Showing posts with label scatterplot. Show all posts
Showing posts with label scatterplot. Show all posts

Monday, September 24, 2012

Example 10.3: Enhanced scatterplot with marginal histograms





Back in example 8.41 we showed how to make a graphic combining a scatterplot with histograms of each variable. A commenter suggested we change the R graphic to allow post-hoc plotting of, for example, lowess lines. In addition, there are further refinements to be made.

In this R-only entry, we'll make the figure more flexible and a bit more robust. See the example linked above for SAS code, or check out Rick Wicklin discussing the same subject-- Rick gives some additional resources.


R
The R code relies heavily on the layout() function. We discussed it last time in a simpler setting with only one column of plots. The goal for the current plot is to enable a title for the whole figure-- this ought to be centered over the whole graphic-- and x- and y-axis labels. In the previous version, there was no title to the page at all and the axis titles would occasionally fail. To do this, we need a layout with a single cell at the top for the whole width of the graphic, a tall narrow cell at the left for the y-axis title, only in the bottom row, and a thin cell at the bottom, only on the left, for the x-axis title. This turns out to be fairly simple with layout() and the results can be checked with layout.show().
zones <- matrix(c(1,1,1, 
                  0,5,0, 
                  2,6,4, 
                  0,3,0), ncol = 3, byrow = TRUE)
layout(zones, widths=c(0.3,4,1), heights = c(1,3,10,.75))
layout.show(n=6)

The matrix input tells R to make the whole first row a single plot area, and that this will be the first thing plotted. The corners of the remaining 3*3 plot cells will be empty. The numbers in the matrix give the order in which the plot cells will be filled. This matrix is the key input to layout(), where we use the remaining options to give the relative widths and heights of the cells. It's possible to do this in the abstract, but is helpful to draw the intended layout first, then test whether the intended design was a achieved using the layout.show() function. The result is shown below. Putting the scatterplot in last will be useful for adding to it post hoc.


With that in hand, it's time to make a function. In generating last week's example, we noted that the layout persists-- that is, the graphics area retains the layout until you shut the graphics device or restore the old parameters. In the new plot, we'll add an option to revert to the old parameters (by default) or retain them. The latter option would be desirable, if, as suggested by a commenter, you wanted to add items to the scatterplot after generating the plot. We also add an option to allow different sized plot symbols.
scatterhist <- function(x, y, xlab = "", ylab = "", plottitle="", 
                        xsize=1, cleanup=TRUE,...){
  # save the old graphics settings-- they may be needed
  def.par <- par(no.readonly = TRUE)
  
  zones <- matrix(c(1,1,1, 0,5,0, 2,6,4, 0,3,0), ncol = 3, byrow = TRUE)
  layout(zones, widths=c(0.3,4,1), heights = c(1,3,10,.75))
  
  # tuning to plot histograms nicely
  xhist <- hist(x, plot = FALSE)
  yhist <- hist(y, plot = FALSE)
  top <- max(c(xhist$counts, yhist$counts))
  
  # for all three titles: 
  #   drop the axis titles and omit boxes, set up margins
  par(xaxt="n", yaxt="n",bty="n",  mar = c(.3,2,.3,0) +.05)
  # fig 1 from the layout
  plot(x=1,y=1,type="n",ylim=c(-1,1), xlim=c(-1,1))
  text(0,0,paste(plottitle), cex=2)
  # fig 2
  plot(x=1,y=1,type="n",ylim=c(-1,1), xlim=c(-1,1))
  text(0,0,paste(ylab), cex=1.5, srt=90)
  # fig 3
  plot(x=1,y=1,type="n",ylim=c(-1,1), xlim=c(-1,1))
  text(0,0,paste(xlab), cex=1.5)
  
  # fig 4, the first histogram, needs different margins
  # no margin on the left
  par(mar = c(2,0,1,1))
  barplot(yhist$counts, axes = FALSE, xlim = c(0, top),
          space = 0, horiz = TRUE)
  # fig 5, other histogram needs no margin on the bottom
  par(mar = c(0,2,1,1))
  barplot(xhist$counts, axes = FALSE, ylim = c(0, top), space = 0)
  # fig 6, finally, the scatterplot-- needs regular axes, different margins
  par(mar = c(2,2,.5,.5), xaxt="s", yaxt="s", bty="n")
  # this color allows traparency & overplotting-- useful if a lot of points
  plot(x, y , pch=19, col="#00000022", cex=xsize, ...)
  
  # reset the graphics, if desired 
  if(cleanup) {par(def.par)}
}

To test this, we'll generate some data and try it out. The results are immediately below; I like this example to help demonstrate that it's not the marginal normality of the data that matter.
x=rexp(1000)
y = x^2 + rnorm(1000)
scatterhist(x[x<4], y[x<4], ylab="This is x", xlab="This is y",   
  "Revised scatterhist", xsize =2)


But let's take advantage of the ability to add curves to the scatterplot.
x=rexp(1000)
y = x^2 + rnorm(1000)
scatterhist(x[x<4], y[x<4], ylab="This is x", xlab="This is y",   
  "Revised scatterhist", xsize =2, cleanup=FALSE)
abline(lm(y~x))
lines(lowess(x,y))

The results are shown at the top-- we can do anything with the scatterplot that we'd be able to do if there were no layout() in effect.



An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Tuesday, December 6, 2011

Example 9.17: (much) better pairs plots


Pairs plots (section 5.1.17) are a useful way of displaying the pairwise relations between variables in a dataset. But the default display is unsatisfactory when the variables aren't all continuous. In this entry, we discuss ways to improve these displays that have been proposed by John Emerson, Walton Green, Barret Schloerke, Dianne Cook, Heike Hofmann, and Hadley Wickham in a manuscript under review entitled The Generalized Pairs Plot. http://www.blogger.com/img/blank.gif

Implementations of the methods in the paper are available in the gpairs and GGally packages; here we use the latter, which is based on the grammar of graphics and the ggplot2 package. This is an R-only entry: we are unaware of efforts to replicate this approach in SAS.

New users may find it easier to break process down into steps, rather than to do everything at once, as the R language allows. One way to do that is to make a smaller version of a dataset, with just the analysis variables included. here we use the HELP data set and choose two categorical variables (gender and housing status) and two continuous ones (the number of drinks per day and a measure of depressive symptoms). Once this new subset is created, the call to ggpairs() is straightforward.

R

library(GGally)
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
ds$sex = as.factor(ifelse(ds$female==1, "female", "male"))
ds$housing = as.factor(ifelse(ds$homeless==1, "homeless", "housed"))
smallds = subset(ds, select=c("housing", "sex", "i1", "cesd"))
ggpairs(smallds, diag=list(continuous="density", discrete="bar"), axisLabels="show")

For users more comfortable with R, the ggpairs function allows you to select variables to include, via its columns option. The following line produces a plot identical to the above, without the subset().

ggpairs(ds, columns=c("housing", "sex", "i1", "cesd"),
diag=list(continuous="density", discrete="bar"), axisLabels="show")

Various options are available for the diagonal elements of the plot matrix, and the off-diagonals can be controlled with upper and lower options. The examples(ggpairs) command is very helpful for visualizing some of the possibilities.

Monday, June 20, 2011

Example 8.41: Scatterplot with marginal histograms


The scatterplot is one of the most ubiquitous, and useful graphics. It's also very basic. One of its shortcomings is that it can hide important aspects of the marginal distributions of the two variables. To address this weakness, you can add a histogram of each margin to the plot. We demonstrate using the SF-36 MCS and PCS subscales in the HELP data set.

SAS
SAS provides code to perform this using proc template and proc sgrender. These procedures are not intended for casual or typical SAS users. Its syntax is, to our eyes, awkward. This is roughly analogous to R functions that simply call C routines. Nonetheless, it's possible to adapt code that works. The code linked above was edited to set the transparency to 0 and to change the plotted symbol size to 5 from 11px. These options appear in the scatterplot statement about midway through the code.

Once the edited code is submitted, the following lines produce the plot shown above.

proc sgrender data="C:\book\help.sas7bdat" template=scatterhist;
dynamic YVAR="mcs" XVAR="pcs"
TITLE="MCS-PCS Relationship";
run;


R
For R, we adapted some code found in an old R-help post to generate the following function. The mtext() function puts text in the margins and is used here to label the axes. The at option in that function centers the label within the scatterplot data using some algebra.

scatterhist = function(x, y, xlab="", ylab=""){
zones=matrix(c(2,0,1,3), ncol=2, byrow=TRUE)
layout(zones, widths=c(4/5,1/5), heights=c(1/5,4/5))
xhist = hist(x, plot=FALSE)
yhist = hist(y, plot=FALSE)
top = max(c(xhist$counts, yhist$counts))
par(mar=c(3,3,1,1))
plot(x,y)
par(mar=c(0,3,1,1))
barplot(xhist$counts, axes=FALSE, ylim=c(0, top), space=0)
par(mar=c(3,0,1,1))
barplot(yhist$counts, axes=FALSE, xlim=c(0, top), space=0, horiz=TRUE)
par(oma=c(3,3,0,0))
mtext(xlab, side=1, line=1, outer=TRUE, adj=0,
at=.8 * (mean(x) - min(x))/(max(x)-min(x)))
mtext(ylab, side=2, line=1, outer=TRUE, adj=0,
at=(.8 * (mean(y) - min(y))/(max(y) - min(y))))
}

The results of the following code are shown below.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
with(ds, scatterhist(mcs, pcs, xlab="MCS", ylab="PCS"))