Tuesday, July 5, 2011

Example 9.1: Scatterplots with binning for large datasets



Scatterplots can get very hard to interpret when displaying large datasets, as points inevitably overplot and can't be individually discerned. A number of approaches have been crafted to help with this problem. One approach uses binning. This approach is also sometimes called a heat map, and can be though of as a two-dimensional histogram, where shades of the bins take the place of the heights of the bars. Any regular tesselation of the plane can be used, but there is some attraction to using hexagons. Why? In the vignettes for the hexbin package author Nicholas Lewin-Koh notes:

There are many reasons for using hexagons, at least over squares. Hexagons have symmetry of nearest neighbors which is lacking in square bins. Hexagons are the maximum number of sides a polygon can have for a regular tesselation of the plane, so in terms of packing a hexagon is 13% more efficient for covering the plane than squares. This property translates into better sampling efficiency at least for elliptical shapes. Lastly hexagons are visually less biased for displaying densities than other regular tesselations.


On the other hand, it's unclear whether these advantages are relevant here or whether they outweigh the simplicity of the square and the constant x and y values accompanying it.

In this entry, we demonstrate the use of a binned scatterplot for data from a sample of 10,000 generated bivariate normal random variables (section 1.10.6).

R

In R, we use the hexbin package to generate our plot, after generating our bivariate normals with correlation approximately 0.52.

library(MASS)
library(hexbin)
mu = c(1, -1)
Sigma = matrix(c(3, 2,
2, 5), nrow=2)
xvals = mvrnorm(10000, mu, Sigma)
Sigma[1,2]/sqrt(Sigma[1,1]*Sigma[2,2]) # correlation
plot(hexbin(xvals[,1], xvals[,2]), xlab="X1", ylab="X2")


SAS
We're not aware of a SAS procedure to generate a binned scatterplot or of previously existing macros to do it. Ken wrote a relatively simple macro to do it, which can be found here. The macro uses proc gmap, and we hope that someone will develop an approach using proc template and proc sgrender, as demonstrated in an example from SAS Institute.

After running the macro, the following code generates the image shown below.


data Sigma (type=cov);
infile cards;
input _type_ $ _Name_ $ x1 x2;
cards;
cov x1 3 2
cov x2 2 5
;
run;

proc simnormal data=Sigma out=mvnorms numreal = 10000;
var x1 x2;
run;

%twodhist(data=mvnorms,x=x1,y=x2,nbinsx=30,nbinsy=30,nshades=9);




We note that the default number of shades shown in R, and the number chosen here for SAS, seem to exceed the eye's ability to differentiate, especially for the darker shades.

Update

An anonymous commenter reported that the SAS code bombed when run. I (Ken) added a new version of the code at the link listed above. I note it here only to emphasize that in either SAS or R, settings or objects in the environment can affect the performance of code. If your plan to share code, an item to add to your checklist is to run the code in a fresh session.

4 comments:

Anonymous said...

I run the sas code as it is. it produces an error: it does not find the format xn. and yn.
Is there something I'm missing?

Ken Kleinman said...

Thanks for letting me know, Anonymous. There's a new version there (same link as above) which should run, and I'll add an update to the post above.

Anonymous said...

Ken, I downloaded again the macro. However it did not work for me. So I dropped the format code for xcat and ycat from:
format nxycat countn. xcat xn. ycat yn.;

and just left:
format nxycat countn.;

Now it is producing the map showed above.
Thank you so much for this great piece of code!

Ken Kleinman said...

That's what I thought I had changed in the new version. Glad it works for you!