Monday, July 11, 2011

Example 9.2: Transparency and bivariate KDE

In Example 9.1, we showed a binning approach to plotting bivariate relationships in a large data set. Here we show more sophisticated approaches: transparent overplotting and formal two-dimensional kernel density estimation. We use the 10,000 simulated bivariate normals shown in Example 9.1.

In SAS, transparency can be found in proc sgplot, with results shown above. The options here are fairly self-explanatory.

proc sgplot data=mvnorms;
scatter x=x1 y=x2 / markerattrs=(symbol=CircleFilled size = .05in)

The image gives a good sense of the overall density, with the darker (overplotted) areas reflecting more observations. Overplotting was the problem we sought to avoid with the binning, but here it becomes an advantage.

Another approach is to use bivariate kernel density estimation. This is perhaps more similar to the binning shown previously, but without the stricture of regular polygons. It also offers some default values for smoothing, though whether or not these are good default values could be debated.

proc kde data=mvnorms;
bivar x1 x2 / plots=contour;


In R, the basic plot() function appears to include transparency, though you must select a suitably pale color to see it. The pch, col, and cex parameters govern the shape, color, and size of the plotted symbols, respectively.

plot(xvals[,1], xvals[,2], pch=19, col="#00000022", cex=0.1)

Bivariate kernel density estimation is available in the smoothScatter() function, which is in included in the R distribution as part of the graphics package.

smoothScatter(xvals[,1], xvals[,2])

1 comment:

Rick Wicklin said...

I think that it's interesting that the default values for PROC KDE and smoothScatter are so different. The smoothScatter image includes blue shading if there is even a single outlying point.

You can use the same trick to help determine if a bivariate density is unimodal or bimodal. For an example, see