Tuesday, March 29, 2011

Example 8.32: The HistData package, sunflower plots, and getting data from R into SAS



This entry is mainly a promotion of the fascinating HistData R package. The package, compiled by the psychologist, statistician, and graphics innovator Michael Friendly, contains a number of small data sets of historical interest. These include data from John Snow's map of cholera in London, Minard's map of Napoleon's Russian campaign of 1812, Galton's data on heights of parents and children, and many others.

If you have any interest in Minard's map, Friendly also hosts a site about the map, Minard, and a gallery with some re-imaginings of the map data, at http://datavis.ca/gallery/re-minard.php. The gallery includes R and SAS versions, as well as one which uses Google Maps.

R
Once you install the package and library() it (section B.6.1), you can gain access to the data with the data() function. For example, we show Galton's data, which lead to the description of regression to the mean.

install.packages("HistData")
library(HistData)
> data(Galton)
> head(Galton)
parent child
1 70.5 61.7
2 68.5 61.7
3 65.5 61.7
4 64.5 61.7
5 64.0 61.7
6 67.5 62.2

The package also includes example() methods for many of the data sets: example(Galton) results in the sunflower plot shown above. The sunflower plot (section 5.1.14) is an alternative to jittering when many observations share values. If the data start as more continuous, you might see the sunflower plot as a form of two-dimensional histogram. You can get a list of data sets available with ?'HistData-package'

We're not aware of a companion set of SAS data sets. An easy way to access the data sets in SAS is to load the package into R and export the data into SAS using the foreign package (section 1.2.2).

> library(foreign)
> write.foreign(Galton,"galton.dat","galton.sas",package="SAS")


SAS
Running the galton.sas file written by the write.foreign function makes a SAS data set called rdata with varibles parent and child. We can make a sunflower plot in SAS using a macro written, coincidentally, by Michael Friendly, which he hosts here. Making a plot requires running the "sunfont.sas" file and the "sunplot.sas" file. I had to modify the "sunfont.sas" file slightly, and I give the edited file here:

libname gfont0 'c:\temp';

data sunsymb;
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
do n=1 to 26;
char=substr(alpha,n,1);
segment=n;
x = .2; y = .2; output; /* Draw small box at center */
x =-.2; y = .2; output; /* of each symbol */
x =-.2; y =-.2; output;
x = .2; y =-.2; output;
x = .2; y = .2; output;
if n>1 then
do i=1 to n; /* draw n radial lines */
x=0; y=0; output;
x=cos(2*atan(1) + i/n*(8*atan(1)));
y=sin(2*atan(1) + i/n*(8*atan(1)));
output;
end;
end;

proc gfont data=sunsymb /* name=GB0426 */
name=sun showroman h=3 romht=2 resol=2;

In this step, Friendly is constructing a font whose "letters" are the sunflower symbols with various numbers of petals. Note that if you already define a gfont0 library, the first line above is not needed.

Then the sunplot macro can be read in and run.

%include "c:\ken\sasmacros\sunplot.sas";
%sunplot(data=rdata, x=parent, y = child); run;


The resulting plot is shown below. The SAS version is rather more primitive, (and I did not bother to add the ellipses or regression line) but both the SAS and R versions show that children tend to be less unusual than their parents, and the more unusual the parent is, the more the child shrinks toward the mean.

2 comments:

Richard Thornton said...

This is great! Thanks for the information.

Anonymous said...

It takes just one word in R to create a sunflowerplot >library(HistData)

> data(Galton)

> par(mfrow=c(1,2))

> plot(Galton,main="Scatter Plot")

> sunflowerplot(Galton,main="Sunflower Plot")