SAS and R: July 2011

Sunday, July 31, 2011

Taking August off!

We'll be back with recharged batteries and lots of new entries in September. Have a great summer*!

As usual, please send any questions you have about using SAS or R.

*Not valid in the southern hemisphere.

Monday, July 25, 2011

Really useful R package: sas7bdat

For SAS users, one hassle in trying things in R, let alone migrating, is the difficulty of getting data out of SAS and into R. In our book (section 1.2.2) and in a blog entry we've covered getting data out of SAS native data sets. Unfortunately, for all of these methods, you need a working, licensed version of SAS.

However Matt Shotwell has reverse-engineered the sas7bdat file format. This means that you can now read a SAS data set without a working copy of SAS. This is a wonderful thing, and in fact SAS Institute ought to have provided this ability long ago. The package is experimental, but it worked fine for two small data sets. Matt tells me that as of 7/2011, the package only works for sas7bdat files generated on 32-bit Windows systems.

R
Install the package sas7bdat. The use the read.sas7bdat() function.


library(sas7bdat)
helpfromSAS = read.sas7bdat("http://www.math.smith.edu
/sasr/datasets/help.sas7bdat")

(Note that newlines are not allowed in the URL in practice, but formatting for the blog required it.)


> is.data.frame(helpfromSAS)
[1] TRUE
> summary(helpfromSAS$MCS)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  6.763  21.680  28.600  31.680  40.940  62.180 
> with(helpfromSAS, summary(SUBSTANCE))
alcohol cocaine  heroin 
    177     152     124

It's unclear why all the variable names are all capitalized. That didn't happen in another trial, so it must be something about the way the help.sas7bdat data set was constructed.

Monday, July 18, 2011

Example 9.3: augmented display of contingency table

SAS and R often provide different levels of details from output. This is particularly true for the descriptive analysis of contingency tables, where SAS makes it easy to display tables with additional quantities (such as the observed cell count).

The mosaic package has added functionality to calculate these quantities in R. We demonstrate using an example from the HELP dataset.

R


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(mosaic)
ds$gender = ifelse(ds$female==1, "female", "male")
ds$homeless = ifelse(ds$homeless==1, "homeless", "housed")
tab = xtabs(~ gender + homeless, data=ds)
> tab
        homeless
gender   homeless housed
  female       40     67
  male        169    177
> xchisq.test(tab)

 Pearson's Chi-squared test with Yates' continuity correction

data:  tab 
X-squared = 3.8708, df = 1, p-value = 0.04913

  40.00    67.00 
( 49.37) ( 57.63)
 [1.78]   [1.52] 
<-1.33>  < 1.23> 
   
 169.00   177.00 
(159.63) (186.37)
 [0.55]   [0.47] 
< 0.74>  <-0.69> 
   
key:
 observed
 (expected)
 [contribution to X-squared]

We see that there is a borderline statistically significant association between gender and homeless status in the HELP study. We interpret that we see fewer than expected females who are homeless, and more males who are homeless.

Another idea is to use graphical depictions of the association in this table. One approach is a mosaic plot (note: no relation to Project MOSAIC and the mosaic package). A mosaic plot starts as a square with area equal to one. It is divided into columns based on the prevalence in each of the values for the column variable (in this case, gender). Then each bar is divided vertically based on the conditional probability of the other variable within that category.

Another graphical display of a table is the association plot. In an association plot, there is also a box for each cell of the table. The area of the box is proportional to the difference between the observed and expected (assuming no association) frequencies. In a typical presentation, excess observed counts are black and above the line, while deficient counts are red and below the line.

Above, we show the mosaic plot (on the left) and association plot (on the right). Both of these displays demonstrate that there is an association. The mosaic plot indicates that only about a quarter of the sample is female (indicated by the width of the columns), and that homelessness is present in about half the subjects (area shaded in light grey). The slight association demonstrated is that there are fewer homeless women than expected (since the horizontal line moves down between the first and second column). Similarly, for the association plot we note that the expected cell count is less than the observed (indicated in red with values below the line) for the female homeless group.


par(mfrow=c(1,2))
mosaicplot(tab, color=TRUE, main="mosaic plot")
assocplot(tab)
title("association plot")

SAS
As in Example 8.32, we find SAS macros for mosaic plots among the contributions of Michael Friendly. In this complex case, they are somewhat more difficult to access than others. The code for the plots themselves can be downloaded here, while it's useful to also run a wrapper macro. After downloading the files, the following code can be used to make the figure below.


title 'Install mosaic modules';
* location of the zipped files;
filename mosaic  'c:\ken\sasmacros\mosaics';
* storage location of compiled macros;
libname  mosaic   'c:\ken\sasmacros\mosaics';

* Code to read in, compile and store the macros;
proc iml ;
   reset storage=mosaic.mosaic;
   %include mosaic(mosaics) ;
   store module=_all_;
   show storage;
quit;

* Prep: create the table, save the cell counts;
proc freq data = "c:\book\help.sas7bdat";
tables homeless * female / out=outhelp;
run;

* Read in the wrapper macro;
%include "c:\ken\sasmacros\mosaics\mosaic.sas";

* Make the plot;
%mosaic(data=outhelp,var = female homeless, 
        sort=homeless descending female, space = 1 1);

The sort and space options make the results more similar to those shown for mosaicplot(). In this version, the colors reflect the signs of the residuals.

Monday, July 11, 2011

Example 9.2: Transparency and bivariate KDE

In Example 9.1, we showed a binning approach to plotting bivariate relationships in a large data set. Here we show more sophisticated approaches: transparent overplotting and formal two-dimensional kernel density estimation. We use the 10,000 simulated bivariate normals shown in Example 9.1.

SAS
In SAS, transparency can be found in proc sgplot, with results shown above. The options here are fairly self-explanatory.


proc sgplot data=mvnorms;
  scatter x=x1 y=x2 / markerattrs=(symbol=CircleFilled size = .05in) 
          transparency=0.85;
run;

The image gives a good sense of the overall density, with the darker (overplotted) areas reflecting more observations. Overplotting was the problem we sought to avoid with the binning, but here it becomes an advantage.

Another approach is to use bivariate kernel density estimation. This is perhaps more similar to the binning shown previously, but without the stricture of regular polygons. It also offers some default values for smoothing, though whether or not these are good default values could be debated.


proc kde data=mvnorms;
  bivar x1 x2 / plots=contour;
run;

R

In R, the basic plot() function appears to include transparency, though you must select a suitably pale color to see it. The pch, col, and cex parameters govern the shape, color, and size of the plotted symbols, respectively.


plot(xvals[,1], xvals[,2], pch=19, col="#00000022", cex=0.1)

Bivariate kernel density estimation is available in the smoothScatter() function, which is in included in the R distribution as part of the graphics package.


smoothScatter(xvals[,1], xvals[,2])

Tuesday, July 5, 2011

Example 9.1: Scatterplots with binning for large datasets

Scatterplots can get very hard to interpret when displaying large datasets, as points inevitably overplot and can't be individually discerned. A number of approaches have been crafted to help with this problem. One approach uses binning. This approach is also sometimes called a heat map, and can be though of as a two-dimensional histogram, where shades of the bins take the place of the heights of the bars. Any regular tesselation of the plane can be used, but there is some attraction to using hexagons. Why? In the vignettes for the hexbin package author Nicholas Lewin-Koh notes:

There are many reasons for using hexagons, at least over squares. Hexagons have symmetry of nearest neighbors which is lacking in square bins. Hexagons are the maximum number of sides a polygon can have for a regular tesselation of the plane, so in terms of packing a hexagon is 13% more efficient for covering the plane than squares. This property translates into better sampling efficiency at least for elliptical shapes. Lastly hexagons are visually less biased for displaying densities than other regular tesselations.

On the other hand, it's unclear whether these advantages are relevant here or whether they outweigh the simplicity of the square and the constant x and y values accompanying it.

In this entry, we demonstrate the use of a binned scatterplot for data from a sample of 10,000 generated bivariate normal random variables (section 1.10.6).

R

In R, we use the hexbin package to generate our plot, after generating our bivariate normals with correlation approximately 0.52.


library(MASS)
library(hexbin)
mu = c(1, -1)
Sigma = matrix(c(3, 2,
                 2, 5), nrow=2)
xvals = mvrnorm(10000, mu, Sigma)
Sigma[1,2]/sqrt(Sigma[1,1]*Sigma[2,2])    # correlation
plot(hexbin(xvals[,1], xvals[,2]), xlab="X1", ylab="X2")

SAS
We're not aware of a SAS procedure to generate a binned scatterplot or of previously existing macros to do it. Ken wrote a relatively simple macro to do it, which can be found here. The macro uses proc gmap, and we hope that someone will develop an approach using proc template and proc sgrender, as demonstrated in an example from SAS Institute.

After running the macro, the following code generates the image shown below.


data Sigma (type=cov);
infile cards;
input _type_ $ _Name_ $ x1 x2;
cards;
cov x1 3 2
cov x2 2 5
;
run;

proc simnormal data=Sigma out=mvnorms numreal = 10000;
  var x1 x2;
run;

%twodhist(data=mvnorms,x=x1,y=x2,nbinsx=30,nbinsy=30,nshades=9);

We note that the default number of shades shown in R, and the number chosen here for SAS, seem to exceed the eye's ability to differentiate, especially for the darker shades.

Update

An anonymous commenter reported that the SAS code bombed when run. I (Ken) added a new version of the code at the link listed above. I note it here only to emphasize that in either SAS or R, settings or objects in the environment can affect the performance of code. If your plan to share code, an item to add to your checklist is to run the code in a fresh session.

Friday, July 1, 2011

A third year of entries!

Contrary to previous reports, we started blogging after our book was published, with the conceit that we were adding examples to the book. Today marks the second anniversary of the book's appearance and of the blog. To celebrate, we're turning over our calendar, and starting a new volume of entries-- Example 9.1 will appear on July 5th.

It's very difficult to get an accurate measure of our viewership. As I write this, Feedburner reports about 650 regular readers, but this omits people who see us on R-bloggers and SAS Community Planet or SAS-X. As consumers of those aggregators, we assume there are many others who see us without subscribing directly.

We also get a fair number of views that derive directly from Google searches, which means we must be doing something right. Google Analytics reports over 100,000 total pageviews, while Feedburner claims 250,000.

So far, our most popular entries, according to feedburner are:
Example 7.38: Kaplan-Meier survival estimates
Example 7.39: Nelson-Aalen estimate of survivorship
Example 8.3: Pyramid plots
Example 7.30: Simulate censored survival data
Example 8.5: Bubble plots part 3

Blogger's internal metrics vary somewhat:
Example 7.35: Propensity score matching
Example 8.7: Hosmer and Lemeshow goodness-of-fit
Example 7.38: Kaplan-Meier survival estimates
Example 7.30: Simulate censored survival data
Example 7.25: Compare draws with distribution

All of your attention is really gratifying, and we hope we're useful to people-- it's hard work finding new material and collaborating on a new post every week!

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts