SAS and R: proc sgplot

Showing posts with label proc sgplot. Show all posts

Tuesday, February 7, 2012

Example 9.20: visualizing Simpson's paradox

Simpson's paradox is always amazing to explain to students. What's bad for one group, and bad for another group is good for everyone, if you just collapse over the grouping variable. Unlike many mathematical paradoxes, this arises in a number of real-world settings.

In this entry, we consider visualizing Simpson's paradox, using data from a study of average SAT scores, average teacher salaries and percentage of students taking the SATs at the state level (published by Deborah Lynn Guber, "Getting what you pay for: the debate over equity in public school expenditures" (1999), Journal of Statistics Education 7(2)).

R
The relevant data are available within the mosaic package.


> library(mosaic); trellis.par.set(theme=col.mosaic())
> head(SAT)
       state expend ratio salary frac verbal math  sat
1    Alabama  4.405  17.2 31.144    8    491  538 1029
2     Alaska  8.963  17.6 47.951   47    445  489  934
3    Arizona  4.778  19.3 32.175   27    448  496  944
4   Arkansas  4.459  17.1 28.934    6    482  523 1005
5 California  4.992  24.0 41.078   45    417  485  902
6   Colorado  5.443  18.4 34.571   29    462  518  980

The paradox manifests itself here if we ignore the fraction of students taking the SAT and instead consider the bivariate relationship between average teacher salary and average SAT score:


> lm(sat ~ salary, data=SAT)

Coefficients:
(Intercept)       salary  
    1158.86        -5.54

Unfortunately, the relationship here appears to be negative: as salary increases,
average SAT score is predicted to decrease.

Luckily for the educators in the audience, once the confounding (aka lurking) variable is accounted for, we see a statistically significant positive relationship:


> summary(lm(sat ~ salary + frac, data=SAT))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 987.9005    31.8775  30.991   <2e-16 ***
salary        2.1804     1.0291   2.119   0.0394 *  
frac         -2.7787     0.2285 -12.163    4e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 33.69 on 47 degrees of freedom
Multiple R-squared: 0.8056, Adjusted R-squared: 0.7973 
F-statistic: 97.36 on 2 and 47 DF,  p-value: < 2.2e-16

That's all well and good, but how to explain what is going on? We can start with a scatterplot, but unless the state names are plotted then it's hard to see what's going on (imagine the plot at the top of this entry without the text labels or the color shading).

It's straightforward to use the panel functionality within the lattice package to create a more useful plot, where states names are plotted rather than points, and different colors are used to represent the low (red), medium (blue) and high (green) values for the fraction of students taking the SAT.


SAT$fracgrp = cut(SAT$frac, breaks=c(0, 22, 49, 81), 
  labels=c("low", "medium", "high"))
SAT$fraccount = 1 + as.numeric(SAT$fracgrp=="medium") + 
  2*as.numeric(SAT$fracgrp=="high")
panel.labels = function(x, y, col='black', labels='x',...) 
  { panel.text(x,y,labels,col=col,...)}
xyplot(sat ~salary, data=SAT, groups=fracgrp, 
  cex=0.6, panel=panel.labels, labels=SAT$state, 
  col=c('red','blue','green')[SAT$fraccount])

We see that within each group, the slope is positive (or at least non-negative), while overall there is a negative slope. Indeed, we see all of the hallmarks of a serious confounder in the correlation matrix for the n=50 observations:


> with(SAT, cor(cbind(sat, frac, salary)))
              sat       frac     salary
sat     1.0000000 -0.8871187 -0.4398834
frac   -0.8871187  1.0000000  0.6167799
salary -0.4398834  0.6167799  1.0000000

There's a strong negative association between SAT (Y) and fraction (Z), as well as a strong positive association between salary (X) and fraction (Z).

SAS

We begin by getting the data out of R. This is a snap, thanks to the proc_r macro we discussed here. Just read the macro in, tell it to grab the sat object on its way back from R, then write the R code to read in the data set. This technique would be great for getting data from the histdata package, discussed here, into SAS.


%include "C:\ken\sasmacros\Proc_R.sas";
%Proc_R (SAS2R =, R2SAS = sat);
Cards4;

mynorm = rnorm(10)
mynorm
library(mosaic)
head(SAT)
sat = SAT

;;;;
%Quit;

We'll skip the regression (best accomplished in proc glm) and skip to the helpful plot. We'll need to categorize the fraction taking the test. A natural way to do this in SAS would be to use formats, but we prefer to generate actual data as a more transparent process. To make the plot, we'll use the sgplot procedure. Typically this allows less control than the gplot procedure but the results in this case are quick and attractive. The group= and datalabel= options add the colors and legend, and the state names, respectively.


data sat2; set sat;
  if frac le 22 then fracgrp = "Low   ";
  else if frac le 49 then fracgrp = "Medium";
  else if frac gt 49 then fracgrp ="High";
run;

proc sgplot data=sat2;
  scatter x = salary y = sat / group=fracgrp datalabel=state;
run; quit;

Monday, July 11, 2011

Example 9.2: Transparency and bivariate KDE

In Example 9.1, we showed a binning approach to plotting bivariate relationships in a large data set. Here we show more sophisticated approaches: transparent overplotting and formal two-dimensional kernel density estimation. We use the 10,000 simulated bivariate normals shown in Example 9.1.

SAS
In SAS, transparency can be found in proc sgplot, with results shown above. The options here are fairly self-explanatory.


proc sgplot data=mvnorms;
  scatter x=x1 y=x2 / markerattrs=(symbol=CircleFilled size = .05in) 
          transparency=0.85;
run;

The image gives a good sense of the overall density, with the darker (overplotted) areas reflecting more observations. Overplotting was the problem we sought to avoid with the binning, but here it becomes an advantage.

Another approach is to use bivariate kernel density estimation. This is perhaps more similar to the binning shown previously, but without the stricture of regular polygons. It also offers some default values for smoothing, though whether or not these are good default values could be debated.


proc kde data=mvnorms;
  bivar x1 x2 / plots=contour;
run;

R

In R, the basic plot() function appears to include transparency, though you must select a suitably pale color to see it. The pch, col, and cex parameters govern the shape, color, and size of the plotted symbols, respectively.


plot(xvals[,1], xvals[,2], pch=19, col="#00000022", cex=0.1)

Bivariate kernel density estimation is available in the smoothScatter() function, which is in included in the R distribution as part of the graphics package.


smoothScatter(xvals[,1], xvals[,2])

Monday, June 21, 2010

Example 7.42: Testing the proportionality assumption

In addition to the non-parametric tools discussed in recent entries, it's common to use proportional hazards regression, (section 4.3.1) also called Cox regression, in evaluating survival data.

It's important in such models to test the proportionality assumption. Below, we demonstrate doing this for a simple model from the HELP data, available at the book web site. Our results below draw on this web page, part of the excellent resources provided by the UCLA ATS site.

SAS

First, we run a proportional hazards regression to assess the effects of treatment on the time to linkage with primary care. (Data were read in and observations with missing values removed in example 7.40.) Only a portion of the results are shown.


proc phreg data=h2;  
  model dayslink*linkstatus(0)=treat;
  output out= propcheck ressch = schres;
run;                 

                  Parameter    Standard
 Parameter  DF    Estimate       Error  Chi-Square  Pr > ChiSq

 treat       1     1.59103     0.19142     69.0837      <.0001

Being in the intervention group would appear to increase the probability of being linked to primary care. But do the model assumptions hold? The output statement above makes a new data set that contains the Schoenfeld residuals. (Schoenfeld D. Residuals for the proportional hazards regresssion model. Biometrika, 1982, 69(1):239-241.) One assessment of proportional hazards is based on these residuals, which ought to show no association with time if proportionality holds. We can plot them against the time to linkage using proc sgplot (section 5.1.1) and adding a loess curve to assess the relationship.


proc sgplot data=propcheck;  
  loess x = dayslink y = schres / clm;
run;

From the resulting plot, shown above, there is an indication of a possible problem. One way to assess this is to include a time-varying covariate, an interaction between the suspect predictor(s) and the event time. This can be done easily within proc phreg. If the interaction term is significant, the null hypothesis of proportionality has been rejected.


proc phreg data=h2;  
  model dayslink*linkstatus(0)=treat treat_time;  
  treat_time = treat*log(dayslink);
run;                  

                  Parameter    Standard
 Parameter   DF    Estimate       Error  Chi-Square  Pr > ChiSq

 treat        1     4.13105     0.97873     17.8153      <.0001
 treat_time   1    -0.61846     0.22569      7.5093      0.0061

The proportional hazards assumption doesn't hold in this case.

R

In R, the time-varying covariate approach is harder to implement. However, the survival library includes a formal test based on the Schoenfeld residuals. (See Therneau and Grambsch, 2000, pages 127-142.) This is accessed by applying the cox.zph() function to the output of the coxph() function. The smallds object used below was created in the previous example. (Some output omitted.)


> library(survival)
> propcheck = coxph(Surv(dayslink, linkstatus) ~ treat, method="breslow")

> summary(propcheck)        

        coef exp(coef) se(coef)     z Pr(>|z|)    
treat 1.5910    4.9088   0.1914 8.312   <2e-16 ***

> cox.zph(propcheck)         
         rho chisq       p
treat -0.233  8.47 0.00361

This test agrees with the SAS approach that the assumptions that proportionality holds can be rejected. To understand why, you can plot() the test. The transform='identity" option is used to make results more similar to those obtained with SAS.


plot(cox.zph(propcheck, transform='identity'))

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts