Showing posts with label proc_r. Show all posts
Showing posts with label proc_r. Show all posts

Tuesday, February 7, 2012

Example 9.20: visualizing Simpson's paradox

Simpson's paradox is always amazing to explain to students. What's bad for one group, and bad for another group is good for everyone, if you just collapse over the grouping variable. Unlike many mathematical paradoxes, this arises in a number of real-world settings.

In this entry, we consider visualizing Simpson's paradox, using data from a study of average SAT scores, average teacher salaries and percentage of students taking the SATs at the state level (published by Deborah Lynn Guber, "Getting what you pay for: the debate over equity in public school expenditures" (1999), Journal of Statistics Education 7(2)).

The relevant data are available within the mosaic package.

> library(mosaic); trellis.par.set(theme=col.mosaic())
> head(SAT)
state expend ratio salary frac verbal math sat
1 Alabama 4.405 17.2 31.144 8 491 538 1029
2 Alaska 8.963 17.6 47.951 47 445 489 934
3 Arizona 4.778 19.3 32.175 27 448 496 944
4 Arkansas 4.459 17.1 28.934 6 482 523 1005
5 California 4.992 24.0 41.078 45 417 485 902
6 Colorado 5.443 18.4 34.571 29 462 518 980

The paradox manifests itself here if we ignore the fraction of students taking the SAT and instead consider the bivariate relationship between average teacher salary and average SAT score:

> lm(sat ~ salary, data=SAT)

(Intercept) salary
1158.86 -5.54

Unfortunately, the relationship here appears to be negative: as salary increases,
average SAT score is predicted to decrease.

Luckily for the educators in the audience, once the confounding (aka lurking) variable is accounted for, we see a statistically significant positive relationship:

> summary(lm(sat ~ salary + frac, data=SAT))

Estimate Std. Error t value Pr(>|t|)
(Intercept) 987.9005 31.8775 30.991 <2e-16 ***
salary 2.1804 1.0291 2.119 0.0394 *
frac -2.7787 0.2285 -12.163 4e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 33.69 on 47 degrees of freedom
Multiple R-squared: 0.8056, Adjusted R-squared: 0.7973
F-statistic: 97.36 on 2 and 47 DF, p-value: < 2.2e-16

That's all well and good, but how to explain what is going on? We can start with a scatterplot, but unless the state names are plotted then it's hard to see what's going on (imagine the plot at the top of this entry without the text labels or the color shading).

It's straightforward to use the panel functionality within the lattice package to create a more useful plot, where states names are plotted rather than points, and different colors are used to represent the low (red), medium (blue) and high (green) values for the fraction of students taking the SAT.

SAT$fracgrp = cut(SAT$frac, breaks=c(0, 22, 49, 81),
labels=c("low", "medium", "high"))
SAT$fraccount = 1 + as.numeric(SAT$fracgrp=="medium") +
panel.labels = function(x, y, col='black', labels='x',...)
{ panel.text(x,y,labels,col=col,...)}
xyplot(sat ~salary, data=SAT, groups=fracgrp,
cex=0.6, panel=panel.labels, labels=SAT$state,

We see that within each group, the slope is positive (or at least non-negative), while overall there is a negative slope. Indeed, we see all of the hallmarks of a serious confounder in the correlation matrix for the n=50 observations:

> with(SAT, cor(cbind(sat, frac, salary)))
sat frac salary
sat 1.0000000 -0.8871187 -0.4398834
frac -0.8871187 1.0000000 0.6167799
salary -0.4398834 0.6167799 1.0000000

There's a strong negative association between SAT (Y) and fraction (Z), as well as a strong positive association between salary (X) and fraction (Z).


We begin by getting the data out of R. This is a snap, thanks to the proc_r macro we discussed here. Just read the macro in, tell it to grab the sat object on its way back from R, then write the R code to read in the data set. This technique would be great for getting data from the histdata package, discussed here, into SAS.

%include "C:\ken\sasmacros\";
%Proc_R (SAS2R =, R2SAS = sat);

mynorm = rnorm(10)
sat = SAT


We'll skip the regression (best accomplished in proc glm) and skip to the helpful plot. We'll need to categorize the fraction taking the test. A natural way to do this in SAS would be to use formats, but we prefer to generate actual data as a more transparent process. To make the plot, we'll use the sgplot procedure. Typically this allows less control than the gplot procedure but the results in this case are quick and attractive. The group= and datalabel= options add the colors and legend, and the state names, respectively.

data sat2; set sat;
if frac le 22 then fracgrp = "Low ";
else if frac le 49 then fracgrp = "Medium";
else if frac gt 49 then fracgrp ="High";

proc sgplot data=sat2;
scatter x = salary y = sat / group=fracgrp datalabel=state;
run; quit;

Thursday, January 26, 2012

SAS Macro Simplifies SAS and R integration (Updated)

Many of us feel very enthusiastic about R. It's free, it features cutting edge applications, it has a large community of users contributing for mutual benefit, and on and on. There are also many things to like about SAS, including stability, backwards compatibility, and professional support among them. The way to be the best analyst you can be is to be flexible and have as many tools at your disposal as you can manage. That's the main motivating principle behind our book and what we do here in this blog.

Today we call attention to a SAS macro that greatly eases integrating R from SAS. Published last month in the Journal of Statistical Software, the macro (written by Xin Wei of Roche Pharmaceuticals) is called Proc_R, and we discuss its installation and use today. For a fuller write-up, see the paper, here. For SAS users, the macro is a huge productivity booster, allowing one to easily complete data management and/or partial data analysis in SAS, skip out quickly to R for analyses that are awkward or impossible in SAS, then return to SAS for completion. For people in industry, this may also ease integrating R into documentation systems built for SAS code. See this post on DecisionStats for a review of other integration attempts.

Getting ready

1. Download the "SAS source code" and the "Replication code and instructions".

2. Move the macro somewhere you have write access.

3. Open the macro in a text editor and change line 46 so that the rpath option points to the location of your R executable.

(4. If you're running Windows 7 or Vista, and you has SAS 9.1 or above, follow instructions in a PDF in the second supplemental file you downloaded. This makes a shortcut for a special version of SAS. I'm not at all sure why you have to do this, though. I had the same results running in my usual SAS set-up.)

That's it! The way the macro works is to read in your R code as a SAS data set, write it out to a file, and call R to run it, then do a bunch of post-processing. The basic macro call looks like this:

%include "C:\ken\sasmacros\";
%Proc_R (SAS2R =, R2SAS =);

***Please Enter R Code Here***


You just replace the starred lines with R code, and run-- the R results, if any, appear in your SAS output and/or results windows. The SAS2R value is a list of the names of SAS data sets you'd like to send to R; they're added into the R environment before your code is executed. The R2SAS value is a list of the names of R objects (that can be coerced to data frames) that you'd like to become SAS data sets.

Here's a trivial example-- generate two data sets in SAS, send them to R to run linear regressions, and send the resulting parameter estimates back to SAS.

data test;
do i = 1 to 1000;
x = normal(0);
y = x + normal(0);

data t2;
do i = 1 to 100;
x = normal(0);
y = x + uniform(0);

%include "C:\";
%Proc_R (SAS2R =test t2, R2SAS =mylm mylm2);
an.lm = with(test,lm(y ~x))
mylm = t(coef(an.lm))

an.lm2 = with(t2,lm(y~x))
mylm2 = t(coef(an.lm2))

proc print data = mylm; run;
proc print data = mylm2; run;

And here's what you get in the SAS log.

[First, proc_r result]

******************R OUTPUT***********************


> setwd("c:/temp")
> library(grDevices)
> png("c:/temp/....png")
> test<- read.csv('c:/temp/test.csv')
> t2<- read.csv('c/temp/t2.csv')
> an.lm = with(test,lm(y ~x))
> mylm = t(coef(an.lm))
> summary(an.lm)

lm(formula = y ~ x)

Min 1Q Median 3Q Max
-2.8571 -0.6430 -0.0051 0.6713 3.5903

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.008568 0.031686 0.27 0.787
x 1.020640 0.033315 30.64 <2e-16 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.002 on 998 degrees of freedom
Multiple R-squared: 0.4846, Adjusted R-squared: 0.4841
F-statistic: 938.5 on 1 and 998 DF, p-value: < 2.2e-16

> an.lm2 = with(t2,lm(y~x))
> mylm2 = t(coef(an.lm2))
> write.csv(mylm,'mylm.csv',row.names=F)
> write.csv(mylm2,'mylm2.csv',row.names=F)
null device
> q()
> proc.time()
user system elapsed
0.28 0.10 0.37

[Here are the proc print results]
Obs _Intercept_ x
1 0.0085676126 1.0206400545

Obs _Intercept_ x
1 0.528410053 0.9851225238

(Page breaks and some extraneous stuff removed.)

It's pretty magical for a SAS user to see R living in the SAS output like this. But there are some caveats. First, this is a windows-only macro. If you run SAS on *nix, you may not be able to get it to work. Second, while the article has examples of graphics from R neatly appearing in SAS, this failed for me. This may be due to the fact that I run SAS 9.3, while the author of the macro is still in earlier versions of SAS. I may try to diagnose and fix this problem, and will update this entry if I find a fix. (Fixed! See update below.)

(UPDATE: Reader Abhijit suggested a setwd() in the R code as a fix for the graphics problem. This works, and I now get R grapics in my SAS results viewer. Even more magical. Code and output above updated to show this. Thanks, Abhijit!)

However, these seem like minor problems, compared with the overall simplification offered by the macro. It's been of great use to me in the past few months, and I expect it will help others as well. Many thanks and congratulations to Xin Wei!