SAS and R: April 2010

Monday, April 26, 2010

Example 7.34: Propensity scores and causal inference from observational studies

Propensity scores can be used to help make causal interpretation of observational data more plausible, by adjusting for other factors that may responsible for differences between groups. Heuristically, we estimate the probability of exposure, rather than randomize exposure, as we'd ideally prefer to do. The estimated probability of exposure is the propensity score. If our estimation of the propensity score incorporates the reasons why people self-select to exposure status, then two individuals with equal propensity score are equally likely to be exposed, and we can interpret them as being randomly assigned to exposure. This process is not unlike ordinary regression adjustment for potential confounders, but uses fewer degrees of freedom and can incorporate more variables.

As an example, we consider the HELP data used extensively for examples in our book. Does homelessness affect physical health, as measured by the PCS score from the SF-36?

First, we consider modeling this relationship directly. This analysis only answers the question of whether homelessness is associated with poorer physical health.

Then we create a propensity score by estimating a logistic regression to predict homelessness using age, gender, number of drinks, and mental health composite score. Finally, we include the propensity score in the model predicting PCS from homelessness. If we accept that these propensity predictors fully account for the probability of homelessness, and there is an association between homelessness and PCS in the model adjusting for propensity, and the directionality of the association flows from homelessness to PCS, then we can conclude that homelessness causes differences in PCS.

We note here that this conclusion relies on other untestable assumptions, including linearity in the relationship between the propensity and PCS. Many users of propensity scores prefer to fit models within strata of the propensity score, or to match on propensity score, rather than use the regression adjustment we present in this entry. In a future entry we'll demonstrate the use of matching.

In a departure from our usual practice, we show only pieces of the output below.

SAS

We being by reading in the data and fitting the model. This is effectively a t-test (section 2.4.1), but we use proc glm to more easily compare with the adjusted results.


proc glm data="c:\book\help";
  model pcs = homeless/solution;
run;

                                Standard
Parameter        Estimate          Error   t Value   Pr > |t|

Intercept     49.00082904     0.68801845     71.22     <.0001
HOMELESS      -2.06404896     1.01292210     -2.04     0.0422

It would appear that homeless patients are in worse health than the others.

We next use proc logistic to estimate the propensity to homelessness, using the output statement to save the predicted probabilities. We omit the output here; it could be excluded in practice using the ODS exclude all statement.


proc logistic data="c:\book\help" desc;
  model homeless = age female i1 mcs;
  output out=propen pred=propensity;
run;

It's important to make sure that there is a reasonable amount of overlap in the propensity scores between the two groups. Otherwise, we'd be extrapolating outside the range of the data when we adjust.


proc means data=propen;
  class homeless;
  var propensity;
run;
                 N
    HOMELESS   Obs           Mean        Minimum        Maximum
---------------------------------------------------------------
           0   244      0.4296704      0.2136791      0.7876000

           1   209      0.4983750      0.2635031      0.9642827
---------------------------------------------------------------

The mean propensity to homelessness is larger in the homeless group. If this were not the case, we might be concerned the the logistic model is too poor a predictor of homelessness to generate an effective propensity score. However, the maximum propensity among the homeless is 20% larger than the largest propensity in the non-homeless group. This suggests that a further review of the propensities would be wise. To check them, we'll generate histograms for each group using the proc univariate (section 5.1.4).


proc univariate data=propen;
  class homeless;
  var propensity;
  histogram propensity;
run;

The resulting histograms suggest a some risk of extrapolation. In our model, we'll remove subjects with propensities greater than 0.8.


proc glm data=propen;
  where propensity lt .8;
  model pcs = homeless propensity/solution;
run;
                                 Standard
Parameter         Estimate          Error   t Value   Pr > |t|

Intercept      54.19944991     1.98264608     27.34     <.0001
HOMELESS       -1.19612942     1.03892589     -1.15     0.2502
propensity    -12.09909082     4.33385405     -2.79     0.0055

After the adjustment, we see a much smaller difference in the physical health of homeless and non-homeless subjects, and find no significant evidence of an association.

R


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
summary(lm(pcs ~ homeless))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   49.001      0.688  71.220   <2e-16 ***
homeless      -2.064      1.013  -2.038   0.0422 *

We use the glm() function to fit the logistic
regression model (section 4.1.1). A formula object is used to specify the model. The predicted probabilities that we'll use as propensity scores are in the fitted element of the output object.


form = formula(homeless ~ age + female + i1 + mcs)
glm1 = glm(form, family=binomial)
X = glm1$fitted

As in the SAS development, we check the resulting values. Here we use the fivenum() function (section 2.1.4) with the tapply() function (section 2.1.2) to get the results for each level of homelessness.


> tapply(X,homeless, FUN=fivenum)
$`0`
      398        97       378        69       438 
0.2136787 0.3464170 0.4040223 0.4984242 0.7876015 

$`1`
       16        18       262       293       286 
0.2635026 0.4002759 0.4739152 0.5768015 0.9642833

Finding the same troubling evidence of non-overlap, we fit a histogram for each group. We do this manually, setting up two output areas with the par() function (section 5.3.6) and conditioning to use data from each homeless value in two calls to the hist() function (section 5.1.4).


par(mfrow=c(2,1))
hist(X[homeless==0], xlim=c(0.2,1))
hist(X[homeless==1], xlim=c(0.2,1))

As noted above, we'll exclude subjects with propensity greater than 0.8. This is done with the subset option to the lm() function (as in section 3.7.4).


summary(lm(pcs ~ homeless + X,subset=(X < .8)))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   54.199      1.983  27.337  < 2e-16 ***
homeless      -1.196      1.039  -1.151  0.25023    
X            -12.099      4.334  -2.792  0.00547 **

Monday, April 19, 2010

Example 7.33: Specifying fonts in graphics

For interactive data analysis, the default fonts used by SAS and R are acceptable, if not beautiful. However, for publication, it may be important to manipulate the fonts. For example, it would be desirable for the fonts in legends, axis labels, or other text printed in plots to approximate the typeface used in the rest of the text. Credit where it's due department: this blog entry is inspired by this blog post by Yihui Xie.

As an example plot, we'll revisit Figure 2.2 from the book, in which we plot MCS by CESD, with plot symbol showing substance abused.

SAS

In SAS, we'll focus on the "traditional" graphics most likely to be used for scientific publication. ODS graphics and the new SG procedures are currently difficult to customize.

There are several fonts supplied with the standard installation. However, any TrueType or Adobe Type 1 font on a local disk can be used. Making the fonts available requires a one-time use of proc fontreg.


proc fontreg mode=all;
   fontpath "c:/windows/fonts";
run;

In the above, the fontpath is the location in the operating system of the .ttf (TrueType) and/or .pfa/.pfb (Type 1) files.

Any font in that directory can then be used in SAS graphics, regardless of the format of the image file. Note however that this approach may mean that running your code on a different computer may alter the appearance of the graphic, since the fonts may not be available on another computer. If you anticipate needing to share the code, you can stick with the fonts SAS supplies, which are described in the online documentation: SAS Products; SAS/GRAPH; Concepts; Fonts.

The simplest way to make all text default to a desired font is to use the goptions statement (sections under 5.3).

 
goptions ftext="Comic Sans MS";

In Windows, the name of the font to put inside the quotation marks is displayed in the Explorer. If bold, italic, or bold italic is available, it can be requested by appending /bo, /it, or /bo/it to the font name, as demonstrated below.

Many statements in SAS/GRAPH accept a font= option, and these can be used to override the default font specified in the goptions statement.

In the example below, we use several different fonts to demonstrate how different statements specify fonts. In fact, the only plot elements in the assigned default Rockwell typeface are the y axis label and numbers and the labels of the symbols in the legend.


filename myurl 
   url 'http://www.math.smith.edu/sasr/datasets/help.csv' 
   lrecl=704;

proc import datafile=myurl out=ds dbms=dlm;
   delimiter=',';
   getnames=yes;
run;

goptions ftext = "Rockwell";

title font="Lucida Handwriting" "MCS by PCS with fonts";

legend1 
  mode=reserve position=(bottom center outside) across=3
  label = (font= "Elephant" h=2 "Substance");

axis1 label=(font="Goudy Old Style" h=2 "The CESD axis") 
  value = ( h = 2 font= "Comic Sans MS") minor=none;

symbol1 font="Comic Sans MS" v='A' h=.7 c=black;
symbol2 font="Agency FB/bo" v='C' h=.7 c=black;
symbol3 font="Franklin Gothic Book/bo/it" v='H' h=.7;
proc gplot data=ds;
   where female=1;
   plot mcs*cesd=substance / legend=legend1 haxis=axis1;
run; quit;

The results, shown below, demonstrate the comic effects of reserving too much of the plot space for labels.

R

In R, the available fonts and the ways to use them varies by device. TrueType fonts can be displayed easily for the windows() device, and can similarly be used in publication graphics through the win.metafile() device (section 5.4.5).


windowsFonts(CS = windowsFont("Comic Sans MS"))
windows()
par(family="CS")

The windowsFonts() function would be called for each font to be included.

Unfortunately, the pdf() and postscript() devices most likely to be useful for publication using LaTeX do not appear to be able to read TrueType fonts, only Adobe Type 1 fonts. Some fonts can be purchased or downloaded for free in this format. If any reader has had success in using external fonts for these devices, I hope they'll provides code or links in the comments. There are some resources for converting TrueType to Type 1 freely available for *nix operating systems. However, for technical reasons, these conversions don't usually offer satisfying results.

Fortunately, R comes with several fonts for these devices. Their names can be easily displayed:


> names(pdfFonts())
 [1] "serif"                "sans"                
 [3] "mono"                 "AvantGarde"          
 [5] "Bookman"              "Courier"             
 [7] "Helvetica"            "Helvetica-Narrow"    
 [9] "NewCenturySchoolbook" "Palatino"            
[11] "Times"                "URWGothic"           
[13] "URWBookman"           "NimbusMon"           
[15] "NimbusSan"            "URWHelvetica"        
[17] "NimbusSanCond"        "CenturySch"          
[19] "URWPalladio"          "NimbusRom"           
[21] "URWTimes"             "Japan1"              
[23] "Japan1HeiMin"         "Japan1GothicBBB"     
[25] "Japan1Ryumin"         "Korea1"              
[27] "Korea1deb"            "CNS1"                
[29] "GB1"

As with SAS, R offers both ways to change the default font (with the par() function) and fine control of individual options in specific function calls. R stores fonts in familys with names as listed above, and the (confusingly named) font which is an integer where 1 corresponds to plain text (the default), 2 to bold face, 3 to italic, 4 to bold italic, and 5 to the symbol font. Not all faces are necessarily available for all font families.


pdf(file="c:/temp/test1.pdf")
par(family="Palatino")
plot(cesd[female==1], mcs[female==1], type="n", bty="n", ylab="MCS",
   xlab = '')
text(cesd[female==1&substance=="alcohol"],
   mcs[female==1&substance=="alcohol"],"A", family="AvantGarde", font=2)
text(cesd[female==1&substance=="cocaine"],
   mcs[female==1&substance=="cocaine"],"C", family="serif")
text(cesd[female==1&substance=="heroin"],
   mcs[female==1&substance=="heroin"],"H", family="Courier", font=4)
title(xlab="This is the CESD axis", family="NewCenturySchoolbook", cex.lab=2)
title(family="Helvetica", font.main=3, cex.main=3, "MCS by CESD with fonts")
dev.off()

Similar to the SAS example, the only characters in the default Palatino font are the y axis label and the numerals.

Replicating a SAS legend appearing below the plot would be more difficult in R, as would replicating the default SAS legend that shows different plotted font characters.

Monday, April 12, 2010

Example 7.32: Add reference lines to a plot; fine control of tick marks

Sometimes it's useful to plot regular reference lines along with the data. For a time-series plot, this can show when critical values are reached in a clearer way than simple tick marks.

As an example, we revisit the empirical CDF plot shown in Example 7.11. If you missed that entry, the data can be downloaded so you can easily explore the code shown below. We'll show how to add a regular grid of lines and lines at specific x or y values.

In a departure from our usual style, we'll discuss SAS and R in parallel, not in sequence.

Original Plot

The plot shown in Example 7.11 was obtained by doing some calculation and then with the following code.

SAS


symbol1 i=j v=none c=blue;
proc gplot data=help_a;
   plot ecdf_pcs * pcs;
run;


plot(sortpcs, ecdfpcs, type="n")
lines(sortpcs, ecdfpcs)

A simple default reference grid can be produced in SAS by changing the plot statement to read plot ecdf_pcs * pcs / grid;. In R, the grid() function (with no parameters specified) has the same effect. Either adds light grey lines at the major tick marks in both the x and y directions. The major tick marks themselves can be selected using the axis statement in SAS or with the axis() function in R, as discussed in section 5.3.7 and 5.3.8. Changing the major tick marks in SAS will make the grid appear at the modified tick locations.


axis1 order=(0 to 1 by .25) minor=none;
axis2 order=(10 to 80 by 35) minor=none;
symbol1 i=j v=none c=blue;
proc gplot data=help_a;
   plot ecdf_pcs * pcs / vaxis=axis1 haxis=axis2 grid;
run;

Unfortunately, it is difficult to get the grid() function to match up with tick marks away from the defaults. A better approach in R is to use the abline() function (section 5.2.1). to draw the reference lines where you want them. In the following code, we also demonstrate customization of the tick marks to match the SAS output shown above.


plot(sortpcs, ecdfpcs, type="n", xaxt="n", yaxt="n", xlim=c(10,80))
axis(side=1, at=c(10,45,80))
axis(side=2, at=c(seq(0,1,.25)))
lines(sortpcs, ecdfpcs)
abline(v=c(10,45,80), col="lightgray", lty="dotted")
abline(h=c(seq(0,1,.25)), col ="lightgrey", lty="dotted")

In the above, the options to plot() suppress the data and axes and specify the range of the x axis. The axis() function calls specify where the tick marks should appear, and the abline() function calls add the reference lines.

SAS also allows manual specification of reference lines. A result equivalent to the demonstrated grid option for the plot statement could be obtained as follows.


axis1 order=(0 to 1 by .25) minor=none;
axis2 order=(10 to 80 by 35) minor=none;
symbol1 i=j v=none c=blue;
proc gplot data=help_a;
   plot ecdf_pcs * pcs / vaxis=axis1 haxis=axis2 href=10,45,80 
      vref=0,.25,.5,.75,1 chref=lightgrey cvref=lightgrey;
run;
quit;

The added control offered by the abline() or ?ref approach is that reference lines can trivially be drawn at points not appearing at major tick marks.
For example, we might have a particular interest in the 90th percentile of the data. We can add abline(h=.9, col="lightgrey", lty="dotted") as a separate command in R, or add the new value to the list of vrefs in SAS to add this line. The final result is shown below.

(The image above modifies the abline() calls to drop the lty option and change the color to "grey". Otherwise the reference lines were too faint to display well here.)

Monday, April 5, 2010

Example 7.31: Contour plot of BMI by weight and height

A contour plot is a simple way to plot a surface in two dimensions. Lines with a constant Z value are plotted on the X-Y plane.

Typical uses include weather maps displaying "isobars" (lines of constant pressure), and maps displaying lines of constant elevation useful in, e.g., hiking. Unusual examples include maps of constant travel time, for example this map, which transforms space so that constant travel time can be presented as circles, or Francis Galton's travel map, below, in which the lines of constant travel time are reflected in the borders between colors.

As an example, here, we plot iso-BMI lines showing regions of underweight, normal weight, overweight, and obesity. As presented by National Heart, Lung, and Blood Institute, the borders between these categories are at BMI of 18.5, 25, and 30. Leaving aside questions of whether risk changes precipitously at these conveniently round numbers, it would be helpful to see what the categories imply across a range of heights and weights.

SAS

In SAS, we loop (section 1.11.1) through a range of heights (English) and weights (avoirdupois), calculating the BMI for each.


data bmi;
  do ht = 48 to 84 by ((84-48)/1000);
    do wt = 90 to 300 by ((300 -90)/1000);
      bmi = (wt*703)/(ht**2); 
   output;
 end;
  end;
run;

We then prepare some labels for the plot, using the annotate macros (section 5.2, 5.2.11).


%annomac;
data annobmi;
  %system(2,2,2); 
  %label(75,120,"Underweight",black,18,0,17);
  %label(70,150,"Normal",black,33,0,17);
  %label(65,170,"Overweight",black,37,0,17);
  %label(55,200,"Obese",black,40,0,17);
  run;

Finally, we use proc contour (section 5.1.12) to make the contour plot, using the anno option to plot the labels.


title h=1.25 "BMI categories by height and weight";
proc gcontour data = bmi anno=annobmi;
  plot wt * ht = bmi / levels = 18.5,25,30 nolegend;
  label ht="Height (inches)" wt="Weight (lbs)";
run;

R

In R, we build the data using the seq() function (section 1.11.3) for height and weight separately, and generate a data frame with all combinations of height and weight with the expand.grid() function (see example 7.22). After defining a function to calculate the BMI once, we apply it to the whole data set, storing the results in the matrix form we'll need using the matrix() function (section 1.9.1).


ht = seq(48,84, length.out=1000)
wt = seq(90,300, length.out=1000)
wtht = expand.grid(x=ht, y=wt)
bmi = function(h,w) {(w * 703)/(h*h)}
bmiwtht = matrix(bmi(wtht$x,wtht$y),length(ht),length(wt))

Now we're ready to use the contour() function (section 5.1.12) to make the plot and to add the labels with the text() function (section 5.2.11).


contour(ht,wt,bmiwtht,levels = c(18.5,25,30), drawlabels=FALSE,
        xlab="Height (inches)",ylab="Weight (lbs)",
        main="BMI categories by height and weight")

text(55,200,"Obese",cex=2,srt=45)
text(65,165,"Overweight",cex=2,srt=40)
text(70,150,"Normal",cex=2,srt=35)
text(75,120,"Underweight",cex=2,srt=18)

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts