SAS and R: confounding

Showing posts with label confounding. Show all posts

Tuesday, February 7, 2012

Example 9.20: visualizing Simpson's paradox

Simpson's paradox is always amazing to explain to students. What's bad for one group, and bad for another group is good for everyone, if you just collapse over the grouping variable. Unlike many mathematical paradoxes, this arises in a number of real-world settings.

In this entry, we consider visualizing Simpson's paradox, using data from a study of average SAT scores, average teacher salaries and percentage of students taking the SATs at the state level (published by Deborah Lynn Guber, "Getting what you pay for: the debate over equity in public school expenditures" (1999), Journal of Statistics Education 7(2)).

R
The relevant data are available within the mosaic package.


> library(mosaic); trellis.par.set(theme=col.mosaic())
> head(SAT)
       state expend ratio salary frac verbal math  sat
1    Alabama  4.405  17.2 31.144    8    491  538 1029
2     Alaska  8.963  17.6 47.951   47    445  489  934
3    Arizona  4.778  19.3 32.175   27    448  496  944
4   Arkansas  4.459  17.1 28.934    6    482  523 1005
5 California  4.992  24.0 41.078   45    417  485  902
6   Colorado  5.443  18.4 34.571   29    462  518  980

The paradox manifests itself here if we ignore the fraction of students taking the SAT and instead consider the bivariate relationship between average teacher salary and average SAT score:


> lm(sat ~ salary, data=SAT)

Coefficients:
(Intercept)       salary  
    1158.86        -5.54

Unfortunately, the relationship here appears to be negative: as salary increases,
average SAT score is predicted to decrease.

Luckily for the educators in the audience, once the confounding (aka lurking) variable is accounted for, we see a statistically significant positive relationship:


> summary(lm(sat ~ salary + frac, data=SAT))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 987.9005    31.8775  30.991   <2e-16 ***
salary        2.1804     1.0291   2.119   0.0394 *  
frac         -2.7787     0.2285 -12.163    4e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 33.69 on 47 degrees of freedom
Multiple R-squared: 0.8056, Adjusted R-squared: 0.7973 
F-statistic: 97.36 on 2 and 47 DF,  p-value: < 2.2e-16

That's all well and good, but how to explain what is going on? We can start with a scatterplot, but unless the state names are plotted then it's hard to see what's going on (imagine the plot at the top of this entry without the text labels or the color shading).

It's straightforward to use the panel functionality within the lattice package to create a more useful plot, where states names are plotted rather than points, and different colors are used to represent the low (red), medium (blue) and high (green) values for the fraction of students taking the SAT.


SAT$fracgrp = cut(SAT$frac, breaks=c(0, 22, 49, 81), 
  labels=c("low", "medium", "high"))
SAT$fraccount = 1 + as.numeric(SAT$fracgrp=="medium") + 
  2*as.numeric(SAT$fracgrp=="high")
panel.labels = function(x, y, col='black', labels='x',...) 
  { panel.text(x,y,labels,col=col,...)}
xyplot(sat ~salary, data=SAT, groups=fracgrp, 
  cex=0.6, panel=panel.labels, labels=SAT$state, 
  col=c('red','blue','green')[SAT$fraccount])

We see that within each group, the slope is positive (or at least non-negative), while overall there is a negative slope. Indeed, we see all of the hallmarks of a serious confounder in the correlation matrix for the n=50 observations:


> with(SAT, cor(cbind(sat, frac, salary)))
              sat       frac     salary
sat     1.0000000 -0.8871187 -0.4398834
frac   -0.8871187  1.0000000  0.6167799
salary -0.4398834  0.6167799  1.0000000

There's a strong negative association between SAT (Y) and fraction (Z), as well as a strong positive association between salary (X) and fraction (Z).

SAS

We begin by getting the data out of R. This is a snap, thanks to the proc_r macro we discussed here. Just read the macro in, tell it to grab the sat object on its way back from R, then write the R code to read in the data set. This technique would be great for getting data from the histdata package, discussed here, into SAS.


%include "C:\ken\sasmacros\Proc_R.sas";
%Proc_R (SAS2R =, R2SAS = sat);
Cards4;

mynorm = rnorm(10)
mynorm
library(mosaic)
head(SAT)
sat = SAT

;;;;
%Quit;

We'll skip the regression (best accomplished in proc glm) and skip to the helpful plot. We'll need to categorize the fraction taking the test. A natural way to do this in SAS would be to use formats, but we prefer to generate actual data as a more transparent process. To make the plot, we'll use the sgplot procedure. Typically this allows less control than the gplot procedure but the results in this case are quick and attractive. The group= and datalabel= options add the colors and legend, and the state names, respectively.


data sat2; set sat;
  if frac le 22 then fracgrp = "Low   ";
  else if frac le 49 then fracgrp = "Medium";
  else if frac gt 49 then fracgrp ="High";
run;

proc sgplot data=sat2;
  scatter x = salary y = sat / group=fracgrp datalabel=state;
run; quit;

Monday, May 10, 2010

Example 7.36: Propensity score stratification

In examples 7.34 and 7.35 we described methods using propensity scores to account for possible confounding factors in an observational study.

In addition to adjusting for the propensity score in a multiple regression and matching on the propensity score, researchers will often stratify by the propensity score, and carry out analyses within each group defined by these scores. With sufficient sample size, we might use 5 or 10 strata, but in this example, we'll use 4.

R

To facilitate running the code from this entry, we include all of the previous code snippets (the code is also posted on the book website as propensity.R.


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
summary(lm(pcs ~ homeless))

form = formula(homeless ~ age + female + i1 + mcs)
glm1 = glm(form, family=binomial)
X = glm1$fitted
tapply(X, homeless, FUN=fivenum)
par(mfrow=c(2,1))
hist(X[homeless==0], xlim=c(0.2,1))
hist(X[homeless==1], xlim=c(0.2,1))
par(mfrow=c(1,1))
summary(lm(pcs ~ homeless + X, subset=(X<.8)))
library(Matching)
rr = Match(Y=pcs, Tr=homeless, X=X, M=1)
summary(rr)
MatchBalance(form, match.out=rr, nboots=10)

We will use the propensity scores (stored in the variable X) as a way of stratifying the dataset into four approximately equally sized groups. Unlike our earlier approach, we will include all subjects (even though with very high propensities) and note that as a result we will observe animbalance in the distribution of homelessness by strata.


breakvals = fivenum(X)   # section 2.1.4
strat = cut(X, breaks=breakvals, 
   labels=c('bot quart', '2nd quart', '3rd quart', 'top quart'), 
   include.lowest=TRUE)

Next we set some parameters needed to tweak the graphic that will be created.


topval = 82
botval = 15
cexval=.6
eps = 3

Next we create a set of boxplots for the PCS scores of the homeless and non-homeless in each of the four strata plus overall. In addition, we annotate the plot with the labels as well as the sample size in each group.


boxplot(pcs[homeless==0 & strat=='bot quart'], 
   pcs[homeless==1 & strat=='bot quart'],
   pcs[homeless==0 & strat=='2nd quart'],
   pcs[homeless==1 & strat=='2nd quart'],
   pcs[homeless==0 & strat=='3rd quart'],
   pcs[homeless==1 & strat=='3rd quart'],
   pcs[homeless==0 & strat=='top quart'], 
   pcs[homeless==1 & strat=='top quart'], 
   pcs[homeless==0], pcs[homeless==1], 
   ylim=c(botval,topval), xaxt="n", ylab="PCS score")
abline(v=2.5)
abline(v=4.5)
abline(v=6.5)
abline(v=8.5, lwd=2)
text(1, topval, "not\nhomeless", cex=cexval)
text(2, topval, "homeless", cex=cexval)
text(3, topval, "not\nhomeless", cex=cexval)
text(4, topval, "homeless", cex=cexval)
text(5, topval, "not\nhomeless", cex=cexval)
text(6, topval, "homeless", cex=cexval)
text(7, topval, "not\nhomeless", cex=cexval)
text(8, topval, "homeless", cex=cexval)
text(9, topval, "not\nhomeless", cex=cexval)
text(10, topval, "homeless", cex=cexval)

text(1.5, botval+eps, "bot quart")
text(3.5, botval+eps, "2nd quart")
text(5.5, botval+eps, "3rd quart")
text(7.5, botval+eps, "top quart")
text(9.5, botval+eps, "overall")

text(1, topval-eps, paste("n=",sum(homeless==0 & strat=='bot quart', na.rm=TRUE),sep=""), cex=cexval)
text(2, topval-eps, paste("n=",sum(homeless==1 & strat=='bot quart', na.rm=TRUE),sep=""), cex=cexval)
text(3, topval-eps, paste("n=",sum(homeless==0 & strat=='2nd quart', na.rm=TRUE),sep=""), cex=cexval)
text(4, topval-eps, paste("n=",sum(homeless==1 & strat=='2nd quart', na.rm=TRUE),sep=""), cex=cexval)
text(5, topval-eps, paste("n=",sum(homeless==0 & strat=='3rd quart', na.rm=TRUE),sep=""), cex=cexval)
text(6, topval-eps, paste("n=",sum(homeless==1 & strat=='3rd quart', na.rm=TRUE),sep=""), cex=cexval)
text(7, topval-eps, paste("n=",sum(homeless==0 & strat=='top quart', na.rm=TRUE),sep=""), cex=cexval)
text(8, topval-eps, paste("n=",sum(homeless==1 & strat=='top quart', na.rm=TRUE),sep=""), cex=cexval)
text(9, topval-eps, paste("n=",sum(homeless==0, na.rm=TRUE),sep=""), cex=cexval)
text(10, topval-eps, paste("n=",sum(homeless==1, na.rm=TRUE),sep=""), cex=cexval)

We note that the difference between the median PCS scores is smaller within each of the quartiles of the propensity scores than the difference between medians overall. In addition, there is some indication that the difference increases as a function of the propensity score. Finally, note that the proportion of subjects in the two groups veries wildly by strata, swinging from 1/3 to 2/3 being homeless.

We can fit a t-test within each stratum, to assess whether there are statistically significant differences between the PCS scores for the homeless and non-homeless subjects. Here we use the by() operator (which takes three arguments: a dataset, a variable to use to specify groups, and a function to be called with the subset defined for that level) combined with the with() function to carry out the test for the appropriate subset, returning the p-value each time. This requires creation of a dataframe with the appropriate quantities.


> stratdf = data.frame(pcs, homeless, strat)
> out = by(stratdf, strat, 
   function(mydataframe) {with(mydataframe, 
      t.test(pcs[homeless==0], pcs[homeless==1])) 
   })
> out
strat: bot quart

 Welch Two Sample t-test

data:  pcs[homeless == 0] and pcs[homeless == 1] 
t = 0.806, df = 58.564, p-value = 0.4235
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -2.65544  6.23678 
sample estimates:
mean of x mean of y 
 49.88714  48.09647 

------------------------------------------------------------------------------------------- 
strat: 2nd quart

 Welch Two Sample t-test

data:  pcs[homeless == 0] and pcs[homeless == 1] 
t = -0.1011, df = 101.083, p-value = 0.9197
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -4.212064  3.803686 
sample estimates:
mean of x mean of y 
 49.37089  49.57508 

------------------------------------------------------------------------------------------- 
strat: 3rd quart

 Welch Two Sample t-test

data:  pcs[homeless == 0] and pcs[homeless == 1] 
t = 0.823, df = 110.918, p-value = 0.4123
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -2.357005  5.705745 
sample estimates:
mean of x mean of y 
 49.44396  47.76959 

------------------------------------------------------------------------------------------- 
strat: top quart

 Welch Two Sample t-test

data:  pcs[homeless == 0] and pcs[homeless == 1] 
t = 0.9222, df = 74.798, p-value = 0.3594
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -2.304285  6.276212 
sample estimates:
mean of x mean of y 
 46.07584  44.08988

The model results show that while the mean PCS is slightly lower among homeless subjects in 3 of the 4 propensity strata, the standard error of the difference is so large that there's little credible evidence for a health cost ascribable to homelessness.

SAS

We begin by getting the propensity scores.


proc logistic data="c:\book\help" desc;
  model homeless = age female i1 mcs;
  output out=propen pred=propensity;
run;

Next, we find the quartiles of the propensity scores using proc means (section 2.1.1) saving the results in a new data set. The new data set has one observation with the three quartiles.


proc means q1 median q3 data=propen;
  var propensity;
  output out = propenquart q1=q1 median=median q3=q3;
run;

We want to generate a categorical variable to indicate quartile membership. This requires having the quartile values from the propenquart data set available for each row of the data set with the propensities. To do this, we'll use the point option to the set statement. See this note for more details on this approach.


data propen2;
set propen;
  setme = 1;
  set propenquart point=setme;
  if propensity lt q1 then propensity_quartile = "1st quart"; 
  else if propensity lt median then propensity_quartile = "2nd quart"; 
  else if propensity lt q3 then propensity_quartile = "3rd quart"; 
  else if propensity lt 0.8 then propensity_quartile = "4th quart"; 
  else delete;
  if homeless = 1 then homeless_status = "Yes";
  else if homeless = 0 then homeless_status = "No";
run;

Next, we'll use proc boxplot (section 5.1.7) to make a plot similar to that made in R. The SAS plot does not show the boxes ignoring strata-- these would be quite a hassle, most likely. The boxplot procedure is finicky, and seems to work best for parallel boxplots when the group variables are character variables, and sorting is often necessary.

 
proc sort data=propen2; by propensity_quartile homeless_status; run;

proc boxplot data=propen2; 
  plot pcs * homeless_status (propensity_quartile);
  insetgroup n / pos=bottom;
run;

Conclusions are noted above.

Finally, we perform the analysis comparing PCS in the homeless and non-homeless groups within each strata, using the ODS system to reduce output.


ods select none;
ods output parameterestimates=propstrata;
proc glm data=propen2;
  by propensity_quartile;
  model pcs=homeless;
run;
ods select all;

Here are the final results. Small differences from the R results are due to the fact that the defaults for specification of quartiles differ between the two packages (note that for R, there are 79 subjects in the non-homeless bottom quartile, while SAS included 78 non-homeless subjects in that group).


proc print data=propstrata;
  where parameter="HOMELESS";
run;

       propensity_
Obs     quartile          Estimate          StdErr     Probt

 2      1st quart      -2.00721337      2.10622538    0.3427
 4      2nd quart       0.44727354      2.11652411    0.8330
 6      3rd quart      -1.55746908      2.03369752    0.4454
 8      4th quart      -2.00910422      2.06426071    0.3325

Monday, May 3, 2010

Example 7.35: Propensity score matching

As discussed in example 7.34, it's sometimes preferable to match on propensity scores, rather than adjust for them as a covariate.

SAS

We use a suite of macros written by Jon Kosanke and Erik Bergstralh at the Mayo Clinic. The dist macro calculates the pairwise distances between observations, while the vmatch macro makes matches based on the distances, finding the closest set of matches while minimizing the overall distance. The latter macro uses the nobs macro. The original macros are available for download from the Mayo Clinic Division of Biomedical statistics and Informatics. Slightly enhanced versions of the nobs and vmatch macros are available from the code section of the book website. The enhanced versions are used to generate the results below.

We previously created the propen data set containing the propensity score variable. In that entry we also noted a lack of overlap in the propensity distributions, and decided to drop observations with propensity > 0.8.

First we drop observations with propensity greater than 0.8 using the subsetting if statement (section 1.5.1). This leaves 201 homeless subjects. We then read in the macros using the %include statement (section 2.1.8). Then we run the dist macro, using its option to call the vmatch macro.


data prop2; set propen;
  if propensity lt .8;
run;

%include "C:\ken\sasmacros\vmatch.sas";
%include "C:\ken\sasmacros\dist.sas";
%include "C:\ken\sasmacros\nobs.sas";

%dist(data=prop2, group=homeless, id=id, mvars=propensity,
      wts=1, vmatch=Y, a=1, b=1, lilm=201, dmax=0.1,  
      outm=mp1_b, summatch=n, printm=N, mergeout=mpropen);

The macros are well documented by text included at the top of the macro, as is common with SAS macros that authors share. In the preceding code, the parameter values in the first line are relatively self-explanatory. The wts parameter, which allows multiple matching variables to be weighted differently. The dmax parameter specifies the maximum distance acceptable for a match. We arbitrarily decide that the propensities must be within 0.1 to make a match (our results will differ if other criteria were specified for the matching, as we will see below in the R comparison). The remaining parameters request the matching, ask for one and only one match per case, for all the cases to be matched (if possible), suppress printed output, and name the data set to contain output.

The output data set mpropen is identical to the input data set, with the addition of a new indicator variable, matched. We can compare the distribution of the covariates before and after the matching with the means procedure (section 2.1.1).


title "observed data";
proc means data=propen mean;
  class homeless;
  var age female i1 mcs;
run;

title "matched observations";
proc means data=mpropen mean;
where matched;
  class homeless;
  var age female i1 mcs;
run;

 observed data 
                  N
    HOMELESS    Obs    Variable            Mean
-----------------------------------------------
           0    244    AGE           35.0409836
                       FEMALE         0.2745902
                       I1            13.5122951
                       MCS           32.4868303

           1    209    AGE           36.3684211
                       FEMALE         0.1913876
                       I1            23.0382775
                       MCS           30.7308549
-----------------------------------------------

matched observations  
                  N
    HOMELESS    Obs    Variable            Mean
-----------------------------------------------
           0    201    AGE           35.6218905
                       FEMALE         0.1791045
                       I1            15.9154229
                       MCS           31.4815123

           1    201    AGE           36.1492537
                       FEMALE         0.1990050
                       I1            19.9452736
                       MCS           30.9176772
-----------------------------------------------

We see that the covariates are much better balanced after matching. Also note the dramatic impact of removing the 8 (4%) cases with very large propensities. This has reduced the mean number of drinks by 15%!

We can now performs the analysis on the matched cases. The two classes of homeless status now have nearly equal distributions of the probability of homelessness.


proc glm data=mpropen;
  where matched;
  model pcs = homeless / solution;
run;

                                  Standard
Parameter         Estimate           Error    t Value    Pr > |t|

Intercept      48.95273471      0.76199823      64.24      <.0001
HOMELESS       -1.79386398      1.07762823      -1.66      0.0968

After matching, the effect of homelessness on physical health is attenuated and has a larger p-value.

R

In R, the Matching library provides tools for matching and analysis. The Match() function implements a variety of algorithms for multivariate matching including propensity score, Mahalanobis and inverse variance matching. The function is intended to be used in conjunction with the MatchBalance() function which determines the extent to which covariate balance has been achieved. The function takes the propensity score as an argument, as well as the outcome to be compared and the group indicators.

A wide variety of matching options include matching with or without replacement, bias adjustment, different methods for handling ties, exact and caliper matching. The GenMatch function can be used to automatically find balance via a genetic search algorithm which determines the optimal weight to give each covariate.

An extensive website describes the package and the many variety of options that it supports, and a related paper is forthcoming. Three extended examples are included to help illustrate the mechanics of matching.


library(Matching)
rr = Match(Y=pcs, Tr=homeless, X=X, M=1)

The function returns an object describing the matching.


names(rr)
 [1] "est"               "se"                "est.noadj"         "se.standard"      
 [5] "se.cond"           "mdata"             "index.treated"     "index.control"    
 [9] "index.dropped"     "weights"           "orig.nobs"         "orig.wnobs"       
[13] "orig.treated.nobs" "nobs"              "wnobs"             "caliper"          
[17] "ecaliper"          "exact"             "ndrops"            "ndrops.matches"   
[21] "MatchLoopC"        "version"           "estimand"

The results can be displayed by running summary().

      
summary(rr)

Estimate...  -0.80207 
AI SE......  1.4448 
T-stat.....  -0.55516 
p.val......  0.57878 

Original number of observations..............  453 
Original number of treated obs...............  209 
Matched number of observations...............  209 
Matched number of observations  (unweighted).  252

By default, the observations are given equal weight. If all of the observations had a weight of 1 on input, then each matched-pair will have a weight of 1 on output if there are no ties.

We see that the causal estimate of -0.80 in the matched comparison is not statistically significant (p=0.58), which is consistent with the other results that accounted for the confounders (though we note that the specific results depend on the particular options that are selected for the matching).

The MatchBalance() function can be used to describe the distribution
of the predictors (by homeless status) before and after matching (to save space, only the results for age and i1 are displayed). This is helpful to determine if the
matching resulted in similar marginal distributions.


> MatchBalance(form, match.out=rr, nboots=10)
***** (V1) age *****   Before Matching     After Matching
mean treatment........     36.368         36.368 
mean control..........     35.041         36.423 
std mean diff.........     16.069       -0.65642 

mean raw eQQ diff.....     1.5981        0.94841 
med  raw eQQ diff.....          1              1 
max  raw eQQ diff.....          7             10 

mean eCDF diff........   0.037112       0.022581 
med  eCDF diff........   0.026365       0.019841 
max  eCDF diff........    0.10477       0.083333 

var ratio (Tr/Co).....     1.3290         1.2671 
T-test p-value........   0.070785        0.93902 
KS Bootstrap p-value.. < 2.22e-16            0.3 
KS Naive p-value......    0.16881        0.34573 
KS Statistic..........    0.10477       0.083333 

***** (V3) i1 *****    Before Matching     After Matching
mean treatment........     23.038         23.038 
mean control..........     13.512         20.939 
std mean diff.........     40.582          8.945 

mean raw eQQ diff.....     9.6316         2.1071 
med  raw eQQ diff.....          8              1 
max  raw eQQ diff.....         73             66 

mean eCDF diff........    0.11853       0.018753 
med  eCDF diff........    0.12377       0.011905 
max  eCDF diff........    0.20662       0.087302 

var ratio (Tr/Co).....     2.3763         1.3729 
T-test p-value........ 7.8894e-07       0.011786 
KS Bootstrap p-value.. < 2.22e-16            0.3 
KS Naive p-value...... 0.00013379        0.29213 
KS Statistic..........    0.20662       0.087302

More details regarding each of the tests for differences in means or distributions can be found using ?MatchBalance. The results for both of the variables presented above indicate that the distributions are considerably closer
to each other in the matched sample than in the original dataset.

Monday, April 26, 2010

Example 7.34: Propensity scores and causal inference from observational studies

Propensity scores can be used to help make causal interpretation of observational data more plausible, by adjusting for other factors that may responsible for differences between groups. Heuristically, we estimate the probability of exposure, rather than randomize exposure, as we'd ideally prefer to do. The estimated probability of exposure is the propensity score. If our estimation of the propensity score incorporates the reasons why people self-select to exposure status, then two individuals with equal propensity score are equally likely to be exposed, and we can interpret them as being randomly assigned to exposure. This process is not unlike ordinary regression adjustment for potential confounders, but uses fewer degrees of freedom and can incorporate more variables.

As an example, we consider the HELP data used extensively for examples in our book. Does homelessness affect physical health, as measured by the PCS score from the SF-36?

First, we consider modeling this relationship directly. This analysis only answers the question of whether homelessness is associated with poorer physical health.

Then we create a propensity score by estimating a logistic regression to predict homelessness using age, gender, number of drinks, and mental health composite score. Finally, we include the propensity score in the model predicting PCS from homelessness. If we accept that these propensity predictors fully account for the probability of homelessness, and there is an association between homelessness and PCS in the model adjusting for propensity, and the directionality of the association flows from homelessness to PCS, then we can conclude that homelessness causes differences in PCS.

We note here that this conclusion relies on other untestable assumptions, including linearity in the relationship between the propensity and PCS. Many users of propensity scores prefer to fit models within strata of the propensity score, or to match on propensity score, rather than use the regression adjustment we present in this entry. In a future entry we'll demonstrate the use of matching.

In a departure from our usual practice, we show only pieces of the output below.

SAS

We being by reading in the data and fitting the model. This is effectively a t-test (section 2.4.1), but we use proc glm to more easily compare with the adjusted results.


proc glm data="c:\book\help";
  model pcs = homeless/solution;
run;

                                Standard
Parameter        Estimate          Error   t Value   Pr > |t|

Intercept     49.00082904     0.68801845     71.22     <.0001
HOMELESS      -2.06404896     1.01292210     -2.04     0.0422

It would appear that homeless patients are in worse health than the others.

We next use proc logistic to estimate the propensity to homelessness, using the output statement to save the predicted probabilities. We omit the output here; it could be excluded in practice using the ODS exclude all statement.


proc logistic data="c:\book\help" desc;
  model homeless = age female i1 mcs;
  output out=propen pred=propensity;
run;

It's important to make sure that there is a reasonable amount of overlap in the propensity scores between the two groups. Otherwise, we'd be extrapolating outside the range of the data when we adjust.


proc means data=propen;
  class homeless;
  var propensity;
run;
                 N
    HOMELESS   Obs           Mean        Minimum        Maximum
---------------------------------------------------------------
           0   244      0.4296704      0.2136791      0.7876000

           1   209      0.4983750      0.2635031      0.9642827
---------------------------------------------------------------

The mean propensity to homelessness is larger in the homeless group. If this were not the case, we might be concerned the the logistic model is too poor a predictor of homelessness to generate an effective propensity score. However, the maximum propensity among the homeless is 20% larger than the largest propensity in the non-homeless group. This suggests that a further review of the propensities would be wise. To check them, we'll generate histograms for each group using the proc univariate (section 5.1.4).


proc univariate data=propen;
  class homeless;
  var propensity;
  histogram propensity;
run;

The resulting histograms suggest a some risk of extrapolation. In our model, we'll remove subjects with propensities greater than 0.8.


proc glm data=propen;
  where propensity lt .8;
  model pcs = homeless propensity/solution;
run;
                                 Standard
Parameter         Estimate          Error   t Value   Pr > |t|

Intercept      54.19944991     1.98264608     27.34     <.0001
HOMELESS       -1.19612942     1.03892589     -1.15     0.2502
propensity    -12.09909082     4.33385405     -2.79     0.0055

After the adjustment, we see a much smaller difference in the physical health of homeless and non-homeless subjects, and find no significant evidence of an association.

R


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
summary(lm(pcs ~ homeless))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   49.001      0.688  71.220   <2e-16 ***
homeless      -2.064      1.013  -2.038   0.0422 *

We use the glm() function to fit the logistic
regression model (section 4.1.1). A formula object is used to specify the model. The predicted probabilities that we'll use as propensity scores are in the fitted element of the output object.


form = formula(homeless ~ age + female + i1 + mcs)
glm1 = glm(form, family=binomial)
X = glm1$fitted

As in the SAS development, we check the resulting values. Here we use the fivenum() function (section 2.1.4) with the tapply() function (section 2.1.2) to get the results for each level of homelessness.


> tapply(X,homeless, FUN=fivenum)
$`0`
      398        97       378        69       438 
0.2136787 0.3464170 0.4040223 0.4984242 0.7876015 

$`1`
       16        18       262       293       286 
0.2635026 0.4002759 0.4739152 0.5768015 0.9642833

Finding the same troubling evidence of non-overlap, we fit a histogram for each group. We do this manually, setting up two output areas with the par() function (section 5.3.6) and conditioning to use data from each homeless value in two calls to the hist() function (section 5.1.4).


par(mfrow=c(2,1))
hist(X[homeless==0], xlim=c(0.2,1))
hist(X[homeless==1], xlim=c(0.2,1))

As noted above, we'll exclude subjects with propensity greater than 0.8. This is done with the subset option to the lm() function (as in section 3.7.4).


summary(lm(pcs ~ homeless + X,subset=(X < .8)))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   54.199      1.983  27.337  < 2e-16 ***
homeless      -1.196      1.039  -1.151  0.25023    
X            -12.099      4.334  -2.792  0.00547 **

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts