SAS and R: Example 7.36: Propensity score stratification

Monday, May 10, 2010

Example 7.36: Propensity score stratification

In examples 7.34 and 7.35 we described methods using propensity scores to account for possible confounding factors in an observational study.

In addition to adjusting for the propensity score in a multiple regression and matching on the propensity score, researchers will often stratify by the propensity score, and carry out analyses within each group defined by these scores. With sufficient sample size, we might use 5 or 10 strata, but in this example, we'll use 4.

R

To facilitate running the code from this entry, we include all of the previous code snippets (the code is also posted on the book website as propensity.R.


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
summary(lm(pcs ~ homeless))

form = formula(homeless ~ age + female + i1 + mcs)
glm1 = glm(form, family=binomial)
X = glm1$fitted
tapply(X, homeless, FUN=fivenum)
par(mfrow=c(2,1))
hist(X[homeless==0], xlim=c(0.2,1))
hist(X[homeless==1], xlim=c(0.2,1))
par(mfrow=c(1,1))
summary(lm(pcs ~ homeless + X, subset=(X<.8)))
library(Matching)
rr = Match(Y=pcs, Tr=homeless, X=X, M=1)
summary(rr)
MatchBalance(form, match.out=rr, nboots=10)

We will use the propensity scores (stored in the variable X) as a way of stratifying the dataset into four approximately equally sized groups. Unlike our earlier approach, we will include all subjects (even though with very high propensities) and note that as a result we will observe animbalance in the distribution of homelessness by strata.


breakvals = fivenum(X)   # section 2.1.4
strat = cut(X, breaks=breakvals, 
   labels=c('bot quart', '2nd quart', '3rd quart', 'top quart'), 
   include.lowest=TRUE)

Next we set some parameters needed to tweak the graphic that will be created.


topval = 82
botval = 15
cexval=.6
eps = 3

Next we create a set of boxplots for the PCS scores of the homeless and non-homeless in each of the four strata plus overall. In addition, we annotate the plot with the labels as well as the sample size in each group.


boxplot(pcs[homeless==0 & strat=='bot quart'], 
   pcs[homeless==1 & strat=='bot quart'],
   pcs[homeless==0 & strat=='2nd quart'],
   pcs[homeless==1 & strat=='2nd quart'],
   pcs[homeless==0 & strat=='3rd quart'],
   pcs[homeless==1 & strat=='3rd quart'],
   pcs[homeless==0 & strat=='top quart'], 
   pcs[homeless==1 & strat=='top quart'], 
   pcs[homeless==0], pcs[homeless==1], 
   ylim=c(botval,topval), xaxt="n", ylab="PCS score")
abline(v=2.5)
abline(v=4.5)
abline(v=6.5)
abline(v=8.5, lwd=2)
text(1, topval, "not\nhomeless", cex=cexval)
text(2, topval, "homeless", cex=cexval)
text(3, topval, "not\nhomeless", cex=cexval)
text(4, topval, "homeless", cex=cexval)
text(5, topval, "not\nhomeless", cex=cexval)
text(6, topval, "homeless", cex=cexval)
text(7, topval, "not\nhomeless", cex=cexval)
text(8, topval, "homeless", cex=cexval)
text(9, topval, "not\nhomeless", cex=cexval)
text(10, topval, "homeless", cex=cexval)

text(1.5, botval+eps, "bot quart")
text(3.5, botval+eps, "2nd quart")
text(5.5, botval+eps, "3rd quart")
text(7.5, botval+eps, "top quart")
text(9.5, botval+eps, "overall")

text(1, topval-eps, paste("n=",sum(homeless==0 & strat=='bot quart', na.rm=TRUE),sep=""), cex=cexval)
text(2, topval-eps, paste("n=",sum(homeless==1 & strat=='bot quart', na.rm=TRUE),sep=""), cex=cexval)
text(3, topval-eps, paste("n=",sum(homeless==0 & strat=='2nd quart', na.rm=TRUE),sep=""), cex=cexval)
text(4, topval-eps, paste("n=",sum(homeless==1 & strat=='2nd quart', na.rm=TRUE),sep=""), cex=cexval)
text(5, topval-eps, paste("n=",sum(homeless==0 & strat=='3rd quart', na.rm=TRUE),sep=""), cex=cexval)
text(6, topval-eps, paste("n=",sum(homeless==1 & strat=='3rd quart', na.rm=TRUE),sep=""), cex=cexval)
text(7, topval-eps, paste("n=",sum(homeless==0 & strat=='top quart', na.rm=TRUE),sep=""), cex=cexval)
text(8, topval-eps, paste("n=",sum(homeless==1 & strat=='top quart', na.rm=TRUE),sep=""), cex=cexval)
text(9, topval-eps, paste("n=",sum(homeless==0, na.rm=TRUE),sep=""), cex=cexval)
text(10, topval-eps, paste("n=",sum(homeless==1, na.rm=TRUE),sep=""), cex=cexval)

We note that the difference between the median PCS scores is smaller within each of the quartiles of the propensity scores than the difference between medians overall. In addition, there is some indication that the difference increases as a function of the propensity score. Finally, note that the proportion of subjects in the two groups veries wildly by strata, swinging from 1/3 to 2/3 being homeless.

We can fit a t-test within each stratum, to assess whether there are statistically significant differences between the PCS scores for the homeless and non-homeless subjects. Here we use the by() operator (which takes three arguments: a dataset, a variable to use to specify groups, and a function to be called with the subset defined for that level) combined with the with() function to carry out the test for the appropriate subset, returning the p-value each time. This requires creation of a dataframe with the appropriate quantities.


> stratdf = data.frame(pcs, homeless, strat)
> out = by(stratdf, strat, 
   function(mydataframe) {with(mydataframe, 
      t.test(pcs[homeless==0], pcs[homeless==1])) 
   })
> out
strat: bot quart

 Welch Two Sample t-test

data:  pcs[homeless == 0] and pcs[homeless == 1] 
t = 0.806, df = 58.564, p-value = 0.4235
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -2.65544  6.23678 
sample estimates:
mean of x mean of y 
 49.88714  48.09647 

------------------------------------------------------------------------------------------- 
strat: 2nd quart

 Welch Two Sample t-test

data:  pcs[homeless == 0] and pcs[homeless == 1] 
t = -0.1011, df = 101.083, p-value = 0.9197
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -4.212064  3.803686 
sample estimates:
mean of x mean of y 
 49.37089  49.57508 

------------------------------------------------------------------------------------------- 
strat: 3rd quart

 Welch Two Sample t-test

data:  pcs[homeless == 0] and pcs[homeless == 1] 
t = 0.823, df = 110.918, p-value = 0.4123
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -2.357005  5.705745 
sample estimates:
mean of x mean of y 
 49.44396  47.76959 

------------------------------------------------------------------------------------------- 
strat: top quart

 Welch Two Sample t-test

data:  pcs[homeless == 0] and pcs[homeless == 1] 
t = 0.9222, df = 74.798, p-value = 0.3594
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -2.304285  6.276212 
sample estimates:
mean of x mean of y 
 46.07584  44.08988

The model results show that while the mean PCS is slightly lower among homeless subjects in 3 of the 4 propensity strata, the standard error of the difference is so large that there's little credible evidence for a health cost ascribable to homelessness.

SAS

We begin by getting the propensity scores.


proc logistic data="c:\book\help" desc;
  model homeless = age female i1 mcs;
  output out=propen pred=propensity;
run;

Next, we find the quartiles of the propensity scores using proc means (section 2.1.1) saving the results in a new data set. The new data set has one observation with the three quartiles.


proc means q1 median q3 data=propen;
  var propensity;
  output out = propenquart q1=q1 median=median q3=q3;
run;

We want to generate a categorical variable to indicate quartile membership. This requires having the quartile values from the propenquart data set available for each row of the data set with the propensities. To do this, we'll use the point option to the set statement. See this note for more details on this approach.


data propen2;
set propen;
  setme = 1;
  set propenquart point=setme;
  if propensity lt q1 then propensity_quartile = "1st quart"; 
  else if propensity lt median then propensity_quartile = "2nd quart"; 
  else if propensity lt q3 then propensity_quartile = "3rd quart"; 
  else if propensity lt 0.8 then propensity_quartile = "4th quart"; 
  else delete;
  if homeless = 1 then homeless_status = "Yes";
  else if homeless = 0 then homeless_status = "No";
run;

Next, we'll use proc boxplot (section 5.1.7) to make a plot similar to that made in R. The SAS plot does not show the boxes ignoring strata-- these would be quite a hassle, most likely. The boxplot procedure is finicky, and seems to work best for parallel boxplots when the group variables are character variables, and sorting is often necessary.

 
proc sort data=propen2; by propensity_quartile homeless_status; run;

proc boxplot data=propen2; 
  plot pcs * homeless_status (propensity_quartile);
  insetgroup n / pos=bottom;
run;

Conclusions are noted above.

Finally, we perform the analysis comparing PCS in the homeless and non-homeless groups within each strata, using the ODS system to reduce output.


ods select none;
ods output parameterestimates=propstrata;
proc glm data=propen2;
  by propensity_quartile;
  model pcs=homeless;
run;
ods select all;

Here are the final results. Small differences from the R results are due to the fact that the defaults for specification of quartiles differ between the two packages (note that for R, there are 79 subjects in the non-homeless bottom quartile, while SAS included 78 non-homeless subjects in that group).


proc print data=propstrata;
  where parameter="HOMELESS";
run;

       propensity_
Obs     quartile          Estimate          StdErr     Probt

 2      1st quart      -2.00721337      2.10622538    0.3427
 4      2nd quart       0.44727354      2.11652411    0.8330
 6      3rd quart      -1.55746908      2.03369752    0.4454
 8      4th quart      -2.00910422      2.06426071    0.3325

6 comments:

Anonymous said...: Would it be possible to have a significant p-value in only 1 out of the 4 quartiles? If so, what would that mean?; May 29, 2012 at 5:53 PM
Anonymous said...: Definitely possible. Before interpreting, I would do a joint test across strata to see whether the effect really differs between strata. If so, then I would say that, for example, there appears to be an effect, but only among those people most likely to have the exposure. IOW, if you're really likely to have the exposure, having the exposure is associated with the outcome. It's a form of interaction.; May 30, 2012 at 9:32 AM
Anonymous said...: Thanks for the explanation! Would you have any examples in SAS or R to perform the joint test? It seems another issue with this test is choosing the right covariates...?; June 1, 2012 at 2:33 PM
Ken Kleinman said...: Just fit an interaction between strata and exposure-- the joint test against the null of equal effects in all strata would be from /type3 in SAS or aov() in R.; June 2, 2012 at 11:16 AM
Anonymous said...: Thanks for explanation!

It is possible to estimate separate regression models for each stratum?
to improve the precision of estimates? Anyone have some references?; August 29, 2013 at 10:38 AM
Anonymous said...: Is it possible to use both propensity score information and post-treatment predictors in an outcome model? I haven't seen an example of how to do this in SAS or R.

In my data I have, among other variables:

migrated (binary outcome- yes or no)
military service (binary treatment- yes or no)
age (predictive of outcome and treatment- continuous)
nativity (predictive of outcome and treatment- categorical)
postwar_occupation (occurs after treatment- categorical)

Can anyone provide some guidance on how to build the required models to best account for selection into the treatment group and for the effect of postwar occupation?

Thank you.; January 24, 2018 at 5:51 PM

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Monday, May 10, 2010

Example 7.36: Propensity score stratification

6 comments:

About SAS and R

Topics discussed