Monday, May 31, 2010

Example 7.39: Nelson-Aalen estimate of cumulative hazard

In our previous example, we demonstrated how to calculate the Kaplan-Meier estimate of the survival function for time to event data.

A related quantity is the Nelson-Aalen estimate of cumulative hazard. In addition to summarizing the hazard incurred by a particular timepoint, this quantity has been
used in missing data models (see White and Royston, 2009).

In addition to the usual SAS and R approaches to this, we also show Stata code.


SAS

It's very straightforward to calculate the Nelson-Aalen estimate in SAS. We assume the data is in the test data set generated in example 7.38.


ods output productlimitestimates=naout;
proc lifetest data=test nelson;
time time*event(0);
run;

options formchar="|----|+|---+=|-/\<>*";
options ls=64;
proc print data=naout; var time cumhaz; run;


In the forgoing, the formchar option makes SAS print in a format that can easily be cut-and-pasted into other software, such as Blogger's text entry window, as described here.


Obs time CumHaz
1 0.0000 0
2 0.5000 .
3 1.0000 .
4 1.0000 0.1111
5 2.0000 0.1736
6 2.0000 .
7 3.0000 0.2450
8 4.0000 0.3220
9 5.0000 .
10 6.0000 0.4129
11 7.0000 .
12 8.0000 0.5240
13 9.0000 0.6490
14 10.0000 .
15 12.0000 0.8156
16 14.0000 1.0156
17 14.0000 .
18 17.0000 .
19 20.0000 1.5156
20 21.0000 .


Note that SAS shows the Nelson-Aalen estimator once per time point, and only when there is a failure at that time. We may need the estimated value for all observed time points in our data, as well as to use the estimate with our original data. This requires some additional work. First, we sort our data and the new naout data set by time, removing rows from the above printout where the cumulative hazard is 0, using the where data set option (section A.6.3). Then we merge the two data sets.


proc sort data=test; by time; run;
proc sort data=naout (where=(cumhaz ne .)); by time; run;

data t2;
merge test (in=keepme) naout (keep = time cumhaz);
* the keep option drops unneeded variables;
by time;
if keepme;
* deletes any extra rows from the naout dataset;
retain cumhazlast 0;
* remember most recent value of this;
if cumhaz eq . then cumhaz = cumhazlast;
* plug in most recent value of cumhazlast,
if cumhaz is missing;
cumhazlast = cumhaz;
* the current value of cumhaz is now the most recent;
run;

proc print data=t2;
var time event cumhaz;
run;


This generates the desired result:


Obs time event CumHaz
1 0.5000 0 0
2 1.0000 1 0.1111
3 1.0000 1 0.1111
4 2.0000 1 0.1736
5 2.0000 0 0.1736
6 3.0000 1 0.2450
7 4.0000 1 0.3220
8 5.0000 0 0.3220
9 6.0000 1 0.4129
10 7.0000 0 0.4129
11 8.0000 1 0.5240
12 9.0000 1 0.6490
13 10.0000 0 0.6490
14 12.0000 1 0.8156
15 14.0000 0 1.0156
16 14.0000 1 1.0156
17 17.0000 0 1.0156
18 20.0000 1 1.5156
19 21.0000 0 1.5156


R

While the survfit() command can be used to create a table of all of the survival function and Nelson-Aalen estimates at each event point, it is slightly harder in R to associate these with the original data. Here we craft a function in R that does this housekeeping, using the fact that the Nelson-Aalen is just the negative log of the survival function (after specifying the type="aalen" option.

calcna = function(time, event) {
na.fit = survfit(coxph(Surv(time,event)~1), type="aalen")
jumps = c(0, na.fit$time, max(time))
# need to be careful at the beginning and end
surv = c(1, na.fit$surv, na.fit$surv[length(na.fit$surv)])

# apply appropriate transformation
neglogsurv = -log(surv)

# create placeholder of correct length
naest = numeric(length(time))
for (i in 2:length(jumps)) {
naest[which(time>=jumps[i-1] & time<=jumps[i])] =
neglogsurv[i-1] # snag the appropriate value
}
return(naest)
}


This can be be used as follows, where we use the time and event objects created in example 7.38.


newna = calcna(time, event)
cbind(time, newna)

time newna
[1,] 0.5 0.0000000
[2,] 1.0 0.1111111
[3,] 1.0 0.1111111
[4,] 2.0 0.1736111
[5,] 2.0 0.1736111
[6,] 3.0 0.2450397
[7,] 4.0 0.3219628
[8,] 5.0 0.3219628
[9,] 6.0 0.4128719
[10,] 7.0 0.4128719
[11,] 8.0 0.5239830
[12,] 9.0 0.6489830
[13,] 10.0 0.6489830
[14,] 12.0 0.8156496
[15,] 14.0 1.0156496
[16,] 14.0 1.0156496
[17,] 17.0 1.0156496
[18,] 20.0 1.5156496
[19,] 21.0 1.5156496


We also describe how to calculate the Nelson-Aalen estimate in Stata. First we create a dataset in the appropriate format. The foreign library in R (sections 1.1.5, 1.2.2) can do this directly. In SAS we would need to export to a format that Stata can read.


library(foreign)
write.dta(data.frame(time=time, event=event), "forstata.dta")


Stata

We can then read in the dataset from R:

. use forstata
(Written by R. )

. stset time, failure(event)

failure event: event != 0 & event < .
obs. time interval: (0, time]
exit on or before: failure

------------------------------------------------------------------------------
19 total obs.
0 exclusions
------------------------------------------------------------------------------
19 obs. remaining, representing
11 failures in single record/single failure data
156.5 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 21
. sts list, cumhaz

failure _d: event
analysis time _t: time
Beg. Net Nelson-Aalen Std.
Time Total Fail Lost Cum. Haz. Error [95% Conf. Int.]
-------------------------------------------------------------------------------
.5 19 0 1 0.0000 0.0000 . .
1 18 2 0 0.1111 0.0786 0.0278 0.4443
2 16 1 1 0.1736 0.1004 0.0559 0.5393
3 14 1 0 0.2450 0.1232 0.0915 0.6565
4 13 1 0 0.3220 0.1453 0.1330 0.7795
5 12 0 1 0.3220 0.1453 0.1330 0.7795
6 11 1 0 0.4129 0.1714 0.1830 0.9313
7 10 0 1 0.4129 0.1714 0.1830 0.9313
8 9 1 0 0.5240 0.2042 0.2441 1.1248
9 8 1 0 0.6490 0.2394 0.3149 1.3375
10 7 0 1 0.6490 0.2394 0.3149 1.3375
12 6 1 0 0.8156 0.2917 0.4046 1.6442
14 5 1 1 1.0156 0.3537 0.5132 2.0099
17 3 0 1 1.0156 0.3537 0.5132 2.0099
20 2 1 0 1.5156 0.6125 0.6865 3.3463
21 1 0 1 1.5156 0.6125 0.6865 3.3463
-------------------------------------------------------------------------------

Monday, May 24, 2010

Example 7.38: Kaplan-Meier survival estimates

In example 7.30 we demonstrated how to simulate data from a Cox proportional hazards model.

In this and the next few entries, we expand upon support in R and SAS for survival (time-to-event) models. We'll start with a small, artificial dataset of 19 subjects. Each subject contributes a pair of variables: the time and an indicator of whether the time is when the event occurred (event=TRUE) or when the subject was censored (event=FALSE).

time  event
0.5   FALSE
1     TRUE
1     TRUE
2     TRUE
2     FALSE
3     TRUE
4     TRUE
5     FALSE
6     TRUE
7     FALSE
8     TRUE
9     TRUE
10    FALSE
12    TRUE
14    FALSE
14    TRUE
17    FALSE
20    TRUE
21    FALSE


Until an instant before time=1, no events were observed (only the censored observation), so the survival estimate is 1. At time=1, 2 subjects out of the 18 still at risk observed the event, so the survival function S(.) at time 1 is S(1) = 16/18 = 0.8889. The next failure occurs at time=2, with 16 still at risk, so S(2)=15/16 * 16/18 = 0.8333. Note that in addition to the event at time=2, there is a subject censored then, so the number at risk at time=3 is just 13 (so S(3) = 13/14 * 15*16 * 16/18 = 0.7738). The calculations continue until the final event is observed.


R

In R, we use the survfit() function (section 5.1.19) within the survival library to calculate the survival function across time.

library(survival)
time =  c(0.5, 1,1,2,2,3,4,5,6,7,8,9,10,12,14,14,17,20, 21)
event = c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, 
  TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
  TRUE, FALSE)
ds = data.frame(time, event)
fit = survfit(Surv(time, event) ~ 1, data=ds)


The returned survival object includes a number of attributes, such as the survival estimates at each timepoint, the standard error of those estimates, and the number of subjects at risk.

> names(fit)
 [1] "n"         "time"      "n.risk"    "n.event"   "n.censor"  
     "surv"      "type"      "std.err"  
 [9] "upper"     "lower"     "conf.type" "conf.int"  "call" 
> summary(fit)
Call: survfit(formula = Surv(time, event) ~ 1, data = ds)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1     18       2    0.889  0.0741       0.7549        1.000
    2     16       1    0.833  0.0878       0.6778        1.000
    3     14       1    0.774  0.0997       0.6011        0.996
    4     13       1    0.714  0.1084       0.5306        0.962
    6     11       1    0.649  0.1164       0.4570        0.923
    8      9       1    0.577  0.1238       0.3791        0.879
    9      8       1    0.505  0.1276       0.3078        0.829
   12      6       1    0.421  0.1312       0.2285        0.775
   14      5       1    0.337  0.1292       0.1587        0.714
   20      2       1    0.168  0.1354       0.0348        0.815



SAS

We can read in the artificial data using an input statement (section 1.1.8).

data ds;
input time event;
cards;
0.5   0
1     1
1     1
2     1
2     0
3     1
4     1
5     0
6     1
7     0
8     1
9     1
10    0
12    1
14    0
14    1
17    0
20    1
21    0
run;




proc lifetest data=ds;
  time time*event(0);
run;


Here we denote censoring as being values where event is equal to 0. If we had a censoring indicator coded in reverse (1 = censoring), the second line might read time time*censored(1);.

The survival function can be estimated in proc lifetest (as shown in section 5.1.19). In a break from our usual practice, we'll include all of the output generated by proc lifetest.

The LIFETEST Procedure
                   Product-Limit Survival Estimates
                                     Survival
                                     Standard     Number      Number 
    time     Survival    Failure      Error       Failed       Left  
  0.0000       1.0000           0           0        0          19   
  0.5000*           .           .           .        0          18   
  1.0000            .           .           .        1          17   
  1.0000       0.8889      0.1111      0.0741        2          16   
  2.0000       0.8333      0.1667      0.0878        3          15   
  2.0000*           .           .           .        3          14   
  3.0000       0.7738      0.2262      0.0997        4          13   
  4.0000       0.7143      0.2857      0.1084        5          12   
  5.0000*           .           .           .        5          11   
  6.0000       0.6494      0.3506      0.1164        6          10   
  7.0000*           .           .           .        6           9   
  8.0000       0.5772      0.4228      0.1238        7           8   
  9.0000       0.5051      0.4949      0.1276        8           7   
 10.0000*           .           .           .        8           6   
 12.0000       0.4209      0.5791      0.1312        9           5   
 14.0000       0.3367      0.6633      0.1292       10           4   
 14.0000*           .           .           .       10           3   
 17.0000*           .           .           .       10           2   
 20.0000       0.1684      0.8316      0.1354       11           1   
 21.0000*           .           .           .       11           0   

NOTE: The marked survival times are censored observations.

Summary Statistics for Time Variable time

             Quartile Estimates
             Point     95% Confidence Interval
Percent    Estimate      [Lower      Upper)
     75     20.0000     12.0000       .    
     50     12.0000      6.0000     20.0000
     25      4.0000      1.0000     12.0000

    Mean    Standard Error
 11.1776            1.9241

NOTE: The mean survival time and its standard error were underestimated because 
      the largest observation was censored and the estimation was restricted to 
      the largest event time.

Summary of the Number of Censored and Uncensored Values
                                 Percent
   Total  Failed    Censored    Censored
      19      11           8       42.11

Monday, May 17, 2010

Example 7.37: calculation of Hotelling's T^2

Hotelling's T^2 is a multivariate statistic used to compare two groups, where multiple outcomes are observed for each subject.

Here we demonstrate how to calculate Hotelling's T^2 using R and SAS, and test the code using a simulation study then apply it to data from the HELP study.


R

We utilize an approach suggested by Peter Mandeville in a posting to the R-help mailing list.



hotelling = function(y1, y2) {
# check the appropriate dimension
k = ncol(y1)
k2 = ncol(y2)
if (k2!=k)
stop("input matrices must have the same number of columns: y1 has ",
k, " while y2 has ", k2)

# calculate sample size and observed means
n1 = nrow(y1)
n2 = nrow(y2)
ybar1= apply(y1, 2, mean); ybar2= apply(y2, 2, mean)
diffbar = ybar1-ybar2

# calculate the variance of the difference in means
v = ((n1-1)*var(y1)+ (n2-1)*var(y2)) /(n1+n2-2)

# calculate the test statistic and associated quantities
t2 = n1*n2*diffbar%*%solve(v)%*%diffbar/(n1+n2)
f = (n1+n2-k-1)*t2/((n1+n2-2)*k)
pvalue = 1-pf(f, k, n1+n2-k-1)

# return the list of results
return(list(pvalue=pvalue, f=f, t2=t2, diff=diffbar))
}


We begin by ensuring that the two input matrices have the same number of columns, then calculate quantities needed for the calculation. The function returns a list with a number of named objects.

To test the function, we ran a simulation study to ensure that under the null hypothesis the function rejected approximately 5% of the time.


# do a simple simulation study to ensure Type I error rate is preserved
library(MASS)
numoutcomes = 5
mu1 = rep(0, numoutcomes); mu2 = rep(0, numoutcomes)
rho = .4
# compound symmetry covariance matrix
Sigma = rho*matrix(rep(1, numoutcomes^2), numoutcomes) +
diag(rep(1-rho, numoutcomes))
numsim = 10000
reject = numeric(numsim)
for (i in 1:numsim) {
grp1 = mvrnorm(100, mu1, Sigma)
grp2 = mvrnorm(150, mu2, Sigma)
reject[i] = hotelling(grp1, grp2)$pvalue<0.05
}


This yields appropriate Type I error rate (just under 5%).


> table(reject)
reject
0 1
9537 463


We can also undertake a comparison of non-homeless and homeless subjects in the HELP dataset, comparing individual observations on the PCS (physical component score) and MCS (mental component scores) of the SF-36 and the CESD measure of depressive symptoms.


ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
smallds = with(ds, cbind(pcs, mcs, cesd))
grp1 = subset(smallds, ds$homeless==0)
grp2 = subset(smallds, ds$homeless==1)
res = hotelling(grp1, grp2)


This saves the results in the object res, which we can display.


> names(res)
[1] "pvalue" "f" "t2" "diff"
> res
$pvalue
[,1]
[1,] 0.1082217

$f
[,1]
[1,] 2.035024

$t2
[,1]
[1,] 6.132267

$diff
pcs mcs cesd
2.064049 1.755975 -2.183760


While there are differences in the mean PCS, MCS, and CESD scores between
the homeless and non-homeless subjects, we do not have sufficient evidence to reject the null hypothesis that they have equivalent means back in the populations (p=0.11).

SAS

In SAS, we can easily make the two-group multivariate comparison using the MANOVA statement in proc glm. This hides the actual value of the statistic, but generates the desired p-value. As usual, we use the ODS system to reduce SAS output.


ods select multstat;
proc glm data="c:\book\help";
class homeless;
model pcs mcs cesd = homeless;
manova h=homeless;
run;
quit;


SAS prints four valid tests, which result in equal values in this case.


MANOVA Test Criteria and Exact F Statistics for
the Hypothesis of No Overall HOMELESS Effect
H = Type III SSCP Matrix for HOMELESS
E = Error SSCP Matrix

S=1 M=0.5 N=223.5

Statistic Value F Value Num DF Den DF Pr > F

Wilks' Lambda 0.98658536 2.04 3 449 0.1082
Pillai's Trace 0.01341464 2.04 3 449 0.1082
Hotelling-Lawley Trace 0.01359704 2.04 3 449 0.1082
Roy's Greatest Root 0.01359704 2.04 3 449 0.1082


Here we also demonstrate using SAS proc iml (section 1.9) to calculate the statistic and perform the test from scratch. This roughly parallels the R development above, though we omit the test for equal number of variables.


*get the help data;
data help;
set "c:\book\help";
run;

proc iml;
* build a function;
start hotelling;
n1 = nrow(h1);
n0 = nrow(h0);
onesum = j(n1,1,1); *make an n1*1 matrix of 1s;
zerosum = j(n0,1,1);
meanh1 = h1`*onesum/n1;
meanh0 = h0`*zerosum/n0;
i1 = i(n1); * make identity matrix;
i0 = i(n0);
vc1 = h1`*(i1-onesum*onesum`/n1)*h1/(n1-1.0);
* make variance-covariance matrix;
vc0 = h0`*(i0-zerosum*zerosum`/n0)*h0/(n0-1.0);
pooledvc = ((n1-1.0)*vc1+(n0-1.0)*vc0)/(n1+n0-2.0);
meandiff = meanh1 - meanh0;
t2 = meandiff`* inv(pooledvc) *meandiff * (n1 * n0)/(n1 + n0);
nvars = ncol(h1);
f=((n1+n0-nvars-1)/(nvars*(n1+n0-2)))*t2;
df2=n1+n0-nvars-1;
p=1-probf(f,nvars,df2);
fcrit = finv(.95, nvars, df2);
print meandiff pooledvc t2 f nvars df2 p fcrit;
finish;

*use the function with the HELP data;
use help;
read all var {cesd pcs mcs} where (homeless = 1) into h1;
read all var {cesd pcs mcs} where (homeless = 0) into h0;
run hotelling;
quit;


Here's the output:


meandiff pooledvc t2 f nvars

2.1837595 155.76862 -38.46634 -108.8555 6.1322669 2.0350243 3
-2.064049 -38.46634 115.50213 14.423897
-1.755975 -108.8555 14.423897 164.44444


df2 p fcrit

449 0.1082217 2.6247689

Monday, May 10, 2010

Example 7.36: Propensity score stratification

In examples 7.34 and 7.35 we described methods using propensity scores to account for possible confounding factors in an observational study.

In addition to adjusting for the propensity score in a multiple regression and matching on the propensity score, researchers will often stratify by the propensity score, and carry out analyses within each group defined by these scores. With sufficient sample size, we might use 5 or 10 strata, but in this example, we'll use 4.


R

To facilitate running the code from this entry, we include all of the previous code snippets (the code is also posted on the book website as propensity.R.


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
summary(lm(pcs ~ homeless))

form = formula(homeless ~ age + female + i1 + mcs)
glm1 = glm(form, family=binomial)
X = glm1$fitted
tapply(X, homeless, FUN=fivenum)
par(mfrow=c(2,1))
hist(X[homeless==0], xlim=c(0.2,1))
hist(X[homeless==1], xlim=c(0.2,1))
par(mfrow=c(1,1))
summary(lm(pcs ~ homeless + X, subset=(X<.8)))
library(Matching)
rr = Match(Y=pcs, Tr=homeless, X=X, M=1)
summary(rr)
MatchBalance(form, match.out=rr, nboots=10)


We will use the propensity scores (stored in the variable X) as a way of stratifying the dataset into four approximately equally sized groups. Unlike our earlier approach, we will include all subjects (even though with very high propensities) and note that as a result we will observe animbalance in the distribution of homelessness by strata.



breakvals = fivenum(X) # section 2.1.4
strat = cut(X, breaks=breakvals,
labels=c('bot quart', '2nd quart', '3rd quart', 'top quart'),
include.lowest=TRUE)


Next we set some parameters needed to tweak the graphic that will be created.


topval = 82
botval = 15
cexval=.6
eps = 3


Next we create a set of boxplots for the PCS scores of the homeless and non-homeless in each of the four strata plus overall. In addition, we annotate the plot with the labels as well as the sample size in each group.


boxplot(pcs[homeless==0 & strat=='bot quart'],
pcs[homeless==1 & strat=='bot quart'],
pcs[homeless==0 & strat=='2nd quart'],
pcs[homeless==1 & strat=='2nd quart'],
pcs[homeless==0 & strat=='3rd quart'],
pcs[homeless==1 & strat=='3rd quart'],
pcs[homeless==0 & strat=='top quart'],
pcs[homeless==1 & strat=='top quart'],
pcs[homeless==0], pcs[homeless==1],
ylim=c(botval,topval), xaxt="n", ylab="PCS score")
abline(v=2.5)
abline(v=4.5)
abline(v=6.5)
abline(v=8.5, lwd=2)
text(1, topval, "not\nhomeless", cex=cexval)
text(2, topval, "homeless", cex=cexval)
text(3, topval, "not\nhomeless", cex=cexval)
text(4, topval, "homeless", cex=cexval)
text(5, topval, "not\nhomeless", cex=cexval)
text(6, topval, "homeless", cex=cexval)
text(7, topval, "not\nhomeless", cex=cexval)
text(8, topval, "homeless", cex=cexval)
text(9, topval, "not\nhomeless", cex=cexval)
text(10, topval, "homeless", cex=cexval)

text(1.5, botval+eps, "bot quart")
text(3.5, botval+eps, "2nd quart")
text(5.5, botval+eps, "3rd quart")
text(7.5, botval+eps, "top quart")
text(9.5, botval+eps, "overall")

text(1, topval-eps, paste("n=",sum(homeless==0 & strat=='bot quart', na.rm=TRUE),sep=""), cex=cexval)
text(2, topval-eps, paste("n=",sum(homeless==1 & strat=='bot quart', na.rm=TRUE),sep=""), cex=cexval)
text(3, topval-eps, paste("n=",sum(homeless==0 & strat=='2nd quart', na.rm=TRUE),sep=""), cex=cexval)
text(4, topval-eps, paste("n=",sum(homeless==1 & strat=='2nd quart', na.rm=TRUE),sep=""), cex=cexval)
text(5, topval-eps, paste("n=",sum(homeless==0 & strat=='3rd quart', na.rm=TRUE),sep=""), cex=cexval)
text(6, topval-eps, paste("n=",sum(homeless==1 & strat=='3rd quart', na.rm=TRUE),sep=""), cex=cexval)
text(7, topval-eps, paste("n=",sum(homeless==0 & strat=='top quart', na.rm=TRUE),sep=""), cex=cexval)
text(8, topval-eps, paste("n=",sum(homeless==1 & strat=='top quart', na.rm=TRUE),sep=""), cex=cexval)
text(9, topval-eps, paste("n=",sum(homeless==0, na.rm=TRUE),sep=""), cex=cexval)
text(10, topval-eps, paste("n=",sum(homeless==1, na.rm=TRUE),sep=""), cex=cexval)




We note that the difference between the median PCS scores is smaller within each of the quartiles of the propensity scores than the difference between medians overall. In addition, there is some indication that the difference increases as a function of the propensity score. Finally, note that the proportion of subjects in the two groups veries wildly by strata, swinging from 1/3 to 2/3 being homeless.

We can fit a t-test within each stratum, to assess whether there are statistically significant differences between the PCS scores for the homeless and non-homeless subjects. Here we use the by() operator (which takes three arguments: a dataset, a variable to use to specify groups, and a function to be called with the subset defined for that level) combined with the with() function to carry out the test for the appropriate subset, returning the p-value each time. This requires creation of a dataframe with the appropriate quantities.


> stratdf = data.frame(pcs, homeless, strat)
> out = by(stratdf, strat,
function(mydataframe) {with(mydataframe,
t.test(pcs[homeless==0], pcs[homeless==1]))
})
> out
strat: bot quart

Welch Two Sample t-test

data: pcs[homeless == 0] and pcs[homeless == 1]
t = 0.806, df = 58.564, p-value = 0.4235
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.65544 6.23678
sample estimates:
mean of x mean of y
49.88714 48.09647

-------------------------------------------------------------------------------------------
strat: 2nd quart

Welch Two Sample t-test

data: pcs[homeless == 0] and pcs[homeless == 1]
t = -0.1011, df = 101.083, p-value = 0.9197
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.212064 3.803686
sample estimates:
mean of x mean of y
49.37089 49.57508

-------------------------------------------------------------------------------------------
strat: 3rd quart

Welch Two Sample t-test

data: pcs[homeless == 0] and pcs[homeless == 1]
t = 0.823, df = 110.918, p-value = 0.4123
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.357005 5.705745
sample estimates:
mean of x mean of y
49.44396 47.76959

-------------------------------------------------------------------------------------------
strat: top quart

Welch Two Sample t-test

data: pcs[homeless == 0] and pcs[homeless == 1]
t = 0.9222, df = 74.798, p-value = 0.3594
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.304285 6.276212
sample estimates:
mean of x mean of y
46.07584 44.08988



The model results show that while the mean PCS is slightly lower among homeless subjects in 3 of the 4 propensity strata, the standard error of the difference is so large that there's little credible evidence for a health cost ascribable to homelessness.



SAS

We begin by getting the propensity scores.


proc logistic data="c:\book\help" desc;
model homeless = age female i1 mcs;
output out=propen pred=propensity;
run;


Next, we find the quartiles of the propensity scores using proc means (section 2.1.1) saving the results in a new data set. The new data set has one observation with the three quartiles.


proc means q1 median q3 data=propen;
var propensity;
output out = propenquart q1=q1 median=median q3=q3;
run;


We want to generate a categorical variable to indicate quartile membership. This requires having the quartile values from the propenquart data set available for each row of the data set with the propensities. To do this, we'll use the point option to the set statement. See this note for more details on this approach.


data propen2;
set propen;
setme = 1;
set propenquart point=setme;
if propensity lt q1 then propensity_quartile = "1st quart";
else if propensity lt median then propensity_quartile = "2nd quart";
else if propensity lt q3 then propensity_quartile = "3rd quart";
else if propensity lt 0.8 then propensity_quartile = "4th quart";
else delete;
if homeless = 1 then homeless_status = "Yes";
else if homeless = 0 then homeless_status = "No";
run;


Next, we'll use proc boxplot (section 5.1.7) to make a plot similar to that made in R. The SAS plot does not show the boxes ignoring strata-- these would be quite a hassle, most likely. The boxplot procedure is finicky, and seems to work best for parallel boxplots when the group variables are character variables, and sorting is often necessary.

 
proc sort data=propen2; by propensity_quartile homeless_status; run;

proc boxplot data=propen2;
plot pcs * homeless_status (propensity_quartile);
insetgroup n / pos=bottom;
run;





Conclusions are noted above.

Finally, we perform the analysis comparing PCS in the homeless and non-homeless groups within each strata, using the ODS system to reduce output.


ods select none;
ods output parameterestimates=propstrata;
proc glm data=propen2;
by propensity_quartile;
model pcs=homeless;
run;
ods select all;


Here are the final results. Small differences from the R results are due to the fact that the defaults for specification of quartiles differ between the two packages (note that for R, there are 79 subjects in the non-homeless bottom quartile, while SAS included 78 non-homeless subjects in that group).


proc print data=propstrata;
where parameter="HOMELESS";
run;

propensity_
Obs quartile Estimate StdErr Probt

2 1st quart -2.00721337 2.10622538 0.3427
4 2nd quart 0.44727354 2.11652411 0.8330
6 3rd quart -1.55746908 2.03369752 0.4454
8 4th quart -2.00910422 2.06426071 0.3325

Monday, May 3, 2010

Example 7.35: Propensity score matching

As discussed in example 7.34, it's sometimes preferable to match on propensity scores, rather than adjust for them as a covariate.

SAS

We use a suite of macros written by Jon Kosanke and Erik Bergstralh at the Mayo Clinic. The dist macro calculates the pairwise distances between observations, while the vmatch macro makes matches based on the distances, finding the closest set of matches while minimizing the overall distance. The latter macro uses the nobs macro. The original macros are available for download from the Mayo Clinic Division of Biomedical statistics and Informatics. Slightly enhanced versions of the nobs and vmatch macros are available from the code section of the book website. The enhanced versions are used to generate the results below.

We previously created the propen data set containing the propensity score variable. In that entry we also noted a lack of overlap in the propensity distributions, and decided to drop observations with propensity > 0.8.

First we drop observations with propensity greater than 0.8 using the subsetting if statement (section 1.5.1). This leaves 201 homeless subjects. We then read in the macros using the %include statement (section 2.1.8). Then we run the dist macro, using its option to call the vmatch macro.


data prop2; set propen;
if propensity lt .8;
run;

%include "C:\ken\sasmacros\vmatch.sas";
%include "C:\ken\sasmacros\dist.sas";
%include "C:\ken\sasmacros\nobs.sas";

%dist(data=prop2, group=homeless, id=id, mvars=propensity,
wts=1, vmatch=Y, a=1, b=1, lilm=201, dmax=0.1,
outm=mp1_b, summatch=n, printm=N, mergeout=mpropen);


The macros are well documented by text included at the top of the macro, as is common with SAS macros that authors share. In the preceding code, the parameter values in the first line are relatively self-explanatory. The wts parameter, which allows multiple matching variables to be weighted differently. The dmax parameter specifies the maximum distance acceptable for a match. We arbitrarily decide that the propensities must be within 0.1 to make a match (our results will differ if other criteria were specified for the matching, as we will see below in the R comparison). The remaining parameters request the matching, ask for one and only one match per case, for all the cases to be matched (if possible), suppress printed output, and name the data set to contain output.

The output data set mpropen is identical to the input data set, with the addition of a new indicator variable, matched. We can compare the distribution of the covariates before and after the matching with the means procedure (section 2.1.1).


title "observed data";
proc means data=propen mean;
class homeless;
var age female i1 mcs;
run;

title "matched observations";
proc means data=mpropen mean;
where matched;
class homeless;
var age female i1 mcs;
run;

observed data
N
HOMELESS Obs Variable Mean
-----------------------------------------------
0 244 AGE 35.0409836
FEMALE 0.2745902
I1 13.5122951
MCS 32.4868303

1 209 AGE 36.3684211
FEMALE 0.1913876
I1 23.0382775
MCS 30.7308549
-----------------------------------------------

matched observations
N
HOMELESS Obs Variable Mean
-----------------------------------------------
0 201 AGE 35.6218905
FEMALE 0.1791045
I1 15.9154229
MCS 31.4815123

1 201 AGE 36.1492537
FEMALE 0.1990050
I1 19.9452736
MCS 30.9176772
-----------------------------------------------


We see that the covariates are much better balanced after matching. Also note the dramatic impact of removing the 8 (4%) cases with very large propensities. This has reduced the mean number of drinks by 15%!

We can now performs the analysis on the matched cases. The two classes of homeless status now have nearly equal distributions of the probability of homelessness.


proc glm data=mpropen;
where matched;
model pcs = homeless / solution;
run;

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 48.95273471 0.76199823 64.24 <.0001
HOMELESS -1.79386398 1.07762823 -1.66 0.0968


After matching, the effect of homelessness on physical health is attenuated and has a larger p-value.


R

In R, the Matching library provides tools for matching and analysis. The Match() function implements a variety of algorithms for multivariate matching including propensity score, Mahalanobis and inverse variance matching. The function is intended to be used in conjunction with the MatchBalance() function which determines the extent to which covariate balance has been achieved. The function takes the propensity score as an argument, as well as the outcome to be compared and the group indicators.

A wide variety of matching options include matching with or without replacement, bias adjustment, different methods for handling ties, exact and caliper matching. The GenMatch function can be used to automatically find balance via a genetic search algorithm which determines the optimal weight to give each covariate.

An extensive website describes the package and the many variety of options that it supports, and a related paper is forthcoming. Three extended examples are included to help illustrate the mechanics of matching.


library(Matching)
rr = Match(Y=pcs, Tr=homeless, X=X, M=1)


The function returns an object describing the matching.


names(rr)
[1] "est" "se" "est.noadj" "se.standard"
[5] "se.cond" "mdata" "index.treated" "index.control"
[9] "index.dropped" "weights" "orig.nobs" "orig.wnobs"
[13] "orig.treated.nobs" "nobs" "wnobs" "caliper"
[17] "ecaliper" "exact" "ndrops" "ndrops.matches"
[21] "MatchLoopC" "version" "estimand"




The results can be displayed by running summary().
      
summary(rr)

Estimate... -0.80207
AI SE...... 1.4448
T-stat..... -0.55516
p.val...... 0.57878

Original number of observations.............. 453
Original number of treated obs............... 209
Matched number of observations............... 209
Matched number of observations (unweighted). 252


By default, the observations are given equal weight. If all of the observations had a weight of 1 on input, then each matched-pair will have a weight of 1 on output if there are no ties.

We see that the causal estimate of -0.80 in the matched comparison is not statistically significant (p=0.58), which is consistent with the other results that accounted for the confounders (though we note that the specific results depend on the particular options that are selected for the matching).

The MatchBalance() function can be used to describe the distribution
of the predictors (by homeless status) before and after matching (to save space, only the results for age and i1 are displayed). This is helpful to determine if the
matching resulted in similar marginal distributions.




> MatchBalance(form, match.out=rr, nboots=10)
***** (V1) age ***** Before Matching After Matching
mean treatment........ 36.368 36.368
mean control.......... 35.041 36.423
std mean diff......... 16.069 -0.65642

mean raw eQQ diff..... 1.5981 0.94841
med raw eQQ diff..... 1 1
max raw eQQ diff..... 7 10

mean eCDF diff........ 0.037112 0.022581
med eCDF diff........ 0.026365 0.019841
max eCDF diff........ 0.10477 0.083333

var ratio (Tr/Co)..... 1.3290 1.2671
T-test p-value........ 0.070785 0.93902
KS Bootstrap p-value.. < 2.22e-16 0.3
KS Naive p-value...... 0.16881 0.34573
KS Statistic.......... 0.10477 0.083333

***** (V3) i1 ***** Before Matching After Matching
mean treatment........ 23.038 23.038
mean control.......... 13.512 20.939
std mean diff......... 40.582 8.945

mean raw eQQ diff..... 9.6316 2.1071
med raw eQQ diff..... 8 1
max raw eQQ diff..... 73 66

mean eCDF diff........ 0.11853 0.018753
med eCDF diff........ 0.12377 0.011905
max eCDF diff........ 0.20662 0.087302

var ratio (Tr/Co)..... 2.3763 1.3729
T-test p-value........ 7.8894e-07 0.011786
KS Bootstrap p-value.. < 2.22e-16 0.3
KS Naive p-value...... 0.00013379 0.29213
KS Statistic.......... 0.20662 0.087302


More details regarding each of the tests for differences in means or distributions can be found using ?MatchBalance. The results for both of the variables presented above indicate that the distributions are considerably closer
to each other in the matched sample than in the original dataset.