SAS and R: May 2010

Monday, May 31, 2010

Example 7.39: Nelson-Aalen estimate of cumulative hazard

In our previous example, we demonstrated how to calculate the Kaplan-Meier estimate of the survival function for time to event data.

A related quantity is the Nelson-Aalen estimate of cumulative hazard. In addition to summarizing the hazard incurred by a particular timepoint, this quantity has been
used in missing data models (see White and Royston, 2009).

In addition to the usual SAS and R approaches to this, we also show Stata code.

SAS

It's very straightforward to calculate the Nelson-Aalen estimate in SAS. We assume the data is in the test data set generated in example 7.38.


ods output productlimitestimates=naout;
proc lifetest data=test nelson;
  time time*event(0);
run;

options formchar="|----|+|---+=|-/\<>*";
options ls=64;
proc print data=naout; var time cumhaz; run;

In the forgoing, the formchar option makes SAS print in a format that can easily be cut-and-pasted into other software, such as Blogger's text entry window, as described here.


                  Obs        time      CumHaz
                    1      0.0000           0
                    2      0.5000           .
                    3      1.0000           .
                    4      1.0000      0.1111
                    5      2.0000      0.1736
                    6      2.0000           .
                    7      3.0000      0.2450
                    8      4.0000      0.3220
                    9      5.0000           .
                   10      6.0000      0.4129
                   11      7.0000           .
                   12      8.0000      0.5240
                   13      9.0000      0.6490
                   14     10.0000           .
                   15     12.0000      0.8156
                   16     14.0000      1.0156
                   17     14.0000           .
                   18     17.0000           .
                   19     20.0000      1.5156
                   20     21.0000           .

Note that SAS shows the Nelson-Aalen estimator once per time point, and only when there is a failure at that time. We may need the estimated value for all observed time points in our data, as well as to use the estimate with our original data. This requires some additional work. First, we sort our data and the new naout data set by time, removing rows from the above printout where the cumulative hazard is 0, using the where data set option (section A.6.3). Then we merge the two data sets.


proc sort data=test; by time; run;
proc sort data=naout (where=(cumhaz ne .)); by time; run;

data t2;
merge test (in=keepme) naout (keep = time cumhaz);
            * the keep option drops unneeded variables;
by time;
if keepme;
            * deletes any extra rows from the naout dataset;
retain cumhazlast 0;
            * remember most recent value of this;
if cumhaz eq . then cumhaz = cumhazlast;
            * plug in most recent value of cumhazlast,
              if cumhaz is missing;
cumhazlast = cumhaz;
            * the current value of cumhaz is now the most recent;
run;

proc print data=t2;
  var time event cumhaz;
run;

This generates the desired result:


              Obs        time    event      CumHaz
                1      0.5000      0             0
                2      1.0000      1        0.1111
                3      1.0000      1        0.1111
                4      2.0000      1        0.1736
                5      2.0000      0        0.1736
                6      3.0000      1        0.2450
                7      4.0000      1        0.3220
                8      5.0000      0        0.3220
                9      6.0000      1        0.4129
               10      7.0000      0        0.4129
               11      8.0000      1        0.5240
               12      9.0000      1        0.6490
               13     10.0000      0        0.6490
               14     12.0000      1        0.8156
               15     14.0000      0        1.0156
               16     14.0000      1        1.0156
               17     17.0000      0        1.0156
               18     20.0000      1        1.5156
               19     21.0000      0        1.5156

R

While the survfit() command can be used to create a table of all of the survival function and Nelson-Aalen estimates at each event point, it is slightly harder in R to associate these with the original data. Here we craft a function in R that does this housekeeping, using the fact that the Nelson-Aalen is just the negative log of the survival function (after specifying the type="aalen" option.


calcna = function(time, event) {
   na.fit = survfit(coxph(Surv(time,event)~1), type="aalen")
   jumps = c(0, na.fit$time, max(time))
   # need to be careful at the beginning and end
   surv = c(1, na.fit$surv, na.fit$surv[length(na.fit$surv)])

   # apply appropriate transformation
   neglogsurv = -log(surv)   
   
   # create placeholder of correct length
   naest = numeric(length(time))  
   for (i in 2:length(jumps)) {
      naest[which(time>=jumps[i-1] & time<=jumps[i])] = 
         neglogsurv[i-1]   # snag the appropriate value
   }
   return(naest)
}

This can be be used as follows, where we use the time and event objects created in example 7.38.


newna = calcna(time, event)
cbind(time, newna)

      time     newna
 [1,]  0.5 0.0000000
 [2,]  1.0 0.1111111
 [3,]  1.0 0.1111111
 [4,]  2.0 0.1736111
 [5,]  2.0 0.1736111
 [6,]  3.0 0.2450397
 [7,]  4.0 0.3219628
 [8,]  5.0 0.3219628
 [9,]  6.0 0.4128719
[10,]  7.0 0.4128719
[11,]  8.0 0.5239830
[12,]  9.0 0.6489830
[13,] 10.0 0.6489830
[14,] 12.0 0.8156496
[15,] 14.0 1.0156496
[16,] 14.0 1.0156496
[17,] 17.0 1.0156496
[18,] 20.0 1.5156496
[19,] 21.0 1.5156496

We also describe how to calculate the Nelson-Aalen estimate in Stata. First we create a dataset in the appropriate format. The foreign library in R (sections 1.1.5, 1.2.2) can do this directly. In SAS we would need to export to a format that Stata can read.


library(foreign)
write.dta(data.frame(time=time, event=event), "forstata.dta")

Stata

We can then read in the dataset from R:


. use forstata
(Written by R.              )

. stset time, failure(event)

     failure event:  event != 0 & event < .
obs. time interval:  (0, time]
 exit on or before:  failure

------------------------------------------------------------------------------
       19  total obs.
        0  exclusions
------------------------------------------------------------------------------
       19  obs. remaining, representing
       11  failures in single record/single failure data
    156.5  total analysis time at risk, at risk from t =         0
                             earliest observed entry t =         0
                                  last observed exit t =        21
. sts list, cumhaz

         failure _d:  event
   analysis time _t:  time
           Beg.          Net          Nelson-Aalen    Std.
  Time    Total   Fail   Lost           Cum. Haz.    Error     [95% Conf. Int.]
-------------------------------------------------------------------------------
    .5       19      0      1             0.0000    0.0000          .         .
     1       18      2      0             0.1111    0.0786     0.0278    0.4443
     2       16      1      1             0.1736    0.1004     0.0559    0.5393
     3       14      1      0             0.2450    0.1232     0.0915    0.6565
     4       13      1      0             0.3220    0.1453     0.1330    0.7795
     5       12      0      1             0.3220    0.1453     0.1330    0.7795
     6       11      1      0             0.4129    0.1714     0.1830    0.9313
     7       10      0      1             0.4129    0.1714     0.1830    0.9313
     8        9      1      0             0.5240    0.2042     0.2441    1.1248
     9        8      1      0             0.6490    0.2394     0.3149    1.3375
    10        7      0      1             0.6490    0.2394     0.3149    1.3375
    12        6      1      0             0.8156    0.2917     0.4046    1.6442
    14        5      1      1             1.0156    0.3537     0.5132    2.0099
    17        3      0      1             1.0156    0.3537     0.5132    2.0099
    20        2      1      0             1.5156    0.6125     0.6865    3.3463
    21        1      0      1             1.5156    0.6125     0.6865    3.3463
-------------------------------------------------------------------------------

Monday, May 24, 2010

Example 7.38: Kaplan-Meier survival estimates

In example 7.30 we demonstrated how to simulate data from a Cox proportional hazards model.

In this and the next few entries, we expand upon support in R and SAS for survival (time-to-event) models. We'll start with a small, artificial dataset of 19 subjects. Each subject contributes a pair of variables: the time and an indicator of whether the time is when the event occurred (event=TRUE) or when the subject was censored (event=FALSE).

time  event
0.5   FALSE
1     TRUE
1     TRUE
2     TRUE
2     FALSE
3     TRUE
4     TRUE
5     FALSE
6     TRUE
7     FALSE
8     TRUE
9     TRUE
10    FALSE
12    TRUE
14    FALSE
14    TRUE
17    FALSE
20    TRUE
21    FALSE

Until an instant before time=1, no events were observed (only the censored observation), so the survival estimate is 1. At time=1, 2 subjects out of the 18 still at risk observed the event, so the survival function S(.) at time 1 is S(1) = 16/18 = 0.8889. The next failure occurs at time=2, with 16 still at risk, so S(2)=15/16 * 16/18 = 0.8333. Note that in addition to the event at time=2, there is a subject censored then, so the number at risk at time=3 is just 13 (so S(3) = 13/14 * 15*16 * 16/18 = 0.7738). The calculations continue until the final event is observed.

R

In R, we use the survfit() function (section 5.1.19) within the survival library to calculate the survival function across time.

library(survival)
time =  c(0.5, 1,1,2,2,3,4,5,6,7,8,9,10,12,14,14,17,20, 21)
event = c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, 
  TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
  TRUE, FALSE)

ds = data.frame(time, event)
fit = survfit(Surv(time, event) ~ 1, data=ds)

The returned survival object includes a number of attributes, such as the survival estimates at each timepoint, the standard error of those estimates, and the number of subjects at risk.

> names(fit)
 [1] "n"         "time"      "n.risk"    "n.event"   "n.censor"  
     "surv"      "type"      "std.err"  
 [9] "upper"     "lower"     "conf.type" "conf.int"  "call" 
> summary(fit)
Call: survfit(formula = Surv(time, event) ~ 1, data = ds)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1     18       2    0.889  0.0741       0.7549        1.000
    2     16       1    0.833  0.0878       0.6778        1.000
    3     14       1    0.774  0.0997       0.6011        0.996
    4     13       1    0.714  0.1084       0.5306        0.962
    6     11       1    0.649  0.1164       0.4570        0.923
    8      9       1    0.577  0.1238       0.3791        0.879
    9      8       1    0.505  0.1276       0.3078        0.829
   12      6       1    0.421  0.1312       0.2285        0.775
   14      5       1    0.337  0.1292       0.1587        0.714
   20      2       1    0.168  0.1354       0.0348        0.815

SAS

We can read in the artificial data using an input statement (section 1.1.8).

data ds;
input time event;
cards;
0.5   0
1     1
1     1
2     1
2     0
3     1
4     1
5     0
6     1
7     0
8     1
9     1
10    0
12    1
14    0
14    1
17    0
20    1
21    0
run;

proc lifetest data=ds;
  time time*event(0);
run;

Here we denote censoring as being values where event is equal to 0. If we had a censoring indicator coded in reverse (1 = censoring), the second line might read time time*censored(1);.

The survival function can be estimated in proc lifetest (as shown in section 5.1.19). In a break from our usual practice, we'll include all of the output generated by proc lifetest.

The LIFETEST Procedure
                   Product-Limit Survival Estimates
                                     Survival
                                     Standard     Number      Number 
    time     Survival    Failure      Error       Failed       Left  
  0.0000       1.0000           0           0        0          19   
  0.5000*           .           .           .        0          18   
  1.0000            .           .           .        1          17   
  1.0000       0.8889      0.1111      0.0741        2          16   
  2.0000       0.8333      0.1667      0.0878        3          15   
  2.0000*           .           .           .        3          14   
  3.0000       0.7738      0.2262      0.0997        4          13   
  4.0000       0.7143      0.2857      0.1084        5          12   
  5.0000*           .           .           .        5          11   
  6.0000       0.6494      0.3506      0.1164        6          10   
  7.0000*           .           .           .        6           9   
  8.0000       0.5772      0.4228      0.1238        7           8   
  9.0000       0.5051      0.4949      0.1276        8           7   
 10.0000*           .           .           .        8           6   
 12.0000       0.4209      0.5791      0.1312        9           5   
 14.0000       0.3367      0.6633      0.1292       10           4   
 14.0000*           .           .           .       10           3   
 17.0000*           .           .           .       10           2   
 20.0000       0.1684      0.8316      0.1354       11           1   
 21.0000*           .           .           .       11           0   

NOTE: The marked survival times are censored observations.

Summary Statistics for Time Variable time

             Quartile Estimates
             Point     95% Confidence Interval
Percent    Estimate      [Lower      Upper)
     75     20.0000     12.0000       .    
     50     12.0000      6.0000     20.0000
     25      4.0000      1.0000     12.0000

    Mean    Standard Error
 11.1776            1.9241

NOTE: The mean survival time and its standard error were underestimated because 
      the largest observation was censored and the estimation was restricted to 
      the largest event time.

Summary of the Number of Censored and Uncensored Values
                                 Percent
   Total  Failed    Censored    Censored
      19      11           8       42.11

Monday, May 17, 2010

Example 7.37: calculation of Hotelling's T^2

Hotelling's T^2 is a multivariate statistic used to compare two groups, where multiple outcomes are observed for each subject.

Here we demonstrate how to calculate Hotelling's T^2 using R and SAS, and test the code using a simulation study then apply it to data from the HELP study.

R

We utilize an approach suggested by Peter Mandeville in a posting to the R-help mailing list.


hotelling = function(y1, y2) {
  # check the appropriate dimension
  k = ncol(y1)
  k2 = ncol(y2)
  if (k2!=k) 
    stop("input matrices must have the same number of columns: y1 has ",
      k, " while y2 has ", k2)
     
  # calculate sample size and observed means
  n1 = nrow(y1)
  n2 = nrow(y2)   
  ybar1= apply(y1, 2, mean); ybar2= apply(y2, 2, mean)
  diffbar = ybar1-ybar2
   
  # calculate the variance of the difference in means
  v = ((n1-1)*var(y1)+ (n2-1)*var(y2)) /(n1+n2-2)
   
  # calculate the test statistic and associated quantities
  t2 = n1*n2*diffbar%*%solve(v)%*%diffbar/(n1+n2)
  f = (n1+n2-k-1)*t2/((n1+n2-2)*k)
  pvalue = 1-pf(f, k, n1+n2-k-1)
   
  # return the list of results
  return(list(pvalue=pvalue, f=f, t2=t2, diff=diffbar))
}

We begin by ensuring that the two input matrices have the same number of columns, then calculate quantities needed for the calculation. The function returns a list with a number of named objects.

To test the function, we ran a simulation study to ensure that under the null hypothesis the function rejected approximately 5% of the time.


# do a simple simulation study to ensure Type I error rate is preserved
library(MASS)
numoutcomes = 5
mu1 = rep(0, numoutcomes); mu2 = rep(0, numoutcomes)
rho = .4
# compound symmetry covariance matrix
Sigma = rho*matrix(rep(1, numoutcomes^2), numoutcomes) + 
  diag(rep(1-rho, numoutcomes)) 
numsim = 10000
reject = numeric(numsim) 
for (i in 1:numsim) {
  grp1 = mvrnorm(100, mu1, Sigma)
  grp2 = mvrnorm(150, mu2, Sigma)
  reject[i] = hotelling(grp1, grp2)$pvalue<0.05
}

This yields appropriate Type I error rate (just under 5%).


> table(reject)
reject
   0    1 
9537  463

We can also undertake a comparison of non-homeless and homeless subjects in the HELP dataset, comparing individual observations on the PCS (physical component score) and MCS (mental component scores) of the SF-36 and the CESD measure of depressive symptoms.


ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
smallds = with(ds, cbind(pcs, mcs, cesd))
grp1 = subset(smallds, ds$homeless==0)
grp2 = subset(smallds, ds$homeless==1)
res = hotelling(grp1, grp2)

This saves the results in the object res, which we can display.


> names(res)
[1] "pvalue" "f"      "t2"     "diff"  
> res
$pvalue
          [,1]
[1,] 0.1082217

$f
         [,1]
[1,] 2.035024

$t2
         [,1]
[1,] 6.132267

$diff
      pcs       mcs      cesd 
 2.064049  1.755975 -2.183760

While there are differences in the mean PCS, MCS, and CESD scores between
the homeless and non-homeless subjects, we do not have sufficient evidence to reject the null hypothesis that they have equivalent means back in the populations (p=0.11).

SAS

In SAS, we can easily make the two-group multivariate comparison using the MANOVA statement in proc glm. This hides the actual value of the statistic, but generates the desired p-value. As usual, we use the ODS system to reduce SAS output.


ods select multstat;
proc glm data="c:\book\help";
  class homeless;
  model pcs mcs cesd = homeless;
  manova h=homeless;
run;
quit;

SAS prints four valid tests, which result in equal values in this case.


            MANOVA Test Criteria and Exact F Statistics for
              the Hypothesis of No Overall HOMELESS Effect
                 H = Type III SSCP Matrix for HOMELESS
                         E = Error SSCP Matrix

                        S=1    M=0.5    N=223.5

 Statistic                      Value  F Value  Num DF  Den DF  Pr > F

 Wilks' Lambda             0.98658536     2.04       3     449  0.1082
 Pillai's Trace            0.01341464     2.04       3     449  0.1082
 Hotelling-Lawley Trace    0.01359704     2.04       3     449  0.1082
 Roy's Greatest Root       0.01359704     2.04       3     449  0.1082

Here we also demonstrate using SAS proc iml (section 1.9) to calculate the statistic and perform the test from scratch. This roughly parallels the R development above, though we omit the test for equal number of variables.


*get the help data;
data help;
set "c:\book\help";
run;

proc iml;
* build a function;
start hotelling;
  n1 = nrow(h1);
  n0 = nrow(h0);
  onesum = j(n1,1,1);  *make an n1*1 matrix of 1s;
  zerosum = j(n0,1,1);  
  meanh1 = h1`*onesum/n1;
  meanh0 = h0`*zerosum/n0;
  i1 = i(n1);          * make identity matrix;
  i0 = i(n0);         
  vc1 =  h1`*(i1-onesum*onesum`/n1)*h1/(n1-1.0);
                       * make variance-covariance matrix;
  vc0 =  h0`*(i0-zerosum*zerosum`/n0)*h0/(n0-1.0);
  pooledvc = ((n1-1.0)*vc1+(n0-1.0)*vc0)/(n1+n0-2.0);
  meandiff = meanh1 - meanh0;
  t2 = meandiff`* inv(pooledvc) *meandiff * (n1 * n0)/(n1 + n0);
  nvars = ncol(h1);
  f=((n1+n0-nvars-1)/(nvars*(n1+n0-2)))*t2;
  df2=n1+n0-nvars-1;
  p=1-probf(f,nvars,df2);
  fcrit = finv(.95, nvars, df2);
  print meandiff pooledvc t2 f nvars df2 p fcrit;
finish;

*use the function with the HELP data;
use help;
  read all var {cesd pcs mcs} where (homeless = 1) into h1;
  read all var {cesd pcs mcs} where (homeless = 0) into h0;
  run hotelling;
quit;

Here's the output:


 meandiff  pooledvc                            t2         f     nvars

 2.1837595 155.76862 -38.46634 -108.8555 6.1322669 2.0350243         3
 -2.064049 -38.46634 115.50213 14.423897
 -1.755975 -108.8555 14.423897 164.44444


                           df2         p     fcrit

                           449 0.1082217 2.6247689

Monday, May 10, 2010

Example 7.36: Propensity score stratification

In examples 7.34 and 7.35 we described methods using propensity scores to account for possible confounding factors in an observational study.

In addition to adjusting for the propensity score in a multiple regression and matching on the propensity score, researchers will often stratify by the propensity score, and carry out analyses within each group defined by these scores. With sufficient sample size, we might use 5 or 10 strata, but in this example, we'll use 4.

R

To facilitate running the code from this entry, we include all of the previous code snippets (the code is also posted on the book website as propensity.R.


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
summary(lm(pcs ~ homeless))

form = formula(homeless ~ age + female + i1 + mcs)
glm1 = glm(form, family=binomial)
X = glm1$fitted
tapply(X, homeless, FUN=fivenum)
par(mfrow=c(2,1))
hist(X[homeless==0], xlim=c(0.2,1))
hist(X[homeless==1], xlim=c(0.2,1))
par(mfrow=c(1,1))
summary(lm(pcs ~ homeless + X, subset=(X<.8)))
library(Matching)
rr = Match(Y=pcs, Tr=homeless, X=X, M=1)
summary(rr)
MatchBalance(form, match.out=rr, nboots=10)

We will use the propensity scores (stored in the variable X) as a way of stratifying the dataset into four approximately equally sized groups. Unlike our earlier approach, we will include all subjects (even though with very high propensities) and note that as a result we will observe animbalance in the distribution of homelessness by strata.


breakvals = fivenum(X)   # section 2.1.4
strat = cut(X, breaks=breakvals, 
   labels=c('bot quart', '2nd quart', '3rd quart', 'top quart'), 
   include.lowest=TRUE)

Next we set some parameters needed to tweak the graphic that will be created.


topval = 82
botval = 15
cexval=.6
eps = 3

Next we create a set of boxplots for the PCS scores of the homeless and non-homeless in each of the four strata plus overall. In addition, we annotate the plot with the labels as well as the sample size in each group.


boxplot(pcs[homeless==0 & strat=='bot quart'], 
   pcs[homeless==1 & strat=='bot quart'],
   pcs[homeless==0 & strat=='2nd quart'],
   pcs[homeless==1 & strat=='2nd quart'],
   pcs[homeless==0 & strat=='3rd quart'],
   pcs[homeless==1 & strat=='3rd quart'],
   pcs[homeless==0 & strat=='top quart'], 
   pcs[homeless==1 & strat=='top quart'], 
   pcs[homeless==0], pcs[homeless==1], 
   ylim=c(botval,topval), xaxt="n", ylab="PCS score")
abline(v=2.5)
abline(v=4.5)
abline(v=6.5)
abline(v=8.5, lwd=2)
text(1, topval, "not\nhomeless", cex=cexval)
text(2, topval, "homeless", cex=cexval)
text(3, topval, "not\nhomeless", cex=cexval)
text(4, topval, "homeless", cex=cexval)
text(5, topval, "not\nhomeless", cex=cexval)
text(6, topval, "homeless", cex=cexval)
text(7, topval, "not\nhomeless", cex=cexval)
text(8, topval, "homeless", cex=cexval)
text(9, topval, "not\nhomeless", cex=cexval)
text(10, topval, "homeless", cex=cexval)

text(1.5, botval+eps, "bot quart")
text(3.5, botval+eps, "2nd quart")
text(5.5, botval+eps, "3rd quart")
text(7.5, botval+eps, "top quart")
text(9.5, botval+eps, "overall")

text(1, topval-eps, paste("n=",sum(homeless==0 & strat=='bot quart', na.rm=TRUE),sep=""), cex=cexval)
text(2, topval-eps, paste("n=",sum(homeless==1 & strat=='bot quart', na.rm=TRUE),sep=""), cex=cexval)
text(3, topval-eps, paste("n=",sum(homeless==0 & strat=='2nd quart', na.rm=TRUE),sep=""), cex=cexval)
text(4, topval-eps, paste("n=",sum(homeless==1 & strat=='2nd quart', na.rm=TRUE),sep=""), cex=cexval)
text(5, topval-eps, paste("n=",sum(homeless==0 & strat=='3rd quart', na.rm=TRUE),sep=""), cex=cexval)
text(6, topval-eps, paste("n=",sum(homeless==1 & strat=='3rd quart', na.rm=TRUE),sep=""), cex=cexval)
text(7, topval-eps, paste("n=",sum(homeless==0 & strat=='top quart', na.rm=TRUE),sep=""), cex=cexval)
text(8, topval-eps, paste("n=",sum(homeless==1 & strat=='top quart', na.rm=TRUE),sep=""), cex=cexval)
text(9, topval-eps, paste("n=",sum(homeless==0, na.rm=TRUE),sep=""), cex=cexval)
text(10, topval-eps, paste("n=",sum(homeless==1, na.rm=TRUE),sep=""), cex=cexval)

We note that the difference between the median PCS scores is smaller within each of the quartiles of the propensity scores than the difference between medians overall. In addition, there is some indication that the difference increases as a function of the propensity score. Finally, note that the proportion of subjects in the two groups veries wildly by strata, swinging from 1/3 to 2/3 being homeless.

We can fit a t-test within each stratum, to assess whether there are statistically significant differences between the PCS scores for the homeless and non-homeless subjects. Here we use the by() operator (which takes three arguments: a dataset, a variable to use to specify groups, and a function to be called with the subset defined for that level) combined with the with() function to carry out the test for the appropriate subset, returning the p-value each time. This requires creation of a dataframe with the appropriate quantities.


> stratdf = data.frame(pcs, homeless, strat)
> out = by(stratdf, strat, 
   function(mydataframe) {with(mydataframe, 
      t.test(pcs[homeless==0], pcs[homeless==1])) 
   })
> out
strat: bot quart

 Welch Two Sample t-test

data:  pcs[homeless == 0] and pcs[homeless == 1] 
t = 0.806, df = 58.564, p-value = 0.4235
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -2.65544  6.23678 
sample estimates:
mean of x mean of y 
 49.88714  48.09647 

------------------------------------------------------------------------------------------- 
strat: 2nd quart

 Welch Two Sample t-test

data:  pcs[homeless == 0] and pcs[homeless == 1] 
t = -0.1011, df = 101.083, p-value = 0.9197
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -4.212064  3.803686 
sample estimates:
mean of x mean of y 
 49.37089  49.57508 

------------------------------------------------------------------------------------------- 
strat: 3rd quart

 Welch Two Sample t-test

data:  pcs[homeless == 0] and pcs[homeless == 1] 
t = 0.823, df = 110.918, p-value = 0.4123
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -2.357005  5.705745 
sample estimates:
mean of x mean of y 
 49.44396  47.76959 

------------------------------------------------------------------------------------------- 
strat: top quart

 Welch Two Sample t-test

data:  pcs[homeless == 0] and pcs[homeless == 1] 
t = 0.9222, df = 74.798, p-value = 0.3594
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -2.304285  6.276212 
sample estimates:
mean of x mean of y 
 46.07584  44.08988

The model results show that while the mean PCS is slightly lower among homeless subjects in 3 of the 4 propensity strata, the standard error of the difference is so large that there's little credible evidence for a health cost ascribable to homelessness.

SAS

We begin by getting the propensity scores.


proc logistic data="c:\book\help" desc;
  model homeless = age female i1 mcs;
  output out=propen pred=propensity;
run;

Next, we find the quartiles of the propensity scores using proc means (section 2.1.1) saving the results in a new data set. The new data set has one observation with the three quartiles.


proc means q1 median q3 data=propen;
  var propensity;
  output out = propenquart q1=q1 median=median q3=q3;
run;

We want to generate a categorical variable to indicate quartile membership. This requires having the quartile values from the propenquart data set available for each row of the data set with the propensities. To do this, we'll use the point option to the set statement. See this note for more details on this approach.


data propen2;
set propen;
  setme = 1;
  set propenquart point=setme;
  if propensity lt q1 then propensity_quartile = "1st quart"; 
  else if propensity lt median then propensity_quartile = "2nd quart"; 
  else if propensity lt q3 then propensity_quartile = "3rd quart"; 
  else if propensity lt 0.8 then propensity_quartile = "4th quart"; 
  else delete;
  if homeless = 1 then homeless_status = "Yes";
  else if homeless = 0 then homeless_status = "No";
run;

Next, we'll use proc boxplot (section 5.1.7) to make a plot similar to that made in R. The SAS plot does not show the boxes ignoring strata-- these would be quite a hassle, most likely. The boxplot procedure is finicky, and seems to work best for parallel boxplots when the group variables are character variables, and sorting is often necessary.

 
proc sort data=propen2; by propensity_quartile homeless_status; run;

proc boxplot data=propen2; 
  plot pcs * homeless_status (propensity_quartile);
  insetgroup n / pos=bottom;
run;

Conclusions are noted above.

Finally, we perform the analysis comparing PCS in the homeless and non-homeless groups within each strata, using the ODS system to reduce output.


ods select none;
ods output parameterestimates=propstrata;
proc glm data=propen2;
  by propensity_quartile;
  model pcs=homeless;
run;
ods select all;

Here are the final results. Small differences from the R results are due to the fact that the defaults for specification of quartiles differ between the two packages (note that for R, there are 79 subjects in the non-homeless bottom quartile, while SAS included 78 non-homeless subjects in that group).


proc print data=propstrata;
  where parameter="HOMELESS";
run;

       propensity_
Obs     quartile          Estimate          StdErr     Probt

 2      1st quart      -2.00721337      2.10622538    0.3427
 4      2nd quart       0.44727354      2.11652411    0.8330
 6      3rd quart      -1.55746908      2.03369752    0.4454
 8      4th quart      -2.00910422      2.06426071    0.3325

Monday, May 3, 2010

Example 7.35: Propensity score matching

As discussed in example 7.34, it's sometimes preferable to match on propensity scores, rather than adjust for them as a covariate.

SAS

We use a suite of macros written by Jon Kosanke and Erik Bergstralh at the Mayo Clinic. The dist macro calculates the pairwise distances between observations, while the vmatch macro makes matches based on the distances, finding the closest set of matches while minimizing the overall distance. The latter macro uses the nobs macro. The original macros are available for download from the Mayo Clinic Division of Biomedical statistics and Informatics. Slightly enhanced versions of the nobs and vmatch macros are available from the code section of the book website. The enhanced versions are used to generate the results below.

We previously created the propen data set containing the propensity score variable. In that entry we also noted a lack of overlap in the propensity distributions, and decided to drop observations with propensity > 0.8.

First we drop observations with propensity greater than 0.8 using the subsetting if statement (section 1.5.1). This leaves 201 homeless subjects. We then read in the macros using the %include statement (section 2.1.8). Then we run the dist macro, using its option to call the vmatch macro.


data prop2; set propen;
  if propensity lt .8;
run;

%include "C:\ken\sasmacros\vmatch.sas";
%include "C:\ken\sasmacros\dist.sas";
%include "C:\ken\sasmacros\nobs.sas";

%dist(data=prop2, group=homeless, id=id, mvars=propensity,
      wts=1, vmatch=Y, a=1, b=1, lilm=201, dmax=0.1,  
      outm=mp1_b, summatch=n, printm=N, mergeout=mpropen);

The macros are well documented by text included at the top of the macro, as is common with SAS macros that authors share. In the preceding code, the parameter values in the first line are relatively self-explanatory. The wts parameter, which allows multiple matching variables to be weighted differently. The dmax parameter specifies the maximum distance acceptable for a match. We arbitrarily decide that the propensities must be within 0.1 to make a match (our results will differ if other criteria were specified for the matching, as we will see below in the R comparison). The remaining parameters request the matching, ask for one and only one match per case, for all the cases to be matched (if possible), suppress printed output, and name the data set to contain output.

The output data set mpropen is identical to the input data set, with the addition of a new indicator variable, matched. We can compare the distribution of the covariates before and after the matching with the means procedure (section 2.1.1).


title "observed data";
proc means data=propen mean;
  class homeless;
  var age female i1 mcs;
run;

title "matched observations";
proc means data=mpropen mean;
where matched;
  class homeless;
  var age female i1 mcs;
run;

 observed data 
                  N
    HOMELESS    Obs    Variable            Mean
-----------------------------------------------
           0    244    AGE           35.0409836
                       FEMALE         0.2745902
                       I1            13.5122951
                       MCS           32.4868303

           1    209    AGE           36.3684211
                       FEMALE         0.1913876
                       I1            23.0382775
                       MCS           30.7308549
-----------------------------------------------

matched observations  
                  N
    HOMELESS    Obs    Variable            Mean
-----------------------------------------------
           0    201    AGE           35.6218905
                       FEMALE         0.1791045
                       I1            15.9154229
                       MCS           31.4815123

           1    201    AGE           36.1492537
                       FEMALE         0.1990050
                       I1            19.9452736
                       MCS           30.9176772
-----------------------------------------------

We see that the covariates are much better balanced after matching. Also note the dramatic impact of removing the 8 (4%) cases with very large propensities. This has reduced the mean number of drinks by 15%!

We can now performs the analysis on the matched cases. The two classes of homeless status now have nearly equal distributions of the probability of homelessness.


proc glm data=mpropen;
  where matched;
  model pcs = homeless / solution;
run;

                                  Standard
Parameter         Estimate           Error    t Value    Pr > |t|

Intercept      48.95273471      0.76199823      64.24      <.0001
HOMELESS       -1.79386398      1.07762823      -1.66      0.0968

After matching, the effect of homelessness on physical health is attenuated and has a larger p-value.

R

In R, the Matching library provides tools for matching and analysis. The Match() function implements a variety of algorithms for multivariate matching including propensity score, Mahalanobis and inverse variance matching. The function is intended to be used in conjunction with the MatchBalance() function which determines the extent to which covariate balance has been achieved. The function takes the propensity score as an argument, as well as the outcome to be compared and the group indicators.

A wide variety of matching options include matching with or without replacement, bias adjustment, different methods for handling ties, exact and caliper matching. The GenMatch function can be used to automatically find balance via a genetic search algorithm which determines the optimal weight to give each covariate.

An extensive website describes the package and the many variety of options that it supports, and a related paper is forthcoming. Three extended examples are included to help illustrate the mechanics of matching.


library(Matching)
rr = Match(Y=pcs, Tr=homeless, X=X, M=1)

The function returns an object describing the matching.


names(rr)
 [1] "est"               "se"                "est.noadj"         "se.standard"      
 [5] "se.cond"           "mdata"             "index.treated"     "index.control"    
 [9] "index.dropped"     "weights"           "orig.nobs"         "orig.wnobs"       
[13] "orig.treated.nobs" "nobs"              "wnobs"             "caliper"          
[17] "ecaliper"          "exact"             "ndrops"            "ndrops.matches"   
[21] "MatchLoopC"        "version"           "estimand"

The results can be displayed by running summary().

      
summary(rr)

Estimate...  -0.80207 
AI SE......  1.4448 
T-stat.....  -0.55516 
p.val......  0.57878 

Original number of observations..............  453 
Original number of treated obs...............  209 
Matched number of observations...............  209 
Matched number of observations  (unweighted).  252

By default, the observations are given equal weight. If all of the observations had a weight of 1 on input, then each matched-pair will have a weight of 1 on output if there are no ties.

We see that the causal estimate of -0.80 in the matched comparison is not statistically significant (p=0.58), which is consistent with the other results that accounted for the confounders (though we note that the specific results depend on the particular options that are selected for the matching).

The MatchBalance() function can be used to describe the distribution
of the predictors (by homeless status) before and after matching (to save space, only the results for age and i1 are displayed). This is helpful to determine if the
matching resulted in similar marginal distributions.


> MatchBalance(form, match.out=rr, nboots=10)
***** (V1) age *****   Before Matching     After Matching
mean treatment........     36.368         36.368 
mean control..........     35.041         36.423 
std mean diff.........     16.069       -0.65642 

mean raw eQQ diff.....     1.5981        0.94841 
med  raw eQQ diff.....          1              1 
max  raw eQQ diff.....          7             10 

mean eCDF diff........   0.037112       0.022581 
med  eCDF diff........   0.026365       0.019841 
max  eCDF diff........    0.10477       0.083333 

var ratio (Tr/Co).....     1.3290         1.2671 
T-test p-value........   0.070785        0.93902 
KS Bootstrap p-value.. < 2.22e-16            0.3 
KS Naive p-value......    0.16881        0.34573 
KS Statistic..........    0.10477       0.083333 

***** (V3) i1 *****    Before Matching     After Matching
mean treatment........     23.038         23.038 
mean control..........     13.512         20.939 
std mean diff.........     40.582          8.945 

mean raw eQQ diff.....     9.6316         2.1071 
med  raw eQQ diff.....          8              1 
max  raw eQQ diff.....         73             66 

mean eCDF diff........    0.11853       0.018753 
med  eCDF diff........    0.12377       0.011905 
max  eCDF diff........    0.20662       0.087302 

var ratio (Tr/Co).....     2.3763         1.3729 
T-test p-value........ 7.8894e-07       0.011786 
KS Bootstrap p-value.. < 2.22e-16            0.3 
KS Naive p-value...... 0.00013379        0.29213 
KS Statistic..........    0.20662       0.087302

More details regarding each of the tests for differences in means or distributions can be found using ?MatchBalance. The results for both of the variables presented above indicate that the distributions are considerably closer
to each other in the matched sample than in the original dataset.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts