SAS and R: survival analysis

Showing posts with label survival analysis. Show all posts

Tuesday, September 27, 2011

Example 9.7: New stuff in SAS 9.3-- Frailty models

Shared frailty models are a way of allowing correlated observations into proportional hazards models. Briefly, instead of l_i(t) = l_0(t)e^(x_iB), we allow l_ij(t) = l_0(t)e^(x_ijB + g_i), where observations j are in clusters i, g_i is typically normal with mean 0, and g_i is uncorrelated with g_i'. The nomenclature frailty comes from examining the logs of the g_i and rewriting the model as l_ij(t) = l_0(t)u_i*e^(xB) where the u_i are now lognormal with median 0. Observations j within cluster i share the frailty u_i, and fail faster (are frailer) than average if u_i > 1.

In SAS 9.2, this model could not be fit, though it is included in the survival package in R. (Section 4.3.2) With SAS 9.3, it can now be fit. We explore here through simulation, extending the approach shown in example 7.30.

SAS
To include frailties in the model, we loop across the clusters to first generate the frailties, then insert the loop from example 7.30, which now represents the observations within cluster, adding the frailty to the survival time model. There's no need to adjust the censoring time.


data simfrail;
  beta1 = 2;
  beta2 = -1;
  lambdat = 0.002; *baseline hazard;
  lambdac = 0.004; *censoring hazard;
  do i = 1 to 250; *new frailty loop;
    frailty = normal(1999) * sqrt(.5);
    do j = 1 to 5; *original loop;
      x1 = normal(0);
      x2 = normal(0);
      * new model of event time, with frailty added;
      linpred = exp(-beta1*x1 - beta2*x2 + frailty);
      t = rand("WEIBULL", 1, lambdaT * linpred);
        * time of event;
      c = rand("WEIBULL", 1, lambdaC);
        * time of censoring;
      time = min(t, c);    * which came first?;
      censored = (c lt t);
    output;
 end;
  end;
run;

For comparison's sake, we replicate the naive model assuming independence:


proc phreg data=simfrail;
  model time*censored(1) = x1 x2;
run;

               Parameter   Standard                         Hazard
 Parameter DF   Estimate      Error Chi-Square Pr > ChiSq    Ratio

 x1         1    1.68211    0.05859   824.1463     <.0001    5.377
 x2         1   -0.88585    0.04388   407.4942     <.0001    0.412

The parameter estimates are rather biased. In contrast, here is the correct frailty model.


proc phreg data=simfrail;
  class i;
  model time*censored(1) = x1 x2;
  random i / noclprint;
run;
                    Cov         REML    Standard
                    Parm    Estimate       Error

                    i         0.5329     0.07995

               Parameter   Standard                         Hazard
 Parameter DF   Estimate      Error Chi-Square Pr > ChiSq    Ratio

 x1         1    2.03324    0.06965   852.2179     <.0001    7.639
 x2         1   -1.00966    0.05071   396.4935     <.0001    0.364

This returns estimates gratifyingly close to the truth. The syntax of the random statement is fairly straightforward-- the noclprint option prevents printing all the values of i. The clustering variable must be specified in the class statement. The output shows the estimated variance of the g_i.

R
In our book (section 4.16.14) we show an example of fitting the uncorrelated data model, but we don't display a frailty model. Here, we use the data generated in SAS, so we omit the data simulation in R. As in SAS, it would be a trivial extension of the code presented in example 7.30. For parallelism, we show the results of ignoring the correlation, first.


> library(survival)
> with(simfrail, coxph(formula = Surv(time, 1-censored) ~ x1 + x2))

     coef exp(coef) se(coef)     z p
x1  1.682     5.378   0.0586  28.7 0
x2 -0.886     0.412   0.0439 -20.2 0

with identical results to above. Note that the Surv function expects an indicator of the event, vs. SAS expecting a censoring indicator.

As with SAS, the syntax for incorporating the frailty is simple.


> with(simfrail, coxph(formula = Surv(time, 1-censored) ~ x1 + x2 
    + frailty(i)))

            coef se(coef) se2    Chisq DF  p
x1          2.02 0.0692   0.0662 850     1 0
x2         -1.00 0.0506   0.0484 393     1 0
frailty(i)                       332   141 0

Variance of random effect= 0.436

Here, the results differ slightly from the SAS model, but still return parameter estimates that are quite similar. We're not familiar enough with the computational methods to diagnose the differences.

Monday, June 14, 2010

Example 7.41: hazard function plotting

As we continue with our series on survival analysis, we demonstrate how to plot estimated (smoothed) hazard functions.

R

We will utilize the routines available in the muhaz package. Background information on the methods can be found in K.R. Hess, D.M. Serachitopol and B.W. Brown Hazard Function Estimators: A Simulation Study, Statistics in Medicine, 1999: 18(22):3075-3088.


ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
smallds = data.frame(dayslink=ds$dayslink, 
   linkstatus=ds$linkstatus, treat=ds$treat)

# drop subjects with missing data
smallds = na.omit(smallds)

treatds = smallds[smallds$treat==1,]
controlds = smallds[smallds$treat==0,]
rm(ds, smallds)  # clean up

library(muhaz)
haztreat = with(treatds, muhaz(dayslink, linkstatus))
hazcontrol = with(controlds, muhaz(dayslink, linkstatus))

plot(haztreat, lwd=2, xlab="Follow-up time (days)")
lines(hazcontrol, lty=2, lwd=2)
legend(200, 0.005, legend=c("Treatment", "Control"), 
   lty=1:2, lwd=2)

The treatment group has dramatically higher hazard, but this drops appreciably after 6 months. The control group hazard is low, and decreases in a roughly linear fashion.

SAS

Paul Alison includes macros to display estimates from parametric and semiparametric models in Survival Analysis Using SAS (2nd edition). We'll use the smooth macro, which is built to accept output from proc lifetest.


proc import file="c:\book\help.csv"
  out=help dbms=dlm;
  delimiter=',';
  getnames=yes;
run;

data h2;
set help;
  if nmiss(dayslink, linkstatus, treat) eq 0;
run;

proc lifetest data=h2 outsurv=allison;
  time dayslink*linkstatus(0);
  strata treat;
run;

%include "c:/ken/sasmacros/smooth.sas";
%smooth(data=allison, time=dayslink, width=25);

The proc lifetest results (not shown) indicate that group 1 is the control and group 2 is the intervention. The macro uses a simpler smoothing method than that found in R, so that the curve is bumpier and estimates near the edges are not shown.

Monday, May 31, 2010

Example 7.39: Nelson-Aalen estimate of cumulative hazard

In our previous example, we demonstrated how to calculate the Kaplan-Meier estimate of the survival function for time to event data.

A related quantity is the Nelson-Aalen estimate of cumulative hazard. In addition to summarizing the hazard incurred by a particular timepoint, this quantity has been
used in missing data models (see White and Royston, 2009).

In addition to the usual SAS and R approaches to this, we also show Stata code.

SAS

It's very straightforward to calculate the Nelson-Aalen estimate in SAS. We assume the data is in the test data set generated in example 7.38.


ods output productlimitestimates=naout;
proc lifetest data=test nelson;
  time time*event(0);
run;

options formchar="|----|+|---+=|-/\<>*";
options ls=64;
proc print data=naout; var time cumhaz; run;

In the forgoing, the formchar option makes SAS print in a format that can easily be cut-and-pasted into other software, such as Blogger's text entry window, as described here.


                  Obs        time      CumHaz
                    1      0.0000           0
                    2      0.5000           .
                    3      1.0000           .
                    4      1.0000      0.1111
                    5      2.0000      0.1736
                    6      2.0000           .
                    7      3.0000      0.2450
                    8      4.0000      0.3220
                    9      5.0000           .
                   10      6.0000      0.4129
                   11      7.0000           .
                   12      8.0000      0.5240
                   13      9.0000      0.6490
                   14     10.0000           .
                   15     12.0000      0.8156
                   16     14.0000      1.0156
                   17     14.0000           .
                   18     17.0000           .
                   19     20.0000      1.5156
                   20     21.0000           .

Note that SAS shows the Nelson-Aalen estimator once per time point, and only when there is a failure at that time. We may need the estimated value for all observed time points in our data, as well as to use the estimate with our original data. This requires some additional work. First, we sort our data and the new naout data set by time, removing rows from the above printout where the cumulative hazard is 0, using the where data set option (section A.6.3). Then we merge the two data sets.


proc sort data=test; by time; run;
proc sort data=naout (where=(cumhaz ne .)); by time; run;

data t2;
merge test (in=keepme) naout (keep = time cumhaz);
            * the keep option drops unneeded variables;
by time;
if keepme;
            * deletes any extra rows from the naout dataset;
retain cumhazlast 0;
            * remember most recent value of this;
if cumhaz eq . then cumhaz = cumhazlast;
            * plug in most recent value of cumhazlast,
              if cumhaz is missing;
cumhazlast = cumhaz;
            * the current value of cumhaz is now the most recent;
run;

proc print data=t2;
  var time event cumhaz;
run;

This generates the desired result:


              Obs        time    event      CumHaz
                1      0.5000      0             0
                2      1.0000      1        0.1111
                3      1.0000      1        0.1111
                4      2.0000      1        0.1736
                5      2.0000      0        0.1736
                6      3.0000      1        0.2450
                7      4.0000      1        0.3220
                8      5.0000      0        0.3220
                9      6.0000      1        0.4129
               10      7.0000      0        0.4129
               11      8.0000      1        0.5240
               12      9.0000      1        0.6490
               13     10.0000      0        0.6490
               14     12.0000      1        0.8156
               15     14.0000      0        1.0156
               16     14.0000      1        1.0156
               17     17.0000      0        1.0156
               18     20.0000      1        1.5156
               19     21.0000      0        1.5156

R

While the survfit() command can be used to create a table of all of the survival function and Nelson-Aalen estimates at each event point, it is slightly harder in R to associate these with the original data. Here we craft a function in R that does this housekeeping, using the fact that the Nelson-Aalen is just the negative log of the survival function (after specifying the type="aalen" option.


calcna = function(time, event) {
   na.fit = survfit(coxph(Surv(time,event)~1), type="aalen")
   jumps = c(0, na.fit$time, max(time))
   # need to be careful at the beginning and end
   surv = c(1, na.fit$surv, na.fit$surv[length(na.fit$surv)])

   # apply appropriate transformation
   neglogsurv = -log(surv)   
   
   # create placeholder of correct length
   naest = numeric(length(time))  
   for (i in 2:length(jumps)) {
      naest[which(time>=jumps[i-1] & time<=jumps[i])] = 
         neglogsurv[i-1]   # snag the appropriate value
   }
   return(naest)
}

This can be be used as follows, where we use the time and event objects created in example 7.38.


newna = calcna(time, event)
cbind(time, newna)

      time     newna
 [1,]  0.5 0.0000000
 [2,]  1.0 0.1111111
 [3,]  1.0 0.1111111
 [4,]  2.0 0.1736111
 [5,]  2.0 0.1736111
 [6,]  3.0 0.2450397
 [7,]  4.0 0.3219628
 [8,]  5.0 0.3219628
 [9,]  6.0 0.4128719
[10,]  7.0 0.4128719
[11,]  8.0 0.5239830
[12,]  9.0 0.6489830
[13,] 10.0 0.6489830
[14,] 12.0 0.8156496
[15,] 14.0 1.0156496
[16,] 14.0 1.0156496
[17,] 17.0 1.0156496
[18,] 20.0 1.5156496
[19,] 21.0 1.5156496

We also describe how to calculate the Nelson-Aalen estimate in Stata. First we create a dataset in the appropriate format. The foreign library in R (sections 1.1.5, 1.2.2) can do this directly. In SAS we would need to export to a format that Stata can read.


library(foreign)
write.dta(data.frame(time=time, event=event), "forstata.dta")

Stata

We can then read in the dataset from R:


. use forstata
(Written by R.              )

. stset time, failure(event)

     failure event:  event != 0 & event < .
obs. time interval:  (0, time]
 exit on or before:  failure

------------------------------------------------------------------------------
       19  total obs.
        0  exclusions
------------------------------------------------------------------------------
       19  obs. remaining, representing
       11  failures in single record/single failure data
    156.5  total analysis time at risk, at risk from t =         0
                             earliest observed entry t =         0
                                  last observed exit t =        21
. sts list, cumhaz

         failure _d:  event
   analysis time _t:  time
           Beg.          Net          Nelson-Aalen    Std.
  Time    Total   Fail   Lost           Cum. Haz.    Error     [95% Conf. Int.]
-------------------------------------------------------------------------------
    .5       19      0      1             0.0000    0.0000          .         .
     1       18      2      0             0.1111    0.0786     0.0278    0.4443
     2       16      1      1             0.1736    0.1004     0.0559    0.5393
     3       14      1      0             0.2450    0.1232     0.0915    0.6565
     4       13      1      0             0.3220    0.1453     0.1330    0.7795
     5       12      0      1             0.3220    0.1453     0.1330    0.7795
     6       11      1      0             0.4129    0.1714     0.1830    0.9313
     7       10      0      1             0.4129    0.1714     0.1830    0.9313
     8        9      1      0             0.5240    0.2042     0.2441    1.1248
     9        8      1      0             0.6490    0.2394     0.3149    1.3375
    10        7      0      1             0.6490    0.2394     0.3149    1.3375
    12        6      1      0             0.8156    0.2917     0.4046    1.6442
    14        5      1      1             1.0156    0.3537     0.5132    2.0099
    17        3      0      1             1.0156    0.3537     0.5132    2.0099
    20        2      1      0             1.5156    0.6125     0.6865    3.3463
    21        1      0      1             1.5156    0.6125     0.6865    3.3463
-------------------------------------------------------------------------------

Monday, May 24, 2010

Example 7.38: Kaplan-Meier survival estimates

In example 7.30 we demonstrated how to simulate data from a Cox proportional hazards model.

In this and the next few entries, we expand upon support in R and SAS for survival (time-to-event) models. We'll start with a small, artificial dataset of 19 subjects. Each subject contributes a pair of variables: the time and an indicator of whether the time is when the event occurred (event=TRUE) or when the subject was censored (event=FALSE).

time  event
0.5   FALSE
1     TRUE
1     TRUE
2     TRUE
2     FALSE
3     TRUE
4     TRUE
5     FALSE
6     TRUE
7     FALSE
8     TRUE
9     TRUE
10    FALSE
12    TRUE
14    FALSE
14    TRUE
17    FALSE
20    TRUE
21    FALSE

Until an instant before time=1, no events were observed (only the censored observation), so the survival estimate is 1. At time=1, 2 subjects out of the 18 still at risk observed the event, so the survival function S(.) at time 1 is S(1) = 16/18 = 0.8889. The next failure occurs at time=2, with 16 still at risk, so S(2)=15/16 * 16/18 = 0.8333. Note that in addition to the event at time=2, there is a subject censored then, so the number at risk at time=3 is just 13 (so S(3) = 13/14 * 15*16 * 16/18 = 0.7738). The calculations continue until the final event is observed.

R

In R, we use the survfit() function (section 5.1.19) within the survival library to calculate the survival function across time.

library(survival)
time =  c(0.5, 1,1,2,2,3,4,5,6,7,8,9,10,12,14,14,17,20, 21)
event = c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, 
  TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
  TRUE, FALSE)

ds = data.frame(time, event)
fit = survfit(Surv(time, event) ~ 1, data=ds)

The returned survival object includes a number of attributes, such as the survival estimates at each timepoint, the standard error of those estimates, and the number of subjects at risk.

> names(fit)
 [1] "n"         "time"      "n.risk"    "n.event"   "n.censor"  
     "surv"      "type"      "std.err"  
 [9] "upper"     "lower"     "conf.type" "conf.int"  "call" 
> summary(fit)
Call: survfit(formula = Surv(time, event) ~ 1, data = ds)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1     18       2    0.889  0.0741       0.7549        1.000
    2     16       1    0.833  0.0878       0.6778        1.000
    3     14       1    0.774  0.0997       0.6011        0.996
    4     13       1    0.714  0.1084       0.5306        0.962
    6     11       1    0.649  0.1164       0.4570        0.923
    8      9       1    0.577  0.1238       0.3791        0.879
    9      8       1    0.505  0.1276       0.3078        0.829
   12      6       1    0.421  0.1312       0.2285        0.775
   14      5       1    0.337  0.1292       0.1587        0.714
   20      2       1    0.168  0.1354       0.0348        0.815

SAS

We can read in the artificial data using an input statement (section 1.1.8).

data ds;
input time event;
cards;
0.5   0
1     1
1     1
2     1
2     0
3     1
4     1
5     0
6     1
7     0
8     1
9     1
10    0
12    1
14    0
14    1
17    0
20    1
21    0
run;

proc lifetest data=ds;
  time time*event(0);
run;

Here we denote censoring as being values where event is equal to 0. If we had a censoring indicator coded in reverse (1 = censoring), the second line might read time time*censored(1);.

The survival function can be estimated in proc lifetest (as shown in section 5.1.19). In a break from our usual practice, we'll include all of the output generated by proc lifetest.

The LIFETEST Procedure
                   Product-Limit Survival Estimates
                                     Survival
                                     Standard     Number      Number 
    time     Survival    Failure      Error       Failed       Left  
  0.0000       1.0000           0           0        0          19   
  0.5000*           .           .           .        0          18   
  1.0000            .           .           .        1          17   
  1.0000       0.8889      0.1111      0.0741        2          16   
  2.0000       0.8333      0.1667      0.0878        3          15   
  2.0000*           .           .           .        3          14   
  3.0000       0.7738      0.2262      0.0997        4          13   
  4.0000       0.7143      0.2857      0.1084        5          12   
  5.0000*           .           .           .        5          11   
  6.0000       0.6494      0.3506      0.1164        6          10   
  7.0000*           .           .           .        6           9   
  8.0000       0.5772      0.4228      0.1238        7           8   
  9.0000       0.5051      0.4949      0.1276        8           7   
 10.0000*           .           .           .        8           6   
 12.0000       0.4209      0.5791      0.1312        9           5   
 14.0000       0.3367      0.6633      0.1292       10           4   
 14.0000*           .           .           .       10           3   
 17.0000*           .           .           .       10           2   
 20.0000       0.1684      0.8316      0.1354       11           1   
 21.0000*           .           .           .       11           0   

NOTE: The marked survival times are censored observations.

Summary Statistics for Time Variable time

             Quartile Estimates
             Point     95% Confidence Interval
Percent    Estimate      [Lower      Upper)
     75     20.0000     12.0000       .    
     50     12.0000      6.0000     20.0000
     25      4.0000      1.0000     12.0000

    Mean    Standard Error
 11.1776            1.9241

NOTE: The mean survival time and its standard error were underestimated because 
      the largest observation was censored and the estimation was restricted to 
      the largest event time.

Summary of the Number of Censored and Uncensored Values
                                 Percent
   Total  Failed    Censored    Censored
      19      11           8       42.11

Tuesday, March 30, 2010

Example 7.30: Simulate censored survival data

To simulate survival data with censoring, we need to model the hazard functions for both time to event and time to censoring.

We simulate both event times from a Weibull distribution with a scale parameter of 1 (this is equivalent to an exponential random variable). The event time has a Weibull shape parameter of 0.002 times a linear predictor, while the censoring time has a Weibull shape parameter of 0.004. A scale of 1 implies a constant (exponential) baseline hazard, but this can be modified by specifying other scale parameters for the Weibull random variables.

First we'll simulate the data, then we'll fit a Cox proportional hazards regression model (section 4.3.1) to see the results.

Simulation is relatively straightforward, and is helpful in concretizing the notation often used in discussion survival data. After setting some parameters, we generate some covariate values, then simply draw an event time and a censoring time. The minimum of these is "observed" and we record whether it was the event time or the censoring time.

SAS


data simcox;
  beta1 = 2;
  beta2 = -1;
  lambdat = 0.002; *baseline hazard;
  lambdac = 0.004; *censoring hazard;
  do i = 1 to 10000;
    x1 = normal(0);
    x2 = normal(0);
    linpred = exp(-beta1*x1 - beta2*x2);
    t = rand("WEIBULL", 1, lambdaT * linpred);
    * time of event;
    c = rand("WEIBULL", 1, lambdaC);
           * time of censoring;
    time = min(t, c);    * which came first?;
    censored = (c lt t);
    output;
  end;
run;

The phreg procedure (section 4.3.1) will show us the effects of the censoring as well as the results of fitting the regression model. We use the ODS system to reduce the output.


ods select censoredsummary parameterestimates;
proc phreg data=simcox;
  model time*censored(1) = x1 x2;
run;

The PHREG Procedure

Summary of the Number of Event and Censored Values
 
                                     Percent
   Total       Event    Censored    Censored
   10000        5971        4029       40.29

          Analysis of Maximum Likelihood Estimates
 
                Parameter    Standard
Parameter  DF    Estimate       Error  Chi-Square  Pr > ChiSq

x1          1     1.98628     0.02213   8059.0716      <.0001
x2          1    -1.01310     0.01583   4098.0277      <.0001

Analysis of Maximum Likelihood Estimates
 
             Hazard
Parameter     Ratio

x1            7.288
x2            0.363


n = 10000
beta1 = 2; beta2 = -1
lambdaT = .002 # baseline hazard
lambdaC = .004  # hazard of censoring

x1 = rnorm(n,0)
x2 = rnorm(n,0)
# true event time
T = rweibull(n, shape=1, scale=lambdaT*exp(-beta1*x1-beta2*x2)) 
C = rweibull(n, shape=1, scale=lambdaC)   #censoring time
time = pmin(T,C)  #observed time is min of censored and true
event = time==T   # set to 1 if event is observed

Having generated the data, we assess the effects of censoring with the table() function (section 2.2.1) and load the survival() library to fit the Cox model.


> table(event)
event
FALSE  TRUE 
 4083  5917


> library(survival)
> coxph(Surv(time, event)~ x1 + x2, method="breslow")
Call:
coxph(formula = Surv(time, event) ~ x1 + x2, method = "breslow")


    coef exp(coef) se(coef)     z p
x1  1.98     7.236   0.0222  89.2 0
x2 -1.02     0.359   0.0160 -64.2 0

Likelihood ratio test=11369  on 2 df, p=0  n= 10000

These parameters result in data where approximately 40% of the observations are censored. The parameter estimates are similar to the true parameter values.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts