## Monday, May 24, 2010

### Example 7.38: Kaplan-Meier survival estimates

In example 7.30 we demonstrated how to simulate data from a Cox proportional hazards model.

In this and the next few entries, we expand upon support in R and SAS for survival (time-to-event) models. We'll start with a small, artificial dataset of 19 subjects. Each subject contributes a pair of variables: the time and an indicator of whether the time is when the event occurred (event=TRUE) or when the subject was censored (event=FALSE).

```time  event
0.5   FALSE
1     TRUE
1     TRUE
2     TRUE
2     FALSE
3     TRUE
4     TRUE
5     FALSE
6     TRUE
7     FALSE
8     TRUE
9     TRUE
10    FALSE
12    TRUE
14    FALSE
14    TRUE
17    FALSE
20    TRUE
21    FALSE```

Until an instant before time=1, no events were observed (only the censored observation), so the survival estimate is 1. At time=1, 2 subjects out of the 18 still at risk observed the event, so the survival function S(.) at time 1 is S(1) = 16/18 = 0.8889. The next failure occurs at time=2, with 16 still at risk, so S(2)=15/16 * 16/18 = 0.8333. Note that in addition to the event at time=2, there is a subject censored then, so the number at risk at time=3 is just 13 (so S(3) = 13/14 * 15*16 * 16/18 = 0.7738). The calculations continue until the final event is observed.

R

In R, we use the survfit() function (section 5.1.19) within the survival library to calculate the survival function across time.

```library(survival)
time =  c(0.5, 1,1,2,2,3,4,5,6,7,8,9,10,12,14,14,17,20, 21)
event = c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE,
TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
TRUE, FALSE)```
```ds = data.frame(time, event)
fit = survfit(Surv(time, event) ~ 1, data=ds)```

The returned survival object includes a number of attributes, such as the survival estimates at each timepoint, the standard error of those estimates, and the number of subjects at risk.

```> names(fit)
 "n"         "time"      "n.risk"    "n.event"   "n.censor"
"surv"      "type"      "std.err"
 "upper"     "lower"     "conf.type" "conf.int"  "call"
> summary(fit)
Call: survfit(formula = Surv(time, event) ~ 1, data = ds)

time n.risk n.event survival std.err lower 95% CI upper 95% CI
1     18       2    0.889  0.0741       0.7549        1.000
2     16       1    0.833  0.0878       0.6778        1.000
3     14       1    0.774  0.0997       0.6011        0.996
4     13       1    0.714  0.1084       0.5306        0.962
6     11       1    0.649  0.1164       0.4570        0.923
8      9       1    0.577  0.1238       0.3791        0.879
9      8       1    0.505  0.1276       0.3078        0.829
12      6       1    0.421  0.1312       0.2285        0.775
14      5       1    0.337  0.1292       0.1587        0.714
20      2       1    0.168  0.1354       0.0348        0.815```

SAS

We can read in the artificial data using an input statement (section 1.1.8).

```data ds;
input time event;
cards;
0.5   0
1     1
1     1
2     1
2     0
3     1
4     1
5     0
6     1
7     0
8     1
9     1
10    0
12    1
14    0
14    1
17    0
20    1
21    0
run;```

```proc lifetest data=ds;
time time*event(0);
run;```

Here we denote censoring as being values where event is equal to 0. If we had a censoring indicator coded in reverse (1 = censoring), the second line might read time time*censored(1);.

The survival function can be estimated in proc lifetest (as shown in section 5.1.19). In a break from our usual practice, we'll include all of the output generated by proc lifetest.

```The LIFETEST Procedure
Product-Limit Survival Estimates
Survival
Standard     Number      Number
time     Survival    Failure      Error       Failed       Left
0.0000       1.0000           0           0        0          19
0.5000*           .           .           .        0          18
1.0000            .           .           .        1          17
1.0000       0.8889      0.1111      0.0741        2          16
2.0000       0.8333      0.1667      0.0878        3          15
2.0000*           .           .           .        3          14
3.0000       0.7738      0.2262      0.0997        4          13
4.0000       0.7143      0.2857      0.1084        5          12
5.0000*           .           .           .        5          11
6.0000       0.6494      0.3506      0.1164        6          10
7.0000*           .           .           .        6           9
8.0000       0.5772      0.4228      0.1238        7           8
9.0000       0.5051      0.4949      0.1276        8           7
10.0000*           .           .           .        8           6
12.0000       0.4209      0.5791      0.1312        9           5
14.0000       0.3367      0.6633      0.1292       10           4
14.0000*           .           .           .       10           3
17.0000*           .           .           .       10           2
20.0000       0.1684      0.8316      0.1354       11           1
21.0000*           .           .           .       11           0

NOTE: The marked survival times are censored observations.

Summary Statistics for Time Variable time

Quartile Estimates
Point     95% Confidence Interval
Percent    Estimate      [Lower      Upper)
75     20.0000     12.0000       .
50     12.0000      6.0000     20.0000
25      4.0000      1.0000     12.0000

Mean    Standard Error
11.1776            1.9241

NOTE: The mean survival time and its standard error were underestimated because
the largest observation was censored and the estimation was restricted to
the largest event time.

Summary of the Number of Censored and Uncensored Values
Percent
Total  Failed    Censored    Censored
19      11           8       42.11```

Anonymous said...

awesome thanks!!

Anonymous said...

Thank you! You made this so easy to get started.

Nick Horton said...

Kudos to Giles Crane for pointing out that R can also compute quantiles, and confidence limits, of survival curves.
The quantile function has a method for survfit objects of the survival package:

quantile( survfit( fit), c(0.25, 0.50, 0.75) )

Nick

Anonymous said...

Great, much easier to understand. thanks!

Anonymous said...

Thanks for that. If you compare the n.risk at a particular time from the R output to the number left at a particular time from SAS, the two do not match. I am currently validating similar output from SAS using R, is there any way to make the R n.risk match the SAS number left, from within the survfit function in R? Thanks.

Anonymous said...

in the R example you forgot to define the data.frame "ds" before to use survfit(...)

Nick Horton said...

Fixed! Thanks for pointing this out.