Tuesday, March 30, 2010

Example 7.30: Simulate censored survival data

To simulate survival data with censoring, we need to model the hazard functions for both time to event and time to censoring.

We simulate both event times from a Weibull distribution with a scale parameter of 1 (this is equivalent to an exponential random variable). The event time has a Weibull shape parameter of 0.002 times a linear predictor, while the censoring time has a Weibull shape parameter of 0.004. A scale of 1 implies a constant (exponential) baseline hazard, but this can be modified by specifying other scale parameters for the Weibull random variables.

First we'll simulate the data, then we'll fit a Cox proportional hazards regression model (section 4.3.1) to see the results.

Simulation is relatively straightforward, and is helpful in concretizing the notation often used in discussion survival data. After setting some parameters, we generate some covariate values, then simply draw an event time and a censoring time. The minimum of these is "observed" and we record whether it was the event time or the censoring time.

SAS


data simcox;
beta1 = 2;
beta2 = -1;
lambdat = 0.002; *baseline hazard;
lambdac = 0.004; *censoring hazard;
do i = 1 to 10000;
x1 = normal(0);
x2 = normal(0);
linpred = exp(-beta1*x1 - beta2*x2);
t = rand("WEIBULL", 1, lambdaT * linpred);
* time of event;
c = rand("WEIBULL", 1, lambdaC);
* time of censoring;
time = min(t, c); * which came first?;
censored = (c lt t);
output;
end;
run;


The phreg procedure (section 4.3.1) will show us the effects of the censoring as well as the results of fitting the regression model. We use the ODS system to reduce the output.


ods select censoredsummary parameterestimates;
proc phreg data=simcox;
model time*censored(1) = x1 x2;
run;

The PHREG Procedure

Summary of the Number of Event and Censored Values

Percent
Total Event Censored Censored
10000 5971 4029 40.29

Analysis of Maximum Likelihood Estimates

Parameter Standard
Parameter DF Estimate Error Chi-Square Pr > ChiSq

x1 1 1.98628 0.02213 8059.0716 <.0001
x2 1 -1.01310 0.01583 4098.0277 <.0001

Analysis of Maximum Likelihood Estimates

Hazard
Parameter Ratio

x1 7.288
x2 0.363





R


n = 10000
beta1 = 2; beta2 = -1
lambdaT = .002 # baseline hazard
lambdaC = .004 # hazard of censoring

x1 = rnorm(n,0)
x2 = rnorm(n,0)
# true event time
T = rweibull(n, shape=1, scale=lambdaT*exp(-beta1*x1-beta2*x2))
C = rweibull(n, shape=1, scale=lambdaC) #censoring time
time = pmin(T,C) #observed time is min of censored and true
event = time==T # set to 1 if event is observed


Having generated the data, we assess the effects of censoring with the table() function (section 2.2.1) and load the survival() library to fit the Cox model.


> table(event)
event
FALSE TRUE
4083 5917




> library(survival)
> coxph(Surv(time, event)~ x1 + x2, method="breslow")
Call:
coxph(formula = Surv(time, event) ~ x1 + x2, method = "breslow")


coef exp(coef) se(coef) z p
x1 1.98 7.236 0.0222 89.2 0
x2 -1.02 0.359 0.0160 -64.2 0

Likelihood ratio test=11369 on 2 df, p=0 n= 10000


These parameters result in data where approximately 40% of the observations are censored. The parameter estimates are similar to the true parameter values.

14 comments:

Anonymous said...

proc lifereg data =simcox;
model t*censored(1) =x1 x2/dist =weibull;
run;

I find that the estiamted beta1 and beta2 are not mathced to those in simulation. i think i missed something. do you have any comments?

Ken Kleinman said...

I see two potential problems. First, you have "t" in the model statement, rather than "time." This is the uncensored version. You need either

model t =x1 x2 / dist =weibull;

or

model time*censored(1) =x1 x2 / dist =weibull;

otherwise the model won't fit very well.

The other issue is that the models are interpreted very differently. The parameter estimate from PHREG is the log relative hazard associated with a one-unit increase in the covariate, while the estimate from LIFEREG is the log relative change in time to event associated with the same one-unit increase.

This means that when we see a positive est estimate in PHREG (increased hazard of event) we should expect a _negative_ estimate in LIFEREG (sooner occurrence of event).

Anonymous said...

Cool. I got it. Maybe the following question is more basic but sort of confused to me:

linpred = exp(-beta1*x1 - beta2*x2);

I don’t see intercept and measurement error here. On the other hand, in the lifereg model, we indeed have an intercept. Can you illustrate what the estimated intercept stand for?

Thanks.

Ken Kleinman said...

The intercept ought to reflect here exactly what it does in any regression: the expected outcome when the covariates are all set to 0. There's no measurement error, but there's Weibull variability injected directly.

Szilard said...

Could you think a similar simulation scenario for Aalen Additive Hazards model?

I did try everything i could come up,but so far got nowhere.

Anonymous said...

I used the same simulation program, but I got different figures for the event, beta1 and beta2, is it normal to have different figures when the same simulation program is repeated a number of times?

Ken Kleinman said...

Very typical. Unless you set the seed for the pseudo-random number generator, you get a different stream of numbers every time, and the parameter estimates differ. This is one of the great strengths of simulation.

Wen said...

Would you please advise how to check if the estimates for scale and shape are correct? I used the code to generate a weibull distribution with shape=2 and scale=0.0008*(-beta1*x1-beta2*x2). And using the generated dataset to run a survreg model. The transformed shape and beta's are correct, but I have no idea how to check "scale"? What the scale estimate we should expect from the regression model?Is it the original lamdaT or scale=0.0008*(-beta1*x1-beta2*x2)?

Kennedy said...
This comment has been removed by the author.
Kennedy said...

Dear Ken. Your blog is helpful. In your frailty model simulation, time is generated by Weibull i.e t = rand("WEIBULL", 1, lambdaT * linpred); Would you advise on how one can simulate this time so that its roughly say every 3 months i.e baseline, month 3, month 6 etc. Kennedy

Arpan said...

How to simulate Cure Rate Models in R?

Michal Figurski said...

This is exactly what I was looking for - a simple and straightforward solution, not using some "black box" package, just plain weibull. Thank you for this post!

Anonymous said...

Why are there negative signs before the beta for linpred? I have 4 independent variables and betas, will the equation be
linpred = exp (-beta1*x1+beta2*x2+beta3*x3+beta4*x4+beta5*x5);
where beta1 is the intercept.

Nick Horton said...

The negative signs are there because the correct function is exp^{-XBeta). So you should carry the negative sign throughout your linear terms (beta2*x2, etc.)