Showing posts with label Cox proportional hazards model. Show all posts
Showing posts with label Cox proportional hazards model. Show all posts

Tuesday, September 30, 2014

Example 2014.11: Contrasts the basic way for R

As we discuss in section 6.1.4 of the second edition, R and SAS handle categorical variables and their parameterization in models quite differently. SAS treats them on a procedure-by-procedure basis, which leads to some odd differences in capabilities and default parameterizations. For example, in the logistic procedure, the default is effect cell coding, while in the genmod procedure-- which also fits logistic regression-- the default is reference cell coding. Meanwhile, many procedures can only accommodate reference cell coding.

In R, in contrast, categorical variables can be designated as "factors" and parameterization stored an attribute of the factor.

In section 6.1.4, we demonstrate how the parameterization of a factor can be easily changed on the fly, in R, in lm(),glm(), and aov, using the contrasts= option in those functions. Here we show how to set the attribute more generally, for use in functions that don't accept the option. This post was inspired by a question from Julia Kuder, of Brigham and Women's Hospital.

SAS
We begin by simulating censored survival data as in Example 7.30. We'll also export the data to use in R.
data simcox;
  beta1 = 2;
  lambdat = 0.002; *baseline hazard;
  lambdac = 0.004; *censoring hazard;
  do i = 1 to 10000;
    x1 = rantbl(0, .25, .25,.25);
    linpred = exp(-beta1*(x1 eq 4));
    t = rand("WEIBULL", 1, lambdaT * linpred);
    * time of event;
    c = rand("WEIBULL", 1, lambdaC);
           * time of censoring;
    time = min(t, c);    * which came first?;
    censored = (c lt t);
    output;
  end;
run;

proc export data=simcox replace
outfile="c:/temp/simcox.csv"
dbms=csv;
run;

Now we'll fit the data in SAS, using effect coding.
proc phreg data=simcox;
class x1 (param=effect);
model time*censored(0)= x1 ;
run;

We reproduce the rather unexciting results here for comparison with R.
                     Parameter     Standard     
 Parameter     DF     Estimate        Error 

 x1        1    1     -0.02698      0.03471
 x1        2    1     -0.01211      0.03437
 x1        3    1     -0.05940      0.03458


R
In R we read the data in, then use the C() function to assign the contr.sum contrast to a version of the x1 variable that we save as a factor. Once that is done, we can fit the proportional hazards regression with the desired contrast.
simcox<- read.csv("c:/temp/simcox.csv")
sc2 = transform(simcox, x1.eff = C(as.factor(x1), contr.sum(4)))
effmodel <- coxph(Surv(time, censored)~ x1.eff,data= sc2)
summary(effmodel)
We excerpt the relevant output to demonstrate equivalence with SAS.
            coef exp(coef) se(coef)  
x1.eff1 -0.02698   0.97339  0.03471  
x1.eff2 -0.01211   0.98797  0.03437  
x1.eff3 -0.05940   0.94233  0.03458
An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Tuesday, March 13, 2012

Example 9.23: Demonstrating proportional hazards


A colleague recently asked after a slide suitable for explaining proportional hazards. In particular, she was concerned that her audience not focus on the time to event or probability of the event. An initial thought was to display the cumulative hazards, which have a constant proportion if the model is true. But the colleague's audience might get distracted both by the language (what's a "hazard"?) and the fact that the cumulative hazard doesn't have a readily interpretable scale. The failure curve, meaning the probability of failure by time t, over time, might be a bit more accessible.

Rather than just draw some curves, we simulated data, based on the code we demonstrated previously. In this case, there's no need for any interesting censoring, but a more interesting survival curve seems worthwhile.

SAS
The more interesting curve is introduced by manually accelerating and slowing down the Weibull survival time demonstrated in the previous approach. We also trichotomized one of the exposures to match the colleague's study, and censored all values greater than 10 to keep focus where the underlying hazard was interesting.

data simcox;
beta1 = .2;
beta2 = log(1.25);
lambdat = 20; *baseline hazard;
do i = 1 to 10000;
x1 = normal(45);
x2 = (normal(0) gt -1) + (normal(0) gt 1);
linpred = -beta1*x1 - beta2*x2;
t = rand("WEIBULL", 1, lambdaT * exp(linpred));
if t gt 5 then t = 5 + (t-5)/10;
if t gt 7 then t = 7 + (t-7) * 20;
* time of event;
censored = (t > 10);
output;
end;
run;

The phreg procedure will fit the model and produces nice plots of the survival function and the cumulative hazard. But to generate useful versions of these, you need to make an additional data set with the covariate values you want to show plots for. We set the other covariate's value at 0. You can also include an id variable with some descriptive text.

data covars;
x1 = 0; x2 = 2; id = "Arm 3"; output;
x1 = 0; x2 = 1; id = "Arm 2"; output;
x1 = 0; x2 = 0; id = "Arm 1"; output;
run;

proc phreg data = simcox plot(overlay)=cumhaz;
class x2 (ref = "0");
baseline covariates = covars out= kkout cumhaz = cumhaz
survival = survival / rowid = id;
model t*censored(1) = x1 x2;
run;

The cumulative hazard plot generated by the plot option is shown below, demonstrating the correct relative hazard of 1.25. The related survival curve could be generated with plot = s .

To get the desired plot of the failure times, use the out = option to the baseline statement. This generates a data set with the listed statistics (here the cumulative hazard and the survival probability, across time). Then we can produce a plot using the gplot procedure, after generating the failure probability.

data kk2;
set kkout;
iprob = 1 - survival;
run;

goptions reset=all;
legend1 label = none value=(h=2);
axis1 order = (0 to 1 by .25) minor = none value=(h=2)
label = (a = 90 h = 3 "Probability of infection");
axis2 order = (0 to 10 by 2) minor = none value=(h=2)
label = (h=3 "Attributable time");;
symbol1 i = sm51s v = none w = 3 r = 3;
proc gplot data = kk2;
plot iprob * t = id / vaxis = axis1 haxis = axis2 legend=legend1;
run; quit;

The result is shown at the top. Note the use of the h= option in various places in the axis and legend statement to make the font more visible when shrunk to fit onto a slide. The smoothing spline plotted through the data with the smXXs interpolation makes a nice shape out of the underlying abrupt changes in the hazard. The symbol, legend, and axis statements are discussed in chapter 6.

R
As in SAS, we begin by simulating the data. Note that we use the simple categorical variable simulator mentioned in the comments for example 7.20.

n = 10000
beta1 = .2
beta2 = log(1.25)
lambdaT = 20

x1 = rnorm(n,0)
x2 = sample(0:2,n,rep=TRUE,prob=c(1/3,1/3,1/3))
# true event time
T = rweibull(n, shape=1, scale=lambdaT*exp(-beta1*x1 - beta2*x2))
T[T>5] = 5 + (T[T>5] -5)/10
T[T>7] = 7 + (T[T>7] -7) * 20
event = T < rep(10,n)

Now we can fit the model using the coxph() function from the Survival package. There is a default method for plot()ing the "survfit" objects resulting from several package functions. However, it shows the typical survival plot. To show failure probability instead, we'll manually take the complement of the survival probability, but still take advantage of the default method. Note the use of the xmax option to limit the x-axis. The results (below) are somewhat bland, and it's unclear if the lines can be colored differently, or their widths increased. They are also as angular as the cumulative hazards shown in the SAS implementation.

library(survival)
plotph = coxph(Surv(T, event)~ x1 * strata(x2),
method="breslow")
summary(plotph)
sp = survfit(plotph)
sp$surv = 1 - sp$surv
plot(sp, xmax=10)


Consequently, as is so often the case, presentation graphics require more manual fiddling. We'll begin by extracting the data we need from the "survfit" object. We'll take just the failure times and probabilities, as well as the name of the strata to which each observation belongs, limiting the data to time < 10. The last of these lines uses the names() function to pull the names of the strata, repeating each an appropriate number of times with the useful rep() function.

sp = survfit(plotph)
failtimes = sp$time[sp$time <10]
failprobs = 1 - sp$surv[sp$time <10]
failcats = c(rep(names(sp$strata),times=sp$n))[sp$time <10]

All that remains is plotting the data, which is not dissimilar to many examples in the book and in this blog. There's likely some way to make these three lines with a little less typing, but knowing how to do it from scratch gives you the most flexibility. It proved difficult to get the desired amount of smoothness from the loess(), lowess(), or supsmu() functions, but smooth.spline() served admirably. The code below demonstrates increasing the axis and tick label sizes

plot(failprobs~failtimes, type="n", ylim=c(0,1), cex.lab=2, cex.axis= 1.5,
ylab= "Probability of infection", xlab = "Attributable time")
lines(smooth.spline(y=failprobs[failcats == "x2=0"],
x=failtimes[failcats == "x2=0"],all.knots=TRUE, spar=1.8),
col = "blue", lwd = 3)
lines(smooth.spline(y=failprobs[failcats == "x2=1"],
x=failtimes[failcats == "x2=1"],all.knots=TRUE, spar=1.8),
col = "red", lwd = 3)
lines(smooth.spline(y=failprobs[failcats == "x2=2"],
x=failtimes[failcats == "x2=2"],all.knots=TRUE, spar=1.8),
col = "green", lwd = 3)
legend(x=7,y=0.4,legend=c("Arm 1", "Arm 2", "Arm 3"),
col = c("blue","red","green"), lty = c(1,1,1), lwd = c(3,3,3) )

Tuesday, September 27, 2011

Example 9.7: New stuff in SAS 9.3-- Frailty models

Shared frailty models are a way of allowing correlated observations into proportional hazards models. Briefly, instead of l_i(t) = l_0(t)e^(x_iB), we allow l_ij(t) = l_0(t)e^(x_ijB + g_i), where observations j are in clusters i, g_i is typically normal with mean 0, and g_i is uncorrelated with g_i'. The nomenclature frailty comes from examining the logs of the g_i and rewriting the model as l_ij(t) = l_0(t)u_i*e^(xB) where the u_i are now lognormal with median 0. Observations j within cluster i share the frailty u_i, and fail faster (are frailer) than average if u_i > 1.

In SAS 9.2, this model could not be fit, though it is included in the survival package in R. (Section 4.3.2) With SAS 9.3, it can now be fit. We explore here through simulation, extending the approach shown in example 7.30.

SAS
To include frailties in the model, we loop across the clusters to first generate the frailties, then insert the loop from example 7.30, which now represents the observations within cluster, adding the frailty to the survival time model. There's no need to adjust the censoring time.


data simfrail;
beta1 = 2;
beta2 = -1;
lambdat = 0.002; *baseline hazard;
lambdac = 0.004; *censoring hazard;
do i = 1 to 250; *new frailty loop;
frailty = normal(1999) * sqrt(.5);
do j = 1 to 5; *original loop;
x1 = normal(0);
x2 = normal(0);
* new model of event time, with frailty added;
linpred = exp(-beta1*x1 - beta2*x2 + frailty);
t = rand("WEIBULL", 1, lambdaT * linpred);
* time of event;
c = rand("WEIBULL", 1, lambdaC);
* time of censoring;
time = min(t, c); * which came first?;
censored = (c lt t);
output;
end;
end;
run;

For comparison's sake, we replicate the naive model assuming independence:

proc phreg data=simfrail;
model time*censored(1) = x1 x2;
run;

Parameter Standard Hazard
Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio

x1 1 1.68211 0.05859 824.1463 <.0001 5.377
x2 1 -0.88585 0.04388 407.4942 <.0001 0.412

The parameter estimates are rather biased. In contrast, here is the correct frailty model.

proc phreg data=simfrail;
class i;
model time*censored(1) = x1 x2;
random i / noclprint;
run;
Cov REML Standard
Parm Estimate Error

i 0.5329 0.07995

Parameter Standard Hazard
Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio

x1 1 2.03324 0.06965 852.2179 <.0001 7.639
x2 1 -1.00966 0.05071 396.4935 <.0001 0.364

This returns estimates gratifyingly close to the truth. The syntax of the random statement is fairly straightforward-- the noclprint option prevents printing all the values of i. The clustering variable must be specified in the class statement. The output shows the estimated variance of the g_i.

R
In our book (section 4.16.14) we show an example of fitting the uncorrelated data model, but we don't display a frailty model. Here, we use the data generated in SAS, so we omit the data simulation in R. As in SAS, it would be a trivial extension of the code presented in example 7.30. For parallelism, we show the results of ignoring the correlation, first.

> library(survival)
> with(simfrail, coxph(formula = Surv(time, 1-censored) ~ x1 + x2))

coef exp(coef) se(coef) z p
x1 1.682 5.378 0.0586 28.7 0
x2 -0.886 0.412 0.0439 -20.2 0

with identical results to above. Note that the Surv function expects an indicator of the event, vs. SAS expecting a censoring indicator.

As with SAS, the syntax for incorporating the frailty is simple.

> with(simfrail, coxph(formula = Surv(time, 1-censored) ~ x1 + x2
+ frailty(i)))

coef se(coef) se2 Chisq DF p
x1 2.02 0.0692 0.0662 850 1 0
x2 -1.00 0.0506 0.0484 393 1 0
frailty(i) 332 141 0

Variance of random effect= 0.436

Here, the results differ slightly from the SAS model, but still return parameter estimates that are quite similar. We're not familiar enough with the computational methods to diagnose the differences.

Monday, June 21, 2010

Example 7.42: Testing the proportionality assumption



In addition to the non-parametric tools discussed in recent entries, it's common to use proportional hazards regression, (section 4.3.1) also called Cox regression, in evaluating survival data.

It's important in such models to test the proportionality assumption. Below, we demonstrate doing this for a simple model from the HELP data, available at the book web site. Our results below draw on this web page, part of the excellent resources provided by the UCLA ATS site.

SAS

First, we run a proportional hazards regression to assess the effects of treatment on the time to linkage with primary care. (Data were read in and observations with missing values removed in example 7.40.) Only a portion of the results are shown.


proc phreg data=h2;
model dayslink*linkstatus(0)=treat;
output out= propcheck ressch = schres;
run;

Parameter Standard
Parameter DF Estimate Error Chi-Square Pr > ChiSq

treat 1 1.59103 0.19142 69.0837 <.0001


Being in the intervention group would appear to increase the probability of being linked to primary care. But do the model assumptions hold? The output statement above makes a new data set that contains the Schoenfeld residuals. (Schoenfeld D. Residuals for the proportional hazards regresssion model. Biometrika, 1982, 69(1):239-241.) One assessment of proportional hazards is based on these residuals, which ought to show no association with time if proportionality holds. We can plot them against the time to linkage using proc sgplot (section 5.1.1) and adding a loess curve to assess the relationship.


proc sgplot data=propcheck;
loess x = dayslink y = schres / clm;
run;


From the resulting plot, shown above, there is an indication of a possible problem. One way to assess this is to include a time-varying covariate, an interaction between the suspect predictor(s) and the event time. This can be done easily within proc phreg. If the interaction term is significant, the null hypothesis of proportionality has been rejected.


proc phreg data=h2;
model dayslink*linkstatus(0)=treat treat_time;
treat_time = treat*log(dayslink);
run;

Parameter Standard
Parameter DF Estimate Error Chi-Square Pr > ChiSq

treat 1 4.13105 0.97873 17.8153 <.0001
treat_time 1 -0.61846 0.22569 7.5093 0.0061


The proportional hazards assumption doesn't hold in this case.

R

In R, the time-varying covariate approach is harder to implement. However, the survival library includes a formal test based on the Schoenfeld residuals. (See Therneau and Grambsch, 2000, pages 127-142.) This is accessed by applying the cox.zph() function to the output of the coxph() function. The smallds object used below was created in the previous example. (Some output omitted.)


> library(survival)
> propcheck = coxph(Surv(dayslink, linkstatus) ~ treat, method="breslow")

> summary(propcheck)

coef exp(coef) se(coef) z Pr(>|z|)
treat 1.5910 4.9088 0.1914 8.312 <2e-16 ***

> cox.zph(propcheck)
rho chisq p
treat -0.233 8.47 0.00361


This test agrees with the SAS approach that the assumptions that proportionality holds can be rejected. To understand why, you can plot() the test. The transform='identity" option is used to make results more similar to those obtained with SAS.


plot(cox.zph(propcheck, transform='identity'))

Tuesday, March 30, 2010

Example 7.30: Simulate censored survival data

To simulate survival data with censoring, we need to model the hazard functions for both time to event and time to censoring.

We simulate both event times from a Weibull distribution with a scale parameter of 1 (this is equivalent to an exponential random variable). The event time has a Weibull shape parameter of 0.002 times a linear predictor, while the censoring time has a Weibull shape parameter of 0.004. A scale of 1 implies a constant (exponential) baseline hazard, but this can be modified by specifying other scale parameters for the Weibull random variables.

First we'll simulate the data, then we'll fit a Cox proportional hazards regression model (section 4.3.1) to see the results.

Simulation is relatively straightforward, and is helpful in concretizing the notation often used in discussion survival data. After setting some parameters, we generate some covariate values, then simply draw an event time and a censoring time. The minimum of these is "observed" and we record whether it was the event time or the censoring time.

SAS


data simcox;
beta1 = 2;
beta2 = -1;
lambdat = 0.002; *baseline hazard;
lambdac = 0.004; *censoring hazard;
do i = 1 to 10000;
x1 = normal(0);
x2 = normal(0);
linpred = exp(-beta1*x1 - beta2*x2);
t = rand("WEIBULL", 1, lambdaT * linpred);
* time of event;
c = rand("WEIBULL", 1, lambdaC);
* time of censoring;
time = min(t, c); * which came first?;
censored = (c lt t);
output;
end;
run;


The phreg procedure (section 4.3.1) will show us the effects of the censoring as well as the results of fitting the regression model. We use the ODS system to reduce the output.


ods select censoredsummary parameterestimates;
proc phreg data=simcox;
model time*censored(1) = x1 x2;
run;

The PHREG Procedure

Summary of the Number of Event and Censored Values

Percent
Total Event Censored Censored
10000 5971 4029 40.29

Analysis of Maximum Likelihood Estimates

Parameter Standard
Parameter DF Estimate Error Chi-Square Pr > ChiSq

x1 1 1.98628 0.02213 8059.0716 <.0001
x2 1 -1.01310 0.01583 4098.0277 <.0001

Analysis of Maximum Likelihood Estimates

Hazard
Parameter Ratio

x1 7.288
x2 0.363





R


n = 10000
beta1 = 2; beta2 = -1
lambdaT = .002 # baseline hazard
lambdaC = .004 # hazard of censoring

x1 = rnorm(n,0)
x2 = rnorm(n,0)
# true event time
T = rweibull(n, shape=1, scale=lambdaT*exp(-beta1*x1-beta2*x2))
C = rweibull(n, shape=1, scale=lambdaC) #censoring time
time = pmin(T,C) #observed time is min of censored and true
event = time==T # set to 1 if event is observed


Having generated the data, we assess the effects of censoring with the table() function (section 2.2.1) and load the survival() library to fit the Cox model.


> table(event)
event
FALSE TRUE
4083 5917




> library(survival)
> coxph(Surv(time, event)~ x1 + x2, method="breslow")
Call:
coxph(formula = Surv(time, event) ~ x1 + x2, method = "breslow")


coef exp(coef) se(coef) z p
x1 1.98 7.236 0.0222 89.2 0
x2 -1.02 0.359 0.0160 -64.2 0

Likelihood ratio test=11369 on 2 df, p=0 n= 10000


These parameters result in data where approximately 40% of the observations are censored. The parameter estimates are similar to the true parameter values.