SAS and R: September 2011

Tuesday, September 27, 2011

Example 9.7: New stuff in SAS 9.3-- Frailty models

Shared frailty models are a way of allowing correlated observations into proportional hazards models. Briefly, instead of l_i(t) = l_0(t)e^(x_iB), we allow l_ij(t) = l_0(t)e^(x_ijB + g_i), where observations j are in clusters i, g_i is typically normal with mean 0, and g_i is uncorrelated with g_i'. The nomenclature frailty comes from examining the logs of the g_i and rewriting the model as l_ij(t) = l_0(t)u_i*e^(xB) where the u_i are now lognormal with median 0. Observations j within cluster i share the frailty u_i, and fail faster (are frailer) than average if u_i > 1.

In SAS 9.2, this model could not be fit, though it is included in the survival package in R. (Section 4.3.2) With SAS 9.3, it can now be fit. We explore here through simulation, extending the approach shown in example 7.30.

SAS
To include frailties in the model, we loop across the clusters to first generate the frailties, then insert the loop from example 7.30, which now represents the observations within cluster, adding the frailty to the survival time model. There's no need to adjust the censoring time.


data simfrail;
  beta1 = 2;
  beta2 = -1;
  lambdat = 0.002; *baseline hazard;
  lambdac = 0.004; *censoring hazard;
  do i = 1 to 250; *new frailty loop;
    frailty = normal(1999) * sqrt(.5);
    do j = 1 to 5; *original loop;
      x1 = normal(0);
      x2 = normal(0);
      * new model of event time, with frailty added;
      linpred = exp(-beta1*x1 - beta2*x2 + frailty);
      t = rand("WEIBULL", 1, lambdaT * linpred);
        * time of event;
      c = rand("WEIBULL", 1, lambdaC);
        * time of censoring;
      time = min(t, c);    * which came first?;
      censored = (c lt t);
    output;
 end;
  end;
run;

For comparison's sake, we replicate the naive model assuming independence:


proc phreg data=simfrail;
  model time*censored(1) = x1 x2;
run;

               Parameter   Standard                         Hazard
 Parameter DF   Estimate      Error Chi-Square Pr > ChiSq    Ratio

 x1         1    1.68211    0.05859   824.1463     <.0001    5.377
 x2         1   -0.88585    0.04388   407.4942     <.0001    0.412

The parameter estimates are rather biased. In contrast, here is the correct frailty model.


proc phreg data=simfrail;
  class i;
  model time*censored(1) = x1 x2;
  random i / noclprint;
run;
                    Cov         REML    Standard
                    Parm    Estimate       Error

                    i         0.5329     0.07995

               Parameter   Standard                         Hazard
 Parameter DF   Estimate      Error Chi-Square Pr > ChiSq    Ratio

 x1         1    2.03324    0.06965   852.2179     <.0001    7.639
 x2         1   -1.00966    0.05071   396.4935     <.0001    0.364

This returns estimates gratifyingly close to the truth. The syntax of the random statement is fairly straightforward-- the noclprint option prevents printing all the values of i. The clustering variable must be specified in the class statement. The output shows the estimated variance of the g_i.

R
In our book (section 4.16.14) we show an example of fitting the uncorrelated data model, but we don't display a frailty model. Here, we use the data generated in SAS, so we omit the data simulation in R. As in SAS, it would be a trivial extension of the code presented in example 7.30. For parallelism, we show the results of ignoring the correlation, first.


> library(survival)
> with(simfrail, coxph(formula = Surv(time, 1-censored) ~ x1 + x2))

     coef exp(coef) se(coef)     z p
x1  1.682     5.378   0.0586  28.7 0
x2 -0.886     0.412   0.0439 -20.2 0

with identical results to above. Note that the Surv function expects an indicator of the event, vs. SAS expecting a censoring indicator.

As with SAS, the syntax for incorporating the frailty is simple.


> with(simfrail, coxph(formula = Surv(time, 1-censored) ~ x1 + x2 
    + frailty(i)))

            coef se(coef) se2    Chisq DF  p
x1          2.02 0.0692   0.0662 850     1 0
x2         -1.00 0.0506   0.0484 393     1 0
frailty(i)                       332   141 0

Variance of random effect= 0.436

Here, the results differ slightly from the SAS model, but still return parameter estimates that are quite similar. We're not familiar enough with the computational methods to diagnose the differences.

Wednesday, September 21, 2011

Example 9.6: Model comparison plots (Completed)

We often work in settings where the data set has a lot of missing data-- some missingness in the (many) covariates, some in the main exposure of interest, and still more in the outcome. (Nick describes this as "job security for statisticians").

Some analysts are leery of imputing anything at all, preferring to rely on the assumption that the data are missing completely at random. Others will use multiple imputation for covariates, but feel they should use "real" data for the exposure and outcome. Still others will impute the exposure but not the outcome. Theory and experiments suggest (Moons et al JCE 2006) that all missing data should be imputed. Depending on the imputation method, this may offer some protection against missing data that missing at random, more general than missing completely at random.

In one analysis, we decided to use each of these approaches and demonstrate the results that would be obtained. The data are shown below. The first column denotes the data used, the second has the effect on the mean and CI limits for the effect. How can we present these results clearly? We designed a graphic that requires some customization using either SAS or R but which makes the point elegantly.

SAS
The SAS version is shown below. (Click on it for a larger image.) To generate it, add a final column to the data, where the effect estimate is repeated but the other values are not. Then a basic plot can be created in proc gplot with the hiloc interpolation in the symbol statement and the overlay option in the plot statement. (See book section 5.3 and other blog entries for details.) Try the code without the axis statements to see what happens.


data ke1;
input datatype estimate meanval;
cards;
1 .11 .11 
1 -.05 .
1 .28 .
2 .07 .07
2 .21 .
2 -.07 .
3 .06 .06
3 -.08 .
3 .2 .
4 0 0
4 -.13 .
4 .12 .
;;
cards;
run;

symbol1 i=hiloc c=black v=none;
symbol2 i=none v=dot h=1  c=black;
axis1 minor=none order = (1 to 4 by 1)
  value = (tick = 1 "Complete" 
             justify=c  "Case" justify = c "(N = 2055)"
           tick=2 "MI" justify=c  "Covariates only" 
             justify=c "(N = 2961)"
           tick=3 "MI" justify=c  "Covariates and exposure" 
             justify=c "(N = 3994)"
           tick=4 "MI" justify=c  "All variables" 
             justify=c "(N = 6782)"
           )
  label = none
  offset = (2 cm, 2 cm)
;
axis2 minor=none order = (-.2 to .3 by .1) 
  label = (angle=90 "Effect of exposure on outcome");
title "Compare missingness approaches";
proc gplot data = ke1;
plot (estimate meanval) * datatype / 
  overlay haxis=axis1 vref=0 vaxis=axis2;
run;
quit;

The two axis statements make the plot work. The axis1 statement uses the value option to hand-write the labels describing the data sets. Note that the justify = c causes a new line to be started. The offset option adds a little space to the left and right of the data. The axis2 statement specifies the range and label for the vertical axis. The extra symbol statement and the overlay option just plot the dots that call attention to the effect estimates-- otherwise they would show just a small crossbar at the effect.

The plot suggests that as more observations are included and the multiple imputation gains accuracy the effect attenuates and the standard errors decrease.

R

In R we create the equivalent plot in multiple steps, first by creating an empty plot of the correct size then iterating through each of the lines. As with the SAS approach, a little manipulation of the raw data is required.


n = c(2055, 2961, 3994, 6782)
labels = c("Complete Case", "MI\ncovariates only",
           "MI\ncovariates and exposure",
           "MI\nall variables")
est =    c(0.11,  0.07,  0.06,  0)
lower = c(-0.05, -0.07, -0.08, -0.13)
upper =  c(0.28,  0.21,  0.20,  0.12)

plot(c(0.5, 4.5), c(min(lower)-.10, max(upper)), type="n", xlab="", 
  xaxt="n", ylab="Effect of exposure on outcome")
title("Compare missingness approaches")
for (i in 1:length(n)) {
  points(i, est[i])
  lines(c(i,i), c(lower[i], upper[i]))
  stringval = paste(labels[i],"\n(N=",n[i],")")
  text(i, min(lower) - .05, stringval, cex=.6)
}
abline(h=0, lty=2)

The resulting plot is shown at the top. As opposed to the SAS approach, more of the figure can be defined using the data. For example, the y-axis values are determined from the minimum and maximum values to plot.

Note: a draft of this entry was published accidentally. Many apologies. --Ken

Tuesday, September 13, 2011

Example 9.5: New stuff in SAS 9.3-- proc FMM

Finite mixture models (FMMs) can be used in settings where some unmeasured classification separates the observed data into groups with different exposure/outcome relationships. One familiar example of this is a zero-inflated model, where some observations come from a degenerate distribution with all mass at 0. In that case the exposure/outcome relationship is less interesting in the degenerate distribution group, but there would be considerable interest in the estimated probability of group membership. Another possibly familiar setting is the estimation of a continuous density as a mixture of normal distributions.

More generally, there could be several groups, with "concomitant" covariates predicting group membership. Each group might have different sets of predictors and outcomes from different distribution families. On the other hand, in a "homogenous" mixture setting, all groups have the same distributional form, but with different parameter values. If the covariates in the model are the same, this setting is similar to an ordinary regression model where every observed covariate is interacted with the (unobserved) group membership variable.

SAS

SAS 9.3 includes the "experimental" FMM procedure to fit these models. We're unsure what criteria SAS uses to decide when a procedure is experimental and when it becomes "production", but experimental procedures in SAS/STAT usually do become production eventually.

As hinted at above, the generality implies by the models is fairly vast, and the FMM procedure includes a lot of generality. The most obvious limitation is that it requires independent observations.

We'll demonstrate with a simulated data set. We create a variable x that predicts both group membership and an outcome y with different linear regression parameters depending on group. The mixing probability follows a logistic regression with intercept=-2 and slope=1. The intercept and slope for the outcome are (0, 1) and (3, 1.2) for groups 0 and 1, respectively. The resulting data is plotted above using a default plot from proc glm using the actual group membership. A histogram and nearest normal density for the residuals from a simple OLS are shown, with R code, below-- they appear more or less normal.


data fmmtest;
do i = 1 to 5000;
  x = normal(0);
  group = (exp(-1 + 2*x)/(1 + exp(-1 + 2*x))) 
    gt uniform(0);
  y = (group * 3) + ((1 + group/5) * x) + 
    normal(0) * sqrt(1);
  output;
  end;
run;

title "Fixed k = 2";
proc fmm data = fmmtest;
  model y = x / k=2 equate=scale;
  probmodel x;
run;

In the model statement, the option k defines how many groups (or "components") are to be included. There are also options kmin and kmax which will fit models with various numbers of components and report results of one of them based on some model criterion. The equate option allows the user to force some elements of the component distributions to be equal. Here we force the residuals to be equal, since that's the model that generated our data. The probmodel statement is how the concomitant variables enter the model. The default settings model a logistic (or generalized logit) regression using the listed covariates.

We show only the parameter estimates.


                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|
        1  Intercept    2.9856   0.04856    61.48    <.0001
        1  x            1.2060   0.03927    30.71    <.0001
        2  Intercept  -0.01978   0.02512    -0.79    0.4309
        2  x            0.9889   0.02453    40.32    <.0001
        1  Variance     1.0279   0.02531
        2  Variance     1.0279   0.02531


            Parameter Estimates for Mixing Probabilities
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|
Intercept     -1.0018     0.05665     -17.68      <.0001
x              1.9403     0.07413      26.17      <.0001

The recovery of the true parameters is quite good. This may be unsurprising given the sample size and the distinctness of the components, but we think it's pretty impressive.

R
The CRAN Task View on Cluster Analysis & Finite Mixture Models provides an overview of packages available for R.

First, we'll make the data and take a look at the simple model diagnostics.


> x = rnorm(5000)
> probgroup1 = exp(-1 + 2*x)/(1 + exp(-1 + 2*x))
> group = ifelse(probgroup1 > runif(5000),1,0)
> y = (group * 3) + ((1 + group/5) * x) + rnorm(5000);

> resids = residuals(lm(y~x))
> hist(resids, freq = FALSE)
> dvals = seq(from=min(resids), to=max(resids),length=100)
> lines(dvals, dnorm(dvals, mean(resids), sd(resids)))

The histogram of the residuals from the simple OLS model is shown below. They appear reasonably normal.

Ron Pearson recently showed a simple example using the mixtools package. We tried to replicate the example above using mixtools but were unable to find functionality for concomitant variables. If you fit the closest model, assuming the mixing probabilities depend on no covariates, this is what happens:


> library(mixtools)
> mixout.mt = regmixEM(y,x,k=2,arbvar = FALSE)> summary(mixout.mt)
summary of regmixEM object:
  comp 1   comp 2
lambda  0.5275915 0.472408
sigma   1.0553877 1.055388
beta1  -0.0035015 2.237736
beta2   1.5068146 2.004008

which is pretty awful.

Fortunately, the flexmix package, while a bit more difficult to use, offers the needed flexibility. As a side note, the generality of the package is pretty awe-inspiring. Nice work!


library(flexmix)
mixout.fm=flexmix(y~x, k=2, model=FLXMRglmfix(y~x, varFix=TRUE),
   concomitant=FLXPmultinom(~ x))

The flexmix function uses a variety of special objects that are created by other functions the package provides. Here we use the FLXMRglmfix function to force equal variances across the components and the FLXPmultinom function to define the logistic regression on the covariate x. The results are:


> parameters(mixout5)
                   Comp.1       Comp.2
coef.(Intercept) 3.000412 -0.008360171
coef.x           1.190454  1.047660263
sigma            0.972306  0.972306015
> parameters(mixout5, which = "concomitant")
            1          2
(Intercept) 0  0.9605367
x           0 -1.9109429

which also reproduces reality with impressive accuracy. Note that the concomitant variable model apparently predicts membership in the second component, so the signs are reversed from the generating model.

Tuesday, September 6, 2011

Example 9.4: New stuff in SAS 9.3-- MI FCS

We begin the new academic year with a series of entries exploring new capabilities of SAS 9.3, and some functionality we haven't previously written about.

We'll begin with multiple imputation. Here, SAS has previously been limited to multivariate normal data or to monotonic missing data patterns.

SAS

SAS 9.3 adds the FCS statement to proc mi. This implements a fully conditional specification imputation method (e.g., van Buuren, S. (2007), "Multiple Imputation of Discrete and Continuous Data by Fully Conditional Specification," Statistical Methods in Medical Research, 16, 219–242.) Briefly, we begin by imputing all the missing data with a simple method. Then missing values for each variable are imputed using a model created with the real and current imputed values for the other variables, iterating across the variables several times.

We replicate the multiple imputation example from the book, section 6.5. In that example, we used the mcmc statement for imputation: at the time, this was the only method available in SAS when a non-monotonic missingness pattern was present. We noted at the time that this was not "strictly appropriate" since mcmc method assumes multivariate normality, and two of our missing variables were dichotomous.

filename myhm url "http://www.math.smith.edu/sasr/datasets/helpmiss.csv" lrecl=704;

proc import replace datafile=myhm out=help dbms=dlm;
delimiter=',';
getnames=yes;
run;

proc mi data = help nimpute=20 out=helpmi20fcs;
class homeless female;
var i1 homeless female sexrisk indtot mcs pcs;
fcs
 logistic (female)
 logistic (homeless);
run;

In the fcs statement, you list the method (logistic, discrim, reg, regpmm) to be used, naming the variable for which the method is to be used in parentheses following the method. (You can also specify a subset of covariates to be used in the method, using the usual SAS model-building syntax.) Omitted covariates are imputed using the default reg method.

ods output parameterestimates=helpmipefcs
covb = helpmicovbfcs;
proc logistic data=helpmi20fcs descending;
by _imputation_;
model homeless=female i1 sexrisk indtot /covb;
run;

proc mianalyze parms=helpmipefcs covb=helpmicovbfcs;
  modeleffects intercept female i1 sexrisk indtot;
run;

with the following primary result:

Parameter    Estimate   Std Error  95% Conf. Limits

intercept   -2.492733    0.591241  -3.65157  -1.33390 
female      -0.245103    0.244029  -0.72339   0.23319
i1           0.023207    0.005610   0.01221   0.03420
sexrisk      0.058642    0.035803  -0.01153   0.12882
indtot       0.047971    0.015745   0.01711   0.07883

which is quite similar to our previous results. Given the small proportion of missing values, this isn't very surprising.

R
Several R packages allow imputation for a general pattern of missingness and missing outcome distribution. A brief summary of missing data tools in R can be found in the CRAN Task view on Multivariate Statistics. We'll return to this topic from the R perspective in a future entry.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts