SAS and R: Example 8.18: A Monte Carlo experiment

Monday, December 13, 2010

Example 8.18: A Monte Carlo experiment

In recent weeks, we've explored methods to fit logistic regression models when a state of quasi-complete separation exists. We considered Firth's penalized likelihood approach, exact logistic regression, and Bayesian models using Markov chain Monte Carlo (MCMC).

Today we'll show how to build a Monte Carlo experiment to compare these approaches. Suppose we have 100 observations with x=0 and 100 with x=1, and suppose that the Pr(Y=1|X=0) = 0.001, while the Pr(Y=1|X=1) = 0.05. Thus the true odds ratio is (0.05/0.95)/(0.001/0.999) = 52.8 and the log odds ratio we want to find is 3.96. But we will rarely observe any y=1 when x=0. Which of these approaches is most likely to give us acceptable results?

Note that in all of the MCMC analyses we use only 6000 iterations, which is likely too few to trust in practice.

The code is long enough here that we annotate within rather than write much text around the code.

SAS

All the SAS procedures used accept the events/trials syntax (section 4.1.1), so we'll generate example data sets as two observations of binomial random variates with the probabilities noted above. We also make extensive use of the ODS system to suppress all printed output (section A.7.1) and to save desired pieces of output as SAS data sets. The latter usage requires first using the ods trace on/listing statement to find the name of the output before saving it. Finally, we use the by statement (section A.6.2) to replicate the analysis for each simulated data set.


data rlog;
do trial = 1 to 100;  
    /* each "trial" is a simulated data set with two observations
       containing the observed number of events with x=0 or x=1 */ 
  x=0; events = ranbin(0,100,.001); n=100; output;
  x=1; events = ranbin(0,100,.05); n=100; output;
  end;
run;

ods select none;   /* omit _all_ printed output */

ods output parameterestimates=glm;   /* save the estimated betas */
proc logist data = rlog;
  by trial;
  model events / n=x;       /* ordinary logistic regression */
run;

ods output parameterestimates=firth;  /* save the estimated betas */
   /* note the output data set has the same name 
      as in the uncorrected glm */
proc logist data = rlog;
  by trial;
  model events / n = x / firth;   /* do the firth bias correction */
run;

ods output exactparmest=exact;  
       /* the exact estimates have a different name under ODS */
proc logist data=rlog;
  by trial;
  model  events / n = x;
  exact x / estimate;  /* do the exact estimation */
run;

data prior;
  input _type_ $ Intercept x;
datalines;
Var 25 25
Mean 0 0 
;
run;

ods output postsummaries=mcmc;
proc genmod data = rlog;
  by trial;
  model events / n = x / dist=bin;
  bayes nbi=1000 nmc=6000
    coeffprior=normal(input=prior) diagnostics=none
    statistics=summary;
       /* do the Bayes regression, using the prior made in the 
          previous data step */
run;

Now I have four data sets with parameter estimates in them. I could use them separately, but I'd like to merge them together. I can do this with the merge statement (section 1.5.7) in a data step. I also need to drop the lines with the estimated intercepts and rename the variables that hold the parameter estimates. The latter is necessary because the names are duplicated across the output data sets and desirable in that it allows names that are meaningful. In any event, I can use the where and rename data set options to include these modifications as I do the merge. I'll also add the number of events when x=0 and when x=1, which requires merging in the original data twice.


data lregsep;
merge 
  glm (where = (variable = "x") rename = (estimate = glm)) 
  firth (where = (variable = "x") rename = (estimate = firth)) 
  exact (rename = (estimate = exact))
  mcmc (where = (parameter = "x") rename = (mean=mcmc))
  rlog (where = (x = 1) rename = (events = events1))
  rlog (where = (x = 0) rename = (events = events0));
by trial;
run;

ods select all;  /* now I want to see the output! */
/* check to make sure the output dataset looks right */
proc print data = lregsep (obs = 5) ; 
var trial glm firth exact mcmc; 
run;

/* what do the estimates look like? */ 
proc means data=lregsep;
  var glm firth exact mcmc; 
run;

With the following output.


 Obs    trial         glm       firth        exact        mcmc

   1      1       12.7866      2.7803       2.3186      3.9635
   2      2       12.8287      3.1494       2.7223      4.0304
   3      3       10.7192      1.6296       0.8885      2.5613
   4      4       11.7458      2.2378       1.6906      3.3409
   5      5       10.7192      1.6296       0.8885      2.5115

            Variable            Mean         Std Dev
            ----------------------------------------
            glm           10.6971252       3.4362801
            firth          2.2666700       0.5716097
            exact          1.8237047       0.5646224
            mcmc           3.1388274       0.9620103
            ----------------------------------------

The ordinary logistic estimates are entirely implausible, while the three alternate approaches are more acceptable. The MCMC result has the least bias, but it's unclear to what degree this is a happy coincidence between the odds ratio and the prior precision. The Firth approach appears to be less biased than the exact logistic regression

R
The R version is roughly analogous to the SAS version. The notable differences are that 1) I want the "weights" version of the data (see example 8.15) for the glm() and logistf() functions and need the events/trials syntax for the elrm() function and the expanded (one row per observation) version for the MCMClogit() funtion. The sapply() function (section B.5.3) serves a similar function to the by statement in SAS. Finally, rather than spelunking through the ods trace output to find the parameter estimates, I used the str() function (section 1.3.2) to figure out where they are stored in the output objects and indexes (rather than data set options) to pull out the one estimate I need.


# make sure the needed packages are present
require(logistf)
require(elrm)
require(MCMCpack)
# the runlogist() function generates a dataset and runs each analysis
# the parameter "trial" keeps track of which time we're calling runlogist()
runlogist = function(trial) {
  # the result vector will hold the estimates temporarily
  result = matrix(0,4)
    # generate the number of events once 
    events.0 =rbinom(1,100, .001)  # for x = 0
    events.1 = rbinom(1,100, .05)   # for x = 1
    # following for glm and logistf "weights" format
    xw = c(0,0,1,1)
    yw = c(0,1,0,1)
    ww = c(100 - events.0, events.0, 100 - events.1,events.1)
    # run the glm and logistf, grab the estimates, and stick 
    # them into the results vector
    result[1] = 
           glm(yw ~ xw, weights=ww, binomial)$coefficients[2]
    result[2] = logistf(yw ~ xw, weights=ww)$coefficients[2]
    # elrm() needs a data frame in the events/trials syntax
    elrmdata = data.frame(events=c(events.0,events.1), x =c(0,1), 
           trials = c(100,100))
    # run it and grab the estimate
    result[3]=elrm(events/trials ~ x, interest = ~ x, iter = 6000, 
         burnIn = 1000, data = elrmdata, r = 2)$coeffs
    # MCMClogit() needs expanded data
    x = c(rep(0,100), rep(1,100))
    y = c(rep(0,100-events.0), rep(1,events.0),
         rep(0, 100-events.1), rep(1, events.1))
    # run it and grab the mean of the MCMC posteriors
    result[4] = summary(MCMClogit(y~as.factor(x), burnin=1000,
         mcmc=6000, b0=0, B0=.04, 
         seed = list(c(781306, 78632467, 364981736, 6545634, 7654654,
                  4584),trial)))$statistics[2,1]
  # send back the four estimates, plus the number of events 
  # when x=0 and x=1
  return(c(trial, events.0, events.1, result))
}

Note the construction of the seed= option to the MCMClogit() function. This allows a different seed in every call without actually using sequential seeds.

Now we're ready to call the function repeatedly. We'll do that with the sapply() function, but we need to nest that inside a t() function call to get the estimates to appear as columns rather than rows, and we'll also make it a data frame in the same command. Note that the parameters we change within the sapply() function are merely a list of trial numbers. Finally, we'll add descriptive names for the columns with the names() function (section 1.3.4).


res2 = as.data.frame(t(sapply(1:10, runlogist)))
names(res2) <- c("trial","events.0","events.1", "glm", 
     "firth", "exact-ish", "MCMC")
head(res2)
mean(res2[,4:7], na.rm=TRUE)


  trial events.0 events.1       glm     firth exact-ish     MCMC
1     1        0        6 18.559624 2.6265073 2.6269087 3.643560
2     2        1        3  1.119021 0.8676031 1.1822296 1.036173
3     3        0        5 18.366720 2.4489268 2.1308186 3.555314
4     4        0        5 18.366720 2.4489268 2.0452446 3.513743
5     5        0        2 17.419339 1.6295391 0.9021854 2.629160
6     6        0        9 17.997524 3.0382577 2.1573979 4.017105

      glm     firth exact-ish      MCMC
17.333356  2.278344  1.813203  3.268243

The results are notably similar to SAS, except for the unacceptable glm() results.

In most Monte Carlo experimental settings, one would also be interested in examining the confidence limits for the parameter estimates. Notes and code for doing this can be found here. In a later entry we'll consider plots for the results generated above. As a final note, there are few combinations of event numbers with any mass worth considering. One could compute the probability of each of these and the associated parameter estimates, deriving a more analytic answer to the question. However, this would be difficult to replicate for arbitrary event probabilities and Ns, and very awkward for continuous covariates, while the above approach could be extended with trivial ease.

3 comments:

Anonymous said...: From your text... "Suppose we have 100 observations with x=0 and 100 with x=1, and suppose that the Pr(Y=1|X=0) = 0.001, while the Pr(Y=1|X=0) = 0.05."

Should one of those read Pr(Y=1|X=1)?; December 21, 2010 at 11:13 AM
Ken Kleinman said...: Yup. The second one. Thanks! (I swear I spotted this & would have been sure I fixed it.); December 21, 2010 at 2:07 PM
Anonymous said...: Since P(Y=1|X=0) = 0.001, we have P(Y=0|X=0) = 0.999. On the other hand, P(Y=1|X=1) = 0.05, so P(Y=0|X=1) = 0.95. As a result, we would see Y=0 with a very high chance when x=0 or 1. Why does this situation imply a quasi separation problem?

Next, why log[(0.05/0.95)/(0.001/0.999)] = 4 is the theoretical coefficient of x in the model?; May 5, 2013 at 2:25 AM

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Monday, December 13, 2010

Example 8.18: A Monte Carlo experiment

3 comments:

About SAS and R

Topics discussed