SAS and R: multiple imputation

Showing posts with label multiple imputation. Show all posts

Tuesday, May 29, 2012

Example 9.33: Multiple imputation, rounding, and bias

Nick has a paper in the American Statistician warning about bias in multiple imputation arising from rounding data imputed under a normal assumption. One example where you might run afoul of this is if the data are truly dichotomous or count variables, but you model it as normal (either because your software is unable to model dichotomous values directly or because you prefer the theoretical soundness of multivariate normal imputation to, e.g., chained equations). In such cases, one might impute assuming normality, then round the imputed values to plausible integers. The paper shows theoretically the bias that can result if this process is pursued, and also that allowing the "implausible values" will eliminate the bias. (Of course, modeling the missing variable using a logistic regression model will be most appropriate here).

In another paper, Nick and Stuart Lipsitz (TAS 2001) comment that the method of predictive mean matching (PMM) "ensures that imputed values are plausible, and may be more appropriate if the normality assumption is violated." Briefly, the PMM method predicts a value from a model for both missing and observed values. The imputation for a subject with a missing value is the observed value of the subject with the nearest predicted value (or random draw of observed values from among the subjects with the nearest predicted values).

How does this play out in practice? Can the PMM method overcome the theoretical rounding bias while still generating only plausible imputed values?

SAS
We begin by simulating dichotomous data, choosing the value of p (probability of 1) = .25, a value with a large absolute bias, according to the paper. We set values to missing with probability 0.5, using a MCAR mechanism. Then we use proc mi (section 6.5, example 9.4) to impute the missing values, assuming normality. The mean and standard error of the mean of y are calculated in proc summary (section 2.1.1) and combined in proc mianalyze. Then the values are rounded manually and the analysis repeated. Next, we impute separately with PMM. Finally, we impute again with a logistic imputation. We use 5 imputations throughout, though 50 would likely be preferable.

Note that a Poisson regression imputation is not yet available for proc mi, so that the exercise is not wholly academic--if you needed to impute count values, you'd have to choose among implausible values, rounding, and PMM. Also note our use of the fcs imputation method, though it is not needed here with an obviously monotone missingness pattern. Finally, note that proc mi here requires at least two variables, for no reason we know of. We generate a normally-distributed and uncorrelated covariate.


data testpmm;
do i = 1 to 5000;
  x = normal(0);
  y = rand('BINOMIAL', .25, 1);
  missprob = ranuni(0);
  if missprob le .5 then y = .;
  output;
  end;
run;

title "Normal imputation";
proc mi data=testpmm out=normal nimpute=5;
var x y;
fcs reg;
run;

title2 "Implausible values";
proc summary data = normal mean stderr;
by _imputation_;
var y;
output out=outnormal mean=meany stderr=stderry;
run;

proc mianalyze data = outnormal;
modeleffects meany;
stderr stderry;
run;


title2 "Rounded";
/* make the rounded data */
data normalrnd;
set normal;
if y lt .5 then y=0;
else y=1;
run;

proc summary data = normalrnd mean stderr;
by _imputation_;
var y;
output out=outnormalrnd mean=meany stderr=stderry;
run;

proc mianalyze data = outnormalrnd;
modeleffects meany;
stderr stderry;
run;

title "regpmm imputation";
proc mi data=testpmm out=pmm nimpute=5;
var x y;
fcs regpmm;
run;
...

title "logistic imputation";
proc mi data=testpmm out=logistic nimpute=5;
class y;
var x y;
fcs logistic;
run;
...

We omit the summary and mianalyze procedures for the latter imputations. Ordinarily, it would be easiest to do this kind of repetitive task with a macro, but we leave it in open code here for legibility.
The results are shown below


                          Normal imputation-- Implausible values

          Parameter        Estimate      Std Error    95% Confidence Limits  
          meany            0.249105       0.008634     0.230849     0.267362
    
                                Normal imputation-- Rounded 
          meany            0.265280       0.006408     0.252710     0.277850
    
                                       regpmm imputation      
          meany            0.246320       0.006642     0.233204     0.259436  

                                      logistic imputation     
          meany            0.255120       0.008428     0.237449     0.272791

As theory suggests, rounding the normally imputed values leads to bias, while using the normal imputations does not (though it results in implausible values). Nether PMM imputation nor direct logistic imputation appear to be biased.

R
We will use the mice package written by Stef van Buuren, one of the key developers of chained imputation. Stef also has a new book describing the package and demonstrating its use in many applied examples. We use 5 imputations throughout, though 50 would likely be preferable.

We begin by creating the data. Note that mice(), like proc mi, requires at least two columns of data. To do the logistic regression imputation, mice() wants the missing data to be a factor, so we make a copy of the data as a data frame object as well.


library(mice)
n = 5000  # number of observations
m = 5   # number of imputations (should be 25-50 in practice)
x = rnorm(n)
y = rbinom(n, 1, .25)   # interesting point according to Horton and Lipsitz (TAS 2004)
unif = runif(n)
y[unif < .5] = NA    # make half of the Y's be missing
ds = cbind(y, x)
ds2 = data.frame(factor(y), x)

The mice package works analogously to proc mi/proc mianalyze. The mice() function performs the imputation, while the pool() function summarizes the results across the completed data sets. The method option to mice() specifies an imputation method for each column in the input object. Here we fit the simplest linear regression model (intercept only).


# normal model with implausible values
impnorm = mice(ds, method="norm", m=m)
summary(pool(with(impnorm, lm(y ~ 1))))

Rounding could be done by tampering with the mids-type object that mice() produces, but there is a more direct way to do this through the post= option. It accepts text strings with R commands that will be applied to the imputed values. Here we use the ifelse() function to make the normal values equal to 0 or 1. The code for the predictive mean matching and logistic regression follow.


impnormround = mice(ds, method="norm", m=m, 
   post= c("imp[[j]][,i] = ifelse(imp[[j]][,i] < .5, 0, 1)",""))

imppmm = mice(ds, method="pmm", m=m)

implog = mice(ds2, method="logreg", m=m)

The results of summary(pool()) calls are shown below..


> summary(pool(with(impnorm, lm(y ~ 1))))
                 est          se     lo 95     hi 95 
(Intercept) 0.272912 0.007008458 0.2589915 0.2868325
> summary(pool(with(impnormround, lm(y ~ 1))))
                est         se     lo 95     hi 95 
(Intercept) 0.28544 0.00854905 0.2676263 0.3032537
> summary(pool(with(imppmm, lm(y ~ 1))))
                 est         se     lo 95     hi 95
(Intercept) 0.277636 0.03180604 0.2145564 0.3407156
> summary(pool(with(implog, lm(y ~ 1))))
                  est         se     lo 95     hi 95 
(Intercept) 0.2652899 0.00879988 0.2480342 0.2825457

The message on bias is similar, though there is some hint of trouble in the CI for the PMM method (it seems to have a bias towards 0.5). The default option of 3 donors may be too few (this can be tweaked by use of the donors = NUMBER option).

Wednesday, September 21, 2011

Example 9.6: Model comparison plots (Completed)

We often work in settings where the data set has a lot of missing data-- some missingness in the (many) covariates, some in the main exposure of interest, and still more in the outcome. (Nick describes this as "job security for statisticians").

Some analysts are leery of imputing anything at all, preferring to rely on the assumption that the data are missing completely at random. Others will use multiple imputation for covariates, but feel they should use "real" data for the exposure and outcome. Still others will impute the exposure but not the outcome. Theory and experiments suggest (Moons et al JCE 2006) that all missing data should be imputed. Depending on the imputation method, this may offer some protection against missing data that missing at random, more general than missing completely at random.

In one analysis, we decided to use each of these approaches and demonstrate the results that would be obtained. The data are shown below. The first column denotes the data used, the second has the effect on the mean and CI limits for the effect. How can we present these results clearly? We designed a graphic that requires some customization using either SAS or R but which makes the point elegantly.

SAS
The SAS version is shown below. (Click on it for a larger image.) To generate it, add a final column to the data, where the effect estimate is repeated but the other values are not. Then a basic plot can be created in proc gplot with the hiloc interpolation in the symbol statement and the overlay option in the plot statement. (See book section 5.3 and other blog entries for details.) Try the code without the axis statements to see what happens.


data ke1;
input datatype estimate meanval;
cards;
1 .11 .11 
1 -.05 .
1 .28 .
2 .07 .07
2 .21 .
2 -.07 .
3 .06 .06
3 -.08 .
3 .2 .
4 0 0
4 -.13 .
4 .12 .
;;
cards;
run;

symbol1 i=hiloc c=black v=none;
symbol2 i=none v=dot h=1  c=black;
axis1 minor=none order = (1 to 4 by 1)
  value = (tick = 1 "Complete" 
             justify=c  "Case" justify = c "(N = 2055)"
           tick=2 "MI" justify=c  "Covariates only" 
             justify=c "(N = 2961)"
           tick=3 "MI" justify=c  "Covariates and exposure" 
             justify=c "(N = 3994)"
           tick=4 "MI" justify=c  "All variables" 
             justify=c "(N = 6782)"
           )
  label = none
  offset = (2 cm, 2 cm)
;
axis2 minor=none order = (-.2 to .3 by .1) 
  label = (angle=90 "Effect of exposure on outcome");
title "Compare missingness approaches";
proc gplot data = ke1;
plot (estimate meanval) * datatype / 
  overlay haxis=axis1 vref=0 vaxis=axis2;
run;
quit;

The two axis statements make the plot work. The axis1 statement uses the value option to hand-write the labels describing the data sets. Note that the justify = c causes a new line to be started. The offset option adds a little space to the left and right of the data. The axis2 statement specifies the range and label for the vertical axis. The extra symbol statement and the overlay option just plot the dots that call attention to the effect estimates-- otherwise they would show just a small crossbar at the effect.

The plot suggests that as more observations are included and the multiple imputation gains accuracy the effect attenuates and the standard errors decrease.

R

In R we create the equivalent plot in multiple steps, first by creating an empty plot of the correct size then iterating through each of the lines. As with the SAS approach, a little manipulation of the raw data is required.


n = c(2055, 2961, 3994, 6782)
labels = c("Complete Case", "MI\ncovariates only",
           "MI\ncovariates and exposure",
           "MI\nall variables")
est =    c(0.11,  0.07,  0.06,  0)
lower = c(-0.05, -0.07, -0.08, -0.13)
upper =  c(0.28,  0.21,  0.20,  0.12)

plot(c(0.5, 4.5), c(min(lower)-.10, max(upper)), type="n", xlab="", 
  xaxt="n", ylab="Effect of exposure on outcome")
title("Compare missingness approaches")
for (i in 1:length(n)) {
  points(i, est[i])
  lines(c(i,i), c(lower[i], upper[i]))
  stringval = paste(labels[i],"\n(N=",n[i],")")
  text(i, min(lower) - .05, stringval, cex=.6)
}
abline(h=0, lty=2)

The resulting plot is shown at the top. As opposed to the SAS approach, more of the figure can be defined using the data. For example, the y-axis values are determined from the minimum and maximum values to plot.

Note: a draft of this entry was published accidentally. Many apologies. --Ken

Tuesday, September 6, 2011

Example 9.4: New stuff in SAS 9.3-- MI FCS

We begin the new academic year with a series of entries exploring new capabilities of SAS 9.3, and some functionality we haven't previously written about.

We'll begin with multiple imputation. Here, SAS has previously been limited to multivariate normal data or to monotonic missing data patterns.

SAS

SAS 9.3 adds the FCS statement to proc mi. This implements a fully conditional specification imputation method (e.g., van Buuren, S. (2007), "Multiple Imputation of Discrete and Continuous Data by Fully Conditional Specification," Statistical Methods in Medical Research, 16, 219–242.) Briefly, we begin by imputing all the missing data with a simple method. Then missing values for each variable are imputed using a model created with the real and current imputed values for the other variables, iterating across the variables several times.

We replicate the multiple imputation example from the book, section 6.5. In that example, we used the mcmc statement for imputation: at the time, this was the only method available in SAS when a non-monotonic missingness pattern was present. We noted at the time that this was not "strictly appropriate" since mcmc method assumes multivariate normality, and two of our missing variables were dichotomous.

filename myhm url "http://www.math.smith.edu/sasr/datasets/helpmiss.csv" lrecl=704;

proc import replace datafile=myhm out=help dbms=dlm;
delimiter=',';
getnames=yes;
run;

proc mi data = help nimpute=20 out=helpmi20fcs;
class homeless female;
var i1 homeless female sexrisk indtot mcs pcs;
fcs
 logistic (female)
 logistic (homeless);
run;

In the fcs statement, you list the method (logistic, discrim, reg, regpmm) to be used, naming the variable for which the method is to be used in parentheses following the method. (You can also specify a subset of covariates to be used in the method, using the usual SAS model-building syntax.) Omitted covariates are imputed using the default reg method.

ods output parameterestimates=helpmipefcs
covb = helpmicovbfcs;
proc logistic data=helpmi20fcs descending;
by _imputation_;
model homeless=female i1 sexrisk indtot /covb;
run;

proc mianalyze parms=helpmipefcs covb=helpmicovbfcs;
  modeleffects intercept female i1 sexrisk indtot;
run;

with the following primary result:

Parameter    Estimate   Std Error  95% Conf. Limits

intercept   -2.492733    0.591241  -3.65157  -1.33390 
female      -0.245103    0.244029  -0.72339   0.23319
i1           0.023207    0.005610   0.01221   0.03420
sexrisk      0.058642    0.035803  -0.01153   0.12882
indtot       0.047971    0.015745   0.01711   0.07883

which is quite similar to our previous results. Given the small proportion of missing values, this isn't very surprising.

R
Several R packages allow imputation for a general pattern of missingness and missing outcome distribution. A brief summary of missing data tools in R can be found in the CRAN Task view on Multivariate Statistics. We'll return to this topic from the R perspective in a future entry.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts