SAS and R: MASS library

Tuesday, November 15, 2011

Example 9.14: confidence intervals for logistic regression models

Recently a student asked about the difference between confint() and confint.default() functions, both available in the MASS library to calculate confidence intervals from logistic regression models. The following example demonstrates that they yield different results.

R


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(MASS)
glmmod = glm(homeless ~ age + female, binomial, data=ds)

> summary(glmmod)
Call:
glm(formula = homeless ~ age + female, family = binomial, data = ds)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3600  -1.1231  -0.9185   1.2020   1.5466  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -0.89262    0.45366  -1.968   0.0491 *
age          0.02386    0.01242   1.921   0.0548 .
female      -0.49198    0.22822  -2.156   0.0311 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 625.28  on 452  degrees of freedom
Residual deviance: 617.19  on 450  degrees of freedom
AIC: 623.19

Number of Fisher Scoring iterations: 4

> exp(confint(glmmod))
Waiting for profiling to be done...
                2.5 %    97.5 %
(Intercept) 0.1669932 0.9920023
age         0.9996431 1.0496390
female      0.3885283 0.9522567
> library(MASS)
> exp(confint.default(glmmod))
                2.5 %    97.5 %
(Intercept) 0.1683396 0.9965331
age         0.9995114 1.0493877
female      0.3909104 0.9563045

Why are they different? Which one is correct?

SAS

Fortunately the detailed documentation in SAS can help resolve this. The logistic procedure (section 4.1.1) offers the clodds option to the model statement. Setting this option to both produces two sets of CL, based on the Wald test and on the profile-likelihood approach. (Venzon, D. J. and Moolgavkar, S. H. (1988), “A Method for Computing Profile-Likelihood Based Confidence Intervals,” Applied Statistics, 37, 87–94.)


ods output cloddswald = waldcl cloddspl = plcl;
proc logistic data = "c:\book\help.sas7bdat"  plots=none;
class female (param=ref ref='0');
model homeless(event='1') = age female / clodds = both;
run;

 Odds Ratio Estimates and Profile-Likelihood Confidence Intervals

 Effect                Unit     Estimate     95% Confidence Limits

 AGE                 1.0000        1.024        1.000        1.050
 FEMALE 1 vs 0       1.0000        0.611        0.389        0.952


        Odds Ratio Estimates and Wald Confidence Intervals

 Effect                Unit     Estimate     95% Confidence Limits

 AGE                 1.0000        1.024        1.000        1.049
 FEMALE 1 vs 0       1.0000        0.611        0.391        0.956

Unfortunately, the default precision of the printout isn't quite sufficient to identify whether this distinction aligns with the differences seen in the two R methods. We get around this by using the ODS system to save the output as data sets (section A.7.1). Then we can print the data sets, removing the default rounding formats to find all of the available precision.


title "Wald CL";
proc print data=waldcl; format _all_; run;
title "PL CL";
proc print data=plcl; format _all_; run;

                              Wald CL  
                                     Odds
   Obs    Effect           Unit    RatioEst    LowerCL    UpperCL

    1     AGE                1      1.02415    0.99951    1.04939
    2     FEMALE 1 vs 0      1      0.61143    0.39092    0.95633
 
  
                               PL CL                            
                                     Odds
   Obs    Effect           Unit    RatioEst    LowerCL    UpperCL

    1     AGE                1      1.02415    0.99964    1.04964
    2     FEMALE 1 vs 0      1      0.61143    0.38853    0.95226

With this added precision, we can see that the confint.default() function in the MASS library generates the Wald confidence limits, while the confint() function produces the profile-likelihood limits. This also explains the confint() comment "Waiting for profiling to be done..." Thus neither CI from the MASS library is incorrect, though the profile-likelihood method is thought to be superior, especially for small sample sizes. Little practical difference is seen here.

Tuesday, July 5, 2011

Example 9.1: Scatterplots with binning for large datasets

Scatterplots can get very hard to interpret when displaying large datasets, as points inevitably overplot and can't be individually discerned. A number of approaches have been crafted to help with this problem. One approach uses binning. This approach is also sometimes called a heat map, and can be though of as a two-dimensional histogram, where shades of the bins take the place of the heights of the bars. Any regular tesselation of the plane can be used, but there is some attraction to using hexagons. Why? In the vignettes for the hexbin package author Nicholas Lewin-Koh notes:

There are many reasons for using hexagons, at least over squares. Hexagons have symmetry of nearest neighbors which is lacking in square bins. Hexagons are the maximum number of sides a polygon can have for a regular tesselation of the plane, so in terms of packing a hexagon is 13% more efficient for covering the plane than squares. This property translates into better sampling efficiency at least for elliptical shapes. Lastly hexagons are visually less biased for displaying densities than other regular tesselations.

On the other hand, it's unclear whether these advantages are relevant here or whether they outweigh the simplicity of the square and the constant x and y values accompanying it.

In this entry, we demonstrate the use of a binned scatterplot for data from a sample of 10,000 generated bivariate normal random variables (section 1.10.6).

R

In R, we use the hexbin package to generate our plot, after generating our bivariate normals with correlation approximately 0.52.


library(MASS)
library(hexbin)
mu = c(1, -1)
Sigma = matrix(c(3, 2,
                 2, 5), nrow=2)
xvals = mvrnorm(10000, mu, Sigma)
Sigma[1,2]/sqrt(Sigma[1,1]*Sigma[2,2])    # correlation
plot(hexbin(xvals[,1], xvals[,2]), xlab="X1", ylab="X2")

SAS
We're not aware of a SAS procedure to generate a binned scatterplot or of previously existing macros to do it. Ken wrote a relatively simple macro to do it, which can be found here. The macro uses proc gmap, and we hope that someone will develop an approach using proc template and proc sgrender, as demonstrated in an example from SAS Institute.

After running the macro, the following code generates the image shown below.


data Sigma (type=cov);
infile cards;
input _type_ $ _Name_ $ x1 x2;
cards;
cov x1 3 2
cov x2 2 5
;
run;

proc simnormal data=Sigma out=mvnorms numreal = 10000;
  var x1 x2;
run;

%twodhist(data=mvnorms,x=x1,y=x2,nbinsx=30,nbinsy=30,nshades=9);

We note that the default number of shades shown in R, and the number chosen here for SAS, seem to exceed the eye's ability to differentiate, especially for the darker shades.

Update

An anonymous commenter reported that the SAS code bombed when run. I (Ken) added a new version of the code at the link listed above. I note it here only to emphasize that in either SAS or R, settings or objects in the environment can affect the performance of code. If your plan to share code, an item to add to your checklist is to run the code in a fresh session.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Tuesday, November 15, 2011

Example 9.14: confidence intervals for logistic regression models

Tuesday, July 5, 2011

Example 9.1: Scatterplots with binning for large datasets

About SAS and R

Topics discussed