SAS and R: proc logistic

Showing posts with label proc logistic. Show all posts

Tuesday, November 15, 2011

Example 9.14: confidence intervals for logistic regression models

Recently a student asked about the difference between confint() and confint.default() functions, both available in the MASS library to calculate confidence intervals from logistic regression models. The following example demonstrates that they yield different results.

R


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(MASS)
glmmod = glm(homeless ~ age + female, binomial, data=ds)

> summary(glmmod)
Call:
glm(formula = homeless ~ age + female, family = binomial, data = ds)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3600  -1.1231  -0.9185   1.2020   1.5466  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -0.89262    0.45366  -1.968   0.0491 *
age          0.02386    0.01242   1.921   0.0548 .
female      -0.49198    0.22822  -2.156   0.0311 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 625.28  on 452  degrees of freedom
Residual deviance: 617.19  on 450  degrees of freedom
AIC: 623.19

Number of Fisher Scoring iterations: 4

> exp(confint(glmmod))
Waiting for profiling to be done...
                2.5 %    97.5 %
(Intercept) 0.1669932 0.9920023
age         0.9996431 1.0496390
female      0.3885283 0.9522567
> library(MASS)
> exp(confint.default(glmmod))
                2.5 %    97.5 %
(Intercept) 0.1683396 0.9965331
age         0.9995114 1.0493877
female      0.3909104 0.9563045

Why are they different? Which one is correct?

SAS

Fortunately the detailed documentation in SAS can help resolve this. The logistic procedure (section 4.1.1) offers the clodds option to the model statement. Setting this option to both produces two sets of CL, based on the Wald test and on the profile-likelihood approach. (Venzon, D. J. and Moolgavkar, S. H. (1988), “A Method for Computing Profile-Likelihood Based Confidence Intervals,” Applied Statistics, 37, 87–94.)


ods output cloddswald = waldcl cloddspl = plcl;
proc logistic data = "c:\book\help.sas7bdat"  plots=none;
class female (param=ref ref='0');
model homeless(event='1') = age female / clodds = both;
run;

 Odds Ratio Estimates and Profile-Likelihood Confidence Intervals

 Effect                Unit     Estimate     95% Confidence Limits

 AGE                 1.0000        1.024        1.000        1.050
 FEMALE 1 vs 0       1.0000        0.611        0.389        0.952


        Odds Ratio Estimates and Wald Confidence Intervals

 Effect                Unit     Estimate     95% Confidence Limits

 AGE                 1.0000        1.024        1.000        1.049
 FEMALE 1 vs 0       1.0000        0.611        0.391        0.956

Unfortunately, the default precision of the printout isn't quite sufficient to identify whether this distinction aligns with the differences seen in the two R methods. We get around this by using the ODS system to save the output as data sets (section A.7.1). Then we can print the data sets, removing the default rounding formats to find all of the available precision.


title "Wald CL";
proc print data=waldcl; format _all_; run;
title "PL CL";
proc print data=plcl; format _all_; run;

                              Wald CL  
                                     Odds
   Obs    Effect           Unit    RatioEst    LowerCL    UpperCL

    1     AGE                1      1.02415    0.99951    1.04939
    2     FEMALE 1 vs 0      1      0.61143    0.39092    0.95633
 
  
                               PL CL                            
                                     Odds
   Obs    Effect           Unit    RatioEst    LowerCL    UpperCL

    1     AGE                1      1.02415    0.99964    1.04964
    2     FEMALE 1 vs 0      1      0.61143    0.38853    0.95226

With this added precision, we can see that the confint.default() function in the MASS library generates the Wald confidence limits, while the confint() function produces the profile-likelihood limits. This also explains the confint() comment "Waiting for profiling to be done..." Thus neither CI from the MASS library is incorrect, though the profile-likelihood method is thought to be superior, especially for small sample sizes. Little practical difference is seen here.

Tuesday, November 30, 2010

Example 8.16: Exact logistic regression

In example 8.15, on Firth logistic regression, we mentioned alternative approaches to separation troubles. Here we demonstrate exact logistic regression. The code for this appears in the book (section 4.1.2) but we don't show an example of it there. We'll consider the setting of observing 100 subjects each with x=0 and x=1, observing no events when x=0, and 5 when x=1.

SAS
We'll create the data as a summary, rather than for every line of data. Then we can use the "events/trials" syntax (section 4.1.1) that both proc logistic and proc genmod accept. This is another way to reduce the size of data sets (along with the weight option mentioned previously) but is less generally useful. The the exact statement in proc logistic will fit the exact logistic regression and generate a p-value. The estimate option is required to display estimated log odds ratio.


data exact;
  x=0; count=0; n=100; output;
  x=1; count=5; n=100; output;
run;

proc logistic data=exact;
  model count/n = x;
  exact x / estimate;
run;

This generates the following output:


                   Exact Parameter Estimates

                        Standard      95% Confidence
Parameter   Estimate       Error         Limits          p-Value

x             1.9414*          .    -0.0677   Infinity    0.0594

         NOTE: * indicates a median unbiased estimate.

R
In R we use the elrm() function in the elrm package to approximate exact logistic regression, as described in this paper by the package's authors. The function requires a special formula object with syntax identical to the SAS events/trials syntax. (Note that the function does not behave as expected when identical observations with trials=1 are submitted. Thus data should be collapsed into unique combinations of predictors before using the function.) In addition, it requires its data to be included in a data frame. We'll construct the data frame in one function call to data.frame().


elrmdata = data.frame(count=c(0,5), x=c(0,1), n=c(100,100))
library(elrm)
resexact = elrm(count/n ~ x, interest = ~x, iter=22000, 
  burnIn=2000, data=elrmdata, r=2)
summary(resexact)

producing the following result:


Call:
[[1]]
elrm(formula = count/n ~ x, interest = ~x, r = 2, iter = 22000, 
    dataset = elrmdata, burnIn = 2000)

Results:
  estimate p-value p-value_se mc_size
x   2.0225 0.02635     0.0011   20000

95% Confidence Intervals for Parameters

        lower upper
x -0.02065572   Inf

Differences between the SAS and R results most likely arise from the fact that the elrm() function is an approximation of the exact approach. The upper limit of infinity seen in the exact SAS analysis and approximate exact elrm() analysis reveals a limitation of this approach relative to the Firth approach seen in example 8.15 and the Bayesian approach we'll examine later.

A final note: if the true Pr(Y=1|X=1) = 0.05, then the true Pr(Y=1|X=0) that results in a log odds ratio of 1.94 is about 0.0075; for a log odds ratio of 2.02, the true probability is about 0.0069.

Monday, November 22, 2010

Example 8.15: Firth logistic regression

In logistic regression, when the outcome has low (or high) prevalence, or when there are several interacted categorical predictors, it can happen that for some combination of the predictors, all the observations have the same event status. A similar event occurs when continuous covariates predict the outcome too perfectly.

This phenomenon, known as "separation" (including complete and quasi-complete separation) will cause problems fitting the model. Sometimes the only symptom of separation will be extremely large standard errors, while at other times the software may report an error or a warning.

One approach to handling this sort of problem is exact logistic regression, which we discuss in section 4.1.2. But exact logistic regression is complex and may require prohibitive computational resources. Another option is to use a Bayesian approach. Here we show how to use a penalized likelihood method originally proposed by Firth (1993 Biometrika 80:27-38) and described fully in this setting by Georg Heinze (2002 Statistics in Medicine 21:2409-2419 and 2006 25:4216-4226). A nice summary of the method is shown on a web page that Heinze maintains. In later entries we'll consider the Bayesian and exact approaches.

Update: see bottom of the post.

SAS
In SAS, the corrected estimates can be found using the firth option to the model statement in proc logistic. We'll set up the problem in the simple setting of a 2x2 table with an empty cell. Here, we simply output three observations with three combinations of predictor and outcome, along with a weight variable which contains the case counts in each cell of the table


data testfirth;
   pred=1; outcome=1; weight=20; output;
   pred=0; outcome=1; weight=20; output;
   pred=0; outcome=0; weight=200; output;
run;

In the proc logistic code, we use the weight statement, available in many procedures, to suggest how many times each observation is to be replicated before the analysis. This approach can save a lot of space.


proc logistic data = testfirth;
  class outcome pred (param=ref ref='0');
  model outcome(event='1') = pred / cl firth;
  weight weight;
run;

Without the firth option, the parameter estimate is 19.7 with a standard error of 1349. In contrast, here is the result of the above code.


            Analysis of Maximum Likelihood Estimates

                              Standard         Wald
Parameter     DF   Estimate      Error   Chi-Square   Pr > ChiSq

Intercept      1    -2.2804     0.2324      96.2774       <.0001
pred      1    1     5.9939     1.4850      16.2926       <.0001

Note here that these no suggestion in this part of the output that the Firth method was employed. That appears only at the very top of the voluminous output.

R
In R, we can use Heinze's logistf package, which includes the logistf() function. We'll make the same table as in SAS by constructing two vectors of length 240 using the c() and rep() functions.


pred = c(rep(1,20),rep(0,220))
outcome = c(rep(1,40),rep(0,200))
lr1 = glm(outcome ~ pred, binomial)
>summary(lr1)

Call:
glm(formula = outcome ~ pred, family = binomial)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4366  -0.4366  -0.4366  -0.4366   2.1899  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.3026     0.2345  -9.818   <2e-16 ***
pred          20.8687  1458.5064   0.014    0.989    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 216.27  on 239  degrees of freedom
Residual deviance: 134.04  on 238  degrees of freedom
AIC: 138.04

Number of Fisher Scoring iterations: 17

Note that the estimate differs slightly from what SAS reports. Here's a more plausible answer.


>library(logistf)
>lr2 = logistf(outcome ~ pred)
>summary(lr2)

logistf(formula = outcome ~ pred)

Model fitted by Penalized ML
Confidence intervals and p-values by Profile Likelihood 

                 coef  se(coef) lower 0.95 upper 0.95 Chisq p
(Intercept) -2.280389 0.2324057  -2.765427  -1.851695   Inf 0
pred         5.993961 1.4850029   3.947048  10.852893   Inf 0

Likelihood ratio test=78.95473 on 1 df, p=0, n=240
Wald test = 16.29196 on 1 df, p = 5.429388e-05

Covariance-Matrix:
            [,1]        [,2]
[1,]  0.05401242 -0.05401242
[2,] -0.05401242  2.20523358

Here, the estimates are nearly identical to SAS, but the standard errors differ.

Update:
Georg Heinze, author of the logistf() function, suggests the following two items.

First, in SAS, the model statement option clparm=pl will generate profile penalized likelihood confidence intervals, which should be similar to those from logistf(), It certainly makes sense to use confidence limits that more closely reflect the fitting method.

Second, in R, there is a weight option in both glm() and in logistf() that is similar to the weight statement in SAS. For example, the data used above could have been input and run as:


pred = c(1,0,0)
outcome = c(1,1,0)
weight=c(20,20,200)
lr1 = glm(outcome ~ pred, binomial, weights=weight)
lr2 = logistf(outcome ~ pred,  weights=weight)

Tuesday, September 28, 2010

Example 8.7: Hosmer and Lemeshow goodness-of-fit

The Hosmer and Lemeshow goodness of fit (GOF) test is a way to assess whether there is evidence for lack of fit in a logistic regression model. Simply put, the test compares the expected and observed number of events in bins defined by the predicted probability of the outcome. This can be calculated in R and SAS.

R

In R, we write a simple function to calculate the statistic and a p-value, based on vectors of observed and predicted probabilities. We use the cut() function (1.4.10) in concert with the quantile() function (2.1.5) to make the bins, then calculate the observed and expected counts, the chi-square statistic, and finally the associated p-value (Table 1.1). The function allows the user to define the number of bins but uses the common default of 10.


hosmerlem = function(y, yhat, g=10) {
  cutyhat = cut(yhat,
     breaks = quantile(yhat, probs=seq(0,
       1, 1/g)), include.lowest=TRUE)
  obs = xtabs(cbind(1 - y, y) ~ cutyhat)
  expect = xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
  chisq = sum((obs - expect)^2/expect)
  P = 1 - pchisq(chisq, g - 2)
  return(list(chisq=chisq,p.value=P))
}

We'll run it with some of the HELP data (available at the book web site). Note that fitted(object) returns the predicted probabilities, if the object is the result of a call to glm() with family=binomial.


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
logreg = glm(homeless ~ female + i1 + cesd + age + substance,
  family=binomial)
hosmerlem(y=homeless, yhat=fitted(logreg))

This returns the following output:


$chisq
[1] 8.495386
$p.value
[1] 0.3866328

The Design package, by Frank Harrell, includes the le Cessie and Houwelingen test (another goodness-of-fit test, Biometrics 1991 47:1267) and is also easy to run, though it requires using the package's function for logistic regression.


library(Design)
mod = lrm(homeless ~ female + i1 + cesd + age + substance,   
    x=TRUE, y=TRUE, data=ds)
resid(mod, 'gof')


Sum of squared errors     Expected value|H0                    SD
         104.1091804           103.9602955             0.1655883
                   Z                     P
           0.8991269             0.3685851

The two tests are reassuringly similar.

SAS

In SAS, the Hosmer and Lemeshow goodness of fit test is generated with the lackfit option to the model statement in proc logistic (section 4.1.1). (We select out the results using the ODS system.)


ods select lackfitpartition lackfitchisq;
proc logistic data="c:\book\help.sas7bdat";
   class substance female;
   model homeless = female i1 cesd age substance /  lackfit;
run;

This generates the following output:


            Partition for the Hosmer and Lemeshow Test

                           HOMELESS = 1            HOMELESS = 0
  Group       Total    Observed    Expected    Observed    Expected

      1          45          10       12.16          35       32.84
      2          45          12       14.60          33       30.40
      3          45          15       15.99          30       29.01
      4          45          17       17.20          28       27.80
      5          45          27       18.77          18       26.23
      6          45          20       20.28          25       24.72
      7          45          23       22.35          22       22.65
      8          45          25       25.04          20       19.96
      9          45          28       27.67          17       17.33
     10          48          32       34.95          16       13.05

Hosmer and Lemeshow Goodness-of-Fit Test

Chi-Square       DF     Pr > ChiSq

   8.4786        8         0.3882

The partition table shows the observed and expected count or events in each decile of the predicted probabilities.

The discrepancy between the SAS and R results is likely due to the odd binning SAS uses; the test is unstable in the presence of ties to the extent that some authorities suggest avoiding it. In general, with continuous predictors, the objections are not germane.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts