SAS and R: Firth logistic regression

Showing posts with label Firth logistic regression. Show all posts

Thursday, March 22, 2012

Example 9.24: Changing the parameterization for categorical predictors

In our book, we discuss the important question of how to assign different parameterizations to categorical variables when fitting models (section 3.1.3). We show code in R for use in the lm() function, as follows:

lm(y ~ x, contrasts=list(x,"contr.treatment")

This works great in lm() and some other functions, notably glm(). But for functions from contributed packages, the contrasts option may not work.

Here we show a more generic approach to setting contrasts in R, using Firth logistic regression, which is discussed in Example 8.15, to demonstrate. This approach is also shown in passing in section 3.7.5.

R
We'll simulate a simple data set for logistic regression, then examine the results of the default parameterization.


n = 100
j = rep(c(0,1,2), each = n)
linpred = -2.5 + j
y = (runif(n*3) < exp(linpred)/(1 + exp(linpred)) )
library(logistf)
flrfactor = logistf(y ~ as.factor(j))
summary(flrfactor)
                    coef  se(coef)      Chisq            p
(Intercept)   -2.1539746 0.3276441        Inf 0.000000e+00
as.factor(j)1  0.3679788 0.4343756  0.7331622 3.918601e-01
as.factor(j)2  1.7936917 0.3855682 26.2224650 3.042623e-07

To see what R is doing, use the contrasts() function:


> contrasts(as.factor(j))
  1 2
0 0 0
1 1 0
2 0 1

R made indicator ("dummy") variables for two of the three levels, so that the estimated coefficients are the log relative odds for these levels vs. the omitted level. This is the "contr.treatment" structure (default for unordered factors). The defaults can be changed with options("contrasts"), but this is a sensible one.

But what if we wanted to assess whether a linear effect was plausible, independent of any quadratic effect? For glm() objects we could examine the anova() between the model with the linear term and the model with the linear and quadratic terms. Or, we could use the syntax shown in the introduction, but with "contr.poly" in place of "contr.treatment". The latter approach may be preferable, and for the logistf() function (and likely many other contributed functions) the contrasts = option does not work. In those cases, use the contrasts function:


jfactor = as.factor(j)
contrasts(jfactor) = contr.poly(3)
flrfc = logistf(y ~ jfactor)
summary(flrfc)
                  coef  se(coef)     Chisq            p
(Intercept) -1.4334177 0.1598591       Inf 0.000000e+00
jfactor.L    1.2683316 0.2726379 26.222465 3.042623e-07
jfactor.Q    0.4318181 0.2810660  2.472087 1.158840e-01

Not surprisingly, there is no need for a quadratic term, after the linear trend is accounted for. The canned contrasts available in R are somewhat limited--effect cell coding is not included, for example. You can assign contrasts(x) a matrix you write manually in such cases.

SAS
In SAS, the class statement for the logistic procedure allows many parametrizations, including "orthpoly", which matches the "contr.poly" contrast from R. However, most modeling procedures do not have this flexibility, and you would have to generate your contrasts manually in those cases, typically by creating new variables with the appropriate contrast values. Here we show the reference cell coding that is the default in R. Perversely, it is not the the default in proc logistic despite it being the only option in most procedures. On the other hand, it does allow the user to specify the reference category.


data test;
do i = 1 to 300;
  j = (i gt 100) + (i gt 200);
  linpred = -2.5 + j;
  y = (uniform(0) lt exp(linpred)/(1 + exp(linpred)) );
  output;
  end;
run;

title "Reference cell";
proc logistic data = test;
class j (param=ref ref='0');
model y(event='1') = j / firth clparm = pl;
run;

title "Polynomials";
proc logistic data = test;
class j (param=orthpoly);
model y(event='1') = j;
run;

With the results:


Reference cell
                                Standard         Wald
  Parameter     DF   Estimate      Error   Chi-Square   Pr > ChiSq
  Intercept      1    -2.6110     0.1252     434.6071       <.0001
  j         1    1     1.2078     0.1483      66.3056       <.0001
  j         2    1     2.2060     0.1409     245.1215       <.0001


Polynomials
                                  Standard        Wald
  Parameter         DF  Estimate     Error  Chi-Square  Pr > ChiSq
  Intercept          1   -1.4761    0.0540    746.6063      <.0001
  j         OPOLY1   1    0.9032    0.0577    245.3952      <.0001
  j         OPOLY2   1   -0.0502    0.0501      1.0029      0.3166

An unrelated note about aggregators
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, November 22, 2010

Example 8.15: Firth logistic regression

In logistic regression, when the outcome has low (or high) prevalence, or when there are several interacted categorical predictors, it can happen that for some combination of the predictors, all the observations have the same event status. A similar event occurs when continuous covariates predict the outcome too perfectly.

This phenomenon, known as "separation" (including complete and quasi-complete separation) will cause problems fitting the model. Sometimes the only symptom of separation will be extremely large standard errors, while at other times the software may report an error or a warning.

One approach to handling this sort of problem is exact logistic regression, which we discuss in section 4.1.2. But exact logistic regression is complex and may require prohibitive computational resources. Another option is to use a Bayesian approach. Here we show how to use a penalized likelihood method originally proposed by Firth (1993 Biometrika 80:27-38) and described fully in this setting by Georg Heinze (2002 Statistics in Medicine 21:2409-2419 and 2006 25:4216-4226). A nice summary of the method is shown on a web page that Heinze maintains. In later entries we'll consider the Bayesian and exact approaches.

Update: see bottom of the post.

SAS
In SAS, the corrected estimates can be found using the firth option to the model statement in proc logistic. We'll set up the problem in the simple setting of a 2x2 table with an empty cell. Here, we simply output three observations with three combinations of predictor and outcome, along with a weight variable which contains the case counts in each cell of the table


data testfirth;
   pred=1; outcome=1; weight=20; output;
   pred=0; outcome=1; weight=20; output;
   pred=0; outcome=0; weight=200; output;
run;

In the proc logistic code, we use the weight statement, available in many procedures, to suggest how many times each observation is to be replicated before the analysis. This approach can save a lot of space.


proc logistic data = testfirth;
  class outcome pred (param=ref ref='0');
  model outcome(event='1') = pred / cl firth;
  weight weight;
run;

Without the firth option, the parameter estimate is 19.7 with a standard error of 1349. In contrast, here is the result of the above code.


            Analysis of Maximum Likelihood Estimates

                              Standard         Wald
Parameter     DF   Estimate      Error   Chi-Square   Pr > ChiSq

Intercept      1    -2.2804     0.2324      96.2774       <.0001
pred      1    1     5.9939     1.4850      16.2926       <.0001

Note here that these no suggestion in this part of the output that the Firth method was employed. That appears only at the very top of the voluminous output.

R
In R, we can use Heinze's logistf package, which includes the logistf() function. We'll make the same table as in SAS by constructing two vectors of length 240 using the c() and rep() functions.


pred = c(rep(1,20),rep(0,220))
outcome = c(rep(1,40),rep(0,200))
lr1 = glm(outcome ~ pred, binomial)
>summary(lr1)

Call:
glm(formula = outcome ~ pred, family = binomial)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4366  -0.4366  -0.4366  -0.4366   2.1899  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.3026     0.2345  -9.818   <2e-16 ***
pred          20.8687  1458.5064   0.014    0.989    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 216.27  on 239  degrees of freedom
Residual deviance: 134.04  on 238  degrees of freedom
AIC: 138.04

Number of Fisher Scoring iterations: 17

Note that the estimate differs slightly from what SAS reports. Here's a more plausible answer.


>library(logistf)
>lr2 = logistf(outcome ~ pred)
>summary(lr2)

logistf(formula = outcome ~ pred)

Model fitted by Penalized ML
Confidence intervals and p-values by Profile Likelihood 

                 coef  se(coef) lower 0.95 upper 0.95 Chisq p
(Intercept) -2.280389 0.2324057  -2.765427  -1.851695   Inf 0
pred         5.993961 1.4850029   3.947048  10.852893   Inf 0

Likelihood ratio test=78.95473 on 1 df, p=0, n=240
Wald test = 16.29196 on 1 df, p = 5.429388e-05

Covariance-Matrix:
            [,1]        [,2]
[1,]  0.05401242 -0.05401242
[2,] -0.05401242  2.20523358

Here, the estimates are nearly identical to SAS, but the standard errors differ.

Update:
Georg Heinze, author of the logistf() function, suggests the following two items.

First, in SAS, the model statement option clparm=pl will generate profile penalized likelihood confidence intervals, which should be similar to those from logistf(), It certainly makes sense to use confidence limits that more closely reflect the fitting method.

Second, in R, there is a weight option in both glm() and in logistf() that is similar to the weight statement in SAS. For example, the data used above could have been input and run as:


pred = c(1,0,0)
outcome = c(1,1,0)
weight=c(20,20,200)
lr1 = glm(outcome ~ pred, binomial, weights=weight)
lr2 = logistf(outcome ~ pred,  weights=weight)

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Thursday, March 22, 2012

Example 9.24: Changing the parameterization for categorical predictors

Monday, November 22, 2010

Example 8.15: Firth logistic regression

About SAS and R

Topics discussed