## Thursday, March 22, 2012

### Example 9.24: Changing the parameterization for categorical predictors

In our book, we discuss the important question of how to assign different parameterizations to categorical variables when fitting models (section 3.1.3). We show code in R for use in the lm() function, as follows:

lm(y ~ x, contrasts=list(x,"contr.treatment")

This works great in lm() and some other functions, notably glm(). But for functions from contributed packages, the contrasts option may not work.

Here we show a more generic approach to setting contrasts in R, using Firth logistic regression, which is discussed in Example 8.15, to demonstrate. This approach is also shown in passing in section 3.7.5.

R
We'll simulate a simple data set for logistic regression, then examine the results of the default parameterization.
`n = 100j = rep(c(0,1,2), each = n)linpred = -2.5 + jy = (runif(n*3) < exp(linpred)/(1 + exp(linpred)) )library(logistf)flrfactor = logistf(y ~ as.factor(j))summary(flrfactor)                    coef  se(coef)      Chisq            p(Intercept)   -2.1539746 0.3276441        Inf 0.000000e+00as.factor(j)1  0.3679788 0.4343756  0.7331622 3.918601e-01as.factor(j)2  1.7936917 0.3855682 26.2224650 3.042623e-07`

To see what R is doing, use the contrasts() function:
`> contrasts(as.factor(j))  1 20 0 01 1 02 0 1`

R made indicator ("dummy") variables for two of the three levels, so that the estimated coefficients are the log relative odds for these levels vs. the omitted level. This is the "contr.treatment" structure (default for unordered factors). The defaults can be changed with options("contrasts"), but this is a sensible one.

But what if we wanted to assess whether a linear effect was plausible, independent of any quadratic effect? For glm() objects we could examine the anova() between the model with the linear term and the model with the linear and quadratic terms. Or, we could use the syntax shown in the introduction, but with "contr.poly" in place of "contr.treatment". The latter approach may be preferable, and for the logistf() function (and likely many other contributed functions) the contrasts = option does not work. In those cases, use the contrasts function:
`jfactor = as.factor(j)contrasts(jfactor) = contr.poly(3)flrfc = logistf(y ~ jfactor)summary(flrfc)                  coef  se(coef)     Chisq            p(Intercept) -1.4334177 0.1598591       Inf 0.000000e+00jfactor.L    1.2683316 0.2726379 26.222465 3.042623e-07jfactor.Q    0.4318181 0.2810660  2.472087 1.158840e-01`

Not surprisingly, there is no need for a quadratic term, after the linear trend is accounted for. The canned contrasts available in R are somewhat limited--effect cell coding is not included, for example. You can assign contrasts(x) a matrix you write manually in such cases.

SAS
In SAS, the class statement for the logistic procedure allows many parametrizations, including "orthpoly", which matches the "contr.poly" contrast from R. However, most modeling procedures do not have this flexibility, and you would have to generate your contrasts manually in those cases, typically by creating new variables with the appropriate contrast values. Here we show the reference cell coding that is the default in R. Perversely, it is not the the default in proc logistic despite it being the only option in most procedures. On the other hand, it does allow the user to specify the reference category.
`data test;do i = 1 to 300;  j = (i gt 100) + (i gt 200);  linpred = -2.5 + j;  y = (uniform(0) lt exp(linpred)/(1 + exp(linpred)) );  output;  end;run;title "Reference cell";proc logistic data = test;class j (param=ref ref='0');model y(event='1') = j / firth clparm = pl;run;title "Polynomials";proc logistic data = test;class j (param=orthpoly);model y(event='1') = j;run;`

With the results:
`Reference cell                                Standard         Wald  Parameter     DF   Estimate      Error   Chi-Square   Pr > ChiSq  Intercept      1    -2.6110     0.1252     434.6071       <.0001  j         1    1     1.2078     0.1483      66.3056       <.0001  j         2    1     2.2060     0.1409     245.1215       <.0001Polynomials                                  Standard        Wald  Parameter         DF  Estimate     Error  Chi-Square  Pr > ChiSq  Intercept          1   -1.4761    0.0540    746.6063      <.0001  j         OPOLY1   1    0.9032    0.0577    245.3952      <.0001  j         OPOLY2   1   -0.0502    0.0501      1.0029      0.3166`