SAS and R: contrasts

Showing posts with label contrasts. Show all posts

Thursday, March 22, 2012

Example 9.24: Changing the parameterization for categorical predictors

In our book, we discuss the important question of how to assign different parameterizations to categorical variables when fitting models (section 3.1.3). We show code in R for use in the lm() function, as follows:

lm(y ~ x, contrasts=list(x,"contr.treatment")

This works great in lm() and some other functions, notably glm(). But for functions from contributed packages, the contrasts option may not work.

Here we show a more generic approach to setting contrasts in R, using Firth logistic regression, which is discussed in Example 8.15, to demonstrate. This approach is also shown in passing in section 3.7.5.

R
We'll simulate a simple data set for logistic regression, then examine the results of the default parameterization.


n = 100
j = rep(c(0,1,2), each = n)
linpred = -2.5 + j
y = (runif(n*3) < exp(linpred)/(1 + exp(linpred)) )
library(logistf)
flrfactor = logistf(y ~ as.factor(j))
summary(flrfactor)
                    coef  se(coef)      Chisq            p
(Intercept)   -2.1539746 0.3276441        Inf 0.000000e+00
as.factor(j)1  0.3679788 0.4343756  0.7331622 3.918601e-01
as.factor(j)2  1.7936917 0.3855682 26.2224650 3.042623e-07

To see what R is doing, use the contrasts() function:


> contrasts(as.factor(j))
  1 2
0 0 0
1 1 0
2 0 1

R made indicator ("dummy") variables for two of the three levels, so that the estimated coefficients are the log relative odds for these levels vs. the omitted level. This is the "contr.treatment" structure (default for unordered factors). The defaults can be changed with options("contrasts"), but this is a sensible one.

But what if we wanted to assess whether a linear effect was plausible, independent of any quadratic effect? For glm() objects we could examine the anova() between the model with the linear term and the model with the linear and quadratic terms. Or, we could use the syntax shown in the introduction, but with "contr.poly" in place of "contr.treatment". The latter approach may be preferable, and for the logistf() function (and likely many other contributed functions) the contrasts = option does not work. In those cases, use the contrasts function:


jfactor = as.factor(j)
contrasts(jfactor) = contr.poly(3)
flrfc = logistf(y ~ jfactor)
summary(flrfc)
                  coef  se(coef)     Chisq            p
(Intercept) -1.4334177 0.1598591       Inf 0.000000e+00
jfactor.L    1.2683316 0.2726379 26.222465 3.042623e-07
jfactor.Q    0.4318181 0.2810660  2.472087 1.158840e-01

Not surprisingly, there is no need for a quadratic term, after the linear trend is accounted for. The canned contrasts available in R are somewhat limited--effect cell coding is not included, for example. You can assign contrasts(x) a matrix you write manually in such cases.

SAS
In SAS, the class statement for the logistic procedure allows many parametrizations, including "orthpoly", which matches the "contr.poly" contrast from R. However, most modeling procedures do not have this flexibility, and you would have to generate your contrasts manually in those cases, typically by creating new variables with the appropriate contrast values. Here we show the reference cell coding that is the default in R. Perversely, it is not the the default in proc logistic despite it being the only option in most procedures. On the other hand, it does allow the user to specify the reference category.


data test;
do i = 1 to 300;
  j = (i gt 100) + (i gt 200);
  linpred = -2.5 + j;
  y = (uniform(0) lt exp(linpred)/(1 + exp(linpred)) );
  output;
  end;
run;

title "Reference cell";
proc logistic data = test;
class j (param=ref ref='0');
model y(event='1') = j / firth clparm = pl;
run;

title "Polynomials";
proc logistic data = test;
class j (param=orthpoly);
model y(event='1') = j;
run;

With the results:


Reference cell
                                Standard         Wald
  Parameter     DF   Estimate      Error   Chi-Square   Pr > ChiSq
  Intercept      1    -2.6110     0.1252     434.6071       <.0001
  j         1    1     1.2078     0.1483      66.3056       <.0001
  j         2    1     2.2060     0.1409     245.1215       <.0001


Polynomials
                                  Standard        Wald
  Parameter         DF  Estimate     Error  Chi-Square  Pr > ChiSq
  Intercept          1   -1.4761    0.0540    746.6063      <.0001
  j         OPOLY1   1    0.9032    0.0577    245.3952      <.0001
  j         OPOLY2   1   -0.0502    0.0501      1.0029      0.3166

An unrelated note about aggregators
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Tuesday, October 12, 2010

Example 8.9: Contrasts

In example 8.6 we showed how to change the reference category. This is the natural first thought analysts have when their primary comparisons aren't represented in the default output. But our interest might center on a number of comparisons which don't share a category. Or we might need to compare one group with the mean of the other groups. Today we'll explore tests and estimates of effects like these, which are sometimes called contrasts. In a later entry we'll explore more complex multivariate contrasts.

SAS

Access to this kind of comparison in SAS is provided in many model-fitting procedures using a test, estimate, or contrast statement. The differences among these can be subtle. In general, for simple comparisons, we recommend the estimate statement, where available. It is available for the important glm and genmod procedures, among others. In our example from linear regression, we changed the referent from heroin to alcohol by sorting the data and using the order=data option. This gains us pairwise comparisons between heroin and alcohol and between cocaine and alcohol. Here we show how to get the comparison between heroin and cocaine from the same model fit.


proc sort data=help_a; by descending substance; run;

proc glm data=help_a order=data;
   class substance;
   model i1 = age substance / solution;
   estimate "heroin vs. cocaine" substance -1 1 0 / e;
run; quit;

The syntax of the estimate statement is to optionally name the estimate (between the quotes) then to list the effects whose values we want to assess, followed by the desired values for each level of the effect. The e option requests that the contrast vector be printed-- this is a vital check that the contrast is working as desired. Here are the pieces of output generated by the estimate statement.


Coefficients for Estimate heroin vs. cocaine

                            Row 1
Intercept                       0

AGE                             0

SUBSTANCE heroin               -1
SUBSTANCE cocaine               1
SUBSTANCE alcohol               0

This is the result of the e option, and shows that we've correctly specified the difference between heroin and cocaine.


                                            Standard
Parameter                   Estimate           Error    t Value    Pr > |t|
heroin vs. cocaine        3.00540049      2.15576257       1.39      0.1640

This is the comparison we want. It displays the difference, which we could confirm by examining the parameter estimates, as well as the standard error of the difference and the p-value, which can't be determined from the standard output.

To compare alcohol with the average of heroin and cocaine, we could use the following syntax (results omitted).


   estimate "cocaine + alcohol vs heroin" substance -2 1 1 /e divisor=2;

The divisor option allows us to use portions that are not easily represented as decimals. It's necessary because of the reserved use of the / in SAS. Any number of estimate statements can be issued in a single proc.

R

The fit.contrast() function in the gmodels package will generate contrasts for lm and glm objects. Multivariate contrasts can be found in the contrast() function in the Design package.


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
lm3 = lm(i1 ~ substance + age, data=ds))
library(gmodels)
fit.contrast(lm3, "substance", c(0,-1, 1))

This generates the following output:


                         Estimate Std. Error t value Pr(>|t|)
substance c=( 0 -1 1 )  -3.005400 2.155763 -1.394124 0.1639696

The simple syntax accepts model objects, the name of the effect, and a vector of contrasts to apply. The function does not replicate the contrast, which would be useful, but it is simple enough to check the parameter estimates from the model to ensure the desired result has been obtained.

The syntax for comparing heroin with the mean of alcohol and cocaine is straightforward.


> fit.contrast(lm3,"substance", c(-.5, -.5,1))
                             Estimate Std. Error t value Pr(>|t|)
substance c=( -0.5 -0.5 1 ) -11.09965 1.904192 -5.829059 1.065763e-08

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Thursday, March 22, 2012

Example 9.24: Changing the parameterization for categorical predictors

Tuesday, October 12, 2010

Example 8.9: Contrasts

About SAS and R

Topics discussed