Tuesday, September 21, 2010

Example 8.6: Changing the reference category for categorical variables

How can we change the reference category for a categorical variable? This question comes up often in a consulting practice.

When including categorical covariates in regression models, there is a question of how to incorporate the categories. One simple method is to generate indicator variables, sometimes called dummy variables. We go into some detail about the parameterization of categorical covariates in the SAS and R book, section 3.1.3.

In the indicator variable approach, new dichotomous variables are generated for all but one of the categories; these have a value of 1 if the subject is in the category and 0 otherwise. SAS and R each have simple ways to do this without explicitly creating new variables. In SAS, many procedures accept a class statement, while in R a variable can be defined as a factor, for example by using as.factor.

Let's consider a simple example with the following display of a categorical variable and the resulting indicators.

id catvar indA indB indC
1 A 1 0 0
2 B 0 1 0
3 D 0 0 0
4 C 0 0 1
5 B 0 1 0
6 D 0 0 0
7 A 1 0 0

When we fit the model, the parameter associated with the indA variable is an estimate of the difference between categories A and D. But what if we want the difference between A and C? Well, we can take out our calculators, but we'd also like the standard error of that estimated difference. One way to do this is to change the reference category, and that is what we'll explore today. In a future entry, we'll demonstrate how to calculate arbitrary comparisons, or contrasts, without refitting the model. That method is likely superior to the one shown here, but as consulting statisticians, the question "how do I change the reference category" is one we often answer.

SAS

For procs logistic, genmod, phreg, and surveylogistic, you can use the ref= option, as follows:

proc logistic data=ds;
class classvar (param=ref ref="name-of-ref-group");
model y = classvar;
run;

Unfortunately, changing the reference in SAS is awkward for other procedures. The SAS default is to make the last category the referent, when last is determined by ordering the characters. To change this, use the order option, frequently an option to the class statement but sometimes an option to the proc statement. If the desired referent is the first category, you can make it the referent by sorting on the variable in descending order and then using the order=data option:

proc sort data=ds; by descending classvar; run;

proc glm data=ds order=data;
class classvar;
model y = classvar;
run;

If your desired reference category is lexicographically in the middle of the list, your best bet is to re-code the categories. My colleague Sheryl Rifas-Shiman renames the labels as, e.g., "a. blue", "b. other", "c. brown". Then sort on the new variable and use the order=data approach. You might also get lucky by sorting on some other variable in the data set and using order=data.

As an example, we consider the simple analysis of covariance discussed in section 3.7.2. The default reference cell for substance is heroin. We can replace this with alcohol using the sorting approach.

proc import datafile='c:/book/help.dta' out=help_a dbms=dta;
run;

proc sort data=help_a; by descending substance; run;

proc glm data=help_a order=data;
class substance;
model i1 = age substance age * substance / solution;
run; quit;
Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 7.913018261 B 6.79251599 1.16 0.2447
AGE 0.557076729 B 0.17437966 3.19 0.0015
SUBSTANCE heroin -2.600851794 B 9.66168958 -0.27 0.7879
SUBSTANCE cocaine 7.853879213 B 10.16492979 0.77 0.4401
SUBSTANCE alcohol 0.000000000 B . . .
AGE*SUBSTANCE heroin -0.450423400 B 0.26525085 -1.70 0.0902
AGE*SUBSTANCE cocaine -0.662468379 B 0.27702051 -2.39 0.0172
AGE*SUBSTANCE alcohol 0.000000000 B . . .

Note that SAS creates the levels for the interaction based on the same implied indicator variables.

R

In R there are several options for changing the reference cell. The simplest of these may be the relevel() function. The two arguments are the factor name and the desired reference category. The as.factor() function can be nested within relevel() if necessary.

> ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
> lm3 = lm(i1 ~ relevel(substance, "alcohol") * age, data=ds)
> summary(lm3)
Call:
lm(formula = i1 ~ relevel(substance, "alcohol") * age, data=ds)

Residuals:
Min 1Q Median 3Q Max
-34.653 -9.625 -4.832 5.576 102.891

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.9130 6.7925 1.165 0.2447
relevel(substance, "alcohol")cocaine 7.8539 10.1649 0.773 0.4401
relevel(substance, "alcohol")heroin -2.6009 9.6617 -0.269 0.7879
age 0.5571 0.1744 3.195 0.0015 **
relevel(substance, "alcohol")cocaine:age -0.6625 0.2770 -2.391 0.0172 *
relevel(substance, "alcohol")heroin:age -0.4504 0.2653 -1.698 0.0902 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.7 on 447 degrees of freedom
Multiple R-squared: 0.2268, Adjusted R-squared: 0.2181
F-statistic: 26.22 on 5 and 447 DF, p-value: < 2.2e-16

4 comments:

Anonymous said...

Thanks this was very helpful!

Rick Wicklin said...

In SAS, there's another trick, which prevents you from having to sort or recode the data: you can define and apply a format. By default, the CLASS statement orders according to formatted values of the categorical variable.

Ken Kleinman said...

In general, I view almost all use of formats as dangerous and bad practice. The problem with them is that they disguise the actual data, which is what we should be paying attention to. If the formats are not provided, they also limit replicability when sharing data sets. About the only exceptions that come to mind are huge data sets where recoding can add computational burden, and dates, though even there you can get in trouble.

For example, a valued colleague just sent me an addendum to a data set, where the new data has to be merged by date. The colleague helpfully noted that the dates in his Excel spreadsheet were "absolute" (with values 1-365) while the dates in my SAS dataset ranged from 01JAN1960.

A casual SAS programmer unfamiliar with formats might not realize that the actual stored values of the SAS dates were also "absolute" (actually relative to another date in the data set) and that the apparent dates in 1960 were due to the formatting of the absolute dates relative to the SAS default begin date.

In general, while formatting might arguably get you the reference category you want without recoding, I strongly recommend against it.

The right solution is for other SAS procs to adopt the more specific syntax used in proc logistic and a few other procs, so that neither recoding nor formats are needed.

Anonymous said...

Just what I was looking for! Thank you!