When including categorical covariates in regression models, there is a question of how to incorporate the categories. One simple method is to generate indicator variables, sometimes called dummy variables. We go into some detail about the parameterization of categorical covariates in the SAS and R book, section 3.1.3.
In the indicator variable approach, new dichotomous variables are generated for all but one of the categories; these have a value of 1 if the subject is in the category and 0 otherwise. SAS and R each have simple ways to do this without explicitly creating new variables. In SAS, many procedures accept a class statement, while in R a variable can be defined as a factor, for example by using as.factor.
Let's consider a simple example with the following display of a categorical variable and the resulting indicators.
id catvar indA indB indC
1 A 1 0 0
2 B 0 1 0
3 D 0 0 0
4 C 0 0 1
5 B 0 1 0
6 D 0 0 0
7 A 1 0 0
When we fit the model, the parameter associated with the indA variable is an estimate of the difference between categories A and D. But what if we want the difference between A and C? Well, we can take out our calculators, but we'd also like the standard error of that estimated difference. One way to do this is to change the reference category, and that is what we'll explore today. In a future entry, we'll demonstrate how to calculate arbitrary comparisons, or contrasts, without refitting the model. That method is likely superior to the one shown here, but as consulting statisticians, the question "how do I change the reference category" is one we often answer.
For procs logistic, genmod, phreg, and surveylogistic, you can use the ref= option, as follows:
proc logistic data=ds;
class classvar (param=ref ref="name-of-ref-group");
model y = classvar;
Unfortunately, changing the reference in SAS is awkward for other procedures. The SAS default is to make the last category the referent, when last is determined by ordering the characters. To change this, use the order option, frequently an option to the class statement but sometimes an option to the proc statement. If the desired referent is the first category, you can make it the referent by sorting on the variable in descending order and then using the order=data option:
proc sort data=ds; by descending classvar; run;
proc glm data=ds order=data;
model y = classvar;
If your desired reference category is lexicographically in the middle of the list, your best bet is to re-code the categories. My colleague Sheryl Rifas-Shiman renames the labels as, e.g., "a. blue", "b. other", "c. brown". Then sort on the new variable and use the order=data approach. You might also get lucky by sorting on some other variable in the data set and using order=data.
As an example, we consider the simple analysis of covariance discussed in section 3.7.2. The default reference cell for substance is heroin. We can replace this with alcohol using the sorting approach.
proc import datafile='c:/book/help.dta' out=help_a dbms=dta;
proc sort data=help_a; by descending substance; run;
proc glm data=help_a order=data;
model i1 = age substance age * substance / solution;
Parameter Estimate Error t Value Pr > |t|
Intercept 7.913018261 B 6.79251599 1.16 0.2447
AGE 0.557076729 B 0.17437966 3.19 0.0015
SUBSTANCE heroin -2.600851794 B 9.66168958 -0.27 0.7879
SUBSTANCE cocaine 7.853879213 B 10.16492979 0.77 0.4401
SUBSTANCE alcohol 0.000000000 B . . .
AGE*SUBSTANCE heroin -0.450423400 B 0.26525085 -1.70 0.0902
AGE*SUBSTANCE cocaine -0.662468379 B 0.27702051 -2.39 0.0172
AGE*SUBSTANCE alcohol 0.000000000 B . . .
Note that SAS creates the levels for the interaction based on the same implied indicator variables.
In R there are several options for changing the reference cell. The simplest of these may be the relevel() function. The two arguments are the factor name and the desired reference category. The as.factor() function can be nested within relevel() if necessary.
> ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
> lm3 = lm(i1 ~ relevel(substance, "alcohol") * age, data=ds)
lm(formula = i1 ~ relevel(substance, "alcohol") * age, data=ds)
Min 1Q Median 3Q Max
-34.653 -9.625 -4.832 5.576 102.891
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.9130 6.7925 1.165 0.2447
relevel(substance, "alcohol")cocaine 7.8539 10.1649 0.773 0.4401
relevel(substance, "alcohol")heroin -2.6009 9.6617 -0.269 0.7879
age 0.5571 0.1744 3.195 0.0015 **
relevel(substance, "alcohol")cocaine:age -0.6625 0.2770 -2.391 0.0172 *
relevel(substance, "alcohol")heroin:age -0.4504 0.2653 -1.698 0.0902 .
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.7 on 447 degrees of freedom
Multiple R-squared: 0.2268, Adjusted R-squared: 0.2181
F-statistic: 26.22 on 5 and 447 DF, p-value: < 2.2e-16