When including categorical covariates in regression models, there is a question of how to incorporate the categories. One simple method is to generate indicator variables, sometimes called dummy variables. We go into some detail about the parameterization of categorical covariates in the SAS and R book, section 3.1.3.

In the indicator variable approach, new dichotomous variables are generated for all but one of the categories; these have a value of 1 if the subject is in the category and 0 otherwise. SAS and R each have simple ways to do this without explicitly creating new variables. In SAS, many procedures accept a

`class`statement, while in R a variable can be defined as a factor, for example by using

`as.factor`.

Let's consider a simple example with the following display of a categorical variable and the resulting indicators.

id catvar indA indB indC

1 A 1 0 0

2 B 0 1 0

3 D 0 0 0

4 C 0 0 1

5 B 0 1 0

6 D 0 0 0

7 A 1 0 0

When we fit the model, the parameter associated with the

`indA`variable is an estimate of the difference between categories A and D. But what if we want the difference between A and C? Well, we can take out our calculators, but we'd also like the standard error of that estimated difference. One way to do this is to change the reference category, and that is what we'll explore today. In a future entry, we'll demonstrate how to calculate arbitrary comparisons, or contrasts, without refitting the model. That method is likely superior to the one shown here, but as consulting statisticians, the question "how do I change the reference category" is one we often answer.

**SAS**

For procs

`logistic`,

`genmod`,

`phreg`, and

`surveylogistic`, you can use the

`ref=`option, as follows:

proc logistic data=ds;

class classvar (param=ref ref="name-of-ref-group");

model y = classvar;

run;

Unfortunately, changing the reference in SAS is awkward for other procedures. The SAS default is to make the last category the referent, when last is determined by ordering the characters. To change this, use the

`order`option, frequently an option to the

`class`statement but sometimes an option to the

`proc`statement. If the desired referent is the first category, you can make it the referent by sorting on the variable in descending order and then using the

`order=data`option:

proc sort data=ds; by descending classvar; run;

proc glm data=ds order=data;

class classvar;

model y = classvar;

run;

If your desired reference category is lexicographically in the middle of the list, your best bet is to re-code the categories. My colleague Sheryl Rifas-Shiman renames the labels as, e.g., "a. blue", "b. other", "c. brown". Then sort on the new variable and use the

`order=data`approach. You might also get lucky by sorting on some other variable in the data set and using

`order=data`.

As an example, we consider the simple analysis of covariance discussed in section 3.7.2. The default reference cell for

`substance`is

`heroin`. We can replace this with

`alcohol`using the sorting approach.

proc import datafile='c:/book/help.dta' out=help_a dbms=dta;

run;

proc sort data=help_a; by descending substance; run;

proc glm data=help_a order=data;

class substance;

model i1 = age substance age * substance / solution;

run; quit;

Standard

Parameter Estimate Error t Value Pr > |t|

Intercept 7.913018261 B 6.79251599 1.16 0.2447

AGE 0.557076729 B 0.17437966 3.19 0.0015

SUBSTANCE heroin -2.600851794 B 9.66168958 -0.27 0.7879

SUBSTANCE cocaine 7.853879213 B 10.16492979 0.77 0.4401

SUBSTANCE alcohol 0.000000000 B . . .

AGE*SUBSTANCE heroin -0.450423400 B 0.26525085 -1.70 0.0902

AGE*SUBSTANCE cocaine -0.662468379 B 0.27702051 -2.39 0.0172

AGE*SUBSTANCE alcohol 0.000000000 B . . .

Note that SAS creates the levels for the interaction based on the same implied indicator variables.

**R**

In R there are several options for changing the reference cell. The simplest of these may be the

`relevel()`function. The two arguments are the factor name and the desired reference category. The

`as.factor()`function can be nested within

`relevel()`if necessary.

> ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")

> lm3 = lm(i1 ~ relevel(substance, "alcohol") * age, data=ds)

> summary(lm3)

Call:

lm(formula = i1 ~ relevel(substance, "alcohol") * age, data=ds)

Residuals:

Min 1Q Median 3Q Max

-34.653 -9.625 -4.832 5.576 102.891

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.9130 6.7925 1.165 0.2447

relevel(substance, "alcohol")cocaine 7.8539 10.1649 0.773 0.4401

relevel(substance, "alcohol")heroin -2.6009 9.6617 -0.269 0.7879

age 0.5571 0.1744 3.195 0.0015 **

relevel(substance, "alcohol")cocaine:age -0.6625 0.2770 -2.391 0.0172 *

relevel(substance, "alcohol")heroin:age -0.4504 0.2653 -1.698 0.0902 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.7 on 447 degrees of freedom

Multiple R-squared: 0.2268, Adjusted R-squared: 0.2181

F-statistic: 26.22 on 5 and 447 DF, p-value: < 2.2e-16

## 4 comments:

Thanks this was very helpful!

In SAS, there's another trick, which prevents you from having to sort or recode the data: you can define and apply a format. By default, the CLASS statement orders according to formatted values of the categorical variable.

In general, I view almost all use of formats as dangerous and bad practice. The problem with them is that they disguise the actual data, which is what we should be paying attention to. If the formats are not provided, they also limit replicability when sharing data sets. About the only exceptions that come to mind are huge data sets where recoding can add computational burden, and dates, though even there you can get in trouble.

For example, a valued colleague just sent me an addendum to a data set, where the new data has to be merged by date. The colleague helpfully noted that the dates in his Excel spreadsheet were "absolute" (with values 1-365) while the dates in my SAS dataset ranged from 01JAN1960.

A casual SAS programmer unfamiliar with formats might not realize that the actual stored values of the SAS dates were also "absolute" (actually relative to another date in the data set) and that the apparent dates in 1960 were due to the formatting of the absolute dates relative to the SAS default begin date.

In general, while formatting might arguably get you the reference category you want without recoding, I strongly recommend against it.

The right solution is for other SAS procs to adopt the more specific syntax used in proc logistic and a few other procs, so that neither recoding nor formats are needed.

Just what I was looking for! Thank you!

Post a Comment