Tuesday, September 28, 2010

Example 8.7: Hosmer and Lemeshow goodness-of-fit

The Hosmer and Lemeshow goodness of fit (GOF) test is a way to assess whether there is evidence for lack of fit in a logistic regression model. Simply put, the test compares the expected and observed number of events in bins defined by the predicted probability of the outcome. This can be calculated in R and SAS.

R

In R, we write a simple function to calculate the statistic and a p-value, based on vectors of observed and predicted probabilities. We use the cut() function (1.4.10) in concert with the quantile() function (2.1.5) to make the bins, then calculate the observed and expected counts, the chi-square statistic, and finally the associated p-value (Table 1.1). The function allows the user to define the number of bins but uses the common default of 10.

hosmerlem = function(y, yhat, g=10) {
cutyhat = cut(yhat,
breaks = quantile(yhat, probs=seq(0,
1, 1/g)), include.lowest=TRUE)
obs = xtabs(cbind(1 - y, y) ~ cutyhat)
expect = xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
chisq = sum((obs - expect)^2/expect)
P = 1 - pchisq(chisq, g - 2)
return(list(chisq=chisq,p.value=P))
}

We'll run it with some of the HELP data (available at the book web site). Note that fitted(object) returns the predicted probabilities, if the object is the result of a call to glm() with family=binomial.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
logreg = glm(homeless ~ female + i1 + cesd + age + substance,
family=binomial)
hosmerlem(y=homeless, yhat=fitted(logreg))

This returns the following output:

$chisq
[1] 8.495386
$p.value
[1] 0.3866328

The Design package, by Frank Harrell, includes the le Cessie and Houwelingen test (another goodness-of-fit test, Biometrics 1991 47:1267) and is also easy to run, though it requires using the package's function for logistic regression.

library(Design)
mod = lrm(homeless ~ female + i1 + cesd + age + substance,
x=TRUE, y=TRUE, data=ds)
resid(mod, 'gof')


Sum of squared errors Expected value|H0 SD
104.1091804 103.9602955 0.1655883
Z P
0.8991269 0.3685851

The two tests are reassuringly similar.

SAS

In SAS, the Hosmer and Lemeshow goodness of fit test is generated with the lackfit option to the model statement in proc logistic (section 4.1.1). (We select out the results using the ODS system.)

ods select lackfitpartition lackfitchisq;
proc logistic data="c:\book\help.sas7bdat";
class substance female;
model homeless = female i1 cesd age substance / lackfit;
run;

This generates the following output:

Partition for the Hosmer and Lemeshow Test

HOMELESS = 1 HOMELESS = 0
Group Total Observed Expected Observed Expected

1 45 10 12.16 35 32.84
2 45 12 14.60 33 30.40
3 45 15 15.99 30 29.01
4 45 17 17.20 28 27.80
5 45 27 18.77 18 26.23
6 45 20 20.28 25 24.72
7 45 23 22.35 22 22.65
8 45 25 25.04 20 19.96
9 45 28 27.67 17 17.33
10 48 32 34.95 16 13.05

Hosmer and Lemeshow Goodness-of-Fit Test

Chi-Square DF Pr > ChiSq

8.4786 8 0.3882

The partition table shows the observed and expected count or events in each decile of the predicted probabilities.

The discrepancy between the SAS and R results is likely due to the odd binning SAS uses; the test is unstable in the presence of ties to the extent that some authorities suggest avoiding it. In general, with continuous predictors, the objections are not germane.

Tuesday, September 21, 2010

Example 8.6: Changing the reference category for categorical variables

How can we change the reference category for a categorical variable? This question comes up often in a consulting practice.

When including categorical covariates in regression models, there is a question of how to incorporate the categories. One simple method is to generate indicator variables, sometimes called dummy variables. We go into some detail about the parameterization of categorical covariates in the SAS and R book, section 3.1.3.

In the indicator variable approach, new dichotomous variables are generated for all but one of the categories; these have a value of 1 if the subject is in the category and 0 otherwise. SAS and R each have simple ways to do this without explicitly creating new variables. In SAS, many procedures accept a class statement, while in R a variable can be defined as a factor, for example by using as.factor.

Let's consider a simple example with the following display of a categorical variable and the resulting indicators.

id catvar indA indB indC
1 A 1 0 0
2 B 0 1 0
3 D 0 0 0
4 C 0 0 1
5 B 0 1 0
6 D 0 0 0
7 A 1 0 0

When we fit the model, the parameter associated with the indA variable is an estimate of the difference between categories A and D. But what if we want the difference between A and C? Well, we can take out our calculators, but we'd also like the standard error of that estimated difference. One way to do this is to change the reference category, and that is what we'll explore today. In a future entry, we'll demonstrate how to calculate arbitrary comparisons, or contrasts, without refitting the model. That method is likely superior to the one shown here, but as consulting statisticians, the question "how do I change the reference category" is one we often answer.

SAS

For procs logistic, genmod, phreg, and surveylogistic, you can use the ref= option, as follows:

proc logistic data=ds;
class classvar (param=ref ref="name-of-ref-group");
model y = classvar;
run;

Unfortunately, changing the reference in SAS is awkward for other procedures. The SAS default is to make the last category the referent, when last is determined by ordering the characters. To change this, use the order option, frequently an option to the class statement but sometimes an option to the proc statement. If the desired referent is the first category, you can make it the referent by sorting on the variable in descending order and then using the order=data option:

proc sort data=ds; by descending classvar; run;

proc glm data=ds order=data;
class classvar;
model y = classvar;
run;

If your desired reference category is lexicographically in the middle of the list, your best bet is to re-code the categories. My colleague Sheryl Rifas-Shiman renames the labels as, e.g., "a. blue", "b. other", "c. brown". Then sort on the new variable and use the order=data approach. You might also get lucky by sorting on some other variable in the data set and using order=data.

As an example, we consider the simple analysis of covariance discussed in section 3.7.2. The default reference cell for substance is heroin. We can replace this with alcohol using the sorting approach.

proc import datafile='c:/book/help.dta' out=help_a dbms=dta;
run;

proc sort data=help_a; by descending substance; run;

proc glm data=help_a order=data;
class substance;
model i1 = age substance age * substance / solution;
run; quit;
Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 7.913018261 B 6.79251599 1.16 0.2447
AGE 0.557076729 B 0.17437966 3.19 0.0015
SUBSTANCE heroin -2.600851794 B 9.66168958 -0.27 0.7879
SUBSTANCE cocaine 7.853879213 B 10.16492979 0.77 0.4401
SUBSTANCE alcohol 0.000000000 B . . .
AGE*SUBSTANCE heroin -0.450423400 B 0.26525085 -1.70 0.0902
AGE*SUBSTANCE cocaine -0.662468379 B 0.27702051 -2.39 0.0172
AGE*SUBSTANCE alcohol 0.000000000 B . . .

Note that SAS creates the levels for the interaction based on the same implied indicator variables.

R

In R there are several options for changing the reference cell. The simplest of these may be the relevel() function. The two arguments are the factor name and the desired reference category. The as.factor() function can be nested within relevel() if necessary.

> ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
> lm3 = lm(i1 ~ relevel(substance, "alcohol") * age, data=ds)
> summary(lm3)
Call:
lm(formula = i1 ~ relevel(substance, "alcohol") * age, data=ds)

Residuals:
Min 1Q Median 3Q Max
-34.653 -9.625 -4.832 5.576 102.891

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.9130 6.7925 1.165 0.2447
relevel(substance, "alcohol")cocaine 7.8539 10.1649 0.773 0.4401
relevel(substance, "alcohol")heroin -2.6009 9.6617 -0.269 0.7879
age 0.5571 0.1744 3.195 0.0015 **
relevel(substance, "alcohol")cocaine:age -0.6625 0.2770 -2.391 0.0172 *
relevel(substance, "alcohol")heroin:age -0.4504 0.2653 -1.698 0.0902 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.7 on 447 degrees of freedom
Multiple R-squared: 0.2268, Adjusted R-squared: 0.2181
F-statistic: 26.22 on 5 and 447 DF, p-value: < 2.2e-16

Tuesday, September 14, 2010

Example 8.5: bubble plots part 3



An anonymous commenter expressed a desire to see how one might use SAS to draw a bubble plot with bubbles in three colors, corresponding to a fourth variable in the data set. (x, y, z for bubble size, and the category variable.) In a previous entries we discussed bubble plots and showed how to make the bubble print in two colors depending a fourth dichotomous variable.

The SAS approach to this cannot be extended to fourth variables with many values: we show here an approach to generating this output. The R version below represents a trivial extension of the code demonstrated earlier.

SAS

We'll start by making some data-- 20 observations in each of 3 categories.

data testbubbles;
do cat = 1 to 3;
do i = 1 to 20;
abscissa = normal(0);
ordinate = normal(0);
z = uniform(0);
output;
end;
end;
run;

Our approach will be to make an annotate data set using the annotate macros (section 5.2). The %slice macro easily draws filled circles. Check its documentation for full details on the parameters it needs in the on-line help: SAS Products; SAS/GRAPH; The Annotate Facility; Annotate Dictionary. Here we note that the 5th parameter is the radius of the circle, chosen here as an arbitrary function of z that makes pleasingly sized circles. Other parameters reflect color density, arc, and starting angle, which could be used to represent additional variables.

%annomac;
data annobub1;
set testbubbles;
%system(2,2,3);
%slice(abscissa, ordinate, 0, 360, sqrt(3*z), green, ps, 0);
run;

Unfortunately, due to a quirk of the macro facility, I don't think the color can be changed conditionally in the preceding step. Instead, we need a new data step to do this.

data annobub2;
set annobub1;
if cat=2 then color="red";
if cat=3 then color="blue";
run;

Now we're ready to plot. We use the symbol (section 5.2.2) statement to tell proc gplot not to plot the data, add the annotate data set, and suppress the legend, as the default legend will not look correct here. An appropriate legend could be generated with a legend statement.

symbol1 i=none r=3;
proc gplot data=testbubbles;
plot ordinate * abscissa = cat / annotate = annobub2 nolegend;
run;
quit;

The resulting plot is shown above. Improved axes are demonstrated throughout the book and in many previous blog posts.

R

The R approach merely requires passing three colors to the bg option in the symbols() function. To mimic SAS, we'll start by defining some data, then generate the vector of colors needed.

cat = rep(c(1, 2, 3), each=20)
abscissa = rnorm(60)
ordinate = rnorm(60)
z = runif(60)
plotcolor = ifelse(cat==1, "green", ifelse(cat==2, "red", "blue"))

The nested calls to the ifelse function (section 1.11.2) allow vectorized conditional tests with more than two possibilities. Another option would be to use a for loop (section 1.11.1) but this would be avoiding one of the strengths of R. In this example, I suppose I could have defined the cat vector with the color values as well, and saved some keystrokes.

With the data generated and the color vector prepared, we need only call the symbols() function.

symbols(ordinate, abscissa, circles=z, inches=1/5, bg=plotcolor)

The resulting plot is shown below.

Tuesday, September 7, 2010

Example 8.4: Including subsetting conditions in output

A number of analyses perform operations on subsets. Making it clear what observations have been excluded or included is helpful to include in the output.

SAS

The where statement (section A.6.3) is a powerful and useful tool for subsetting on the fly. (Other options for subsetting typically require data steps or data set options in the proc statement.) But one problem with subsetting in SAS is that the output doesn't show the conditioning. This is a little macro to correct this problem:

%macro where(wherest, titleline);
where &wherest;
title&titleline "where %str(&wherest)";
%mend where;

If you use %where(condition, n); in place of where condition; the condition is both implemented and will appear in the output as the nth title line. This takes advantage of the fact that the title statements just need to appear before the procedure is run, as opposed to before the proc statement.

For example:

title "HELP data set";
proc means data="c:\book\help.sas7bdat";
%where (female eq 1,2);
var age;
run;

Produces the output:

HELP data set
where female eq 1

The MEANS Procedure

Analysis Variable : AGE

Mean
------------
36.2523364
------------

Note that you need to use the testing forms such as eq and not the assignment = in your where statements for this to work.

This tip is cross-posted on my list of SAS tricks.

R

Within R, many functions (e.g. lm()) provide support for a subset option (as in example 3.7.5). Indexing can be used to restrict analyses (B.4.2). In addition, the subset() function can be embedded in a function call to make explicit what subset of observations will be included in the analysis. We demonstrate calculating the mean of female subjects all three ways below.

> ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
> coef(lm(age ~ 1, subset=female==1, data=ds))
(Intercept)
36.25234
> mean(ds$age[ds$female==1])
[1] 36.25234
> mean(subset(ds,female==1,age))
age
36.25234


In interactive R the code and output lie together and it is usually trivial to trace the subsetting used to produce output. But in literate programming settings, or when output may be cut-and-pasted from R to a document, it may be valuable to mirror the SAS macro presented above.

Here is a function to display the subsetting condition for simple functions. It requires some sophistication to turn expressions and variables into objects (and vice-versa). The use of deparse(substitute()) allows us to create a character variable with the name of the subset condition and function, while eval() is used to pass the name of the variable to the subset() function. A similar functionality for more complex functions (such as those requiring a formula, would require some tweaking; a generic explicit subsetting function could be difficult.

verbose.subset = function(ds, condition, variable, FUN=mean) {
cat(paste("subsetting with condition: ",
deparse(substitute(condition)),
"\n", sep=""))
cat(paste("calling the function: ",
deparse(substitute(FUN)),
"\n", sep=""))
FUN(subset(ds, condition, select=eval(variable)))
}

This yields the following output:

> verbose.subset(ds, female==1, "age")
subsetting with condition: female == 1
calling the function: mean
age
36.25234