Tuesday, December 28, 2010

Tools to tidy up R code

Last week we made an impassioned plea for attention to style in formatting R and SAS code.

While it's always better to adopt a consistent style and use it whenever you write code, the reality is that sometimes formatting slips (or you end up repurposing code that others wrote. In those situations, the formatR package (due to Yihui Xie) is a useful tool to help tidy up code.

Imagine that the poorly formatted and somewhat obfuscated code from the last entry has been copied to the clipboard:

ds=read.csv("http://www.math.smith.edu/r/data/help.csv");attach(ds)
fOo=ks.test(age[female==1],age[female==0],data=ds)
plotdens=function(x,y,mytitle, mylab){densx = density(x)
densy = density(y);plot(densx,main=mytitle,lwd=3,xlab=mylab,
bty="l");lines(densy,lty=2,col=2,lwd=3);xvals=c(densx$x,
rev(densy$x));yvals=c(densx$y,rev(densy$y));polygon(xvals,
yvals,col="gray")};mytitle=paste("Test of ages: D=",round(fOo$statistic,3),
" p=",round(fOo$p.value,2),sep="");plotdens(age[female==1],
age[female==0],mytitle=mytitle,mylab="age (in years)")
legend(50,.05,legend=c("Women","Men"),col=1:2,lty=1:2,lwd=2)

Running the commands:

library(formatR)
tidy.source()

will generate the following output:

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
fOo = ks.test(age[female == 1], age[female == 0], data = ds)
plotdens = function(x, y, mytitle, mylab) {
densx = density(x)
densy = density(y)
plot(densx, main = mytitle, lwd = 3, xlab = mylab, bty = "l")
lines(densy, lty = 2, col = 2, lwd = 3)
xvals = c(densx$x, rev(densy$x))
yvals = c(densx$y, rev(densy$y))
polygon(xvals, yvals, col = "gray")
}
mytitle = paste("Test of ages: D=", round(fOo$statistic, 3), " p=",
round(fOo$p.value, 2), sep = "")
plotdens(age[female == 1], age[female == 0], mytitle = mytitle,
mylab = "age (in years)")

The cleaned up code is much easier to parse, with minimal effort. Some quibbles: I'm not as fond of using 4 spaces of indentation. And I always worry what'll happen if there are syntax errors (yipes!). But this is a useful tool to have in your box.

Wednesday, December 22, 2010

A plea for consistent style!

As we get close to the end of the year, it's time to look back over the past year and think of resolutions for 2011 and beyond. One that's often on my mind relates to ways to structure my code to make it clearer to others (as well as to myself when I look back upon it months later).

Style guides are common in many programming languages, and are often purported to increase the readability and legibility of code, as well as minimize errors. The Wikipedia page on this topic describes the importance of indentation, spacing, alignment, and other formatting conventions.

Many stylistic conventions are appropriate for statistical code written in SAS and R, and can help to make code clearer and easier to comprehend. Consider the difference between:

ds=read.csv("http://www.math.smith.edu/r/data/help.csv");attach(ds)
fOo=ks.test(age[female==1],age[female==0],data=ds)
plotdens=function(x,y,mytitle, mylab){densx = density(x)
densy = density(y);plot(densx,main=mytitle,lwd=3,xlab=mylab,
bty="l");lines(densy,lty=2,col=2,lwd=3);xvals=c(densx$x,
rev(densy$x));yvals=c(densx$y,rev(densy$y));polygon(xvals,
yvals,col="gray")};mytitle=paste("Test of ages: D=",round(fOo$statistic,3),
" p=",round(fOo$p.value,2),sep="");plotdens(age[female==1],
age[female==0],mytitle=mytitle,mylab="age (in years)")
legend(50,.05,legend=c("Women","Men"),col=1:2,lty=1:2,lwd=2)

and

# code example from the Using R for Data Management, Statistical
# Analysis and Graphics book
# Nicholas Horton, Smith College December 21, 2010
#
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)

# fit KS test and save object containing p-value
ksres = ks.test(age[female==1], age[female==0], data=ds)

# define function to plot two densities on the same graph
plotdens = function(x, y, mytitle, mylab) {
densx = density(x)
densy = density(y)
plot(densx, main=mytitle, lwd=3, xlab=mylab, bty="l")
lines(densy, lty=2, col=2, lwd=3)
xvals = c(densx$x, rev(densy$x))
yvals = c(densx$y, rev(densy$y))
polygon(xvals, yvals, col="gray")
}

# craft specialized title containing statistic and p-value
mytitle = paste("Test of ages: D=",
round(ksres$statistic,3),
" p=", round(ksres$p.value, 2),
sep="")

plotdens(age[female==1], age[female==0],
mytitle=mytitle, mylab="age (in years)")

legend(50, .05, legend=c("Women", "Men"), col=1:2, lty=1:2,
lwd=2)

While the first example has the advantage of using considerably fewer lines, it suffers dramatically from readability. The use of appropriate indentation, white space, spacing and comments help the analyst when debugging as well as fostering easier reuse in the future. In settings where code review is undertaken, sharing a set of common standards is eminently sensible.

SAS

A specific but somewhat cursory style manual for SAS can be found at the SAS community Style guide for writing and polishing programs. I like the start of this guide, though it is incomplete at present. Other useful words of wisdom can be found here and here.

R

Google's R Style Guide is chock full of tips and guidelines to make R code easier to read, share and verify. Another source of ideas is Henrik Bengtsson's draft R coding conventions. While one can quibble about some of the specific suggestions, overall, the effect of adherence to such a style guide is code that is easier to understand and less likely to hide errors.

Some coders are fundamentalists in insisting on "the correct" style. In general, however, it is more important to develop a sensible, interpretable, and coherent style of your own than to adhere to styles that you find awkward, whatever their provenance. The links above provide some common sense tips that can help improve productivity and make you a better analyst.

Monday, December 13, 2010

Example 8.18: A Monte Carlo experiment

In recent weeks, we've explored methods to fit logistic regression models when a state of quasi-complete separation exists. We considered Firth's penalized likelihood approach, exact logistic regression, and Bayesian models using Markov chain Monte Carlo (MCMC).

Today we'll show how to build a Monte Carlo experiment to compare these approaches. Suppose we have 100 observations with x=0 and 100 with x=1, and suppose that the Pr(Y=1|X=0) = 0.001, while the Pr(Y=1|X=1) = 0.05. Thus the true odds ratio is (0.05/0.95)/(0.001/0.999) = 52.8 and the log odds ratio we want to find is 3.96. But we will rarely observe any y=1 when x=0. Which of these approaches is most likely to give us acceptable results?

Note that in all of the MCMC analyses we use only 6000 iterations, which is likely too few to trust in practice.

The code is long enough here that we annotate within rather than write much text around the code.

SAS

All the SAS procedures used accept the events/trials syntax (section 4.1.1), so we'll generate example data sets as two observations of binomial random variates with the probabilities noted above. We also make extensive use of the ODS system to suppress all printed output (section A.7.1) and to save desired pieces of output as SAS data sets. The latter usage requires first using the ods trace on/listing statement to find the name of the output before saving it. Finally, we use the by statement (section A.6.2) to replicate the analysis for each simulated data set.

data rlog;
do trial = 1 to 100;
/* each "trial" is a simulated data set with two observations
containing the observed number of events with x=0 or x=1 */
x=0; events = ranbin(0,100,.001); n=100; output;
x=1; events = ranbin(0,100,.05); n=100; output;
end;
run;

ods select none; /* omit _all_ printed output */

ods output parameterestimates=glm; /* save the estimated betas */
proc logist data = rlog;
by trial;
model events / n=x; /* ordinary logistic regression */
run;

ods output parameterestimates=firth; /* save the estimated betas */
/* note the output data set has the same name
as in the uncorrected glm */
proc logist data = rlog;
by trial;
model events / n = x / firth; /* do the firth bias correction */
run;

ods output exactparmest=exact;
/* the exact estimates have a different name under ODS */
proc logist data=rlog;
by trial;
model events / n = x;
exact x / estimate; /* do the exact estimation */
run;

data prior;
input _type_ $ Intercept x;
datalines;
Var 25 25
Mean 0 0
;
run;

ods output postsummaries=mcmc;
proc genmod data = rlog;
by trial;
model events / n = x / dist=bin;
bayes nbi=1000 nmc=6000
coeffprior=normal(input=prior) diagnostics=none
statistics=summary;
/* do the Bayes regression, using the prior made in the
previous data step */
run;

Now I have four data sets with parameter estimates in them. I could use them separately, but I'd like to merge them together. I can do this with the merge statement (section 1.5.7) in a data step. I also need to drop the lines with the estimated intercepts and rename the variables that hold the parameter estimates. The latter is necessary because the names are duplicated across the output data sets and desirable in that it allows names that are meaningful. In any event, I can use the where and rename data set options to include these modifications as I do the merge. I'll also add the number of events when x=0 and when x=1, which requires merging in the original data twice.

data lregsep;
merge
glm (where = (variable = "x") rename = (estimate = glm))
firth (where = (variable = "x") rename = (estimate = firth))
exact (rename = (estimate = exact))
mcmc (where = (parameter = "x") rename = (mean=mcmc))
rlog (where = (x = 1) rename = (events = events1))
rlog (where = (x = 0) rename = (events = events0));
by trial;
run;

ods select all; /* now I want to see the output! */
/* check to make sure the output dataset looks right */
proc print data = lregsep (obs = 5) ;
var trial glm firth exact mcmc;
run;

/* what do the estimates look like? */
proc means data=lregsep;
var glm firth exact mcmc;
run;

With the following output.

Obs trial glm firth exact mcmc

1 1 12.7866 2.7803 2.3186 3.9635
2 2 12.8287 3.1494 2.7223 4.0304
3 3 10.7192 1.6296 0.8885 2.5613
4 4 11.7458 2.2378 1.6906 3.3409
5 5 10.7192 1.6296 0.8885 2.5115

Variable Mean Std Dev
----------------------------------------
glm 10.6971252 3.4362801
firth 2.2666700 0.5716097
exact 1.8237047 0.5646224
mcmc 3.1388274 0.9620103
----------------------------------------

The ordinary logistic estimates are entirely implausible, while the three alternate approaches are more acceptable. The MCMC result has the least bias, but it's unclear to what degree this is a happy coincidence between the odds ratio and the prior precision. The Firth approach appears to be less biased than the exact logistic regression

R
The R version is roughly analogous to the SAS version. The notable differences are that 1) I want the "weights" version of the data (see example 8.15) for the glm() and logistf() functions and need the events/trials syntax for the elrm() function and the expanded (one row per observation) version for the MCMClogit() funtion. The sapply() function (section B.5.3) serves a similar function to the by statement in SAS. Finally, rather than spelunking through the ods trace output to find the parameter estimates, I used the str() function (section 1.3.2) to figure out where they are stored in the output objects and indexes (rather than data set options) to pull out the one estimate I need.


# make sure the needed packages are present
require(logistf)
require(elrm)
require(MCMCpack)
# the runlogist() function generates a dataset and runs each analysis
# the parameter "trial" keeps track of which time we're calling runlogist()
runlogist = function(trial) {
# the result vector will hold the estimates temporarily
result = matrix(0,4)
# generate the number of events once
events.0 =rbinom(1,100, .001) # for x = 0
events.1 = rbinom(1,100, .05) # for x = 1
# following for glm and logistf "weights" format
xw = c(0,0,1,1)
yw = c(0,1,0,1)
ww = c(100 - events.0, events.0, 100 - events.1,events.1)
# run the glm and logistf, grab the estimates, and stick
# them into the results vector
result[1] =
glm(yw ~ xw, weights=ww, binomial)$coefficients[2]
result[2] = logistf(yw ~ xw, weights=ww)$coefficients[2]
# elrm() needs a data frame in the events/trials syntax
elrmdata = data.frame(events=c(events.0,events.1), x =c(0,1),
trials = c(100,100))
# run it and grab the estimate
result[3]=elrm(events/trials ~ x, interest = ~ x, iter = 6000,
burnIn = 1000, data = elrmdata, r = 2)$coeffs
# MCMClogit() needs expanded data
x = c(rep(0,100), rep(1,100))
y = c(rep(0,100-events.0), rep(1,events.0),
rep(0, 100-events.1), rep(1, events.1))
# run it and grab the mean of the MCMC posteriors
result[4] = summary(MCMClogit(y~as.factor(x), burnin=1000,
mcmc=6000, b0=0, B0=.04,
seed = list(c(781306, 78632467, 364981736, 6545634, 7654654,
4584),trial)))$statistics[2,1]
# send back the four estimates, plus the number of events
# when x=0 and x=1
return(c(trial, events.0, events.1, result))
}

Note the construction of the seed= option to the MCMClogit() function. This allows a different seed in every call without actually using sequential seeds.

Now we're ready to call the function repeatedly. We'll do that with the sapply() function, but we need to nest that inside a t() function call to get the estimates to appear as columns rather than rows, and we'll also make it a data frame in the same command. Note that the parameters we change within the sapply() function are merely a list of trial numbers. Finally, we'll add descriptive names for the columns with the names() function (section 1.3.4).

res2 = as.data.frame(t(sapply(1:10, runlogist)))
names(res2) <- c("trial","events.0","events.1", "glm",
"firth", "exact-ish", "MCMC")
head(res2)
mean(res2[,4:7], na.rm=TRUE)


trial events.0 events.1 glm firth exact-ish MCMC
1 1 0 6 18.559624 2.6265073 2.6269087 3.643560
2 2 1 3 1.119021 0.8676031 1.1822296 1.036173
3 3 0 5 18.366720 2.4489268 2.1308186 3.555314
4 4 0 5 18.366720 2.4489268 2.0452446 3.513743
5 5 0 2 17.419339 1.6295391 0.9021854 2.629160
6 6 0 9 17.997524 3.0382577 2.1573979 4.017105

glm firth exact-ish MCMC
17.333356 2.278344 1.813203 3.268243

The results are notably similar to SAS, except for the unacceptable glm() results.

In most Monte Carlo experimental settings, one would also be interested in examining the confidence limits for the parameter estimates. Notes and code for doing this can be found here. In a later entry we'll consider plots for the results generated above. As a final note, there are few combinations of event numbers with any mass worth considering. One could compute the probability of each of these and the associated parameter estimates, deriving a more analytic answer to the question. However, this would be difficult to replicate for arbitrary event probabilities and Ns, and very awkward for continuous covariates, while the above approach could be extended with trivial ease.

Monday, December 6, 2010

Example 8.17: Logistic regression via MCMC



In examples 8.15 and 8.16 we considered Firth logistic regression and exact logistic regression as ways around the problem of separation, often encountered in logistic regression. (Re-cap: Separation happens when all the observations in a category share a result, or when a continuous covariate predicts the outcome too well. It results in a likelihood maximized when a parameter is extremely large, and causes trouble with ordinary maximum likelihood approached.) Another option is to use Bayesian methods. Here we focus on Markov chain Monte Carlo (MCMC) approaches to Bayesian analysis.

SAS

SAS access to MCMC for logistic regression is provided through the bayes statement in proc genmod. There are several default priors available. The normal prior is the most flexible (in the software), allowing different prior means and variances for the regression parameters. The prior is specified through a separate data set. We begin by setting up the data in the events/trials syntax. Then we define a fairly vague prior for the intercept and the effect of the covariate: uncorrelated, and each with a mean of zero and a variance of 1000 (or a precision of 0.001). Finally, we call proc genmod to implement the analysis.


data testmcmc;
x=0; count=0; n=100; output;
x=1; count=5; n=100; output;
run;


data prior;
input _type_ $ Intercept x;
datalines;
Var 1000 1000
Mean 0 0
;
run;

title "Bayes with normal prior";
proc genmod descending data=testmcmc;
model count/n = x / dist=binomial link=logit;
bayes seed=10231995 nbi=1000 nmc=21000
coeffprior=normal(input=prior) diagnostics=all
statistics=summary;
run;

In the forgoing, nbi is the length of the burn-in and nmc is the total number of Monte Carlo iterations. The remaining options define the prior and request certain output. The diagnostics=all option generates many results, including posterior autocorrelations, Gelman-Rubin, Geweke, Raftery-Lewis, and Heidelberger-Welch diagnostics. The summary statistics are presented below; the diagnostics are not especially encouraging.

Posterior Summaries

Standard
Parameter N Mean Deviation

Intercept 21000 -20.3301 10.3277
x 21000 17.2857 10.3368

Posterior Summaries

Percentiles
Parameter 25% 50% 75%

Intercept -27.6173 -18.5558 -11.9025
x 8.8534 15.5267 24.6024

It seems that perhaps this prior is too vague. Perhaps we can make it a little more precise. A log odds ratio of 10 implies an odds ratio > 22,000, so perhaps we can accept a prior variance of 25, with about 95% of the prior weight between -10 and 10.

data prior;
input _type_ $ Intercept x;
datalines;
Var 25 25
Mean 0 0
;
run;

ods graphics on;
title "Bayes with normal prior";
proc genmod descending data=testmcmc;
model count/n = x / dist=binomial link=logit;
bayes seed=10231995 nbi=1000 nmc=21000
coeffprior=normal(input=prior) diagnostics=all
statistics=(summary interval) plot=all;
run;

Posterior Summaries

Standard
Parameter N Mean Deviation

Intercept 21000 -6.5924 1.7958
x 21000 3.5169 1.8431

Posterior Summaries

Percentiles
Parameter 25% 50% 75%

Intercept -7.6347 -6.3150 -5.2802
x 2.1790 3.2684 4.5929

Posterior Intervals

Parameter Alpha Equal-Tail Interval HPD Interval

Intercept 0.050 -10.8101 -3.8560 -10.2652 -3.5788
x 0.050 0.5981 7.7935 0.3997 7.4201

These are more plausible values, and the diagnostics are more favorable.
In the above, we added the keyword interval to generate confidence regions, and used the ods graphics on statement to enable ODS graphics and the plot=all option to generate the graphical output shown above.

R
There are several packages in R that include MCMC approaches. Here we use the MCMCpack package, which include the MCMClogit() function. It appears not to accept the weights option mentioned previously, so we generate data at the observation level to begin. Then we run the MCMC.

events.0=0 # for X = 0
events.1=5 # for X = 1
x = c(rep(0,100), rep(1,100))
y = c(rep(0,100-events.0), rep(1,events.0),
rep(0, 100-events.1), rep(1, events.1))

library(MCMCpack)
logmcmc = MCMClogit(y~as.factor(x), burnin=1000, mcmc=21000, b0=0, B0=.04)

The MCMClogit() accepts a formula object and allows the burn-in and number of Monte Carlo iterations to be specified. The prior mean b0 can be specified as a vector if different for each parameter, or as a scalar, as show. Similarly, the prior precision B0 can be a matrix or a scalar; if scalar, the parameters are uncorrelated in the prior.

> summary(logmcmc)

Iterations = 1001:22000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 21000

1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:

Mean SD Naive SE Time-series SE
(Intercept) -6.570 1.816 0.01253 0.03139
as.factor(x)1 3.513 1.859 0.01283 0.03363

2. Quantiles for each variable:

2.5% 25% 50% 75% 97.5%
(Intercept) -10.6591 -7.634 -6.299 -5.229 -3.831
as.factor(x)1 0.6399 2.147 3.292 4.599 7.698

plot(logmcmc)

The result of the plot() is shown below. These results and those from SAS are reassuringly similar. Many diagnostics are available through the coda package. The codamenu() function allows simple menu-based access to its tools.

Tuesday, November 30, 2010

Example 8.16: Exact logistic regression

In example 8.15, on Firth logistic regression, we mentioned alternative approaches to separation troubles. Here we demonstrate exact logistic regression. The code for this appears in the book (section 4.1.2) but we don't show an example of it there. We'll consider the setting of observing 100 subjects each with x=0 and x=1, observing no events when x=0, and 5 when x=1.

SAS
We'll create the data as a summary, rather than for every line of data. Then we can use the "events/trials" syntax (section 4.1.1) that both proc logistic and proc genmod accept. This is another way to reduce the size of data sets (along with the weight option mentioned previously) but is less generally useful. The the exact statement in proc logistic will fit the exact logistic regression and generate a p-value. The estimate option is required to display estimated log odds ratio.

data exact;
x=0; count=0; n=100; output;
x=1; count=5; n=100; output;
run;

proc logistic data=exact;
model count/n = x;
exact x / estimate;
run;

This generates the following output:

Exact Parameter Estimates

Standard 95% Confidence
Parameter Estimate Error Limits p-Value

x 1.9414* . -0.0677 Infinity 0.0594

NOTE: * indicates a median unbiased estimate.


R
In R we use the elrm() function in the elrm package to approximate exact logistic regression, as described in this paper by the package's authors. The function requires a special formula object with syntax identical to the SAS events/trials syntax. (Note that the function does not behave as expected when identical observations with trials=1 are submitted. Thus data should be collapsed into unique combinations of predictors before using the function.) In addition, it requires its data to be included in a data frame. We'll construct the data frame in one function call to data.frame().

elrmdata = data.frame(count=c(0,5), x=c(0,1), n=c(100,100))
library(elrm)
resexact = elrm(count/n ~ x, interest = ~x, iter=22000,
burnIn=2000, data=elrmdata, r=2)
summary(resexact)

producing the following result:

Call:
[[1]]
elrm(formula = count/n ~ x, interest = ~x, r = 2, iter = 22000,
dataset = elrmdata, burnIn = 2000)

Results:
estimate p-value p-value_se mc_size
x 2.0225 0.02635 0.0011 20000

95% Confidence Intervals for Parameters

lower upper
x -0.02065572 Inf

Differences between the SAS and R results most likely arise from the fact that the elrm() function is an approximation of the exact approach. The upper limit of infinity seen in the exact SAS analysis and approximate exact elrm() analysis reveals a limitation of this approach relative to the Firth approach seen in example 8.15 and the Bayesian approach we'll examine later.

A final note: if the true Pr(Y=1|X=1) = 0.05, then the true Pr(Y=1|X=0) that results in a log odds ratio of 1.94 is about 0.0075; for a log odds ratio of 2.02, the true probability is about 0.0069.

Monday, November 29, 2010

SAS and R joins SAS-x

Tal Galili, organizer of the R-bloggers blog aggregator, has opened a new aggregator for people blogging about SAS. If you're unfamiliar with the concept, an aggregator is a single blog which republishes (with permission, in this case) the entries from many contributing blogs, gathering a dispersed community of interest to a single point in your browser or newsreader.

Called SAS-x, the new aggregator is starting up right now, and we're pleased to be included. We find R-bloggers to be a source of stimulating output from many different disciplines and hope that SAS-x will be as successful.

Monday, November 22, 2010

Example 8.15: Firth logistic regression

In logistic regression, when the outcome has low (or high) prevalence, or when there are several interacted categorical predictors, it can happen that for some combination of the predictors, all the observations have the same event status. A similar event occurs when continuous covariates predict the outcome too perfectly.

This phenomenon, known as "separation" (including complete and quasi-complete separation) will cause problems fitting the model. Sometimes the only symptom of separation will be extremely large standard errors, while at other times the software may report an error or a warning.

One approach to handling this sort of problem is exact logistic regression, which we discuss in section 4.1.2. But exact logistic regression is complex and may require prohibitive computational resources. Another option is to use a Bayesian approach. Here we show how to use a penalized likelihood method originally proposed by Firth (1993 Biometrika 80:27-38) and described fully in this setting by Georg Heinze (2002 Statistics in Medicine 21:2409-2419 and 2006 25:4216-4226). A nice summary of the method is shown on a web page that Heinze maintains. In later entries we'll consider the Bayesian and exact approaches.

Update: see bottom of the post.

SAS
In SAS, the corrected estimates can be found using the firth option to the model statement in proc logistic. We'll set up the problem in the simple setting of a 2x2 table with an empty cell. Here, we simply output three observations with three combinations of predictor and outcome, along with a weight variable which contains the case counts in each cell of the table

data testfirth;
pred=1; outcome=1; weight=20; output;
pred=0; outcome=1; weight=20; output;
pred=0; outcome=0; weight=200; output;
run;

In the proc logistic code, we use the weight statement, available in many procedures, to suggest how many times each observation is to be replicated before the analysis. This approach can save a lot of space.

proc logistic data = testfirth;
class outcome pred (param=ref ref='0');
model outcome(event='1') = pred / cl firth;
weight weight;
run;

Without the firth option, the parameter estimate is 19.7 with a standard error of 1349. In contrast, here is the result of the above code.

Analysis of Maximum Likelihood Estimates

Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -2.2804 0.2324 96.2774 <.0001
pred 1 1 5.9939 1.4850 16.2926 <.0001

Note here that these no suggestion in this part of the output that the Firth method was employed. That appears only at the very top of the voluminous output.

R
In R, we can use Heinze's logistf package, which includes the logistf() function. We'll make the same table as in SAS by constructing two vectors of length 240 using the c() and rep() functions.

pred = c(rep(1,20),rep(0,220))
outcome = c(rep(1,40),rep(0,200))
lr1 = glm(outcome ~ pred, binomial)
>summary(lr1)

Call:
glm(formula = outcome ~ pred, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-0.4366 -0.4366 -0.4366 -0.4366 2.1899

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.3026 0.2345 -9.818 <2e-16 ***
pred 20.8687 1458.5064 0.014 0.989
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 216.27 on 239 degrees of freedom
Residual deviance: 134.04 on 238 degrees of freedom
AIC: 138.04

Number of Fisher Scoring iterations: 17

Note that the estimate differs slightly from what SAS reports. Here's a more plausible answer.

>library(logistf)
>lr2 = logistf(outcome ~ pred)
>summary(lr2)

logistf(formula = outcome ~ pred)

Model fitted by Penalized ML
Confidence intervals and p-values by Profile Likelihood

coef se(coef) lower 0.95 upper 0.95 Chisq p
(Intercept) -2.280389 0.2324057 -2.765427 -1.851695 Inf 0
pred 5.993961 1.4850029 3.947048 10.852893 Inf 0

Likelihood ratio test=78.95473 on 1 df, p=0, n=240
Wald test = 16.29196 on 1 df, p = 5.429388e-05

Covariance-Matrix:
[,1] [,2]
[1,] 0.05401242 -0.05401242
[2,] -0.05401242 2.20523358

Here, the estimates are nearly identical to SAS, but the standard errors differ.


Update:
Georg Heinze, author of the logistf() function, suggests the following two items.

First, in SAS, the model statement option clparm=pl will generate profile penalized likelihood confidence intervals, which should be similar to those from logistf(), It certainly makes sense to use confidence limits that more closely reflect the fitting method.

Second, in R, there is a weight option in both glm() and in logistf() that is similar to the weight statement in SAS. For example, the data used above could have been input and run as:

pred = c(1,0,0)
outcome = c(1,1,0)
weight=c(20,20,200)
lr1 = glm(outcome ~ pred, binomial, weights=weight)
lr2 = logistf(outcome ~ pred, weights=weight)

Monday, November 15, 2010

Example 8.14: generating standardized regression coefficients

Standardized (or beta) coefficients from a linear regression model are the parameter estimates obtained when the predictors and outcomes have been standardized to have variance = 1. Alternatively, the regression model can be fit and then standardized post-hoc based on the appropriate standard deviations. The parameters are thus interpreted as change in the outcome, in standard deviations, per standard deviation change in the predictors. However they're calculated, standardized coefficients facilitate an assessment of which variables have the greatest association with the outcome (or response) variable, though such an assessment ignores the confidence limits associated with each pairwise association.

It's straightforward to calculate these quantities in SAS and R. We'll demonstrate with data from the HELP study, modeling PCS as a function of MCS and homelessness among female subjects.

SAS

In SAS, standardized coefficients are available as the stb option for the model statement in proc reg.

proc reg data="c:\book\help";
where female eq 1;
model pcs = mcs homeless / stb;
run;
The REG Procedure
Model: MODEL1
Dependent Variable: PCS

Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 39.62619 2.49830 15.86 <.0001
MCS 1 0.21945 0.07644 2.87 0.0050
HOMELESS 1 -2.56907 1.95079 -1.32 0.1908

Parameter Estimates
Standardized
Variable DF Estimate

Intercept 1 0
MCS 1 0.26919
HOMELESS 1 -0.12348


R
In R we demonstrate the use of the lm.beta() function in the QuantPsyc package (due to Thomas D. Fletcher of State Farm). The function is short and sweet, and takes a linear model object as argument:

>lm.beta
function (MOD)
{
b <- summary(MOD)$coef[-1, 1]
sx <- sd(MOD$model[-1])
sy <- sd(MOD$model[1])
beta <- b * sx/sy
return(beta)
}

Here we apply the function to data from the HELP study.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
female = subset(ds, female==1)
lm1 = lm(pcs ~ mcs + homeless, data=female)

The results, in terms of unstandardized regression parameters are the same as in SAS:

> summary(lm1)

Call:
lm(formula = pcs ~ mcs + homeless, data = female)

Residuals:
Min 1Q Median 3Q Max
-28.163 -5.821 -1.017 6.775 29.979

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.62619 2.49830 15.861 < 2e-16 ***
mcs 0.21945 0.07644 2.871 0.00496 **
homeless -2.56907 1.95079 -1.317 0.19075
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.761 on 104 degrees of freedom
Multiple R-squared: 0.0862, Adjusted R-squared: 0.06862
F-statistic: 4.905 on 2 and 104 DF, p-value: 0.009212

To generate the standardized parameter estimates, we use the lm.beta() function.

library(QuantPsyc)
lm.beta(lm1)

This generates the following output:

mcs homeless
0.2691888 -0.1234776

A change in 1 standard deviation of MCS has more than twice the impact on PCS than a 1 standard deviation change in the HOMELESS variable. This example points up another potential weakness of standardized regression coefficients, however, in that the homeless variable can take on values of 0 or 1, and a 1 standard deviation change is hard to interpret.

Monday, November 8, 2010

Example 8.13: Bike ride plot, part 2



Before explaining how to make and interpret the plot above, Nick and I want to make a plea for questions--it's hard to come up with useful questions to explore each week!

As shown in Example 8.12, data from the Cyclemeter app can be used to make interesting plots of a bike ride. But in the previous application, questions remain. Why did my speed vary so much, for example? And why do some sections of the route appear to be very straight while others are relatively bumpy? To investigate these questions, we'll add some more data into the plot, using color. First, I'll shade the line according to my elevation-- my speed is probably faster when I'm going down hill. Then I'll add the points where the GPS connected; this may explain the straight sections. Doing all of this will require more complicated coding tasks.

(Data were read in previously.)

SAS

My initial thought was that I would use proc univariate to generate a histogram (section 5.1.4) and generate the color for the line from the histogram categories. But it turns out that you can't force the histogram to have your desired number of categories. (This is true for the R hist() function as well.) So I'm going to find the minimum and maximum elevation using proc summary (section 2.1), saving the results in a data set. Then I'll add those min and max values to each line of the data, using the tip found here. Finally, I'll make my own categories for elevation by looping through the category boundaries until I find one bigger than my observation.

proc summary data=ride;
var elevation__feet_;
output out=minmax max=maxelev min=minelev;;
run;

data aride;
set ride;
setme = 1;
set minmax point=setme;
colorindex = 0;
range = maxelev-minelev;
do i = 0 to 5;
if elevation__feet_ ge (minelev + (i * range/6) )
then colorindex = i + 1;
end;
run;

To make the plot, I'm going to use the annotate facility, as before. However, the annotate %line macro won't work, for reasons I won't go into. So I need to make an annotate data set directly. Briefly, the way annotate data sets work is that they have a pre-specified variable list in which only certain values are allowed. Some of these are discussed in section 5.1, but in addition to these, below we use the function variable; when function="MOVE" the conceptual plotting pen moves without drawing. When function="DRAW" a line is plotted from the last location to the current one. Locations are determined by x and y variables. The line and size variables affect the line style and thickness.

The other interesting bit of the code below is the creation and use of a temporary array (section 1.11.5) for the colors. I choose the color from this array by indexing on the colorindex variable created above.

For help on the annotate facility, see Contents; SAS Products; SAS/GRAPH; The Annotate Facility. Colors are discussed in section 5.3.11, but we use them in a slightly different way, here. For help with color specification, see Contents; SAS Products; SAS/GRAPH; Concepts; Colors.

data annoride2;
set aride;
if elevation__feet_ ne .;
retain xsys '2' ysys '2' hsys '6';
array mycolor [6] $ _temporary_
("CX252525" "CX636363" "CX969696" "CXBDBDBD" "CXD9D9D9" "CXF7F7F7");
x = longitude;
y = latitude;
/* If this is the first observation, we need to move the pen to
the first point. Otherwise we draw a line */
if _n_ ne 1 then function = "DRAW";
else function = "MOVE";
color = mycolor[colorindex];
line = 1;
size = speed__miles_h_ * 2;
output;
run;

Finally, we can plot the figure. I use the symbol statement (section 5.2.2) to plot nice dots where the GPS connected, and the axis statement (section 5.3.7) to remove the values of latitude and longitude, which don't interest me much. Analysis of the graph I leave for the R version.

axis1 major=none minor=none value=none;
symbol1 i=none v=dot c=black h=.2;
proc gplot data = ride;
plot latitude * longitude /
vaxis=axis1 haxis=axis1 annotate=annoride2;
run;
quit;

The resulting plot is shown above.

R

In R, I'm going to modify my vectorized version of the lines() function to additionally assign colors to each line. As in the SAS example, I will use categories of elevation to do this. Assigning the elevations to categories is a little trickier if I want to avoid for() loops. To do it, I will first find the category boundaries. I'll then use the which() function (as in section 1.13.2) to figure out the category boundaries smaller than the observation, and the max() function (section 1.8.1) to select the largest of these. Unfortunately, I also need the sapply() function (section B.5.3) so that I can repeat this process for each elevation and return a vector. I've annotated below to explain how this works. Finally, I use the RColorBrewer package to generate a vector of colors. (I also used it to find the colors I put in the SAS code.)

veclines2 = function(x, y, z, c) {
require(RColorBrewer)
breaks = min(c) + (0:5 * (max(c) - min(c))/6)
# The sapply() function applies the function named in the
# second argument to each element of the vector named in the
# first argument. Since I don't need this function again,
# I write it out in within the sapply() function call and don't
# even have to name it.
ccat = sapply(c, function (r) {max(which(r >= breaks))} )
mypalette = brewer.pal(6,"Greys")
for (i in 1:(length(x)-1)) {
lines(x[i:(i+1)], y[i:(i+1)], lwd=z[i], col=mypalette[7 - ccat[i]])
}
}


Reader JanOosting pointed out that the segments() function will do exactly what my for-looped lines() function does. The segments() function requires "to" and "from" x and y locations. Here's a version of the above function that uses it.

bikeseg = function(x,y,z,c) {
l=length(x)
require(RColorBrewer)
mypalette <-brewer.pal(6,"Greys")
breaks = min(c) + (0:5 * (max(c) - min(c))/6)
ccat = sapply(c,function (r) {max(which(r >= breaks))})
segments(x0 = x[1:(l-1)], y0 = y[1:(l-1)],
x1 = x[2:l], y1 = y[2:l],lwd=z,col=mypalette[7-ccat])
}



Then I call the function, after creating an empty plot and making the RColorBrewer library available. Finally, I add the points and some street names (see sections 5.2.3 and 5.2.11), to aid in interpretation. The way R draws the names live, as you enter commands, makes it much easier to place and rotate the names than in SAS, where you would have to re-make the annotate data set and run it to see the results.

plot(Longitude, Latitude, type="n")
library(RColorBrewer)
bikeseg(Longitude, Latitude, Speed..miles.h./2, Elevation..feet.)

points(Longitude, Latitude, cex=2, pch='.')
text(-72.495, 42.34, "Station Road")
text(-72.5035, 42.33, "Middle Street", srt=90)
text(-72.5035, 42.36, "Bike Path", srt=-39)
text(-72.54, 42.36, "Bike Path", srt=15)
text(-72.55, 42.347, "Maple", srt=107)
text(-72.54, 42.338, "Pomeroy")
text(-72.515, 42.331, "Potwine")



In the final image, thicker lines show greater speed, darker colors indicate lower elevations, and the dots are where the GPS connects. My best speed is achieved along Station Road, which is a long, straight drop. There aren't enough categories of color to explain most of the other variations in speed, although you can see me get slower as I near Middle Street along Potwine. The dots are fairly evenly spaced except along the bike path, where there is a lot of tree cover in some spots. This causes the long straight pieces near the top of the plot. Actually, since the phone is also sitting in my pocket as I ride, so I'm more pleased than disappointed with the perfomance of the iPhone and Cyclemeter.

Monday, November 1, 2010

Example 8.12: Bike ride plot, part 1



The iPhone app Cyclemeter uses the phone's GPS capability to record location and other data, and infer speed, while you ride. I took a ride near my house recently, and downloaded the data. I'd like to examine my route and my speed. A simple plot of the route is trivial in either SAS or R, but adding the speed data requires a little work. You can download my data from here and I read the data directly via URL in the following code.

SAS

In SAS, I first use proc import with the url filetype, as shown in section 1.1.6. I can then make a simple plot of the route using the i=j option to the symbol statement (as in section 1.13.5), which simply joins successive points.

filename bike url 'http://www.kenkleinman.net/files/cycle-data-10022010.csv';

proc import datafile=bike out=ride dbms=dlm;
delimiter=',';
getnames=yes;
run;

symbol1 i=j;
proc gplot data=ride;
plot latitude * longitude;
run;




I didn't project the data, so this looks a little compressed north-south.

To show my speed at each point, I decided to make a thicker line when I'm going faster. To do this, I use the annotate macros discussed in section 5.2. I decided to use the %line macro to do this, but that requires each observation in the data set have a starting point and an ending point for its section of line. I use the lag function (section 1.4.17) in a separate data step to add the previous point to each observation. Then I create the annotate data set. Finally, I use the value = none option to the symbol statement to make an empty plot and the annotate data set draws the line for me.


data twopoints;
set ride;
lastlat = lag(latitude);
lastlong = lag(longitude);
if _n_ ne 1;
run;

%annomac;
data annoride;
set twopoints;
%system(2,2,6);
%line(longitude,latitude,lastlong,lastlat,
black,1,speed__miles_h_);
run;

symbol1 v=none;
proc gplot data=ride;
plot latitude * longitude / annotate=annoride;;
run;
quit;

The resulting plot shown below closely resembles the R plot shown at the top of this entry.




R

In R, it's as trivial to make the simple plot as in SAS. Just read in the CSV data from the URL (section 1.1.2, 1.1.6) make an empty plot (5.1.1), and add the lines (5.2.1).

myride=read.csv("http://www.kenkleinman.net/files/cycle-data-10022010.csv")
attach(myride)
plot(Longitude, Latitude, type="n")
lines(Longitude, Latitude)



Now I want to show the speed, as above. The lines() function has a lwd= option, but unfortunately, it's not vectorized. In other words, it accepts only a scalar that applies to all the line segments drawn in a given call. To get around that, I'll write my own vectorized version of lines() using the disfavored for() function. It calls lines() for each pair of points, with an appropriate lwd value.

veclines = function(x, y, z) {
for (i in 1:(length(x)-1)) {
lines(x[i:(i+1)], y[i:(i+1)], lwd=z[i])
}
}
veclines(Longitude, Latitude, Speed..miles.h./2)

The result is displayed at the top of this blog entry. In the next entry we'll add more information to help explain why the speed varies.

Tuesday, October 26, 2010

Example 8.11: violin plots



We've continued to get useful feedback and ideas from our posts on the combination dotplot/boxplot and other ways to craft similar displays.

Another notion is the violin plot, which combines a boxplot and a (doubled) kernel density plot. While the basic notion of the violin plot does not include the individual points, such a display has virtues, particularly when comparing multiple groups and with large datasets. For teaching purposes, dots representing the data points could be added in. More details on the plot can be found in: Hintze, J. L. and R. D. Nelson (1998). Violin plots: a box plot-density trace synergism. The American Statistician, 52(2):181-4.

R

In R, the vioplot package includes the vioplot() function, which generated the plot at the top of this entry.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
female = subset(ds, female==1)
library(vioplot)
with(female, vioplot(pcs[homeless==0], pcs[homeless==1],
horizontal=TRUE, names=c("non-homeless", "homeless"),
col = "lightblue"))


SAS

We've neglected SAS in the discussion of teaching graphics. Mimicking the tailored appearance of Wild's approach to the dotplot-boxplot would require at least several hours, while even the shorter code suggested by commenters would be difficult. For the most part this reflects the modular nature of R. However violin plots are similar enough to stacked kernel density estimates, that we show them here in order to demonstrate the code.

proc sgpanel data="C:\book\help";
where female eq 1;
panelby homeless / columns=1 ;
density pcs / scale=percent type=kernel ;
run;

The output lacks the graphic depiction of central tendency, and does not double the density, but it does highlight similarities and differences between the categories.

Sunday, October 24, 2010

Reader suggestions on alternative ways to create combination dotplot/boxplot






Kudos to several of our readers, who suggested simpler ways to craft the graphical display (combination dotplot/boxplot) from our most recent example.

Yihui Xie combines a boxplot with a coarsened version of the PCS scores (using the round() function) used in the stripchart() function.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
smallds = subset(ds, female==1)
boxplot(pcs~homeless, data=smallds,
horizontal=TRUE)
stripchart(round(pcs)~homeless,
method='stack', data=smallds,
add=TRUE)

This is far simpler than the code we provided, though aesthetically is perhaps less pleasing.

An alternative approach was suggested by NetDoc, who creatively utilizes the ggplot2 package authored by Hadley Wickham. This demonstrates how a complex plot can be crafted by building up components using this powerful system. This approach dodges points so that they don't overlap.


source("http://www.math.smith.edu/sasr/examples/wild-helper.R")
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
female = subset(ds, female==1)
theme_update(panel.background = theme_rect(fill="white", colour=NA))
theme_update(panel.grid.minor=theme_line(colour="white", size=NA))
p = ggplot(female, aes(factor(homeless), pcs))
p + geom_boxplot(outlier.colour="transparent") +
geom_point(position=position_dodge(width=0.2), pch="O") +
opts(title = "PCS by Homeless") +
labs(x="Homeless", y="PCS") +
coord_flip() +
opts(axis.colour="white")



This is also simpler than the approach taken by Wild and colleagues. The increased complexity of Wild's function is the cost of maintaining a consistent integrated appearance with the other pieces of their package. In addition, the additional checking of appropriate arguments is also important for code used by others.

Tuesday, October 19, 2010

Example 8.10: Combination dotplot/boxplot (teaching graphic in honor of World Statistics Day)



In honor of World Statistics Day and the read paper that my co-authors Chris Wild, Maxine Pfannkuch, Matt Regan, and I are presenting at the Royal Statistical Society today, we present the R code to generate a combination dotplot/boxplot that is useful for students first learning statistics. One of the over-riding themes of the paper is that introductory students (be they in upper elementary or early university) should keep their eyes on the data.

When describing distributions, students often are drawn to the most visually obvious aspects: median, quartiles and extremes. These are the ingredients of the basic boxplot, which is often introduced as a graphical display in introductory courses. Students are taught to calculate the quartiles, and this becomes one of the components of a boxplot.

One limitation of the boxplot is that it loses the individual data points. Given a dotplot (see here for an example of one in Fathom), it is very easy to guesstimate and draw a boxplot by hand. Wild et al. argue that doing so is probably the best way of gaining an appreciation of just what a boxplot actually is. The boxplot provides a natural bridge between operating entirely in terms of what is seen in graphics to reasoning using summary statistics. Retaining the dots in the combination dotplot/boxplot provides a reminder that the box plot is just summarizing the raw data, thus preserving a connection to more concrete foundations.

Certainly, such a plot has a number of limitations. It breaks down when there are a large number of observations, and has redundancy not suitable for publication. But as a way to motivate informal inference, it has potential value. It's particularly useful when combined in an animation using multiple samples from a known population, as demonstrated here (scroll through the file quickly).

In addition, the code, drafted by Chris and his colleagues Steve Taylor and Dineika Chandrananda at the University of Auckland, demonstrates a number of useful and interesting techniques.

R

Because of the length of the code, we'll focus just on the boxpoints3() function shown below. The remaining support code must be run first (and can be found at http://www.math.smith.edu/sasr/examples/wild-helper.R). The original boxpoints() function has a large number of options and configuration settings, which the function below sets at the default values.

# Create a plot from two variables, one continuous and one categorical.
# For each level of the categorical variable (grps) a stacked dot plot
# and a boxplot summary are created. Derived from code written by
# Christopher Wild, Dineika Chandrananda and Steve Taylor
#
boxpoints3 = function(x,grps,varnames1,varnames2,labeltext)
{
observed = (1:length(x))[!is.na(x) & !is.na(grps)]
x = x[observed]
grps = as.factor(grps[observed])
ngrps = length(levels(grps))

# begin section to align titles and labels
xlims = range(x) + c(-0.2,0.1)*(max(x)-min(x))
top = 1.1
bottom = -0.2
plot(xlims, c(bottom,top), type="n", xlab="", ylab="", axes=F)
yvals = ((1:ngrps)-0.7)/ngrps
text(mean(xlims), top, paste(varnames1, "by", varnames2, sep=" "),
cex=1, font=2)
text(xlims[1],top-0.05,varnames2, cex=1, adj=0)
addaxis(xlims, ylev=0, tickheight=0.03, textdispl=0.07, nticks=5,
axiscex=1)
text(mean(xlims), -0.2, varnames1, adj=0.5, cex=1)
# end section for titles and labels

for (i in 1:ngrps) {
xi = x[grps==levels(grps)[i]]
text(xlims[1], yvals[i]+0.2/ngrps,
substr(as.character(levels(grps)[i]), 1, 12), adj=0, cex=1)
prettyrange = range(pretty(xi))
if (min(diff(sort(unique(xi)))) >= diff(prettyrange)/75)
xbins = xi # They are sufficiently well spaced.
else {
xbins = round(75 * (xi-prettyrange[1])/diff(prettyrange))
}
addpoints(xi, yval=yvals[i], vmax=0.62/ngrps, vadd=0.075/ngrps,
xbin=xbins, ptcex=1, ptcol='grey50', ptlwd=2)
bxplt = fivenum(xi)
if(length(xi) != 0)
addbox(x5=bxplt, yval=yvals[i], hbxwdth=0.2/ngrps,
boxcol="black", medcol="black", whiskercol="black",
boxfill=NA, boxlwidth=2)
}
}

Most of the function tends to issues of housekeeping, in particular aligning titles and labels. Once this is done, the support functions addpoints() and addbox() functions are called with the appropriate arguments. We can call the function using data from female subjects at baseline of the HELP study, comparing PCS (physical component scores) for homeless and non-homeless subjects.

source("http://www.math.smith.edu/sasr/examples/wild-helper.R")
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
female = subset(ds, female==1)
with(female,boxpoints3(pcs, homeless, "PCS", "Homeless"))

We see that the homeless subjects have lower PCS scores than the non-homeless (though the homeless group also has the highest score of either group in this sample).

Tuesday, October 12, 2010

Example 8.9: Contrasts

In example 8.6 we showed how to change the reference category. This is the natural first thought analysts have when their primary comparisons aren't represented in the default output. But our interest might center on a number of comparisons which don't share a category. Or we might need to compare one group with the mean of the other groups. Today we'll explore tests and estimates of effects like these, which are sometimes called contrasts. In a later entry we'll explore more complex multivariate contrasts.

SAS

Access to this kind of comparison in SAS is provided in many model-fitting procedures using a test, estimate, or contrast statement. The differences among these can be subtle. In general, for simple comparisons, we recommend the estimate statement, where available. It is available for the important glm and genmod procedures, among others. In our example from linear regression, we changed the referent from heroin to alcohol by sorting the data and using the order=data option. This gains us pairwise comparisons between heroin and alcohol and between cocaine and alcohol. Here we show how to get the comparison between heroin and cocaine from the same model fit.

proc sort data=help_a; by descending substance; run;

proc glm data=help_a order=data;
class substance;
model i1 = age substance / solution;
estimate "heroin vs. cocaine" substance -1 1 0 / e;
run; quit;

The syntax of the estimate statement is to optionally name the estimate (between the quotes) then to list the effects whose values we want to assess, followed by the desired values for each level of the effect. The e option requests that the contrast vector be printed-- this is a vital check that the contrast is working as desired. Here are the pieces of output generated by the estimate statement.

Coefficients for Estimate heroin vs. cocaine

Row 1
Intercept 0

AGE 0

SUBSTANCE heroin -1
SUBSTANCE cocaine 1
SUBSTANCE alcohol 0

This is the result of the e option, and shows that we've correctly specified the difference between heroin and cocaine.

Standard
Parameter Estimate Error t Value Pr > |t|
heroin vs. cocaine 3.00540049 2.15576257 1.39 0.1640

This is the comparison we want. It displays the difference, which we could confirm by examining the parameter estimates, as well as the standard error of the difference and the p-value, which can't be determined from the standard output.

To compare alcohol with the average of heroin and cocaine, we could use the following syntax (results omitted).

estimate "cocaine + alcohol vs heroin" substance -2 1 1 /e divisor=2;

The divisor option allows us to use portions that are not easily represented as decimals. It's necessary because of the reserved use of the / in SAS. Any number of estimate statements can be issued in a single proc.

R

The fit.contrast() function in the gmodels package will generate contrasts for lm and glm objects. Multivariate contrasts can be found in the contrast() function in the Design package.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
lm3 = lm(i1 ~ substance + age, data=ds))
library(gmodels)
fit.contrast(lm3, "substance", c(0,-1, 1))

This generates the following output:

Estimate Std. Error t value Pr(>|t|)
substance c=( 0 -1 1 ) -3.005400 2.155763 -1.394124 0.1639696

The simple syntax accepts model objects, the name of the effect, and a vector of contrasts to apply. The function does not replicate the contrast, which would be useful, but it is simple enough to check the parameter estimates from the model to ensure the desired result has been obtained.

The syntax for comparing heroin with the mean of alcohol and cocaine is straightforward.

> fit.contrast(lm3,"substance", c(-.5, -.5,1))
Estimate Std. Error t value Pr(>|t|)
substance c=( -0.5 -0.5 1 ) -11.09965 1.904192 -5.829059 1.065763e-08

Tuesday, October 5, 2010

Example 8.8: more Hosmer and Lemeshow

This is a special R-only entry.

In Example 8.7, we showed the Hosmer and Lemeshow goodness-of-fit test. Today we demonstrate more advanced computational approaches for the test.

If you write a function for your own use, it hardly matters what it looks like, as long as it works. But if you want to share it, you might build in some warnings or error-checking, since the user won't know its limitations the way you do. (This is likely good advice even if you are the only one to use your code!)

In R, you can add another layer of detail so that your function conforms to standards for built-in functions. This is a level of detail we don't pursue in our book, but is worth doing in many settings. Here we provided a modified version of a Hosmer-Lemeshow test sent to us by Stephen Taylor of the Auckland University of Technology. We've added a few annotations.

Note that the function accepts a glm object, rather than the two vectors our function used.

hosmerlem2 = function(obj, g=10) {
# first, check to see if we fed in the right kind of object
stopifnot(family(obj)$family=="binomial" && family(obj)$link=="logit")
y = obj$model[[1]]
# the double bracket (above) gets the index of items within an object
if (is.factor(y))
y = as.numeric(y)==2
yhat = obj$fitted.values
cutyhat = cut(yhat, quantile(yhat, 0:g/g), include.lowest=TRUE)
obs = xtabs(cbind(1 - y, y) ~ cutyhat)
expect = xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
if (any(expect < 5))
warning("Some expected counts are less than 5. Use smaller number of groups")
chisq = sum((obs - expect)^2/expect)
P = 1 - pchisq(chisq, g - 2)
# by returning an object of class "htest", the function will perform like the
# built-in hypothesis tests
return(structure(list(
method = c(paste("Hosmer and Lemeshow goodness-of-fit test with", g, "bins", sep=" ")),
data.name = deparse(substitute(obj)),
statistic = c(X2=chisq),
parameter = c(df=g-2),
p.value = P
), class='htest'))
}

We can run this using last entry's data from the HELP study.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
logreg = glm(homeless ~ female + i1 + cesd + age + substance,
family=binomial)

The results are the same as before:

> hosmerlem2(logreg)
Hosmer and Lemeshow goodness-of-fit test with 10 bins

data: logreg
X2 = 8.4954, df = 8, p-value = 0.3866

Tuesday, September 28, 2010

Example 8.7: Hosmer and Lemeshow goodness-of-fit

The Hosmer and Lemeshow goodness of fit (GOF) test is a way to assess whether there is evidence for lack of fit in a logistic regression model. Simply put, the test compares the expected and observed number of events in bins defined by the predicted probability of the outcome. This can be calculated in R and SAS.

R

In R, we write a simple function to calculate the statistic and a p-value, based on vectors of observed and predicted probabilities. We use the cut() function (1.4.10) in concert with the quantile() function (2.1.5) to make the bins, then calculate the observed and expected counts, the chi-square statistic, and finally the associated p-value (Table 1.1). The function allows the user to define the number of bins but uses the common default of 10.

hosmerlem = function(y, yhat, g=10) {
cutyhat = cut(yhat,
breaks = quantile(yhat, probs=seq(0,
1, 1/g)), include.lowest=TRUE)
obs = xtabs(cbind(1 - y, y) ~ cutyhat)
expect = xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
chisq = sum((obs - expect)^2/expect)
P = 1 - pchisq(chisq, g - 2)
return(list(chisq=chisq,p.value=P))
}

We'll run it with some of the HELP data (available at the book web site). Note that fitted(object) returns the predicted probabilities, if the object is the result of a call to glm() with family=binomial.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
logreg = glm(homeless ~ female + i1 + cesd + age + substance,
family=binomial)
hosmerlem(y=homeless, yhat=fitted(logreg))

This returns the following output:

$chisq
[1] 8.495386
$p.value
[1] 0.3866328

The Design package, by Frank Harrell, includes the le Cessie and Houwelingen test (another goodness-of-fit test, Biometrics 1991 47:1267) and is also easy to run, though it requires using the package's function for logistic regression.

library(Design)
mod = lrm(homeless ~ female + i1 + cesd + age + substance,
x=TRUE, y=TRUE, data=ds)
resid(mod, 'gof')


Sum of squared errors Expected value|H0 SD
104.1091804 103.9602955 0.1655883
Z P
0.8991269 0.3685851

The two tests are reassuringly similar.

SAS

In SAS, the Hosmer and Lemeshow goodness of fit test is generated with the lackfit option to the model statement in proc logistic (section 4.1.1). (We select out the results using the ODS system.)

ods select lackfitpartition lackfitchisq;
proc logistic data="c:\book\help.sas7bdat";
class substance female;
model homeless = female i1 cesd age substance / lackfit;
run;

This generates the following output:

Partition for the Hosmer and Lemeshow Test

HOMELESS = 1 HOMELESS = 0
Group Total Observed Expected Observed Expected

1 45 10 12.16 35 32.84
2 45 12 14.60 33 30.40
3 45 15 15.99 30 29.01
4 45 17 17.20 28 27.80
5 45 27 18.77 18 26.23
6 45 20 20.28 25 24.72
7 45 23 22.35 22 22.65
8 45 25 25.04 20 19.96
9 45 28 27.67 17 17.33
10 48 32 34.95 16 13.05

Hosmer and Lemeshow Goodness-of-Fit Test

Chi-Square DF Pr > ChiSq

8.4786 8 0.3882

The partition table shows the observed and expected count or events in each decile of the predicted probabilities.

The discrepancy between the SAS and R results is likely due to the odd binning SAS uses; the test is unstable in the presence of ties to the extent that some authorities suggest avoiding it. In general, with continuous predictors, the objections are not germane.

Tuesday, September 21, 2010

Example 8.6: Changing the reference category for categorical variables

How can we change the reference category for a categorical variable? This question comes up often in a consulting practice.

When including categorical covariates in regression models, there is a question of how to incorporate the categories. One simple method is to generate indicator variables, sometimes called dummy variables. We go into some detail about the parameterization of categorical covariates in the SAS and R book, section 3.1.3.

In the indicator variable approach, new dichotomous variables are generated for all but one of the categories; these have a value of 1 if the subject is in the category and 0 otherwise. SAS and R each have simple ways to do this without explicitly creating new variables. In SAS, many procedures accept a class statement, while in R a variable can be defined as a factor, for example by using as.factor.

Let's consider a simple example with the following display of a categorical variable and the resulting indicators.

id catvar indA indB indC
1 A 1 0 0
2 B 0 1 0
3 D 0 0 0
4 C 0 0 1
5 B 0 1 0
6 D 0 0 0
7 A 1 0 0

When we fit the model, the parameter associated with the indA variable is an estimate of the difference between categories A and D. But what if we want the difference between A and C? Well, we can take out our calculators, but we'd also like the standard error of that estimated difference. One way to do this is to change the reference category, and that is what we'll explore today. In a future entry, we'll demonstrate how to calculate arbitrary comparisons, or contrasts, without refitting the model. That method is likely superior to the one shown here, but as consulting statisticians, the question "how do I change the reference category" is one we often answer.

SAS

For procs logistic, genmod, phreg, and surveylogistic, you can use the ref= option, as follows:

proc logistic data=ds;
class classvar (param=ref ref="name-of-ref-group");
model y = classvar;
run;

Unfortunately, changing the reference in SAS is awkward for other procedures. The SAS default is to make the last category the referent, when last is determined by ordering the characters. To change this, use the order option, frequently an option to the class statement but sometimes an option to the proc statement. If the desired referent is the first category, you can make it the referent by sorting on the variable in descending order and then using the order=data option:

proc sort data=ds; by descending classvar; run;

proc glm data=ds order=data;
class classvar;
model y = classvar;
run;

If your desired reference category is lexicographically in the middle of the list, your best bet is to re-code the categories. My colleague Sheryl Rifas-Shiman renames the labels as, e.g., "a. blue", "b. other", "c. brown". Then sort on the new variable and use the order=data approach. You might also get lucky by sorting on some other variable in the data set and using order=data.

As an example, we consider the simple analysis of covariance discussed in section 3.7.2. The default reference cell for substance is heroin. We can replace this with alcohol using the sorting approach.

proc import datafile='c:/book/help.dta' out=help_a dbms=dta;
run;

proc sort data=help_a; by descending substance; run;

proc glm data=help_a order=data;
class substance;
model i1 = age substance age * substance / solution;
run; quit;
Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 7.913018261 B 6.79251599 1.16 0.2447
AGE 0.557076729 B 0.17437966 3.19 0.0015
SUBSTANCE heroin -2.600851794 B 9.66168958 -0.27 0.7879
SUBSTANCE cocaine 7.853879213 B 10.16492979 0.77 0.4401
SUBSTANCE alcohol 0.000000000 B . . .
AGE*SUBSTANCE heroin -0.450423400 B 0.26525085 -1.70 0.0902
AGE*SUBSTANCE cocaine -0.662468379 B 0.27702051 -2.39 0.0172
AGE*SUBSTANCE alcohol 0.000000000 B . . .

Note that SAS creates the levels for the interaction based on the same implied indicator variables.

R

In R there are several options for changing the reference cell. The simplest of these may be the relevel() function. The two arguments are the factor name and the desired reference category. The as.factor() function can be nested within relevel() if necessary.

> ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
> lm3 = lm(i1 ~ relevel(substance, "alcohol") * age, data=ds)
> summary(lm3)
Call:
lm(formula = i1 ~ relevel(substance, "alcohol") * age, data=ds)

Residuals:
Min 1Q Median 3Q Max
-34.653 -9.625 -4.832 5.576 102.891

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.9130 6.7925 1.165 0.2447
relevel(substance, "alcohol")cocaine 7.8539 10.1649 0.773 0.4401
relevel(substance, "alcohol")heroin -2.6009 9.6617 -0.269 0.7879
age 0.5571 0.1744 3.195 0.0015 **
relevel(substance, "alcohol")cocaine:age -0.6625 0.2770 -2.391 0.0172 *
relevel(substance, "alcohol")heroin:age -0.4504 0.2653 -1.698 0.0902 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.7 on 447 degrees of freedom
Multiple R-squared: 0.2268, Adjusted R-squared: 0.2181
F-statistic: 26.22 on 5 and 447 DF, p-value: < 2.2e-16

Tuesday, September 14, 2010

Example 8.5: bubble plots part 3



An anonymous commenter expressed a desire to see how one might use SAS to draw a bubble plot with bubbles in three colors, corresponding to a fourth variable in the data set. (x, y, z for bubble size, and the category variable.) In a previous entries we discussed bubble plots and showed how to make the bubble print in two colors depending a fourth dichotomous variable.

The SAS approach to this cannot be extended to fourth variables with many values: we show here an approach to generating this output. The R version below represents a trivial extension of the code demonstrated earlier.

SAS

We'll start by making some data-- 20 observations in each of 3 categories.

data testbubbles;
do cat = 1 to 3;
do i = 1 to 20;
abscissa = normal(0);
ordinate = normal(0);
z = uniform(0);
output;
end;
end;
run;

Our approach will be to make an annotate data set using the annotate macros (section 5.2). The %slice macro easily draws filled circles. Check its documentation for full details on the parameters it needs in the on-line help: SAS Products; SAS/GRAPH; The Annotate Facility; Annotate Dictionary. Here we note that the 5th parameter is the radius of the circle, chosen here as an arbitrary function of z that makes pleasingly sized circles. Other parameters reflect color density, arc, and starting angle, which could be used to represent additional variables.

%annomac;
data annobub1;
set testbubbles;
%system(2,2,3);
%slice(abscissa, ordinate, 0, 360, sqrt(3*z), green, ps, 0);
run;

Unfortunately, due to a quirk of the macro facility, I don't think the color can be changed conditionally in the preceding step. Instead, we need a new data step to do this.

data annobub2;
set annobub1;
if cat=2 then color="red";
if cat=3 then color="blue";
run;

Now we're ready to plot. We use the symbol (section 5.2.2) statement to tell proc gplot not to plot the data, add the annotate data set, and suppress the legend, as the default legend will not look correct here. An appropriate legend could be generated with a legend statement.

symbol1 i=none r=3;
proc gplot data=testbubbles;
plot ordinate * abscissa = cat / annotate = annobub2 nolegend;
run;
quit;

The resulting plot is shown above. Improved axes are demonstrated throughout the book and in many previous blog posts.

R

The R approach merely requires passing three colors to the bg option in the symbols() function. To mimic SAS, we'll start by defining some data, then generate the vector of colors needed.

cat = rep(c(1, 2, 3), each=20)
abscissa = rnorm(60)
ordinate = rnorm(60)
z = runif(60)
plotcolor = ifelse(cat==1, "green", ifelse(cat==2, "red", "blue"))

The nested calls to the ifelse function (section 1.11.2) allow vectorized conditional tests with more than two possibilities. Another option would be to use a for loop (section 1.11.1) but this would be avoiding one of the strengths of R. In this example, I suppose I could have defined the cat vector with the color values as well, and saved some keystrokes.

With the data generated and the color vector prepared, we need only call the symbols() function.

symbols(ordinate, abscissa, circles=z, inches=1/5, bg=plotcolor)

The resulting plot is shown below.

Tuesday, September 7, 2010

Example 8.4: Including subsetting conditions in output

A number of analyses perform operations on subsets. Making it clear what observations have been excluded or included is helpful to include in the output.

SAS

The where statement (section A.6.3) is a powerful and useful tool for subsetting on the fly. (Other options for subsetting typically require data steps or data set options in the proc statement.) But one problem with subsetting in SAS is that the output doesn't show the conditioning. This is a little macro to correct this problem:

%macro where(wherest, titleline);
where &wherest;
title&titleline "where %str(&wherest)";
%mend where;

If you use %where(condition, n); in place of where condition; the condition is both implemented and will appear in the output as the nth title line. This takes advantage of the fact that the title statements just need to appear before the procedure is run, as opposed to before the proc statement.

For example:

title "HELP data set";
proc means data="c:\book\help.sas7bdat";
%where (female eq 1,2);
var age;
run;

Produces the output:

HELP data set
where female eq 1

The MEANS Procedure

Analysis Variable : AGE

Mean
------------
36.2523364
------------

Note that you need to use the testing forms such as eq and not the assignment = in your where statements for this to work.

This tip is cross-posted on my list of SAS tricks.

R

Within R, many functions (e.g. lm()) provide support for a subset option (as in example 3.7.5). Indexing can be used to restrict analyses (B.4.2). In addition, the subset() function can be embedded in a function call to make explicit what subset of observations will be included in the analysis. We demonstrate calculating the mean of female subjects all three ways below.

> ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
> coef(lm(age ~ 1, subset=female==1, data=ds))
(Intercept)
36.25234
> mean(ds$age[ds$female==1])
[1] 36.25234
> mean(subset(ds,female==1,age))
age
36.25234


In interactive R the code and output lie together and it is usually trivial to trace the subsetting used to produce output. But in literate programming settings, or when output may be cut-and-pasted from R to a document, it may be valuable to mirror the SAS macro presented above.

Here is a function to display the subsetting condition for simple functions. It requires some sophistication to turn expressions and variables into objects (and vice-versa). The use of deparse(substitute()) allows us to create a character variable with the name of the subset condition and function, while eval() is used to pass the name of the variable to the subset() function. A similar functionality for more complex functions (such as those requiring a formula, would require some tweaking; a generic explicit subsetting function could be difficult.

verbose.subset = function(ds, condition, variable, FUN=mean) {
cat(paste("subsetting with condition: ",
deparse(substitute(condition)),
"\n", sep=""))
cat(paste("calling the function: ",
deparse(substitute(FUN)),
"\n", sep=""))
FUN(subset(ds, condition, select=eval(variable)))
}

This yields the following output:

> verbose.subset(ds, female==1, "age")
subsetting with condition: female == 1
calling the function: mean
age
36.25234

Monday, August 30, 2010

Example 8.3: pyramid plots



Pyramid plots are a common way to display the distribution of age groups in a human population. The percentages of people within a given age category are arranged in a barplot, often back to back. Such displays can be used distinguish males vs. females, differences between two different countries or the distribution of age at different timepoints. Aidan Kane has an example.

We demonstrate how to generate back to back pyramid plots by gender of the age distribution from the HELP (Health Evaluation and Linkage to Primary Care) study. The example today highlights the differences between the R community and the SAS corporate structure. The R function was constructed to do exactly a pyramid plot, while the SAS approach tricks a powerful but general approach to achieve approximately the desired results. The R result to our eyes are more attractive; to mimic them exactly in SAS would require drawing much of the content from primitives. Someone may have done this, but the software structure and user community isn't organized for sharing.

R

We begin by loading the data then creating a categorical age variable (in 5 year increments) using the cut() command (section 1.4.10). Next a character variable is created that will be used to display the five number summaries by gender (section 2.1.2).

ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
attach(ds)
library(plotrix)

# create a categorical age variable
agegrp = cut(age, breaks=c(18, 20, 25, 30, 35, 40, 45, 50, 55, 60))

# create a nicer description for gender
gender = rep("male", length(agegrp))
gender[female==1] = "female"

# create a vector of percentages in each age range
women = as.vector(100*table(agegrp[female==1])/sum(female==1))
men = as.vector(100*table(agegrp[female==0])/sum(female==0))

# distribution by gender
tapply(age, gender, fivenum)


This yields the following output (five number summaries by gender):

$female
[1] 21.0 31.0 35.0 40.5 58.0

$male
[1] 19 30 35 40 60

Finally, the vectors of percentages at each level of the age variable for men and women is given as arguments to the pyramid.plot() function.

pyramid.plot(men, women,
labels=c("(18,20]","(20,25]","(25,30]","(30,35]",
"(35,40]","(40,45]","(45,50]","(50,55]","(55,60]"),
gap=5)
title("Age distribution at baseline of HELP study")

The age distributions are quite similar, with the males slightly more dispersed than the females.

SAS



We'll use proc gchart with the hbar statement (section 5.1.3) to make the plot. This requires some set-up, due to the desired back-to-back image. We begin, as in R, by generating the age categories and a gender variable. The strategy for categorizing age is shown in section 1.4.9.

data pyr;
set "c:\book\help";
agegrp = (age le 20) + (age le 25) + (age le 30) + (age le 35) +
(age le 40) + (age le 45) + (age le 50) + (age le 55) + (age le 60);
if female eq 1 then gender = "Female";
else gender = "Male";
run;


Next, we generate the percent in each age group, within gender, using proc freq (section 2.3.1). We save the output to a data set with the out option and suppress all the printed output. Then we make the percents for the males negative, so they'll display to the left of 0.


proc freq data=pyr noprint;
tables agegrp * gender/out=sumpyr outpct;
run;

data pyr2;
set sumpyr;
if gender eq "Male" then pct_col=pct_col * -1;
run;

We could proceed with the plot now, but the axes would include age categories 1 through 9 and negative percents for the males. To clean this up, we use axis statements (sections 5.3.7, 5.3.8).

title 'Age distribution at baseline of HELP study';
axis1 value = ("(55,60]" "(50,55]" "(45,50]" "(40,45]"
"(35,40]" "(30,35]" "(25,30]" "(20,25]" "(18,20]" ) ;
axis2 order=(-30 to 30 by 10)
label=("Percent in each age group, within gender")
minor = none
value = ("30" "20" "10" "0" "10" '20' '30');

proc gchart data=pyr2;
hbar agegrp / discrete freq nostats sumvar=pct_col space=0.5
subgroup=gender raxis=axis2 maxis=axis1;
label agegrp="Age";
run;
quit;

In the gchart statement, the key option is sumvar which tells proc gchart the length of the bars. The discrete option forces a bar for each value of agregrp. Other options associate the defined axis statements with axes of the plot, generate different colors for each gender, space the bars, and suppress some default plot features.

Different colored bars within gender could be accomplished with pattern statements. More difficult would be coloring the bars within gender by some third variables, as is demonstrated in R in example(pyramid.plot). Replicating the R plot with the category labels between the genders would require drawing the plot using annotate data sets.