Tuesday, November 30, 2010

Example 8.16: Exact logistic regression

In example 8.15, on Firth logistic regression, we mentioned alternative approaches to separation troubles. Here we demonstrate exact logistic regression. The code for this appears in the book (section 4.1.2) but we don't show an example of it there. We'll consider the setting of observing 100 subjects each with x=0 and x=1, observing no events when x=0, and 5 when x=1.

SAS
We'll create the data as a summary, rather than for every line of data. Then we can use the "events/trials" syntax (section 4.1.1) that both proc logistic and proc genmod accept. This is another way to reduce the size of data sets (along with the weight option mentioned previously) but is less generally useful. The the exact statement in proc logistic will fit the exact logistic regression and generate a p-value. The estimate option is required to display estimated log odds ratio.

data exact;
x=0; count=0; n=100; output;
x=1; count=5; n=100; output;
run;

proc logistic data=exact;
model count/n = x;
exact x / estimate;
run;

This generates the following output:

Exact Parameter Estimates

Standard 95% Confidence
Parameter Estimate Error Limits p-Value

x 1.9414* . -0.0677 Infinity 0.0594

NOTE: * indicates a median unbiased estimate.


R
In R we use the elrm() function in the elrm package to approximate exact logistic regression, as described in this paper by the package's authors. The function requires a special formula object with syntax identical to the SAS events/trials syntax. (Note that the function does not behave as expected when identical observations with trials=1 are submitted. Thus data should be collapsed into unique combinations of predictors before using the function.) In addition, it requires its data to be included in a data frame. We'll construct the data frame in one function call to data.frame().

elrmdata = data.frame(count=c(0,5), x=c(0,1), n=c(100,100))
library(elrm)
resexact = elrm(count/n ~ x, interest = ~x, iter=22000,
burnIn=2000, data=elrmdata, r=2)
summary(resexact)

producing the following result:

Call:
[[1]]
elrm(formula = count/n ~ x, interest = ~x, r = 2, iter = 22000,
dataset = elrmdata, burnIn = 2000)

Results:
estimate p-value p-value_se mc_size
x 2.0225 0.02635 0.0011 20000

95% Confidence Intervals for Parameters

lower upper
x -0.02065572 Inf

Differences between the SAS and R results most likely arise from the fact that the elrm() function is an approximation of the exact approach. The upper limit of infinity seen in the exact SAS analysis and approximate exact elrm() analysis reveals a limitation of this approach relative to the Firth approach seen in example 8.15 and the Bayesian approach we'll examine later.

A final note: if the true Pr(Y=1|X=1) = 0.05, then the true Pr(Y=1|X=0) that results in a log odds ratio of 1.94 is about 0.0075; for a log odds ratio of 2.02, the true probability is about 0.0069.

Monday, November 29, 2010

SAS and R joins SAS-x

Tal Galili, organizer of the R-bloggers blog aggregator, has opened a new aggregator for people blogging about SAS. If you're unfamiliar with the concept, an aggregator is a single blog which republishes (with permission, in this case) the entries from many contributing blogs, gathering a dispersed community of interest to a single point in your browser or newsreader.

Called SAS-x, the new aggregator is starting up right now, and we're pleased to be included. We find R-bloggers to be a source of stimulating output from many different disciplines and hope that SAS-x will be as successful.

Monday, November 22, 2010

Example 8.15: Firth logistic regression

In logistic regression, when the outcome has low (or high) prevalence, or when there are several interacted categorical predictors, it can happen that for some combination of the predictors, all the observations have the same event status. A similar event occurs when continuous covariates predict the outcome too perfectly.

This phenomenon, known as "separation" (including complete and quasi-complete separation) will cause problems fitting the model. Sometimes the only symptom of separation will be extremely large standard errors, while at other times the software may report an error or a warning.

One approach to handling this sort of problem is exact logistic regression, which we discuss in section 4.1.2. But exact logistic regression is complex and may require prohibitive computational resources. Another option is to use a Bayesian approach. Here we show how to use a penalized likelihood method originally proposed by Firth (1993 Biometrika 80:27-38) and described fully in this setting by Georg Heinze (2002 Statistics in Medicine 21:2409-2419 and 2006 25:4216-4226). A nice summary of the method is shown on a web page that Heinze maintains. In later entries we'll consider the Bayesian and exact approaches.

Update: see bottom of the post.

SAS
In SAS, the corrected estimates can be found using the firth option to the model statement in proc logistic. We'll set up the problem in the simple setting of a 2x2 table with an empty cell. Here, we simply output three observations with three combinations of predictor and outcome, along with a weight variable which contains the case counts in each cell of the table

data testfirth;
pred=1; outcome=1; weight=20; output;
pred=0; outcome=1; weight=20; output;
pred=0; outcome=0; weight=200; output;
run;

In the proc logistic code, we use the weight statement, available in many procedures, to suggest how many times each observation is to be replicated before the analysis. This approach can save a lot of space.

proc logistic data = testfirth;
class outcome pred (param=ref ref='0');
model outcome(event='1') = pred / cl firth;
weight weight;
run;

Without the firth option, the parameter estimate is 19.7 with a standard error of 1349. In contrast, here is the result of the above code.

Analysis of Maximum Likelihood Estimates

Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -2.2804 0.2324 96.2774 <.0001
pred 1 1 5.9939 1.4850 16.2926 <.0001

Note here that these no suggestion in this part of the output that the Firth method was employed. That appears only at the very top of the voluminous output.

R
In R, we can use Heinze's logistf package, which includes the logistf() function. We'll make the same table as in SAS by constructing two vectors of length 240 using the c() and rep() functions.

pred = c(rep(1,20),rep(0,220))
outcome = c(rep(1,40),rep(0,200))
lr1 = glm(outcome ~ pred, binomial)
>summary(lr1)

Call:
glm(formula = outcome ~ pred, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-0.4366 -0.4366 -0.4366 -0.4366 2.1899

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.3026 0.2345 -9.818 <2e-16 ***
pred 20.8687 1458.5064 0.014 0.989
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 216.27 on 239 degrees of freedom
Residual deviance: 134.04 on 238 degrees of freedom
AIC: 138.04

Number of Fisher Scoring iterations: 17

Note that the estimate differs slightly from what SAS reports. Here's a more plausible answer.

>library(logistf)
>lr2 = logistf(outcome ~ pred)
>summary(lr2)

logistf(formula = outcome ~ pred)

Model fitted by Penalized ML
Confidence intervals and p-values by Profile Likelihood

coef se(coef) lower 0.95 upper 0.95 Chisq p
(Intercept) -2.280389 0.2324057 -2.765427 -1.851695 Inf 0
pred 5.993961 1.4850029 3.947048 10.852893 Inf 0

Likelihood ratio test=78.95473 on 1 df, p=0, n=240
Wald test = 16.29196 on 1 df, p = 5.429388e-05

Covariance-Matrix:
[,1] [,2]
[1,] 0.05401242 -0.05401242
[2,] -0.05401242 2.20523358

Here, the estimates are nearly identical to SAS, but the standard errors differ.


Update:
Georg Heinze, author of the logistf() function, suggests the following two items.

First, in SAS, the model statement option clparm=pl will generate profile penalized likelihood confidence intervals, which should be similar to those from logistf(), It certainly makes sense to use confidence limits that more closely reflect the fitting method.

Second, in R, there is a weight option in both glm() and in logistf() that is similar to the weight statement in SAS. For example, the data used above could have been input and run as:

pred = c(1,0,0)
outcome = c(1,1,0)
weight=c(20,20,200)
lr1 = glm(outcome ~ pred, binomial, weights=weight)
lr2 = logistf(outcome ~ pred, weights=weight)

Monday, November 15, 2010

Example 8.14: generating standardized regression coefficients

Standardized (or beta) coefficients from a linear regression model are the parameter estimates obtained when the predictors and outcomes have been standardized to have variance = 1. Alternatively, the regression model can be fit and then standardized post-hoc based on the appropriate standard deviations. The parameters are thus interpreted as change in the outcome, in standard deviations, per standard deviation change in the predictors. However they're calculated, standardized coefficients facilitate an assessment of which variables have the greatest association with the outcome (or response) variable, though such an assessment ignores the confidence limits associated with each pairwise association.

It's straightforward to calculate these quantities in SAS and R. We'll demonstrate with data from the HELP study, modeling PCS as a function of MCS and homelessness among female subjects.

SAS

In SAS, standardized coefficients are available as the stb option for the model statement in proc reg.

proc reg data="c:\book\help";
where female eq 1;
model pcs = mcs homeless / stb;
run;
The REG Procedure
Model: MODEL1
Dependent Variable: PCS

Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 39.62619 2.49830 15.86 <.0001
MCS 1 0.21945 0.07644 2.87 0.0050
HOMELESS 1 -2.56907 1.95079 -1.32 0.1908

Parameter Estimates
Standardized
Variable DF Estimate

Intercept 1 0
MCS 1 0.26919
HOMELESS 1 -0.12348


R
In R we demonstrate the use of the lm.beta() function in the QuantPsyc package (due to Thomas D. Fletcher of State Farm). The function is short and sweet, and takes a linear model object as argument:

>lm.beta
function (MOD)
{
b <- summary(MOD)$coef[-1, 1]
sx <- sd(MOD$model[-1])
sy <- sd(MOD$model[1])
beta <- b * sx/sy
return(beta)
}

Here we apply the function to data from the HELP study.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
female = subset(ds, female==1)
lm1 = lm(pcs ~ mcs + homeless, data=female)

The results, in terms of unstandardized regression parameters are the same as in SAS:

> summary(lm1)

Call:
lm(formula = pcs ~ mcs + homeless, data = female)

Residuals:
Min 1Q Median 3Q Max
-28.163 -5.821 -1.017 6.775 29.979

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.62619 2.49830 15.861 < 2e-16 ***
mcs 0.21945 0.07644 2.871 0.00496 **
homeless -2.56907 1.95079 -1.317 0.19075
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.761 on 104 degrees of freedom
Multiple R-squared: 0.0862, Adjusted R-squared: 0.06862
F-statistic: 4.905 on 2 and 104 DF, p-value: 0.009212

To generate the standardized parameter estimates, we use the lm.beta() function.

library(QuantPsyc)
lm.beta(lm1)

This generates the following output:

mcs homeless
0.2691888 -0.1234776

A change in 1 standard deviation of MCS has more than twice the impact on PCS than a 1 standard deviation change in the HOMELESS variable. This example points up another potential weakness of standardized regression coefficients, however, in that the homeless variable can take on values of 0 or 1, and a 1 standard deviation change is hard to interpret.

Monday, November 8, 2010

Example 8.13: Bike ride plot, part 2



Before explaining how to make and interpret the plot above, Nick and I want to make a plea for questions--it's hard to come up with useful questions to explore each week!

As shown in Example 8.12, data from the Cyclemeter app can be used to make interesting plots of a bike ride. But in the previous application, questions remain. Why did my speed vary so much, for example? And why do some sections of the route appear to be very straight while others are relatively bumpy? To investigate these questions, we'll add some more data into the plot, using color. First, I'll shade the line according to my elevation-- my speed is probably faster when I'm going down hill. Then I'll add the points where the GPS connected; this may explain the straight sections. Doing all of this will require more complicated coding tasks.

(Data were read in previously.)

SAS

My initial thought was that I would use proc univariate to generate a histogram (section 5.1.4) and generate the color for the line from the histogram categories. But it turns out that you can't force the histogram to have your desired number of categories. (This is true for the R hist() function as well.) So I'm going to find the minimum and maximum elevation using proc summary (section 2.1), saving the results in a data set. Then I'll add those min and max values to each line of the data, using the tip found here. Finally, I'll make my own categories for elevation by looping through the category boundaries until I find one bigger than my observation.

proc summary data=ride;
var elevation__feet_;
output out=minmax max=maxelev min=minelev;;
run;

data aride;
set ride;
setme = 1;
set minmax point=setme;
colorindex = 0;
range = maxelev-minelev;
do i = 0 to 5;
if elevation__feet_ ge (minelev + (i * range/6) )
then colorindex = i + 1;
end;
run;

To make the plot, I'm going to use the annotate facility, as before. However, the annotate %line macro won't work, for reasons I won't go into. So I need to make an annotate data set directly. Briefly, the way annotate data sets work is that they have a pre-specified variable list in which only certain values are allowed. Some of these are discussed in section 5.1, but in addition to these, below we use the function variable; when function="MOVE" the conceptual plotting pen moves without drawing. When function="DRAW" a line is plotted from the last location to the current one. Locations are determined by x and y variables. The line and size variables affect the line style and thickness.

The other interesting bit of the code below is the creation and use of a temporary array (section 1.11.5) for the colors. I choose the color from this array by indexing on the colorindex variable created above.

For help on the annotate facility, see Contents; SAS Products; SAS/GRAPH; The Annotate Facility. Colors are discussed in section 5.3.11, but we use them in a slightly different way, here. For help with color specification, see Contents; SAS Products; SAS/GRAPH; Concepts; Colors.

data annoride2;
set aride;
if elevation__feet_ ne .;
retain xsys '2' ysys '2' hsys '6';
array mycolor [6] $ _temporary_
("CX252525" "CX636363" "CX969696" "CXBDBDBD" "CXD9D9D9" "CXF7F7F7");
x = longitude;
y = latitude;
/* If this is the first observation, we need to move the pen to
the first point. Otherwise we draw a line */
if _n_ ne 1 then function = "DRAW";
else function = "MOVE";
color = mycolor[colorindex];
line = 1;
size = speed__miles_h_ * 2;
output;
run;

Finally, we can plot the figure. I use the symbol statement (section 5.2.2) to plot nice dots where the GPS connected, and the axis statement (section 5.3.7) to remove the values of latitude and longitude, which don't interest me much. Analysis of the graph I leave for the R version.

axis1 major=none minor=none value=none;
symbol1 i=none v=dot c=black h=.2;
proc gplot data = ride;
plot latitude * longitude /
vaxis=axis1 haxis=axis1 annotate=annoride2;
run;
quit;

The resulting plot is shown above.

R

In R, I'm going to modify my vectorized version of the lines() function to additionally assign colors to each line. As in the SAS example, I will use categories of elevation to do this. Assigning the elevations to categories is a little trickier if I want to avoid for() loops. To do it, I will first find the category boundaries. I'll then use the which() function (as in section 1.13.2) to figure out the category boundaries smaller than the observation, and the max() function (section 1.8.1) to select the largest of these. Unfortunately, I also need the sapply() function (section B.5.3) so that I can repeat this process for each elevation and return a vector. I've annotated below to explain how this works. Finally, I use the RColorBrewer package to generate a vector of colors. (I also used it to find the colors I put in the SAS code.)

veclines2 = function(x, y, z, c) {
require(RColorBrewer)
breaks = min(c) + (0:5 * (max(c) - min(c))/6)
# The sapply() function applies the function named in the
# second argument to each element of the vector named in the
# first argument. Since I don't need this function again,
# I write it out in within the sapply() function call and don't
# even have to name it.
ccat = sapply(c, function (r) {max(which(r >= breaks))} )
mypalette = brewer.pal(6,"Greys")
for (i in 1:(length(x)-1)) {
lines(x[i:(i+1)], y[i:(i+1)], lwd=z[i], col=mypalette[7 - ccat[i]])
}
}


Reader JanOosting pointed out that the segments() function will do exactly what my for-looped lines() function does. The segments() function requires "to" and "from" x and y locations. Here's a version of the above function that uses it.

bikeseg = function(x,y,z,c) {
l=length(x)
require(RColorBrewer)
mypalette <-brewer.pal(6,"Greys")
breaks = min(c) + (0:5 * (max(c) - min(c))/6)
ccat = sapply(c,function (r) {max(which(r >= breaks))})
segments(x0 = x[1:(l-1)], y0 = y[1:(l-1)],
x1 = x[2:l], y1 = y[2:l],lwd=z,col=mypalette[7-ccat])
}



Then I call the function, after creating an empty plot and making the RColorBrewer library available. Finally, I add the points and some street names (see sections 5.2.3 and 5.2.11), to aid in interpretation. The way R draws the names live, as you enter commands, makes it much easier to place and rotate the names than in SAS, where you would have to re-make the annotate data set and run it to see the results.

plot(Longitude, Latitude, type="n")
library(RColorBrewer)
bikeseg(Longitude, Latitude, Speed..miles.h./2, Elevation..feet.)

points(Longitude, Latitude, cex=2, pch='.')
text(-72.495, 42.34, "Station Road")
text(-72.5035, 42.33, "Middle Street", srt=90)
text(-72.5035, 42.36, "Bike Path", srt=-39)
text(-72.54, 42.36, "Bike Path", srt=15)
text(-72.55, 42.347, "Maple", srt=107)
text(-72.54, 42.338, "Pomeroy")
text(-72.515, 42.331, "Potwine")



In the final image, thicker lines show greater speed, darker colors indicate lower elevations, and the dots are where the GPS connects. My best speed is achieved along Station Road, which is a long, straight drop. There aren't enough categories of color to explain most of the other variations in speed, although you can see me get slower as I near Middle Street along Potwine. The dots are fairly evenly spaced except along the bike path, where there is a lot of tree cover in some spots. This causes the long straight pieces near the top of the plot. Actually, since the phone is also sitting in my pocket as I ride, so I'm more pleased than disappointed with the perfomance of the iPhone and Cyclemeter.

Monday, November 1, 2010

Example 8.12: Bike ride plot, part 1



The iPhone app Cyclemeter uses the phone's GPS capability to record location and other data, and infer speed, while you ride. I took a ride near my house recently, and downloaded the data. I'd like to examine my route and my speed. A simple plot of the route is trivial in either SAS or R, but adding the speed data requires a little work. You can download my data from here and I read the data directly via URL in the following code.

SAS

In SAS, I first use proc import with the url filetype, as shown in section 1.1.6. I can then make a simple plot of the route using the i=j option to the symbol statement (as in section 1.13.5), which simply joins successive points.

filename bike url 'http://www.kenkleinman.net/files/cycle-data-10022010.csv';

proc import datafile=bike out=ride dbms=dlm;
delimiter=',';
getnames=yes;
run;

symbol1 i=j;
proc gplot data=ride;
plot latitude * longitude;
run;




I didn't project the data, so this looks a little compressed north-south.

To show my speed at each point, I decided to make a thicker line when I'm going faster. To do this, I use the annotate macros discussed in section 5.2. I decided to use the %line macro to do this, but that requires each observation in the data set have a starting point and an ending point for its section of line. I use the lag function (section 1.4.17) in a separate data step to add the previous point to each observation. Then I create the annotate data set. Finally, I use the value = none option to the symbol statement to make an empty plot and the annotate data set draws the line for me.


data twopoints;
set ride;
lastlat = lag(latitude);
lastlong = lag(longitude);
if _n_ ne 1;
run;

%annomac;
data annoride;
set twopoints;
%system(2,2,6);
%line(longitude,latitude,lastlong,lastlat,
black,1,speed__miles_h_);
run;

symbol1 v=none;
proc gplot data=ride;
plot latitude * longitude / annotate=annoride;;
run;
quit;

The resulting plot shown below closely resembles the R plot shown at the top of this entry.




R

In R, it's as trivial to make the simple plot as in SAS. Just read in the CSV data from the URL (section 1.1.2, 1.1.6) make an empty plot (5.1.1), and add the lines (5.2.1).

myride=read.csv("http://www.kenkleinman.net/files/cycle-data-10022010.csv")
attach(myride)
plot(Longitude, Latitude, type="n")
lines(Longitude, Latitude)



Now I want to show the speed, as above. The lines() function has a lwd= option, but unfortunately, it's not vectorized. In other words, it accepts only a scalar that applies to all the line segments drawn in a given call. To get around that, I'll write my own vectorized version of lines() using the disfavored for() function. It calls lines() for each pair of points, with an appropriate lwd value.

veclines = function(x, y, z) {
for (i in 1:(length(x)-1)) {
lines(x[i:(i+1)], y[i:(i+1)], lwd=z[i])
}
}
veclines(Longitude, Latitude, Speed..miles.h./2)

The result is displayed at the top of this blog entry. In the next entry we'll add more information to help explain why the speed varies.