Showing posts with label ods system. Show all posts
Showing posts with label ods system. Show all posts

Tuesday, November 15, 2011

Example 9.14: confidence intervals for logistic regression models

Recently a student asked about the difference between confint() and confint.default() functions, both available in the MASS library to calculate confidence intervals from logistic regression models. The following example demonstrates that they yield different results.

R

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(MASS)
glmmod = glm(homeless ~ age + female, binomial, data=ds)

> summary(glmmod)
Call:
glm(formula = homeless ~ age + female, family = binomial, data = ds)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.3600 -1.1231 -0.9185 1.2020 1.5466

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.89262 0.45366 -1.968 0.0491 *
age 0.02386 0.01242 1.921 0.0548 .
female -0.49198 0.22822 -2.156 0.0311 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 625.28 on 452 degrees of freedom
Residual deviance: 617.19 on 450 degrees of freedom
AIC: 623.19

Number of Fisher Scoring iterations: 4

> exp(confint(glmmod))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.1669932 0.9920023
age 0.9996431 1.0496390
female 0.3885283 0.9522567
> library(MASS)
> exp(confint.default(glmmod))
2.5 % 97.5 %
(Intercept) 0.1683396 0.9965331
age 0.9995114 1.0493877
female 0.3909104 0.9563045

Why are they different? Which one is correct?

SAS

Fortunately the detailed documentation in SAS can help resolve this. The logistic procedure (section 4.1.1) offers the clodds option to the model statement. Setting this option to both produces two sets of CL, based on the Wald test and on the profile-likelihood approach. (Venzon, D. J. and Moolgavkar, S. H. (1988), “A Method for Computing Profile-Likelihood Based Confidence Intervals,” Applied Statistics, 37, 87–94.)

ods output cloddswald = waldcl cloddspl = plcl;
proc logistic data = "c:\book\help.sas7bdat" plots=none;
class female (param=ref ref='0');
model homeless(event='1') = age female / clodds = both;
run;

Odds Ratio Estimates and Profile-Likelihood Confidence Intervals

Effect Unit Estimate 95% Confidence Limits

AGE 1.0000 1.024 1.000 1.050
FEMALE 1 vs 0 1.0000 0.611 0.389 0.952


Odds Ratio Estimates and Wald Confidence Intervals

Effect Unit Estimate 95% Confidence Limits

AGE 1.0000 1.024 1.000 1.049
FEMALE 1 vs 0 1.0000 0.611 0.391 0.956



Unfortunately, the default precision of the printout isn't quite sufficient to identify whether this distinction aligns with the differences seen in the two R methods. We get around this by using the ODS system to save the output as data sets (section A.7.1). Then we can print the data sets, removing the default rounding formats to find all of the available precision.

title "Wald CL";
proc print data=waldcl; format _all_; run;
title "PL CL";
proc print data=plcl; format _all_; run;

Wald CL
Odds
Obs Effect Unit RatioEst LowerCL UpperCL

1 AGE 1 1.02415 0.99951 1.04939
2 FEMALE 1 vs 0 1 0.61143 0.39092 0.95633


PL CL
Odds
Obs Effect Unit RatioEst LowerCL UpperCL

1 AGE 1 1.02415 0.99964 1.04964
2 FEMALE 1 vs 0 1 0.61143 0.38853 0.95226

With this added precision, we can see that the confint.default() function in the MASS library generates the Wald confidence limits, while the confint() function produces the profile-likelihood limits. This also explains the confint() comment "Waiting for profiling to be done..." Thus neither CI from the MASS library is incorrect, though the profile-likelihood method is thought to be superior, especially for small sample sizes. Little practical difference is seen here.

Monday, June 27, 2011

Example 8.42: skewness and kurtosis and more moments (oh my!)



While skewness and kurtosis are not as often calculated and reported as mean and standard deviation, they can be useful at times. Skewness is the 3rd moment around the mean, and characterizes whether the distribution is symmetric (skewness=0). Kurtosis is a function of the 4th central moment, and characterizes peakedness, where the normal distribution has a value of 3 and smaller values correspond to thinner tails (less peakedness).

Some packages (including SAS) subtract three from the kurtosis, so that the normal distribution has a kurtosis of 0 (this is sometimes called excess kurtosis.

R

library(moments)
library(lattice)
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
ds$gender = ifelse(ds$female==1, "female", "male")
densityplot(~ cesd, data=ds, groups=gender, auto.key=TRUE)

We see that the distribution of CESD scores is skewed with a long left tail, and appears somewhat less peaked than a normal distribution. This is confirmed by the actual statistics:

> with(ds, tapply(cesd, gender, skewness))
female male
-0.4906171 -0.2464390
> with(ds, tapply(cesd, gender, kurtosis)) # kurtosis
female male
2.748968 2.547061
> with(ds, tapply(cesd, gender, kurtosis))-3 # excess kurtosis
female male
-0.2510318 -0.4529394


SAS
SAS includes much detail on the moments and other statistics in the output from proc univariate. As usual, the quantity of output can be off-putting for new users and students. Here we extract the moments we need with the ODS system. We also generate kernel density estimates roughly analogous to the densityplot() results shown above.

ods output moments = cesdmoments;
proc univariate data="c:\book\help.sas7bdat";
class female;
var cesd;
histogram cesd / kernel;
run;

proc print data=cesdmoments;
where label1 = "Skewness";
var female label1 nvalue1 label2 nvalue2;
run;

With the result:

Obs FEMALE Label1 nValue1 Label2 nValue2

4 0 Skewness -0.247513 Kurtosis -0.442010
10 1 Skewness -0.497620 Kurtosis -0.204928

We note that the default is to produce unbiased (REML) estimates, rather than the biased method of moments estimator produced by the kurtosis() function (and that SAS presents the excess kurtosis).

Tuesday, April 12, 2011

Example 8.34: lack of robustness of t test with small n

Tim Hesterberg has effectively argued for a larger role for resampling based inference in introductory statistics courses (and statistical practice more generally). While the Central Limit Theorem is a glorious result, and the Student t-test remarkably robust, there are subtleties that Hesterberg, Jones and others have pointed out that are not always well understood.

In this entry, we explore the robustness of the Student one sample t-test in terms of coverage probabilities where we look whether the alpha level is split evenly between the two tails.

R
We begin by defining a function to carry out the simulation study.

runsim = function(FUNCTION=rnorm, n=20, numsim=100000,
label="normal")
{
missupper = numeric(numsim)
misslower = numeric(numsim)
for (i in 1:numsim) {
x = FUNCTION(n)
testval = t.test(x)
missupper[i] = testval$conf.int[2] < 0
misslower[i] = testval$conf.int[1] > 0
}
cat("n=", n,", number of simulations=", numsim, sep="")
cat(", (true data are ", label, ")\n", sep="")
cat(round(100*sum(missupper)/numsim, 1),
"% of simulations had CI below the true value.\n", sep="")
cat(round(100*sum(misslower)/numsim, 1),
"% of simulations had CI above the true value.\n", sep="")
}

We can define three functions to explore: one where the data are actually normally distributed (so the central limit theorem doesn't need to be invoked), a symmetric distribution (where the central limit theorem can be used with smaller sample sizes) and a skewed distribution (where the central limit theorem requires very large sample sizes).

reallynormal = function(n) { return(rnorm(n)) }
symmetric = function(n) { return(runif(n, -1, 1)) }
skewed = function(n) { return(rexp(n, 1) - 1) }

The results (at least to us) are somewhat startling. To get the tail probabilities correct when the underlying distribution is actually skewed, we need a huge sample size:

> runsim(reallynormal, n=20, label="normal")
n=20, number of simulations=1e+05, (true data are normal)
2.5% of simulations had CI below the true value.
2.5% of simulations had CI above the true value.

> runsim(symmetric, n=20, label="symmetric")
n=20, number of simulations=1e+05, (true data are symmetric)
2.6% of simulations had CI below the true value.
2.5% of simulations had CI above the true value.

> runsim(skewed, n=20, label="skewed")
n=20, number of simulations=1e+05, (true data are skewed)
7.7% of simulations had CI below the true value.
0.6% of simulations had CI above the true value.

> runsim(skewed, n=100, label="skewed")
n=100, number of simulations=1e+05, (true data are skewed)
4.5% of simulations had CI below the true value.
1.2% of simulations had CI above the true value.

> runsim(skewed, n=500, label="skewed")
n=500, number of simulations=1e+05, (true data are skewed)
3.4% of simulations had CI below the true value.
1.7% of simulations had CI above the true value.


SAS

In the SAS version, we'll write a macro that first generates all of the data, then uses by processing to perform the t tests, and finally tallies the results and prints them to the output.

%macro simt(n=20, numsim=100000, myrand=normal(0), label=normal);
data test;
do sim = 1 to &numsim;
do i = 1 to &n;
x = &myrand;
output;
end;
end;
run;

ods select none;
ods output conflimits=ttestcl;
proc ttest data=test;
by sim;
var x;
run;
ods select all;

data _null_;
set ttestcl end=last;
retain missupper 0 misslower 0;
missupper = missupper + (upperclmean lt 0);
misslower = misslower + (lowerclmean gt 0);
if last then do;
missupper_pct = compress(round(100 * missupper/&numsim,.1));
misslower_pct = compress(round(100 * misslower/&numsim,.1));
file print;
put "n=&n, number of simulations = &numsim, true data are &label";
put missupper_pct +(-1) "% of simulations had CI below the true value";
put misslower_pct +(-1) "% of simulations had CI above the true value";
end;
run;
%mend simt;

Two or three features of this code bear commenting. First, the retain statement and end option allow us to count the number of misses without an additional data step and/or procedure. Next, the file print statement redirects the results from a put statement to the output, rather than the log. While this is not really needed, the output is where we expect the results to appear. Finally, the odd +(-1) in the final two put statements moves the column pointer back one space. This allows the percent sign to immediately follow the number.

The various distributions and sample sizes used in the R code can be called as follows:

%simt(numsim=100000, n = 20, myrand = normal(0));
%simt(numsim=100000, n = 20, myrand = (2 * uniform(0)) - 1, label=symmetric);
%simt(numsim=100000, n = 20, myrand = (ranexp(0) - 1), label=skewed);
%simt(numsim=100000, n = 100, myrand = (ranexp(0) - 1), label=skewed);
%simt(numsim=100000, n = 500, myrand = (ranexp(0) - 1), label=skewed);

where the code to generate the desired data is passed to the macro as a character string. The results of
%simt(numsim=100000, n = 100, myrand = (ranexp(0) - 1), label=skewed);
are

n=100, number of simulations = 100000, true data are skewed
4.5% of simulations had CI below the true value
1.2% of simulations had CI above the true value

which agrees nicely with the R results above.

We look forward to Tim's talk at the JSM.

Monday, December 13, 2010

Example 8.18: A Monte Carlo experiment

In recent weeks, we've explored methods to fit logistic regression models when a state of quasi-complete separation exists. We considered Firth's penalized likelihood approach, exact logistic regression, and Bayesian models using Markov chain Monte Carlo (MCMC).

Today we'll show how to build a Monte Carlo experiment to compare these approaches. Suppose we have 100 observations with x=0 and 100 with x=1, and suppose that the Pr(Y=1|X=0) = 0.001, while the Pr(Y=1|X=1) = 0.05. Thus the true odds ratio is (0.05/0.95)/(0.001/0.999) = 52.8 and the log odds ratio we want to find is 3.96. But we will rarely observe any y=1 when x=0. Which of these approaches is most likely to give us acceptable results?

Note that in all of the MCMC analyses we use only 6000 iterations, which is likely too few to trust in practice.

The code is long enough here that we annotate within rather than write much text around the code.

SAS

All the SAS procedures used accept the events/trials syntax (section 4.1.1), so we'll generate example data sets as two observations of binomial random variates with the probabilities noted above. We also make extensive use of the ODS system to suppress all printed output (section A.7.1) and to save desired pieces of output as SAS data sets. The latter usage requires first using the ods trace on/listing statement to find the name of the output before saving it. Finally, we use the by statement (section A.6.2) to replicate the analysis for each simulated data set.

data rlog;
do trial = 1 to 100;
/* each "trial" is a simulated data set with two observations
containing the observed number of events with x=0 or x=1 */
x=0; events = ranbin(0,100,.001); n=100; output;
x=1; events = ranbin(0,100,.05); n=100; output;
end;
run;

ods select none; /* omit _all_ printed output */

ods output parameterestimates=glm; /* save the estimated betas */
proc logist data = rlog;
by trial;
model events / n=x; /* ordinary logistic regression */
run;

ods output parameterestimates=firth; /* save the estimated betas */
/* note the output data set has the same name
as in the uncorrected glm */
proc logist data = rlog;
by trial;
model events / n = x / firth; /* do the firth bias correction */
run;

ods output exactparmest=exact;
/* the exact estimates have a different name under ODS */
proc logist data=rlog;
by trial;
model events / n = x;
exact x / estimate; /* do the exact estimation */
run;

data prior;
input _type_ $ Intercept x;
datalines;
Var 25 25
Mean 0 0
;
run;

ods output postsummaries=mcmc;
proc genmod data = rlog;
by trial;
model events / n = x / dist=bin;
bayes nbi=1000 nmc=6000
coeffprior=normal(input=prior) diagnostics=none
statistics=summary;
/* do the Bayes regression, using the prior made in the
previous data step */
run;

Now I have four data sets with parameter estimates in them. I could use them separately, but I'd like to merge them together. I can do this with the merge statement (section 1.5.7) in a data step. I also need to drop the lines with the estimated intercepts and rename the variables that hold the parameter estimates. The latter is necessary because the names are duplicated across the output data sets and desirable in that it allows names that are meaningful. In any event, I can use the where and rename data set options to include these modifications as I do the merge. I'll also add the number of events when x=0 and when x=1, which requires merging in the original data twice.

data lregsep;
merge
glm (where = (variable = "x") rename = (estimate = glm))
firth (where = (variable = "x") rename = (estimate = firth))
exact (rename = (estimate = exact))
mcmc (where = (parameter = "x") rename = (mean=mcmc))
rlog (where = (x = 1) rename = (events = events1))
rlog (where = (x = 0) rename = (events = events0));
by trial;
run;

ods select all; /* now I want to see the output! */
/* check to make sure the output dataset looks right */
proc print data = lregsep (obs = 5) ;
var trial glm firth exact mcmc;
run;

/* what do the estimates look like? */
proc means data=lregsep;
var glm firth exact mcmc;
run;

With the following output.

Obs trial glm firth exact mcmc

1 1 12.7866 2.7803 2.3186 3.9635
2 2 12.8287 3.1494 2.7223 4.0304
3 3 10.7192 1.6296 0.8885 2.5613
4 4 11.7458 2.2378 1.6906 3.3409
5 5 10.7192 1.6296 0.8885 2.5115

Variable Mean Std Dev
----------------------------------------
glm 10.6971252 3.4362801
firth 2.2666700 0.5716097
exact 1.8237047 0.5646224
mcmc 3.1388274 0.9620103
----------------------------------------

The ordinary logistic estimates are entirely implausible, while the three alternate approaches are more acceptable. The MCMC result has the least bias, but it's unclear to what degree this is a happy coincidence between the odds ratio and the prior precision. The Firth approach appears to be less biased than the exact logistic regression

R
The R version is roughly analogous to the SAS version. The notable differences are that 1) I want the "weights" version of the data (see example 8.15) for the glm() and logistf() functions and need the events/trials syntax for the elrm() function and the expanded (one row per observation) version for the MCMClogit() funtion. The sapply() function (section B.5.3) serves a similar function to the by statement in SAS. Finally, rather than spelunking through the ods trace output to find the parameter estimates, I used the str() function (section 1.3.2) to figure out where they are stored in the output objects and indexes (rather than data set options) to pull out the one estimate I need.


# make sure the needed packages are present
require(logistf)
require(elrm)
require(MCMCpack)
# the runlogist() function generates a dataset and runs each analysis
# the parameter "trial" keeps track of which time we're calling runlogist()
runlogist = function(trial) {
# the result vector will hold the estimates temporarily
result = matrix(0,4)
# generate the number of events once
events.0 =rbinom(1,100, .001) # for x = 0
events.1 = rbinom(1,100, .05) # for x = 1
# following for glm and logistf "weights" format
xw = c(0,0,1,1)
yw = c(0,1,0,1)
ww = c(100 - events.0, events.0, 100 - events.1,events.1)
# run the glm and logistf, grab the estimates, and stick
# them into the results vector
result[1] =
glm(yw ~ xw, weights=ww, binomial)$coefficients[2]
result[2] = logistf(yw ~ xw, weights=ww)$coefficients[2]
# elrm() needs a data frame in the events/trials syntax
elrmdata = data.frame(events=c(events.0,events.1), x =c(0,1),
trials = c(100,100))
# run it and grab the estimate
result[3]=elrm(events/trials ~ x, interest = ~ x, iter = 6000,
burnIn = 1000, data = elrmdata, r = 2)$coeffs
# MCMClogit() needs expanded data
x = c(rep(0,100), rep(1,100))
y = c(rep(0,100-events.0), rep(1,events.0),
rep(0, 100-events.1), rep(1, events.1))
# run it and grab the mean of the MCMC posteriors
result[4] = summary(MCMClogit(y~as.factor(x), burnin=1000,
mcmc=6000, b0=0, B0=.04,
seed = list(c(781306, 78632467, 364981736, 6545634, 7654654,
4584),trial)))$statistics[2,1]
# send back the four estimates, plus the number of events
# when x=0 and x=1
return(c(trial, events.0, events.1, result))
}

Note the construction of the seed= option to the MCMClogit() function. This allows a different seed in every call without actually using sequential seeds.

Now we're ready to call the function repeatedly. We'll do that with the sapply() function, but we need to nest that inside a t() function call to get the estimates to appear as columns rather than rows, and we'll also make it a data frame in the same command. Note that the parameters we change within the sapply() function are merely a list of trial numbers. Finally, we'll add descriptive names for the columns with the names() function (section 1.3.4).

res2 = as.data.frame(t(sapply(1:10, runlogist)))
names(res2) <- c("trial","events.0","events.1", "glm",
"firth", "exact-ish", "MCMC")
head(res2)
mean(res2[,4:7], na.rm=TRUE)


trial events.0 events.1 glm firth exact-ish MCMC
1 1 0 6 18.559624 2.6265073 2.6269087 3.643560
2 2 1 3 1.119021 0.8676031 1.1822296 1.036173
3 3 0 5 18.366720 2.4489268 2.1308186 3.555314
4 4 0 5 18.366720 2.4489268 2.0452446 3.513743
5 5 0 2 17.419339 1.6295391 0.9021854 2.629160
6 6 0 9 17.997524 3.0382577 2.1573979 4.017105

glm firth exact-ish MCMC
17.333356 2.278344 1.813203 3.268243

The results are notably similar to SAS, except for the unacceptable glm() results.

In most Monte Carlo experimental settings, one would also be interested in examining the confidence limits for the parameter estimates. Notes and code for doing this can be found here. In a later entry we'll consider plots for the results generated above. As a final note, there are few combinations of event numbers with any mass worth considering. One could compute the probability of each of these and the associated parameter estimates, deriving a more analytic answer to the question. However, this would be difficult to replicate for arbitrary event probabilities and Ns, and very awkward for continuous covariates, while the above approach could be extended with trivial ease.