Showing posts with label proc genmod. Show all posts
Showing posts with label proc genmod. Show all posts

Monday, October 15, 2012

Example 10.6: Should Poisson regression ever be used? Negative binomial vs. Poisson regression

In practice, we often find that count data is not well modeled by Poisson regression, though Poisson models are often presented as the natural approach for such data. In contrast, the negative binomial regression model is much more flexible and is therefore likely to fit better, if the data are not Poisson. In example 8.30 we compared the probability mass functions of the two distributions, and found that for a given mean, the negative binomial closely approximates the Poisson, as the scale parameter increases. But how does this affect the choice of regression model? How might another alternative, the overdispersed, or quasi-Poisson model compete with these? Today we generate a rudimentary toolkit for assessing the effects of Poisson, negative binomial, and quasi-Poisson models, assuming data are truly generated by one or the other process.

SAS
We'll begin by simulating Poisson and negative binomial data. Note that we also rely on the poismean_nb function that we created in example 8.30-- this is needed because SAS only accepts the natural parameters of the distribution, while the mean is a (simple) function of the two parameters. As is typical in such settings, we'll begin by generating data under the null of no association between, in this case, the normal covariate and the count outcome. The proportion of rejections should be no greater than alpha (5%, here). However, we'll include code to easily simulate data under the alternative as well. This will facilitate assessing the relative power of the models, later.
data nbp;
do ds = 1 to 10000;
  do i = 0 to 250;
    x = normal(0);
    mean = exp(-.25 + (0 * x));
    pois = rand("POISSON",mean);
    nb1 = rand("NEGBINOMIAL", poismean_nb(mean, 1), 1);
    output;
    end;
  end;
run;
The models will be fit in proc genmod. (See sections 4.1.3, 4.1.5, table 4.1.) It would be good to write a little macro to change the distribution and the output names, but it's not necessary. To save space here, the repetitive lines are omitted. The naming convention is that the true distribution (p or nb) is listed first, followed by the fit model (p, nb, or pod, for overdispersed).
 
ods select none;
ods output parameterestimates = ppests;
proc genmod data = nbp;
by ds;
model pois = x /dist=poisson;
run;

ods output parameterestimates = ppodests;
model pois = x /dist=poisson scale = p;

ods output parameterestimates = pnbests;
model pois = x /dist=negbin;

ods output parameterestimates = nbnbests;
model nb1 = x /dist=negbin;

ods output parameterestimates = nbpests;
model nb1 = x /dist=poisson;

ods output parameterestimates = nbpodests;
model nb1 = x /dist=poisson scale=p;
ods select all;
For analysis, we'll bring all the results together using the merge statement (section 1.5.7). Note that the output data sets contain the Wald CI limits as well as the estimates themselves; all have to be renamed in the merge, or they will overwrite each other.
data results;
merge
ppests (rename = (estimate = pp_est lowerwaldcl = pp_l 
  upperwaldcl = pp_u))
ppodests (rename = (estimate = ppod_est lowerwaldcl = ppod_l 
  upperwaldcl = ppod_u))
pnbests (rename = (estimate = pnb_est lowerwaldcl = pnb_l 
  upperwaldcl = pnb_u))
nbnbests (rename = (estimate = nbnb_est lowerwaldcl = nbnb_l 
  upperwaldcl = nbnb_u))
nbpests (rename = (estimate = nbp_est lowerwaldcl = nbp_l 
  upperwaldcl = nbp_u))
nbpodests (rename = (estimate = nbpod_est lowerwaldcl = nbpod_l 
  upperwaldcl = nbpod_u));
where parameter eq "x";
target = 0;
pprej =   ((pp_l gt target) or (pp_u lt target));
ppodrej =   ((ppod_l gt target) or (ppod_u lt target));
pnbrej =  ((pnb_l gt target) or (pnb_u lt target));
nbnbrej = ((nbnb_l gt target) or (nbnb_u lt target));
nbprej =  ((nbp_l gt target) or (nbp_u lt target));
nbpodrej =  ((nbpod_l gt target) or (nbpod_u lt target));
run;
The indicators of CI that exclude the null are calculated with appropriate names using logical tests that are 1 if true (rejections) and 0 if false. (See, e.g., section 1.4.9.) The final results can be obtained from proc means
proc means data = results; 
var pp_est ppod_est pnb_est nbnb_est nbp_est nbpod_est 
    pprej ppodrej pnbrej nbnbrej nbprej nbpodrej; 
run;

                     Variable             Mean
                     -------------------------
                     pp_est       -0.000349738
                     ppod_est     -0.000349738
                     pnb_est      -0.000344668
                     nbnb_est        0.0013738
                     nbp_est         0.0013588
                     nbpod_est       0.0013588
                     pprej           0.0505000
                     ppodrej         0.0501000
                     pnbrej          0.0468000
                     nbnbrej         0.0535000
                     nbprej          0.1427000
                     nbpodrej        0.0555000
                     -------------------------


All of the estimates appear to be unbiased. However, Poisson regression, when applied to the truly negative binomial data, appears to be dramatically anticonservative, rejecting the null (i.e., with CI excluding the null value) 14% of the time. The overdispersed model may be slightly biased as well. The estimated proportion of rejections is 5.55%, or 555 of 10,000 experiments. An exact CI for the proportion excludes 5%, here, although the anticonservative bias appears to be slight. To test other effect sizes, we'd change the mean, set in the first data step and the target in the results data. It would also be valuable to change the scale parameter for the negative binomial.


R
We begin by defining two simple functions: one to extract the standard errors from a model, and the second to assess whether Wald-type CI for parameter estimates exclude some value. It's a bit confusing that a standard error extracting function is not part of R. Or perhaps it is, and someone will point out the obvious function in the comments. It's useful to use the standard errors and construct the Wald CI in the current setting because the obvious alternative for constructing CI, the confint() function, uses profile likelihoods, which would be too time-consuming in a simulation setting. The second function accepts the parameter estimate, its standard error, and a fixed value which we want to know is in or out of the CI. Both functions are actually single expressions, but having them in hand will reduce the typing in the main function.
# this will work for any model object that works with vcov()
# the test for positive variance should be unnecessary but can't hurt
stderrs = function(model) {
  ifelse(min(diag(vcov(model))) > 0, sqrt(diag(vcov(model))), NA)  
}

# short and sweet: 1 if target is out of Wald CI, 0 if in
ciout = function(est, se, target){
  ifelse( (est - 1.96*se > target) | (est + 1.96*se < target), 1,0)
}
With these ingredients prepared, we're ready to write a function to fit the three models to the two sets of observed data. The function will accept a number of observations per data set and a true beta. The Poisson and overdispersed Poisson are fit with the glm() function (section 4.1.3, table 4.1) but the negative binomial uses the glm.nb() function found in the MASS package (section 4.1.5).
testpnb = function(n, beta) {
# make data
n = 250
x = rnorm(n)
mean = exp(-.25 + (beta * x))
pois = rpois(n,mean)
nb1 = rnbinom(n, size=1, mu=mean)

# fit models to Poisson data
pp = glm(pois ~x, family="poisson")
ppod = glm(pois ~x, family="quasipoisson")
pnb = glm.nb(pois~x)

# fit models to nb data
nbnb = glm.nb(nb1 ~x)
nbp = glm(nb1 ~x, family="poisson")
nbpod = glm(nb1 ~x, family="quasipoisson")

# extract parameter estimates using the coef() function
est = as.numeric(c(coef(pp)[2], coef(ppod)[2], coef(pnb)[2], coef(nbnb)[2], coef(nbp)[2], coef(nbpod)[2]))
# use our two new functions to get the SE and the CI indicator
se = c(stderrs(pp), stderrs(ppod), stderrs(pnb), stderrs(nbnb), stderrs(nbp), stderrs(nbpod))
ci = ciout(est, se, rep(beta,6))
return(matrix(c(est,se,ci),ncol=3))
}
Now we can use the convenient replicate() function to call the experiment many times. Since the output of testnb() is a matrix, the result of replicate() is a three-dimensional matrix, R * C * sheet, where sheet here corresponds to each experimental replicate. To summarize the results, we can use the rowMeans() function to get the proportion of rejections or the mean of the estimates.
mainout = replicate(10000,testpnb(250,0))

# the [,3,] below means "all rows, column 3, all sheets"
> rowMeans(mainout[,3,])
[1] 0.0490 0.0514 0.0463 0.0490 0.1403 0.0493

> rowMeans(mainout[,1,])
[1]  0.0003482834  0.0003482834  0.0003558526 -0.0004123949 -0.0003972441 -0.0003972441
The results agree completely with the SAS results discussed above. The naive Poisson regression would appear a bad idea--if the data are negative binomial, tests don't have the nominal size. It would be valuable to replicate the experiment with some other distribution for the real data as well. One approach to modeling count data would be to fit the Poisson and assess the quality of the fit, which can be done in several ways. However, this iterative fitting also jeopardizes the size of the test, in theory. Perhaps we'll explore the practical impact of this in a future entry. Fortunately, at least in this limited example, a nice alternative exists: We can just fit the negative binomial by default. The costs of this in terms of power could be assessed with a thorough simulation study, but are likely to be small, since only one additional parameter is estimated. And the size of the test is hardly affected at all. The quasi-Poisson model could also be recommended, but has the drawback of relying on what is actually not a viable distribution for the data. Some sources suggest that it may be even more flexible than the negative binomial, however.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Tuesday, October 4, 2011

Example 9.8: New stuff in SAS 9.3-- Bayesian random effects models in Proc MCMC




Rounding off our reports on major new developments in SAS 9.3, today we'll talk about proc mcmc and the random statement.

Stand-alone packages for fitting very general Bayesian models using Markov chain Monte Carlo (MCMC) methods have been available for quite some time now. The best known of these are BUGS and its derivatives WinBUGS (last updated in 2007) and OpenBUGS . There are also some packages available that call these tools from R.

Today we'll consider a relatively simple model: Clustered Poisson data where cluster means are a constant plus a cluster-specific exponentially-distributed random effect. To be clear:
y_ij ~ Poisson(mu_i)
log(mu_i) = B_0 + r_i
r_i ~ Exponential(lambda)
Of course in Bayesian thinking all effects are random-- here we use the term in the sense of cluster-specific effects.

SAS
Several SAS procedures have a bayes statement that allow some specific models to be fit. For example, in Section 6.6 and example 8.17, we show Bayesian Poisson and logistic regression, respectively, using proc genmod. But our example today is a little unusual, and we could not find a canned procedure for it. For these more general problems, SAS has proc mcmc, which in SAS 9.3 allows random effects to be easily modeled.

We begin by generating the data, and fitting the naive (unclustered) model. We set B_0 = 1 and lambda = 0.4. There are 200 clusters of 10 observations each, which we might imagine represent 10 students from each of 200 classrooms.

data test2;
truebeta0 = 1;
randscale = .4;
call streaminit(1944);
do i = 1 to 200;
randint = rand("EXPONENTIAL") * randscale;
do ni = 1 to 10;
mu = exp(truebeta0 + randint);
y = rand("POISSON", mu);
output;
end;
end;
run;

proc genmod data = test2;
model y = / dist=poisson;
run;

Standard Wald 95%
Parameter Estimate Error Confidence Limits

Intercept 1.4983 0.0106 1.4776 1.5190

Note the inelegant SAS syntax for fitting an intercept-only model. The result is pretty awful-- 50% bias with respect to the global mean. Perhaps we'll do better by acknowledging the clustering. We might try that with normally distributed random effects in proc glimmix.

proc glimmix data = test2 method=laplace;
class i;
model y = / dist = poisson solution;
random int / subject = i type = un;
run;

Cov Standard
Parm Subject Estimate Error
UN(1,1) i 0.1682 0.01841

Standard
Effect Estimate Error t Value Pr > |t|
Intercept 1.3805 0.03124 44.20 <.0001

No joy-- still a 40% bias in the estimated mean. And the variance of the random effects is biased by more than 50%! Let's try fitting the model that generated the data.

proc mcmc data=test2 nmc=10000 thin=10 seed=2011;
parms fixedint 1 gscale 0.4;

prior fixedint ~ normal(0, var = 10000);
prior gscale ~ igamma(.01 , scale = .01 ) ;

random rint ~ gamma(shape=1, scale=gscale) subject = i initial=0.0001;
mu = exp(fixedint + rint);
model y ~ poisson(mu);
run;

The key points of the proc mcmc statement are nmc, the total number of Monte Carlo iterations to perform, and thin, which includes only every nth sample for inference. The prior and model statements are fairly obvious; we note that in more complex models, parameters that are listed within a single prior statement are sampled as a block. We're placing priors on the fixed (shared) intercept and the scale of the exponential. The mu line is actually just a programming statement-- it uses the same syntax as data step programming.
The newly available statement is random. The syntax here is similar to those for the other priors, with the addition of the subject option, which generates a unique parameter for each level of the subject variable. The random effects themselves can be used in later statements, as shown, to enter into data distributions. A final note here is that the exponential distribution isn't explicitly available, but since the gamma distribution with shape fixed at 1 defines the exponential, this is not a problem. Here are the key results.

Posterior Summaries

Standard
Parameter N Mean Deviation
fixedint 1000 1.0346 0.0244
gscale 1000 0.3541 0.0314

Posterior Intervals

Parameter Alpha HPD Interval
fixedint 0.050 0.9834 1.0791
gscale 0.050 0.2937 0.4163

The 95% HPD regions include the true values of the parameters and the posterior means are much less biased than in the model assuming normal random effects.

As usual, MCMC models should be evaluated carefully for convergence and coverage. In this example, I have some concerns (see default diagnostic figure above) and if it were real data I would want to do more.

R
The CRAN task view on Bayesian Inference includes a summary of tools for general and model-specific MCMC tools. However, there is nothing like proc mcmc in terms of being a general and easy to use tool that is native to R. The nearest options are to use R front ends to WinBUGS/OpenBUGS (R2WinBUGS) or JAGS (rjags). (A brief worked example of using rjags was posted last year by John Myles White.) Alternatively, with some math and a little sweat, the mcmc package would also work. We'll explore an approach through one or more of these packages in a later entry, and would welcome a collaboration from anyone who would like to take that on.

Monday, December 6, 2010

Example 8.17: Logistic regression via MCMC



In examples 8.15 and 8.16 we considered Firth logistic regression and exact logistic regression as ways around the problem of separation, often encountered in logistic regression. (Re-cap: Separation happens when all the observations in a category share a result, or when a continuous covariate predicts the outcome too well. It results in a likelihood maximized when a parameter is extremely large, and causes trouble with ordinary maximum likelihood approached.) Another option is to use Bayesian methods. Here we focus on Markov chain Monte Carlo (MCMC) approaches to Bayesian analysis.

SAS

SAS access to MCMC for logistic regression is provided through the bayes statement in proc genmod. There are several default priors available. The normal prior is the most flexible (in the software), allowing different prior means and variances for the regression parameters. The prior is specified through a separate data set. We begin by setting up the data in the events/trials syntax. Then we define a fairly vague prior for the intercept and the effect of the covariate: uncorrelated, and each with a mean of zero and a variance of 1000 (or a precision of 0.001). Finally, we call proc genmod to implement the analysis.


data testmcmc;
x=0; count=0; n=100; output;
x=1; count=5; n=100; output;
run;


data prior;
input _type_ $ Intercept x;
datalines;
Var 1000 1000
Mean 0 0
;
run;

title "Bayes with normal prior";
proc genmod descending data=testmcmc;
model count/n = x / dist=binomial link=logit;
bayes seed=10231995 nbi=1000 nmc=21000
coeffprior=normal(input=prior) diagnostics=all
statistics=summary;
run;

In the forgoing, nbi is the length of the burn-in and nmc is the total number of Monte Carlo iterations. The remaining options define the prior and request certain output. The diagnostics=all option generates many results, including posterior autocorrelations, Gelman-Rubin, Geweke, Raftery-Lewis, and Heidelberger-Welch diagnostics. The summary statistics are presented below; the diagnostics are not especially encouraging.

Posterior Summaries

Standard
Parameter N Mean Deviation

Intercept 21000 -20.3301 10.3277
x 21000 17.2857 10.3368

Posterior Summaries

Percentiles
Parameter 25% 50% 75%

Intercept -27.6173 -18.5558 -11.9025
x 8.8534 15.5267 24.6024

It seems that perhaps this prior is too vague. Perhaps we can make it a little more precise. A log odds ratio of 10 implies an odds ratio > 22,000, so perhaps we can accept a prior variance of 25, with about 95% of the prior weight between -10 and 10.

data prior;
input _type_ $ Intercept x;
datalines;
Var 25 25
Mean 0 0
;
run;

ods graphics on;
title "Bayes with normal prior";
proc genmod descending data=testmcmc;
model count/n = x / dist=binomial link=logit;
bayes seed=10231995 nbi=1000 nmc=21000
coeffprior=normal(input=prior) diagnostics=all
statistics=(summary interval) plot=all;
run;

Posterior Summaries

Standard
Parameter N Mean Deviation

Intercept 21000 -6.5924 1.7958
x 21000 3.5169 1.8431

Posterior Summaries

Percentiles
Parameter 25% 50% 75%

Intercept -7.6347 -6.3150 -5.2802
x 2.1790 3.2684 4.5929

Posterior Intervals

Parameter Alpha Equal-Tail Interval HPD Interval

Intercept 0.050 -10.8101 -3.8560 -10.2652 -3.5788
x 0.050 0.5981 7.7935 0.3997 7.4201

These are more plausible values, and the diagnostics are more favorable.
In the above, we added the keyword interval to generate confidence regions, and used the ods graphics on statement to enable ODS graphics and the plot=all option to generate the graphical output shown above.

R
There are several packages in R that include MCMC approaches. Here we use the MCMCpack package, which include the MCMClogit() function. It appears not to accept the weights option mentioned previously, so we generate data at the observation level to begin. Then we run the MCMC.

events.0=0 # for X = 0
events.1=5 # for X = 1
x = c(rep(0,100), rep(1,100))
y = c(rep(0,100-events.0), rep(1,events.0),
rep(0, 100-events.1), rep(1, events.1))

library(MCMCpack)
logmcmc = MCMClogit(y~as.factor(x), burnin=1000, mcmc=21000, b0=0, B0=.04)

The MCMClogit() accepts a formula object and allows the burn-in and number of Monte Carlo iterations to be specified. The prior mean b0 can be specified as a vector if different for each parameter, or as a scalar, as show. Similarly, the prior precision B0 can be a matrix or a scalar; if scalar, the parameters are uncorrelated in the prior.

> summary(logmcmc)

Iterations = 1001:22000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 21000

1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:

Mean SD Naive SE Time-series SE
(Intercept) -6.570 1.816 0.01253 0.03139
as.factor(x)1 3.513 1.859 0.01283 0.03363

2. Quantiles for each variable:

2.5% 25% 50% 75% 97.5%
(Intercept) -10.6591 -7.634 -6.299 -5.229 -3.831
as.factor(x)1 0.6399 2.147 3.292 4.599 7.698

plot(logmcmc)

The result of the plot() is shown below. These results and those from SAS are reassuringly similar. Many diagnostics are available through the coda package. The codamenu() function allows simple menu-based access to its tools.