Showing posts with label proc glm. Show all posts
Showing posts with label proc glm. Show all posts

Monday, June 25, 2012

Example 9.36: Levene's test for equal variances

The assumption of equal variances among the groups in analysis of variance is an expression of the assumption of homoscedasticity for linear models more generally. For ANOVA, this assumption can be tested via Levene's test. The test is a function of the residuals and means within each group, though various modifications are used, including the Brown-Forsythe test. This uses the medians within group, rather than the mean, and is recommended when normality may be suspect.

We illustrate using the HELP data set available here, modeling depressive symptoms (assessed via CESD) as a function of abused substance.

SAS
In SAS, the tests are available as an option to the means statement in proc glm

data help;
set "C:\book\help.sas7bdat";
run;

proc glm data = help;
class substance;
model cesd = substance;
means substance / hovtest=levene(type=abs) hovtest=bf;
run;

The two requested tests are a version of Levene's test that is produced in R, below, and the aforementioned Brown-Forsythe test. The relevant results are shown below.

Levene's Test for Homogeneity of CESD Variance
ANOVA of Absolute Deviations from Group Means

Sum of Mean
Source DF Squares Square F Value Pr > F

SUBSTANCE 2 272.4 136.2 2.61 0.0747
Error 450 23480.7 52.1793


Brown and Forsythe's Test for Homogeneity of CESD Variance
ANOVA of Absolute Deviations from Group Medians

Sum of Mean
Source DF Squares Square F Value Pr > F

SUBSTANCE 2 266.0 133.0 2.46 0.0864
Error 450 24310.9 54.0243

There's some suggestion of a lack of homoscedasticity; it might be wise to consider methods robust to violations of this assumption.


R
In R, the test can be found in the levene.test() function in the lawstat package.

help = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(lawstat)
with(help, levene.test(cesd, as.factor(substance), location="mean"))

classical Levene's test based on the absolute deviations from the mean
( none not applied because the location is not set to median )

data: cesd
Test Statistic = 2.6099, p-value = 0.07465

with(help, levene.test(cesd, as.factor(substance),location="median"))

modified robust Brown-Forsythe Levene-type test based on the absolute
deviations from the median

data: cesd
Test Statistic = 2.462, p-value = 0.08641


An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Tuesday, September 13, 2011

Example 9.5: New stuff in SAS 9.3-- proc FMM




Finite mixture models (FMMs) can be used in settings where some unmeasured classification separates the observed data into groups with different exposure/outcome relationships. One familiar example of this is a zero-inflated model, where some observations come from a degenerate distribution with all mass at 0. In that case the exposure/outcome relationship is less interesting in the degenerate distribution group, but there would be considerable interest in the estimated probability of group membership. Another possibly familiar setting is the estimation of a continuous density as a mixture of normal distributions.

More generally, there could be several groups, with "concomitant" covariates predicting group membership. Each group might have different sets of predictors and outcomes from different distribution families. On the other hand, in a "homogenous" mixture setting, all groups have the same distributional form, but with different parameter values. If the covariates in the model are the same, this setting is similar to an ordinary regression model where every observed covariate is interacted with the (unobserved) group membership variable.

SAS

SAS 9.3 includes the "experimental" FMM procedure to fit these models. We're unsure what criteria SAS uses to decide when a procedure is experimental and when it becomes "production", but experimental procedures in SAS/STAT usually do become production eventually.

As hinted at above, the generality implies by the models is fairly vast, and the FMM procedure includes a lot of generality. The most obvious limitation is that it requires independent observations.

We'll demonstrate with a simulated data set. We create a variable x that predicts both group membership and an outcome y with different linear regression parameters depending on group. The mixing probability follows a logistic regression with intercept=-2 and slope=1. The intercept and slope for the outcome are (0, 1) and (3, 1.2) for groups 0 and 1, respectively. The resulting data is plotted above using a default plot from proc glm using the actual group membership. A histogram and nearest normal density for the residuals from a simple OLS are shown, with R code, below-- they appear more or less normal.

data fmmtest;
do i = 1 to 5000;
x = normal(0);
group = (exp(-1 + 2*x)/(1 + exp(-1 + 2*x)))
gt uniform(0);
y = (group * 3) + ((1 + group/5) * x) +
normal(0) * sqrt(1);
output;
end;
run;

title "Fixed k = 2";
proc fmm data = fmmtest;
model y = x / k=2 equate=scale;
probmodel x;
run;

In the model statement, the option k defines how many groups (or "components") are to be included. There are also options kmin and kmax which will fit models with various numbers of components and report results of one of them based on some model criterion. The equate option allows the user to force some elements of the component distributions to be equal. Here we force the residuals to be equal, since that's the model that generated our data. The probmodel statement is how the concomitant variables enter the model. The default settings model a logistic (or generalized logit) regression using the listed covariates.

We show only the parameter estimates.

Standard
Component Effect Estimate Error z Value Pr > |z|
1 Intercept 2.9856 0.04856 61.48 <.0001
1 x 1.2060 0.03927 30.71 <.0001
2 Intercept -0.01978 0.02512 -0.79 0.4309
2 x 0.9889 0.02453 40.32 <.0001
1 Variance 1.0279 0.02531
2 Variance 1.0279 0.02531


Parameter Estimates for Mixing Probabilities
Standard
Effect Estimate Error z Value Pr > |z|
Intercept -1.0018 0.05665 -17.68 <.0001
x 1.9403 0.07413 26.17 <.0001

The recovery of the true parameters is quite good. This may be unsurprising given the sample size and the distinctness of the components, but we think it's pretty impressive.

R
The CRAN Task View on Cluster Analysis & Finite Mixture Models provides an overview of packages available for R.

First, we'll make the data and take a look at the simple model diagnostics.

> x = rnorm(5000)
> probgroup1 = exp(-1 + 2*x)/(1 + exp(-1 + 2*x))
> group = ifelse(probgroup1 > runif(5000),1,0)
> y = (group * 3) + ((1 + group/5) * x) + rnorm(5000);

> resids = residuals(lm(y~x))
> hist(resids, freq = FALSE)
> dvals = seq(from=min(resids), to=max(resids),length=100)
> lines(dvals, dnorm(dvals, mean(resids), sd(resids)))

The histogram of the residuals from the simple OLS model is shown below. They appear reasonably normal.

Ron Pearson recently showed a simple example using the mixtools package. We tried to replicate the example above using mixtools but were unable to find functionality for concomitant variables. If you fit the closest model, assuming the mixing probabilities depend on no covariates, this is what happens:

> library(mixtools)
> mixout.mt = regmixEM(y,x,k=2,arbvar = FALSE)> summary(mixout.mt)
summary of regmixEM object:
comp 1 comp 2
lambda 0.5275915 0.472408
sigma 1.0553877 1.055388
beta1 -0.0035015 2.237736
beta2 1.5068146 2.004008

which is pretty awful.

Fortunately, the flexmix package, while a bit more difficult to use, offers the needed flexibility. As a side note, the generality of the package is pretty awe-inspiring. Nice work!

library(flexmix)
mixout.fm=flexmix(y~x, k=2, model=FLXMRglmfix(y~x, varFix=TRUE),
concomitant=FLXPmultinom(~ x))

The flexmix function uses a variety of special objects that are created by other functions the package provides. Here we use the FLXMRglmfix function to force equal variances across the components and the FLXPmultinom function to define the logistic regression on the covariate x. The results are:

> parameters(mixout5)
Comp.1 Comp.2
coef.(Intercept) 3.000412 -0.008360171
coef.x 1.190454 1.047660263
sigma 0.972306 0.972306015
> parameters(mixout5, which = "concomitant")
1 2
(Intercept) 0 0.9605367
x 0 -1.9109429

which also reproduces reality with impressive accuracy. Note that the concomitant variable model apparently predicts membership in the second component, so the signs are reversed from the generating model.