SAS and R: proc glm

Showing posts with label proc glm. Show all posts

Monday, June 25, 2012

Example 9.36: Levene's test for equal variances

The assumption of equal variances among the groups in analysis of variance is an expression of the assumption of homoscedasticity for linear models more generally. For ANOVA, this assumption can be tested via Levene's test. The test is a function of the residuals and means within each group, though various modifications are used, including the Brown-Forsythe test. This uses the medians within group, rather than the mean, and is recommended when normality may be suspect.

We illustrate using the HELP data set available here, modeling depressive symptoms (assessed via CESD) as a function of abused substance.

SAS
In SAS, the tests are available as an option to the means statement in proc glm


data help;
set "C:\book\help.sas7bdat";
run;

proc glm data = help;
class substance;
model cesd = substance;
means substance / hovtest=levene(type=abs) hovtest=bf;
run;

The two requested tests are a version of Levene's test that is produced in R, below, and the aforementioned Brown-Forsythe test. The relevant results are shown below.


           Levene's Test for Homogeneity of CESD Variance
           ANOVA of Absolute Deviations from Group Means

                           Sum of        Mean
  Source           DF     Squares      Square    F Value    Pr > F

  SUBSTANCE         2       272.4       136.2       2.61    0.0747
  Error           450     23480.7     52.1793


     Brown and Forsythe's Test for Homogeneity of CESD Variance
          ANOVA of Absolute Deviations from Group Medians

                           Sum of        Mean
  Source           DF     Squares      Square    F Value    Pr > F

  SUBSTANCE         2       266.0       133.0       2.46    0.0864
  Error           450     24310.9     54.0243

There's some suggestion of a lack of homoscedasticity; it might be wise to consider methods robust to violations of this assumption.

R
In R, the test can be found in the levene.test() function in the lawstat package.


help = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(lawstat)
with(help, levene.test(cesd, as.factor(substance), location="mean"))

 classical Levene's test based on the absolute deviations from the mean 
          ( none not applied because the location is not set to median )

data:  cesd 
Test Statistic = 2.6099, p-value = 0.07465

with(help, levene.test(cesd, as.factor(substance),location="median"))

 modified robust Brown-Forsythe Levene-type test based on the absolute 
           deviations from the median

data:  cesd 
Test Statistic = 2.462, p-value = 0.08641

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Tuesday, September 13, 2011

Example 9.5: New stuff in SAS 9.3-- proc FMM

Finite mixture models (FMMs) can be used in settings where some unmeasured classification separates the observed data into groups with different exposure/outcome relationships. One familiar example of this is a zero-inflated model, where some observations come from a degenerate distribution with all mass at 0. In that case the exposure/outcome relationship is less interesting in the degenerate distribution group, but there would be considerable interest in the estimated probability of group membership. Another possibly familiar setting is the estimation of a continuous density as a mixture of normal distributions.

More generally, there could be several groups, with "concomitant" covariates predicting group membership. Each group might have different sets of predictors and outcomes from different distribution families. On the other hand, in a "homogenous" mixture setting, all groups have the same distributional form, but with different parameter values. If the covariates in the model are the same, this setting is similar to an ordinary regression model where every observed covariate is interacted with the (unobserved) group membership variable.

SAS

SAS 9.3 includes the "experimental" FMM procedure to fit these models. We're unsure what criteria SAS uses to decide when a procedure is experimental and when it becomes "production", but experimental procedures in SAS/STAT usually do become production eventually.

As hinted at above, the generality implies by the models is fairly vast, and the FMM procedure includes a lot of generality. The most obvious limitation is that it requires independent observations.

We'll demonstrate with a simulated data set. We create a variable x that predicts both group membership and an outcome y with different linear regression parameters depending on group. The mixing probability follows a logistic regression with intercept=-2 and slope=1. The intercept and slope for the outcome are (0, 1) and (3, 1.2) for groups 0 and 1, respectively. The resulting data is plotted above using a default plot from proc glm using the actual group membership. A histogram and nearest normal density for the residuals from a simple OLS are shown, with R code, below-- they appear more or less normal.


data fmmtest;
do i = 1 to 5000;
  x = normal(0);
  group = (exp(-1 + 2*x)/(1 + exp(-1 + 2*x))) 
    gt uniform(0);
  y = (group * 3) + ((1 + group/5) * x) + 
    normal(0) * sqrt(1);
  output;
  end;
run;

title "Fixed k = 2";
proc fmm data = fmmtest;
  model y = x / k=2 equate=scale;
  probmodel x;
run;

In the model statement, the option k defines how many groups (or "components") are to be included. There are also options kmin and kmax which will fit models with various numbers of components and report results of one of them based on some model criterion. The equate option allows the user to force some elements of the component distributions to be equal. Here we force the residuals to be equal, since that's the model that generated our data. The probmodel statement is how the concomitant variables enter the model. The default settings model a logistic (or generalized logit) regression using the listed covariates.

We show only the parameter estimates.


                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|
        1  Intercept    2.9856   0.04856    61.48    <.0001
        1  x            1.2060   0.03927    30.71    <.0001
        2  Intercept  -0.01978   0.02512    -0.79    0.4309
        2  x            0.9889   0.02453    40.32    <.0001
        1  Variance     1.0279   0.02531
        2  Variance     1.0279   0.02531


            Parameter Estimates for Mixing Probabilities
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|
Intercept     -1.0018     0.05665     -17.68      <.0001
x              1.9403     0.07413      26.17      <.0001

The recovery of the true parameters is quite good. This may be unsurprising given the sample size and the distinctness of the components, but we think it's pretty impressive.

R
The CRAN Task View on Cluster Analysis & Finite Mixture Models provides an overview of packages available for R.

First, we'll make the data and take a look at the simple model diagnostics.


> x = rnorm(5000)
> probgroup1 = exp(-1 + 2*x)/(1 + exp(-1 + 2*x))
> group = ifelse(probgroup1 > runif(5000),1,0)
> y = (group * 3) + ((1 + group/5) * x) + rnorm(5000);

> resids = residuals(lm(y~x))
> hist(resids, freq = FALSE)
> dvals = seq(from=min(resids), to=max(resids),length=100)
> lines(dvals, dnorm(dvals, mean(resids), sd(resids)))

The histogram of the residuals from the simple OLS model is shown below. They appear reasonably normal.

Ron Pearson recently showed a simple example using the mixtools package. We tried to replicate the example above using mixtools but were unable to find functionality for concomitant variables. If you fit the closest model, assuming the mixing probabilities depend on no covariates, this is what happens:


> library(mixtools)
> mixout.mt = regmixEM(y,x,k=2,arbvar = FALSE)> summary(mixout.mt)
summary of regmixEM object:
  comp 1   comp 2
lambda  0.5275915 0.472408
sigma   1.0553877 1.055388
beta1  -0.0035015 2.237736
beta2   1.5068146 2.004008

which is pretty awful.

Fortunately, the flexmix package, while a bit more difficult to use, offers the needed flexibility. As a side note, the generality of the package is pretty awe-inspiring. Nice work!


library(flexmix)
mixout.fm=flexmix(y~x, k=2, model=FLXMRglmfix(y~x, varFix=TRUE),
   concomitant=FLXPmultinom(~ x))

The flexmix function uses a variety of special objects that are created by other functions the package provides. Here we use the FLXMRglmfix function to force equal variances across the components and the FLXPmultinom function to define the logistic regression on the covariate x. The results are:


> parameters(mixout5)
                   Comp.1       Comp.2
coef.(Intercept) 3.000412 -0.008360171
coef.x           1.190454  1.047660263
sigma            0.972306  0.972306015
> parameters(mixout5, which = "concomitant")
            1          2
(Intercept) 0  0.9605367
x           0 -1.9109429

which also reproduces reality with impressive accuracy. Note that the concomitant variable model apparently predicts membership in the second component, so the signs are reversed from the generating model.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Monday, June 25, 2012

Example 9.36: Levene's test for equal variances

Tuesday, September 13, 2011

Example 9.5: New stuff in SAS 9.3-- proc FMM

About SAS and R

Topics discussed