SAS and R: Example 9.5: New stuff in SAS 9.3-- proc FMM

Tuesday, September 13, 2011

Example 9.5: New stuff in SAS 9.3-- proc FMM

Finite mixture models (FMMs) can be used in settings where some unmeasured classification separates the observed data into groups with different exposure/outcome relationships. One familiar example of this is a zero-inflated model, where some observations come from a degenerate distribution with all mass at 0. In that case the exposure/outcome relationship is less interesting in the degenerate distribution group, but there would be considerable interest in the estimated probability of group membership. Another possibly familiar setting is the estimation of a continuous density as a mixture of normal distributions.

More generally, there could be several groups, with "concomitant" covariates predicting group membership. Each group might have different sets of predictors and outcomes from different distribution families. On the other hand, in a "homogenous" mixture setting, all groups have the same distributional form, but with different parameter values. If the covariates in the model are the same, this setting is similar to an ordinary regression model where every observed covariate is interacted with the (unobserved) group membership variable.

SAS

SAS 9.3 includes the "experimental" FMM procedure to fit these models. We're unsure what criteria SAS uses to decide when a procedure is experimental and when it becomes "production", but experimental procedures in SAS/STAT usually do become production eventually.

As hinted at above, the generality implies by the models is fairly vast, and the FMM procedure includes a lot of generality. The most obvious limitation is that it requires independent observations.

We'll demonstrate with a simulated data set. We create a variable x that predicts both group membership and an outcome y with different linear regression parameters depending on group. The mixing probability follows a logistic regression with intercept=-2 and slope=1. The intercept and slope for the outcome are (0, 1) and (3, 1.2) for groups 0 and 1, respectively. The resulting data is plotted above using a default plot from proc glm using the actual group membership. A histogram and nearest normal density for the residuals from a simple OLS are shown, with R code, below-- they appear more or less normal.


data fmmtest;
do i = 1 to 5000;
  x = normal(0);
  group = (exp(-1 + 2*x)/(1 + exp(-1 + 2*x))) 
    gt uniform(0);
  y = (group * 3) + ((1 + group/5) * x) + 
    normal(0) * sqrt(1);
  output;
  end;
run;

title "Fixed k = 2";
proc fmm data = fmmtest;
  model y = x / k=2 equate=scale;
  probmodel x;
run;

In the model statement, the option k defines how many groups (or "components") are to be included. There are also options kmin and kmax which will fit models with various numbers of components and report results of one of them based on some model criterion. The equate option allows the user to force some elements of the component distributions to be equal. Here we force the residuals to be equal, since that's the model that generated our data. The probmodel statement is how the concomitant variables enter the model. The default settings model a logistic (or generalized logit) regression using the listed covariates.

We show only the parameter estimates.


                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|
        1  Intercept    2.9856   0.04856    61.48    <.0001
        1  x            1.2060   0.03927    30.71    <.0001
        2  Intercept  -0.01978   0.02512    -0.79    0.4309
        2  x            0.9889   0.02453    40.32    <.0001
        1  Variance     1.0279   0.02531
        2  Variance     1.0279   0.02531


            Parameter Estimates for Mixing Probabilities
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|
Intercept     -1.0018     0.05665     -17.68      <.0001
x              1.9403     0.07413      26.17      <.0001

The recovery of the true parameters is quite good. This may be unsurprising given the sample size and the distinctness of the components, but we think it's pretty impressive.

R
The CRAN Task View on Cluster Analysis & Finite Mixture Models provides an overview of packages available for R.

First, we'll make the data and take a look at the simple model diagnostics.


> x = rnorm(5000)
> probgroup1 = exp(-1 + 2*x)/(1 + exp(-1 + 2*x))
> group = ifelse(probgroup1 > runif(5000),1,0)
> y = (group * 3) + ((1 + group/5) * x) + rnorm(5000);

> resids = residuals(lm(y~x))
> hist(resids, freq = FALSE)
> dvals = seq(from=min(resids), to=max(resids),length=100)
> lines(dvals, dnorm(dvals, mean(resids), sd(resids)))

The histogram of the residuals from the simple OLS model is shown below. They appear reasonably normal.

Ron Pearson recently showed a simple example using the mixtools package. We tried to replicate the example above using mixtools but were unable to find functionality for concomitant variables. If you fit the closest model, assuming the mixing probabilities depend on no covariates, this is what happens:


> library(mixtools)
> mixout.mt = regmixEM(y,x,k=2,arbvar = FALSE)> summary(mixout.mt)
summary of regmixEM object:
  comp 1   comp 2
lambda  0.5275915 0.472408
sigma   1.0553877 1.055388
beta1  -0.0035015 2.237736
beta2   1.5068146 2.004008

which is pretty awful.

Fortunately, the flexmix package, while a bit more difficult to use, offers the needed flexibility. As a side note, the generality of the package is pretty awe-inspiring. Nice work!


library(flexmix)
mixout.fm=flexmix(y~x, k=2, model=FLXMRglmfix(y~x, varFix=TRUE),
   concomitant=FLXPmultinom(~ x))

The flexmix function uses a variety of special objects that are created by other functions the package provides. Here we use the FLXMRglmfix function to force equal variances across the components and the FLXPmultinom function to define the logistic regression on the covariate x. The results are:


> parameters(mixout5)
                   Comp.1       Comp.2
coef.(Intercept) 3.000412 -0.008360171
coef.x           1.190454  1.047660263
sigma            0.972306  0.972306015
> parameters(mixout5, which = "concomitant")
            1          2
(Intercept) 0  0.9605367
x           0 -1.9109429

which also reproduces reality with impressive accuracy. Note that the concomitant variable model apparently predicts membership in the second component, so the signs are reversed from the generating model.

1 comment:

Richard Thornton said...: Very interesting!; September 21, 2011 at 9:32 PM

Post a Comment

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Tuesday, September 13, 2011

Example 9.5: New stuff in SAS 9.3-- proc FMM

1 comment:

About SAS and R

Topics discussed