Monday, October 31, 2011

Example 9.12: simpler ways to carry out permutation tests



In a previous entry, as well as section 2.4.3 of the book, we describe how to carry out a 2 group permutation test in SAS as well as with the coin package in R. We demonstrate with comparing the ages of the female and male subjects in the HELP study.

In this entry, we revisit the permutation test using other functions.

R

We describe a simpler interface to carry out and visualize permutation tests using the functions from the mosaic package.

We begin by replicating our previous example (section 2.6.4, p. 87).

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(coin)
numsim = 1000
oneway_test(age ~ as.factor(female),
distribution=approximate(B=numsim-1), data=ds)

which yields the following output:

Approximative 2-Sample Permutation Test

data: age by as.factor(female) (0, 1)
Z = -0.9194, p-value = 0.3894
alternative hypothesis: true mu is not equal to 0

We conclude that there is minimal evidence to contradict the null hypothesis that the two groups have the same ages back in their respective populations.

Now we demonstrate another way to run this test in a more general form, using the mosaic package's do() function combined with the * operator to repeatedly carry out fitting a linear model with a parameter for female which will calculate our test statistic (difference in means between females and males) repeatedly after shuffling the group indicators. The shuffle() function permutes the group labels, and then the summary statistic is calculated.

> library(mosaic)
> obsdiff = with(ds, mean(age[female==1]) - mean(age[female==0]))
> obsdiff
mean
0.7841284
> summary(age ~ female, data=ds, fun=mean)
age N=453

+-------+---+---+--------+
| | |N |mean |
+-------+---+---+--------+
|female |No |346|35.46821|
| |Yes|107|36.25234|
+-------+---+---+--------+
|Overall| |453|35.65342|
+-------+---+---+--------+

Now we can run the permutation test, then display the results on a souped-up histogram with different shading for values larger in magnitude than the observed statistic (see above).

res = do(numsim) * lm(age ~ shuffle(female), data=ds)
pvalue = sum(abs(res$female) > abs(obsdiff)) / numsim
xhistogram(~ female, groups = abs(female) > abs(obsdiff),
n=50, density=TRUE, data=res, xlab="difference between groups",
main=paste("Permutation test result: p=", round(pvalue, 3)))

The results are similar to those from the previous test: there is little evidence to contradict the null hypothesis.

SAS

In SAS, we'll take another approach, delving into the capabilities of proc iml to make a manual permutation test. We begin by reading the data and replicating the example in the book.

libname k 'c:\book';
proc npar1way data = k.help;
class female;
var age;
exact scores=data / mc n= 9999 alpha = .05;
run;

Data Scores One-Way Analysis

Chi-Square 0.8453
DF 1
Pr > Chi-Square 0.3579

Permuting data is a very awkward thing to do in data steps. But it turns out to be easy in proc iml (the built-in SAS matrix language). Here we read in the key variables from the data set (use and read). Then we generate the permutations (ranperm). However, this generates row for each permuted data set, while we need a column for each, so we transpose the matrix (t) before saving it. Then we save the resulting data in a SAS data set with the female variable. Note that we permuted the ages only, as opposed to the R example-- it doesn't matter which is permuted, of course. Much of the proc iml code used here can be found in section 1.9 of the book-- however, note that curly braces are required in the read statement, as shown below.

proc iml;
use k.help;
read all var{female age} into x;
p = t(ranperm(x[,2],1000));
paf = x[,1]||p;
create newds from paf;
append from paf;
quit;

With the permuted data in hand, we use proc ttest (section 2.4.1) with the ODS system to generate and save the differences. Note that the default variable names from proc iml are fairly nondescript. With the 1000 permuted statistics in hand, we can generate a histogram of the statistics and a p-value with proc univariate.

ods output conflimits=diff;
proc ttest data=newds plots=none;
class col1;
var col2 - col1001;
run;

proc univariate data=diff;
where method = "Pooled";
var mean;
histogram mean / normal;
run;

data diff2;
set diff;
absdiff = abs(mean);
run;

proc univariate data=diff2
loccount mu0 = 0.7841284;
where method = "Pooled";
var absdiff;
run;

Location Counts: Mu0=0.78

Count Value

Num Obs > Mu0 357
Num Obs ^= Mu0 1000
Num Obs < Mu0 643

Sunday, October 30, 2011

Proc tabulate for simple statistics (corrected)

Ken Beath, of Macquarie University, commented on an earlier entry that the best way to generate summary statistics is using proc tabulate. While the best tools might differ, depending on the purpose, we wanted to share Ken's code demonstrating how to replicate the R mosaic package tables using proc tabulate.

SAS

Ken's fully annotated code is appended below; we highlight the key syntax elements here. Reading in the data is shown in many examples.

proc tabulate data=help;
class substance;
var cesd;
table (substance all),cesd*(n nmiss mean median);
run;

The class and var statements serve their usual purpose of identifying the categorical and analysis variables. The table statement does the work of the procedure, specifying a table with rows (requested before the comma) for each level of substance and overall, with columns (requested after the the *) to include the listed statistics for the analysis variable cesd. The resulting table is shown below.

------------------------------------------------------------------------
| | cesd |
| |---------------------------------------------------|
| | N | NMiss | Mean | Median |
|------------------+------------+------------+------------+------------|
|substance | | | | |
|------------------| | | | |
|alcohol | 177.00| 0.00| 34.37| 36.00|
|------------------+------------+------------+------------+------------|
|cocaine | 152.00| 0.00| 29.42| 30.00|
|------------------+------------+------------+------------+------------|
|heroin | 124.00| 0.00| 34.87| 35.00|
|------------------+------------+------------+------------+------------|
|All | 453.00| 0.00| 32.85| 34.00|
------------------------------------------------------------------------


Below we show Ken's code, complete with his helpful annotations. Note his use of formats. We're not fond of formats, but for presentation, as opposed to analysis, they can be very useful.

(Note: a prior version of this entry inexplicably referred to proc report, rather than proc tabulate.)


PROC IMPORT OUT= help
DATAFILE= "C:\Users\kbeath\Documents\tabulate\help.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;

/* missing option create a category missing for each categorical
variable, always a good idea;
the table statement specifies row then column;
so for this example we have substance defining the rows, and
cesd statistics the columns */

proc tabulate data=help missing;
class substance;
var cesd;
table substance,cesd*(mean n nmiss);
run;

/* formchar specifies the characters used to form the borders
- we set them all to blank to have no borders;
mean*f=8.2 specifies that the mean is formatted using
an 8.2 format, etc*/

proc tabulate data=help missing formchar=' ';
class substance;
var cesd;
table substance,cesd*(mean*f=8.2 n*f=8. nmiss*f=8.);
run;


/* substance="Substance" causes change in label to Substance */

proc format;
value $subf "alcohol"="Alcohol"
"cocaine"="Cocaine" "heroin"="Heroin";

proc tabulate data=help missing formchar=' ';
class substance;
var cesd;
format substance $subf.;
table substance="Substance",
cesd="CESD"*(mean*f=8.2 n*f=8. nmiss*f=8.);
run;

/* all the statistics. I've changed the format to 7.2
so they all fit on a line */

proc tabulate data=help missing formchar=' ';
class substance;
var cesd;
format substance $subf.;
table substance="Substance",
cesd="CESD"*(n*f=8. nmiss*f=8.
(mean std min q1 median q3 max)*f=7.2);
run;

/* add a line for all */

proc tabulate data=help missing formchar=' ';
class substance;
var cesd;
format substance $subf.;
table (substance="Substance" all),
cesd="CESD"*(n*f=8. nmiss*f=8.
(mean std min q1 median q3 max)*f=7.2);
run;

/* to show how easy it is, further subdivide
by racial group */

proc tabulate data=help missing formchar=' ';
class substance racegrp;
var cesd;
format substance $subf.;
table (racegrp all)*(substance="Substance" all),
cesd="CESD"*(n*f=8. nmiss*f=8.
(mean std min q1 median q3 max)*f=7.2);
run;

Tuesday, October 25, 2011

Example 9.11: Employment plot


A facebook friend posted the picture reproduced above-- it makes the case that President Obama has been a successful creator of jobs, and also paints GW Bush as a president who lost jobs. Another friend pointed out that to be fair, all of Bush's presidency ought to be included. Let's make a fair plot of job growth and loss. Data can be retrieved from the Bureau of Labor Statistics, where Nick will be spending his next sabbatical. The extract we use below is also available from the book website. This particular table reports the cumulative change over the past three months, adjusting for seasonal trends. This tends to smooth out the line.

SAS

The first job is to get the data into SAS. Here we demonstrate reading it directly from a URL, as outlined in section 1.1.6.

filename myurl
url "http://www.math.smith.edu/sasr/datasets/bls.csv";

data q_change;
infile myurl delimiter=',';
input Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Annual;
run;

The raw data are in a pretty inconvenient format for plotting. To make a long, narrow data set with a row for each month, we'll use proc transpose (section 1.5.3) to flip each year on its side. Then, to attach a date to each measure, we'll use the compress function. First we add "01" (the first of the month) to the month name, which is in a variable created by proc transpose with the default name "_name_". Then we tack on the year variable, and input the string in the date format. The resulting variable is a SAS date (number of days since 12/31/1959, see section 1.6.1).

proc transpose data=q_change out=q2;
by year;
run;

data q3;
set q2;
date1 = input(compress("01"||_name_||year),date11.);
run;

Now the data are ready to plot. It would probably be possible to use proc sgplot but proc gplot is more flexible and allows better control for presentation graphics.

title "3-month change in private-sector jobs, seasonally adjusted";
axis1 minor = none label = (h=2 angle = 90 "Thousands of jobs")
value = (h = 2);
axis2 minor = none value = (h=2)label = none
offset = (1cm, -5cm)
reflabel = (h=1.5 "Truman" "Eisenhower" "Kennedy/Johnson"
"Nixon/Ford" "Carter" "Reagan" "GHW Bush" "Clinton" "GW Bush" "Obama" );
symbol1 i=j v=none w=3;

proc gplot data=q3;
plot col1 * date1 / vaxis=axis1 haxis=axis2 vref=0
href = '12apr1945'd '21jan1953'd '20jan1961'd '20jan1969'd
'20jan1977'd '21jan1981'd '20jan1989'd '20jan1993'd
'20jan2001'd '20jan2009'd;
format date1 monyy6.;
run;
quit;

Much of the syntax above has been demonstrated in our book examples and blog entries. What may be unfamiliar is the use of the href option in the plot statement and the reflabel option in the axis statement. The former draws reference lines at the listed values in the plot, while the latter adds titles to these lines. The resulting plot is shown here.

Looking fairly across postwar presidencies, only the Kennedy/Johnson and Clinton years were mostly unmarred by periods with large losses in jobs. The Carter years were also times jobs were consistently added. While the graphic shared on facebook overstates the case against GW Bush, it fairly shows Obama as a job creator thus far, to the extent a president can be credited with jobs created on his watch.


R

The main trick in R is loading the data and getting it into the correct format.
here we use cbind() to grab the appropriate columns, then transpose that matrix and turn it into a vector which serves as input for making a time series object with the ts() command (as in section 4.2.8). Once this is created, the default plot for a time series object is close to what we have in mind.

ds = read.csv("http://www.math.smith.edu/sasr/datasets/bls.csv",
header=FALSE)
jobs = with(ds, cbind(V2, V3, V4, V5, V6, V7, V8, V9, V10,
V11, V12, V13))
jobsts = ts(as.vector(t(jobs)), start=c(1945, 1),
frequency=12)
plot(jobsts, plot.type="single", col=4,
ylab="number of jobs (in thousands)")

All that remains is to add the reference lines for 0 jobs and the presidencies. The lines are most easily added with the abline() function (section 5.2.1). Easier than adding labels for the lines within the plot function will be to use the mtext() function to place the labels in the margins. We'll write a little function to save a few keystrokes by plotting the line and adding the label together.

abline(h=0)
presline = function(date,line,name){
mtext(at=date,text= name, line=line)
abline(v = date)
}
presline(1946,1,"Truman")
presline(1953,2,"Eisenhower")
presline(1961,1,"Kennedy/Johnson")
presline(1969,2,"Nixon/Ford")
presline(1977,1,"Carter")
presline(1981,2,"Reagan")
presline(1989,1,"GHW Bush")
presline(1993,2,"Clinton")
presline(2001,1,"GW Bush")
presline(2009,2,"Obama")

It might be worthwhile to standardize the number of jobs to the population size, since the dramatic loss of jobs due to demobilization after the Second World War during a single month in 1945 (2.4 million) represented 1.7% of the population, while the recent loss of 2.3 million jobs in 2009 represented only 0.8% of the population.


Monday, October 17, 2011

Example 9.10: more regression trees and recursive partitioning with "partykit"


We discuss recursive partitioning, a technique for classification and regression using a decision tree in section 6.7.3 of the book. Support for these methods is available within the rpart package. Torsten Hothorn and Achim Zeileis have extended the support for these methods with the partykit package, which provides a toolkit with infrastructure for representing, summarizing, and visualizing tree-structured regression and classification models.

In this entry, we revisit the example from the book, which worked to classify predictors of homelessness in the HELP study.

R


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(rpart); library(partykit)
ds$sub = as.factor(ds$substance)
homeless.rpart = rpart(homeless ~ female + i1 + sub + sexrisk + mcs +
pcs, method="class", data=ds)
plot(homeless.rpart)
text(homeless.rpart)

This reproduces Figure 6.2 (p. 236) from the book, while we can display the output from the classification tree using the printcp() command.

> printcp(homeless.rpart)
Classification tree:
rpart(formula = home ~ female + i1 + sub + sexrisk + mcs + pcs,
data = ds, method = "class")
Variables actually used in tree construction:
[1] female i1 mcs pcs sexrisk

Root node error: 209/453 = 0.5
n= 453
CP nsplit rel error xerror xstd
1 0.10 0 1.0 1.0 0.05
2 0.05 1 0.9 1.1 0.05
3 0.03 4 0.8 1.1 0.05
4 0.02 5 0.7 1.0 0.05
5 0.01 7 0.7 0.9 0.05
6 0.01 9 0.7 0.9 0.05

Using the partykit package, we can make a nice graphic describing these results. We'll use the plot.party() function on a party object (coerced from the rpart object generated above using as.party()). This provides more information about the tree (as seen in the Figure above).

plot(as.party(homeless.rpart), type="simple")

More information as well as a lovely vignette can be found here.

SAS

Recursive partitioning is available through SAS Enterprise Miner, a module not always included in SAS installations.

Thursday, October 13, 2011

Example 9.9: Simplifying R using the mosaic package (part 1)



While both SAS and R are powerful systems for statistical analysis, they can be frustrating to new users or those learning statistics for the first time.

R
The mosaic package is designed to help simplify the interface for such new users, while allowing them to undertake sophisticated analyses.

As an example of how the package simplifies life for the novice user, consider calculating summary statistics and displaying a densityplot for the CESD (measure of depressive symptom) scores by substance abuse group in the HELP dataset. Doing this in R without the package would require mastering a package such as plyr to replicate results by substance or a typing-intensive use of syntax to select rows corresponding to each substance.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(mosaic)
options(digits=3)

After loading the data and the package, and setting the number of digits to a more reasonable default, we can call the mean() function to easily calculate this statistic (denoted by S in the result) for each of the three substance abuse groups alcohol, cocaine or heroin.
> mean(cesd ~ substance, data=ds)
  substance    S   N Missing
1   alcohol 34.4 177       0
2   cocaine 29.4 152       0
3    heroin 34.9 124       0

Similar results are seen when we calculate the standard deviations per group:
> sd(cesd ~ substance, data=ds)
  substance    S   N Missing
1   alcohol 12.1 177       0
2   cocaine 13.4 152       0
3    heroin 11.2 124       0

Another function can calculate a raft of summary statistics for each group that are nicely formatted.

UPDATE FROM 12/14/2013: favstats(cesd ~ substance, data=ds)  is an even better way to do this...
> summary(cesd ~ substance, data=ds, fun=favstats)
cesd    N=453
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+
|         |       |N  |min |Q1 |median |Q3 |max |mean |sd  |n  |missing |
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+
|substance|alcohol|177|4   |26 |36     |42 |58  |34.4 |12.1|177|0       |
|         |cocaine|152|1   |19 |30     |39 |60  |29.4 |13.4|152|0       |
|         |heroin |124|4   |28 |35     |43 |56  |34.9 |11.2|124|0       |
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+
|Overall  |       |453|1   |25 |34     |41 |60  |32.8 |12.5|453|0       |
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+

These commands allow quick review of the data to ensure, for example, that assumptions of equal variance are justified, or that coding errors or missing values haven't crept in.

A graphical depiction using a set of densityplots (shown above) can be created using the command:
densityplot(~ cesd, group=substance, data=ds, auto.key=TRUE)


SAS
We're unaware of any similar program that attempts to simplify SAS syntax for educational use. To replicate the above results, we would use the means and sgpanel procedures.
data ds;
set "C:\book\help.sas7bdat";
run;

options ls=80;
proc means data=ds fw=4
  min q1 median q3 max mean std nmiss n;
  class substance;
  var cesd;
run;
                     Analysis Variable : CESD

              N           Lower             Upper               Std
 SUBSTANCE  Obs   Min  Quartile  Median  Quartile   Max  Mean   Dev
 ------------------------------------------------------------------
 alcohol    177  4.00      26.0    36.0      42.0  58.0  34.4  12.1
 cocaine    152  1.00      19.0    30.0      39.0  60.0  29.4  13.4
 heroin     124  4.00      28.0    35.0      43.0  56.0  34.9  11.2
 ------------------------------------------------------------------

                                 N     N
                    SUBSTANCE  Obs  Miss      N
                    ---------------------------
                    alcohol    177     0    177
                    cocaine    152     0    152
                    heroin     124     0    124
                    ---------------------------

After reading the data in, the meansprocedure can produce any of the desired statistics (plus may others) directly. To replicate the mosaic package in printing a single statistic, list only that statistic in the proc means statement. Note that the overall statistic in the R table is not included. To replicate that row, you would re-run the above code, omitting the class statement.

To the best of our knowledge, there still does not exist an easy way to plot multiple densities in a single SAS plot. In example 2.6.4 we show how it can be done using proc kde, saving the density estimates, and plotting separately. (Code for this is included at the book web site.) But in the interest of simple code, we show a simpler method using proc sgpanel. The result, show below, is less useful than the R plot from the the mosaic package, but still gets the point across.
proc sgpanel data=ds;
  panelby substance / columns=1;
  density cesd / type=kernel;
run;

Tuesday, October 4, 2011

Example 9.8: New stuff in SAS 9.3-- Bayesian random effects models in Proc MCMC




Rounding off our reports on major new developments in SAS 9.3, today we'll talk about proc mcmc and the random statement.

Stand-alone packages for fitting very general Bayesian models using Markov chain Monte Carlo (MCMC) methods have been available for quite some time now. The best known of these are BUGS and its derivatives WinBUGS (last updated in 2007) and OpenBUGS . There are also some packages available that call these tools from R.

Today we'll consider a relatively simple model: Clustered Poisson data where cluster means are a constant plus a cluster-specific exponentially-distributed random effect. To be clear:
y_ij ~ Poisson(mu_i)
log(mu_i) = B_0 + r_i
r_i ~ Exponential(lambda)
Of course in Bayesian thinking all effects are random-- here we use the term in the sense of cluster-specific effects.

SAS
Several SAS procedures have a bayes statement that allow some specific models to be fit. For example, in Section 6.6 and example 8.17, we show Bayesian Poisson and logistic regression, respectively, using proc genmod. But our example today is a little unusual, and we could not find a canned procedure for it. For these more general problems, SAS has proc mcmc, which in SAS 9.3 allows random effects to be easily modeled.

We begin by generating the data, and fitting the naive (unclustered) model. We set B_0 = 1 and lambda = 0.4. There are 200 clusters of 10 observations each, which we might imagine represent 10 students from each of 200 classrooms.

data test2;
truebeta0 = 1;
randscale = .4;
call streaminit(1944);
do i = 1 to 200;
randint = rand("EXPONENTIAL") * randscale;
do ni = 1 to 10;
mu = exp(truebeta0 + randint);
y = rand("POISSON", mu);
output;
end;
end;
run;

proc genmod data = test2;
model y = / dist=poisson;
run;

Standard Wald 95%
Parameter Estimate Error Confidence Limits

Intercept 1.4983 0.0106 1.4776 1.5190

Note the inelegant SAS syntax for fitting an intercept-only model. The result is pretty awful-- 50% bias with respect to the global mean. Perhaps we'll do better by acknowledging the clustering. We might try that with normally distributed random effects in proc glimmix.

proc glimmix data = test2 method=laplace;
class i;
model y = / dist = poisson solution;
random int / subject = i type = un;
run;

Cov Standard
Parm Subject Estimate Error
UN(1,1) i 0.1682 0.01841

Standard
Effect Estimate Error t Value Pr > |t|
Intercept 1.3805 0.03124 44.20 <.0001

No joy-- still a 40% bias in the estimated mean. And the variance of the random effects is biased by more than 50%! Let's try fitting the model that generated the data.

proc mcmc data=test2 nmc=10000 thin=10 seed=2011;
parms fixedint 1 gscale 0.4;

prior fixedint ~ normal(0, var = 10000);
prior gscale ~ igamma(.01 , scale = .01 ) ;

random rint ~ gamma(shape=1, scale=gscale) subject = i initial=0.0001;
mu = exp(fixedint + rint);
model y ~ poisson(mu);
run;

The key points of the proc mcmc statement are nmc, the total number of Monte Carlo iterations to perform, and thin, which includes only every nth sample for inference. The prior and model statements are fairly obvious; we note that in more complex models, parameters that are listed within a single prior statement are sampled as a block. We're placing priors on the fixed (shared) intercept and the scale of the exponential. The mu line is actually just a programming statement-- it uses the same syntax as data step programming.
The newly available statement is random. The syntax here is similar to those for the other priors, with the addition of the subject option, which generates a unique parameter for each level of the subject variable. The random effects themselves can be used in later statements, as shown, to enter into data distributions. A final note here is that the exponential distribution isn't explicitly available, but since the gamma distribution with shape fixed at 1 defines the exponential, this is not a problem. Here are the key results.

Posterior Summaries

Standard
Parameter N Mean Deviation
fixedint 1000 1.0346 0.0244
gscale 1000 0.3541 0.0314

Posterior Intervals

Parameter Alpha HPD Interval
fixedint 0.050 0.9834 1.0791
gscale 0.050 0.2937 0.4163

The 95% HPD regions include the true values of the parameters and the posterior means are much less biased than in the model assuming normal random effects.

As usual, MCMC models should be evaluated carefully for convergence and coverage. In this example, I have some concerns (see default diagnostic figure above) and if it were real data I would want to do more.

R
The CRAN task view on Bayesian Inference includes a summary of tools for general and model-specific MCMC tools. However, there is nothing like proc mcmc in terms of being a general and easy to use tool that is native to R. The nearest options are to use R front ends to WinBUGS/OpenBUGS (R2WinBUGS) or JAGS (rjags). (A brief worked example of using rjags was posted last year by John Myles White.) Alternatively, with some math and a little sweat, the mcmc package would also work. We'll explore an approach through one or more of these packages in a later entry, and would welcome a collaboration from anyone who would like to take that on.