Monday, November 17, 2014

Example 2014.13: Statistics doesn't have to be so hard! Resampling in R and SAS

A recent post pointed us to a great talk that elegantly described how inferences from a trial could be analyzed with a purely resampling-based approach. The talk uses data from a paper that considered the association between beer consumption and mosquito attraction. We recommend the talk linked above for those thinking about creative ways to teach inference.

In this entry, we demonstrate how straightforward it is to replicate these analyses in R, and show how they can be done in SAS.

R

We'll repeat the exercise twice in R: first using the mosaic package that Nick and colleagues have been developing to help teach statistics, and then in base R.

For mosaic, we begin by entering the data and creating a dataframe. The do() operator and the shuffle() function facilitate carrying out a permutation test (see also section 5.4.5 of the book). 

beer = c(27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20)
water = c(21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20)

ds = data.frame(y = c(beer, water), 
                x = c(rep("beer", length(beer)), rep("water", length(water))))
require(mosaic)
obsdiff = compareMean(y ~ x, data=ds)
nulldist = do(999)*compareMean(y ~ shuffle(x), data=ds)
histogram(~ result, xlab="permutation differences", data=nulldist)
ladd(panel.abline(v=obsdiff, col="red", lwd=2))

> obsdiff
[1] -4.377778
> tally(~ abs(result) > abs(obsdiff), format="percent", data=nulldist)

 TRUE FALSE 
  0.1  99.9 

The do() operator evaluates the expression on the right hand side a specified number of times.  In this case we shuffle (or permute) the group indicators. 

We observe a mean difference of 4.4 attractions (comparing the beer to water groups). The histogram of the results-- plotted with the lattice graphics package that mosaic loads by default-- demonstrates that this result would be highly unlikely to occur by chance: if the null hypothesis that the groups were equal was true, results more extreme than this would happen only 1 time out of 1000. This can be displayed using the tally() function, which adds some functionality to table(). We can calculate the p-value by including the observed statistic in the numerator and the denominator = (1+1)/(999 + 1) = .002.

For those not invested in the mosaic package,  base R functions can be used to perform this analysis . We present a version here that begins after making the data vectors.
alldata = c(beer, water)
labels = c(rep("beer", length(beer)), rep("water", length(water)))
obsdiff = mean(alldata[labels=="beer"]) - mean(alldata[labels=="water"])

> obsdiff
[1] -4.377778

The sample() function re-orders the labels, effectively implementing the supposition that the number of bites might have happened under either the water or the beer drinking regimen.
resample_labels = sample(labels)
resample_diff = mean(alldata[resample_labels=="beer"]) - 
                mean(alldata[resample_labels=="water"])

resample_diff
[1] 1.033333

In a teaching setting, the preceding code could be re-run several times, to mimic the presentation seen in the video linked above. To repeat many times, the most suitable base R tool is replicate(). To use it, we make a function of the resampling procedure shown above.
resamp_means = function(data, labs){
  resample_labels = sample(labs)
  resample_diff = mean(data[resample_labels=="beer"]) - 
    mean(data[resample_labels=="water"])
  return(resample_diff)
}

nulldist = replicate(9999,resamp_means(alldata,labels))

hist(nulldist, col="cyan")
abline(v = obsdiff, col = "red")

The histogram is shown above. The p-value is obtained by counting the proportion of statistics (including the actual observed difference) among greater than or equal to the observed statistic:
alldiffs = c(obsdiff,nulldist)
p = sum(abs(alldiffs >= obsdiff)/ 10000)


SAS

The SAS code is relatively cumbersome in comparison. We begin by reading the data in, using the "line hold" double-ampersand and the infile datalines statement that allows us to specify a delimiter (other than a space) when reading data in directly in a data step. This let us copy the data from the R code. To identify the water and beer regimen subjects, we use the _n_ implied variable that SAS creates but does not save with the data.

The summary procedure generates the mean for each group and saves the results in a data set with a row for each group; the transpose procedure makes a data set with a single row and a variable for each group mean. Finally, we calculate the observed difference and use call symput to make it into a macro variable for later use.
data bites;;
if _n_ le 18 then drink="water";
  else drink="beer";
infile datalines delimiter=',';
input bites @@;
datalines;
21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20
27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24
28, 24, 29, 21, 21, 18, 27, 20
;
run;

proc summary nway data = bites;
class drink;
var bites;
output out=obsmeans mean=mean;
run;

proc transpose data = obsmeans out=om2;
var mean;
id drink;
run;

data om3; 
set om2;
obsdiff = beer-water;
call symput('obsdiff',obsdiff);
run;

proc print data = om3; var obsdiff; run;

            Obs    obsdiff
             1     4.37778
(Yes, we could have done this with proc ttest and ODS. But the spirit of the video is that we don't understand t-tests, so we want to avoid them.)

To rerandomize, we can assign a random number to each row, sort on the random number, and assign drink labels based on the new order of the data.
data rerand;
set bites;
randorder = uniform(0);
run;

proc sort data = rerand; by randorder; run;

data rerand2;
set rerand;
if _n_ le 18 then redrink = "water";
  else redrink = "beer";
run;

proc summary nway data = rerand2;
class redrink;
var bites;
output out=rerand3 mean=mean;
run;

proc transpose data = rerand3 out=rerand4;
var mean;
id redrink;
run;

data rrdiff; 
set rerand4;
rrdiff = beer-water;
run;

proc print data = rrdiff; var rrdiff; run;

            Obs     rrdiff
             1     -1.73778
One way to repeat this a bunch of times would be to make a macro out of the above and collect the resulting rrdiff into a data set. Instead, we use the surveyselect procedure to do this much more efficiently. The groups option sample groups of 18 and 25 from the data, while the reps option requests this be done 9,999 times. We can then use the summary and transpose procedures as before, with the addition of the by replicate statement to make a data set with columns for each group mean and a row for each replicate.
proc surveyselect data = bites groups=(18,25) reps = 9999 out = ssresamp; run;

proc summary nway data = ssresamp;
by replicate;
class groupid;
var bites;
output out=ssresamp2 mean=mean;
run;

proc transpose data = ssresamp2 out=ssresamp3 prefix=group;
by replicate;
var mean;
id groupid;
run;
To get a p-value and make a histogram, we use the macro variable created earlier.
data ssresamp4;
set ssresamp3;
diff = group2 - group1;
exceeds = abs(diff) ge &obsdiff;
run;

proc means data = ssresamp4 sum; var exceeds; run;

             The MEANS Procedure
          Analysis Variable : exceeds
                          Sum
                    9.0000000

proc sgplot data = ssresamp4;
histogram diff;
refline &obsdiff /axis=x lineattrs=(thickness=2 color=red);
run;
The p-value is 0.001 (= (9+1)/10000).


An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. With exceptions noted above, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

3 comments:

schnee said...

@jrauser sure left a mark.

Here's how I did much the same...

http://rpubs.com/schnee/34617

Rick Wicklin said...

For a less cumbersome approach in SAS, see http://blogs.sas.com/content/iml/2014/11/21/resampling-in-sas/

Ken Kleinman said...

Thanks, Rick. We're not IML coders, so this route wasn't open to us.