SAS and R: teaching statistics

Showing posts with label teaching statistics. Show all posts

Tuesday, August 14, 2012

The Statistical Sleuth (second edition) in R

For those of you who teach, or are interested in seeing an illustrated series of analyses, there is a new compendium of files to help describe how to fit models for the extended case studies in the Second Edition of the Statistical Sleuth: A Course in Methods of Data Analysis (2002), the excellent text by Fred Ramsey and Dan Schafer. If you are using this book, or would like to see straightforward ways to undertake analyses in R for intro and intermediate statistics courses, these may be of interest.

These files can be found here. The site includes both formatted pdf files as well as the original knitr files which were used to generate the output. Knitr is an elegant, flexible and fast means to undertake reproducible analysis and dynamic report generation within R and RStudio.

This work leverages efforts undertaken by Project MOSAIC, an NSF-funded initiative to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the mosaic package, which was written to simplify the use of R for introductory statistics courses.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, July 18, 2011

Example 9.3: augmented display of contingency table

SAS and R often provide different levels of details from output. This is particularly true for the descriptive analysis of contingency tables, where SAS makes it easy to display tables with additional quantities (such as the observed cell count).

The mosaic package has added functionality to calculate these quantities in R. We demonstrate using an example from the HELP dataset.

R


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(mosaic)
ds$gender = ifelse(ds$female==1, "female", "male")
ds$homeless = ifelse(ds$homeless==1, "homeless", "housed")
tab = xtabs(~ gender + homeless, data=ds)
> tab
        homeless
gender   homeless housed
  female       40     67
  male        169    177
> xchisq.test(tab)

 Pearson's Chi-squared test with Yates' continuity correction

data:  tab 
X-squared = 3.8708, df = 1, p-value = 0.04913

  40.00    67.00 
( 49.37) ( 57.63)
 [1.78]   [1.52] 
<-1.33>  < 1.23> 
   
 169.00   177.00 
(159.63) (186.37)
 [0.55]   [0.47] 
< 0.74>  <-0.69> 
   
key:
 observed
 (expected)
 [contribution to X-squared]

We see that there is a borderline statistically significant association between gender and homeless status in the HELP study. We interpret that we see fewer than expected females who are homeless, and more males who are homeless.

Another idea is to use graphical depictions of the association in this table. One approach is a mosaic plot (note: no relation to Project MOSAIC and the mosaic package). A mosaic plot starts as a square with area equal to one. It is divided into columns based on the prevalence in each of the values for the column variable (in this case, gender). Then each bar is divided vertically based on the conditional probability of the other variable within that category.

Another graphical display of a table is the association plot. In an association plot, there is also a box for each cell of the table. The area of the box is proportional to the difference between the observed and expected (assuming no association) frequencies. In a typical presentation, excess observed counts are black and above the line, while deficient counts are red and below the line.

Above, we show the mosaic plot (on the left) and association plot (on the right). Both of these displays demonstrate that there is an association. The mosaic plot indicates that only about a quarter of the sample is female (indicated by the width of the columns), and that homelessness is present in about half the subjects (area shaded in light grey). The slight association demonstrated is that there are fewer homeless women than expected (since the horizontal line moves down between the first and second column). Similarly, for the association plot we note that the expected cell count is less than the observed (indicated in red with values below the line) for the female homeless group.


par(mfrow=c(1,2))
mosaicplot(tab, color=TRUE, main="mosaic plot")
assocplot(tab)
title("association plot")

SAS
As in Example 8.32, we find SAS macros for mosaic plots among the contributions of Michael Friendly. In this complex case, they are somewhat more difficult to access than others. The code for the plots themselves can be downloaded here, while it's useful to also run a wrapper macro. After downloading the files, the following code can be used to make the figure below.


title 'Install mosaic modules';
* location of the zipped files;
filename mosaic  'c:\ken\sasmacros\mosaics';
* storage location of compiled macros;
libname  mosaic   'c:\ken\sasmacros\mosaics';

* Code to read in, compile and store the macros;
proc iml ;
   reset storage=mosaic.mosaic;
   %include mosaic(mosaics) ;
   store module=_all_;
   show storage;
quit;

* Prep: create the table, save the cell counts;
proc freq data = "c:\book\help.sas7bdat";
tables homeless * female / out=outhelp;
run;

* Read in the wrapper macro;
%include "c:\ken\sasmacros\mosaics\mosaic.sas";

* Make the plot;
%mosaic(data=outhelp,var = female homeless, 
        sort=homeless descending female, space = 1 1);

The sort and space options make the results more similar to those shown for mosaicplot(). In this version, the colors reflect the signs of the residuals.

Tuesday, October 19, 2010

Example 8.10: Combination dotplot/boxplot (teaching graphic in honor of World Statistics Day)

In honor of World Statistics Day and the read paper that my co-authors Chris Wild, Maxine Pfannkuch, Matt Regan, and I are presenting at the Royal Statistical Society today, we present the R code to generate a combination dotplot/boxplot that is useful for students first learning statistics. One of the over-riding themes of the paper is that introductory students (be they in upper elementary or early university) should keep their eyes on the data.

When describing distributions, students often are drawn to the most visually obvious aspects: median, quartiles and extremes. These are the ingredients of the basic boxplot, which is often introduced as a graphical display in introductory courses. Students are taught to calculate the quartiles, and this becomes one of the components of a boxplot.

One limitation of the boxplot is that it loses the individual data points. Given a dotplot (see here for an example of one in Fathom), it is very easy to guesstimate and draw a boxplot by hand. Wild et al. argue that doing so is probably the best way of gaining an appreciation of just what a boxplot actually is. The boxplot provides a natural bridge between operating entirely in terms of what is seen in graphics to reasoning using summary statistics. Retaining the dots in the combination dotplot/boxplot provides a reminder that the box plot is just summarizing the raw data, thus preserving a connection to more concrete foundations.

Certainly, such a plot has a number of limitations. It breaks down when there are a large number of observations, and has redundancy not suitable for publication. But as a way to motivate informal inference, it has potential value. It's particularly useful when combined in an animation using multiple samples from a known population, as demonstrated here (scroll through the file quickly).

In addition, the code, drafted by Chris and his colleagues Steve Taylor and Dineika Chandrananda at the University of Auckland, demonstrates a number of useful and interesting techniques.

R

Because of the length of the code, we'll focus just on the boxpoints3() function shown below. The remaining support code must be run first (and can be found at http://www.math.smith.edu/sasr/examples/wild-helper.R). The original boxpoints() function has a large number of options and configuration settings, which the function below sets at the default values.


# Create a plot from two variables, one continuous and one categorical.
# For each level of the categorical variable (grps) a stacked dot plot
# and a boxplot summary are created.  Derived from code written by 
# Christopher Wild, Dineika Chandrananda and Steve Taylor
#
boxpoints3 = function(x,grps,varnames1,varnames2,labeltext)
{
  observed = (1:length(x))[!is.na(x) & !is.na(grps)]
  x = x[observed]
  grps = as.factor(grps[observed])
  ngrps = length(levels(grps))
  
  # begin section to align titles and labels
  xlims = range(x) + c(-0.2,0.1)*(max(x)-min(x))
  top = 1.1
  bottom = -0.2
  plot(xlims, c(bottom,top), type="n", xlab="", ylab="", axes=F)
  yvals = ((1:ngrps)-0.7)/ngrps
  text(mean(xlims), top, paste(varnames1, "by", varnames2, sep=" "), 
     cex=1, font=2)
  text(xlims[1],top-0.05,varnames2, cex=1, adj=0)
  addaxis(xlims, ylev=0, tickheight=0.03, textdispl=0.07, nticks=5,
     axiscex=1)
  text(mean(xlims), -0.2, varnames1, adj=0.5, cex=1)
  # end section for titles and labels

  for (i in 1:ngrps) {
    xi = x[grps==levels(grps)[i]]
    text(xlims[1], yvals[i]+0.2/ngrps,
       substr(as.character(levels(grps)[i]), 1, 12), adj=0, cex=1)
    prettyrange = range(pretty(xi))
    if (min(diff(sort(unique(xi)))) >= diff(prettyrange)/75)
      xbins = xi  # They are sufficiently well spaced.
    else {
      xbins = round(75 * (xi-prettyrange[1])/diff(prettyrange))
    }
    addpoints(xi, yval=yvals[i], vmax=0.62/ngrps, vadd=0.075/ngrps,
      xbin=xbins, ptcex=1, ptcol='grey50', ptlwd=2)
    bxplt = fivenum(xi)
    if(length(xi) != 0)
      addbox(x5=bxplt, yval=yvals[i], hbxwdth=0.2/ngrps,
             boxcol="black", medcol="black", whiskercol="black", 
             boxfill=NA, boxlwidth=2)
  }
}

Most of the function tends to issues of housekeeping, in particular aligning titles and labels. Once this is done, the support functions addpoints() and addbox() functions are called with the appropriate arguments. We can call the function using data from female subjects at baseline of the HELP study, comparing PCS (physical component scores) for homeless and non-homeless subjects.


source("http://www.math.smith.edu/sasr/examples/wild-helper.R")
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
female = subset(ds, female==1)
with(female,boxpoints3(pcs, homeless, "PCS", "Homeless"))

We see that the homeless subjects have lower PCS scores than the non-homeless (though the homeless group also has the highest score of either group in this sample).

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts