SAS and R: Example 9.9: Simplifying R using the mosaic package (part 1)

Thursday, October 13, 2011

Example 9.9: Simplifying R using the mosaic package (part 1)

While both SAS and R are powerful systems for statistical analysis, they can be frustrating to new users or those learning statistics for the first time.

R
The mosaic package is designed to help simplify the interface for such new users, while allowing them to undertake sophisticated analyses.

As an example of how the package simplifies life for the novice user, consider calculating summary statistics and displaying a densityplot for the CESD (measure of depressive symptom) scores by substance abuse group in the HELP dataset. Doing this in R without the package would require mastering a package such as plyr to replicate results by substance or a typing-intensive use of syntax to select rows corresponding to each substance.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(mosaic)
options(digits=3)

After loading the data and the package, and setting the number of digits to a more reasonable default, we can call the mean() function to easily calculate this statistic (denoted by S in the result) for each of the three substance abuse groups alcohol, cocaine or heroin.

> mean(cesd ~ substance, data=ds)
  substance    S   N Missing
1   alcohol 34.4 177       0
2   cocaine 29.4 152       0
3    heroin 34.9 124       0

Similar results are seen when we calculate the standard deviations per group:

> sd(cesd ~ substance, data=ds)
  substance    S   N Missing
1   alcohol 12.1 177       0
2   cocaine 13.4 152       0
3    heroin 11.2 124       0

Another function can calculate a raft of summary statistics for each group that are nicely formatted.

UPDATE FROM 12/14/2013: favstats(cesd ~ substance, data=ds) is an even better way to do this...

> summary(cesd ~ substance, data=ds, fun=favstats)
cesd    N=453
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+
|         |       |N  |min |Q1 |median |Q3 |max |mean |sd  |n  |missing |
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+
|substance|alcohol|177|4   |26 |36     |42 |58  |34.4 |12.1|177|0       |
|         |cocaine|152|1   |19 |30     |39 |60  |29.4 |13.4|152|0       |
|         |heroin |124|4   |28 |35     |43 |56  |34.9 |11.2|124|0       |
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+
|Overall  |       |453|1   |25 |34     |41 |60  |32.8 |12.5|453|0       |
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+

These commands allow quick review of the data to ensure, for example, that assumptions of equal variance are justified, or that coding errors or missing values haven't crept in.

A graphical depiction using a set of densityplots (shown above) can be created using the command:

densityplot(~ cesd, group=substance, data=ds, auto.key=TRUE)

SAS
We're unaware of any similar program that attempts to simplify SAS syntax for educational use. To replicate the above results, we would use the means and sgpanel procedures.

data ds;
set "C:\book\help.sas7bdat";
run;

options ls=80;
proc means data=ds fw=4
  min q1 median q3 max mean std nmiss n;
  class substance;
  var cesd;
run;
                     Analysis Variable : CESD

              N           Lower             Upper               Std
 SUBSTANCE  Obs   Min  Quartile  Median  Quartile   Max  Mean   Dev
 ------------------------------------------------------------------
 alcohol    177  4.00      26.0    36.0      42.0  58.0  34.4  12.1
 cocaine    152  1.00      19.0    30.0      39.0  60.0  29.4  13.4
 heroin     124  4.00      28.0    35.0      43.0  56.0  34.9  11.2
 ------------------------------------------------------------------

                                 N     N
                    SUBSTANCE  Obs  Miss      N
                    ---------------------------
                    alcohol    177     0    177
                    cocaine    152     0    152
                    heroin     124     0    124
                    ---------------------------

After reading the data in, the meansprocedure can produce any of the desired statistics (plus may others) directly. To replicate the mosaic package in printing a single statistic, list only that statistic in the proc means statement. Note that the overall statistic in the R table is not included. To replicate that row, you would re-run the above code, omitting the class statement.

To the best of our knowledge, there still does not exist an easy way to plot multiple densities in a single SAS plot. In example 2.6.4 we show how it can be done using proc kde, saving the density estimates, and plotting separately. (Code for this is included at the book web site.) But in the interest of simple code, we show a simpler method using proc sgpanel. The result, show below, is less useful than the R plot from the the mosaic package, but still gets the point across.

proc sgpanel data=ds;
  panelby substance / columns=1;
  density cesd / type=kernel;
run;

5 comments:

Ken said...: If you want to do tables of means etc in SAS, the best way is PROC TABULATE. It is amazingly powerful, see http://www2.sas.com/proceedings/sugi27/p060-27.pdf (or search for PROC TABULATE will find lots of tutorials) and quite easy to use. For some reason it isn't discussed in most books on SAS for statisticians. One minor point, I always use MISSING option in the PROC TABULATE statement, otherwise it omits any observation with a missing in the CLASS statement.; October 14, 2011 at 3:55 AM
Ken Kleinman said...: Thanks, Ken. I usually think of proc tabulate as being necessary for reports, but not for getting data. And since I rarely produce reports, I don't use it.

In any case, proc tabulate is definitely not in my toolbox. It would be great if you can show the code needed to replicate the tables shown above! The data live at the book web site linked in the R section above.; October 14, 2011 at 8:46 AM
Ken said...: Have added it to my things to do. Maybe I need a graduate student.; October 25, 2011 at 6:33 PM
Sarah Anoke said...: The code would be as follows:

proc tabulate missing;
class substance;
var cesd;
table substance, cesd*(min q1 median q3 max mean std nmiss n);
run;; October 26, 2011 at 9:08 PM
Ken Kleinman said...: Thanks, Sarah! Ken also sent code-- we'll be posting it later this week.; October 26, 2011 at 10:17 PM

Post a Comment

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Thursday, October 13, 2011

Example 9.9: Simplifying R using the mosaic package (part 1)

5 comments:

About SAS and R

Topics discussed