Thursday, October 13, 2011

Example 9.9: Simplifying R using the mosaic package (part 1)

While both SAS and R are powerful systems for statistical analysis, they can be frustrating to new users or those learning statistics for the first time.

The mosaic package is designed to help simplify the interface for such new users, while allowing them to undertake sophisticated analyses.

As an example of how the package simplifies life for the novice user, consider calculating summary statistics and displaying a densityplot for the CESD (measure of depressive symptom) scores by substance abuse group in the HELP dataset. Doing this in R without the package would require mastering a package such as plyr to replicate results by substance or a typing-intensive use of syntax to select rows corresponding to each substance.

ds = read.csv("")

After loading the data and the package, and setting the number of digits to a more reasonable default, we can call the mean() function to easily calculate this statistic (denoted by S in the result) for each of the three substance abuse groups alcohol, cocaine or heroin.
> mean(cesd ~ substance, data=ds)
  substance    S   N Missing
1   alcohol 34.4 177       0
2   cocaine 29.4 152       0
3    heroin 34.9 124       0

Similar results are seen when we calculate the standard deviations per group:
> sd(cesd ~ substance, data=ds)
  substance    S   N Missing
1   alcohol 12.1 177       0
2   cocaine 13.4 152       0
3    heroin 11.2 124       0

Another function can calculate a raft of summary statistics for each group that are nicely formatted.

UPDATE FROM 12/14/2013: favstats(cesd ~ substance, data=ds)  is an even better way to do this...
> summary(cesd ~ substance, data=ds, fun=favstats)
cesd    N=453
|         |       |N  |min |Q1 |median |Q3 |max |mean |sd  |n  |missing |
|substance|alcohol|177|4   |26 |36     |42 |58  |34.4 |12.1|177|0       |
|         |cocaine|152|1   |19 |30     |39 |60  |29.4 |13.4|152|0       |
|         |heroin |124|4   |28 |35     |43 |56  |34.9 |11.2|124|0       |
|Overall  |       |453|1   |25 |34     |41 |60  |32.8 |12.5|453|0       |

These commands allow quick review of the data to ensure, for example, that assumptions of equal variance are justified, or that coding errors or missing values haven't crept in.

A graphical depiction using a set of densityplots (shown above) can be created using the command:
densityplot(~ cesd, group=substance, data=ds, auto.key=TRUE)

We're unaware of any similar program that attempts to simplify SAS syntax for educational use. To replicate the above results, we would use the means and sgpanel procedures.
data ds;
set "C:\book\help.sas7bdat";

options ls=80;
proc means data=ds fw=4
  min q1 median q3 max mean std nmiss n;
  class substance;
  var cesd;
                     Analysis Variable : CESD

              N           Lower             Upper               Std
 SUBSTANCE  Obs   Min  Quartile  Median  Quartile   Max  Mean   Dev
 alcohol    177  4.00      26.0    36.0      42.0  58.0  34.4  12.1
 cocaine    152  1.00      19.0    30.0      39.0  60.0  29.4  13.4
 heroin     124  4.00      28.0    35.0      43.0  56.0  34.9  11.2

                                 N     N
                    SUBSTANCE  Obs  Miss      N
                    alcohol    177     0    177
                    cocaine    152     0    152
                    heroin     124     0    124

After reading the data in, the meansprocedure can produce any of the desired statistics (plus may others) directly. To replicate the mosaic package in printing a single statistic, list only that statistic in the proc means statement. Note that the overall statistic in the R table is not included. To replicate that row, you would re-run the above code, omitting the class statement.

To the best of our knowledge, there still does not exist an easy way to plot multiple densities in a single SAS plot. In example 2.6.4 we show how it can be done using proc kde, saving the density estimates, and plotting separately. (Code for this is included at the book web site.) But in the interest of simple code, we show a simpler method using proc sgpanel. The result, show below, is less useful than the R plot from the the mosaic package, but still gets the point across.
proc sgpanel data=ds;
  panelby substance / columns=1;
  density cesd / type=kernel;


Ken said...

If you want to do tables of means etc in SAS, the best way is PROC TABULATE. It is amazingly powerful, see (or search for PROC TABULATE will find lots of tutorials) and quite easy to use. For some reason it isn't discussed in most books on SAS for statisticians. One minor point, I always use MISSING option in the PROC TABULATE statement, otherwise it omits any observation with a missing in the CLASS statement.

Ken Kleinman said...

Thanks, Ken. I usually think of proc tabulate as being necessary for reports, but not for getting data. And since I rarely produce reports, I don't use it.

In any case, proc tabulate is definitely not in my toolbox. It would be great if you can show the code needed to replicate the tables shown above! The data live at the book web site linked in the R section above.

Ken said...

Have added it to my things to do. Maybe I need a graduate student.

Sarah Anoke said...

The code would be as follows:

proc tabulate missing;
class substance;
var cesd;
table substance, cesd*(min q1 median q3 max mean std nmiss n);

Ken Kleinman said...

Thanks, Sarah! Ken also sent code-- we'll be posting it later this week.