Tuesday, September 7, 2010

Example 8.4: Including subsetting conditions in output

A number of analyses perform operations on subsets. Making it clear what observations have been excluded or included is helpful to include in the output.

SAS

The where statement (section A.6.3) is a powerful and useful tool for subsetting on the fly. (Other options for subsetting typically require data steps or data set options in the proc statement.) But one problem with subsetting in SAS is that the output doesn't show the conditioning. This is a little macro to correct this problem:

%macro where(wherest, titleline);
where &wherest;
title&titleline "where %str(&wherest)";
%mend where;

If you use %where(condition, n); in place of where condition; the condition is both implemented and will appear in the output as the nth title line. This takes advantage of the fact that the title statements just need to appear before the procedure is run, as opposed to before the proc statement.

For example:

title "HELP data set";
proc means data="c:\book\help.sas7bdat";
%where (female eq 1,2);
var age;
run;

Produces the output:

HELP data set
where female eq 1

The MEANS Procedure

Analysis Variable : AGE

Mean
------------
36.2523364
------------

Note that you need to use the testing forms such as eq and not the assignment = in your where statements for this to work.

This tip is cross-posted on my list of SAS tricks.

R

Within R, many functions (e.g. lm()) provide support for a subset option (as in example 3.7.5). Indexing can be used to restrict analyses (B.4.2). In addition, the subset() function can be embedded in a function call to make explicit what subset of observations will be included in the analysis. We demonstrate calculating the mean of female subjects all three ways below.

> ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
> coef(lm(age ~ 1, subset=female==1, data=ds))
(Intercept)
36.25234
> mean(ds$age[ds$female==1])
[1] 36.25234
> mean(subset(ds,female==1,age))
age
36.25234


In interactive R the code and output lie together and it is usually trivial to trace the subsetting used to produce output. But in literate programming settings, or when output may be cut-and-pasted from R to a document, it may be valuable to mirror the SAS macro presented above.

Here is a function to display the subsetting condition for simple functions. It requires some sophistication to turn expressions and variables into objects (and vice-versa). The use of deparse(substitute()) allows us to create a character variable with the name of the subset condition and function, while eval() is used to pass the name of the variable to the subset() function. A similar functionality for more complex functions (such as those requiring a formula, would require some tweaking; a generic explicit subsetting function could be difficult.

verbose.subset = function(ds, condition, variable, FUN=mean) {
cat(paste("subsetting with condition: ",
deparse(substitute(condition)),
"\n", sep=""))
cat(paste("calling the function: ",
deparse(substitute(FUN)),
"\n", sep=""))
FUN(subset(ds, condition, select=eval(variable)))
}

This yields the following output:

> verbose.subset(ds, female==1, "age")
subsetting with condition: female == 1
calling the function: mean
age
36.25234

No comments: