Monday, January 10, 2011

Example 8.20: Referencing lists of variables, part 2

In Example 8.19, we discussed how to refer to a group of variables with sequential names, such as varname1, varname2, varname3. This is trivial in SAS and can be done in R as we showed.

It's also sometimes useful to refer to all variables which begin with a common character string. For example, in the HELP data set, there are the variables cesd, cesd1, cesd2, cesd3 and cesd4.

SAS
In SAS, this can be done with the : operator. This functions much like the * wildcard available in many operating systems.

proc means data="c:\book\help.sas7bdat" mean;
var cesd:;
run;


Variable Mean
------------------------
CESD1 22.7154472
CESD2 23.5837321
CESD3 22.0685484
CESD4 20.1428571
CESD 32.8476821
------------------------


R
This functionality is not built into R. But, as with the sequentially named variable problem, you can use the string functions available within R to replicate the effect.

In this case, we use the names() function (section 1.3.4) to get a list of the variables in the data set, then search for names whose beginnings match the desired string using the substr() function (section 1.4.3). Note that the substr() == section returns a vector of logicals, rather than variable names.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
mean(ds[, substr(names(ds), 1, 4) == "cesd"], na.rm=TRUE)

cesd1 cesd2 cesd3 cesd4 cesd
22.71545 23.58373 22.06855 20.14286 32.84768

The typing required for the previous statement is rather involved, and requires counting characters. You may want to make a function to do this instead.

The function will accept a data frame as input and return the data frame with just the desired variables. It looks much like the direct version displayed above, but uses the substitute() function to access the "varname" parameter as text, rather than as an object. I store those characters in the object vname.

matchin = function(dsname, varname) {
vname = substitute(varname)
return(dsname[substr(names(dsname),1,nchar(vname)) == vname])
}

Now we can just type

mean(matchin(ds, cesd), na.rm=TRUE)

with results identical to those displayed above.

No comments: