SAS and R: subsetting

Showing posts with label subsetting. Show all posts

Thursday, January 5, 2012

Example 9.18: Constructing the fastest relay team via enumeration

In competitive swimming, the medley relay is a team event in which four different swimmers each swim one of the four strokes: freestyle, breaststroke, backstroke, and butterfly. In general, every swimmer might be able swim any given stroke. How can we choose the fastest relay team? Here we solve this by enumerating all possible teams, though a more efficient routine likely exists.

Some example practice times can be seen on this Google Spreadsheet. Also, using the steps outlined here, the same spreadsheet is available as a CSV file here. (FTR, these are actual practice times for 100 yards for mostly 12-13 year-old swimmers; the names have been changed.)

SAS
We first read the data from the URL, using the technique outlined in section 1.1.6. Note that if you cut-and-paste this, you'll need to get the whole URL onto one line-- we break it up here for display only.


filename swimurl url 'https://docs.google.com/spreadsheet
/pub?key=0AvJKgZUzMYLYdE5xTHlEWkNUM3NoOHB1ZTJoTFMzUUE&
single=true&gid=0&output=csv';

proc import datafile=swimurl out=swim dbms=csv;
run;

Next, we use the point= option in nested set statements to generate a single data set with all the possible combinations of names and times. Meanwhile we change the names of the variables so they don't get overwritten in the next set statement. Note the use of the nobs option to find the number of rows in the data set.


data permute 
    (keep=free freetime fly flytime back backtime  breast breasttime);
set swim (rename = (swimmer=free freestyle=freetime)) nobs=nobs;
do i = 1 to nobs;
  set swim(rename = (swimmer=fly butterfly=flytime)) point=i;
  do j = 1 to nobs;
    set swim(rename = (swimmer=back backstroke=backtime)) point=j;
    do k = 1 to nobs;
      set swim(rename = (swimmer=breast breaststroke=breasttime)) 
        point = k;
      output;
 end;
  end;
end;
run;

The resulting data set has 12^4 rows, and includes rosters with the same swimmer swimming all four legs. In fact, a quick glance will show that Anna has the best time in each stroke, and thus the best "team" based on these practice times would use her for each stroke. This is against the rules, and also probably isn't reflective of performance in a race. We'll remove illegal line-ups using a where statement (section 1.5.1) and also calculate the total team time.


data prep;
set permute;
where free ne back and free ne breast and free ne fly and 
  back ne breast and back ne fly and breast ne fly;
time = sum(freetime, flytime, backtime, breasttime);
run;

The resulting data set has (12 permute 4) lines. To find the best team, we just sort by the total time and look at the first line. Here the first 10 lines (10 best teams) are shown.


proc sort data=prep; by time; run;
proc print data=prep (obs=10); run;

                                                        b
                                                        r
           f                             b              e
           r              f              a              a
           e              l              c      b       s
           e              y              k      r       t
   f       t              t      b       t      e       t      t
   r       i      f       i      a       i      a       i      i
   e       m      l       m      c       m      s       m      m
   e       e      y       e      k       e      t       e      e

 Kara    109.3  Dora    126.8  Lara    117.7  Anna    126.9  480.7
 Anna    102.8  Dora    126.8  Lara    117.7  Beth    134.6  481.9
 Kara    109.3  Anna    120.5  Lara    117.7  Beth    134.6  482.1
 Anna    102.8  Dora    126.8  Lara    117.7  Honora  136.4  483.7
 Kara    109.3  Jane    129.8  Lara    117.7  Anna    126.9  483.7
 Kara    109.3  Anna    120.5  Lara    117.7  Dora    136.4  483.9
 Kara    109.3  Anna    120.5  Lara    117.7  Honora  136.4  483.9
 Kara    109.3  Lara    123.1  Jane    124.7  Anna    126.9  484.0
 Anna    102.8  Dora    126.8  Lara    117.7  Inez    136.8  484.1
 Carrie  112.7  Dora    126.8  Lara    117.7  Anna    126.9  484.1

The best time shaves a whole second off the predicted time using the second-best team.

R
Since published Google Spreadsheets use https rather than http, we use the RCurl package and its getURL() function. (Note that if you cut-and-paste this, you'll need to get the whole URL onto one line-- we break it up here for display only.) Then we can read the data with read.csv() and textConnection().


library(RCurl)
swim = getURL("https://docs.google.com/spreadsheet
/pub?key=0AvJKgZUzMYLYdE5xTHlEWkNUM3NoOHB1ZTJoTFMzUUE&
single=true&gid=0&output=csv")

swim2=read.csv(textConnection(swim))

To make an object with the combinations of names, we use the expand.grid() function highlighted in Example 7.22, providing as arguments the swimmers names four times. As in the SAS example, the result has has 12^4 rows. The combn() function might be a better fit here, but was more difficult to use.


test2 = expand.grid(swim2$Swimmer,swim2$Swimmer, swim2$Swimmer, swim2$Swimmer)

It'll be useful to assign these copies of the names to each of the strokes. We'll do that with the rename() function available in the reshape package. (This approach is mentioned in section 1.3.4.). Then we can remove the rows where the same name appears twice using some logic. The logic is nested in the with() function to save some keystrokes and is generally preferable to attach()ing the test2 object.


library(reshape)
test2 = rename(test2, c("Var1" = "free", "Var2" = "fly", 
  "Var3" = "back", "Var4" = "breast"))
test3 = with(test2, test2[(free != breast) & (free != fly) 
  & (free != back) & (breast != fly) & (breast != back) 
  & (fly != back) ,])

Finally, we can use the which.min() function to pick the best team.


> bestteam =   
+ test3[which.min(swim2$Freestyle[test3$free]+swim2$Breaststroke[test3$breast] +
+ swim2$Butterfly[test3$fly] + swim2$Backstroke[test3$back]),]
>bestteam
     free  fly back breast
1631 Kara Dora Lara   Anna

For new users of R, this may look very peculiar-- it uses elegant but powerful features of the R language that may be challenging for new users to grasp. Essentially, in swim2$Freestyle[test3$free] we say: from the "freestyle" times in the swim2 object, take the time from the row that has the name in a row of "free" names in the test3 object. The which.min() function replicates this request for every row in the test3 object (which has all of the permutations) in it, returning the row number with that minimum sum. The outer test3[rows,columns] syntax grabs the values in this row. (The number 1631 is the row number, for some reason showing the row in the test2 object created by expand.grid().)

Now, we might also want the actual times associated with the best team. We can find them by calling the correct rows (names from the best team) and columns (stroke associated with that name) from the original data set.


> times = c(swim2[swim2$Swimmer == bestteam$free,2], 
+      swim2[swim2$Swimmer == bestteam$fly,3], 
+      swim2[swim2$Swimmer == bestteam$back,4], 
+      swim2[swim2$Swimmer == bestteam$breast,5])
> times
[1] 109.3 126.8 117.7 126.9

If instead, one wanted to list the times in order, one approach would be to add columns to the test3 object with the time for each stroke, calculate their sum, and sort on the sum.

Tuesday, September 7, 2010

Example 8.4: Including subsetting conditions in output

A number of analyses perform operations on subsets. Making it clear what observations have been excluded or included is helpful to include in the output.

SAS

The where statement (section A.6.3) is a powerful and useful tool for subsetting on the fly. (Other options for subsetting typically require data steps or data set options in the proc statement.) But one problem with subsetting in SAS is that the output doesn't show the conditioning. This is a little macro to correct this problem:


%macro where(wherest, titleline);
  where &wherest;
  title&titleline "where %str(&wherest)";
%mend where;

If you use %where(condition, n); in place of where condition; the condition is both implemented and will appear in the output as the nth title line. This takes advantage of the fact that the title statements just need to appear before the procedure is run, as opposed to before the proc statement.

For example:


title "HELP data set";
proc means data="c:\book\help.sas7bdat";
  %where (female eq 1,2);
  var age;
run;

Produces the output:


HELP data set      
where female eq 1

The MEANS Procedure

Analysis Variable : AGE

        Mean
------------
  36.2523364
------------

Note that you need to use the testing forms such as eq and not the assignment = in your where statements for this to work.

This tip is cross-posted on my list of SAS tricks.

R

Within R, many functions (e.g. lm()) provide support for a subset option (as in example 3.7.5). Indexing can be used to restrict analyses (B.4.2). In addition, the subset() function can be embedded in a function call to make explicit what subset of observations will be included in the analysis. We demonstrate calculating the mean of female subjects all three ways below.


> ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
> coef(lm(age ~ 1, subset=female==1, data=ds))
(Intercept) 
   36.25234 
> mean(ds$age[ds$female==1])
[1] 36.25234
> mean(subset(ds,female==1,age))
     age 
36.25234

In interactive R the code and output lie together and it is usually trivial to trace the subsetting used to produce output. But in literate programming settings, or when output may be cut-and-pasted from R to a document, it may be valuable to mirror the SAS macro presented above.

Here is a function to display the subsetting condition for simple functions. It requires some sophistication to turn expressions and variables into objects (and vice-versa). The use of deparse(substitute()) allows us to create a character variable with the name of the subset condition and function, while eval() is used to pass the name of the variable to the subset() function. A similar functionality for more complex functions (such as those requiring a formula, would require some tweaking; a generic explicit subsetting function could be difficult.


verbose.subset = function(ds, condition, variable, FUN=mean) {
   cat(paste("subsetting with condition: ", 
      deparse(substitute(condition)),
      "\n", sep=""))
   cat(paste("calling the function: ", 
      deparse(substitute(FUN)),
      "\n", sep=""))
   FUN(subset(ds, condition, select=eval(variable)))
}

This yields the following output:


> verbose.subset(ds, female==1, "age")
subsetting with condition: female == 1
calling the function: mean
     age 
36.25234

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Thursday, January 5, 2012

Example 9.18: Constructing the fastest relay team via enumeration

Tuesday, September 7, 2010

Example 8.4: Including subsetting conditions in output

About SAS and R

Topics discussed