Showing posts with label set statement options. Show all posts
Showing posts with label set statement options. Show all posts

Monday, May 14, 2012

Example 9.31: Exploring multiple testing procedures

In example 9.30 we explored the effects of adjusting for multiple testing using the Bonferroni and Benjamini-Hochberg (or false discovery rate, FDR) procedures. At the time we claimed that it would probably be inappropriate to extract the adjusted p-values from the FDR method from their context. In this entry we attempt to explain our misgivings about this practice.

The FDR procedure is described in Benjamini and Hochberg (JRSSB, 1995) as a "step-down" procedure. Put simply, the procedure has the following steps:

0. Choose the familywise alpha
1. Rank order the unadjusted p-values
2. Beginning with the Mth of the ordered p-values p(m),
2a. if p(m) < alpha*(m/M), then reject all tests 1 ... m,
2b. if not, m = m-1
3. Repeat steps 2a and 2b until the condition is met
or p(1) > alpha/M

where M is the number of tests. The "adjusted p-value" based on this procedure is the smallest familywise alpha under which the current test would have been rejected. To calculate this, we can modify the routine above:

1. Rank order the unadjusted p-values
2. For ordered p-values p(m) M to 1,
2a. candidate ap(m) = p(m) *(M/m)
2b. if candidate ap(m) > ap(m+1) then ap(m) = ap(m+1)
2c. else ap(m) = candidate ap(m)

where ap(m) refers to the adjusted p-value corresponding to the mth ordered unadjusted p-value. It's interesting to note that the adjusted p-value for the Mth ordered test is the same as the unadjusted p-value, while the candidate adjusted p-value for the smallest test is the Bonferroni adjusted p-value. The primary difficulty with taking these p-values (as opposed to the test results) out of context is captured in steps 2b and 2c. They imply that the p-value for a given test may be lowered by other observed p-values in the family of tests. It's also true that the adjusted p-value depends on the number of tests included in the family, but this seems somewhat less troubling.

To examine the impact of the procedure on the adjusted p-values for the individual tests, we'll compare the candidate ap(m) from step 2a against the actual ap(m). Our sense is that to the degree these are different, the adjusted p-value should not be extracted from the context of the observed family of tests.


SAS
Our SAS code relies heavily on the array statement (section 1.11.5). We loop through the p-values from largest to smallest, calculating the candidate fdr p-value as above, before arriving at the final adjusted p-value. To compare the values conveniently, we make a new data set with two copies of the original data set, renaming first the candidate and then the adjusted p-values to have the same names. The in = data set option creates a temporary variable which identifies which data set an observation was read from; here it denotes which version of the same data set (and which set of p-values) was used.

data fdr;
array pvals [10] pval1 - pval10
(.001 .001 .001 .001 .001 .03 .035 .04 .05 .05);
array cfdrpvals [10] cfdr1 - cfdr10;
array fdrpvals [10] fdr1 - fdr10;
fdrpvals[10] = pvals[10];
do i = 9 to 1 by -1;
cfdrpvals[i] = pvals[i] * 10/i;
if cfdrpvals[i] > fdrpvals[i+1] then fdrpvals[i] = fdrpvals[i+1];
else fdrpvals[i] = cfdrpvals[i];
end;
run;

data compare;
set fdr (in = cfdr rename=(cfdr1=c1 cfdr2=c2 cfdr3=c3 cfdr4=c4
cfdr5=c5 cfdr6=c6 cfdr7=c7 cfdr8=c8 cfdr9=c9))
fdr (in = fdr rename=(fdr1=c1 fdr2=c2 fdr3=c3 fdr4=c4 fdr5=c5
fdr6=c6 fdr7=c7 fdr8=c8 fdr9=c9));
if cfdr then adjustment = "Candidate fdr";
if fdr then adjustment = "Final fdr";
run;

proc print data = compare; var adjustment c1-c9; run;

adjustment c1 c2 c3 c4 c5 c6 c7 c8 c9

Candidate fdr 0.010 .005 .0033 .0025 .002 .05 .05 .05 .055
Final fdr 0.002 .002 .0020 .0020 .002 .05 .05 .05 .050

(We omit the last p-value because the adjustment does not affect it.) The result shows that for many of the tests in this family, a substantially smaller p-value is obtained with the final FDR p-value than the candidate. To this degree, the FDR p-value is dependent on the observed values of the p-values in the tests in the family, and ought not to be removed from the context of these other tests. We would recommend caution in displaying the FDR p-values in such settings, given readers' propensity to use them as if they were ordinary p-values, safely adjusted for multiple testing.

R
Comparison of the R and SAS code may make SAS programmers weep. The candidate values are easily calculated, and can be presented with the final p-values in one step using the p.adjust() function. Three lines of code, albeit incorporating multiple functions in each line. (And it could sensibly be done in two, calculating the candidate p-values within the rbind() function call.) Note especially the line calculating the candidate p-values, in which vectorization allows a for loop to be avoided in a very natural fashion.

fakeps = c(rep(.2, 5), 6, 7, 8, 10, 10)/200
cfdr = fakeps * 10/(1:10)
rbind(cfdr, fdr=p.adjust(fakeps, "fdr"))[,1:9]

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
cfdr 0.010 0.005 0.0033 0.0025 0.002 0.05 0.05 0.05 0.0556 0.05
fdr 0.002 0.002 0.0020 0.0020 0.002 0.05 0.05 0.05 0.0500 0.05


An unrelated note about aggregatorsWe love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Thursday, January 5, 2012

Example 9.18: Constructing the fastest relay team via enumeration

In competitive swimming, the medley relay is a team event in which four different swimmers each swim one of the four strokes: freestyle, breaststroke, backstroke, and butterfly. In general, every swimmer might be able swim any given stroke. How can we choose the fastest relay team? Here we solve this by enumerating all possible teams, though a more efficient routine likely exists.

Some example practice times can be seen on this Google Spreadsheet. Also, using the steps outlined here, the same spreadsheet is available as a CSV file here. (FTR, these are actual practice times for 100 yards for mostly 12-13 year-old swimmers; the names have been changed.)


SAS
We first read the data from the URL, using the technique outlined in section 1.1.6. Note that if you cut-and-paste this, you'll need to get the whole URL onto one line-- we break it up here for display only.

filename swimurl url 'https://docs.google.com/spreadsheet
/pub?key=0AvJKgZUzMYLYdE5xTHlEWkNUM3NoOHB1ZTJoTFMzUUE&
single=true&gid=0&output=csv';

proc import datafile=swimurl out=swim dbms=csv;
run;

Next, we use the point= option in nested set statements to generate a single data set with all the possible combinations of names and times. Meanwhile we change the names of the variables so they don't get overwritten in the next set statement. Note the use of the nobs option to find the number of rows in the data set.

data permute
(keep=free freetime fly flytime back backtime breast breasttime);
set swim (rename = (swimmer=free freestyle=freetime)) nobs=nobs;
do i = 1 to nobs;
set swim(rename = (swimmer=fly butterfly=flytime)) point=i;
do j = 1 to nobs;
set swim(rename = (swimmer=back backstroke=backtime)) point=j;
do k = 1 to nobs;
set swim(rename = (swimmer=breast breaststroke=breasttime))
point = k;
output;
end;
end;
end;
run;

The resulting data set has 12^4 rows, and includes rosters with the same swimmer swimming all four legs. In fact, a quick glance will show that Anna has the best time in each stroke, and thus the best "team" based on these practice times would use her for each stroke. This is against the rules, and also probably isn't reflective of performance in a race. We'll remove illegal line-ups using a where statement (section 1.5.1) and also calculate the total team time.

data prep;
set permute;
where free ne back and free ne breast and free ne fly and
back ne breast and back ne fly and breast ne fly;
time = sum(freetime, flytime, backtime, breasttime);
run;

The resulting data set has (12 permute 4) lines. To find the best team, we just sort by the total time and look at the first line. Here the first 10 lines (10 best teams) are shown.

proc sort data=prep; by time; run;
proc print data=prep (obs=10); run;

b
r
f b e
r f a a
e l c b s
e y k r t
f t t b t e t t
r i f i a i a i i
e m l m c m s m m
e e y e k e t e e

Kara 109.3 Dora 126.8 Lara 117.7 Anna 126.9 480.7
Anna 102.8 Dora 126.8 Lara 117.7 Beth 134.6 481.9
Kara 109.3 Anna 120.5 Lara 117.7 Beth 134.6 482.1
Anna 102.8 Dora 126.8 Lara 117.7 Honora 136.4 483.7
Kara 109.3 Jane 129.8 Lara 117.7 Anna 126.9 483.7
Kara 109.3 Anna 120.5 Lara 117.7 Dora 136.4 483.9
Kara 109.3 Anna 120.5 Lara 117.7 Honora 136.4 483.9
Kara 109.3 Lara 123.1 Jane 124.7 Anna 126.9 484.0
Anna 102.8 Dora 126.8 Lara 117.7 Inez 136.8 484.1
Carrie 112.7 Dora 126.8 Lara 117.7 Anna 126.9 484.1

The best time shaves a whole second off the predicted time using the second-best team.

R
Since published Google Spreadsheets use https rather than http, we use the RCurl package and its getURL() function. (Note that if you cut-and-paste this, you'll need to get the whole URL onto one line-- we break it up here for display only.) Then we can read the data with read.csv() and textConnection().

library(RCurl)
swim = getURL("https://docs.google.com/spreadsheet
/pub?key=0AvJKgZUzMYLYdE5xTHlEWkNUM3NoOHB1ZTJoTFMzUUE&
single=true&gid=0&output=csv")

swim2=read.csv(textConnection(swim))

To make an object with the combinations of names, we use the expand.grid() function highlighted in Example 7.22, providing as arguments the swimmers names four times. As in the SAS example, the result has has 12^4 rows. The combn() function might be a better fit here, but was more difficult to use.

test2 = expand.grid(swim2$Swimmer,swim2$Swimmer, swim2$Swimmer, swim2$Swimmer)

It'll be useful to assign these copies of the names to each of the strokes. We'll do that with the rename() function available in the reshape package. (This approach is mentioned in section 1.3.4.). Then we can remove the rows where the same name appears twice using some logic. The logic is nested in the with() function to save some keystrokes and is generally preferable to attach()ing the test2 object.

library(reshape)
test2 = rename(test2, c("Var1" = "free", "Var2" = "fly",
"Var3" = "back", "Var4" = "breast"))
test3 = with(test2, test2[(free != breast) & (free != fly)
& (free != back) & (breast != fly) & (breast != back)
& (fly != back) ,])

Finally, we can use the which.min() function to pick the best team.

> bestteam =
+ test3[which.min(swim2$Freestyle[test3$free]+swim2$Breaststroke[test3$breast] +
+ swim2$Butterfly[test3$fly] + swim2$Backstroke[test3$back]),]
>bestteam
free fly back breast
1631 Kara Dora Lara Anna

For new users of R, this may look very peculiar-- it uses elegant but powerful features of the R language that may be challenging for new users to grasp. Essentially, in swim2$Freestyle[test3$free] we say: from the "freestyle" times in the swim2 object, take the time from the row that has the name in a row of "free" names in the test3 object. The which.min() function replicates this request for every row in the test3 object (which has all of the permutations) in it, returning the row number with that minimum sum. The outer test3[rows,columns] syntax grabs the values in this row. (The number 1631 is the row number, for some reason showing the row in the test2 object created by expand.grid().)

Now, we might also want the actual times associated with the best team. We can find them by calling the correct rows (names from the best team) and columns (stroke associated with that name) from the original data set.

> times = c(swim2[swim2$Swimmer == bestteam$free,2],
+ swim2[swim2$Swimmer == bestteam$fly,3],
+ swim2[swim2$Swimmer == bestteam$back,4],
+ swim2[swim2$Swimmer == bestteam$breast,5])
> times
[1] 109.3 126.8 117.7 126.9

If instead, one wanted to list the times in order, one approach would be to add columns to the test3 object with the time for each stroke, calculate their sum, and sort on the sum.