SAS and R: matrix()

Showing posts with label matrix(). Show all posts

Monday, June 18, 2012

Example 9.35: Discrete randomization and formatted output

A colleague asked for help with randomly choosing a kid within a family. This is for a trial in which families are recruited at well-child visits, but in each family only one of the children having a well-child visit that day can be in the study. The idea is that after recruiting the family, the research assistant needs to choose one child, but if they make that choice themselves, the children are unlikely to be representative. Instead, we'll allow them to make a random decision through an easily used slip that can be put into sealed envelopes. The envisioned process is that the RA will recruit the family, determine the number of eligible children, then open the envelope to find out which child was randomly selected.

One thought here would be to generate separate stacks of envelopes for each given family size, and have the research assistant open an envelope from the appropriate stack. However, this could be logistically challenging, especially since the RAs will spend weeks away from the home office. Instead, we'll include all plausible family sizes on each slip of paper. It seems unlikely that more than 5 children in a family will have well-child visits on the same day.

SAS
We'll use the SAS example to demonstrate using SAS macros to write SAS code, as well as showing a plausible use for SAS formats (section 1.4.12) and making use of proc print.


/* the following macro will write out equal probabilities for selecting 
each integer between 1 and the argument, in the format needed for the 
rand function.  E.g., if the argument is 3, 
it will write out
1/3,1/3,1/3
*/

%macro tbls(n);
%do i = 1 %to &n;
1/&n %if &i < &n %then ,
%end;
%mend tbls;

/* then we can use the %tbls macro to create the randomization
via rand("TABLE") (section 1.10.4). */ 
data kids;
do family = 1 to 10000;
  nkids = 2; chosen = rand("TABLE",%tbls(2)); output;
  nkids = 3; chosen = rand("TABLE",%tbls(3)); output;
  nkids = 4; chosen = rand("TABLE",%tbls(4)); output;
  nkids = 5; chosen = rand("TABLE",%tbls(5)); output;
end;
run;

/* check randomization */
proc freq data = kids;
table nkids * chosen / nocol nopercent;
run; 

   nkids     chosen

   Frequency|
   Row Pct  |       1|       2|       3|       4|       5|  Total
   ---------+--------+--------+--------+--------+--------+
          2 |  50256 |  49744 |      0 |      0 |      0 | 100000
            |  50.26 |  49.74 |   0.00 |   0.00 |   0.00 |
   ---------+--------+--------+--------+--------+--------+
          3 |  33429 |  33292 |  33279 |      0 |      0 | 100000
            |  33.43 |  33.29 |  33.28 |   0.00 |   0.00 |
   ---------+--------+--------+--------+--------+--------+
          4 |  25039 |  24839 |  25245 |  24877 |      0 | 100000
            |  25.04 |  24.84 |  25.25 |  24.88 |   0.00 |
   ---------+--------+--------+--------+--------+--------+
          5 |  19930 |  20074 |  20188 |  20036 |  19772 | 100000
            |  19.93 |  20.07 |  20.19 |  20.04 |  19.77 |
   ---------+--------+--------+--------+--------+--------+
   Total      128654   127949    78712    44913    19772   400000

Looks pretty good. Now we need to make the output usable to the research assistants, by formatting the results into English. We'll use the same format for each number of kids. This saves some keystrokes now, but may possibly cause the RAs some confusion-- it means that we might refer to the "4th oldest" of 4 children, rather than the "youngest". We could fix this using a different format for each number of children, analogous to the R version below.


proc format;
value chosen
1 = "oldest"
2 = '2nd oldest'
3 = '3rd oldest'
4 = '4th oldest'
5 = '5th oldest';
run;
 
/* now, make a text variable the concatenates (section 1.4.5) the variables 
and some explanatory text */
data k2;
set kids;
if nkids eq 2 then
  t1 = "If there are " || strip(nkids) ||" children then choose the " ||
       strip(put(chosen,chosen.)) || " child.";
else
  t1 = "             " || strip(nkids) ||" ________________________ " ||
       strip(put(chosen,chosen.));
run;

/* then we print.  Notice the options to print in plain text, shorten the 
page length and width, and remove the date and page number from the SAS output, as
well as in the proc print statement to remove the observation number and
show the line number, with a few other tricks */
options nonumber nodate ps = 60 ls = 68;
OPTIONS FORMCHAR="|----|+|---+=|-/\<>*";
proc print data = k2 (obs = 3) noobs label sumlabel;
by family;
var t1;
label t1 = '00'x family = "Envelope";
run;

---------------------------- Envelope=1 ----------------------------



     If there are 2 children then choose the 2nd oldest child.
                  3 ________________________ 3rd oldest
                  4 ________________________ 4th oldest
                  5 ________________________ 5th oldest


---------------------------- Envelope=2 ----------------------------



     If there are 2 children then choose the 2nd oldest child.
                  3 ________________________ oldest
                  4 ________________________ oldest
                  5 ________________________ 3rd oldest


---------------------------- Envelope=3 ----------------------------



     If there are 2 children then choose the 2nd oldest child.
                  3 ________________________ 2nd oldest
                  4 ________________________ 3rd oldest
                  5 ________________________ 2nd oldest

R
For R, we leave some trial code in place, to demonstrate how one might discover, test, and build R code in this setting. Most results have been omitted.


sample(5, size = 1)   
# choose a (discrete uniform) random integer between 1 and 5

apply(matrix(2:5),1,sample,size=1)   
# choose a random integer between 1 and 2, then between 1 and 3, etc., 
# using apply() to repeat the call to sample() with different maximum number
# apply() needs a matrix or array input
# result of this is the raw data needed for one family

replicate(3,apply(matrix(2:5),1,sample,size=1))
# replicate() is in the apply() family and just repeats the 
# function n times

     [,1] [,2] [,3]
[1,]    2    1    2
[2,]    2    1    2
[3,]    2    2    2
[4,]    3    5    4

Now we have the raw data for the envelopes. Before formatting it for printing, let's check it to make sure it works correctly.


test=replicate(100000, apply(matrix(2:5), 1, sample, size=1))
apply(test, 1, summary)
        [,1] [,2]  [,3]  [,4]
Min.     1.0    1 1.000 1.000
1st Qu.  1.0    1 1.000 2.000
Median   1.0    2 2.000 3.000
Mean     1.5    2 2.492 3.003
3rd Qu.  2.0    3 3.000 4.000
Max.     2.0    3 4.000 5.000
# this is not so helpful-- need the count or percent for each number
# this would be the default if the data were factors, but they aren't
# check to see if we can trick summary() into treating these integers
# as if they were factors
methods(summary)
# yes, there's a summary() method for factors-- let's apply it
# there's also apply(test,1,table) which might be better, if you remember it
apply(test, 1, summary.factor)
[[1]]
    1     2 
50025 49975 

[[2]]
    1     2     3 
33329 33366 33305 

[[3]]
    1     2     3     4 
25231 25134 24849 24786 

[[4]]
    1     2     3     4     5 
19836 20068 20065 20022 20009 
# apply(test,1,table) will give similar results, if you remember it

Well, that's not too pretty, but it's clear that the randomization is working. Now it's time to work on formatting the output.


mylist=replicate(5, apply(matrix(2:5), 1, sample, size=1))
# brief example data set

# We'll need to use some formatted values (section 1.14.12), as in SAS. 
# Here, we'll make new value labels for each number of children,
# which will make the output easier to read.  We add in an envelope 
# number and wrap it all into a data frame.
df = data.frame(envelope = 1:5,
   twokids=factor(mylist[1,],1:2,labels=c("youngest","oldest")),
  threekids=factor(mylist[2,],1:3,labels=c("youngest", "middle", "oldest")),
  fourkids=factor(mylist[3,],1:4,labels=c("youngest", "second youngest", 
      "second oldest", "oldest")),
  fivekids=factor(mylist[4,],1:5,labels=c("youngest", "second youngest", 
      "middle", "second oldest", "oldest"))
)

# now we need a function to take a row of the data frame and make a single slip
# the paste() function (section 1.4.5) puts together the fixed and variable 
# content of each row, while the cat() function will print it without quotes
slip = function(kidvec) {
  cat(paste("------------- Envelope", kidvec[1], "------------------"))
  cat(paste("\nIf there are", 2:5, " children, select the", kidvec[2:5],"child"))
  cat("\n \n \n")
}

# test it on one row
slip(df[1,])

# looks good-- now we can apply() it to each row of the data frame
apply(df, 1, slip)

------------- Envelope 1 ------------------
If there are 2  children, select the youngest child 
If there are 3  children, select the youngest child 
If there are 4  children, select the second youngest child 
If there are 5  children, select the youngest child
 
 
------------- Envelope 2 ------------------
If there are 2  children, select the youngest child 
If there are 3  children, select the youngest child 
If there are 4  children, select the second oldest child 
If there are 5  children, select the middle child


------------- Envelope 3 ------------------
If there are 2  children, select the youngest child 
If there are 3  children, select the youngest child 
If there are 4  children, select the youngest child 
If there are 5  children, select the second youngest child

# and so forth

# finally, we can save the result in a file with
# capture.output()
capture.output(apply(df,1,slip), file="testslip.txt")

An unrelated note about aggregators:
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, May 21, 2012

Example 9.32: Multiple testing simulation

In examples 9.30 and 9.31 we explored corrections for multiple testing and then extracting p-values adjusted by the Benjamini and Hochberg (or FDR) procedure. In this post we'll develop a simulation to explore the impact of "strong" and "weak" control of the family-wise error rate offered in multiple comparison corrections. Loosely put, weak control procedures may fail when some of the null hypotheses are actually false, in that the remaining (true) nulls may be rejected more than the nominal proportion of times.

For our simulation, we'll develop flexible code to generate some p-values from false nulls and others from true nulls. We'll assume that the true nulls have p-values distributed uniform (0,1); the false nulls will have p-values distributed uniform with a user-determined maximum. We'll also allow the number of tests overall and the number of false nulls to be set.

SAS
In SAS, a macro does the job. It accepts the user parameters described above, then generates false and true nulls for each desired simulation. With the data created, we can use proc multtest to apply the FDR procedure, with the ODS system saving the results. Note how the by statement allows us to replicate the analysis for each simulated set of p-values without creating a separate data set for each one. (Also note that we do not use proc sort before that by statement-- this can be risky, but works fine here.)


%macro fdr(nsims=1, ntests = 20, nfalse=10, howfalse=.01);
ods select none;
data test;
do sim = 1 to &nsims;
  do i = 1 to &ntests;
    raw_p = uniform(0) * 
      ( ((i le &nfalse) * &howfalse ) + ((i gt &nfalse) * 1 ) );
    output;
  end;
end;
run;

ods output pvalues = __pv;
proc multtest inpvalues=test fdr;
by sim;
run;

With the results in hand, (still within the macro) we need to do some massaging to make the results usable. First we'll recode the rejections (assuming a 0.05 alpha level) so that non-rejections are 0 and rejections are 1/number of tests. That way we can just sum across the results to get the proportion of rejections. Next, we transform the data to get each simulation in a row (section 1.5.4). (The data output from proc multtest has nsims*ntests rows. After transposing, there are nsims rows.) Finally, we can sum across the rows to get the proportion of tests rejected in each simulated family of tests. The results are shown in a table made with proc freq.


data __pv1;
set __pv;
if falsediscoveryrate lt 0.05 then fdrprop = 1/&ntests;
else fdrprop =0;
run;

proc transpose data = __pv1 (keep =sim fdrprop) out = pvals_a;
by sim; run;

data pvals;
set pvals_a;
prop = sum(of col1 - col&ntests);
run;
ods select all;

proc freq data = pvals; tables prop; run;
%mend fdr;

%fdr(nsims = 1000, ntests = 20, nfalse = 10, howfalse=.001);

                                      Cumulative    Cumulative
     prop    Frequency     Percent     Frequency      Percent
     ---------------------------------------------------------
      0.5         758       75.80           758        75.80
     0.55         210       21.00           968        96.80
      0.6          27        2.70           995        99.50
     0.65           5        0.50          1000       100.00

So true nulls were rejected 24% of the time, which seems like a lot. Multiple comparison procedures with "strong" control of the familywise error rate will reject them only 5% of the time. Building this simulation as a macro facilitates exploring the effects of the multiple comparison procedures in a variety of settings.

R
As in example 9.31, the R code is rather simpler, though perhaps a bit opaque. To make the p-values, we make them first for all of tests with the false, then for all of the tests with the true nulls. The matrix function reads these in by column, by default, meaning that the first nfalse columns get the nsims*nfalse observations. The apply function generates the FDR p-values for each row of the data set. The t() function just transposes the resulting matrix so that we get back a row for each simulation. As in the SAS version, we'll count each rejection as 1/ntests, and non-rejections as 0; we do this with the ifelse() statement. Then we sum across the simulations with another call to apply() and show the results with a simple table.


checkfdr = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {
  raw_p = matrix(c(runif(nfalse * nsims) * howfalse, 
                   runif((ntests-nfalse) * nsims)), nrow=nsims)
  fdr = t(apply(raw_p, 1, p.adjust, "fdr"))
  reject = ifelse(fdr<.05, 1/ntests,0)
  prop = apply(reject, 1, sum)
  prop.table(table(prop)) 
}

> checkfdr(nsims=1000, ntests=20, nfalse=10, howfalse=.001)
prop
  0.5  0.55   0.6  0.65 
0.755 0.210 0.032 0.003

The results are reassuringly similar to those from SAS. In this R code, it's particularly simple to try a different test-- just replace "fdr" in the p.adjust() call. Here's the result with the Hochberg test, which has strong control.


checkhoch = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {
   pvals = matrix(c(runif(nfalse * nsims) * howfalse, 
                    runif((ntests-nfalse) * nsims)), nrow=nsims)
   hochberg = t(apply(pvals, 1, p.adjust,"hochberg"))
   reject = ifelse(hochberg<.05,1/ntests,0)
   prop = apply(reject, 1, sum)
   prop.table(table(prop)) 
}
 
> checkhoch(nsims=1000, ntests=20, nfalse=10, howfalse=.001)
prop
  0.5  0.55   0.6 
0.951 0.046 0.003

With this procedure one or more of the true nulls is rejected an appropriate 4.9% of the time. For the most part, we feel more comfortable using multiple testing procedures with "strong control".

An unrelated note about aggregators
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, April 16, 2012

Example 9.27: Baseball and shrinkage

To celebrate the beginning of the professional baseball season here in the US and Canada, we revisit a famous example of using baseball data to demonstrate statistical properties.

In 1977, Bradley Efron and Carl Morris published a paper about the James-Stein estimator-- the shrinkage estimator that has better mean squared error than the simple average. Their prime example was the batting averages of 18 player in the 1970 season: they considered trying to estimate the players' average over the remainder of the season, based on their first 45 at-bats. The paper is a pleasure to read, and can be downloaded here. The data are available here, on the pages of Statistician Phil Everson, of Swarthmore College.

Today we'll review plotting the data, and intend to look at some other shrinkage estimators in a later entry.

SAS
We begin by reading in the data for Everson's page. (Note the long address would need to be on one line, or you could could use a URL shortener like TinyURL.com. To read the data, we use the infile statement to indicate a tab-delimited file and to say that the data begin in row 2. The informat statement helps read in the variable-length last names.


filename bb url "http://www.swarthmore.edu/NatSci/peverso1/Sports%20Data/
    JamesSteinData/Efron-Morris%20Baseball/EfronMorrisBB.txt";

data bball;
infile bb delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=2 ;
informat firstname $7. lastname $10.;
input FirstName $ LastName $ AtBats Hits BattingAverage RemainingAtBats
   RemainingAverage SeasonAtBats SeasonHits SeasonAverage;
run;

data bballjs;
set bball;
js = .212 * battingaverage + .788 * .265;

avg = battingaverage; time = 1; 
  if lastname not in("Scott","Williams", "Rodriguez", "Unser","Swaboda","Spencer") 
    then name = lastname; else name = ''; 
output;
avg = seasonaverage; name = ''; time = 2; output;
avg = js; time = 3; name = ''; output;
run;

In the second data step, we calculate the James-Stein estimator according to the values reported in the paper. Then, to facilitate plotting, we convert the data to the "long" format, with three rows for each player, using the explicit output statement. The average in the first 45 at-bats, the average in the remainder of the season, and the James-Stein estimator are recorded in the same variable in each of the three rows, respectively. To distinguish between the rows, we assign a different value of time: this will be used to order the values on the graphic. We also record the last name of (most of) the players in a new variable, but only in one of the rows. This will be plotted in the graphic-- some players' names can't be shown without plotting over the data or other players' names.

Now we can generate the plot. Many features shown here have been demonstrated in several entries. We call out 1) the h option, which increases the text size in the titles and labels, 2) the offset option, which moves the data away from the edge of the plot frame, 3) the value option in the axis statement, which replaces the values of "time" with descriptive labels, and 4) the handy a*b=c syntax which replicates the plot for each player.


title h=3 "Efron and Morris example of James-Stein estimation";
title2 h=2 "Baseball players' 1970 performance estimated from first 45 at-bats";
axis1 offset = (4cm,1cm) minor=none label=none
  value = (h = 2 "Avg. of first 45" "Avg. of remainder" "J-S Estimator");
axis2 order = (.150 to .400 by .050) minor=none offset=(0.5cm,1.5cm) 
  label = (h =2 r=90 a = 270 "Batting Average");
symbol1 i = j v = none l = 1 c = black r = 20 w=3 
  pointlabel = (h=2 j=l position = middle "#name");

proc gplot data = bballjs;
  plot avg * time = lastname / haxis = axis1 vaxis = axis2 nolegend;
run; quit;

To read the plot (shown at the top) consider approaching the nominal true probability of a hit, as represented by the average over the remainder of the season, in the center. If you begin on the left, you see the difference associated with using the simple average of the first 45 at-bats as the estimator. Coming from the right, you see the difference associated withe using the James-Stein shrinkage estimator. The improvement associated with the James-Stein estimator is reflected in the generally shallower slopes when coming from the left. With the exception of Pirates great Roberto Clemente and declining third-baseman Max Alvis, most every line has a shallower slope from the left; James' and Stein's theoretical work proved that overall the lines must be shallower from the right.

R
A similar process is undertaken within R. Once the data are loaded, and a subset of the names are blanked out (to improve the readability of the figure), the matplot() and matlines() functions are used to create the lines.


bball = read.table("http://www.swarthmore.edu/NatSci/peverso1/Sports%20Data/JamesSteinData/Efron-Morris%20Baseball/EfronMorrisBB.txt",
                   header=TRUE, stringsAsFactors=FALSE)
bball$js = bball$BattingAverage * .212 + .788 * (0.265)
bball$LastName[!is.na(match(bball$LastName, 
  c("Scott","Williams", "Rodriguez", "Unser","Swaboda","Spencer")))] = ""

a = matrix(rep(1:3, nrow(bball)), 3, nrow(bball))
b = matrix(c(bball$BattingAverage, bball$SeasonAverage, bball$js), 
   3, nrow(bball), byrow=TRUE)
matplot(a, b, pch=" ", ylab="predicted average", xaxt="n", xlim=c(0.5, 3.1), ylim=c(0.13, 0.42))
matlines(a, b)
text(rep(0.7, nrow(bball)), bball$BattingAverage, bball$LastName, cex=0.6)
text(1, 0.14, "First 45\nat bats", cex=0.5)
text(2, 0.14, "Average\nof remainder", cex=0.5)
text(3, 0.14, "J-S\nestimator", cex=0.5)

Thursday, February 23, 2012

Example 9.21: The birthday "problem" re-examined

The so-called birthday paradox or birthday problem is simply the counter-intutitive discovery that the probability of (at least) two people in a group sharing a birthday goes up surprisingly fast as the group size increases. If the group is only 23 people, there is a 50% chance that two of them share a birthday, and with 40 people it's about 90%. There is an excellent wikipedia page discussing this.

However, this analytically derived probability is based on the assumption that births are equally likely on any day of the year. (It also ignores the occasional February 29th, and any social factors that lead people born at the same time of year to seek like spouses, and so forth.) But this assumption does not appear to be true, as laid out anecdotally and in press.

As noted in the latter link, any disparity in the probability of birth between days will improve the chances of a match. But how much? An analytic solution seems quite complex, even if we approximate the true daily distribution with a constant birth probability per month. Simulation will be simpler. While we're at it, we'll include leap days as well, since February 29th approaches.

SAS

Our approach here is based on the observation that the probability of at least one match among N people is equal to the sum of the probabilities of exactly one match in 2,...,N people. In addition, rather than simulating groups of 2, estimating the probability of a match, and repeating for groups of 3,...,N, we'll keep adding people to a group until we have a match, finding the probability of a match in all group sizes at once.

Here we use arrays (section 1.11.5) to keep track of the number of days in a month and of the people in our group. To reduce computation, we'll check for matches as we add people to the group, and only generate their birthdays if there is not yet a match. We also demonstrate the useful hyphen tool for referring to ranges of variables (1.11.4).


data bd1;
array daysmo [12] _temporary_  (31 28.25 31 30 31 30 31 31 30 31 30 31);
array dob [367] dob1 - dob367;  * these variables will hold the birthdays
                                * the hyphen includes all the variables in the
                                * sequence

do group = 1 to 10000000;       * simulate this many groups;
  match = 0;                    * initialize whether there's a match in this 
                                  group, yet;
  do i = 1 to 367;              * loop through up to 367 subjects... the maximum
                                  possible, obviously;
    month = rantbl(0, 31*.0026123, 28*.0026785, 31*.0026838, 30*.0026426,
        31*.0026702, 30*.0027424, 31*.0028655, 31*.0028954, 30*.0029407,
        31*.0027705, 30*.0026842);
                                * choose a month of birth, by probabilities reported
                                  in the Science News link, which are daily by month;
    day = ceil((4 * daysmo[month] * uniform(0))/4);  
                                * choose a day within the month,
                                  note the trick used to get Leap Days; 
    dob[i] = mdy(month, day, 1960);
                                * convert month and day into a day in the year--
                                  1960 is a convenient leap year;
 do j = 1 to (i-1) until (match gt 0);
                                * compare each old person to the new one;
   if dob[j] = dob[i] then match = i;
                                * if there was a match, we needed i people in the 
                                  group to make it;
   end;
 if match gt 0 then leave;  
                                * no need to generate the other 367-i people;
  end;
  output;
end;
run;

We note here that while we allow up to 367 birthdays before a match, the probability of more than 150 is so infinitesimal that we could save the space and speed up processing time by ignoring it. Now that the groups have been simulated, we just need to summarize and present them. We tabulate how many cases of groups of size N were recorded, generate the simple analytic answer, and merge them.


proc freq data = bd1;
tables match / out=bd2 outcum;  * the bd2 data set has the results;
run;

data simpreal;
set bd2;
prob = 1 - ((fact(match) * comb (365,match)) / 365**match);
realprob = cum_freq/10000000;
diff = realprob-prob;
diffpct = 100 * (diff)/prob;
run;

It's easiest to interpret the results by way of a plot. We'll plot the absolute and the relative difference on the same image with two different axes. The axis and symbol statements will make it slightly prettier, and allow us to make 0 appear at the same point on both axes.


axis1 order = (-.75 to .75 by .25) minor = none;
axis2 order = (-.00025 to .00025 by .00005) minor = none;
symbol1 v = dot h = .75 c = blue;
symbol2 font=marker v = U h = .5 c = red;

proc gplot data= simpreal (obs = 89);
plot diffpct * match / vref = 0 vaxis=axis1 legend;
plot2 diff *match/ vaxis = axis2 legend;
run; quit;

The results, shown below, are very clear-- leap day and the disequilibrium in birth month probability does increase the probability of at least one match in any group of a given size, relative to the uniform distribution across days assumed in the analytic solution. But the difference is miniscule in both the absolute and relative scale.

R
Here we mimic the approach used above, but use the apply() function family in place of some of the looping.


dayprobs = c(.0026123,.0026785,.0026838,.0026426,.0026702,.0027424,.0028655,
    .0028954,.0029407,.0027705,.0026842,.0026864)
daysmo = c(31,28,31,30,31,30,31,31,30,31,30,31)
daysmo2 = c(31,28.25,31,30,31,30,31,31,30,31,30,31)
# need both: the former is how the probs are reported, 
# while the latter allows leap days

moprob = daysmo * dayprobs

With the monthly probabilities established, we can sample a birth month for everyone, and then choose a birth day within month. We use the same trick as above to allow birth days of February 29th. Here we show code for 10,000 groups; with the simple cloud R this code was developed on, more caused a crash.

We've stopped referencing our book exhaustively, and doing so here would be tedious. Instead, we'll just comment that the tools we use here can be found in sections 1.4.5, 1.4.15, 1.4.16, 1.5.2, 1.8.3, 1.8.4, 1.9.1, 1.11.1, 5.2.1, 5.6.1, B.5.2, and probably others.


mob = sample(1:12,10000 * 367,rep=TRUE,prob=moprob)
dob = sapply(mob,function(x) ceiling(sample((4*daysmo2[x]),1)/4) )
# The ceiling() function isn't vectorized, so we make the equivalent
# using sapply().

mobdob = paste(mob,dob)
# concatenate the month and day to make a single variable to compare
# between people.  The ISOdate() function would approximate the SAS mdy() 
# function but would be much longer, and we don't need it.

# convert the vector into a matrix with the maximum
# group size as the number of columns
# as noted above, this could safely be truncated, with great savings
mdmat = matrix(mobdob, ncol=367, nrow=10000)

To find duplicate birthdays in each row of the matrix, we'll write a function to compare the number of unique values with the length of the vector, then call it repeatedly in a for() loop until there is a difference. Then, to save (a lot) of computations, we'll break out of the loop and report the number needed to make the match. Finally, we'll call this vector-based function using apply() to perform it on each row of the birthday matrix.


matchn = function(x) {
  for (i in 1:367){
    if (length(unique(x[1:i])) != i) break
  }
  return(i)
}

groups = apply(mdmat, 1, matchn)

bdprobs = cumsum(table(groups)/10000)
# find the N with each group number, divide by number of groups
# and get the cumulative sum

rgroups = as.numeric(names(bdprobs))
# extract the group sizes from the table
probs = 1 - ((factorial(rgroups) * choose(365,rgroups)) / 365**rgroups)
# calculate the analytic answer, for any group size 
# for which there was an observed simulated value

diffs = bdprobs - probs
diffpcts = diffs/probs

To plot the differences and percent differences in probabilities, we modify (slightly) the functions for a multiple-axis scatterplot we show in our book in section 5.6.1. You can find the code for this and all the book examples on the book web site.


addsecondy <- function(x, y, origy, yname="Y2") {
  prevlimits <- range(origy)
  axislimits <- range(y)
  axis(side=4, at=prevlimits[1] + diff(prevlimits)*c(0:5)/5,
       labels=round(axislimits[1] + diff(axislimits)*c(0:5)/5, 3))
  mtext(yname, side=4)
  newy <- (y-axislimits[1])/(diff(axislimits)/diff(prevlimits)) +
    prevlimits[1]
  points(x, newy, pch=2)
  abline(h=(-axislimits[1])/(diff(axislimits)/diff(prevlimits)) +
    prevlimits[1])
}

plottwoy <- function(x, y1, y2, xname="X", y1name="Y1", y2name="Y2")
{
  plot(x, y1, ylab=y1name, xlab=xname)
  abline(h=0)
  addsecondy(x, y2, y1, yname=y2name)
}

plottwoy(rgroups, diffs, diffpcts, xname="Number in group",
  y1name="Diff in prob", y2name="Diff in percent")
legend(80, .0013, pch=1:2, legend=c("Diffs", "Pcts"))

The resulting plot, (which is based on 100,000 groups, tolerable compute time on a laptop) is shown at the top. Aligning the 0 on each axis was more of a hassle than seemed worth it for today. However, the message is equally clear-- a clearly larger probability with the observed birth distribution, but not a meaningful difference.

Tuesday, July 5, 2011

Example 9.1: Scatterplots with binning for large datasets

Scatterplots can get very hard to interpret when displaying large datasets, as points inevitably overplot and can't be individually discerned. A number of approaches have been crafted to help with this problem. One approach uses binning. This approach is also sometimes called a heat map, and can be though of as a two-dimensional histogram, where shades of the bins take the place of the heights of the bars. Any regular tesselation of the plane can be used, but there is some attraction to using hexagons. Why? In the vignettes for the hexbin package author Nicholas Lewin-Koh notes:

There are many reasons for using hexagons, at least over squares. Hexagons have symmetry of nearest neighbors which is lacking in square bins. Hexagons are the maximum number of sides a polygon can have for a regular tesselation of the plane, so in terms of packing a hexagon is 13% more efficient for covering the plane than squares. This property translates into better sampling efficiency at least for elliptical shapes. Lastly hexagons are visually less biased for displaying densities than other regular tesselations.

On the other hand, it's unclear whether these advantages are relevant here or whether they outweigh the simplicity of the square and the constant x and y values accompanying it.

In this entry, we demonstrate the use of a binned scatterplot for data from a sample of 10,000 generated bivariate normal random variables (section 1.10.6).

R

In R, we use the hexbin package to generate our plot, after generating our bivariate normals with correlation approximately 0.52.


library(MASS)
library(hexbin)
mu = c(1, -1)
Sigma = matrix(c(3, 2,
                 2, 5), nrow=2)
xvals = mvrnorm(10000, mu, Sigma)
Sigma[1,2]/sqrt(Sigma[1,1]*Sigma[2,2])    # correlation
plot(hexbin(xvals[,1], xvals[,2]), xlab="X1", ylab="X2")

SAS
We're not aware of a SAS procedure to generate a binned scatterplot or of previously existing macros to do it. Ken wrote a relatively simple macro to do it, which can be found here. The macro uses proc gmap, and we hope that someone will develop an approach using proc template and proc sgrender, as demonstrated in an example from SAS Institute.

After running the macro, the following code generates the image shown below.


data Sigma (type=cov);
infile cards;
input _type_ $ _Name_ $ x1 x2;
cards;
cov x1 3 2
cov x2 2 5
;
run;

proc simnormal data=Sigma out=mvnorms numreal = 10000;
  var x1 x2;
run;

%twodhist(data=mvnorms,x=x1,y=x2,nbinsx=30,nbinsy=30,nshades=9);

We note that the default number of shades shown in R, and the number chosen here for SAS, seem to exceed the eye's ability to differentiate, especially for the darker shades.

Update

An anonymous commenter reported that the SAS code bombed when run. I (Ken) added a new version of the code at the link listed above. I note it here only to emphasize that in either SAS or R, settings or objects in the environment can affect the performance of code. If your plan to share code, an item to add to your checklist is to run the code in a fresh session.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts