Thursday, February 23, 2012

Example 9.21: The birthday "problem" re-examined



The so-called birthday paradox or birthday problem is simply the counter-intutitive discovery that the probability of (at least) two people in a group sharing a birthday goes up surprisingly fast as the group size increases. If the group is only 23 people, there is a 50% chance that two of them share a birthday, and with 40 people it's about 90%. There is an excellent wikipedia page discussing this.

However, this analytically derived probability is based on the assumption that births are equally likely on any day of the year. (It also ignores the occasional February 29th, and any social factors that lead people born at the same time of year to seek like spouses, and so forth.) But this assumption does not appear to be true, as laid out anecdotally and in press.

As noted in the latter link, any disparity in the probability of birth between days will improve the chances of a match. But how much? An analytic solution seems quite complex, even if we approximate the true daily distribution with a constant birth probability per month. Simulation will be simpler. While we're at it, we'll include leap days as well, since February 29th approaches.

SAS

Our approach here is based on the observation that the probability of at least one match among N people is equal to the sum of the probabilities of exactly one match in 2,...,N people. In addition, rather than simulating groups of 2, estimating the probability of a match, and repeating for groups of 3,...,N, we'll keep adding people to a group until we have a match, finding the probability of a match in all group sizes at once.

Here we use arrays (section 1.11.5) to keep track of the number of days in a month and of the people in our group. To reduce computation, we'll check for matches as we add people to the group, and only generate their birthdays if there is not yet a match. We also demonstrate the useful hyphen tool for referring to ranges of variables (1.11.4).

data bd1;
array daysmo [12] _temporary_ (31 28.25 31 30 31 30 31 31 30 31 30 31);
array dob [367] dob1 - dob367; * these variables will hold the birthdays
* the hyphen includes all the variables in the
* sequence

do group = 1 to 10000000; * simulate this many groups;
match = 0; * initialize whether there's a match in this
group, yet;
do i = 1 to 367; * loop through up to 367 subjects... the maximum
possible, obviously;
month = rantbl(0, 31*.0026123, 28*.0026785, 31*.0026838, 30*.0026426,
31*.0026702, 30*.0027424, 31*.0028655, 31*.0028954, 30*.0029407,
31*.0027705, 30*.0026842);
* choose a month of birth, by probabilities reported
in the Science News link, which are daily by month;
day = ceil((4 * daysmo[month] * uniform(0))/4);
* choose a day within the month,
note the trick used to get Leap Days;
dob[i] = mdy(month, day, 1960);
* convert month and day into a day in the year--
1960 is a convenient leap year;
do j = 1 to (i-1) until (match gt 0);
* compare each old person to the new one;
if dob[j] = dob[i] then match = i;
* if there was a match, we needed i people in the
group to make it;
end;
if match gt 0 then leave;
* no need to generate the other 367-i people;
end;
output;
end;
run;

We note here that while we allow up to 367 birthdays before a match, the probability of more than 150 is so infinitesimal that we could save the space and speed up processing time by ignoring it. Now that the groups have been simulated, we just need to summarize and present them. We tabulate how many cases of groups of size N were recorded, generate the simple analytic answer, and merge them.

proc freq data = bd1;
tables match / out=bd2 outcum; * the bd2 data set has the results;
run;

data simpreal;
set bd2;
prob = 1 - ((fact(match) * comb (365,match)) / 365**match);
realprob = cum_freq/10000000;
diff = realprob-prob;
diffpct = 100 * (diff)/prob;
run;

It's easiest to interpret the results by way of a plot. We'll plot the absolute and the relative difference on the same image with two different axes. The axis and symbol statements will make it slightly prettier, and allow us to make 0 appear at the same point on both axes.

axis1 order = (-.75 to .75 by .25) minor = none;
axis2 order = (-.00025 to .00025 by .00005) minor = none;
symbol1 v = dot h = .75 c = blue;
symbol2 font=marker v = U h = .5 c = red;

proc gplot data= simpreal (obs = 89);
plot diffpct * match / vref = 0 vaxis=axis1 legend;
plot2 diff *match/ vaxis = axis2 legend;
run; quit;

The results, shown below, are very clear-- leap day and the disequilibrium in birth month probability does increase the probability of at least one match in any group of a given size, relative to the uniform distribution across days assumed in the analytic solution. But the difference is miniscule in both the absolute and relative scale.

R
Here we mimic the approach used above, but use the apply() function family in place of some of the looping.

dayprobs = c(.0026123,.0026785,.0026838,.0026426,.0026702,.0027424,.0028655,
.0028954,.0029407,.0027705,.0026842,.0026864)
daysmo = c(31,28,31,30,31,30,31,31,30,31,30,31)
daysmo2 = c(31,28.25,31,30,31,30,31,31,30,31,30,31)
# need both: the former is how the probs are reported,
# while the latter allows leap days

moprob = daysmo * dayprobs

With the monthly probabilities established, we can sample a birth month for everyone, and then choose a birth day within month. We use the same trick as above to allow birth days of February 29th. Here we show code for 10,000 groups; with the simple cloud R this code was developed on, more caused a crash.

We've stopped referencing our book exhaustively, and doing so here would be tedious. Instead, we'll just comment that the tools we use here can be found in sections 1.4.5, 1.4.15, 1.4.16, 1.5.2, 1.8.3, 1.8.4, 1.9.1, 1.11.1, 5.2.1, 5.6.1, B.5.2, and probably others.

mob = sample(1:12,10000 * 367,rep=TRUE,prob=moprob)
dob = sapply(mob,function(x) ceiling(sample((4*daysmo2[x]),1)/4) )
# The ceiling() function isn't vectorized, so we make the equivalent
# using sapply().

mobdob = paste(mob,dob)
# concatenate the month and day to make a single variable to compare
# between people. The ISOdate() function would approximate the SAS mdy()
# function but would be much longer, and we don't need it.

# convert the vector into a matrix with the maximum
# group size as the number of columns
# as noted above, this could safely be truncated, with great savings
mdmat = matrix(mobdob, ncol=367, nrow=10000)

To find duplicate birthdays in each row of the matrix, we'll write a function to compare the number of unique values with the length of the vector, then call it repeatedly in a for() loop until there is a difference. Then, to save (a lot) of computations, we'll break out of the loop and report the number needed to make the match. Finally, we'll call this vector-based function using apply() to perform it on each row of the birthday matrix.

matchn = function(x) {
for (i in 1:367){
if (length(unique(x[1:i])) != i) break
}
return(i)
}

groups = apply(mdmat, 1, matchn)

bdprobs = cumsum(table(groups)/10000)
# find the N with each group number, divide by number of groups
# and get the cumulative sum

rgroups = as.numeric(names(bdprobs))
# extract the group sizes from the table
probs = 1 - ((factorial(rgroups) * choose(365,rgroups)) / 365**rgroups)
# calculate the analytic answer, for any group size
# for which there was an observed simulated value

diffs = bdprobs - probs
diffpcts = diffs/probs

To plot the differences and percent differences in probabilities, we modify (slightly) the functions for a multiple-axis scatterplot we show in our book in section 5.6.1. You can find the code for this and all the book examples on the book web site.

addsecondy <- function(x, y, origy, yname="Y2") {
prevlimits <- range(origy)
axislimits <- range(y)
axis(side=4, at=prevlimits[1] + diff(prevlimits)*c(0:5)/5,
labels=round(axislimits[1] + diff(axislimits)*c(0:5)/5, 3))
mtext(yname, side=4)
newy <- (y-axislimits[1])/(diff(axislimits)/diff(prevlimits)) +
prevlimits[1]
points(x, newy, pch=2)
abline(h=(-axislimits[1])/(diff(axislimits)/diff(prevlimits)) +
prevlimits[1])
}

plottwoy <- function(x, y1, y2, xname="X", y1name="Y1", y2name="Y2")
{
plot(x, y1, ylab=y1name, xlab=xname)
abline(h=0)
addsecondy(x, y2, y1, yname=y2name)
}

plottwoy(rgroups, diffs, diffpcts, xname="Number in group",
y1name="Diff in prob", y2name="Diff in percent")
legend(80, .0013, pch=1:2, legend=c("Diffs", "Pcts"))

The resulting plot, (which is based on 100,000 groups, tolerable compute time on a laptop) is shown at the top. Aligning the 0 on each axis was more of a hassle than seemed worth it for today. However, the message is equally clear-- a clearly larger probability with the observed birth distribution, but not a meaningful difference.

Monday, February 13, 2012

RStudio in the cloud, for dummies

You can have your own cloud computing version of R, complete with RStudio. Why should you? It's cool! Plus, there's a lot more power out there than you can easily get on your own hardware. And, it's R in a web page. Run it from your tablet. Run it from work, even if you're not supposed to install software. Run it from your boyfriend's laptop while he's on a beer run.

This entry is largely made possible by the work of Louis Alsett, who's completing his doctoral work at Trinity College, University of Dublin. We had thought that running a cloud compute application was beyond our current technical abilities, but Louis' work makes it pretty easy to do. In this entry, we'll show you how. (Louis graciously vetted the text for this entry, but all errors are our responsibility).

Start-up
1. Get an account with Amazon Web Services (AWS). This is slightly more involved than your ordinary Amazon account, but not a big deal. There are no fees unless you use the services. There's also a "free tier" which means that you start with 750 hours of usage per month for a year.* Effectively, they're giving you a free computer for a year.

2. Go to this handy page maintained by Louis. Click on the 32-bit link for your region. This is a shortcut that gets you the right AMI. (An AMI is an "Amazon Machine Image". Each of these is effectively an operating system with a bunch of pre-loaded software. The ones that Louis maintains have R and RStudio built into them, and have an additional feature we'll encounter later.) You can also find Louis' AMIs without his page.**

3. The shortcut from Louis' page brings you through the first steps of setting up. You'll next see a page in the "Request Instances Wizard". An "instance" consists of some virtual hardware for your chosen OS and software. It's effectively a computer in the cloud. The defaults on the wizard are fine, with one key exception, but we'll add a little detail about what they are.

a. Click Continue on that first page, as if you had reviewed the data. (You might notice that this is a Ubuntu OS. But if you're not a Linux user, don't fret-- you won't know that's what OS is running.)
b. On the "Instance details" page, you can also click continue. The main option here is choosing the virtual hardware the AMI will run on. (The defaults are fine.)
c. The next page is also "Instance details" and you can click through.
d. The next "Instance details" page lets you assign a name to the instance. This can be useful if you end up running several instances at the same time, but you can click through for now.
e. Click through the "Create Key Pair" page; this is also convenient if you're a heavy user, but not necessary.
f. The next step is to "Configure Firewall". This is where you do have to pay attention. Since you'll want to access your virtual machine via a browser, you need to allow HTTP access. To do this,

1) click "Create a New Security Group".
2) Give a name (like "RStudio") and
3) a description ("RStudio")-- both are required. Then,
4) in the "Port range" window, type 80. (Leave the source at 0.0.0.0/0, which means that you can connect from any IP address.)
5) Click "Add Rule". You should get a little blue box describing the rule. Now
6) click continue. On the following page, click "Back" and check that your new security group is selected.

g. Click "Launch". Your virtual machine is being started! There's a page with some links, which you can "Close".

Use
4. To use the new computer, click "Instances" on the "Navigation" panel of the AWS Management Console Amazon EC2 page. You'll see a row with an "empty" Name, and a State that is either "Pending" or "Running". (You might need to click refresh to see when it starts running.) When it's running, click on it. You get a bunch of information in the box below.

5. Scroll down to "Public DNS". Copy the DNS and paste it into the address bar in your browser. If all went well, you should see an RStudio login window. This is the genius of Louis' approach-- you never need to see the operating system. Use the username rstudio and password rstudio. In a moment, that beautiful, familiar RStudio interface appears!

6. For security, it makes sense to change your password. But since Louis wants to spare you the OS, he's cleverly built in a way to change it from within RStudio. Just change the "Password" in the "Welcome.r" file, then source it. You should probably avoid saving the "Welcome.r" file-- maybe just close it-- because saving it will result in your password being saved as plain text. Probably not a big risk, but why tempt fate?

7. You can close your browser and open the window again any time you like, from any browser you like, using your new password.

There's your R in the cloud! Use RStudio's built-in package installation tools to easily build your working environment.

Management
Our understanding of the "Free Usage Tier" is that you can leave this on all the time for a year without incurring any charges. Amazing. But caveat emptor.
You should also know how billing works. According to the FAQ, for instances other than the "micro" version we used here, (or for "micro" instances after your "Free Usage Tier" period is over) you're billed an hourly rate between when the instance starts running and when it's terminated. The "micro" linux instance that we chose above will cost $0.02/hour after the free period is over. Still cheap to use for a few hours, but too costly to leave on all the time for fun.

However, there is also a "stopped" state. The stopped state is important for the other aspect of billing: data storage. Storage costs something like $0.10 per GB per month. When your instance is "stopped" you don't pay the hourly instance charge, but you still pay the monthly data storage charge. The "free usage tier" includes 30GB of storage for the first year. (Obviously, there are no charges when your instance is terminated, since you lose all stored data.) Louis' AMIs have only 2 GB of storage built in, so they will run cheap, once your free usage period is over.

As long as an instance is running, you retain all aspects of your session-- it's just as if you had a computer that you left on running RStudio all the time. An instance that is stopped will retain all the loaded packages and local objects, but you have to log into AWS to start it.

Amazon warns that instances will occasionally fail, and if that happens, you're supposed to be able to restart them, as if you had stopped them on purpose.*** But it might be good idea to back things up.

Happy cloud computing!

* The free usage is limited to "micro" instances, such as we use here. For any other kind, the usual fees apply.

** To find the AMIs without Louis' handy web page, start the AWS management console, go to EC2, and click "Launch Instance". You'll get a page with some standard instances where you can click a radio button to "Launch Classic Wizard". Click that, then Community AMIs. Then search for rstudio, click the one with the RStudio and R version you want, and proceed as from step 2.

*** Louis says " I've not experienced it first-hand as they've been reliable for me, but apparently the instance will disappear and the hard drive will be left hanging round. When you are in the Amazon console on the EC2 tab, if you look further down the left "Navigation" you'll see "Volumes" under the "Elastic Block Store" section. You can look there when your instance is running and see its hard drive which will say "attached" -- this becomes "available" if an instance fails. So, you need to create a new instance and then attach the drive to it and reboot."

Friday, February 10, 2012

managing projects using RStudio

We're continually amazed with new developments within RStudio, the integrated developed environment for R that we highlighted previously (Among others, Andrew Gelman agrees with us about its value).

The most recent addition addresses one of our earlier concerns, by adding support for projects within RStudio. These allow work to be divided into multiple contexts, each with their own working directory, workspace, history, and source documents. For those multi-taskers amongst us, this is a big win. Projects can be created within a new or existing directory, as well as through use of a version control system (Git or Subversion).

When you create or move to a project within RStudio lots of useful things happen:
(1) A new R session (process) is started
(2) The .Rprofile file in the project's main directory (if any) is sourced by R
(3) The .RData file in the project's main directory is loaded (this can be controlled by an option).
(4) The .Rhistory file in the project's main directory is loaded into the RStudio History pane (and used for Console Up/Down arrow command history).
(5) The current working directory is set to the project directory.
(6) Previously edited source documents are restored into editor tabs, and
(7) Other RStudio settings (e.g. active tabs, splitter positions, etc.) are restored to where they were the last time the project was closed.

If you haven't updated your version of RStudio recently, or have never checked it out, this would be a great time to consider it. More information about these new features can be found here, along with an excellent screencast overview here. Happy coding!

Tuesday, February 7, 2012

Example 9.20: visualizing Simpson's paradox


Simpson's paradox is always amazing to explain to students. What's bad for one group, and bad for another group is good for everyone, if you just collapse over the grouping variable. Unlike many mathematical paradoxes, this arises in a number of real-world settings.

In this entry, we consider visualizing Simpson's paradox, using data from a study of average SAT scores, average teacher salaries and percentage of students taking the SATs at the state level (published by Deborah Lynn Guber, "Getting what you pay for: the debate over equity in public school expenditures" (1999), Journal of Statistics Education 7(2)).

R
The relevant data are available within the mosaic package.

> library(mosaic); trellis.par.set(theme=col.mosaic())
> head(SAT)
state expend ratio salary frac verbal math sat
1 Alabama 4.405 17.2 31.144 8 491 538 1029
2 Alaska 8.963 17.6 47.951 47 445 489 934
3 Arizona 4.778 19.3 32.175 27 448 496 944
4 Arkansas 4.459 17.1 28.934 6 482 523 1005
5 California 4.992 24.0 41.078 45 417 485 902
6 Colorado 5.443 18.4 34.571 29 462 518 980

The paradox manifests itself here if we ignore the fraction of students taking the SAT and instead consider the bivariate relationship between average teacher salary and average SAT score:

> lm(sat ~ salary, data=SAT)

Coefficients:
(Intercept) salary
1158.86 -5.54

Unfortunately, the relationship here appears to be negative: as salary increases,
average SAT score is predicted to decrease.

Luckily for the educators in the audience, once the confounding (aka lurking) variable is accounted for, we see a statistically significant positive relationship:

> summary(lm(sat ~ salary + frac, data=SAT))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 987.9005 31.8775 30.991 <2e-16 ***
salary 2.1804 1.0291 2.119 0.0394 *
frac -2.7787 0.2285 -12.163 4e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 33.69 on 47 degrees of freedom
Multiple R-squared: 0.8056, Adjusted R-squared: 0.7973
F-statistic: 97.36 on 2 and 47 DF, p-value: < 2.2e-16

That's all well and good, but how to explain what is going on? We can start with a scatterplot, but unless the state names are plotted then it's hard to see what's going on (imagine the plot at the top of this entry without the text labels or the color shading).

It's straightforward to use the panel functionality within the lattice package to create a more useful plot, where states names are plotted rather than points, and different colors are used to represent the low (red), medium (blue) and high (green) values for the fraction of students taking the SAT.

SAT$fracgrp = cut(SAT$frac, breaks=c(0, 22, 49, 81),
labels=c("low", "medium", "high"))
SAT$fraccount = 1 + as.numeric(SAT$fracgrp=="medium") +
2*as.numeric(SAT$fracgrp=="high")
panel.labels = function(x, y, col='black', labels='x',...)
{ panel.text(x,y,labels,col=col,...)}
xyplot(sat ~salary, data=SAT, groups=fracgrp,
cex=0.6, panel=panel.labels, labels=SAT$state,
col=c('red','blue','green')[SAT$fraccount])

We see that within each group, the slope is positive (or at least non-negative), while overall there is a negative slope. Indeed, we see all of the hallmarks of a serious confounder in the correlation matrix for the n=50 observations:

> with(SAT, cor(cbind(sat, frac, salary)))
sat frac salary
sat 1.0000000 -0.8871187 -0.4398834
frac -0.8871187 1.0000000 0.6167799
salary -0.4398834 0.6167799 1.0000000

There's a strong negative association between SAT (Y) and fraction (Z), as well as a strong positive association between salary (X) and fraction (Z).

SAS

We begin by getting the data out of R. This is a snap, thanks to the proc_r macro we discussed here. Just read the macro in, tell it to grab the sat object on its way back from R, then write the R code to read in the data set. This technique would be great for getting data from the histdata package, discussed here, into SAS.

%include "C:\ken\sasmacros\Proc_R.sas";
%Proc_R (SAS2R =, R2SAS = sat);
Cards4;

mynorm = rnorm(10)
mynorm
library(mosaic)
head(SAT)
sat = SAT

;;;;
%Quit;

We'll skip the regression (best accomplished in proc glm) and skip to the helpful plot. We'll need to categorize the fraction taking the test. A natural way to do this in SAS would be to use formats, but we prefer to generate actual data as a more transparent process. To make the plot, we'll use the sgplot procedure. Typically this allows less control than the gplot procedure but the results in this case are quick and attractive. The group= and datalabel= options add the colors and legend, and the state names, respectively.

data sat2; set sat;
if frac le 22 then fracgrp = "Low ";
else if frac le 49 then fracgrp = "Medium";
else if frac gt 49 then fracgrp ="High";
run;

proc sgplot data=sat2;
scatter x = salary y = sat / group=fracgrp datalabel=state;
run; quit;