Monday, July 27, 2009

Book now shipping from Amazon

Amazon now reports that the book is in stock! The current discount is 13%.

Or, order from the publisher. If you are an ASA member, you can use the online discount code 634LH to obtain a 15% discount.

Saturday, July 25, 2009

Example 7.7: Tabulate binomial probabilities

Suppose we wanted to assess the probability P(X=x) for a binomial random variate with n = 10 and with p = .81, .84, ..., .99. This could be helpful, for example, in various game settings.

In SAS, we find the probability that X=x using differences in the CDF calculated via the cdf function (1.10.1). We loop through the various binomial probabilities and values of x using the do ... end structure (section 1.11.1). Finally, we store the probabilities in implicitly named variables via an array statement (section 1.11.5).

SAS

data table (drop=j);
array probs [11] prob0 prob1 - prob10;
do p = .81 to .99 by .03;
do j = 0 to 10;
if j eq 1 then probs[j+1] = cdf("BINOMIAL", 0, p, 10);
else probs[j+1] = cdf("BINOMIAL", j, p, 10)
- cdf("BINOMIAL", j-1, p, 10);
end;
output;
end;
run;


The results are printed with two decimal places using the format statement (section 1.2.4). The options statement (section A.4) uses the ls option to narrow the output width to be compatible with Blogger.


options ls=64;
proc print data=table noobs;
var p prob0 prob1 - prob10;
format _numeric_ 3.2;
run;


And the results are:

p
p p p p p p p p p p r
r r r r r r r r r r o
o o o o o o o o o o b
b b b b b b b b b b 1
p 0 1 2 3 4 5 6 7 8 9 0
.81 .00 .00 .00 .00 .00 .02 .08 .19 .30 .29 .12
.84 .00 .00 .00 .00 .00 .01 .05 .15 .29 .33 .17
.87 .00 .00 .00 .00 .00 .00 .03 .10 .25 .37 .25
.90 .00 .00 .00 .00 .00 .00 .01 .06 .19 .39 .35
.93 .00 .00 .00 .00 .00 .00 .00 .02 .12 .36 .48
.96 .00 .00 .00 .00 .00 .00 .00 .01 .05 .28 .66
.99 .00 .00 .00 .00 .00 .00 .00 .00 .00 .09 .90


R
In R we start by making a vector of the binomial probabilities, using the : operator (section 1.11.3) to generate a sequence of integers. After creating a matrix (section B.4.4) to hold the table results, we loop (section 1.11.1) through the binomial probabilities, calling the dbinom() function (section 1.1) to find the probability that X=x. This calculation is nested within the round() function (section 1.2.5) to reduce the digits displayed. Finally, we glue the vector of binomial probabilities to the results.


p <- .78 + (3 * 1:7)/100
allprobs <- matrix(nrow=length(p), ncol=11)
for (i in 1:length(p)) {
allprobs[i,] <- round(dbinom(0:10, 10, p[i]),2)
}
table <- cbind(p, allprobs)
table


With the result:

p
[1,] 0.81 0 0 0 0 0 0.02 0.08 0.19 0.30 0.29 0.12
[2,] 0.84 0 0 0 0 0 0.01 0.05 0.15 0.29 0.33 0.17
[3,] 0.87 0 0 0 0 0 0.00 0.03 0.10 0.25 0.37 0.25
[4,] 0.90 0 0 0 0 0 0.00 0.01 0.06 0.19 0.39 0.35
[5,] 0.93 0 0 0 0 0 0.00 0.00 0.02 0.12 0.36 0.48
[6,] 0.96 0 0 0 0 0 0.00 0.00 0.01 0.05 0.28 0.66
[7,] 0.99 0 0 0 0 0 0.00 0.00 0.00 0.00 0.09 0.90


As with the example of jittered scatterplots for dichotomous y by continuous x, (Example 7.3, Example 7.4, and Example 7.5) we will revisit this example for improvement in later entries.

Monday, July 20, 2009

Example 7.6: Find Amazon sales rank for a book

In honor of Amazon's official release date for the book, we offer this blog entry.

Both SAS and R can be used to find the Amazon Sales Rank for a book by downloading the desired web page and ferreting out the appropriate line. This code is likely to break if Amazon’s page format is changed (but it worked as of October, 2010). [Note: as of spring 2010 Amazon changed the format for their webpages, and the appropriate text to search for changed from "Amazon.com Sales Rank" to "Amazon Bestsellers Rank". We've updated the blog code with this string. As of October 9, 2010 they added a number of blank lines to the web page, which we also now address.]

In this example, we find the sales rank for our book. Some interesting information about interpreting the rank can be found here or here.

Both SAS and R code below rely on section 1.1.3, ”Reading more complex text files.” Note that in the displayed SAS and R code, the long URL has been broken onto several lines, while it would have to be entered on a single line to run correctly.



In SAS, we assign the URL an internal name (section 1.1.6), then input the file using a data step. We exclude all the lines which don’t contain the sales rank, using the count function (section 1.4.6). We then extract the number using the substr function (section 1.4.3), with the find function (section 1.4.6) employed to locate the number within the line. The last step is to turn the extracted text (which contains a comma) into a numeric variable.

SAS

filename amazon url "http://www.amazon.com/
SAS-Management-Statistical-Analysis-Graphics/
dp/1420070576/ref=sr_1_1?ie=UTF8&s=books
&qid=1242233418&sr=8-1";

data test;
infile amazon truncover;
input @1 line $256.;
if count(line, "Amazon Bestsellers Rank") ne 0;
rankchar = substr(line, find(line, "#")+1,
find(line, "in Books") - find(line, "#") - 2);
rank = input(rankchar, comma9.);
run;

proc print data=test noobs;
var rank;
run;



R

# grab contents of web page
urlcontents <- readLines("http://www.amazon.com/
SAS-Management-Statistical-Analysis-Graphics/
dp/1420070576/ref=sr_1_1?ie=UTF8&s=books
&qid=1242233418&sr=8-1")
# find line with sales rank
linenum <- suppressWarnings(grep("Amazon Bestsellers Rank:",
urlcontents))

newline = linenum + 1 # work around October 2010 blank spaces
while (urlcontents[newline] == "") {
newline = newline + 1
}

# split line into multiple elements
linevals <- strsplit(urlcontents[newline], ' ')[[1]]

# find element with sales rank number
entry <- grep("#", linevals)
# snag that entry
charrank <- linevals[entry]
# kill '#' at start
charrank <- substr(charrank, 2, nchar(charrank))
# remove commas
charrank <- gsub(',','', charrank)
# turn it into a numeric opject
salesrank <- as.numeric(charrank)
cat("salesrank=",salesrank,"\n")


The resulting output (on July 16, 2009) is

SAS

rank

23476


R

salesrank= 23467

Saturday, July 18, 2009

Book excerpts now posted

We've posted excerpts from the book on the book website. The excerpts include Chapter 3 (regression and ANOVA) in its entirety. This demonstrates how the entries (the generic descriptions of software functions) and the worked examples reinforce each other.

We've also posted selected entries from each of the other chapters, describing probability distributions and random variables, two sample comparisons (T-tests, Wilcoxon rank-sum test, Kolmogorov-Smirnov test, median test), linear mixed models for correlated data (including general correlation, random intercepts and slopes, and more complex structures), graphics (scatterplots, histograms, saving plots as PDF), and power calculations (analytic and Monte Carlo).

Wednesday, July 15, 2009

Example 7.5: Replicating a prettier jittered scatterplot

The scatterplot in section 7.4 is a plot we could use repeatedly. We demonstrate how to create a macro (SAS, section A.8) and a function (R, section B.5) to do it more easily.

SAS

%macro logiplot(x=x, y=y, data=, jitterwidth=.05, smooth=50);
data lp1;
set &data;
if &y eq 0 or &y eq 1;
jitter = uniform(0) * &jitterwidth;
if &y eq 1 then yplot = &y - jitter;
else if &y eq 0 then yplot = &y + jitter;
run;

axis1 minor=none label = ("&x");
axis2 minor=none label = (angle = 270 rotate = 90 "&y");
symbol1 i=sm&smooth.s v=none c=blue;
symbol2 i=none v=dot h=.2 c=blue;
proc gplot data=lp1;
plot (&y yplot) * &x / overlay haxis=axis1 vaxis=axis2;
run;
quit;


R

logiplot <- function(x, y, jitamount=.01, smoothspan=2/3,
xlabel="X label", ylabel="Y label") {
jittery <- jitter(y, amount=jitamount/2)
correction <- ifelse(y==0, jitamount/2, -jitamount/2)
jittery <- jittery + correction
plot(x, y, type="n", xlab=xlabel, ylab=ylabel)
points(x, jittery, pch=20, col="blue")
lines(lowess(x, y, f=smoothspan), col="blue")
}


We’ll load the example data set from the book via the web (section 1.1.6), then make a plot of the real data.

SAS

filename myurl
url 'http://www.math.smith.edu/sasr/datasets/help.csv' lrecl=704;

proc import
datafile=myurl
out=ds dbms=dlm;
delimiter=',';
getnames=yes;
run;

%logiplot(x = mcs, y = homeless, data=ds, smooth=40);


R

ds <- read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
logiplot(ds$mcs, ds$homeless, jitamount=.05, smoothspan=.3,
xlabel="mcs", ylabel="homeless")


The resulting plots are quite similar, but still differ with respect to the smoother and the exact jitter applied to each point.



(Click for a bigger picture.)

Thursday, July 2, 2009

Example 7.4: A prettier jittered scatterplot

The plot in section 7.3 has some problems. At the very least, the jittered values ought to be between 0 and 1, so the smoothed lines fit better with them. Once again we use the data generated in section 7.2 as an example. For both SAS and R, we use conditioning (section 1.11.2) to make the jitter happen within the 0-1 range.

SAS
In SAS, we use axis statements (section 5.3.8) to clean up the axis tick marks and labels.

data lp1;
set test;
jitter = uniform(0) * 0.075;
if ytest eq 1 then yplot = ytest - jitter;
else if ytest eq 0 then yplot = ytest + jitter;
run;

axis1 minor = none label = ("xtest");
axis2 minor = none label = (angle=270 rotate=90 "ytest");
symbol1 i=sm50s v=none c = blue;
symbol2 i=none v=dot h = .2 c = blue;
proc gplot data = lp1;
plot (ytest yplot) * xtest / overlay haxis=axis1 vaxis=axis2;
run;
quit;


And the resulting plot is:




















R


In R, we add a label to the y axis with the ylab option (section 5.3.8). We also modify the smoother to be a little less responsive to the data (by using a wider window, see section 5.2.6).


jittery <- jitter(ytest, amount=.0375)
correction <- ifelse(ytest==0, .0375, -.0375)
jittery <- jittery + correction
plot(xtest, ytest, type="n")
points(xtest, jittery, pch = 20, col = "blue", ylab = "ytest")
lines(lowess(xtest, ytest, f = .4), col = "blue")



And the resulting plot is:
















As with the uglier version shown in example 7.3, the differences between the two plots results from there being different randomly-generated data sets and because we use two different smoothers.

In the next example, we'll show how to make a SAS Macro or an R function to replicate this plot easily.