Or, order from the publisher. If you are an ASA member, you can use the online discount code 634LH to obtain a 15% discount.
Monday, July 27, 2009
Book now shipping from Amazon
Or, order from the publisher. If you are an ASA member, you can use the online discount code 634LH to obtain a 15% discount.
Saturday, July 25, 2009
Example 7.7: Tabulate binomial probabilities
In SAS, we find the probability that X=x using differences in the CDF calculated via the cdf function (1.10.1). We loop through the various binomial probabilities and values of x using the do ... end structure (section 1.11.1). Finally, we store the probabilities in implicitly named variables via an array statement (section 1.11.5).
SAS
data table (drop=j);
array probs [11] prob0 prob1 - prob10;
do p = .81 to .99 by .03;
do j = 0 to 10;
if j eq 1 then probs[j+1] = cdf("BINOMIAL", 0, p, 10);
else probs[j+1] = cdf("BINOMIAL", j, p, 10)
- cdf("BINOMIAL", j-1, p, 10);
end;
output;
end;
run;
The results are printed with two decimal places using the format statement (section 1.2.4). The options statement (section A.4) uses the ls option to narrow the output width to be compatible with Blogger.
options ls=64;
proc print data=table noobs;
var p prob0 prob1 - prob10;
format _numeric_ 3.2;
run;
And the results are:
p
p p p p p p p p p p r
r r r r r r r r r r o
o o o o o o o o o o b
b b b b b b b b b b 1
p 0 1 2 3 4 5 6 7 8 9 0
.81 .00 .00 .00 .00 .00 .02 .08 .19 .30 .29 .12
.84 .00 .00 .00 .00 .00 .01 .05 .15 .29 .33 .17
.87 .00 .00 .00 .00 .00 .00 .03 .10 .25 .37 .25
.90 .00 .00 .00 .00 .00 .00 .01 .06 .19 .39 .35
.93 .00 .00 .00 .00 .00 .00 .00 .02 .12 .36 .48
.96 .00 .00 .00 .00 .00 .00 .00 .01 .05 .28 .66
.99 .00 .00 .00 .00 .00 .00 .00 .00 .00 .09 .90
R
In R we start by making a vector of the binomial probabilities, using the : operator (section 1.11.3) to generate a sequence of integers. After creating a matrix (section B.4.4) to hold the table results, we loop (section 1.11.1) through the binomial probabilities, calling the dbinom() function (section 1.1) to find the probability that X=x. This calculation is nested within the round() function (section 1.2.5) to reduce the digits displayed. Finally, we glue the vector of binomial probabilities to the results.
p <- .78 + (3 * 1:7)/100
allprobs <- matrix(nrow=length(p), ncol=11)
for (i in 1:length(p)) {
allprobs[i,] <- round(dbinom(0:10, 10, p[i]),2)
}
table <- cbind(p, allprobs)
table
With the result:
p
[1,] 0.81 0 0 0 0 0 0.02 0.08 0.19 0.30 0.29 0.12
[2,] 0.84 0 0 0 0 0 0.01 0.05 0.15 0.29 0.33 0.17
[3,] 0.87 0 0 0 0 0 0.00 0.03 0.10 0.25 0.37 0.25
[4,] 0.90 0 0 0 0 0 0.00 0.01 0.06 0.19 0.39 0.35
[5,] 0.93 0 0 0 0 0 0.00 0.00 0.02 0.12 0.36 0.48
[6,] 0.96 0 0 0 0 0 0.00 0.00 0.01 0.05 0.28 0.66
[7,] 0.99 0 0 0 0 0 0.00 0.00 0.00 0.00 0.09 0.90
As with the example of jittered scatterplots for dichotomous y by continuous x, (Example 7.3, Example 7.4, and Example 7.5) we will revisit this example for improvement in later entries.
Monday, July 20, 2009
Example 7.6: Find Amazon sales rank for a book
Both SAS and R can be used to find the Amazon Sales Rank for a book by downloading the desired web page and ferreting out the appropriate line. This code is likely to break if Amazon’s page format is changed (but it worked as of October, 2010). [Note: as of spring 2010 Amazon changed the format for their webpages, and the appropriate text to search for changed from "Amazon.com Sales Rank" to "Amazon Bestsellers Rank". We've updated the blog code with this string. As of October 9, 2010 they added a number of blank lines to the web page, which we also now address.]
In this example, we find the sales rank for our book. Some interesting information about interpreting the rank can be found here or here.
Both SAS and R code below rely on section 1.1.3, ”Reading more complex text files.” Note that in the displayed SAS and R code, the long URL has been broken onto several lines, while it would have to be entered on a single line to run correctly.
In SAS, we assign the URL an internal name (section 1.1.6), then input the file using a data step. We exclude all the lines which don’t contain the sales rank, using the count function (section 1.4.6). We then extract the number using the substr function (section 1.4.3), with the find function (section 1.4.6) employed to locate the number within the line. The last step is to turn the extracted text (which contains a comma) into a numeric variable.
SAS
filename amazon url "http://www.amazon.com/
SAS-Management-Statistical-Analysis-Graphics/
dp/1420070576/ref=sr_1_1?ie=UTF8&s=books
&qid=1242233418&sr=8-1";
data test;
infile amazon truncover;
input @1 line $256.;
if count(line, "Amazon Bestsellers Rank") ne 0;
rankchar = substr(line, find(line, "#")+1,
find(line, "in Books") - find(line, "#") - 2);
rank = input(rankchar, comma9.);
run;
proc print data=test noobs;
var rank;
run;
R
# grab contents of web page
urlcontents <- readLines("http://www.amazon.com/
SAS-Management-Statistical-Analysis-Graphics/
dp/1420070576/ref=sr_1_1?ie=UTF8&s=books
&qid=1242233418&sr=8-1")
# find line with sales rank
linenum <- suppressWarnings(grep("Amazon Bestsellers Rank:",
urlcontents))
newline = linenum + 1 # work around October 2010 blank spaces
while (urlcontents[newline] == "") {
newline = newline + 1
}
# split line into multiple elements
linevals <- strsplit(urlcontents[newline], ' ')[[1]]
# find element with sales rank number
entry <- grep("#", linevals)
# snag that entry
charrank <- linevals[entry]
# kill '#' at start
charrank <- substr(charrank, 2, nchar(charrank))
# remove commas
charrank <- gsub(',','', charrank)
# turn it into a numeric opject
salesrank <- as.numeric(charrank)
cat("salesrank=",salesrank,"\n")
The resulting output (on July 16, 2009) is
SAS
rank
23476
R
salesrank= 23467
Saturday, July 18, 2009
Book excerpts now posted
We've also posted selected entries from each of the other chapters, describing probability distributions and random variables, two sample comparisons (T-tests, Wilcoxon rank-sum test, Kolmogorov-Smirnov test, median test), linear mixed models for correlated data (including general correlation, random intercepts and slopes, and more complex structures), graphics (scatterplots, histograms, saving plots as PDF), and power calculations (analytic and Monte Carlo).
Wednesday, July 15, 2009
Example 7.5: Replicating a prettier jittered scatterplot
SAS
%macro logiplot(x=x, y=y, data=, jitterwidth=.05, smooth=50);
data lp1;
set &data;
if &y eq 0 or &y eq 1;
jitter = uniform(0) * &jitterwidth;
if &y eq 1 then yplot = &y - jitter;
else if &y eq 0 then yplot = &y + jitter;
run;
axis1 minor=none label = ("&x");
axis2 minor=none label = (angle = 270 rotate = 90 "&y");
symbol1 i=sm&smooth.s v=none c=blue;
symbol2 i=none v=dot h=.2 c=blue;
proc gplot data=lp1;
plot (&y yplot) * &x / overlay haxis=axis1 vaxis=axis2;
run;
quit;
R
logiplot <- function(x, y, jitamount=.01, smoothspan=2/3,
xlabel="X label", ylabel="Y label") {
jittery <- jitter(y, amount=jitamount/2)
correction <- ifelse(y==0, jitamount/2, -jitamount/2)
jittery <- jittery + correction
plot(x, y, type="n", xlab=xlabel, ylab=ylabel)
points(x, jittery, pch=20, col="blue")
lines(lowess(x, y, f=smoothspan), col="blue")
}
We’ll load the example data set from the book via the web (section 1.1.6), then make a plot of the real data.
SAS
filename myurl
url 'http://www.math.smith.edu/sasr/datasets/help.csv' lrecl=704;
proc import
datafile=myurl
out=ds dbms=dlm;
delimiter=',';
getnames=yes;
run;
%logiplot(x = mcs, y = homeless, data=ds, smooth=40);
R
ds <- read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
logiplot(ds$mcs, ds$homeless, jitamount=.05, smoothspan=.3,
xlabel="mcs", ylabel="homeless")
The resulting plots are quite similar, but still differ with respect to the smoother and the exact jitter applied to each point.

(Click for a bigger picture.)
Labels:
R function,
SAS macro
0
comments
Thursday, July 2, 2009
Example 7.4: A prettier jittered scatterplot
SAS
In SAS, we use axis statements (section 5.3.8) to clean up the axis tick marks and labels.
data lp1;
set test;
jitter = uniform(0) * 0.075;
if ytest eq 1 then yplot = ytest - jitter;
else if ytest eq 0 then yplot = ytest + jitter;
run;
axis1 minor = none label = ("xtest");
axis2 minor = none label = (angle=270 rotate=90 "ytest");
symbol1 i=sm50s v=none c = blue;
symbol2 i=none v=dot h = .2 c = blue;
proc gplot data = lp1;
plot (ytest yplot) * xtest / overlay haxis=axis1 vaxis=axis2;
run;
quit;
And the resulting plot is:

R
In R, we add a label to the y axis with the ylab option (section 5.3.8). We also modify the smoother to be a little less responsive to the data (by using a wider window, see section 5.2.6).
jittery <- jitter(ytest, amount=.0375)
correction <- ifelse(ytest==0, .0375, -.0375)
jittery <- jittery + correction
plot(xtest, ytest, type="n")
points(xtest, jittery, pch = 20, col = "blue", ylab = "ytest")
lines(lowess(xtest, ytest, f = .4), col = "blue")
And the resulting plot is:

As with the uglier version shown in example 7.3, the differences between the two plots results from there being different randomly-generated data sets and because we use two different smoothers.
In the next example, we'll show how to make a SAS Macro or an R function to replicate this plot easily.
Labels:
conditioning,
graphics,
R,
SAS
1 comments
Subscribe to:
Posts (Atom)