SAS and R: July 2009

Monday, July 27, 2009

Book now shipping from Amazon

Amazon now reports that the book is in stock! The current discount is 13%.

Or, order from the publisher. If you are an ASA member, you can use the online discount code 634LH to obtain a 15% discount.

Saturday, July 25, 2009

Example 7.7: Tabulate binomial probabilities

Suppose we wanted to assess the probability P(X=x) for a binomial random variate with n = 10 and with p = .81, .84, ..., .99. This could be helpful, for example, in various game settings.

In SAS, we ﬁnd the probability that X=x using differences in the CDF calculated via the cdf function (1.10.1). We loop through the various binomial probabilities and values of x using the do ... end structure (section 1.11.1). Finally, we store the probabilities in implicitly named variables via an array statement (section 1.11.5).

SAS


data table (drop=j);
array probs [11] prob0 prob1 - prob10;
do p = .81 to .99 by .03;
  do j = 0 to 10;
    if j eq 1 then probs[j+1] = cdf("BINOMIAL", 0, p, 10);
    else probs[j+1] = cdf("BINOMIAL", j, p, 10) 
                         - cdf("BINOMIAL", j-1, p, 10);
    end;
  output;
  end;
run;

The results are printed with two decimal places using the format statement (section 1.2.4). The options statement (section A.4) uses the ls option to narrow the output width to be compatible with Blogger.


options ls=64;
proc print data=table noobs;
   var p prob0 prob1 - prob10;
   format _numeric_ 3.2; 
run;

And the results are:


                                                         p
       p    p    p    p    p    p    p    p    p    p    r
       r    r    r    r    r    r    r    r    r    r    o
       o    o    o    o    o    o    o    o    o    o    b
       b    b    b    b    b    b    b    b    b    b    1
  p    0    1    2    3    4    5    6    7    8    9    0
.81  .00  .00  .00  .00  .00  .02  .08  .19  .30  .29  .12
.84  .00  .00  .00  .00  .00  .01  .05  .15  .29  .33  .17
.87  .00  .00  .00  .00  .00  .00  .03  .10  .25  .37  .25
.90  .00  .00  .00  .00  .00  .00  .01  .06  .19  .39  .35
.93  .00  .00  .00  .00  .00  .00  .00  .02  .12  .36  .48
.96  .00  .00  .00  .00  .00  .00  .00  .01  .05  .28  .66
.99  .00  .00  .00  .00  .00  .00  .00  .00  .00  .09  .90

R
In R we start by making a vector of the binomial probabilities, using the : operator (section 1.11.3) to generate a sequence of integers. After creating a matrix (section B.4.4) to hold the table results, we loop (section 1.11.1) through the binomial probabilities, calling the dbinom() function (section 1.1) to ﬁnd the probability that X=x. This calculation is nested within the round() function (section 1.2.5) to reduce the digits displayed. Finally, we glue the vector of binomial probabilities to the results.


p <- .78 + (3 * 1:7)/100
allprobs <- matrix(nrow=length(p), ncol=11)
for (i in 1:length(p)) {
   allprobs[i,] <- round(dbinom(0:10, 10, p[i]),2)
}
table <- cbind(p, allprobs)
table

With the result:


        p                                        
[1,] 0.81 0 0 0 0 0 0.02 0.08 0.19 0.30 0.29 0.12
[2,] 0.84 0 0 0 0 0 0.01 0.05 0.15 0.29 0.33 0.17
[3,] 0.87 0 0 0 0 0 0.00 0.03 0.10 0.25 0.37 0.25
[4,] 0.90 0 0 0 0 0 0.00 0.01 0.06 0.19 0.39 0.35
[5,] 0.93 0 0 0 0 0 0.00 0.00 0.02 0.12 0.36 0.48
[6,] 0.96 0 0 0 0 0 0.00 0.00 0.01 0.05 0.28 0.66
[7,] 0.99 0 0 0 0 0 0.00 0.00 0.00 0.00 0.09 0.90

As with the example of jittered scatterplots for dichotomous y by continuous x, (Example 7.3, Example 7.4, and Example 7.5) we will revisit this example for improvement in later entries.

Monday, July 20, 2009

Example 7.6: Find Amazon sales rank for a book

In honor of Amazon's official release date for the book, we offer this blog entry.

Both SAS and R can be used to find the Amazon Sales Rank for a book by downloading the desired web page and ferreting out the appropriate line. This code is likely to break if Amazon’s page format is changed (but it worked as of October, 2010). [Note: as of spring 2010 Amazon changed the format for their webpages, and the appropriate text to search for changed from "Amazon.com Sales Rank" to "Amazon Bestsellers Rank". We've updated the blog code with this string. As of October 9, 2010 they added a number of blank lines to the web page, which we also now address.]

In this example, we find the sales rank for our book. Some interesting information about interpreting the rank can be found here or here.

Both SAS and R code below rely on section 1.1.3, ”Reading more complex text ﬁles.” Note that in the displayed SAS and R code, the long URL has been broken onto several lines, while it would have to be entered on a single line to run correctly.

In SAS, we assign the URL an internal name (section 1.1.6), then input the ﬁle using a data step. We exclude all the lines which don’t contain the sales rank, using the count function (section 1.4.6). We then extract the number using the substr function (section 1.4.3), with the find function (section 1.4.6) employed to locate the number within the line. The last step is to turn the extracted text (which contains a comma) into a numeric variable.

SAS


filename amazon url "http://www.amazon.com/
         SAS-Management-Statistical-Analysis-Graphics/
         dp/1420070576/ref=sr_1_1?ie=UTF8&s=books
         &qid=1242233418&sr=8-1";

data test;
infile amazon truncover;
input @1 line $256.;
   if count(line, "Amazon Bestsellers Rank") ne 0;
   rankchar = substr(line, find(line, "#")+1, 
        find(line, "in Books") - find(line, "#") - 2);
   rank = input(rankchar, comma9.);
run;

proc print data=test noobs; 
   var rank;
run;


# grab contents of web page
urlcontents <- readLines("http://www.amazon.com/
           SAS-Management-Statistical-Analysis-Graphics/
           dp/1420070576/ref=sr_1_1?ie=UTF8&s=books
           &qid=1242233418&sr=8-1")
# find line with sales rank
linenum <- suppressWarnings(grep("Amazon Bestsellers Rank:",
           urlcontents))

newline = linenum + 1    # work around October 2010 blank spaces
while (urlcontents[newline] == "") {
   newline = newline + 1
}

# split line into multiple elements
linevals <- strsplit(urlcontents[newline], ' ')[[1]]

# find element with sales rank number
entry <- grep("#", linevals)   
# snag that entry
charrank <- linevals[entry]
# kill '#' at start
charrank <- substr(charrank, 2, nchar(charrank))  
# remove commas
charrank <- gsub(',','', charrank) 
# turn it into a numeric opject
salesrank <- as.numeric(charrank)
cat("salesrank=",salesrank,"\n")

The resulting output (on July 16, 2009) is

SAS


                      rank
       
                      23476


salesrank= 23467

Saturday, July 18, 2009

Book excerpts now posted

We've posted excerpts from the book on the book website. The excerpts include Chapter 3 (regression and ANOVA) in its entirety. This demonstrates how the entries (the generic descriptions of software functions) and the worked examples reinforce each other.

We've also posted selected entries from each of the other chapters, describing probability distributions and random variables, two sample comparisons (T-tests, Wilcoxon rank-sum test, Kolmogorov-Smirnov test, median test), linear mixed models for correlated data (including general correlation, random intercepts and slopes, and more complex structures), graphics (scatterplots, histograms, saving plots as PDF), and power calculations (analytic and Monte Carlo).

Wednesday, July 15, 2009

Example 7.5: Replicating a prettier jittered scatterplot

The scatterplot in section 7.4 is a plot we could use repeatedly. We demonstrate how to create a macro (SAS, section A.8) and a function (R, section B.5) to do it more easily.

SAS


%macro logiplot(x=x, y=y, data=, jitterwidth=.05, smooth=50);
data lp1;
set &data;
if &y eq 0 or &y eq 1;
jitter = uniform(0) * &jitterwidth;
if &y eq 1 then yplot = &y - jitter;
else if &y eq 0 then yplot = &y + jitter;
run;

axis1 minor=none label = ("&x");
axis2 minor=none label = (angle = 270 rotate = 90 "&y");
symbol1 i=sm&smooth.s v=none c=blue;
symbol2 i=none v=dot h=.2 c=blue;
proc gplot data=lp1;
plot (&y yplot) * &x / overlay haxis=axis1 vaxis=axis2;
run;
quit;


logiplot <- function(x, y, jitamount=.01, smoothspan=2/3, 
   xlabel="X label", ylabel="Y label") {
   jittery <- jitter(y, amount=jitamount/2)
   correction <- ifelse(y==0, jitamount/2, -jitamount/2)
   jittery <- jittery + correction
   plot(x, y, type="n", xlab=xlabel, ylab=ylabel)
   points(x, jittery, pch=20, col="blue")
   lines(lowess(x, y, f=smoothspan), col="blue")
}

We’ll load the example data set from the book via the web (section 1.1.6), then make a plot of the real data.

SAS


filename myurl 
   url 'http://www.math.smith.edu/sasr/datasets/help.csv'  lrecl=704;

proc import
   datafile=myurl
   out=ds dbms=dlm;
   delimiter=',';
   getnames=yes;
run;

%logiplot(x = mcs, y = homeless, data=ds, smooth=40);


ds <- read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
logiplot(ds$mcs, ds$homeless, jitamount=.05, smoothspan=.3, 
   xlabel="mcs", ylabel="homeless")

The resulting plots are quite similar, but still differ with respect to the smoother and the exact jitter applied to each point.

(Click for a bigger picture.)

Thursday, July 2, 2009

Example 7.4: A prettier jittered scatterplot

The plot in section 7.3 has some problems. At the very least, the jittered values ought to be between 0 and 1, so the smoothed lines ﬁt better with them. Once again we use the data generated in section 7.2 as an example. For both SAS and R, we use conditioning (section 1.11.2) to make the jitter happen within the 0-1 range.

SAS
In SAS, we use axis statements (section 5.3.8) to clean up the axis tick marks and labels.

data lp1;
set test;
jitter = uniform(0) * 0.075;
if ytest eq 1 then yplot = ytest - jitter;
else if ytest eq 0 then yplot = ytest + jitter;
run;

axis1 minor = none label = ("xtest");
axis2 minor = none label = (angle=270 rotate=90 "ytest");
symbol1 i=sm50s v=none c =  blue;
symbol2 i=none v=dot h = .2 c = blue;
proc gplot data = lp1;
plot (ytest yplot) * xtest / overlay haxis=axis1 vaxis=axis2;
run;
quit;

And the resulting plot is:

R

In R, we add a label to the y axis with the ylab option (section 5.3.8). We also modify the smoother to be a little less responsive to the data (by using a wider window, see section 5.2.6).


jittery <- jitter(ytest, amount=.0375)
correction <- ifelse(ytest==0, .0375, -.0375)
jittery <- jittery + correction
plot(xtest, ytest, type="n")
points(xtest, jittery, pch = 20, col = "blue", ylab = "ytest")
lines(lowess(xtest, ytest, f = .4), col = "blue")

And the resulting plot is:

As with the uglier version shown in example 7.3, the differences between the two plots results from there being different randomly-generated data sets and because we use two different smoothers.

In the next example, we'll show how to make a SAS Macro or an R function to replicate this plot easily.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts