SAS and R: apply family of functions

Monday, December 10, 2012

Example 10.8: The upper 95% CI is 3.69

Apologies for the long and unannounced break-- the longest since we started blogging, three and a half years ago. I was writing a 2-day course for SAS users to learn R. Contact me if you're interested. And Nick and I are beginning work on the second edition of our book-- look for it in the fall. Please let us know if you have ideas about what we omitted last time or would otherwise like to see added. In the mean time, we'll keep blogging, though likely at a reduced rate.

Today: what can you say about the probability of an event if the observed number of events is 0? It turns out that the upper 95% CI for the probability is 3.69/N. There's a sweet little paper with some rationale for this, but it's in my other office. And I couldn't recall the precise value-- so I used SAS and R to demonstrate it to myself.

R

The R code is remarkably concise. After generating some Ns, we write a little function to perform the test and extract the (exact) upper 95% confidence limit. This is facilitated by the "..." notation, which passes along unused arguments to functions. Then we use apply() to call the new function for each N, passing the numerator 0 each time. Note that apply() needs a matrix argument, so the simple vector of Ns is converted to a matrix before use. [The sapply() function will accept a vector input, but took about 8 times as long to run.] Finally, we plot the upper limit * N against N. showing the asymptote. A log scaled x-axis is useful here, and is achieved with the log='x' option. (Section 5.3.12.) the result is shown above.

bin.m = seq(10, 10000, by=5)
mybt = function(...) { binom.test(...)$conf.int[2] }
uci = apply(as.matrix(bin.m), 1, mybt, x=0)
plot(y=bin.m * uci, x=bin.m, ylim=c(0,4), type="l", 
     lwd=5, col="red", cex=5, log='x',  
     ylab="Exact upper CI", xlab="Sample size", 
     main="Upper CI when there are 0 cases observed")
abline(h=3.69)

SAS

In SAS, the data, really just the N and a numerator of 0, are generated in a data step. The CI are found using the binomial option in the proc freq tables statement and saved using the output statement. Note that the weight statement is used here to avoid having a row for each Bernoulli trial.

data binm;
do n = 10 to 10000 by 5;
  x=0;
  output;
  end;
run;

ods select none;
proc freq data=binm;
by n;
weight n;
tables x / binomial;
output out=bp binomial;
run;
ods select all;

To calculate the upper limit*N, another data step is needed-- note that in this setting SAS will only produce the lower limit against the probability that all observations share the same value, thus the subtraction from 1 shown below. The log scale x-axis is obtained with the logbase option to the axis statement. (Section 5.3.12.) The result is shown below.

data uci;
set bp;
limit = (1-xl_bin) * n;
run;

axis1 order = (0 to 4 by 1);
axis2 logbase=10 logstyle=expand;
symbol1 i = j v = none c = red w=5 l=1;
proc gplot data=uci;
plot limit * n / vref=3.69 vaxis=axis1 haxis=axis2;
label n="Sample size" limit="Exact upper CI";
run;
quit;

It's clear that the upper 95% limit on the number of successes asymptotes to about 3.69. Thus the upper limit on the binomial probability p is 3.69/N.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, April 30, 2012

Example 9.29: the perils of for loops

A recent exchange on the R-sig-teaching list featured a discussion of how best to teach new students R. The initial post included an exercise to write a function, that given a n, will draw n rows of a triangle made up of "*", noting that for a beginner, this may require two for loops. For example, in pseudo-code:


for i = 1 to n
  for j = 1 to i
    print "*"

Unfortunately, as several folks (including Richard M. Heiberger and R. Michael Weylandt) noted, for loops in general are not the best way to take full advantage of R. In this entry, we review two solutions they proposed which fit within the R philosophy.

Richard's solution uses the outer() function to generate a 5x5 matrix of logical values indicating whether the column number is bigger than the row number. Next the ifelse() function is used to replace TRUE with *.


> ifelse(outer(1:5, 1:5, `>=`), "*", " ")  
     [,1] [,2] [,3] [,4] [,5]
[1,] "*"  " "  " "  " "  " " 
[2,] "*"  "*"  " "  " "  " " 
[3,] "*"  "*"  "*"  " "  " " 
[4,] "*"  "*"  "*"  "*"  " " 
[5,] "*"  "*"  "*"  "*"  "*"

Michael's solution uses the lapply() function to call a function repeatedly for different values of n. This returns a list rather than a matrix, but accomplishes the same task.


> lapply(1:5, function(x) cat(rep("*", x), "\n"))
* 
* * 
* * * 
* * * * 
* * * * *

While this exercise is of little practical value, it does illustrate some important points, and provides a far more efficient as well as elegant way of accomplishing the tasks. For those interested in more, another resource is the R Inferno project of Patric Burns.

SAS
We demonstrate a SAS data step solution mainly to call out some useful features and cautions. In all likelihood a proc iml matrix-based solution would be more elegant;


data test;
array star [5] $ star1 - star5;
do i = 1 to 5;
  star[i] = "*";
  output;
  end;
run;

proc print noobs; var star1 - star5; run;

             star1    star2    star3    star4    star5

               *
               *        *
               *        *        *
               *        *        *        *
               *        *        *        *        *

In particular, note the $ in the array statement, which allows the variables to contain characters; by default variables created by an array statement are numeric. In addition, note the reference to a sequentially suffixed list of variables using the single hyphen shortcut; this would help in generalizing to n rows. Finally, note that we were able to avoid a second do loop (SAS' primary iterative looping syntax) mainly by luck-- the most recently generated value of a variable is saved by default. This can cause trouble, in general, but here it keeps all the previous "*"s when moving on to the next row.

An unrelated note about aggregatorsWe love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Monday, December 10, 2012

Example 10.8: The upper 95% CI is 3.69

Monday, April 30, 2012

Example 9.29: the perils of for loops

About SAS and R

Topics discussed