Today: what can you say about the probability of an event if the observed number of events is 0? It turns out that the upper 95% CI for the probability is 3.69/N. There's a sweet little paper with some rationale for this, but it's in my other office. And I couldn't recall the precise value-- so I used SAS and R to demonstrate it to myself.

**R**

The R code is remarkably concise. After generating some Ns, we write a little function to perform the test and extract the (exact) upper 95% confidence limit. This is facilitated by the "..." notation, which passes along unused arguments to functions. Then we use

`apply()`to call the new function for each N, passing the numerator 0 each time. Note that

`apply()`needs a matrix argument, so the simple vector of Ns is converted to a matrix before use. [The

`sapply()`function will accept a vector input, but took about 8 times as long to run.] Finally, we plot the upper limit * N against N. showing the asymptote. A log scaled x-axis is useful here, and is achieved with the

`log='x'`option. (Section 5.3.12.) the result is shown above.

bin.m = seq(10, 10000, by=5) mybt = function(...) { binom.test(...)$conf.int[2] } uci = apply(as.matrix(bin.m), 1, mybt, x=0) plot(y=bin.m * uci, x=bin.m, ylim=c(0,4), type="l", lwd=5, col="red", cex=5, log='x', ylab="Exact upper CI", xlab="Sample size", main="Upper CI when there are 0 cases observed") abline(h=3.69)

**SAS**

In SAS, the data, really just the N and a numerator of 0, are generated in a

`data`step. The CI are found using the

`binomial`option in the

`proc freq tables`statement and saved using the

`output`statement. Note that the

`weight`statement is used here to avoid having a row for each Bernoulli trial.

data binm; do n = 10 to 10000 by 5; x=0; output; end; run; ods select none; proc freq data=binm; by n; weight n; tables x / binomial; output out=bp binomial; run; ods select all;To calculate the upper limit*N, another

`data`step is needed-- note that in this setting SAS will only produce the lower limit against the probability that all observations share the same value, thus the subtraction from 1 shown below. The log scale x-axis is obtained with the

`logbase`option to the

`axis`statement. (Section 5.3.12.) The result is shown below.

data uci; set bp; limit = (1-xl_bin) * n; run; axis1 order = (0 to 4 by 1); axis2 logbase=10 logstyle=expand; symbol1 i = j v = none c = red w=5 l=1; proc gplot data=uci; plot limit * n / vref=3.69 vaxis=axis1 haxis=axis2; label n="Sample size" limit="Exact upper CI"; run; quit;It's clear that the upper 95% limit on the number of successes asymptotes to about 3.69. Thus the upper limit on the binomial probability p is 3.69/N.

**An unrelated note about aggregators:**We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

## 3 comments:

Based on K.ULM 1990 paper on standardized mortality ratio, the upper 95% CI when number of event is 0 is: (R code) qchisq(0.975, 2*(# of event+1))/2 = 3.688879

Approximation using Poisson distribution (two-sided ci):

1-dpois(0,

3.688893)[1] 0.9750003

Approximation using Poisson distribution (one-sided ci):

> 1-dpois(0, ,

2.995718)[1] 0.9499993

I calculated those values by minimizing function:

ul_0 <- function(x)(dpois(0,x)-a)^2

where a=0.05 for one sided and a=0.025 for two-sided confidence interval.

Good to have you back.

A few comments:

(1) If the goal was to demonstrate the use of ..., then move on to the next point. Otherwise, note that x=0 for every call to mybt. Thus you can redefine

mybt <- function(n) binom.test(x=0, n=n)$conf.int[2]

(to get the code to run you'll then need to remove x=0 from the apply call.)

(2) I had not seen a speed comparison between sapply and apply. Not an intuitive result.

A well written loop is not as pretty, but is comparable to apply for speed:

uci <- numeric(length(bin.m))

for (i in seq_along(bin.m)) {

uci[i] <- binom.test(x=0, n=bin.m[i])$conf.int[2]

}

[More about simulation speed at

http://pages.stat.wisc.edu/~st471-1/Rnotes/SimSpeed.html#sec-1]

(3) As for speed (which I know is only a couple seconds for this graphic) you can improve performance by using equally spaced points. Most of the 1999 points in your simulation are bunched on the right because they are on a linear scale but the graphic is on a logarithmic scale. For example, the following covers the same interval (10 to 10000) with equally spaced (on the graph) points :

bin.m <- round(10^seq(1,4,length=100))

Post a Comment