SAS and R: axis control

Showing posts with label axis control. Show all posts

Monday, December 10, 2012

Example 10.8: The upper 95% CI is 3.69

Apologies for the long and unannounced break-- the longest since we started blogging, three and a half years ago. I was writing a 2-day course for SAS users to learn R. Contact me if you're interested. And Nick and I are beginning work on the second edition of our book-- look for it in the fall. Please let us know if you have ideas about what we omitted last time or would otherwise like to see added. In the mean time, we'll keep blogging, though likely at a reduced rate.

Today: what can you say about the probability of an event if the observed number of events is 0? It turns out that the upper 95% CI for the probability is 3.69/N. There's a sweet little paper with some rationale for this, but it's in my other office. And I couldn't recall the precise value-- so I used SAS and R to demonstrate it to myself.

R

The R code is remarkably concise. After generating some Ns, we write a little function to perform the test and extract the (exact) upper 95% confidence limit. This is facilitated by the "..." notation, which passes along unused arguments to functions. Then we use apply() to call the new function for each N, passing the numerator 0 each time. Note that apply() needs a matrix argument, so the simple vector of Ns is converted to a matrix before use. [The sapply() function will accept a vector input, but took about 8 times as long to run.] Finally, we plot the upper limit * N against N. showing the asymptote. A log scaled x-axis is useful here, and is achieved with the log='x' option. (Section 5.3.12.) the result is shown above.

bin.m = seq(10, 10000, by=5)
mybt = function(...) { binom.test(...)$conf.int[2] }
uci = apply(as.matrix(bin.m), 1, mybt, x=0)
plot(y=bin.m * uci, x=bin.m, ylim=c(0,4), type="l", 
     lwd=5, col="red", cex=5, log='x',  
     ylab="Exact upper CI", xlab="Sample size", 
     main="Upper CI when there are 0 cases observed")
abline(h=3.69)

SAS

In SAS, the data, really just the N and a numerator of 0, are generated in a data step. The CI are found using the binomial option in the proc freq tables statement and saved using the output statement. Note that the weight statement is used here to avoid having a row for each Bernoulli trial.

data binm;
do n = 10 to 10000 by 5;
  x=0;
  output;
  end;
run;

ods select none;
proc freq data=binm;
by n;
weight n;
tables x / binomial;
output out=bp binomial;
run;
ods select all;

To calculate the upper limit*N, another data step is needed-- note that in this setting SAS will only produce the lower limit against the probability that all observations share the same value, thus the subtraction from 1 shown below. The log scale x-axis is obtained with the logbase option to the axis statement. (Section 5.3.12.) The result is shown below.

data uci;
set bp;
limit = (1-xl_bin) * n;
run;

axis1 order = (0 to 4 by 1);
axis2 logbase=10 logstyle=expand;
symbol1 i = j v = none c = red w=5 l=1;
proc gplot data=uci;
plot limit * n / vref=3.69 vaxis=axis1 haxis=axis2;
label n="Sample size" limit="Exact upper CI";
run;
quit;

It's clear that the upper 95% limit on the number of successes asymptotes to about 3.69. Thus the upper limit on the binomial probability p is 3.69/N.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Tuesday, March 13, 2012

Example 9.23: Demonstrating proportional hazards

A colleague recently asked after a slide suitable for explaining proportional hazards. In particular, she was concerned that her audience not focus on the time to event or probability of the event. An initial thought was to display the cumulative hazards, which have a constant proportion if the model is true. But the colleague's audience might get distracted both by the language (what's a "hazard"?) and the fact that the cumulative hazard doesn't have a readily interpretable scale. The failure curve, meaning the probability of failure by time t, over time, might be a bit more accessible.

Rather than just draw some curves, we simulated data, based on the code we demonstrated previously. In this case, there's no need for any interesting censoring, but a more interesting survival curve seems worthwhile.

SAS
The more interesting curve is introduced by manually accelerating and slowing down the Weibull survival time demonstrated in the previous approach. We also trichotomized one of the exposures to match the colleague's study, and censored all values greater than 10 to keep focus where the underlying hazard was interesting.


data simcox;
  beta1 = .2;
  beta2 = log(1.25);
  lambdat = 20; *baseline hazard;
  do i = 1 to 10000;
    x1 = normal(45);
    x2 = (normal(0) gt -1) + (normal(0) gt 1);
    linpred = -beta1*x1 - beta2*x2;
    t = rand("WEIBULL", 1, lambdaT * exp(linpred));
 if t gt 5 then t = 5 + (t-5)/10;
 if t gt 7 then t = 7 + (t-7) * 20;
    * time of event;
 censored = (t > 10);
    output;
  end;
run;

The phreg procedure will fit the model and produces nice plots of the survival function and the cumulative hazard. But to generate useful versions of these, you need to make an additional data set with the covariate values you want to show plots for. We set the other covariate's value at 0. You can also include an id variable with some descriptive text.


data covars;
x1 = 0; x2 = 2; id = "Arm 3"; output;
x1 = 0; x2 = 1; id = "Arm 2"; output;
x1 = 0; x2 = 0; id = "Arm 1"; output;
run;

proc phreg data = simcox plot(overlay)=cumhaz;
class x2 (ref = "0");
baseline covariates = covars out= kkout cumhaz = cumhaz 
    survival = survival / rowid = id;
model t*censored(1) = x1 x2;
run;

The cumulative hazard plot generated by the plot option is shown below, demonstrating the correct relative hazard of 1.25. The related survival curve could be generated with plot = s .

To get the desired plot of the failure times, use the out = option to the baseline statement. This generates a data set with the listed statistics (here the cumulative hazard and the survival probability, across time). Then we can produce a plot using the gplot procedure, after generating the failure probability.


data kk2;
set kkout;
iprob = 1 - survival;
run;

goptions reset=all;
legend1 label = none value=(h=2);
axis1 order = (0 to 1 by .25) minor = none value=(h=2)
  label = (a = 90 h = 3 "Probability of infection");
axis2 order = (0 to 10 by 2) minor = none value=(h=2)
  label = (h=3 "Attributable time");;
symbol1 i = sm51s v = none w = 3 r = 3;
proc gplot data = kk2;
plot iprob * t = id / vaxis = axis1 haxis = axis2 legend=legend1;
run; quit;

The result is shown at the top. Note the use of the h= option in various places in the axis and legend statement to make the font more visible when shrunk to fit onto a slide. The smoothing spline plotted through the data with the smXXs interpolation makes a nice shape out of the underlying abrupt changes in the hazard. The symbol, legend, and axis statements are discussed in chapter 6.

R
As in SAS, we begin by simulating the data. Note that we use the simple categorical variable simulator mentioned in the comments for example 7.20.


n = 10000
beta1 = .2
beta2 = log(1.25)
lambdaT = 20

x1 = rnorm(n,0)
x2 = sample(0:2,n,rep=TRUE,prob=c(1/3,1/3,1/3))
# true event time
T = rweibull(n, shape=1, scale=lambdaT*exp(-beta1*x1 - beta2*x2)) 
T[T>5] = 5 + (T[T>5] -5)/10
T[T>7] = 7 + (T[T>7] -7) * 20
event = T < rep(10,n)

Now we can fit the model using the coxph() function from the Survival package. There is a default method for plot()ing the "survfit" objects resulting from several package functions. However, it shows the typical survival plot. To show failure probability instead, we'll manually take the complement of the survival probability, but still take advantage of the default method. Note the use of the xmax option to limit the x-axis. The results (below) are somewhat bland, and it's unclear if the lines can be colored differently, or their widths increased. They are also as angular as the cumulative hazards shown in the SAS implementation.


library(survival)
plotph = coxph(Surv(T, event)~ x1 * strata(x2), 
     method="breslow")
summary(plotph)
sp = survfit(plotph)
sp$surv = 1 - sp$surv
plot(sp, xmax=10)

Consequently, as is so often the case, presentation graphics require more manual fiddling. We'll begin by extracting the data we need from the "survfit" object. We'll take just the failure times and probabilities, as well as the name of the strata to which each observation belongs, limiting the data to time < 10. The last of these lines uses the names() function to pull the names of the strata, repeating each an appropriate number of times with the useful rep() function.


sp = survfit(plotph)
failtimes = sp$time[sp$time <10]
failprobs = 1 - sp$surv[sp$time <10]
failcats = c(rep(names(sp$strata),times=sp$n))[sp$time <10]

All that remains is plotting the data, which is not dissimilar to many examples in the book and in this blog. There's likely some way to make these three lines with a little less typing, but knowing how to do it from scratch gives you the most flexibility. It proved difficult to get the desired amount of smoothness from the loess(), lowess(), or supsmu() functions, but smooth.spline() served admirably. The code below demonstrates increasing the axis and tick label sizes


plot(failprobs~failtimes, type="n", ylim=c(0,1), cex.lab=2, cex.axis= 1.5, 
     ylab= "Probability of infection", xlab = "Attributable time")
lines(smooth.spline(y=failprobs[failcats == "x2=0"], 
     x=failtimes[failcats == "x2=0"],all.knots=TRUE, spar=1.8), 
     col = "blue", lwd = 3)
lines(smooth.spline(y=failprobs[failcats == "x2=1"], 
     x=failtimes[failcats == "x2=1"],all.knots=TRUE, spar=1.8), 
     col = "red", lwd = 3)
lines(smooth.spline(y=failprobs[failcats == "x2=2"], 
     x=failtimes[failcats == "x2=2"],all.knots=TRUE, spar=1.8),
     col = "green", lwd = 3)
legend(x=7,y=0.4,legend=c("Arm 1", "Arm 2", "Arm 3"), 
     col = c("blue","red","green"), lty = c(1,1,1), lwd = c(3,3,3) )

Monday, August 30, 2010

Example 8.3: pyramid plots

Pyramid plots are a common way to display the distribution of age groups in a human population. The percentages of people within a given age category are arranged in a barplot, often back to back. Such displays can be used distinguish males vs. females, differences between two different countries or the distribution of age at different timepoints. Aidan Kane has an example.

We demonstrate how to generate back to back pyramid plots by gender of the age distribution from the HELP (Health Evaluation and Linkage to Primary Care) study. The example today highlights the differences between the R community and the SAS corporate structure. The R function was constructed to do exactly a pyramid plot, while the SAS approach tricks a powerful but general approach to achieve approximately the desired results. The R result to our eyes are more attractive; to mimic them exactly in SAS would require drawing much of the content from primitives. Someone may have done this, but the software structure and user community isn't organized for sharing.

R

We begin by loading the data then creating a categorical age variable (in 5 year increments) using the cut() command (section 1.4.10). Next a character variable is created that will be used to display the five number summaries by gender (section 2.1.2).


ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
attach(ds)
library(plotrix)

# create a categorical age variable
agegrp = cut(age, breaks=c(18, 20, 25, 30, 35, 40, 45, 50, 55, 60))

# create a nicer description for gender
gender = rep("male", length(agegrp))
gender[female==1] = "female"

# create a vector of percentages in each age range
women = as.vector(100*table(agegrp[female==1])/sum(female==1))
men = as.vector(100*table(agegrp[female==0])/sum(female==0))

# distribution by gender
tapply(age, gender, fivenum)

This yields the following output (five number summaries by gender):


$female
[1] 21.0 31.0 35.0 40.5 58.0

$male
[1] 19 30 35 40 60

Finally, the vectors of percentages at each level of the age variable for men and women is given as arguments to the pyramid.plot() function.


pyramid.plot(men, women, 
   labels=c("(18,20]","(20,25]","(25,30]","(30,35]", 
   "(35,40]","(40,45]","(45,50]","(50,55]","(55,60]"), 
   gap=5)
title("Age distribution at baseline of HELP study")

The age distributions are quite similar, with the males slightly more dispersed than the females.

SAS

We'll use proc gchart with the hbar statement (section 5.1.3) to make the plot. This requires some set-up, due to the desired back-to-back image. We begin, as in R, by generating the age categories and a gender variable. The strategy for categorizing age is shown in section 1.4.9.


data pyr;
set "c:\book\help";
  agegrp = (age le 20) + (age le 25) + (age le 30) + (age le 35) +
  (age le 40) + (age le 45) + (age le 50) + (age le 55) + (age le 60);
  if female eq 1 then gender = "Female";
    else gender = "Male";
run;

Next, we generate the percent in each age group, within gender, using proc freq (section 2.3.1). We save the output to a data set with the out option and suppress all the printed output. Then we make the percents for the males negative, so they'll display to the left of 0.


proc freq data=pyr noprint;
  tables agegrp * gender/out=sumpyr outpct;
run;

data pyr2;
set sumpyr;
if gender eq "Male" then pct_col=pct_col * -1;
run;

We could proceed with the plot now, but the axes would include age categories 1 through 9 and negative percents for the males. To clean this up, we use axis statements (sections 5.3.7, 5.3.8).


title 'Age distribution at baseline of HELP study';
axis1 value = ("(55,60]" "(50,55]" "(45,50]" "(40,45]" 
   "(35,40]" "(30,35]" "(25,30]"  "(20,25]" "(18,20]" ) ;
axis2 order=(-30 to 30 by 10) 
label=("Percent in each age group, within gender")
minor = none
value = ("30" "20" "10" "0" "10" '20' '30');

proc gchart data=pyr2;
hbar agegrp / discrete freq nostats sumvar=pct_col space=0.5
   subgroup=gender raxis=axis2 maxis=axis1;
label agegrp="Age";
run;
quit;

In the gchart statement, the key option is sumvar which tells proc gchart the length of the bars. The discrete option forces a bar for each value of agregrp. Other options associate the defined axis statements with axes of the plot, generate different colors for each gender, space the bars, and suppress some default plot features.

Different colored bars within gender could be accomplished with pattern statements. More difficult would be coloring the bars within gender by some third variables, as is demonstrated in R in example(pyramid.plot). Replicating the R plot with the category labels between the genders would require drawing the plot using annotate data sets.

Monday, April 12, 2010

Example 7.32: Add reference lines to a plot; fine control of tick marks

Sometimes it's useful to plot regular reference lines along with the data. For a time-series plot, this can show when critical values are reached in a clearer way than simple tick marks.

As an example, we revisit the empirical CDF plot shown in Example 7.11. If you missed that entry, the data can be downloaded so you can easily explore the code shown below. We'll show how to add a regular grid of lines and lines at specific x or y values.

In a departure from our usual style, we'll discuss SAS and R in parallel, not in sequence.

Original Plot

The plot shown in Example 7.11 was obtained by doing some calculation and then with the following code.

SAS


symbol1 i=j v=none c=blue;
proc gplot data=help_a;
   plot ecdf_pcs * pcs;
run;


plot(sortpcs, ecdfpcs, type="n")
lines(sortpcs, ecdfpcs)

A simple default reference grid can be produced in SAS by changing the plot statement to read plot ecdf_pcs * pcs / grid;. In R, the grid() function (with no parameters specified) has the same effect. Either adds light grey lines at the major tick marks in both the x and y directions. The major tick marks themselves can be selected using the axis statement in SAS or with the axis() function in R, as discussed in section 5.3.7 and 5.3.8. Changing the major tick marks in SAS will make the grid appear at the modified tick locations.


axis1 order=(0 to 1 by .25) minor=none;
axis2 order=(10 to 80 by 35) minor=none;
symbol1 i=j v=none c=blue;
proc gplot data=help_a;
   plot ecdf_pcs * pcs / vaxis=axis1 haxis=axis2 grid;
run;

Unfortunately, it is difficult to get the grid() function to match up with tick marks away from the defaults. A better approach in R is to use the abline() function (section 5.2.1). to draw the reference lines where you want them. In the following code, we also demonstrate customization of the tick marks to match the SAS output shown above.


plot(sortpcs, ecdfpcs, type="n", xaxt="n", yaxt="n", xlim=c(10,80))
axis(side=1, at=c(10,45,80))
axis(side=2, at=c(seq(0,1,.25)))
lines(sortpcs, ecdfpcs)
abline(v=c(10,45,80), col="lightgray", lty="dotted")
abline(h=c(seq(0,1,.25)), col ="lightgrey", lty="dotted")

In the above, the options to plot() suppress the data and axes and specify the range of the x axis. The axis() function calls specify where the tick marks should appear, and the abline() function calls add the reference lines.

SAS also allows manual specification of reference lines. A result equivalent to the demonstrated grid option for the plot statement could be obtained as follows.


axis1 order=(0 to 1 by .25) minor=none;
axis2 order=(10 to 80 by 35) minor=none;
symbol1 i=j v=none c=blue;
proc gplot data=help_a;
   plot ecdf_pcs * pcs / vaxis=axis1 haxis=axis2 href=10,45,80 
      vref=0,.25,.5,.75,1 chref=lightgrey cvref=lightgrey;
run;
quit;

The added control offered by the abline() or ?ref approach is that reference lines can trivially be drawn at points not appearing at major tick marks.
For example, we might have a particular interest in the 90th percentile of the data. We can add abline(h=.9, col="lightgrey", lty="dotted") as a separate command in R, or add the new value to the list of vrefs in SAS to add this line. The final result is shown below.

(The image above modifies the abline() calls to drop the lty option and change the color to "grey". Otherwise the reference lines were too faint to display well here.)

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts