SAS and R: symbol statement

Showing posts with label symbol statement. Show all posts

Monday, October 1, 2012

Example 10.4: Multiple comparisons and confidence limits

A colleague is a devotee of confidence intervals. To him, the CI have the magical property that they are immune to the multiple comparison problem-- in other words, he feels its OK to look at a bunch of 95% CI and focus on the ones that appear to exclude the null. This though he knows well the one-to-one relationship between 95% CIs that exclude the null and p-values below 0.05.

Today, we'll create a Monte Carlo experiment to demonstrate that fishing by CI is just as dangerous as fishing by p-value; generating the image above. We'll do this by replicating a bivariate experiment 100 times. Later, we'll examine the results of a single experiment with many predictors.

R
To begin with, we'll write a function to generate a single experiment, using a logistic regression. This is a simple modification of one of our first and most popular entries.

simci = function(){
  intercept = 0
  beta = 0
# beta = 0 because we're simulating under the null 
# make the variance of x in this experiment vary a bit
  xtest = rnorm(1000) * runif(1,.6,1.4)
  linpred = intercept + xtest*beta
  prob = exp(linpred)/(1 + exp(linpred))
  runis = runif(1000)
  ytest = ifelse(runis < prob,1,0)
# now, fit the model
  fit = glm(ytest~xtest,family=binomial(link="logit"))
# the standard error of the estimates is easiest to find in the
  pe = summary(fit)$coefficients
# calculate the Wald CI; an alternative would be confint(), but
# that calculated profile CI, which take longer to generate
  ci = exp(c(pe[2,1] - 1.96*pe[2,2], pe[2,1] + 1.96*pe[2,2] ))
  return(ci)
}

Then we can use the replicate() function to repeat the experiment 100 times. The t() function (section 1.9.2) transposes the resulting matrix to have one row per experiment.

sim100 = t(replicate(100,simci()))

plot(x = sim100[,1], y = 1:100, 
  xlim = c(min(sim100),max(sim100)), type="n")
segments(y0=1:100,x0=sim100[,1],y1 = 1:100,x1=sim100[,2], 
  col = ifelse(sim100[,1]>1 | sim100[,2]<1,"red","black"))
abline(v=1)

The result is shown at the top. In the code, we set the limits of the x-axis by finding the max and min across the whole matrix-- this is a little wasteful of CPU cycles, but saves some typing. The segments() function (see example 8.42) is a vector-enabled line-drawer. Here we draw a line from the lower CI limit to the upper, giving the experiment number as the x value for each. We assign a red plot line if the CI excludes 1, using the ifelse() function (section 1.11.2), a vectorwise logic test. Finally, a reference line helps the viewer see for far the end of the CI is from the null. We omit prettying up the axis labels.

SAS
In SAS, considerably more lines are required. We begin by simulating the data, as in example 7.2. The modifications are to generate 100 examples with an outside do loop (section 1.11.1) and the random element added to the variance.

data simci;
beta = 0;
intercept = 0;
do sim = 1 to 100;   /* outer loop */
  xvar = (uniform(0) *.8) + .6;  /* variance != 1 */
  do i = 1 to 1000;
    xtest = normal(0) * xvar;
    linpred = intercept + (xtest * beta);
    prob =  exp(linpred)/(1 + exp(linpred));
    ytest = (uniform(0) < prob);
    output;
  end;
end;
run;

Then we fit the logistic regression. We leave in the ods trace commands (section A.7.1) to remind you how to find the SAS names of the output elements, needed to save the results in the ods output statement. The CI for the odds ratios are requested in the clodds statement, which accepts a pl value for a profile likelihood based interval.

*ods trace on/listing; 
ods select none;
ods output cloddswald = lrci;
proc logistic data = simci;
by sim;
model ytest(event='1')=xtest / clodds=wald;
run;
*ods trace off;
ods select all;

Our plotting approach will require the "long" data set style, with two rows for each experiment. We'll generate that while checking whether the null is excluded from the CI.

data lrp2;
set lrci;
red = 0;
if lowercl > 1 or uppercl < 1 then red = 1;
point = lowercl; output;
point = uppercl; output;
run;

Finally, we're ready to make the graphic. We use the hilob interpolation to connect the upper and lower CI for each experiment; the b requests bars instead of a line, and the bwidth option specifies a very narrow bar. These options prevent the default plotting of the "mean" value with a tick. The a*b=c syntax (section 5.2.2) allows the different line colors.

symbol1 i=hilob bwidth=.05 c=black;
symbol2 i=hilob bwidth=.05 c=red;
proc gplot data = lrp2;
plot point * sim = red / vref = 1;
run;quit;

The result is just below. The vertical alignment seen in the R plot seems more natural, but this would not be possible with the hilo interpolation. As theory and logic would suggest, quite a few of the hundred simulated CI exclude the null, sometimes by a large proportion of the CI width.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Tuesday, March 13, 2012

Example 9.23: Demonstrating proportional hazards

A colleague recently asked after a slide suitable for explaining proportional hazards. In particular, she was concerned that her audience not focus on the time to event or probability of the event. An initial thought was to display the cumulative hazards, which have a constant proportion if the model is true. But the colleague's audience might get distracted both by the language (what's a "hazard"?) and the fact that the cumulative hazard doesn't have a readily interpretable scale. The failure curve, meaning the probability of failure by time t, over time, might be a bit more accessible.

Rather than just draw some curves, we simulated data, based on the code we demonstrated previously. In this case, there's no need for any interesting censoring, but a more interesting survival curve seems worthwhile.

SAS
The more interesting curve is introduced by manually accelerating and slowing down the Weibull survival time demonstrated in the previous approach. We also trichotomized one of the exposures to match the colleague's study, and censored all values greater than 10 to keep focus where the underlying hazard was interesting.


data simcox;
  beta1 = .2;
  beta2 = log(1.25);
  lambdat = 20; *baseline hazard;
  do i = 1 to 10000;
    x1 = normal(45);
    x2 = (normal(0) gt -1) + (normal(0) gt 1);
    linpred = -beta1*x1 - beta2*x2;
    t = rand("WEIBULL", 1, lambdaT * exp(linpred));
 if t gt 5 then t = 5 + (t-5)/10;
 if t gt 7 then t = 7 + (t-7) * 20;
    * time of event;
 censored = (t > 10);
    output;
  end;
run;

The phreg procedure will fit the model and produces nice plots of the survival function and the cumulative hazard. But to generate useful versions of these, you need to make an additional data set with the covariate values you want to show plots for. We set the other covariate's value at 0. You can also include an id variable with some descriptive text.


data covars;
x1 = 0; x2 = 2; id = "Arm 3"; output;
x1 = 0; x2 = 1; id = "Arm 2"; output;
x1 = 0; x2 = 0; id = "Arm 1"; output;
run;

proc phreg data = simcox plot(overlay)=cumhaz;
class x2 (ref = "0");
baseline covariates = covars out= kkout cumhaz = cumhaz 
    survival = survival / rowid = id;
model t*censored(1) = x1 x2;
run;

The cumulative hazard plot generated by the plot option is shown below, demonstrating the correct relative hazard of 1.25. The related survival curve could be generated with plot = s .

To get the desired plot of the failure times, use the out = option to the baseline statement. This generates a data set with the listed statistics (here the cumulative hazard and the survival probability, across time). Then we can produce a plot using the gplot procedure, after generating the failure probability.


data kk2;
set kkout;
iprob = 1 - survival;
run;

goptions reset=all;
legend1 label = none value=(h=2);
axis1 order = (0 to 1 by .25) minor = none value=(h=2)
  label = (a = 90 h = 3 "Probability of infection");
axis2 order = (0 to 10 by 2) minor = none value=(h=2)
  label = (h=3 "Attributable time");;
symbol1 i = sm51s v = none w = 3 r = 3;
proc gplot data = kk2;
plot iprob * t = id / vaxis = axis1 haxis = axis2 legend=legend1;
run; quit;

The result is shown at the top. Note the use of the h= option in various places in the axis and legend statement to make the font more visible when shrunk to fit onto a slide. The smoothing spline plotted through the data with the smXXs interpolation makes a nice shape out of the underlying abrupt changes in the hazard. The symbol, legend, and axis statements are discussed in chapter 6.

R
As in SAS, we begin by simulating the data. Note that we use the simple categorical variable simulator mentioned in the comments for example 7.20.


n = 10000
beta1 = .2
beta2 = log(1.25)
lambdaT = 20

x1 = rnorm(n,0)
x2 = sample(0:2,n,rep=TRUE,prob=c(1/3,1/3,1/3))
# true event time
T = rweibull(n, shape=1, scale=lambdaT*exp(-beta1*x1 - beta2*x2)) 
T[T>5] = 5 + (T[T>5] -5)/10
T[T>7] = 7 + (T[T>7] -7) * 20
event = T < rep(10,n)

Now we can fit the model using the coxph() function from the Survival package. There is a default method for plot()ing the "survfit" objects resulting from several package functions. However, it shows the typical survival plot. To show failure probability instead, we'll manually take the complement of the survival probability, but still take advantage of the default method. Note the use of the xmax option to limit the x-axis. The results (below) are somewhat bland, and it's unclear if the lines can be colored differently, or their widths increased. They are also as angular as the cumulative hazards shown in the SAS implementation.


library(survival)
plotph = coxph(Surv(T, event)~ x1 * strata(x2), 
     method="breslow")
summary(plotph)
sp = survfit(plotph)
sp$surv = 1 - sp$surv
plot(sp, xmax=10)

Consequently, as is so often the case, presentation graphics require more manual fiddling. We'll begin by extracting the data we need from the "survfit" object. We'll take just the failure times and probabilities, as well as the name of the strata to which each observation belongs, limiting the data to time < 10. The last of these lines uses the names() function to pull the names of the strata, repeating each an appropriate number of times with the useful rep() function.


sp = survfit(plotph)
failtimes = sp$time[sp$time <10]
failprobs = 1 - sp$surv[sp$time <10]
failcats = c(rep(names(sp$strata),times=sp$n))[sp$time <10]

All that remains is plotting the data, which is not dissimilar to many examples in the book and in this blog. There's likely some way to make these three lines with a little less typing, but knowing how to do it from scratch gives you the most flexibility. It proved difficult to get the desired amount of smoothness from the loess(), lowess(), or supsmu() functions, but smooth.spline() served admirably. The code below demonstrates increasing the axis and tick label sizes


plot(failprobs~failtimes, type="n", ylim=c(0,1), cex.lab=2, cex.axis= 1.5, 
     ylab= "Probability of infection", xlab = "Attributable time")
lines(smooth.spline(y=failprobs[failcats == "x2=0"], 
     x=failtimes[failcats == "x2=0"],all.knots=TRUE, spar=1.8), 
     col = "blue", lwd = 3)
lines(smooth.spline(y=failprobs[failcats == "x2=1"], 
     x=failtimes[failcats == "x2=1"],all.knots=TRUE, spar=1.8), 
     col = "red", lwd = 3)
lines(smooth.spline(y=failprobs[failcats == "x2=2"], 
     x=failtimes[failcats == "x2=2"],all.knots=TRUE, spar=1.8),
     col = "green", lwd = 3)
legend(x=7,y=0.4,legend=c("Arm 1", "Arm 2", "Arm 3"), 
     col = c("blue","red","green"), lty = c(1,1,1), lwd = c(3,3,3) )

Thursday, March 1, 2012

Example 9.22: shading plots and inequalities

A colleague teaching college algebra wrote in the R-sig-teaching list asking for assistance in plotting the solutions to the inequality x^2 - 3 > 0. This type of display is handy in providing a graphical solution to accompany an analytic one.

R
The plotFun() function within the mosaic package comes in handy here.


library(mosaic)
plotFun( x^2 -3 ~ x, xlim=c(-4,4))
ladd(panel.abline(h=0,v=0,col='gray50'))
plotFun( (x^2 -3) * (x^2 > 3) ~ x, type='h', alpha=.5, 
  lwd=4, col='lightblue', add=TRUE)
plotFun( x^2 -3 ~ x, xlim=c(-4,4), add=TRUE)

As is common when crafting figures using R, the final product is built up in parts. First the curve is created, then vertical and horizontal lines are added. The shading is done using a second call to plotFun(). Finally, the curve is plotted again, to leave it on top of the final figure.

Alternatively, one might want to construct the solution more directly. This is fairly straightforward using the lines() (section 5.2.1) and polygon() (sections 2.6.4, 5.2.13) functions.


x = seq(-4,4,length=81)
fun = (x^2 -3)
sol = ((x^2 -3) * (x^2 > 3))

plot(x,fun, type="l", ylab=expression(x^2 - 3))
lines(x, sol)
polygon(c(-4,x,4), c(0,sol,0), col= "gray", border=NA)
abline(h=0, v=0)

The type="l" option to plot() draws a line plot instead of the default scatterplot. In the polygon() call we add points on the x axis to close the shape. One advantage of this approach is that the superscript can be correctly displayed in the y axis label, as shown in the plot below.

SAS
In SAS we'll construct the plot from scratch, using a data step to generate the function and solution, and the areas option of the gplot statement to make the shaded areas. The areas option fills in the area between the first line and the bottom of the plot, then between pairs of lines, so we have to draw the x axis manually, and we'll make the data for this as well.


data test;
do x = -4 to 4 by .1;
  sol = (x*x - 3) * (x*x >3);
  fun = x*x-3;
  zero = 0;
  output;
  end;
run;

The symbol statement is required so that there are lines to shade between. The pattern statements define the colors to use in the shading. Here we get a white color below the x axis (plotted zero line), then a blue color between the solution and the x axis. Then we plot the x axis line again-- otherwise it does not show. The overlay option plots all four lines on the same image. The result is shown below.


pattern1 color=white;
pattern2 color=blue;
symbol1 i = j v = none c = black;
symbol2 i = j v = none c = black;
symbol3 i = j v = none c = black;
symbol4 i = j v = none c = black;
proc gplot data = test;
plot (zero sol fun zero) * x / overlay areas=2;
label zero = "x^2 -3";
run; quit;

Thursday, February 23, 2012

Example 9.21: The birthday "problem" re-examined

The so-called birthday paradox or birthday problem is simply the counter-intutitive discovery that the probability of (at least) two people in a group sharing a birthday goes up surprisingly fast as the group size increases. If the group is only 23 people, there is a 50% chance that two of them share a birthday, and with 40 people it's about 90%. There is an excellent wikipedia page discussing this.

However, this analytically derived probability is based on the assumption that births are equally likely on any day of the year. (It also ignores the occasional February 29th, and any social factors that lead people born at the same time of year to seek like spouses, and so forth.) But this assumption does not appear to be true, as laid out anecdotally and in press.

As noted in the latter link, any disparity in the probability of birth between days will improve the chances of a match. But how much? An analytic solution seems quite complex, even if we approximate the true daily distribution with a constant birth probability per month. Simulation will be simpler. While we're at it, we'll include leap days as well, since February 29th approaches.

SAS

Our approach here is based on the observation that the probability of at least one match among N people is equal to the sum of the probabilities of exactly one match in 2,...,N people. In addition, rather than simulating groups of 2, estimating the probability of a match, and repeating for groups of 3,...,N, we'll keep adding people to a group until we have a match, finding the probability of a match in all group sizes at once.

Here we use arrays (section 1.11.5) to keep track of the number of days in a month and of the people in our group. To reduce computation, we'll check for matches as we add people to the group, and only generate their birthdays if there is not yet a match. We also demonstrate the useful hyphen tool for referring to ranges of variables (1.11.4).


data bd1;
array daysmo [12] _temporary_  (31 28.25 31 30 31 30 31 31 30 31 30 31);
array dob [367] dob1 - dob367;  * these variables will hold the birthdays
                                * the hyphen includes all the variables in the
                                * sequence

do group = 1 to 10000000;       * simulate this many groups;
  match = 0;                    * initialize whether there's a match in this 
                                  group, yet;
  do i = 1 to 367;              * loop through up to 367 subjects... the maximum
                                  possible, obviously;
    month = rantbl(0, 31*.0026123, 28*.0026785, 31*.0026838, 30*.0026426,
        31*.0026702, 30*.0027424, 31*.0028655, 31*.0028954, 30*.0029407,
        31*.0027705, 30*.0026842);
                                * choose a month of birth, by probabilities reported
                                  in the Science News link, which are daily by month;
    day = ceil((4 * daysmo[month] * uniform(0))/4);  
                                * choose a day within the month,
                                  note the trick used to get Leap Days; 
    dob[i] = mdy(month, day, 1960);
                                * convert month and day into a day in the year--
                                  1960 is a convenient leap year;
 do j = 1 to (i-1) until (match gt 0);
                                * compare each old person to the new one;
   if dob[j] = dob[i] then match = i;
                                * if there was a match, we needed i people in the 
                                  group to make it;
   end;
 if match gt 0 then leave;  
                                * no need to generate the other 367-i people;
  end;
  output;
end;
run;

We note here that while we allow up to 367 birthdays before a match, the probability of more than 150 is so infinitesimal that we could save the space and speed up processing time by ignoring it. Now that the groups have been simulated, we just need to summarize and present them. We tabulate how many cases of groups of size N were recorded, generate the simple analytic answer, and merge them.


proc freq data = bd1;
tables match / out=bd2 outcum;  * the bd2 data set has the results;
run;

data simpreal;
set bd2;
prob = 1 - ((fact(match) * comb (365,match)) / 365**match);
realprob = cum_freq/10000000;
diff = realprob-prob;
diffpct = 100 * (diff)/prob;
run;

It's easiest to interpret the results by way of a plot. We'll plot the absolute and the relative difference on the same image with two different axes. The axis and symbol statements will make it slightly prettier, and allow us to make 0 appear at the same point on both axes.


axis1 order = (-.75 to .75 by .25) minor = none;
axis2 order = (-.00025 to .00025 by .00005) minor = none;
symbol1 v = dot h = .75 c = blue;
symbol2 font=marker v = U h = .5 c = red;

proc gplot data= simpreal (obs = 89);
plot diffpct * match / vref = 0 vaxis=axis1 legend;
plot2 diff *match/ vaxis = axis2 legend;
run; quit;

The results, shown below, are very clear-- leap day and the disequilibrium in birth month probability does increase the probability of at least one match in any group of a given size, relative to the uniform distribution across days assumed in the analytic solution. But the difference is miniscule in both the absolute and relative scale.

R
Here we mimic the approach used above, but use the apply() function family in place of some of the looping.


dayprobs = c(.0026123,.0026785,.0026838,.0026426,.0026702,.0027424,.0028655,
    .0028954,.0029407,.0027705,.0026842,.0026864)
daysmo = c(31,28,31,30,31,30,31,31,30,31,30,31)
daysmo2 = c(31,28.25,31,30,31,30,31,31,30,31,30,31)
# need both: the former is how the probs are reported, 
# while the latter allows leap days

moprob = daysmo * dayprobs

With the monthly probabilities established, we can sample a birth month for everyone, and then choose a birth day within month. We use the same trick as above to allow birth days of February 29th. Here we show code for 10,000 groups; with the simple cloud R this code was developed on, more caused a crash.

We've stopped referencing our book exhaustively, and doing so here would be tedious. Instead, we'll just comment that the tools we use here can be found in sections 1.4.5, 1.4.15, 1.4.16, 1.5.2, 1.8.3, 1.8.4, 1.9.1, 1.11.1, 5.2.1, 5.6.1, B.5.2, and probably others.


mob = sample(1:12,10000 * 367,rep=TRUE,prob=moprob)
dob = sapply(mob,function(x) ceiling(sample((4*daysmo2[x]),1)/4) )
# The ceiling() function isn't vectorized, so we make the equivalent
# using sapply().

mobdob = paste(mob,dob)
# concatenate the month and day to make a single variable to compare
# between people.  The ISOdate() function would approximate the SAS mdy() 
# function but would be much longer, and we don't need it.

# convert the vector into a matrix with the maximum
# group size as the number of columns
# as noted above, this could safely be truncated, with great savings
mdmat = matrix(mobdob, ncol=367, nrow=10000)

To find duplicate birthdays in each row of the matrix, we'll write a function to compare the number of unique values with the length of the vector, then call it repeatedly in a for() loop until there is a difference. Then, to save (a lot) of computations, we'll break out of the loop and report the number needed to make the match. Finally, we'll call this vector-based function using apply() to perform it on each row of the birthday matrix.


matchn = function(x) {
  for (i in 1:367){
    if (length(unique(x[1:i])) != i) break
  }
  return(i)
}

groups = apply(mdmat, 1, matchn)

bdprobs = cumsum(table(groups)/10000)
# find the N with each group number, divide by number of groups
# and get the cumulative sum

rgroups = as.numeric(names(bdprobs))
# extract the group sizes from the table
probs = 1 - ((factorial(rgroups) * choose(365,rgroups)) / 365**rgroups)
# calculate the analytic answer, for any group size 
# for which there was an observed simulated value

diffs = bdprobs - probs
diffpcts = diffs/probs

To plot the differences and percent differences in probabilities, we modify (slightly) the functions for a multiple-axis scatterplot we show in our book in section 5.6.1. You can find the code for this and all the book examples on the book web site.


addsecondy <- function(x, y, origy, yname="Y2") {
  prevlimits <- range(origy)
  axislimits <- range(y)
  axis(side=4, at=prevlimits[1] + diff(prevlimits)*c(0:5)/5,
       labels=round(axislimits[1] + diff(axislimits)*c(0:5)/5, 3))
  mtext(yname, side=4)
  newy <- (y-axislimits[1])/(diff(axislimits)/diff(prevlimits)) +
    prevlimits[1]
  points(x, newy, pch=2)
  abline(h=(-axislimits[1])/(diff(axislimits)/diff(prevlimits)) +
    prevlimits[1])
}

plottwoy <- function(x, y1, y2, xname="X", y1name="Y1", y2name="Y2")
{
  plot(x, y1, ylab=y1name, xlab=xname)
  abline(h=0)
  addsecondy(x, y2, y1, yname=y2name)
}

plottwoy(rgroups, diffs, diffpcts, xname="Number in group",
  y1name="Diff in prob", y2name="Diff in percent")
legend(80, .0013, pch=1:2, legend=c("Diffs", "Pcts"))

The resulting plot, (which is based on 100,000 groups, tolerable compute time on a laptop) is shown at the top. Aligning the 0 on each axis was more of a hassle than seemed worth it for today. However, the message is equally clear-- a clearly larger probability with the observed birth distribution, but not a meaningful difference.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts