SAS and R: lines()

Showing posts with label lines(). Show all posts

Tuesday, March 13, 2012

Example 9.23: Demonstrating proportional hazards

A colleague recently asked after a slide suitable for explaining proportional hazards. In particular, she was concerned that her audience not focus on the time to event or probability of the event. An initial thought was to display the cumulative hazards, which have a constant proportion if the model is true. But the colleague's audience might get distracted both by the language (what's a "hazard"?) and the fact that the cumulative hazard doesn't have a readily interpretable scale. The failure curve, meaning the probability of failure by time t, over time, might be a bit more accessible.

Rather than just draw some curves, we simulated data, based on the code we demonstrated previously. In this case, there's no need for any interesting censoring, but a more interesting survival curve seems worthwhile.

SAS
The more interesting curve is introduced by manually accelerating and slowing down the Weibull survival time demonstrated in the previous approach. We also trichotomized one of the exposures to match the colleague's study, and censored all values greater than 10 to keep focus where the underlying hazard was interesting.


data simcox;
  beta1 = .2;
  beta2 = log(1.25);
  lambdat = 20; *baseline hazard;
  do i = 1 to 10000;
    x1 = normal(45);
    x2 = (normal(0) gt -1) + (normal(0) gt 1);
    linpred = -beta1*x1 - beta2*x2;
    t = rand("WEIBULL", 1, lambdaT * exp(linpred));
 if t gt 5 then t = 5 + (t-5)/10;
 if t gt 7 then t = 7 + (t-7) * 20;
    * time of event;
 censored = (t > 10);
    output;
  end;
run;

The phreg procedure will fit the model and produces nice plots of the survival function and the cumulative hazard. But to generate useful versions of these, you need to make an additional data set with the covariate values you want to show plots for. We set the other covariate's value at 0. You can also include an id variable with some descriptive text.


data covars;
x1 = 0; x2 = 2; id = "Arm 3"; output;
x1 = 0; x2 = 1; id = "Arm 2"; output;
x1 = 0; x2 = 0; id = "Arm 1"; output;
run;

proc phreg data = simcox plot(overlay)=cumhaz;
class x2 (ref = "0");
baseline covariates = covars out= kkout cumhaz = cumhaz 
    survival = survival / rowid = id;
model t*censored(1) = x1 x2;
run;

The cumulative hazard plot generated by the plot option is shown below, demonstrating the correct relative hazard of 1.25. The related survival curve could be generated with plot = s .

To get the desired plot of the failure times, use the out = option to the baseline statement. This generates a data set with the listed statistics (here the cumulative hazard and the survival probability, across time). Then we can produce a plot using the gplot procedure, after generating the failure probability.


data kk2;
set kkout;
iprob = 1 - survival;
run;

goptions reset=all;
legend1 label = none value=(h=2);
axis1 order = (0 to 1 by .25) minor = none value=(h=2)
  label = (a = 90 h = 3 "Probability of infection");
axis2 order = (0 to 10 by 2) minor = none value=(h=2)
  label = (h=3 "Attributable time");;
symbol1 i = sm51s v = none w = 3 r = 3;
proc gplot data = kk2;
plot iprob * t = id / vaxis = axis1 haxis = axis2 legend=legend1;
run; quit;

The result is shown at the top. Note the use of the h= option in various places in the axis and legend statement to make the font more visible when shrunk to fit onto a slide. The smoothing spline plotted through the data with the smXXs interpolation makes a nice shape out of the underlying abrupt changes in the hazard. The symbol, legend, and axis statements are discussed in chapter 6.

R
As in SAS, we begin by simulating the data. Note that we use the simple categorical variable simulator mentioned in the comments for example 7.20.


n = 10000
beta1 = .2
beta2 = log(1.25)
lambdaT = 20

x1 = rnorm(n,0)
x2 = sample(0:2,n,rep=TRUE,prob=c(1/3,1/3,1/3))
# true event time
T = rweibull(n, shape=1, scale=lambdaT*exp(-beta1*x1 - beta2*x2)) 
T[T>5] = 5 + (T[T>5] -5)/10
T[T>7] = 7 + (T[T>7] -7) * 20
event = T < rep(10,n)

Now we can fit the model using the coxph() function from the Survival package. There is a default method for plot()ing the "survfit" objects resulting from several package functions. However, it shows the typical survival plot. To show failure probability instead, we'll manually take the complement of the survival probability, but still take advantage of the default method. Note the use of the xmax option to limit the x-axis. The results (below) are somewhat bland, and it's unclear if the lines can be colored differently, or their widths increased. They are also as angular as the cumulative hazards shown in the SAS implementation.


library(survival)
plotph = coxph(Surv(T, event)~ x1 * strata(x2), 
     method="breslow")
summary(plotph)
sp = survfit(plotph)
sp$surv = 1 - sp$surv
plot(sp, xmax=10)

Consequently, as is so often the case, presentation graphics require more manual fiddling. We'll begin by extracting the data we need from the "survfit" object. We'll take just the failure times and probabilities, as well as the name of the strata to which each observation belongs, limiting the data to time < 10. The last of these lines uses the names() function to pull the names of the strata, repeating each an appropriate number of times with the useful rep() function.


sp = survfit(plotph)
failtimes = sp$time[sp$time <10]
failprobs = 1 - sp$surv[sp$time <10]
failcats = c(rep(names(sp$strata),times=sp$n))[sp$time <10]

All that remains is plotting the data, which is not dissimilar to many examples in the book and in this blog. There's likely some way to make these three lines with a little less typing, but knowing how to do it from scratch gives you the most flexibility. It proved difficult to get the desired amount of smoothness from the loess(), lowess(), or supsmu() functions, but smooth.spline() served admirably. The code below demonstrates increasing the axis and tick label sizes


plot(failprobs~failtimes, type="n", ylim=c(0,1), cex.lab=2, cex.axis= 1.5, 
     ylab= "Probability of infection", xlab = "Attributable time")
lines(smooth.spline(y=failprobs[failcats == "x2=0"], 
     x=failtimes[failcats == "x2=0"],all.knots=TRUE, spar=1.8), 
     col = "blue", lwd = 3)
lines(smooth.spline(y=failprobs[failcats == "x2=1"], 
     x=failtimes[failcats == "x2=1"],all.knots=TRUE, spar=1.8), 
     col = "red", lwd = 3)
lines(smooth.spline(y=failprobs[failcats == "x2=2"], 
     x=failtimes[failcats == "x2=2"],all.knots=TRUE, spar=1.8),
     col = "green", lwd = 3)
legend(x=7,y=0.4,legend=c("Arm 1", "Arm 2", "Arm 3"), 
     col = c("blue","red","green"), lty = c(1,1,1), lwd = c(3,3,3) )

Thursday, March 1, 2012

Example 9.22: shading plots and inequalities

A colleague teaching college algebra wrote in the R-sig-teaching list asking for assistance in plotting the solutions to the inequality x^2 - 3 > 0. This type of display is handy in providing a graphical solution to accompany an analytic one.

R
The plotFun() function within the mosaic package comes in handy here.


library(mosaic)
plotFun( x^2 -3 ~ x, xlim=c(-4,4))
ladd(panel.abline(h=0,v=0,col='gray50'))
plotFun( (x^2 -3) * (x^2 > 3) ~ x, type='h', alpha=.5, 
  lwd=4, col='lightblue', add=TRUE)
plotFun( x^2 -3 ~ x, xlim=c(-4,4), add=TRUE)

As is common when crafting figures using R, the final product is built up in parts. First the curve is created, then vertical and horizontal lines are added. The shading is done using a second call to plotFun(). Finally, the curve is plotted again, to leave it on top of the final figure.

Alternatively, one might want to construct the solution more directly. This is fairly straightforward using the lines() (section 5.2.1) and polygon() (sections 2.6.4, 5.2.13) functions.


x = seq(-4,4,length=81)
fun = (x^2 -3)
sol = ((x^2 -3) * (x^2 > 3))

plot(x,fun, type="l", ylab=expression(x^2 - 3))
lines(x, sol)
polygon(c(-4,x,4), c(0,sol,0), col= "gray", border=NA)
abline(h=0, v=0)

The type="l" option to plot() draws a line plot instead of the default scatterplot. In the polygon() call we add points on the x axis to close the shape. One advantage of this approach is that the superscript can be correctly displayed in the y axis label, as shown in the plot below.

SAS
In SAS we'll construct the plot from scratch, using a data step to generate the function and solution, and the areas option of the gplot statement to make the shaded areas. The areas option fills in the area between the first line and the bottom of the plot, then between pairs of lines, so we have to draw the x axis manually, and we'll make the data for this as well.


data test;
do x = -4 to 4 by .1;
  sol = (x*x - 3) * (x*x >3);
  fun = x*x-3;
  zero = 0;
  output;
  end;
run;

The symbol statement is required so that there are lines to shade between. The pattern statements define the colors to use in the shading. Here we get a white color below the x axis (plotted zero line), then a blue color between the solution and the x axis. Then we plot the x axis line again-- otherwise it does not show. The overlay option plots all four lines on the same image. The result is shown below.


pattern1 color=white;
pattern2 color=blue;
symbol1 i = j v = none c = black;
symbol2 i = j v = none c = black;
symbol3 i = j v = none c = black;
symbol4 i = j v = none c = black;
proc gplot data = test;
plot (zero sol fun zero) * x / overlay areas=2;
label zero = "x^2 -3";
run; quit;

Tuesday, March 15, 2011

Example 8.30: Compare Poisson and negative binomial count models

How similar can a negative binomial distribution get to a Poisson distribution?

When confronted with modeling count data, our first instinct is to use Poisson regression. But in practice, count data is often overdispersed. We can fit the overdispersion in the Poisson (Section 4.1) using quasi-likelihood methods, but a better alternative might be to use a negative binomial regression (section 4.1.5). Nick has a paper exploring these models (and others) in an application.

One concern about this is how well the negative binomial might approximate the Poisson, if in fact a Poisson obtains.

We present here a function and a macro to explore how similar the negative binomial can get to the Poisson, if we keep the means of the distributions equal. But before doing so, it will be helpful to review their definitions:

The Poisson is defined as P(Y=y | l) = [e^(-l)l^y]/y!
and the negative binomial as: P(X=x | n,p) = [(n + x + 1)! / (x!)(n+1)!] p^n (1-p)^x

In the Poisson, the mean is l, while the negative binomial counts the number of failures x before n successes, where the probability of success is p. The mean of X is np/(1-p). There are several characterizations of the negative binomial.

R

In R, the pnbinom() function (section 1.10) can be called either with the parameters n and p given above, or by specifying the mean mu and a dispersion parameter (denoted size), where mu = np/(1-p) as above. It's convenient to parameterize via the mean, to keep the negative binomial mean equal to the Poisson mean.

Our function will accept a series of integers and a mean value as input, and plot the Poisson cumulative probabilities and the negative binomial cumulative probabilities for three values of n. We make use of the type="n" option in the plot() function (section 5.1.1) and add the negative binomial values with the lines() function (section 5.2.1).


poissonvsnb = function(values,mean)  {
  probs = ppois(values,mean)
  plot(y=probs, x=values, type="n", ylim=c(0,1))
  lines(y=probs, x=values, col="red")
  readline("Poisson shown. Press Enter to continue...")
  nbprobs1 = pnbinom(values, mu=mean, size=1)
  nbprobs5 = pnbinom(values, mu=mean, size=5)
  nbprobs40 = pnbinom(values, mu=mean, size=40)
  lines(y=nbprobs1, x=values, col="black")
  lines(y=nbprobs5, x=values, col="blue")
  lines(y=nbprobs40, x=values, col="green")
}
poissonvsnb(0:10,1)

The result is shown above. The red line representing the Poisson is completely overplotted by the negative binomial with size=40. This can be seen when running live, due to the readline() statement, which waits for input before continuing.

SAS

In SAS, the cdf function (section 1.10) does not have the flexibility of parameterizing directly via the mean. To add to the confusion, SAS uses another characterization of the negative binomial, which counts the number of successes x before n failures with the effect that the mean is now n(1-p)/p. Thus is we want to hold the mean constant, we need to solve for p and find probabilities from the distribution where p = n/(n + mu).

To make this process a little less cumbersome to type, we'll also demonstrate the use of proc fcmp, which allows you to compile functions that can be used in data steps and some other procedures. In general, it works as you might hope, with a function statement and a return statement. The only hassle is telling SAS where to store the functions and where to find them when they're needed.


proc fcmp outlib=sasuser.funcs.test;
function poismean_nb(mean, size);
  return(size/(mean+size));
  endsub;
run;

options cmplib = sasuser.funcs;
run;

Now we're ready to write a macro to replicate the R function. Note how the new function is nested within the call to the cdf function, with the appropriate size parameter. The overlay option allows plotting several y values on the same x axis; the r option to the symbol statement (section 5.1.19) keeps the symbol in effect for several y values. SAS generates a legend easily; this allows us to see the (mostly overplotted) Poisson. Using readline() to pause the output (as in R) is not available.

As a suggestion about how to write macros in SAS, I left this one a little messy. I first wrote the code to make the plot once, with the number of X values and the mean specified in the code with fixed values. This makes two extra lines of code, but when I converted to a macro, I only needed to change the fixed values to the macro parameters. For elegance, I would omit the first two lines and replace the later occurrences of n and mean with the macro parameters.


%macro nbptest(maxn, mean);
data nbp;
n = &maxn;
mean = &mean;
do i = 0 to n;
  probpois = cdf("POISSON", i, mean);
  probnb1 = CDF("NEGBINOMIAL", i, poismean_nb(mean, 1), 1);
  probnb5 = CDF("NEGBINOMIAL", i, poismean_nb(mean, 5), 5);
  probnb40 = CDF("NEGBINOMIAL", i, poismean_nb(mean, 40), 40);
  output;
  end;
run;

axis1 order = (0 to 1 by .2) minor=none ;
symbol1 v=none i=j r=4;
proc gplot data=nbp;
plot (probpois probnb1 probnb5 probnb40)*i / 
       overlay vaxis=axis1 legend;
run; quit;
%mend;

%nbptest(10,2);

The results are shown below. The negative binomial approaches the Poisson very closely as size increases, holding the mean constant.

Monday, November 1, 2010

Example 8.12: Bike ride plot, part 1

The iPhone app Cyclemeter uses the phone's GPS capability to record location and other data, and infer speed, while you ride. I took a ride near my house recently, and downloaded the data. I'd like to examine my route and my speed. A simple plot of the route is trivial in either SAS or R, but adding the speed data requires a little work. You can download my data from here and I read the data directly via URL in the following code.

SAS

In SAS, I first use proc import with the url filetype, as shown in section 1.1.6. I can then make a simple plot of the route using the i=j option to the symbol statement (as in section 1.13.5), which simply joins successive points.


filename bike url 'http://www.kenkleinman.net/files/cycle-data-10022010.csv';

proc import datafile=bike out=ride dbms=dlm;
  delimiter=',';
  getnames=yes;
run;

symbol1 i=j;
proc gplot data=ride;
  plot latitude * longitude;
run;

I didn't project the data, so this looks a little compressed north-south.

To show my speed at each point, I decided to make a thicker line when I'm going faster. To do this, I use the annotate macros discussed in section 5.2. I decided to use the %line macro to do this, but that requires each observation in the data set have a starting point and an ending point for its section of line. I use the lag function (section 1.4.17) in a separate data step to add the previous point to each observation. Then I create the annotate data set. Finally, I use the value = none option to the symbol statement to make an empty plot and the annotate data set draws the line for me.


data twopoints;
set ride;
  lastlat = lag(latitude);
  lastlong = lag(longitude);
  if _n_ ne 1;
run;

%annomac;
data annoride;
set twopoints;
%system(2,2,6);
%line(longitude,latitude,lastlong,lastlat,
  black,1,speed__miles_h_);
run;

symbol1 v=none;
proc gplot data=ride;
  plot latitude * longitude / annotate=annoride;;
run;
quit;

The resulting plot shown below closely resembles the R plot shown at the top of this entry.

R

In R, it's as trivial to make the simple plot as in SAS. Just read in the CSV data from the URL (section 1.1.2, 1.1.6) make an empty plot (5.1.1), and add the lines (5.2.1).


myride=read.csv("http://www.kenkleinman.net/files/cycle-data-10022010.csv")
attach(myride)
plot(Longitude, Latitude, type="n")
lines(Longitude, Latitude)

Now I want to show the speed, as above. The lines() function has a lwd= option, but unfortunately, it's not vectorized. In other words, it accepts only a scalar that applies to all the line segments drawn in a given call. To get around that, I'll write my own vectorized version of lines() using the disfavored for() function. It calls lines() for each pair of points, with an appropriate lwd value.


veclines = function(x, y, z) {
  for (i in 1:(length(x)-1)) {
    lines(x[i:(i+1)], y[i:(i+1)], lwd=z[i])
  }
}
veclines(Longitude, Latitude, Speed..miles.h./2)

The result is displayed at the top of this blog entry. In the next entry we'll add more information to help explain why the speed varies.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts