SAS and R: adding text to graphics

Showing posts with label adding text to graphics. Show all posts

Tuesday, February 15, 2011

Example 8.25: more latent class models (plus a graphical display)

In recent entries (here, here, here and here), we've been fitting a series of latent class models using SAS and R. One of the most commonly used and powerful package for latent class model estimation is Mplus. In this entry, we demonstrate how to use the MplusAutomation package to automate the process of fitting and interpreting a series of models using Mplus.

The first chunk of code needs to be run on a Windows computer with Mplus installed. We undertake the same analytic steps as before, then run the prepareMplusData() function to create the dataset, then createModels() to create the Mplus input files.


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
library(MplusAutomation)
cesdcut = ifelse(cesd>20, 1, 0)
smallds = na.omit(data.frame(homeless, cesdcut, 
  satreat, linkstatus))
prepareMplusData(smallds, file="mplus.dat")
createModels("mplus.txt")

Once the preliminaries have been completed, we can run each of the models, then scarf up the results.


runModels()
summary=extractModelSummaries()
models=readModels()

To help create graphical summaries of the classes, I crafted some code that utilizes the Mplus output files. These include functions to calculate the antilogit (alogit()), determine class probabilities in terms of model parameters (calcclasses() and findprobs()), and a routine to plot the resulting values (plotres()). Unfortunately, these routines require tweaking for models with different number of predictors (as well as care if some of the predictors have more than 2 levels).


alogit = function(x) {
        return(exp(x)/(1+exp(x)))
}

calcclasses = function(vals) {
   numvals = length(vals)
   classes = numeric(numvals+1)   
   for (i in 1:numvals) {
      classes[i] = exp(vals[i])/
         (1+sum(exp(vals[1:numvals])))
   }
   classes[numvals+1] = 1/(1+sum(exp(vals[1:numvals])))
   return(classes)
}

findprobs = function(df) {
   numclass = length(levels(as.factor(df$LatentClass)))-1
   v1 = numeric(numclass)
   v2 = numeric(numclass)
   v3 = numeric(numclass)
   v4 = numeric(numclass)
   for (i in 1:numclass) {
      v1[i] = alogit(-df$est[4*(i-1)+1])
      v2[i] = alogit(-df$est[4*(i-1)+2])
      v3[i] =  alogit(-df$est[4*(i-1)+3])
      v4[i] =  alogit(-df$est[4*(i-1)+4])
   }
   if (numclass>1) {
      classes=calcclasses(df$est[(4*numclass+1):(4*numclass+numclass-1)])
   } else classes=1
   return(list(prop=cbind(v1=v1, v2=v2, v3=v3, v4=v4), 
     classes=classes))
}

plotres = function(mylist, roundval=1, cexval=.75) {
# can only plot at most 4 classes!
   reorder = order(mylist$classes)
   probs = mylist$classes[reorder]
   results = mylist$prop
   dimen = dim(results)
   cols = c(1,4,2,5) # black, blue, red, and turquoise
   ltys = c(3,1,2,4) # dotted, solid, dash, dash-dot
   ltyslines = ltys[rank(-mylist$classes)]
   colslines = cols[rank(-mylist$classes)]
   ltysrev = rev(ltys[1:dimen[1]])
   colsrev = rev(cols[1:dimen[1]])
   plot(c(0.9, dimen[2]), c(-0.08,1), xlab="", 
           ylab="estimated prevalence", xaxt="n", type="n")
   abline(h=0, col="gray")
   abline(h=1, col="gray")
   for (i in 1:(dimen[1])) {
      lines(1:(dimen[2]), results[i,], lty=ltyslines[i], 
         col=colslines[i], lwd=2)
      points(1:(dimen[2]), results[i,], col=colslines[i])
   }
   text(1,0,"homeless", pos=1, cex=cexval)
   text(2,0,"cesdcut", pos=1, cex=cexval)
   text(3,0,"satreat", pos=1, cex=cexval)
   text(4,0,"linkstat", pos=1, cex=cexval)
   legendval = character(dimen[1])
   for (i in 1:(dimen[1])) {
      legendval[i] = paste("class ",i , " (", round(100*probs[i], 
         roundval), "%)", sep="")
   }
   legend(2, 0.4, legend=legendval, 
      lty=ltysrev, col=colsrev, cex=cexval, lwd=2)
}

files=list.files()
files=files[grep(".out",files)
]
par(mfrow=c(4,1), mar=c(1, 2, 2, 1) + 0.1)
for (i in 1:length(files)) {
   cat("file=",files[i],"\n")
   res = extractModelParameters(target=files[i])
   newres = findprobs(res$unstandardized)
   plotres(newres)
   title(substr(files[i], 1, nchar(files[i])-4))
}

Finally, by spelunking through the output files in the current directory (using list.files()) the results from each of the four models can be collated and displayed as seen in the Figure above.

The single class model simply reproduces the prevalences of each of the predictors. The two class solution primarily separates those not receiving substance abuse treatment from those that do. The three class solution further splits along substance abuse treatment, as well as homeless status. The four class solution is somewhat jumbled, with a group of homeless subjects comprising the largest class. None of the classes distinguish linkstatus (the primary outcome of the RCT).

Monday, November 8, 2010

Example 8.13: Bike ride plot, part 2

Before explaining how to make and interpret the plot above, Nick and I want to make a plea for questions--it's hard to come up with useful questions to explore each week!

As shown in Example 8.12, data from the Cyclemeter app can be used to make interesting plots of a bike ride. But in the previous application, questions remain. Why did my speed vary so much, for example? And why do some sections of the route appear to be very straight while others are relatively bumpy? To investigate these questions, we'll add some more data into the plot, using color. First, I'll shade the line according to my elevation-- my speed is probably faster when I'm going down hill. Then I'll add the points where the GPS connected; this may explain the straight sections. Doing all of this will require more complicated coding tasks.

(Data were read in previously.)

SAS

My initial thought was that I would use proc univariate to generate a histogram (section 5.1.4) and generate the color for the line from the histogram categories. But it turns out that you can't force the histogram to have your desired number of categories. (This is true for the R hist() function as well.) So I'm going to find the minimum and maximum elevation using proc summary (section 2.1), saving the results in a data set. Then I'll add those min and max values to each line of the data, using the tip found here. Finally, I'll make my own categories for elevation by looping through the category boundaries until I find one bigger than my observation.


proc summary data=ride;
  var elevation__feet_;
  output out=minmax max=maxelev min=minelev;;
run;

data aride;
set ride;
  setme = 1;
  set minmax point=setme;
  colorindex = 0;
  range = maxelev-minelev;
  do i = 0 to 5;
    if elevation__feet_ ge (minelev + (i * range/6) ) 
      then colorindex = i + 1;    
  end;
run;

To make the plot, I'm going to use the annotate facility, as before. However, the annotate %line macro won't work, for reasons I won't go into. So I need to make an annotate data set directly. Briefly, the way annotate data sets work is that they have a pre-specified variable list in which only certain values are allowed. Some of these are discussed in section 5.1, but in addition to these, below we use the function variable; when function="MOVE" the conceptual plotting pen moves without drawing. When function="DRAW" a line is plotted from the last location to the current one. Locations are determined by x and y variables. The line and size variables affect the line style and thickness.

The other interesting bit of the code below is the creation and use of a temporary array (section 1.11.5) for the colors. I choose the color from this array by indexing on the colorindex variable created above.

For help on the annotate facility, see Contents; SAS Products; SAS/GRAPH; The Annotate Facility. Colors are discussed in section 5.3.11, but we use them in a slightly different way, here. For help with color specification, see Contents; SAS Products; SAS/GRAPH; Concepts; Colors.


data annoride2;
set aride;
  if elevation__feet_ ne .;
  retain xsys '2' ysys '2' hsys '6';
  array mycolor [6] $ _temporary_ 
    ("CX252525" "CX636363" "CX969696" "CXBDBDBD" "CXD9D9D9" "CXF7F7F7");
  x = longitude;
  y = latitude;
  /* If this is the first observation, we need to move the pen to 
     the first point.  Otherwise we draw a line */
  if _n_ ne 1 then function = "DRAW"; 
  else function = "MOVE";
  color = mycolor[colorindex];
  line = 1;
  size = speed__miles_h_ * 2;
  output;
run;

Finally, we can plot the figure. I use the symbol statement (section 5.2.2) to plot nice dots where the GPS connected, and the axis statement (section 5.3.7) to remove the values of latitude and longitude, which don't interest me much. Analysis of the graph I leave for the R version.


axis1 major=none minor=none value=none;
symbol1 i=none v=dot c=black h=.2;
proc gplot data = ride;
plot latitude * longitude / 
   vaxis=axis1 haxis=axis1 annotate=annoride2;
run;
quit;

The resulting plot is shown above.

R

In R, I'm going to modify my vectorized version of the lines() function to additionally assign colors to each line. As in the SAS example, I will use categories of elevation to do this. Assigning the elevations to categories is a little trickier if I want to avoid for() loops. To do it, I will first find the category boundaries. I'll then use the which() function (as in section 1.13.2) to figure out the category boundaries smaller than the observation, and the max() function (section 1.8.1) to select the largest of these. Unfortunately, I also need the sapply() function (section B.5.3) so that I can repeat this process for each elevation and return a vector. I've annotated below to explain how this works. Finally, I use the RColorBrewer package to generate a vector of colors. (I also used it to find the colors I put in the SAS code.)


veclines2 = function(x, y, z, c) {
  require(RColorBrewer)
  breaks = min(c) + (0:5 * (max(c) - min(c))/6)
  # The sapply() function applies the function named in the 
  # second argument to each element of the vector named in the 
  # first argument.  Since I don't need this function again, 
  # I write it out in within the sapply() function call and don't 
  # even have to name it.
  ccat = sapply(c, function (r) {max(which(r >= breaks))} )
  mypalette = brewer.pal(6,"Greys")
  for (i in 1:(length(x)-1)) {
    lines(x[i:(i+1)], y[i:(i+1)], lwd=z[i], col=mypalette[7 - ccat[i]])
  }
}

Reader JanOosting pointed out that the segments() function will do exactly what my for-looped lines() function does. The segments() function requires "to" and "from" x and y locations. Here's a version of the above function that uses it.


bikeseg = function(x,y,z,c) {
  l=length(x)
  require(RColorBrewer)
  mypalette <-brewer.pal(6,"Greys")
breaks = min(c) + (0:5 * (max(c) - min(c))/6)
ccat = sapply(c,function (r) {max(which(r >= breaks))})
segments(x0 = x[1:(l-1)], y0 = y[1:(l-1)], 
  x1 = x[2:l], y1 = y[2:l],lwd=z,col=mypalette[7-ccat])
}

Then I call the function, after creating an empty plot and making the RColorBrewer library available. Finally, I add the points and some street names (see sections 5.2.3 and 5.2.11), to aid in interpretation. The way R draws the names live, as you enter commands, makes it much easier to place and rotate the names than in SAS, where you would have to re-make the annotate data set and run it to see the results.


plot(Longitude, Latitude, type="n")
library(RColorBrewer)
bikeseg(Longitude, Latitude, Speed..miles.h./2, Elevation..feet.)

points(Longitude, Latitude, cex=2, pch='.')
text(-72.495, 42.34, "Station Road")
text(-72.5035, 42.33, "Middle Street", srt=90)
text(-72.5035, 42.36, "Bike Path", srt=-39)
text(-72.54, 42.36, "Bike Path", srt=15)
text(-72.55, 42.347, "Maple", srt=107)
text(-72.54, 42.338, "Pomeroy")
text(-72.515, 42.331, "Potwine")

In the final image, thicker lines show greater speed, darker colors indicate lower elevations, and the dots are where the GPS connects. My best speed is achieved along Station Road, which is a long, straight drop. There aren't enough categories of color to explain most of the other variations in speed, although you can see me get slower as I near Middle Street along Potwine. The dots are fairly evenly spaced except along the bike path, where there is a lot of tree cover in some spots. This causes the long straight pieces near the top of the plot. Actually, since the phone is also sitting in my pocket as I ride, so I'm more pleased than disappointed with the perfomance of the iPhone and Cyclemeter.

Monday, April 5, 2010

Example 7.31: Contour plot of BMI by weight and height

A contour plot is a simple way to plot a surface in two dimensions. Lines with a constant Z value are plotted on the X-Y plane.

Typical uses include weather maps displaying "isobars" (lines of constant pressure), and maps displaying lines of constant elevation useful in, e.g., hiking. Unusual examples include maps of constant travel time, for example this map, which transforms space so that constant travel time can be presented as circles, or Francis Galton's travel map, below, in which the lines of constant travel time are reflected in the borders between colors.

As an example, here, we plot iso-BMI lines showing regions of underweight, normal weight, overweight, and obesity. As presented by National Heart, Lung, and Blood Institute, the borders between these categories are at BMI of 18.5, 25, and 30. Leaving aside questions of whether risk changes precipitously at these conveniently round numbers, it would be helpful to see what the categories imply across a range of heights and weights.

SAS

In SAS, we loop (section 1.11.1) through a range of heights (English) and weights (avoirdupois), calculating the BMI for each.


data bmi;
  do ht = 48 to 84 by ((84-48)/1000);
    do wt = 90 to 300 by ((300 -90)/1000);
      bmi = (wt*703)/(ht**2); 
   output;
 end;
  end;
run;

We then prepare some labels for the plot, using the annotate macros (section 5.2, 5.2.11).


%annomac;
data annobmi;
  %system(2,2,2); 
  %label(75,120,"Underweight",black,18,0,17);
  %label(70,150,"Normal",black,33,0,17);
  %label(65,170,"Overweight",black,37,0,17);
  %label(55,200,"Obese",black,40,0,17);
  run;

Finally, we use proc contour (section 5.1.12) to make the contour plot, using the anno option to plot the labels.


title h=1.25 "BMI categories by height and weight";
proc gcontour data = bmi anno=annobmi;
  plot wt * ht = bmi / levels = 18.5,25,30 nolegend;
  label ht="Height (inches)" wt="Weight (lbs)";
run;

R

In R, we build the data using the seq() function (section 1.11.3) for height and weight separately, and generate a data frame with all combinations of height and weight with the expand.grid() function (see example 7.22). After defining a function to calculate the BMI once, we apply it to the whole data set, storing the results in the matrix form we'll need using the matrix() function (section 1.9.1).


ht = seq(48,84, length.out=1000)
wt = seq(90,300, length.out=1000)
wtht = expand.grid(x=ht, y=wt)
bmi = function(h,w) {(w * 703)/(h*h)}
bmiwtht = matrix(bmi(wtht$x,wtht$y),length(ht),length(wt))

Now we're ready to use the contour() function (section 5.1.12) to make the plot and to add the labels with the text() function (section 5.2.11).


contour(ht,wt,bmiwtht,levels = c(18.5,25,30), drawlabels=FALSE,
        xlab="Height (inches)",ylab="Weight (lbs)",
        main="BMI categories by height and weight")

text(55,200,"Obese",cex=2,srt=45)
text(65,165,"Overweight",cex=2,srt=40)
text(70,150,"Normal",cex=2,srt=35)
text(75,120,"Underweight",cex=2,srt=18)

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts