Showing posts with label graphics. Show all posts
Showing posts with label graphics. Show all posts

Monday, August 18, 2014

Example 2014.10: Panel by a continuous variable

In Example 8.40, side-by-side histograms, we showed how to generate histograms for some continuous variable, for each level of a categorical variable in a data set. An anonymous reader asked how we would do this if both the variables were continuous. Keep the questions coming!

The SAS solution we presented relied on the sgpanel procedure. There, the panelby statement names a variable for which each distinct value will generate a panel. If there are many values, for example for a continuous variable, there will be many panels generated, which is probably not the desired result. As far as we know, there is no option to automatically categorize a continuous panel variable in proc sgpanel. If this is required, a two-step approach will be needed to first make groups of one of the variables.

We do that below using proc rank. In this approach, the groups option is the number of groups required and the ranks statement names a new variable to hold the group indicator. Once the groups are made, the same code demonstrated earlier can be used. (This is an example of "it's never too late to learn"-- I used to do this via a sort and a data step with implied variables, until I realized that there had to be a way to it via a procedure. --KK)

In this setting, the panels are another approach to the data we examine in a scatterplot. As an example, we show the mental compentency score by grouping of the physical competency score in the HELP data set.
proc rank data = 'c:\book\help.sas7bdat' groups = 6 out = catmcs;
var mcs;
ranks mcs_sextile;

title "Histograms of PCS by sextile of MCS";
proc sgpanel data = catmcs;
  panelby mcs_sextile / columns = 3 rows =2;
  histogram pcs;
We also demonstrate the columns and rows options to the panelby statement, which allow control over the presentation of the panel results. The graphic produced is shown above.

Our R solution in the earlier entry used the lattice package (written by Deepayan Sarkar) to plot a formula such as histogram(~a | b). A simple substitution of a continuous covariate b into that syntax will also generate a panel for each distinct value of the covariates: a factor is expected. In the package, an implementation of Trellis graphics, the term "shingles" is used to approach the notion of categorizing a continuous variable for making panels. The function equal.count() is provided to make the (possibly overlapping) categories of the variables, and uses the panel headers to suggest the ranges of continuous covariate that are included in each panel.
ds = read.csv("")
histogram(~ pcs | equal.count(mcs), 
   main="Histograms of PCS by shingle of MCS",
Note that the default ordering of panels in lattice is left to right, bottom to top. The index.cond option here re-orders the panels to go from left to right, top to bottom.

The default behavior of equal.count() is to allow some overlap between the categories, which is a little odd. In addition, there is a good deal of visual imprecision in the method used to identify the panels-- there's no key given, and the only indicator of the shingle value is the shading of the title bars. A more precise method would be to use the quantile() function manually, as we demonstrated in example 8.7, the Hosmer and Lemeshow goodness-of-fit test. We show here how the mutate() function in Hadley Wickham's dplyr package can be used to add a new variable to a data frame.

ds = mutate(ds, cutmcs = cut(ds$mcs, 
   breaks = quantile(ds$mcs, probs=seq(0,1, 1/6)), include.lowest=TRUE))
histogram(~ pcs | cutmcs,  main="Histograms of PCS by sextile of MCS",
          index.cond=list(c(4,5,6,1,2,3)), data=ds)
This shows the exact values of the bin ranges in the panel titles, surely a better use of that space. Minor differences in the histograms are due to the overlapping categories included in the previous version.

Finally, we also show the approach one might use with the ggplot2 package, an implementation of Leland Wilkinson's Grammar of Graphics, coded by Hadley Wickham. The package includes the useful cut_number() function, which does something similar to the cut(..., breaks=quantile(...)) construction we showed above. In ggplot2, "facets" are analogous to the shingles used in lattice.
ds = mutate(ds, cutmcsgg = cut_number(ds$mcs, n=6))
ggplot(ds, aes(pcs)) + geom_bar() + 
  facet_wrap(~cutmcsgg) + ggtitle("Histograms of PCS by sextile of MCS")
Roughly, we can read the syntax to state: 1) make a plot from the ds dataset in which the primary analytic variable will be pcs; 2) make histograms; 3) make facets of the cutmcsgg variable; 4) add a title. Since the syntax is a little unusual, Hadley provides the qplot() function, a wrapper which operates more like traditional functions. An identical plot to the above can be generated with qplot() as follows:
qplot(data=ds,x=pcs, geom="bar", facets= ~cutmcsgg, 
   main="Histograms of PCS by sextile of MCS")

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, August 11, 2014

Example 2014.9: Rolling averages. Also: Second Edition is shipping!

As of today, the second edition of "SAS and R: Data Management, Statistical Analysis, and Graphics" is shipping from CRC Press, Amazon, and other booksellers. There are lots of additional examples from this blog, new organization, and other features we hope you'll find useful. Thanks for your support. We'll be continuing to blog.

Now, on to today's main course.
For cyclical data, it's sometimes useful to generate rolling averages-- the average of some number of recent measurements, usually one full cycle. For example, for retail sales, one might want the rolling average of the most recent week. The rolling average will dampen the effects of repeated patterns but still show the location of the data.

In keeping with our habit of plotting personal data (e.g.,Example 8.11, Example 8.12, example 10.1, Example 10.2), I'll use my own weight recorded over the past 6 months. After reading about "alternate day dieting" in The Atlantic, I decided to try the diet described in the book by Varady. I've never really tried to diet for weight loss before, but this diet has worked really well for me over the past six months. The basics are that you eat 500 calories every other day (diet days) and on the non-diet days you eat what you want. There's a little science supporting the approach. I can't really recommend the book, unfortunately, unless you're a fan of the self-help style.

As you can imagine, one's weight tends to fluctuate pretty wildly between diet days and non-diet days. The cycle is just two days, but to get a sense of my weight at any given time, it might be best to use the rolling average of the past, say, four days.

The beginning of the data, available from, follows.
1/11/14 219
1/12/14 NA
1/13/14 219
1/14/14 NA
1/15/14 221.8
1/16/14 218

As you can tell from the NAs, I compiled the data with the intent to read it into R.
> weights = read.table("c:/temp/weight.txt")
> head(weights)

       V1    V2
1 1/11/14 219.0
2 1/12/14    NA
3 1/13/14 219.0
4 1/14/14    NA
5 1/15/14 221.8
6 1/16/14 218.0
Note, though, that the date values are just character strings (read in as a factor variable), and not so useful as read in.
> str(weights)
'data.frame': 161 obs. of  2 variables:
 $ V1: Factor w/ 161 levels "1/11/14","1/12/14",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ V2: num  219 NA 219 NA 222 ...
The lubridate package contributed by the invaluable Hadley Wickham contains functions to make it easier to use dates in R. Here, I use its mdy() function to convert characters values into R dates.
with(weights, plot(V2 ~ mdy(V1), 
  xlim = c(mdy("1/1/14"),mdy("6/30/14")),
  ylab="Weight", xlab="Date"))
The simple plot has enough values that you can clearly see the trend of weight loss over time, and perhaps the rolling average exercise is somewhat misplaced, here. To calculate the rolling average, I adapted (below) the lag function from section 2.2.18 (2nd edition; 1.4.17 in the 1st ed.)-- this is a simpler version that does not check for errors. The result of lag(x,k) is a vector with the first k values missing and with the remaining values being the beginning values of x. Thus the ith value of lag(x,k) is x[i-k]. To get the rolling average, I just take the mean of several lags. Here I use the rowMeans() function to do it for all the values at once. The lines() function adds the rolling average to the plot.
lag = function(x,k) {
  return( c(rep(NA,k), x[1:(length(x)-k)]) )

y = weights$V2
ra = rowMeans(

The final plot is shown above. Note that the the initial values of the lagged vector are missing, as are weights for several dates throughout this period. The na.rm=T option causes rowMeans() to return the mean of the observed values-- equivalent to a single imputation of the mean of the observed values, which perhaps Nick will allow me in this setting (note from NH: I don't have major issues with this). There are also two periods where I failed to record weights for four days running. For these periods, rowMeans() returns NaN, or "Not a Number". This is usefully converted to regions in the plot where the running average line is not plotted. Compare, for instance, with the default SAS behavior shown below. For the record, I was ill in early May and had little appetite regardless of my dieting schedule.

The data can be easily read with the input statement. The mmddyy7. informat tells SAS that the data in the first field are as many as 7 characters long and should be read as dates. SAS will store them as SAS dates (section 2.4 in the 2nd edition; 1.6 in the 1st edition). As the data are read in, I use the lagk functions (section 2.2.18 2nd edition; 1.4.17 in the 1st ed.) to recall the values from recent days and calculate the rolling average as I go.
data weights;
infile "c:\temp\weight.txt";
input date mmddyy7. weight;
ra = mean(weight,lag(weight), lag2(weight), lag3(weight));
Note that the input statement expects the weight values to be numbers, and interprets the NAs in the data as "Invalid data". It inserts missing values into the data set, which is what we desire. The mean function provides the mean of the non-missing values. When the weight and all of the lagged values of weight are missing, it will return a missing value. With the rolling average in hand, I can plot the observed weights and the rolling average. To print Julian dates rather than SAS dates, use the format statement to tell SAS that the date variable should be printed using the date. format.
symbol1 i = none v=dot c = blue;
symbol2 i = j v = none c = black w=5;
proc gplot data = weights;
plot (weight ra)*date /overlay;
format date date.;
The results are shown below. The main difference from the R plot is that the gaps in my recording do not appear in the line. The SAS symbol statement, the equivalent of the lines() function, more or less, does not encounter NaNs, but only missing values, and so it connects the points. I think R's behavior is more appropriate here-- there's no particular reason to suppose a linear interpolation between the observed data points is best, and so the line ought to be missing.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, December 10, 2012

Example 10.8: The upper 95% CI is 3.69

Apologies for the long and unannounced break-- the longest since we started blogging, three and a half years ago. I was writing a 2-day course for SAS users to learn R. Contact me if you're interested. And Nick and I are beginning work on the second edition of our book-- look for it in the fall. Please let us know if you have ideas about what we omitted last time or would otherwise like to see added. In the mean time, we'll keep blogging, though likely at a reduced rate.

Today: what can you say about the probability of an event if the observed number of events is 0? It turns out that the upper 95% CI for the probability is 3.69/N. There's a sweet little paper with some rationale for this, but it's in my other office. And I couldn't recall the precise value-- so I used SAS and R to demonstrate it to myself.


The R code is remarkably concise. After generating some Ns, we write a little function to perform the test and extract the (exact) upper 95% confidence limit. This is facilitated by the "..." notation, which passes along unused arguments to functions. Then we use apply() to call the new function for each N, passing the numerator 0 each time. Note that apply() needs a matrix argument, so the simple vector of Ns is converted to a matrix before use. [The sapply() function will accept a vector input, but took about 8 times as long to run.] Finally, we plot the upper limit * N against N. showing the asymptote. A log scaled x-axis is useful here, and is achieved with the log='x' option. (Section 5.3.12.) the result is shown above.
bin.m = seq(10, 10000, by=5)
mybt = function(...) { binom.test(...)$[2] }
uci = apply(as.matrix(bin.m), 1, mybt, x=0)
plot(y=bin.m * uci, x=bin.m, ylim=c(0,4), type="l", 
     lwd=5, col="red", cex=5, log='x',  
     ylab="Exact upper CI", xlab="Sample size", 
     main="Upper CI when there are 0 cases observed")


In SAS, the data, really just the N and a numerator of 0, are generated in a data step. The CI are found using the binomial option in the proc freq tables statement and saved using the output statement. Note that the weight statement is used here to avoid having a row for each Bernoulli trial.
data binm;
do n = 10 to 10000 by 5;

ods select none;
proc freq data=binm;
by n;
weight n;
tables x / binomial;
output out=bp binomial;
ods select all;
To calculate the upper limit*N, another data step is needed-- note that in this setting SAS will only produce the lower limit against the probability that all observations share the same value, thus the subtraction from 1 shown below. The log scale x-axis is obtained with the logbase option to the axis statement. (Section 5.3.12.) The result is shown below.
data uci;
set bp;
limit = (1-xl_bin) * n;

axis1 order = (0 to 4 by 1);
axis2 logbase=10 logstyle=expand;
symbol1 i = j v = none c = red w=5 l=1;
proc gplot data=uci;
plot limit * n / vref=3.69 vaxis=axis1 haxis=axis2;
label n="Sample size" limit="Exact upper CI";
It's clear that the upper 95% limit on the number of successes asymptotes to about 3.69. Thus the upper limit on the binomial probability p is 3.69/N.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, September 17, 2012

Example 10.2: Custom graphic layouts

In example 10.1 we introduced data from a CPAP machine. In brief, it's hard to tell exactly what's being recorded in the data set, but it seems to be related to the pattern of breathing. Measurements are taken five times a second, leading to on the order of 100,000 data points in a typical night. To get a visual sense of what a night's breathing looks like is therefore non-trivial.

Today, we'll make the graphic shown above, which presents an hour of data.

In SAS, the sgpanel procedure (section 5.1.11) will produce a similar graphic pretty easily. But we need to make a data set with indicators of the hour, and of ten-minute blocks within the hour. This we'll do with the ceil function (section 1.8.4).

data cycles2;
set cycles;
hour = ceil(time_min/60);
tenmin = ceil(time_min/10);
time_in_ten = mod(time_min - 1/300,10);
/* 1/300 adjustment keeps last measure in the correct
10-min block */

title "Hour 4 of pressure";
proc sgpanel data = cycles2;
where hour eq 4;
panelby tenmin / layout=rowlattice rows=6 spacing = 4;
colaxis display=none;
rowaxis display = (nolabel);
series x = time_in_ten y = byte;
run; quit;

The resulting plot is shown below. It would be nicer to omit the labels on the right of each plot, but this does not appear to be an option. It would likely only be possible with a fair amount of effort.

In R, we'll use the layout() function to make a 7-row layout-- one for the title and 6 for the 10-minute blocks of time. Before we get there, though, we'll construct a function to fill the time block plots with input data. The function accepts a data vector and plots only 3,000 values from it, choosing the values based on an input hour and 10-minute block within the hour. To ensure an equal y-axis range for each call, we'll also send minimum and maximum values as input to the function. All of this will be fed into plot() with the type="l" option to make a line plot.

plot10 = function(hour, tenmins, miny, maxy, data=cycles){
start = hour*18000 + tenmins* 3000 +1
plot((1:3000)/300, cycles[(start + 1):(start +3000)],
ylim = c(miny,maxy),type="l", xaxs="i", yaxs="i")

The documentation for layout() is rather opaque, so we'll review it separately.

oldpar = par(no.readonly = TRUE)
# revert to this later

layout(matrix(1:7), widths=1, heights=c(3,8,8,8,8,8,8), respect=FALSE)

The layout() function divides the plot area into a matrix of cells, some of which will be filled by the next output plots. The first argument says where in the matrix the next N objects will go. All the integers 1...N must appear in the matrix; cells that will be left empty have a 0 instead. Here, we have no empty cells, and only one column, so the "matrix" is really just a vector with 1...7 in order. The widths option specifies the relative widths of the columns-- here we have only one column so any constant will result in the use of the whole width of the output area. Similarly, the heightsoption gives the relative height of the cells. Here the title will get 3/51 of the height, while each 10-minute block will get 8/51. This unequal shape of the plot regions is one reason to prefer layout() to some other ways to plot multiple images on a page. The respect option, when "TRUE" makes the otherwise relative widths and heights conform, so that a unit of height is equal to a unit of width. We also use layout() in example 8.41.

With the layout in hand, we're ready to fill it.

par(xaxt="n", mar = c(.3,2,.3,0) +.05)
# drop the x-axis, change the spacing around the plot
plot(x=1,y=1,type="n",ylim=c(-1,1), xlim=c(-1,1), yaxt="n",bty="n")
# the first (narrow) plot is just empty
text(0,0,paste("Hour ", (hour + 1), " of pressure data"), cex=2)
# text to put in the first plot
miny = min(cycles[(hour * 18000 + 1):((hour + 1) * 18000)])
maxy = max(cycles[(hour * 18000 + 1):((hour + 1) * 18000)])
# find min and max across the whole hour, to keep range
# of y-axis constant across the plots
for (x in 0:5) plot10(hour, x, miny, maxy)
# plot the 6 ten-minute blocks
# reset the graphics options

The resulting plot is shown at the top of the entry. There's clearly something odd going on around 11-15 minutes into the hour-- this could be a misadjusted mask, or a real problem with the breathing. There's also a period around 58 minutes when it looks like breathing stops. That's what the machine is meant to stop.

An unrelated note about aggregatorsWe love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, July 16, 2012

Example 9.38: dynamite plots, revisited

Dynamite plots are a somewhat pejorative term for a graphical display where the height of a bar indicates the mean, and the vertical line on top of it represents the standard deviation (or standard error). These displays are commonly found in many scientific disciplines, as a way of communicating group differences in means.

Many find these displays troubling. One post entitled them unmitigated evil.
The Vanderbilt University Department of Biostatistics has a formal policy discouraing use of these plots, stating that:

Dynamite plots often hide important information. This is particularly true of small or skewed data sets. Researchers are highly discouraged from using them, and department members have the option to decline participation in papers in which the lead author requires the use of these plots.

Despite the limitations of the display, we believe that there may be times when the display is helpful as a way to compare groups, assuming distributions that are approximately normal. Samuel Brown also described creation of these figures, as a way to encourage computing in R. We previously demonstrated how to create them in SAS and R, and today discuss code created by Randall Pruim to demonstrate how such graphics can be created using lattice graphics within the mosaic package.


dynamitePlot <- function(height, error, names = as.character(1:length(height)),
significance = NA, ylim = c(0, maxLim), ...) {
if (missing(error)) { error = 0 }
maxLim <- 1.2* max(mapply(sum, height, error))
mError <- min(c(error, na.rm=TRUE))
barchart(height ~ names, ylim=ylim, panel=function(x,y,...) {
grid.polyline(c(x,x), c(y, y+error), id=rep(x,2), default.units='native',
arrow=arrow(angle=45, length=unit(mError, 'native')))
grid.polyline(c(x,x), c(y, y-error), id=rep(x,2), default.units='native',
arrow=arrow(angle=45, length=unit(mError, 'native')))
grid.text(x=x, y=y + error + .05*maxLim, label=significance,
}, ...)

Much of the code involves setting up the appropriate axis limits, then drawing the lines and adding the text. We can call this new function with an artificial example with 4 groups:

Values <- c(1,2,5,4)
Errors <- c(0.25, 0.5, 0.33, 0.12)
Names <- paste("Trial", 1:4)
Sig <- c("a", "a", "b", "b")
dynamitePlot(Values, Errors, names=Names, significance=Sig)

We still don't recommend frequent use of these plots (as other displays may be better (e.g. dotplots for very small sample sizes or violin plots), but having the capability to generate dynamite plots within the lattice framework can be handy.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, April 2, 2012

Example 9.25: It's been a mighty warm winter? (Plot on a circular axis)

Updated (see below)

People here in the northeast US consider this to have been an unusually warm winter. Was it?

The University of Dayton and the US Environmental Protection Agency maintain an archive of daily average temperatures that's reasonably current. In the case of Albany, NY (the most similar of their records to our homes in the Massachusetts' Pioneer Valley), the data set as of this writing includes daily records from 1995 through March 12, 2012.

In this entry, we show how to use R to plot these temperatures on a circular axis, that is, where January first follows December 31st. We'll color the current winter differently to see how it compares. We're not aware of a tool to enable this in SAS. It would most likely require a bit of algebra and manual plotting to make it work.

The work of plotting is done by the radian.plot() function in the plotrix package. But there are a number of data management tasks to be employed first. Most notably, we need to calculate the relative portion of the year that's elapsed through each day. This is trickier than it might be, because of leap years. We'll read the data directly via URL, which we demonstrate in Example 8.31. That way, when the unseasonably warm weather of last week is posted, we can update the plot with trivial ease.

temp1 = read.table("
leap = c(0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1)
days = rep(365, 18) + leap
monthdays = c(31,28,31,30,31,30,31,31,30,31,30,31)
temp1$V3 = temp1$V3 - 1994

The leap, days, and monthdays vectors identify leap years, count the corrrect number of days in each year, and have the number of days in the month in non-leap years, respectively. We need each of these to get the elapsed time in the year for each day. The columns in the data set are the month, day, year, and average temperature (in Fahrenheit). The years are renumbered, since we'll use them as indexes later.

The yearpart() function, below, counts the proportion of days elapsed.

yearpart = function(daytvec,yeardays,mdays=monthdays){
part = (sum(mdays[1:(daytvec[1]-1)],
(daytvec[1] > 2) * (yeardays[daytvec[3]]==366))
+ daytvec[2] - ((daytvec[1] == 1)*31)) / yeardays[daytvec[3]]

The daytvec argument to the function will be a row from the data set. The function works by first summing the days in the months that have passed (,sum(mdays[1:(daytvec[1]-1)]) adding one if it's February and a leap year ((daytvec[1] > 2) * (yeardays[daytvec[3]]==366))). Then the days passed so far in the current month are added. Finally, we subtract the length of January, if it's January. This is needed, because sum(1:0) = 1, the result of which is that that January is counted as a month that has "passed" when the sum() function quoted above is calculated for January days. Finally, we just divide by the number of days in the current year.

The rest is fairy simple. We calculate the radians as the portion of the year passed * 2 * pi, using the apply() function to repeat across the rows of the data set. Then we make matrices with time before and time since this winter started, admittedly with some ugly logical expressions (section 1.14.11), and use the radian.plot() function to make the plots. The options to the function are fairly self-explanatory.

temp2 = as.matrix(temp1)
radians = 2* pi * apply(temp2,1,yearpart,days,monthdays)

t3old = matrix(c(temp1$V4[temp1$V4 != -99 & ((temp1$V3 < 18) | (temp1$V2 < 12))],
radians[temp1$V4 != -99 & ((temp1$V3 < 18) | (temp1$V2 < 2))]),ncol=2)

t3now= matrix(c(temp1$V4[temp1$V4 != -99 &
((temp1$V3 == 18) | (temp1$V3 == 17 & temp1$V1 == 12))],
radians[temp1$V4 != -99 & ((temp1$V3 == 18) |
(temp1$V3 == 17 & temp1$V1 == 12))]),ncol=2)
# from plottrix library
radial.plot(t3old[,1],t3old[,2],rp.type="s", point.col = 2, point.symbols=46,
clockwise=TRUE, start = pi/2, label.pos = (1:12)/6 * (pi),
labels=c("February 1","March 1","April 1","May 1","June 1",
"July 1","August 1","September 1","October 1","November 1",
"December 1","January 1"), radial.lim=c(-20,10,40,70,100))

radial.plot(t3now[,1],t3now[,2],rp.type="s", point.col = 1, point.symbols='*',
clockwise=TRUE, start = pi/2, add=TRUE, radial.lim=c(-20,10,40,70,100))

The result is shown at the top. The dots (point.symbol is like pch so 20 is a point (section 5.2.2) show the older data, while the asterisks are the current winter. An alternate plot can be created with the rp.type="p" option, which makes a line plot. The result is shown below, but the lines connecting the dots get most of the ink and are not what we care about today.

Either plot demonstrates clearly that a typical average temperature in Albany is about 60 to 80 in August and about 10 to 35 in January, the coldest mont

The top figure shows that it has in fact been quite a warm winter-- most of the black asterisks are near the outside of the range of red dots. Updating with more recent weeks will likely increase this impression. In the first edition of this post, the radial.lim option was omitted, which resulted in different axes in the original and "add" calls to radial.plot. This made the winter look much cooler. Many thanks to Robert Allison for noticing the problem in the main plot. Robert has made many hundreds of beautiful graphics in SAS, which can be found here. He also has a book. Robert also created a version of the plot above in SAS, which you can find here, with code here. Both SAS and R (not to mention a host of other environments) are sufficiently general and flexible that you can do whatever you want to do-- but varying amounts of expertise might be required.

An unrelated note about aggregators
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Wednesday, January 11, 2012

Example 9.19: Demonstrating the central limit theorem

A colleague recently asked "why should the average get closer to the mean when we increase the sample size?" We should interpret this question as asking why the standard error of the mean gets smaller as n increases. The central limit theorem shows that (under certain conditions, of course) the standard error must do this, and that the mean approaches a normal distribution. But the question was why does it? This seems so natural that it may have gone unquestioned in the past.

The best simple rationale may be that there are more ways to get middle values than extreme values--for example, the mean of a die roll (uniform discrete distribution on 1, 2, ..., 6) is 3.5. With one die, you're equally likely to get an "average" of 3 or of 1. But with two dice there are five ways to get an average of 3, and only one way to get an average of 1. You're 5 times more likely to get the value that's closer to the mean than the one that's further away.

Here's a simple graphic to show that the standard error decreases with increasing n.

We begin by simulating some data-- normal, here, but of course that doesn't matter (assuming that the standard deviation exists for whatever distribution we pick and the sample size is appropriately large). Rather than simulate separate samples with n = 1 ... k, it's easier to add a random variate to a series and keep a running tally of the mean, which is easy with a little algebra. This approach also allows tracking the progress of the mean of each series, which could also be useful.

%let nsamp = 100;
data normal;
do sample = 1 to &nsamp;
meanx = 0;
do obs = 1 to &nsamp;
x = normal(0);
meanx = ((meanx * (obs -1)) + x)/obs;

We can now plot the means vs. the number of observations.

symbol1 i = none v = dot h = .2;
proc gplot data = normal;
plot meanx * obs;

symbol1 i=join v=none r=&nsamp;
proc gplot data=normal;
plot meanx * obs = sample / nolegend;
run; quit;

The graphic resulting from the first proc gplot is shown above, and demonstrates both how quickly the variability of the estimate of the mean decreases when n is small, and how little it changes when n is larger. A plot showing the means for each sequence converging can be generated with the second block of code. Note the use of the global macro variable nsamp assigned using the %let statement (section A.8.2).

We'll also generate sequences of variates in R. We'll do this by putting the random variates in a matrix, and treating each row as a sequence. We'll use the apply() function (sections 1.10.6 and B.5.3) to treat each row of the matrix separately.

numsim = 100
matx = matrix(rnorm(numsim^2), nrow=numsim)

runavg = function(x) { cumsum(x)/(1:length(x)) }
ramatx = t(apply(matx, 1, runavg))

The simple function runavg() calculates the running average of a vector and returns the a vector of equal length. By using it as the function in apply() we can get the running average of each row. The result must be transposed (with the t() function, section 1.9.2) to keep the sequences in rows. To plot the values, we'll use the type="n" option to plot(), specifying the first column of the running total as the y variable. While it's possible that the running average will surpass the average when n=1, we ignore that case in this simple demonstration.

plot(x=1:numsim, y = ramatx[,1], type="n",
xlab="number of observations", ylab="running mean")
rapoints = function(x) points(x~seq(1:length(x)), pch=20, cex=0.2)

plot(x=1:numsim, y = ramatx[,1], type="n",
xlab="number of observations", ylab="running mean")
ralines = function(x) lines(x~seq(1:length(x)))
apply(ramatx, 1, ralines)

Here we define another simple function to plot the values in a vector against the place number, then again use the apply() function to plot each row as a vector. The first set of code generates a plot resembling the SAS graphic presented above. The second set of code will connect the values in each sequence, with results shown below.

Wednesday, September 21, 2011

Example 9.6: Model comparison plots (Completed)

We often work in settings where the data set has a lot of missing data-- some missingness in the (many) covariates, some in the main exposure of interest, and still more in the outcome. (Nick describes this as "job security for statisticians").

Some analysts are leery of imputing anything at all, preferring to rely on the assumption that the data are missing completely at random. Others will use multiple imputation for covariates, but feel they should use "real" data for the exposure and outcome. Still others will impute the exposure but not the outcome. Theory and experiments suggest (Moons et al JCE 2006) that all missing data should be imputed. Depending on the imputation method, this may offer some protection against missing data that missing at random, more general than missing completely at random.

In one analysis, we decided to use each of these approaches and demonstrate the results that would be obtained. The data are shown below. The first column denotes the data used, the second has the effect on the mean and CI limits for the effect. How can we present these results clearly? We designed a graphic that requires some customization using either SAS or R but which makes the point elegantly.

1 .11
1 -.05
1 .28
2 .07
2 .21
2 -.07
3 .06
3 -.08
3 .2
4 0
4 -.13
4 .12

The SAS version is shown below. (Click on it for a larger image.) To generate it, add a final column to the data, where the effect estimate is repeated but the other values are not. Then a basic plot can be created in proc gplot with the hiloc interpolation in the symbol statement and the overlay option in the plot statement. (See book section 5.3 and other blog entries for details.) Try the code without the axis statements to see what happens.

data ke1;
input datatype estimate meanval;
1 .11 .11
1 -.05 .
1 .28 .
2 .07 .07
2 .21 .
2 -.07 .
3 .06 .06
3 -.08 .
3 .2 .
4 0 0
4 -.13 .
4 .12 .

symbol1 i=hiloc c=black v=none;
symbol2 i=none v=dot h=1 c=black;
axis1 minor=none order = (1 to 4 by 1)
value = (tick = 1 "Complete"
justify=c "Case" justify = c "(N = 2055)"
tick=2 "MI" justify=c "Covariates only"
justify=c "(N = 2961)"
tick=3 "MI" justify=c "Covariates and exposure"
justify=c "(N = 3994)"
tick=4 "MI" justify=c "All variables"
justify=c "(N = 6782)"
label = none
offset = (2 cm, 2 cm)
axis2 minor=none order = (-.2 to .3 by .1)
label = (angle=90 "Effect of exposure on outcome");
title "Compare missingness approaches";
proc gplot data = ke1;
plot (estimate meanval) * datatype /
overlay haxis=axis1 vref=0 vaxis=axis2;

The two axis statements make the plot work. The axis1 statement uses the value option to hand-write the labels describing the data sets. Note that the justify = c causes a new line to be started. The offset option adds a little space to the left and right of the data. The axis2 statement specifies the range and label for the vertical axis. The extra symbol statement and the overlay option just plot the dots that call attention to the effect estimates-- otherwise they would show just a small crossbar at the effect.

The plot suggests that as more observations are included and the multiple imputation gains accuracy the effect attenuates and the standard errors decrease.


In R we create the equivalent plot in multiple steps, first by creating an empty plot of the correct size then iterating through each of the lines. As with the SAS approach, a little manipulation of the raw data is required.

n = c(2055, 2961, 3994, 6782)
labels = c("Complete Case", "MI\ncovariates only",
"MI\ncovariates and exposure",
"MI\nall variables")
est = c(0.11, 0.07, 0.06, 0)
lower = c(-0.05, -0.07, -0.08, -0.13)
upper = c(0.28, 0.21, 0.20, 0.12)

plot(c(0.5, 4.5), c(min(lower)-.10, max(upper)), type="n", xlab="",
xaxt="n", ylab="Effect of exposure on outcome")
title("Compare missingness approaches")
for (i in 1:length(n)) {
points(i, est[i])
lines(c(i,i), c(lower[i], upper[i]))
stringval = paste(labels[i],"\n(N=",n[i],")")
text(i, min(lower) - .05, stringval, cex=.6)
abline(h=0, lty=2)

The resulting plot is shown at the top. As opposed to the SAS approach, more of the figure can be defined using the data. For example, the y-axis values are determined from the minimum and maximum values to plot.

Note: a draft of this entry was published accidentally. Many apologies. --Ken

Monday, November 8, 2010

Example 8.13: Bike ride plot, part 2

Before explaining how to make and interpret the plot above, Nick and I want to make a plea for questions--it's hard to come up with useful questions to explore each week!

As shown in Example 8.12, data from the Cyclemeter app can be used to make interesting plots of a bike ride. But in the previous application, questions remain. Why did my speed vary so much, for example? And why do some sections of the route appear to be very straight while others are relatively bumpy? To investigate these questions, we'll add some more data into the plot, using color. First, I'll shade the line according to my elevation-- my speed is probably faster when I'm going down hill. Then I'll add the points where the GPS connected; this may explain the straight sections. Doing all of this will require more complicated coding tasks.

(Data were read in previously.)


My initial thought was that I would use proc univariate to generate a histogram (section 5.1.4) and generate the color for the line from the histogram categories. But it turns out that you can't force the histogram to have your desired number of categories. (This is true for the R hist() function as well.) So I'm going to find the minimum and maximum elevation using proc summary (section 2.1), saving the results in a data set. Then I'll add those min and max values to each line of the data, using the tip found here. Finally, I'll make my own categories for elevation by looping through the category boundaries until I find one bigger than my observation.

proc summary data=ride;
var elevation__feet_;
output out=minmax max=maxelev min=minelev;;

data aride;
set ride;
setme = 1;
set minmax point=setme;
colorindex = 0;
range = maxelev-minelev;
do i = 0 to 5;
if elevation__feet_ ge (minelev + (i * range/6) )
then colorindex = i + 1;

To make the plot, I'm going to use the annotate facility, as before. However, the annotate %line macro won't work, for reasons I won't go into. So I need to make an annotate data set directly. Briefly, the way annotate data sets work is that they have a pre-specified variable list in which only certain values are allowed. Some of these are discussed in section 5.1, but in addition to these, below we use the function variable; when function="MOVE" the conceptual plotting pen moves without drawing. When function="DRAW" a line is plotted from the last location to the current one. Locations are determined by x and y variables. The line and size variables affect the line style and thickness.

The other interesting bit of the code below is the creation and use of a temporary array (section 1.11.5) for the colors. I choose the color from this array by indexing on the colorindex variable created above.

For help on the annotate facility, see Contents; SAS Products; SAS/GRAPH; The Annotate Facility. Colors are discussed in section 5.3.11, but we use them in a slightly different way, here. For help with color specification, see Contents; SAS Products; SAS/GRAPH; Concepts; Colors.

data annoride2;
set aride;
if elevation__feet_ ne .;
retain xsys '2' ysys '2' hsys '6';
array mycolor [6] $ _temporary_
("CX252525" "CX636363" "CX969696" "CXBDBDBD" "CXD9D9D9" "CXF7F7F7");
x = longitude;
y = latitude;
/* If this is the first observation, we need to move the pen to
the first point. Otherwise we draw a line */
if _n_ ne 1 then function = "DRAW";
else function = "MOVE";
color = mycolor[colorindex];
line = 1;
size = speed__miles_h_ * 2;

Finally, we can plot the figure. I use the symbol statement (section 5.2.2) to plot nice dots where the GPS connected, and the axis statement (section 5.3.7) to remove the values of latitude and longitude, which don't interest me much. Analysis of the graph I leave for the R version.

axis1 major=none minor=none value=none;
symbol1 i=none v=dot c=black h=.2;
proc gplot data = ride;
plot latitude * longitude /
vaxis=axis1 haxis=axis1 annotate=annoride2;

The resulting plot is shown above.


In R, I'm going to modify my vectorized version of the lines() function to additionally assign colors to each line. As in the SAS example, I will use categories of elevation to do this. Assigning the elevations to categories is a little trickier if I want to avoid for() loops. To do it, I will first find the category boundaries. I'll then use the which() function (as in section 1.13.2) to figure out the category boundaries smaller than the observation, and the max() function (section 1.8.1) to select the largest of these. Unfortunately, I also need the sapply() function (section B.5.3) so that I can repeat this process for each elevation and return a vector. I've annotated below to explain how this works. Finally, I use the RColorBrewer package to generate a vector of colors. (I also used it to find the colors I put in the SAS code.)

veclines2 = function(x, y, z, c) {
breaks = min(c) + (0:5 * (max(c) - min(c))/6)
# The sapply() function applies the function named in the
# second argument to each element of the vector named in the
# first argument. Since I don't need this function again,
# I write it out in within the sapply() function call and don't
# even have to name it.
ccat = sapply(c, function (r) {max(which(r >= breaks))} )
mypalette = brewer.pal(6,"Greys")
for (i in 1:(length(x)-1)) {
lines(x[i:(i+1)], y[i:(i+1)], lwd=z[i], col=mypalette[7 - ccat[i]])

Reader JanOosting pointed out that the segments() function will do exactly what my for-looped lines() function does. The segments() function requires "to" and "from" x and y locations. Here's a version of the above function that uses it.

bikeseg = function(x,y,z,c) {
mypalette <-brewer.pal(6,"Greys")
breaks = min(c) + (0:5 * (max(c) - min(c))/6)
ccat = sapply(c,function (r) {max(which(r >= breaks))})
segments(x0 = x[1:(l-1)], y0 = y[1:(l-1)],
x1 = x[2:l], y1 = y[2:l],lwd=z,col=mypalette[7-ccat])

Then I call the function, after creating an empty plot and making the RColorBrewer library available. Finally, I add the points and some street names (see sections 5.2.3 and 5.2.11), to aid in interpretation. The way R draws the names live, as you enter commands, makes it much easier to place and rotate the names than in SAS, where you would have to re-make the annotate data set and run it to see the results.

plot(Longitude, Latitude, type="n")
bikeseg(Longitude, Latitude, Speed..miles.h./2, Elevation..feet.)

points(Longitude, Latitude, cex=2, pch='.')
text(-72.495, 42.34, "Station Road")
text(-72.5035, 42.33, "Middle Street", srt=90)
text(-72.5035, 42.36, "Bike Path", srt=-39)
text(-72.54, 42.36, "Bike Path", srt=15)
text(-72.55, 42.347, "Maple", srt=107)
text(-72.54, 42.338, "Pomeroy")
text(-72.515, 42.331, "Potwine")

In the final image, thicker lines show greater speed, darker colors indicate lower elevations, and the dots are where the GPS connects. My best speed is achieved along Station Road, which is a long, straight drop. There aren't enough categories of color to explain most of the other variations in speed, although you can see me get slower as I near Middle Street along Potwine. The dots are fairly evenly spaced except along the bike path, where there is a lot of tree cover in some spots. This causes the long straight pieces near the top of the plot. Actually, since the phone is also sitting in my pocket as I ride, so I'm more pleased than disappointed with the perfomance of the iPhone and Cyclemeter.

Monday, November 1, 2010

Example 8.12: Bike ride plot, part 1

The iPhone app Cyclemeter uses the phone's GPS capability to record location and other data, and infer speed, while you ride. I took a ride near my house recently, and downloaded the data. I'd like to examine my route and my speed. A simple plot of the route is trivial in either SAS or R, but adding the speed data requires a little work. You can download my data from here and I read the data directly via URL in the following code.


In SAS, I first use proc import with the url filetype, as shown in section 1.1.6. I can then make a simple plot of the route using the i=j option to the symbol statement (as in section 1.13.5), which simply joins successive points.

filename bike url '';

proc import datafile=bike out=ride dbms=dlm;

symbol1 i=j;
proc gplot data=ride;
plot latitude * longitude;

I didn't project the data, so this looks a little compressed north-south.

To show my speed at each point, I decided to make a thicker line when I'm going faster. To do this, I use the annotate macros discussed in section 5.2. I decided to use the %line macro to do this, but that requires each observation in the data set have a starting point and an ending point for its section of line. I use the lag function (section 1.4.17) in a separate data step to add the previous point to each observation. Then I create the annotate data set. Finally, I use the value = none option to the symbol statement to make an empty plot and the annotate data set draws the line for me.

data twopoints;
set ride;
lastlat = lag(latitude);
lastlong = lag(longitude);
if _n_ ne 1;

data annoride;
set twopoints;

symbol1 v=none;
proc gplot data=ride;
plot latitude * longitude / annotate=annoride;;

The resulting plot shown below closely resembles the R plot shown at the top of this entry.


In R, it's as trivial to make the simple plot as in SAS. Just read in the CSV data from the URL (section 1.1.2, 1.1.6) make an empty plot (5.1.1), and add the lines (5.2.1).

plot(Longitude, Latitude, type="n")
lines(Longitude, Latitude)

Now I want to show the speed, as above. The lines() function has a lwd= option, but unfortunately, it's not vectorized. In other words, it accepts only a scalar that applies to all the line segments drawn in a given call. To get around that, I'll write my own vectorized version of lines() using the disfavored for() function. It calls lines() for each pair of points, with an appropriate lwd value.

veclines = function(x, y, z) {
for (i in 1:(length(x)-1)) {
lines(x[i:(i+1)], y[i:(i+1)], lwd=z[i])
veclines(Longitude, Latitude, Speed..miles.h./2)

The result is displayed at the top of this blog entry. In the next entry we'll add more information to help explain why the speed varies.

Tuesday, October 26, 2010

Example 8.11: violin plots

We've continued to get useful feedback and ideas from our posts on the combination dotplot/boxplot and other ways to craft similar displays.

Another notion is the violin plot, which combines a boxplot and a (doubled) kernel density plot. While the basic notion of the violin plot does not include the individual points, such a display has virtues, particularly when comparing multiple groups and with large datasets. For teaching purposes, dots representing the data points could be added in. More details on the plot can be found in: Hintze, J. L. and R. D. Nelson (1998). Violin plots: a box plot-density trace synergism. The American Statistician, 52(2):181-4.


In R, the vioplot package includes the vioplot() function, which generated the plot at the top of this entry.

ds = read.csv("")
female = subset(ds, female==1)
with(female, vioplot(pcs[homeless==0], pcs[homeless==1],
horizontal=TRUE, names=c("non-homeless", "homeless"),
col = "lightblue"))


We've neglected SAS in the discussion of teaching graphics. Mimicking the tailored appearance of Wild's approach to the dotplot-boxplot would require at least several hours, while even the shorter code suggested by commenters would be difficult. For the most part this reflects the modular nature of R. However violin plots are similar enough to stacked kernel density estimates, that we show them here in order to demonstrate the code.

proc sgpanel data="C:\book\help";
where female eq 1;
panelby homeless / columns=1 ;
density pcs / scale=percent type=kernel ;

The output lacks the graphic depiction of central tendency, and does not double the density, but it does highlight similarities and differences between the categories.

Sunday, October 24, 2010

Reader suggestions on alternative ways to create combination dotplot/boxplot

Kudos to several of our readers, who suggested simpler ways to craft the graphical display (combination dotplot/boxplot) from our most recent example.

Yihui Xie combines a boxplot with a coarsened version of the PCS scores (using the round() function) used in the stripchart() function.

ds = read.csv("")
smallds = subset(ds, female==1)
boxplot(pcs~homeless, data=smallds,
method='stack', data=smallds,

This is far simpler than the code we provided, though aesthetically is perhaps less pleasing.

An alternative approach was suggested by NetDoc, who creatively utilizes the ggplot2 package authored by Hadley Wickham. This demonstrates how a complex plot can be crafted by building up components using this powerful system. This approach dodges points so that they don't overlap.

ds = read.csv("")
female = subset(ds, female==1)
theme_update(panel.background = theme_rect(fill="white", colour=NA))
theme_update(panel.grid.minor=theme_line(colour="white", size=NA))
p = ggplot(female, aes(factor(homeless), pcs))
p + geom_boxplot(outlier.colour="transparent") +
geom_point(position=position_dodge(width=0.2), pch="O") +
opts(title = "PCS by Homeless") +
labs(x="Homeless", y="PCS") +
coord_flip() +

This is also simpler than the approach taken by Wild and colleagues. The increased complexity of Wild's function is the cost of maintaining a consistent integrated appearance with the other pieces of their package. In addition, the additional checking of appropriate arguments is also important for code used by others.

Tuesday, September 14, 2010

Example 8.5: bubble plots part 3

An anonymous commenter expressed a desire to see how one might use SAS to draw a bubble plot with bubbles in three colors, corresponding to a fourth variable in the data set. (x, y, z for bubble size, and the category variable.) In a previous entries we discussed bubble plots and showed how to make the bubble print in two colors depending a fourth dichotomous variable.

The SAS approach to this cannot be extended to fourth variables with many values: we show here an approach to generating this output. The R version below represents a trivial extension of the code demonstrated earlier.


We'll start by making some data-- 20 observations in each of 3 categories.

data testbubbles;
do cat = 1 to 3;
do i = 1 to 20;
abscissa = normal(0);
ordinate = normal(0);
z = uniform(0);

Our approach will be to make an annotate data set using the annotate macros (section 5.2). The %slice macro easily draws filled circles. Check its documentation for full details on the parameters it needs in the on-line help: SAS Products; SAS/GRAPH; The Annotate Facility; Annotate Dictionary. Here we note that the 5th parameter is the radius of the circle, chosen here as an arbitrary function of z that makes pleasingly sized circles. Other parameters reflect color density, arc, and starting angle, which could be used to represent additional variables.

data annobub1;
set testbubbles;
%slice(abscissa, ordinate, 0, 360, sqrt(3*z), green, ps, 0);

Unfortunately, due to a quirk of the macro facility, I don't think the color can be changed conditionally in the preceding step. Instead, we need a new data step to do this.

data annobub2;
set annobub1;
if cat=2 then color="red";
if cat=3 then color="blue";

Now we're ready to plot. We use the symbol (section 5.2.2) statement to tell proc gplot not to plot the data, add the annotate data set, and suppress the legend, as the default legend will not look correct here. An appropriate legend could be generated with a legend statement.

symbol1 i=none r=3;
proc gplot data=testbubbles;
plot ordinate * abscissa = cat / annotate = annobub2 nolegend;

The resulting plot is shown above. Improved axes are demonstrated throughout the book and in many previous blog posts.


The R approach merely requires passing three colors to the bg option in the symbols() function. To mimic SAS, we'll start by defining some data, then generate the vector of colors needed.

cat = rep(c(1, 2, 3), each=20)
abscissa = rnorm(60)
ordinate = rnorm(60)
z = runif(60)
plotcolor = ifelse(cat==1, "green", ifelse(cat==2, "red", "blue"))

The nested calls to the ifelse function (section 1.11.2) allow vectorized conditional tests with more than two possibilities. Another option would be to use a for loop (section 1.11.1) but this would be avoiding one of the strengths of R. In this example, I suppose I could have defined the cat vector with the color values as well, and saved some keystrokes.

With the data generated and the color vector prepared, we need only call the symbols() function.

symbols(ordinate, abscissa, circles=z, inches=1/5, bg=plotcolor)

The resulting plot is shown below.

Monday, August 30, 2010

Example 8.3: pyramid plots

Pyramid plots are a common way to display the distribution of age groups in a human population. The percentages of people within a given age category are arranged in a barplot, often back to back. Such displays can be used distinguish males vs. females, differences between two different countries or the distribution of age at different timepoints. Aidan Kane has an example.

We demonstrate how to generate back to back pyramid plots by gender of the age distribution from the HELP (Health Evaluation and Linkage to Primary Care) study. The example today highlights the differences between the R community and the SAS corporate structure. The R function was constructed to do exactly a pyramid plot, while the SAS approach tricks a powerful but general approach to achieve approximately the desired results. The R result to our eyes are more attractive; to mimic them exactly in SAS would require drawing much of the content from primitives. Someone may have done this, but the software structure and user community isn't organized for sharing.


We begin by loading the data then creating a categorical age variable (in 5 year increments) using the cut() command (section 1.4.10). Next a character variable is created that will be used to display the five number summaries by gender (section 2.1.2).

ds = read.csv("")

# create a categorical age variable
agegrp = cut(age, breaks=c(18, 20, 25, 30, 35, 40, 45, 50, 55, 60))

# create a nicer description for gender
gender = rep("male", length(agegrp))
gender[female==1] = "female"

# create a vector of percentages in each age range
women = as.vector(100*table(agegrp[female==1])/sum(female==1))
men = as.vector(100*table(agegrp[female==0])/sum(female==0))

# distribution by gender
tapply(age, gender, fivenum)

This yields the following output (five number summaries by gender):

[1] 21.0 31.0 35.0 40.5 58.0

[1] 19 30 35 40 60

Finally, the vectors of percentages at each level of the age variable for men and women is given as arguments to the pyramid.plot() function.

pyramid.plot(men, women,
title("Age distribution at baseline of HELP study")

The age distributions are quite similar, with the males slightly more dispersed than the females.


We'll use proc gchart with the hbar statement (section 5.1.3) to make the plot. This requires some set-up, due to the desired back-to-back image. We begin, as in R, by generating the age categories and a gender variable. The strategy for categorizing age is shown in section 1.4.9.

data pyr;
set "c:\book\help";
agegrp = (age le 20) + (age le 25) + (age le 30) + (age le 35) +
(age le 40) + (age le 45) + (age le 50) + (age le 55) + (age le 60);
if female eq 1 then gender = "Female";
else gender = "Male";

Next, we generate the percent in each age group, within gender, using proc freq (section 2.3.1). We save the output to a data set with the out option and suppress all the printed output. Then we make the percents for the males negative, so they'll display to the left of 0.

proc freq data=pyr noprint;
tables agegrp * gender/out=sumpyr outpct;

data pyr2;
set sumpyr;
if gender eq "Male" then pct_col=pct_col * -1;

We could proceed with the plot now, but the axes would include age categories 1 through 9 and negative percents for the males. To clean this up, we use axis statements (sections 5.3.7, 5.3.8).

title 'Age distribution at baseline of HELP study';
axis1 value = ("(55,60]" "(50,55]" "(45,50]" "(40,45]"
"(35,40]" "(30,35]" "(25,30]" "(20,25]" "(18,20]" ) ;
axis2 order=(-30 to 30 by 10)
label=("Percent in each age group, within gender")
minor = none
value = ("30" "20" "10" "0" "10" '20' '30');

proc gchart data=pyr2;
hbar agegrp / discrete freq nostats sumvar=pct_col space=0.5
subgroup=gender raxis=axis2 maxis=axis1;
label agegrp="Age";

In the gchart statement, the key option is sumvar which tells proc gchart the length of the bars. The discrete option forces a bar for each value of agregrp. Other options associate the defined axis statements with axes of the plot, generate different colors for each gender, space the bars, and suppress some default plot features.

Different colored bars within gender could be accomplished with pattern statements. More difficult would be coloring the bars within gender by some third variables, as is demonstrated in R in example(pyramid.plot). Replicating the R plot with the category labels between the genders would require drawing the plot using annotate data sets.

Monday, April 19, 2010

Example 7.33: Specifying fonts in graphics

For interactive data analysis, the default fonts used by SAS and R are acceptable, if not beautiful. However, for publication, it may be important to manipulate the fonts. For example, it would be desirable for the fonts in legends, axis labels, or other text printed in plots to approximate the typeface used in the rest of the text. Credit where it's due department: this blog entry is inspired by this blog post by Yihui Xie.

As an example plot, we'll revisit Figure 2.2 from the book, in which we plot MCS by CESD, with plot symbol showing substance abused.


In SAS, we'll focus on the "traditional" graphics most likely to be used for scientific publication. ODS graphics and the new SG procedures are currently difficult to customize.

There are several fonts supplied with the standard installation. However, any TrueType or Adobe Type 1 font on a local disk can be used. Making the fonts available requires a one-time use of proc fontreg.

proc fontreg mode=all;
fontpath "c:/windows/fonts";

In the above, the fontpath is the location in the operating system of the .ttf (TrueType) and/or .pfa/.pfb (Type 1) files.

Any font in that directory can then be used in SAS graphics, regardless of the format of the image file. Note however that this approach may mean that running your code on a different computer may alter the appearance of the graphic, since the fonts may not be available on another computer. If you anticipate needing to share the code, you can stick with the fonts SAS supplies, which are described in the online documentation: SAS Products; SAS/GRAPH; Concepts; Fonts.

The simplest way to make all text default to a desired font is to use the goptions statement (sections under 5.3).

goptions ftext="Comic Sans MS";

In Windows, the name of the font to put inside the quotation marks is displayed in the Explorer. If bold, italic, or bold italic is available, it can be requested by appending /bo, /it, or /bo/it to the font name, as demonstrated below.

Many statements in SAS/GRAPH accept a font= option, and these can be used to override the default font specified in the goptions statement.

In the example below, we use several different fonts to demonstrate how different statements specify fonts. In fact, the only plot elements in the assigned default Rockwell typeface are the y axis label and numbers and the labels of the symbols in the legend.

filename myurl
url ''

proc import datafile=myurl out=ds dbms=dlm;

goptions ftext = "Rockwell";

title font="Lucida Handwriting" "MCS by PCS with fonts";

mode=reserve position=(bottom center outside) across=3
label = (font= "Elephant" h=2 "Substance");

axis1 label=(font="Goudy Old Style" h=2 "The CESD axis")
value = ( h = 2 font= "Comic Sans MS") minor=none;

symbol1 font="Comic Sans MS" v='A' h=.7 c=black;
symbol2 font="Agency FB/bo" v='C' h=.7 c=black;
symbol3 font="Franklin Gothic Book/bo/it" v='H' h=.7;
proc gplot data=ds;
where female=1;
plot mcs*cesd=substance / legend=legend1 haxis=axis1;
run; quit;

The results, shown below, demonstrate the comic effects of reserving too much of the plot space for labels.


In R, the available fonts and the ways to use them varies by device. TrueType fonts can be displayed easily for the windows() device, and can similarly be used in publication graphics through the win.metafile() device (section 5.4.5).

windowsFonts(CS = windowsFont("Comic Sans MS"))

The windowsFonts() function would be called for each font to be included.

Unfortunately, the pdf() and postscript() devices most likely to be useful for publication using LaTeX do not appear to be able to read TrueType fonts, only Adobe Type 1 fonts. Some fonts can be purchased or downloaded for free in this format. If any reader has had success in using external fonts for these devices, I hope they'll provides code or links in the comments. There are some resources for converting TrueType to Type 1 freely available for *nix operating systems. However, for technical reasons, these conversions don't usually offer satisfying results.

Fortunately, R comes with several fonts for these devices. Their names can be easily displayed:

> names(pdfFonts())
[1] "serif" "sans"
[3] "mono" "AvantGarde"
[5] "Bookman" "Courier"
[7] "Helvetica" "Helvetica-Narrow"
[9] "NewCenturySchoolbook" "Palatino"
[11] "Times" "URWGothic"
[13] "URWBookman" "NimbusMon"
[15] "NimbusSan" "URWHelvetica"
[17] "NimbusSanCond" "CenturySch"
[19] "URWPalladio" "NimbusRom"
[21] "URWTimes" "Japan1"
[23] "Japan1HeiMin" "Japan1GothicBBB"
[25] "Japan1Ryumin" "Korea1"
[27] "Korea1deb" "CNS1"
[29] "GB1"

As with SAS, R offers both ways to change the default font (with the par() function) and fine control of individual options in specific function calls. R stores fonts in familys with names as listed above, and the (confusingly named) font which is an integer where 1 corresponds to plain text (the default), 2 to bold face, 3 to italic, 4 to bold italic, and 5 to the symbol font. Not all faces are necessarily available for all font families.

plot(cesd[female==1], mcs[female==1], type="n", bty="n", ylab="MCS",
xlab = '')
mcs[female==1&substance=="alcohol"],"A", family="AvantGarde", font=2)
mcs[female==1&substance=="cocaine"],"C", family="serif")
mcs[female==1&substance=="heroin"],"H", family="Courier", font=4)
title(xlab="This is the CESD axis", family="NewCenturySchoolbook", cex.lab=2)
title(family="Helvetica", font.main=3, cex.main=3, "MCS by CESD with fonts")

Similar to the SAS example, the only characters in the default Palatino font are the y axis label and the numerals.

Replicating a SAS legend appearing below the plot would be more difficult in R, as would replicating the default SAS legend that shows different plotted font characters.

Monday, March 22, 2010

Example 7.28: Bubble plots

A bubble plot is a means of displaying 3 variables in a scatterplot. The z dimension is presented in the size of the plot symbol, typically a circle. The area or radius of the circle plotted is proportional to the value of the third variable. This can be a very effective data presentation method. For example, consider Andrew Gelman's recent re-presentation of health expenditure/survival data/annual number of doctor visits per person. On the other hand, Edward Tufte suggests that such representations are ambiguous, in that it is often unclear whether the area, radius, or height reflects the third variable. In addition, he reports that humans tend not to be good judges of relative area.

However, other means of presenting three dimensions on a flat screen or piece of paper often rely on visual cues regarding perspective, which some find difficult to judge.

Here we demonstrate SAS and R bubble plots using the HELP data set used in our book. We show a plot of depression by age, with bubble size proportional to the average number of drinks per day. To make the plot a little easier to read, we show this only for female alcohol abusers.


In SAS, we can use the bubble statement in proc gplot. We demonstrate here the use of the where data set option (section 1.5.1) for subsetting, which allows us to avoid using any data steps. SAS allows the circle area or radius to be proportional to the third variable; we choose the radius for compatibility with R. We alter the size of the circles for the same reason. We also demonstrate options for coloring in the filled circles.

libname k "c:\book";

proc gplot data = (where=((female eq 1)
and (substance eq "alcohol")));
bubble cesd*age=i1 / bscale = radius bsize=60
bcolor=blue bfill=solid;


In R, we can use the symbols() function for the plot. Here we also demonstrate reading in data previously saved in native R format (section 1.1.1), as well as the subset() function and the with() function (the latter appears in section 1.3.1). The inches option is an arbitrary scale factor. We note that the symbols() function has a great deal of additional capability-- it can substitute squares for circles for plotting the third variable, and add additional dimensions with rectangles or stars. Proportions can be displayed with thermometers, and boxplots can also be displayed.

femalealc = subset(ds, female==1 & substance=="alcohol")
with(femalealc, symbols(age, cesd, circles=i1,
inches=1/5, bg="blue"))

The results are shown below. It appears that younger women with more depressive symptoms tend to report more drinking.