Monday, August 18, 2014

Example 2014.10: Panel by a continuous variable

In Example 8.40, side-by-side histograms, we showed how to generate histograms for some continuous variable, for each level of a categorical variable in a data set. An anonymous reader asked how we would do this if both the variables were continuous. Keep the questions coming!

SAS
The SAS solution we presented relied on the sgpanel procedure. There, the panelby statement names a variable for which each distinct value will generate a panel. If there are many values, for example for a continuous variable, there will be many panels generated, which is probably not the desired result. As far as we know, there is no option to automatically categorize a continuous panel variable in proc sgpanel. If this is required, a two-step approach will be needed to first make groups of one of the variables.

We do that below using proc rank. In this approach, the groups option is the number of groups required and the ranks statement names a new variable to hold the group indicator. Once the groups are made, the same code demonstrated earlier can be used. (This is an example of "it's never too late to learn"-- I used to do this via a sort and a data step with implied variables, until I realized that there had to be a way to it via a procedure. --KK)

In this setting, the panels are another approach to the data we examine in a scatterplot. As an example, we show the mental compentency score by grouping of the physical competency score in the HELP data set.
proc rank data = 'c:\book\help.sas7bdat' groups = 6 out = catmcs;
var mcs;
ranks mcs_sextile;
run;

title "Histograms of PCS by sextile of MCS";
proc sgpanel data = catmcs;
  panelby mcs_sextile / columns = 3 rows =2;
  histogram pcs;
run;
We also demonstrate the columns and rows options to the panelby statement, which allow control over the presentation of the panel results. The graphic produced is shown above.

R
Our R solution in the earlier entry used the lattice package (written by Deepayan Sarkar) to plot a formula such as histogram(~a | b). A simple substitution of a continuous covariate b into that syntax will also generate a panel for each distinct value of the covariates: a factor is expected. In the package, an implementation of Trellis graphics, the term "shingles" is used to approach the notion of categorizing a continuous variable for making panels. The function equal.count() is provided to make the (possibly overlapping) categories of the variables, and uses the panel headers to suggest the ranges of continuous covariate that are included in each panel.
ds = read.csv("http://www.amherst.edu/~nhorton/r2/datasets/help.csv")
library(lattice)
histogram(~ pcs | equal.count(mcs), 
   main="Histograms of PCS by shingle of MCS",
   index.cond=list(c(4,5,6,1,2,3)),data=ds)
Note that the default ordering of panels in lattice is left to right, bottom to top. The index.cond option here re-orders the panels to go from left to right, top to bottom.

The default behavior of equal.count() is to allow some overlap between the categories, which is a little odd. In addition, there is a good deal of visual imprecision in the method used to identify the panels-- there's no key given, and the only indicator of the shingle value is the shading of the title bars. A more precise method would be to use the quantile() function manually, as we demonstrated in example 8.7, the Hosmer and Lemeshow goodness-of-fit test. We show here how the mutate() function in Hadley Wickham's dplyr package can be used to add a new variable to a data frame.

require(dplyr)
ds = mutate(ds, cutmcs = cut(ds$mcs, 
   breaks = quantile(ds$mcs, probs=seq(0,1, 1/6)), include.lowest=TRUE))
histogram(~ pcs | cutmcs,  main="Histograms of PCS by sextile of MCS",
          index.cond=list(c(4,5,6,1,2,3)), data=ds)
This shows the exact values of the bin ranges in the panel titles, surely a better use of that space. Minor differences in the histograms are due to the overlapping categories included in the previous version.

Finally, we also show the approach one might use with the ggplot2 package, an implementation of Leland Wilkinson's Grammar of Graphics, coded by Hadley Wickham. The package includes the useful cut_number() function, which does something similar to the cut(..., breaks=quantile(...)) construction we showed above. In ggplot2, "facets" are analogous to the shingles used in lattice.
library(ggplot2)
ds = mutate(ds, cutmcsgg = cut_number(ds$mcs, n=6))
ggplot(ds, aes(pcs)) + geom_bar() + 
  facet_wrap(~cutmcsgg) + ggtitle("Histograms of PCS by sextile of MCS")
Roughly, we can read the syntax to state: 1) make a plot from the ds dataset in which the primary analytic variable will be pcs; 2) make histograms; 3) make facets of the cutmcsgg variable; 4) add a title. Since the syntax is a little unusual, Hadley provides the qplot() function, a wrapper which operates more like traditional functions. An identical plot to the above can be generated with qplot() as follows:
qplot(data=ds,x=pcs, geom="bar", facets= ~cutmcsgg, 
   main="Histograms of PCS by sextile of MCS")


An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, August 11, 2014

Example 2014.9: Rolling averages. Also: Second Edition is shipping!

As of today, the second edition of "SAS and R: Data Management, Statistical Analysis, and Graphics" is shipping from CRC Press, Amazon, and other booksellers. There are lots of additional examples from this blog, new organization, and other features we hope you'll find useful. Thanks for your support. We'll be continuing to blog.

Now, on to today's main course.
For cyclical data, it's sometimes useful to generate rolling averages-- the average of some number of recent measurements, usually one full cycle. For example, for retail sales, one might want the rolling average of the most recent week. The rolling average will dampen the effects of repeated patterns but still show the location of the data.

In keeping with our habit of plotting personal data (e.g.,Example 8.11, Example 8.12, example 10.1, Example 10.2), I'll use my own weight recorded over the past 6 months. After reading about "alternate day dieting" in The Atlantic, I decided to try the diet described in the book by Varady. I've never really tried to diet for weight loss before, but this diet has worked really well for me over the past six months. The basics are that you eat 500 calories every other day (diet days) and on the non-diet days you eat what you want. There's a little science supporting the approach. I can't really recommend the book, unfortunately, unless you're a fan of the self-help style.

As you can imagine, one's weight tends to fluctuate pretty wildly between diet days and non-diet days. The cycle is just two days, but to get a sense of my weight at any given time, it might be best to use the rolling average of the past, say, four days.

The beginning of the data, available from http://www.amherst.edu/~nhorton/sasr2/datasets/weight.txt, follows.
1/11/14 219
1/12/14 NA
1/13/14 219
1/14/14 NA
1/15/14 221.8
1/16/14 218
...


R
As you can tell from the NAs, I compiled the data with the intent to read it into R.
> weights = read.table("c:/temp/weight.txt")
> head(weights)

       V1    V2
1 1/11/14 219.0
2 1/12/14    NA
3 1/13/14 219.0
4 1/14/14    NA
5 1/15/14 221.8
6 1/16/14 218.0
Note, though, that the date values are just character strings (read in as a factor variable), and not so useful as read in.
> str(weights)
'data.frame': 161 obs. of  2 variables:
 $ V1: Factor w/ 161 levels "1/11/14","1/12/14",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ V2: num  219 NA 219 NA 222 ...
The lubridate package contributed by the invaluable Hadley Wickham contains functions to make it easier to use dates in R. Here, I use its mdy() function to convert characters values into R dates.
library(lubridate)
with(weights, plot(V2 ~ mdy(V1), 
  xlim = c(mdy("1/1/14"),mdy("6/30/14")),
  ylab="Weight", xlab="Date"))
The simple plot has enough values that you can clearly see the trend of weight loss over time, and perhaps the rolling average exercise is somewhat misplaced, here. To calculate the rolling average, I adapted (below) the lag function from section 2.2.18 (2nd edition; 1.4.17 in the 1st ed.)-- this is a simpler version that does not check for errors. The result of lag(x,k) is a vector with the first k values missing and with the remaining values being the beginning values of x. Thus the ith value of lag(x,k) is x[i-k]. To get the rolling average, I just take the mean of several lags. Here I use the rowMeans() function to do it for all the values at once. The lines() function adds the rolling average to the plot.
lag = function(x,k) {
  return( c(rep(NA,k), x[1:(length(x)-k)]) )
}

y = weights$V2
ra = rowMeans(
  matrix(c(y,lag(y,1),lag(y,2),lag(y,3)),ncol=4,byrow=F),
    na.rm=T)

lines(mdy(weights$V1),ra)
The final plot is shown above. Note that the the initial values of the lagged vector are missing, as are weights for several dates throughout this period. The na.rm=T option causes rowMeans() to return the mean of the observed values-- equivalent to a single imputation of the mean of the observed values, which perhaps Nick will allow me in this setting (note from NH: I don't have major issues with this). There are also two periods where I failed to record weights for four days running. For these periods, rowMeans() returns NaN, or "Not a Number". This is usefully converted to regions in the plot where the running average line is not plotted. Compare, for instance, with the default SAS behavior shown below. For the record, I was ill in early May and had little appetite regardless of my dieting schedule.

SAS
The data can be easily read with the input statement. The mmddyy7. informat tells SAS that the data in the first field are as many as 7 characters long and should be read as dates. SAS will store them as SAS dates (section 2.4 in the 2nd edition; 1.6 in the 1st edition). As the data are read in, I use the lagk functions (section 2.2.18 2nd edition; 1.4.17 in the 1st ed.) to recall the values from recent days and calculate the rolling average as I go.
data weights;
infile "c:\temp\weight.txt";
input date mmddyy7. weight;
ra = mean(weight,lag(weight), lag2(weight), lag3(weight));
run;
Note that the input statement expects the weight values to be numbers, and interprets the NAs in the data as "Invalid data". It inserts missing values into the data set, which is what we desire. The mean function provides the mean of the non-missing values. When the weight and all of the lagged values of weight are missing, it will return a missing value. With the rolling average in hand, I can plot the observed weights and the rolling average. To print Julian dates rather than SAS dates, use the format statement to tell SAS that the date variable should be printed using the date. format.
symbol1 i = none v=dot c = blue;
symbol2 i = j v = none c = black w=5;
proc gplot data = weights;
plot (weight ra)*date /overlay;
format date date.;
run;
The results are shown below. The main difference from the R plot is that the gaps in my recording do not appear in the line. The SAS symbol statement, the equivalent of the lines() function, more or less, does not encounter NaNs, but only missing values, and so it connects the points. I think R's behavior is more appropriate here-- there's no particular reason to suppose a linear interpolation between the observed data points is best, and so the line ought to be missing.


An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.