Monday, August 18, 2014

Example 2014.10: Panel by a continuous variable

In Example 8.40, side-by-side histograms, we showed how to generate histograms for some continuous variable, for each level of a categorical variable in a data set. An anonymous reader asked how we would do this if both the variables were continuous. Keep the questions coming!

SAS
The SAS solution we presented relied on the sgpanel procedure. There, the panelby statement names a variable for which each distinct value will generate a panel. If there are many values, for example for a continuous variable, there will be many panels generated, which is probably not the desired result. As far as we know, there is no option to automatically categorize a continuous panel variable in proc sgpanel. If this is required, a two-step approach will be needed to first make groups of one of the variables.

We do that below using proc rank. In this approach, the groups option is the number of groups required and the ranks statement names a new variable to hold the group indicator. Once the groups are made, the same code demonstrated earlier can be used. (This is an example of "it's never too late to learn"-- I used to do this via a sort and a data step with implied variables, until I realized that there had to be a way to it via a procedure. --KK)

In this setting, the panels are another approach to the data we examine in a scatterplot. As an example, we show the mental compentency score by grouping of the physical competency score in the HELP data set.
proc rank data = 'c:\book\help.sas7bdat' groups = 6 out = catmcs;
var mcs;
ranks mcs_sextile;
run;

title "Histograms of PCS by sextile of MCS";
proc sgpanel data = catmcs;
  panelby mcs_sextile / columns = 3 rows =2;
  histogram pcs;
run;
We also demonstrate the columns and rows options to the panelby statement, which allow control over the presentation of the panel results. The graphic produced is shown above.

R
Our R solution in the earlier entry used the lattice package (written by Deepayan Sarkar) to plot a formula such as histogram(~a | b). A simple substitution of a continuous covariate b into that syntax will also generate a panel for each distinct value of the covariates: a factor is expected. In the package, an implementation of Trellis graphics, the term "shingles" is used to approach the notion of categorizing a continuous variable for making panels. The function equal.count() is provided to make the (possibly overlapping) categories of the variables, and uses the panel headers to suggest the ranges of continuous covariate that are included in each panel.
ds = read.csv("http://www.amherst.edu/~nhorton/r2/datasets/help.csv")
library(lattice)
histogram(~ pcs | equal.count(mcs), 
   main="Histograms of PCS by shingle of MCS",
   index.cond=list(c(4,5,6,1,2,3)),data=ds)
Note that the default ordering of panels in lattice is left to right, bottom to top. The index.cond option here re-orders the panels to go from left to right, top to bottom.

The default behavior of equal.count() is to allow some overlap between the categories, which is a little odd. In addition, there is a good deal of visual imprecision in the method used to identify the panels-- there's no key given, and the only indicator of the shingle value is the shading of the title bars. A more precise method would be to use the quantile() function manually, as we demonstrated in example 8.7, the Hosmer and Lemeshow goodness-of-fit test. We show here how the mutate() function in Hadley Wickham's dplyr package can be used to add a new variable to a data frame.

require(dplyr)
ds = mutate(ds, cutmcs = cut(ds$mcs, 
   breaks = quantile(ds$mcs, probs=seq(0,1, 1/6)), include.lowest=TRUE))
histogram(~ pcs | cutmcs,  main="Histograms of PCS by sextile of MCS",
          index.cond=list(c(4,5,6,1,2,3)), data=ds)
This shows the exact values of the bin ranges in the panel titles, surely a better use of that space. Minor differences in the histograms are due to the overlapping categories included in the previous version.

Finally, we also show the approach one might use with the ggplot2 package, an implementation of Leland Wilkinson's Grammar of Graphics, coded by Hadley Wickham. The package includes the useful cut_number() function, which does something similar to the cut(..., breaks=quantile(...)) construction we showed above. In ggplot2, "facets" are analogous to the shingles used in lattice.
library(ggplot2)
ds = mutate(ds, cutmcsgg = cut_number(ds$mcs, n=6))
ggplot(ds, aes(pcs)) + geom_bar() + 
  facet_wrap(~cutmcsgg) + ggtitle("Histograms of PCS by sextile of MCS")
Roughly, we can read the syntax to state: 1) make a plot from the ds dataset in which the primary analytic variable will be pcs; 2) make histograms; 3) make facets of the cutmcsgg variable; 4) add a title. Since the syntax is a little unusual, Hadley provides the qplot() function, a wrapper which operates more like traditional functions. An identical plot to the above can be generated with qplot() as follows:
qplot(data=ds,x=pcs, geom="bar", facets= ~cutmcsgg, 
   main="Histograms of PCS by sextile of MCS")


An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

5 comments:

Chris Andrews said...

You can use a FORMAT statement to categorize a continuous variable

DATA a;
DO i=1 TO 10000;
x = RAND("normal");
y = RAND("normal") + x;
OUTPUT;
END;
RUN;

PROC FORMAT;
VALUE broken
LOW-<-3.5=-4
-3.5-<-2.5 = -3
-2.5-<-1.5 = -2
-1.5-<-0.5 = -1
-0.5-<0.5 = 0
0.5-<1.5 = 1
1.5-<2.5 = 2
2.5-<3.5 = 3
3.5-HIGH = 4;
RUN;


TITLE "Histograms paneled by a continuous variable";
PROC SGPANEL DATA = a;
FORMAT y broken.;
PANELBY y / COLUMNS = 3 ROWS = 3;
HISTOGRAM x;
RUN;

Ken Kleinman said...

Thanks, Chris. We should have mentioned that, with our usual cautions and principled explanations of why we generally avoid formats in SAS.

Kris said...

Hi Ken,
I'm curious as to your reasons for avoiding formats in SAS. This might be answered in your book (which is on my shopping list!).

I've found formats and informats to be very valuable in SAS, particularly when it comes to storing and processing large (my definition >1 million rows) datasets, and being able to create these directly from metadata is not only time-saving but also takes out a potential source of error.

Cheers, Kris.

Ken Kleinman said...

Thanks, Sebastian. as.table=TRUE is nice to re-order top-to-bottom, left-to-right, thanks! The demonstrated code has the advantage of generality, so it's great to know both.

Ken Kleinman said...

Kris, informats are invaluable-- I didn't mean to cast any aspersions against them. Formats included with SAS can also be very useful, especially date/time formats.

My objections are to user-defined formats. They mostly come from my experiences in using others' data sets.

My preference is for data to be as self-contained and fully self-descriptive as possible. As far as I know, it is not possible to save user-defined formats with your data. So depending on formats, rather than true recoding of the data, runs the risk of losing the recoding. Perhaps this is not as bad as losing the data, but it can be extremely frustrating. (A related issue is that when you share data but not the proc format code to make your formats, the colleagues you share with will get errors unless they use options nofmterr;.) In the code shown by Chris Andrews above, it might not be possible to re-generate a format-based graphic once the proc format code was lost. This would be the case, for example, if there was a typo in the format definition.

Another thing I don't like about formats is that they hide the true values stored in the data set. For example, consider a modification of Chris Andrews' code to show a typical way people use formats:

DATA a;
DO i=1 TO 10000;
x = RAND("normal");
y = RAND("normal") + x;
OUTPUT;
END;
format x broken.;
RUN;

This is "nice" in that my proc sgpanel code is shorter-- no need to use the format statement there:

proc sgpanel data=a;
panelby x;
histogram y;
run;

But when I look at my data, I only see those formatted values!

proc print data = a (obs=2); run;

Obs i x y

1 1 2 1.94797
2 2 -1 -2.68371

I suppose both of the these objections carry less weight if you never share your data, or if you never associate formats in your data set. And I see your point about the storage, and will take your word for it on the time savings, though I'm skeptical. I don't think I see a way it can help with accuracy, though, and I give some counter-examples here. What's your experience with improved accuracy through using them.