SAS and R: graphics

Showing posts with label graphics. Show all posts

Thursday, September 21, 2017

ggformula: another option for teaching graphics in R to beginners

A previous entry (http://sas-and-r.blogspot.com/2017/07/options-for-teaching-r-to-beginners.html) describes an approach to teaching graphics in R that also “get[s] students doing powerful things quickly”, as David Robinson suggested.

In this guest blog entry, Randall Pruim offers an alternative way based on a different formula interface. Here's Randall:

For a number of years I and several of my colleagues have been teaching R to beginners using an approach that includes a combination of

the lattice package for graphics,
several functions from the stats package for modeling (e.g., lm(), t.test()), and
the mosaic package for numerical summaries and for smoothing over edge cases and inconsistencies in the other two components.

Important in this approach is the syntactic similarity that the following “formula template” brings to all of these operations.

goal ( y ~ x , data = mydata, ... )

Many data analysis operations can be executed by filling in four pieces of information (goal, y, x, and mydata) with the appropriate information for the desired task. This allows students to become fluent quickly with a powerful, coherent toolkit for data analysis.

Trouble in paradise
As the earlier post noted, the use of lattice has some drawbacks. While basic graphs like histograms, boxplots, scatterplots, and quantile-quantile plots are simple to make with lattice, it is challenging to combine these simple plots into more complex plots or to plot data from multiple data sources. Splitting data into subgroups and either overlaying with multiple colors or separating into sub-plots (facets) is easy, but the labeling of such plots is not as convenient (and takes more space) than the equivalent plots made with ggplot2. And in our experience, students generally find the look of ggplot2 graphics more appealing.

On the other hand, introducing ggplot2 into a first course is challenging. The syntax tends to be more verbose, so it takes up more of the limited space on projected images and course handouts. More importantly, the syntax is entirely unrelated to the syntax used for other aspects of the course. For those adopting a “Less Volume, More Creativity” approach, ggplot2 is tough to justify.

ggformula: The third-and-a half way

Danny Kaplan and I recently introduced ggformula, an R package that provides a formula interface to ggplot2 graphics. Our hope is that this provides the best aspects of lattice (the formula interface and lighter syntax) and ggplot2 (modularity, layering, and better visual aesthetics).

For simple plots, the only thing that changes is the name of the plotting function. Each of these functions begins with gf. Here are two examples, either of which could replace the side-by-side boxplots made with lattice in the previous post.

We can even overlay these two types of plots to see how they compare. To do so, we simply place what I call the "then" operator (%>%, also commonly called a pipe) between the two layers and adjust the transparency so we can see both where they overlap.

Comparing groups

Groups can be compared either by overlaying multiple groups distinguishable by some attribute (e.g., color)

or by creating multiple plots arranged in a grid rather than overlaying subgroups in the same space. The ggformula package provides two ways to create these facets. The first uses | very much like lattice does. Notice that the gf_lm() layer inherits information from the the gf_points() layer in these plots, saving some typing when the information is the same in multiple layers.

The second way adds facets with gf_facet_wrap() or gf_facet_grid() and can be more convenient for complex plots or when customization of facets is desired.

Fitting into the tidyverse work flow

ggformala also fits into a tidyverse-style workflow (arguably better than ggplot2 itself does). Data can be piped into the initial call to a ggformula function and there is no need to switch between %>% and + when moving from data transformations to plot operations.

Summary

The “Less Volume, More Creativity” approach is based on a common formula template that has served well for several years, but the arrival of ggformula strengthens this approach by bringing a richer graphical system into reach for beginners without introducing new syntactical structures. The full range of ggplot2 features and customizations remains available, and the ggformula package vignettes and tutorials describe these in more detail.

-- Randall Pruim

Thursday, July 27, 2017

Options for teaching R to beginners: a false dichotomy?

I've been reading David Robinson's excellent blog entry "Teach the tidyverse to beginners" (http://varianceexplained.org/r/teach-tidyverse), which argues that a tidyverse approach is the best way to teach beginners. He summarizes two competing curricula:

1) "Base R first": teach syntax such as $ and [[]], built in functions like ave() and tapply(), and use base graphics

2) "Tidyverse first": start from scratch with pipes (%>%) and leverage dplyr and use ggplot2 for graphics

If I had to choose one of these approaches, I'd also go with 2) ("Tidyverse first"), since it helps to move us closer to helping our students "think with data" using more powerful tools (see here for my sermon on this topic).

A third way

Of course, there’s a third option that addresses David’s imperative to "get students doing powerful things quickly". The mosaic package was written to make R easier to use in introductory statistics courses. The package is part of Project MOSAIC (http://mosaic-web.org), an NSF-funded initiative to integrate statistics, modeling, and computing. A paper outlining the mosaic package's "Less Volume, More Creativity" approach was recently published in the R Journal (https://journal.r-project.org/archive/2017/RJ-2017-024). To his credit, David mentions the mosaic package in a response to one of the comments on his blog.

Less Volume, More Creativity

One of the big ideas in the mosaic package is that students build on the existing formula interface in R as a mechanism to calculate summary statistics, generate graphical displays, and fit regression models. Randy Pruim has dubbed this approach "Less Volume, More Creativity".

While teaching this formula interface involves adding a new learning outcome (what is "Y ~ X"?), the mosaic approach simplifies calculation of summary statistics by groups and the generation of two or three dimensional displays on day one of an introductory statistics course (see for example Wang et al., "Data Viz on Day One: bringing big ideas into intro stats early and often" (2017), TISE).

The formula interface also prepares students for more complicated models in R (e.g., logistic regression, classification).

Here's a simple example using the diamonds data from the ggplot2 package. We model the relationships between two colors (D and J), number of carats, and price.

I'll begin with a bit of data wrangling to generate an analytic dataset with just those two colors. (Early in a course I would either hide the next code chunk or make the recoded dataframe accessible to the students to avoid cognitive overload.) Note that an R Markdown file with the following commands is available for download at https://nhorton.people.amherst.edu/mosaic-blog.Rmd.

library(mosaic)
recoded <- diamonds %>%
filter(color=="D" | color=="J") %>%
mutate(col = as.character(color))

We first calculate the mean price (in US$) for each of the two colors.

mean(price ~ col, data = recoded)

   D    J 
3170 5324

This call is an example of how the formula interface facilitates calculation of a variable's mean for each of the levels of another variable. We see that D color diamonds tend to cost less than J color diamonds.

A useful function in mosaic is favstats() which provides a useful set of summary statistics (including sample size and missing values) by group.

favstats(price ~ col, data = recoded)

col	min	Q1	median	Q3	max	mean	sd	n	missing
D	357	911	1838	4214	18693	3170	3357	6775	0
J	335	1860	4234	7695	18710	5324	4438	2808	0

A similar command can be used to generate side by side boxplots. Here we illustrate the use of lattice graphics. (An alternative formula based graphics system (ggformula) will be the focus of a future post.)

bwplot(col ~ price, data = recoded)

The distributions are skewed to the right (not surprisingly since they are prices). If we wanted to formally compare these sample means we could do so with a two-sample t-test (or in a similar fashion, by fitting a linear model).

t.test(price ~ col, data = recoded)
Welch Two Sample t-test

data: price by col
t = -20, df = 4000, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2336 -1971
sample estimates:
mean in group D mean in group J
3170 5324

msummary(lm(price ~ col, data = recoded))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3170.0 45.0 70.4 <2e-16 ***
colJ 2153.9 83.2 25.9 <2e-16 ***

Residual standard error: 3710 on 9581 degrees of freedom
Multiple R-squared: 0.0654, Adjusted R-squared: 0.0653

F-statistic: 670 on 1 and 9581 DF, p-value: <2e-16

The results from the two approaches are consistent: the group differences are highly statistically significant. We could conclude that J diamonds tend to cost more than D diamonds, back in the population of all diamonds.

Let's do a quick review of the mosaic modeling syntax to date:
mean(price ~ col)
bwplot(price ~ col) t.test(price ~ col) lm(price ~ col) See the pattern? On a statistical note, it's important to remember that the diamonds were not randomized into colors: this is a found (observational dataset) so there may be other factors at play. The revised GAISE College report reiterates the importance of multivariate thinking in intro stats. Moving to three dimensions Let's continue with the "Less Volume, More Creativity" approach to bring in a third variable: the number of carats in each diamond. xyplot(price ~ carat, groups=col, auto.key=TRUE, type=c("p", "r"), data = recoded)

We see that controlling for the number of carats, the D color diamonds tend to sell for more than the J color diamonds. We can confirm this by fitting a regression model that controls for both variables (and then display the resulting predicted values from this parallel slopes model using plotModel()).

This is a great example of Simpson's paradox: accounting for the number of carats has yielded opposite results from a model that didn't include carats. If we were to move forward with such an analysis we'd need to be sure to undertake an assessment of our model and verify conditions and assumptions (but for the purpose of the blog entry I'll defer that).

Moving beyond mosaic

The revised GAISE College report enunciated the importance of technology when teaching statistics. Many courses still use calculators or web-based applets to incorporate technology into their classes. R is an excellent environment for teaching statistics, but many instructors feel uncomfortable using it (particularly if they feel compelled to teach the $ and [[]] syntax, which many find offputting). The mosaic approach helps make the use of R feasible for many audiences by keeping things simple. It's unfortunately true that many introductory statistics courses don't move beyond bivariate relationships (so students may feel paralyzed about what to do about other factors). The mosaic approach has the advantage that it can bring multivariate thinking, modeling, and exploratory data tools together with a single interface (and modest degree of difficulty in terms of syntax). I've been teaching multiple regression as a descriptive method early in an intro stat course for the past ten years (and it helps to get students excited about material that they haven't seen before). The mosaic approach also scales well: it's straightforward to teach students dplyr/tidyverse data wrangling by adding in the pipe operator and some key data idioms. (So perhaps the third option should be labeled "mosaic and tidyverse".)

See the following for an example of how favstats() can be replaced by dplyr idioms.

recoded %>%
group_by(col) %>%
summarize(meanval = mean(price, na.rm = TRUE))

col	meanval
D	3170
J	5324

That being said, I suspect that many students (and instructors) will still use favstats() for simple tasks (e.g., to check sample sizes, check for missing data, etc). I know that I do. But the important thing is that unlike training wheels, mosaic doesn't hold them back when they want to learn new things. I'm a big fan of ggplot2, but even Hadley agrees that the existing syntax is not what he wants it to be. While it's not hard to learn to use + to glue together multiple graphics commands and to get your head around aesthetics, teaching ggplot2 adds several additional learning outcomes to a course that's already overly pregnant with them.

Side note

I would argue that a lot of what is in mosaic should have been in base R (e.g., formula interface to mean(), data= option for mean()). Other parts are more focused on teaching (e.g., plotModel(), xpnorm(), and resampling with the do() function).

Closing thoughts

In summary, I argue that the mosaic approach is consistent with the tidyverse. It dovetails nicely with David's "Teach tidyverse" as an intermediate step that may be more accessible for undergraduate audiences without a strong computing background. I'd encourage people to check it out (and let Randy, Danny, and me know if there are ways to improve the package).

Want to learn more about mosaic? In addition to the R Journal paper referenced above, you can see how we get students using R quickly in the package's "Less Volume, More Creativity" and "Minimal R" vignettes. We also provide curated examples from commonly used textbooks in the “mosaic resources” vignette and a series of freely downloadable and remixable monographs including The Student’s Guide to R and Start Teaching with R.

Monday, August 18, 2014

Example 2014.10: Panel by a continuous variable

In Example 8.40, side-by-side histograms, we showed how to generate histograms for some continuous variable, for each level of a categorical variable in a data set. An anonymous reader asked how we would do this if both the variables were continuous. Keep the questions coming!

SAS
The SAS solution we presented relied on the sgpanel procedure. There, the panelby statement names a variable for which each distinct value will generate a panel. If there are many values, for example for a continuous variable, there will be many panels generated, which is probably not the desired result. As far as we know, there is no option to automatically categorize a continuous panel variable in proc sgpanel. If this is required, a two-step approach will be needed to first make groups of one of the variables.

We do that below using proc rank. In this approach, the groups option is the number of groups required and the ranks statement names a new variable to hold the group indicator. Once the groups are made, the same code demonstrated earlier can be used. (This is an example of "it's never too late to learn"-- I used to do this via a sort and a data step with implied variables, until I realized that there had to be a way to it via a procedure. --KK)

In this setting, the panels are another approach to the data we examine in a scatterplot. As an example, we show the mental compentency score by grouping of the physical competency score in the HELP data set.

proc rank data = 'c:\book\help.sas7bdat' groups = 6 out = catmcs;
var mcs;
ranks mcs_sextile;
run;

title "Histograms of PCS by sextile of MCS";
proc sgpanel data = catmcs;
  panelby mcs_sextile / columns = 3 rows =2;
  histogram pcs;
run;

We also demonstrate the columns and rows options to the panelby statement, which allow control over the presentation of the panel results. The graphic produced is shown above.

R
Our R solution in the earlier entry used the lattice package (written by Deepayan Sarkar) to plot a formula such as histogram(~a | b). A simple substitution of a continuous covariate b into that syntax will also generate a panel for each distinct value of the covariates: a factor is expected. In the package, an implementation of Trellis graphics, the term "shingles" is used to approach the notion of categorizing a continuous variable for making panels. The function equal.count() is provided to make the (possibly overlapping) categories of the variables, and uses the panel headers to suggest the ranges of continuous covariate that are included in each panel.

ds = read.csv("http://www.amherst.edu/~nhorton/r2/datasets/help.csv")
library(lattice)
histogram(~ pcs | equal.count(mcs), 
   main="Histograms of PCS by shingle of MCS",
   index.cond=list(c(4,5,6,1,2,3)),data=ds)

Note that the default ordering of panels in lattice is left to right, bottom to top. The index.cond option here re-orders the panels to go from left to right, top to bottom.

The default behavior of equal.count() is to allow some overlap between the categories, which is a little odd. In addition, there is a good deal of visual imprecision in the method used to identify the panels-- there's no key given, and the only indicator of the shingle value is the shading of the title bars. A more precise method would be to use the quantile() function manually, as we demonstrated in example 8.7, the Hosmer and Lemeshow goodness-of-fit test. We show here how the mutate() function in Hadley Wickham's dplyr package can be used to add a new variable to a data frame.

require(dplyr)

ds = mutate(ds, cutmcs = cut(ds$mcs, 
   breaks = quantile(ds$mcs, probs=seq(0,1, 1/6)), include.lowest=TRUE))
histogram(~ pcs | cutmcs,  main="Histograms of PCS by sextile of MCS",
          index.cond=list(c(4,5,6,1,2,3)), data=ds)

This shows the exact values of the bin ranges in the panel titles, surely a better use of that space. Minor differences in the histograms are due to the overlapping categories included in the previous version.

Finally, we also show the approach one might use with the ggplot2 package, an implementation of Leland Wilkinson's Grammar of Graphics, coded by Hadley Wickham. The package includes the useful cut_number() function, which does something similar to the cut(..., breaks=quantile(...)) construction we showed above. In ggplot2, "facets" are analogous to the shingles used in lattice.

library(ggplot2)
ds = mutate(ds, cutmcsgg = cut_number(ds$mcs, n=6))
ggplot(ds, aes(pcs)) + geom_bar() + 
  facet_wrap(~cutmcsgg) + ggtitle("Histograms of PCS by sextile of MCS")

Roughly, we can read the syntax to state: 1) make a plot from the ds dataset in which the primary analytic variable will be pcs; 2) make histograms; 3) make facets of the cutmcsgg variable; 4) add a title. Since the syntax is a little unusual, Hadley provides the qplot() function, a wrapper which operates more like traditional functions. An identical plot to the above can be generated with qplot() as follows:

qplot(data=ds,x=pcs, geom="bar", facets= ~cutmcsgg, 
   main="Histograms of PCS by sextile of MCS")

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, August 11, 2014

Example 2014.9: Rolling averages. Also: Second Edition is shipping!

As of today, the second edition of "SAS and R: Data Management, Statistical Analysis, and Graphics" is shipping from CRC Press, Amazon, and other booksellers. There are lots of additional examples from this blog, new organization, and other features we hope you'll find useful. Thanks for your support. We'll be continuing to blog.

Now, on to today's main course.

For cyclical data, it's sometimes useful to generate rolling averages-- the average of some number of recent measurements, usually one full cycle. For example, for retail sales, one might want the rolling average of the most recent week. The rolling average will dampen the effects of repeated patterns but still show the location of the data.

In keeping with our habit of plotting personal data (e.g.,Example 8.11, Example 8.12, example 10.1, Example 10.2), I'll use my own weight recorded over the past 6 months. After reading about "alternate day dieting" in The Atlantic, I decided to try the diet described in the book by Varady. I've never really tried to diet for weight loss before, but this diet has worked really well for me over the past six months. The basics are that you eat 500 calories every other day (diet days) and on the non-diet days you eat what you want. There's a little science supporting the approach. I can't really recommend the book, unfortunately, unless you're a fan of the self-help style.

As you can imagine, one's weight tends to fluctuate pretty wildly between diet days and non-diet days. The cycle is just two days, but to get a sense of my weight at any given time, it might be best to use the rolling average of the past, say, four days.

The beginning of the data, available from http://www.amherst.edu/~nhorton/sasr2/datasets/weight.txt, follows.

1/11/14 219
1/12/14 NA
1/13/14 219
1/14/14 NA
1/15/14 221.8
1/16/14 218
...

R
As you can tell from the NAs, I compiled the data with the intent to read it into R.

> weights = read.table("c:/temp/weight.txt")
> head(weights)

       V1    V2
1 1/11/14 219.0
2 1/12/14    NA
3 1/13/14 219.0
4 1/14/14    NA
5 1/15/14 221.8
6 1/16/14 218.0

Note, though, that the date values are just character strings (read in as a factor variable), and not so useful as read in.

> str(weights)
'data.frame': 161 obs. of  2 variables:
 $ V1: Factor w/ 161 levels "1/11/14","1/12/14",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ V2: num  219 NA 219 NA 222 ...

The lubridate package contributed by the invaluable Hadley Wickham contains functions to make it easier to use dates in R. Here, I use its mdy() function to convert characters values into R dates.

library(lubridate)
with(weights, plot(V2 ~ mdy(V1), 
  xlim = c(mdy("1/1/14"),mdy("6/30/14")),
  ylab="Weight", xlab="Date"))

The simple plot has enough values that you can clearly see the trend of weight loss over time, and perhaps the rolling average exercise is somewhat misplaced, here. To calculate the rolling average, I adapted (below) the lag function from section 2.2.18 (2nd edition; 1.4.17 in the 1st ed.)-- this is a simpler version that does not check for errors. The result of lag(x,k) is a vector with the first k values missing and with the remaining values being the beginning values of x. Thus the ith value of lag(x,k) is x[i-k]. To get the rolling average, I just take the mean of several lags. Here I use the rowMeans() function to do it for all the values at once. The lines() function adds the rolling average to the plot.

lag = function(x,k) {
  return( c(rep(NA,k), x[1:(length(x)-k)]) )
}

y = weights$V2
ra = rowMeans(
  matrix(c(y,lag(y,1),lag(y,2),lag(y,3)),ncol=4,byrow=F),
    na.rm=T)

lines(mdy(weights$V1),ra)

The final plot is shown above. Note that the the initial values of the lagged vector are missing, as are weights for several dates throughout this period. The na.rm=T option causes rowMeans() to return the mean of the observed values-- equivalent to a single imputation of the mean of the observed values, which perhaps Nick will allow me in this setting (note from NH: I don't have major issues with this). There are also two periods where I failed to record weights for four days running. For these periods, rowMeans() returns NaN, or "Not a Number". This is usefully converted to regions in the plot where the running average line is not plotted. Compare, for instance, with the default SAS behavior shown below. For the record, I was ill in early May and had little appetite regardless of my dieting schedule.

SAS
The data can be easily read with the input statement. The mmddyy7. informat tells SAS that the data in the first field are as many as 7 characters long and should be read as dates. SAS will store them as SAS dates (section 2.4 in the 2nd edition; 1.6 in the 1st edition). As the data are read in, I use the lagk functions (section 2.2.18 2nd edition; 1.4.17 in the 1st ed.) to recall the values from recent days and calculate the rolling average as I go.

data weights;
infile "c:\temp\weight.txt";
input date mmddyy7. weight;
ra = mean(weight,lag(weight), lag2(weight), lag3(weight));
run;

Note that the input statement expects the weight values to be numbers, and interprets the NAs in the data as "Invalid data". It inserts missing values into the data set, which is what we desire. The mean function provides the mean of the non-missing values. When the weight and all of the lagged values of weight are missing, it will return a missing value. With the rolling average in hand, I can plot the observed weights and the rolling average. To print Julian dates rather than SAS dates, use the format statement to tell SAS that the date variable should be printed using the date. format.

symbol1 i = none v=dot c = blue;
symbol2 i = j v = none c = black w=5;
proc gplot data = weights;
plot (weight ra)*date /overlay;
format date date.;
run;

The results are shown below. The main difference from the R plot is that the gaps in my recording do not appear in the line. The SAS symbol statement, the equivalent of the lines() function, more or less, does not encounter NaNs, but only missing values, and so it connects the points. I think R's behavior is more appropriate here-- there's no particular reason to suppose a linear interpolation between the observed data points is best, and so the line ought to be missing.

Monday, December 10, 2012

Example 10.8: The upper 95% CI is 3.69

Apologies for the long and unannounced break-- the longest since we started blogging, three and a half years ago. I was writing a 2-day course for SAS users to learn R. Contact me if you're interested. And Nick and I are beginning work on the second edition of our book-- look for it in the fall. Please let us know if you have ideas about what we omitted last time or would otherwise like to see added. In the mean time, we'll keep blogging, though likely at a reduced rate.

Today: what can you say about the probability of an event if the observed number of events is 0? It turns out that the upper 95% CI for the probability is 3.69/N. There's a sweet little paper with some rationale for this, but it's in my other office. And I couldn't recall the precise value-- so I used SAS and R to demonstrate it to myself.

R

The R code is remarkably concise. After generating some Ns, we write a little function to perform the test and extract the (exact) upper 95% confidence limit. This is facilitated by the "..." notation, which passes along unused arguments to functions. Then we use apply() to call the new function for each N, passing the numerator 0 each time. Note that apply() needs a matrix argument, so the simple vector of Ns is converted to a matrix before use. [The sapply() function will accept a vector input, but took about 8 times as long to run.] Finally, we plot the upper limit * N against N. showing the asymptote. A log scaled x-axis is useful here, and is achieved with the log='x' option. (Section 5.3.12.) the result is shown above.

bin.m = seq(10, 10000, by=5)
mybt = function(...) { binom.test(...)$conf.int[2] }
uci = apply(as.matrix(bin.m), 1, mybt, x=0)
plot(y=bin.m * uci, x=bin.m, ylim=c(0,4), type="l", 
     lwd=5, col="red", cex=5, log='x',  
     ylab="Exact upper CI", xlab="Sample size", 
     main="Upper CI when there are 0 cases observed")
abline(h=3.69)

SAS

In SAS, the data, really just the N and a numerator of 0, are generated in a data step. The CI are found using the binomial option in the proc freq tables statement and saved using the output statement. Note that the weight statement is used here to avoid having a row for each Bernoulli trial.

data binm;
do n = 10 to 10000 by 5;
  x=0;
  output;
  end;
run;

ods select none;
proc freq data=binm;
by n;
weight n;
tables x / binomial;
output out=bp binomial;
run;
ods select all;

To calculate the upper limit*N, another data step is needed-- note that in this setting SAS will only produce the lower limit against the probability that all observations share the same value, thus the subtraction from 1 shown below. The log scale x-axis is obtained with the logbase option to the axis statement. (Section 5.3.12.) The result is shown below.

data uci;
set bp;
limit = (1-xl_bin) * n;
run;

axis1 order = (0 to 4 by 1);
axis2 logbase=10 logstyle=expand;
symbol1 i = j v = none c = red w=5 l=1;
proc gplot data=uci;
plot limit * n / vref=3.69 vaxis=axis1 haxis=axis2;
label n="Sample size" limit="Exact upper CI";
run;
quit;

It's clear that the upper 95% limit on the number of successes asymptotes to about 3.69. Thus the upper limit on the binomial probability p is 3.69/N.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, September 17, 2012

Example 10.2: Custom graphic layouts

In example 10.1 we introduced data from a CPAP machine. In brief, it's hard to tell exactly what's being recorded in the data set, but it seems to be related to the pattern of breathing. Measurements are taken five times a second, leading to on the order of 100,000 data points in a typical night. To get a visual sense of what a night's breathing looks like is therefore non-trivial.

Today, we'll make the graphic shown above, which presents an hour of data.

SAS
In SAS, the sgpanel procedure (section 5.1.11) will produce a similar graphic pretty easily. But we need to make a data set with indicators of the hour, and of ten-minute blocks within the hour. This we'll do with the ceil function (section 1.8.4).


data cycles2;
set cycles;
hour = ceil(time_min/60);
tenmin = ceil(time_min/10);
time_in_ten = mod(time_min - 1/300,10);
/* 1/300 adjustment keeps last measure in the correct 
        10-min block */
run;

title "Hour 4 of pressure";
proc sgpanel data = cycles2;
where hour eq 4;
panelby tenmin / layout=rowlattice rows=6 spacing = 4;
colaxis display=none;
rowaxis display = (nolabel);
series x = time_in_ten y = byte;
run; quit;

The resulting plot is shown below. It would be nicer to omit the labels on the right of each plot, but this does not appear to be an option. It would likely only be possible with a fair amount of effort.

R
In R, we'll use the layout() function to make a 7-row layout-- one for the title and 6 for the 10-minute blocks of time. Before we get there, though, we'll construct a function to fill the time block plots with input data. The function accepts a data vector and plots only 3,000 values from it, choosing the values based on an input hour and 10-minute block within the hour. To ensure an equal y-axis range for each call, we'll also send minimum and maximum values as input to the function. All of this will be fed into plot() with the type="l" option to make a line plot.


plot10 = function(hour, tenmins, miny, maxy, data=cycles){
   start = hour*18000 + tenmins* 3000 +1 
   plot((1:3000)/300, cycles[(start + 1):(start +3000)], 
            ylim = c(miny,maxy),type="l", xaxs="i", yaxs="i")
}

The documentation for layout() is rather opaque, so we'll review it separately.


oldpar = par(no.readonly = TRUE)
# revert to this later 

layout(matrix(1:7), widths=1, heights=c(3,8,8,8,8,8,8), respect=FALSE)

The layout() function divides the plot area into a matrix of cells, some of which will be filled by the next output plots. The first argument says where in the matrix the next N objects will go. All the integers 1...N must appear in the matrix; cells that will be left empty have a 0 instead. Here, we have no empty cells, and only one column, so the "matrix" is really just a vector with 1...7 in order. The widths option specifies the relative widths of the columns-- here we have only one column so any constant will result in the use of the whole width of the output area. Similarly, the heightsoption gives the relative height of the cells. Here the title will get 3/51 of the height, while each 10-minute block will get 8/51. This unequal shape of the plot regions is one reason to prefer layout() to some other ways to plot multiple images on a page. The respect option, when "TRUE" makes the otherwise relative widths and heights conform, so that a unit of height is equal to a unit of width. We also use layout() in example 8.41.

With the layout in hand, we're ready to fill it.


par(xaxt="n",  mar = c(.3,2,.3,0) +.05)
# drop the x-axis, change the spacing around the plot
plot(x=1,y=1,type="n",ylim=c(-1,1), xlim=c(-1,1), yaxt="n",bty="n")
# the first (narrow) plot is just empty
hour=3
text(0,0,paste("Hour ", (hour + 1), " of pressure data"), cex=2)
# text to put in the first plot
miny = min(cycles[(hour * 18000 + 1):((hour + 1) * 18000)])
maxy = max(cycles[(hour * 18000 + 1):((hour + 1) * 18000)])
# find min and max across the whole hour, to keep range 
# of y-axis constant across the plots
for (x in 0:5) plot10(hour, x, miny, maxy)
# plot the 6 ten-minute blocks
par(oldpar)
# reset the graphics options

The resulting plot is shown at the top of the entry. There's clearly something odd going on around 11-15 minutes into the hour-- this could be a misadjusted mask, or a real problem with the breathing. There's also a period around 58 minutes when it looks like breathing stops. That's what the machine is meant to stop.

An unrelated note about aggregatorsWe love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, July 16, 2012

Example 9.38: dynamite plots, revisited

Dynamite plots are a somewhat pejorative term for a graphical display where the height of a bar indicates the mean, and the vertical line on top of it represents the standard deviation (or standard error). These displays are commonly found in many scientific disciplines, as a way of communicating group differences in means.

Many find these displays troubling. One post entitled them unmitigated evil.
The Vanderbilt University Department of Biostatistics has a formal policy discouraing use of these plots, stating that:

Dynamite plots often hide important information. This is particularly true of small or skewed data sets. Researchers are highly discouraged from using them, and department members have the option to decline participation in papers in which the lead author requires the use of these plots.

Despite the limitations of the display, we believe that there may be times when the display is helpful as a way to compare groups, assuming distributions that are approximately normal. Samuel Brown also described creation of these figures, as a way to encourage computing in R. We previously demonstrated how to create them in SAS and R, and today discuss code created by Randall Pruim to demonstrate how such graphics can be created using lattice graphics within the mosaic package.

R


library(mosaic)
dynamitePlot <- function(height, error, names = as.character(1:length(height)), 
                         significance = NA, ylim = c(0, maxLim), ...) {
  if (missing(error)) { error = 0 }
  maxLim <- 1.2* max(mapply(sum, height, error))
  mError <- min(c(error, na.rm=TRUE))
  barchart(height ~ names, ylim=ylim, panel=function(x,y,...) {
    panel.barchart(x,y,...)
    grid.polyline(c(x,x), c(y, y+error), id=rep(x,2), default.units='native',
      arrow=arrow(angle=45, length=unit(mError, 'native'))) 
    grid.polyline(c(x,x), c(y, y-error), id=rep(x,2), default.units='native',
      arrow=arrow(angle=45, length=unit(mError, 'native'))) 
    grid.text(x=x, y=y + error + .05*maxLim, label=significance, 
      default.units='native')
  }, ...)
}

Much of the code involves setting up the appropriate axis limits, then drawing the lines and adding the text. We can call this new function with an artificial example with 4 groups:


Values <- c(1,2,5,4)
Errors <- c(0.25, 0.5, 0.33, 0.12)
Names <- paste("Trial", 1:4)
Sig <- c("a", "a", "b", "b")
dynamitePlot(Values, Errors, names=Names, significance=Sig)

We still don't recommend frequent use of these plots (as other displays may be better (e.g. dotplots for very small sample sizes or violin plots), but having the capability to generate dynamite plots within the lattice framework can be handy.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts