Thursday, July 27, 2017

Options for teaching R to beginners: a false dichotomy?

I've been reading David Robinson's excellent blog entry "Teach the tidyverse to beginners" (http://varianceexplained.org/r/teach-tidyverse), which argues that a tidyverse approach is the best way to teach beginners.  He summarizes two competing curricula:

1) "Base R first": teach syntax such as $ and [[]], built in functions like ave() and tapply(), and use base graphics

2) "Tidyverse first": start from scratch with pipes (%>%) and leverage dplyr and use ggplot2 for graphics

If I had to choose one of these approaches, I'd also go with 2) ("Tidyverse first"), since it helps to move us closer to helping our students "think with data" using more powerful tools (see here for my sermon on this topic).

A third way

Of course, there’s a third option that addresses David’s imperative to "get students doing powerful things quickly".  The mosaic package was written to make R easier to use in introductory statistics courses.  The package is part of Project MOSAIC (http://mosaic-web.org), an NSF-funded initiative to integrate statistics, modeling, and computing. A paper outlining the mosaic package's "Less Volume, More Creativity" approach was recently published in the R Journal (https://journal.r-project.org/archive/2017/RJ-2017-024). To his credit, David mentions the mosaic package in a response to one of the comments on his blog.

Less Volume, More Creativity

One of the big ideas in the mosaic package is that students build on the existing formula interface in R as a mechanism to calculate summary statistics, generate graphical displays, and fit regression models. Randy Pruim has dubbed this approach "Less Volume, More Creativity".

While teaching this formula interface involves adding a new learning outcome (what is "Y ~ X"?), the mosaic approach simplifies calculation of summary statistics by groups and the generation of two or three dimensional displays on day one of an introductory statistics course (see for example Wang et al., "Data Viz on Day One: bringing big ideas into intro stats early and often" (2017), TISE).

The formula interface also prepares students for more complicated models in R (e.g., logistic regression, classification).

Here's a simple example using the diamonds data from the ggplot2 package.  We model the relationships between two colors (D and J), number of carats, and price.

I'll begin with a bit of data wrangling to generate an analytic dataset with just those two colors. (Early in a course I would either hide the next code chunk or make the recoded dataframe accessible to the students to avoid cognitive overload.)  Note that an R Markdown file with the following commands is available for download at https://nhorton.people.amherst.edu/mosaic-blog.Rmd.

library(mosaic)
recoded <- diamonds %>%
  filter(color=="D" | color=="J") %>%
  mutate(col = as.character(color))

We first calculate the mean price (in US$) for each of the two colors.

mean(price ~ col, data = recoded)
   D    J 
3170 5324 

This call is an example of how the formula interface facilitates calculation of a variable's mean for each of the levels of another variable. We see that D color diamonds tend to cost less than J color diamonds.

A useful function in mosaic is favstats() which provides a useful set of summary statistics (including sample size and missing values) by group.

favstats(price ~ col, data = recoded)
col
min
Q1
median
Q3
max
mean
sd
n
missing
D35791118384214186933170335767750
J335186042347695187105324443828080

A similar command can be used to generate side by side boxplots. Here we illustrate the use of lattice graphics. (An alternative formula based graphics system (ggformula) will be the focus of a future post.)

bwplot(col ~ price, data = recoded)


The distributions are skewed to the right (not surprisingly since they are prices). If we wanted to formally compare these sample means we could do so with a two-sample t-test (or in a similar fashion, by fitting a linear model).

t.test(price ~ col, data = recoded)
Welch Two Sample t-test

data:  price by col
t = -20, df = 4000, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2336 -1971
sample estimates:
mean in group D mean in group J 
           3170            5324 


msummary(lm(price ~ col, data = recoded))
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3170.0       45.0    70.4   <2e-16 ***
colJ          2153.9       83.2    25.9   <2e-16 ***

Residual standard error: 3710 on 9581 degrees of freedom
Multiple R-squared:  0.0654, Adjusted R-squared:  0.0653 

F-statistic:  670 on 1 and 9581 DF,  p-value: <2e-16

The results from the two approaches are consistent: the group differences are highly statistically significant.  We could conclude that J diamonds tend to cost more than D diamonds, back in the population of all diamonds.

Let's do a quick review of the mosaic modeling syntax to date:
mean(price ~ col)
bwplot(price ~ col) t.test(price ~ col) lm(price ~ col) See the pattern? On a statistical note, it's important to remember that the diamonds were not randomized into colors: this is a found (observational dataset) so there may be other factors at play.  The revised GAISE College report reiterates the importance of multivariate thinking in intro stats. Moving to three dimensions Let's continue with the "Less Volume, More Creativity" approach to bring in a third variable: the number of carats in each diamond. xyplot(price ~ carat, groups=col, auto.key=TRUE, type=c("p", "r"), data = recoded)
We see that controlling for the number of carats, the D color diamonds tend to sell for more than the J color diamonds.  We can confirm this by fitting a regression model that controls for both variables (and then display the resulting predicted values from this parallel slopes model using plotModel()).
This is a great example of Simpson's paradox: accounting for the number of carats has yielded opposite results from a model that didn't include carats. If we were to move forward with such an analysis we'd need to be sure to undertake an assessment of our model and verify conditions and assumptions (but for the purpose of the blog entry I'll defer that).

Moving beyond mosaic

The revised GAISE College report enunciated the importance of technology when teaching statistics. Many courses still use calculators or web-based applets to incorporate technology into their classes. R is an excellent environment for teaching statistics, but many instructors feel uncomfortable using it (particularly if they feel compelled to teach the $ and [[]] syntax, which many find offputting).  The mosaic approach helps make the use of R feasible for many audiences by keeping things simple. It's unfortunately true that many introductory statistics courses don't move beyond bivariate relationships (so students may feel paralyzed about what to do about other factors). The mosaic approach has the advantage that it can bring multivariate thinking, modeling, and exploratory data tools together with a single interface (and modest degree of difficulty in terms of syntax). I've been teaching multiple regression as a descriptive method early in an intro stat course for the past ten years (and it helps to get students excited about material that they haven't seen before). The mosaic approach also scales well: it's straightforward to teach students dplyr/tidyverse data wrangling by adding in the pipe operator and some key data idioms. (So perhaps the third option should be labeled "mosaic and tidyverse".)  

See the following for an example of how favstats() can be replaced by dplyr idioms. 

recoded %>%
  group_by(col) %>%
  summarize(meanval = mean(price, na.rm = TRUE))
col
meanval
D3170
J5324
That being said, I suspect that many students (and instructors) will still use favstats() for simple tasks (e.g., to check sample sizes, check for missing data, etc).  I know that I do.  But the important thing is that unlike training wheels, mosaic doesn't hold them back when they want to learn new things. I'm a big fan of ggplot2, but even Hadley agrees that the existing syntax is not what he wants it to be.  While it's not hard to learn to use + to glue together multiple graphics commands and to get your head around aesthetics, teaching ggplot2 adds several additional learning outcomes to a course that's already overly pregnant with them.


Side note

I would argue that a lot of what is in mosaic should have been in base R (e.g., formula interface to mean(), data= option for mean()).  Other parts are more focused on teaching (e.g., plotModel()xpnorm(), and resampling with the do() function).

Closing thoughts

In summary, I argue that the mosaic approach is consistent with the tidyverse. It dovetails nicely with David's "Teach tidyverse" as an intermediate step that may be more accessible for undergraduate audiences without a strong computing background.  I'd encourage people to check it out (and let Randy, Danny, and me know if there are ways to improve the package).

Want to learn more about mosaic?  In addition to the R Journal paper referenced above, you can see how we get students using R quickly in the package's "Less Volume, More Creativity" and "Minimal R" vignettes.  We also provide curated examples from commonly used textbooks in the “mosaic resources” vignette and a series of freely downloadable and remixable monographs including The Student’s Guide to R and Start Teaching with R.

9 comments:

THK said...

I have encountered this dilemma (trilemma?) as well in teaching R. While mosaic looks like a great package, I have to complain about the mean(y ~ x, data = DF) syntax. That seems to violate the R convention on interpreting formulas. Mosaic should have limited it to mean(~ y | x, data = DF). Otherwise I could not give a consistent explanation for how formulas work in R.

Nick Horton said...

I'm not sure what you mean by "violate the R convention on interpreting formulas". Isn't mean(Y ~ X) equivalent in meaning to the t.test(Y ~ X) and aov(Y ~ X)?

Note that mosaic does support the syntax you describe for summary statistics for the aggregating functions:

mean(~ Y | X)
favstats(~ Y | X)

along with

mean(~ Y, group=X)
favstats(~ Y, group=X)

> mean(~ age | substance, data=HELPrct)
alcohol cocaine heroin
38.2 34.5 33.4
> mean(~ age, group=substance, data=HELPrct)
alcohol cocaine heroin
38.2 34.5 33.4
> mean(age ~ substance, data=HELPrct)
alcohol cocaine heroin
38.2 34.5 33.4

Achim Zeileis said...

Thanks for this post. I agree that strengthening the teaching of formula-based functions is a good idea and easy to learn for beginners. One useful R feature that is often overlooked in this context is that plot(y ~ x, data = df) chooses a suitable plot for various combinations of y and x. Of course, if both variables are numeric, this creates a scatter plot. For numeric "response" and categorical "explanatory variable" we get parallel boxplots. And if the response is categorical we get a spineplot or spinogram, respectively. For many data sets that are relevant to our students (business & economics) this goes quite a long way. And from that point onwards I can teach what kind of principles - and corresponding R functions/packages - can be used to construct more complex displays etc.

Nick Horton said...

An additional feature of the mosaic package is the multi-purpose mplot() function available within RStudio.

If you provide a linear model object as argument, it allows you to generate the typical regression diagnostics (including a regression coefficient plot).

If you provide a dataframe as argument, you get an interactive data visualizer that lets you explore univariate, bivariate, and multivariate graphical displays (see http://escholarship.org/uc/item/84v3774z for an example of how we incorporate this on day one of an introductory statistics course).

in RStudio, try running:

library(mosaic)
mplot(HELPrct)

The "Show Expression" feature is particularly useful: it's an easy way to see the syntax to generate the selected plot using lattice, ggplot2, or ggformula.

R. Pruim said...

While you can certainly begin with base graphics and the plot() function, that only seems like a reasonable solution to me if you want to continue with base graphics throughout the course (which I don't). The different graphics systems don't play well together, so I find it best to pick one and stick with it. If you want to use lattice or ggformula, I'd suggest starting there and avoiding base graphics altogether.

Note too, that plot() is a bit quirky in its choices. plot(~price, data = diamonds) is not a very good choice, for example. A student should expect something much better. On balance, I don't find plot() or qplot() from ggplot2 compelling. I generally want to get beyond what these provide, and I don't find teaching lattice or ggformula to be challenging without such a function.

Bob Muenchen said...

Thanks for writing this thought-proving article. I teach a lot of workshops for organizations that are migrating from SAS to R, and one of the things that confuses people is R's inconsistent treatment of missing values. Simple stat functions require setting na.rm = TRUE while formula-based ones don't. Using the mosaic functions, you can set options(na.rm = TRUE) and from then on, its simple functions will find that setting and work more like formula-based ones. Since mosaic is nice enough to add formulas to simple stat functions, it would be nice for that to be the default.

Another inconsistency that I would like to see mosaic fix is that the data argument works only with formulas. So this finds the variables: t.test(y ~ group, data = mydata) while this does not: t.test(pre, post, paired = TRUE, data = mydata).

R. Pruim said...

The treatment of missing data is not really a matter of whether formulas are used but of the particular function used. One difference between R and SAS is that R is written by a community and SAS by a company. So it is easier for SAS to enforce stronger consistency across functions. Naming conventions (camelCase, dots, underscores, etc.) are also inconsistent across R. (The reason functions like lm() discard missing data by default is that they use model.frame() which has this as its default.)

The reason mosaic doesn't change the default behavior with regard to missing values is that we decided early on that our versions of the functions should behave just like the originals in cases where the original functions produced sensible output. (It would be bad if mean() gave a different answer depending on whether mosaic was or was not loaded.) The user can, as you note, set the default behavior actively using options(). That seemed like a good compromise.

For functions like favstats(), which we could write without worrying about compatibility with core R functions, we were free to do other things. In this case, we compute statistics after dropping missing values and also display the number of values that were missing.

Regarding t.test(pre, post, paired = TRUE, data = mydata), I have received numerous requests for things like this. (Usually the request is along the lines of mean(x, data = mydata). And for a time (and against my better judgment) we supported this. But it was a bad idea for several reasons. For starters, the code is ambiguous if x exists both in the environment and in mydata. It also makes constructing the functions and keeping them compatible with their core R counterparts much more challenging and led to some subtle bugs. In R, a formula is the correct way to designate a name to be evaluated in a special way. Finally, it was somewhat confusing that some functions accepted the "bare variables + data" syntax and others required formulas. It is more systematic if they all require the formula.

The bigger inconsistency -- that t.test(y ~ x) worked by t.test( ~ x) did not -- we fixed in mosaic.


In general, the problem with nonstandard evaluation is that it is nonstandard, so it can be hard to predict behavior. There are times where NSE is very useful, but it works best when operating within a well-defined system where the NSE can be correctly anticipated by the user. The use of formulas in lattice, ggformula, mosaic, and in functions like lm() at its cousins provides an established context for evaluation of formulas with a data context.

Unknown said...

Do not teach beginners how to use R statistics. Teach them proper research methodology. They will learn by themselves how to use any statistical software including R. That is power in learning. They will never forget.

Nick Horton said...

Our goal, consistent with the revised GAISE College report (https://arxiv.org/abs/1705.09530) is to integrate the teaching of key statistical concepts with the use of appropriate technology.

It's certainly possible to teach research methodology (e.g., addressing confounding using multiple regression) without software. I think that having students use real-tools with a straightforward interface can augment such instruction.