Showing posts with label reproducible analysis. Show all posts
Showing posts with label reproducible analysis. Show all posts

Tuesday, May 3, 2011

To attach() or not attach(): that is the question

R objects that reside in other R objects can require a lot of typing to access. For example, to refer to a variable x in a dataframe df, one could type df$x. This is no problem when the dataframe and variable names are short, but can become burdensome when longer names or repeated references are required, or objects in complicated structures must be accessed.

The attach() function in R can be used to make objects within dataframes accessible in R with fewer keystrokes. As an example:

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
names(ds)
attach(ds)
mean(cesd)
[1] 32.84768

The search() function can be used to list attached objects and packages. Let's see what is there, then detach() the dataset to clean up after ourselves.

search()
> search()
[1] ".GlobalEnv" "ds" "tools:RGUI" "package:stats"
[5] "package:graphics" "package:grDevices" "package:utils" "package:datasets"
[9] "package:methods" "Autoloads" "package:base"
detach(ds)

As noted in section B.4.5, users are cautioned that if there is already a variable
called cesd in the local workspace, issuing attach(ds), may not mean that cesd references ds$cesd. Name conflicts of this type are a common problem with attach() and care should be taken to avoid them.

The help page for attach() notes that attach can lead to confusion. The Google R Style Manual provides clear advice on this point, providing the following advice about attach():
The possibilities for creating errors when using attach are numerous. Avoid it.


After being burned by this one too many times, we concur.

So what options exist for those who decide to go cold turkey?

  1. Reference variables directly (e.g. lm(ds$x ~ ds$y))

  2. Specify the dataframe for commands which support this (e.g. lm(y ~ x, data=ds))

  3. Use the with() function, which returns the value of whatever expression is evaluated (e.g. with(ds,lm(y ~x)))

  4. (Also note the within() function, which is similar to with(), but returns a modified object.)


Some examples may be helpful.

> # fit a linear model
> lm1 = lm(cesd ~ pcs, data=ds)

> mean(ds$cesd[ds$female==1]) # these next three are equivalent
[1] 36.88785
> with(ds, mean(cesd[female==1]))
[1] 36.88785
> with(subset(ds, female==1), mean(cesd))
[1] 36.88785

In short, there's never an actual need to use attach(), using it can lead to confusion or errors, and alternatives exists that avoid the problems. We recommend against it.

In SAS, all procedures use the most recent data set or must reference a data set explicitly. Very roughly speaking, using attach() in R is like relying on the implicit use of the most recent data set. Our recommendation against attach() thus mirrors our use of the data= option throughout our books.

Monday, February 28, 2011

Plug for RStudio: powerful, free, and easy to use interactive development environment for R


(click for a bigger picture)


As a longtime SAS user, one obstacle for me in using R professionally has been figuring out a process for saving and testing code across several work sessions and integrating code composition and execution. There are a couple of integrated R environments available, including ESS, TINN-R, and others. However, each of these seemed to require a serious investment of time, and I never did get around to using them (nor did Nick, despite several good-faith attempts). Instead I used a clunky system of editing code via a text editor, then copy and pasting or sourcing. This really inhibited my ability to at first learn then efficiently code in R.

Then Nick introduced me to the folks who have created RStudio. They are a small group of wicked smart programmers who know how to help other programmers be more efficient. They've now turned their attention to help statisticians and other R users. RStudio, publicly available as of 2/28/2011, is an open source product that is freely available. Its abilities are extremely broad, and I'm bound to miss something important in the brief description below, but suffice it to say that it's well worth your time to check it out. Neither Nick nor I have any vested interest in recommending it (though he's moved all of his teaching of introductory and intermediate statistics courses to it, along with his collaborative research projects).

RStudio is an integrated development environment for R that includes 1) text editing windows from which code can be submitted to the console and/or saved to the OS, 2) live lists of the objects in your workspace, 3) easily searchable infinite history with ability to insert from the history to the console or a text editing window, 4) tab completion in the console for objects, commands, and help, 5) interface with the OS for access to files, 6) help window with back and forward buttons, 7) package downloading, and 8) support for Sweave to facilitate reproducible analysis. Despite all these capabilities, RStudio is very easy to get started with.

There is also a server version, which you can access over the web if someone installs it and gives you access. If you're not familiar with this idea, it means you can work from most browsers--I was even able to use it on a Kindle. The cloud version saves your workspace from session to session, so you can work in exactly the same way, in exactly the same workspace (with a continuous history and all your objects), on whatever OS/CPU you have in front of you-- Windows, Mac OS, Chrome, Linux. You can switch OS, you can shut your computer down, and RStudio comes up just as you left it. Forgot your laptop? No problem.

The standalone version is an ordinary downloadable program. It uses the existing R binaries on your Mac (OSX 10.5+), Windows (XP/Vista/7), Ubuntu or Fedora Linux machine. The local and server applications have the same interface.

For me, the most useful aspect has been the integrated editor, but each one of the items I listed above has saved me a great deal of time over the past few months. The integrated help alone might be reason enough to adopt it. As a consulting statistician, RStudio is a huge leap forward. It changes R from a important tool which I have to be able to use into a plausible system in which to do all of my work. I really can't overestimate its value to me. Go to http://www.rstudio.org/ to learn more, see screenshots, and download!