Showing posts with label with(). Show all posts
Showing posts with label with(). Show all posts

Monday, June 20, 2011

Example 8.41: Scatterplot with marginal histograms


The scatterplot is one of the most ubiquitous, and useful graphics. It's also very basic. One of its shortcomings is that it can hide important aspects of the marginal distributions of the two variables. To address this weakness, you can add a histogram of each margin to the plot. We demonstrate using the SF-36 MCS and PCS subscales in the HELP data set.

SAS
SAS provides code to perform this using proc template and proc sgrender. These procedures are not intended for casual or typical SAS users. Its syntax is, to our eyes, awkward. This is roughly analogous to R functions that simply call C routines. Nonetheless, it's possible to adapt code that works. The code linked above was edited to set the transparency to 0 and to change the plotted symbol size to 5 from 11px. These options appear in the scatterplot statement about midway through the code.

Once the edited code is submitted, the following lines produce the plot shown above.

proc sgrender data="C:\book\help.sas7bdat" template=scatterhist;
dynamic YVAR="mcs" XVAR="pcs"
TITLE="MCS-PCS Relationship";
run;


R
For R, we adapted some code found in an old R-help post to generate the following function. The mtext() function puts text in the margins and is used here to label the axes. The at option in that function centers the label within the scatterplot data using some algebra.

scatterhist = function(x, y, xlab="", ylab=""){
zones=matrix(c(2,0,1,3), ncol=2, byrow=TRUE)
layout(zones, widths=c(4/5,1/5), heights=c(1/5,4/5))
xhist = hist(x, plot=FALSE)
yhist = hist(y, plot=FALSE)
top = max(c(xhist$counts, yhist$counts))
par(mar=c(3,3,1,1))
plot(x,y)
par(mar=c(0,3,1,1))
barplot(xhist$counts, axes=FALSE, ylim=c(0, top), space=0)
par(mar=c(3,0,1,1))
barplot(yhist$counts, axes=FALSE, xlim=c(0, top), space=0, horiz=TRUE)
par(oma=c(3,3,0,0))
mtext(xlab, side=1, line=1, outer=TRUE, adj=0,
at=.8 * (mean(x) - min(x))/(max(x)-min(x)))
mtext(ylab, side=2, line=1, outer=TRUE, adj=0,
at=(.8 * (mean(y) - min(y))/(max(y) - min(y))))
}

The results of the following code are shown below.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
with(ds, scatterhist(mcs, pcs, xlab="MCS", ylab="PCS"))

Tuesday, May 3, 2011

To attach() or not attach(): that is the question

R objects that reside in other R objects can require a lot of typing to access. For example, to refer to a variable x in a dataframe df, one could type df$x. This is no problem when the dataframe and variable names are short, but can become burdensome when longer names or repeated references are required, or objects in complicated structures must be accessed.

The attach() function in R can be used to make objects within dataframes accessible in R with fewer keystrokes. As an example:

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
names(ds)
attach(ds)
mean(cesd)
[1] 32.84768

The search() function can be used to list attached objects and packages. Let's see what is there, then detach() the dataset to clean up after ourselves.

search()
> search()
[1] ".GlobalEnv" "ds" "tools:RGUI" "package:stats"
[5] "package:graphics" "package:grDevices" "package:utils" "package:datasets"
[9] "package:methods" "Autoloads" "package:base"
detach(ds)

As noted in section B.4.5, users are cautioned that if there is already a variable
called cesd in the local workspace, issuing attach(ds), may not mean that cesd references ds$cesd. Name conflicts of this type are a common problem with attach() and care should be taken to avoid them.

The help page for attach() notes that attach can lead to confusion. The Google R Style Manual provides clear advice on this point, providing the following advice about attach():
The possibilities for creating errors when using attach are numerous. Avoid it.


After being burned by this one too many times, we concur.

So what options exist for those who decide to go cold turkey?

  1. Reference variables directly (e.g. lm(ds$x ~ ds$y))

  2. Specify the dataframe for commands which support this (e.g. lm(y ~ x, data=ds))

  3. Use the with() function, which returns the value of whatever expression is evaluated (e.g. with(ds,lm(y ~x)))

  4. (Also note the within() function, which is similar to with(), but returns a modified object.)


Some examples may be helpful.

> # fit a linear model
> lm1 = lm(cesd ~ pcs, data=ds)

> mean(ds$cesd[ds$female==1]) # these next three are equivalent
[1] 36.88785
> with(ds, mean(cesd[female==1]))
[1] 36.88785
> with(subset(ds, female==1), mean(cesd))
[1] 36.88785

In short, there's never an actual need to use attach(), using it can lead to confusion or errors, and alternatives exists that avoid the problems. We recommend against it.

In SAS, all procedures use the most recent data set or must reference a data set explicitly. Very roughly speaking, using attach() in R is like relying on the implicit use of the most recent data set. Our recommendation against attach() thus mirrors our use of the data= option throughout our books.