Monday, July 2, 2012

Citing R or SAS

One of us recently read a colleague's first draft of a paper, in which she had written: "All analyses were done in R 2.14.0." We assume we're preaching to the converted here, when we say that the enormous amount of work that goes into R needs to be recognized as often as possible, and that R's creators deserve to reap some credit for their labors. In contrast to SAS, after all, most work on R is not compensated with a paycheck. As a reminder, the citation() function produces the correct citation for R in general and is good to use when citing R.

The project in question had used a negative binomial regression function from the MASS package, but colleague had omitted any reference to it. In this case, a citation would provide both credit to the authors and a useful guide to anyone wanting to replicate our approach. It would also allow readers to consider whether changes in the package might affect the results observed. A call to citation(package="MASS") will provide the preferred citation here. (Any package name can be inserted, or course, though some authors may not have provided a full citation.)

Similarly, while SAS authors are rarely identified by name and presumably get a salary from SAS, it's preferable to identify the version of the software and where it can be obtained. In medical research this is usually done by an in-text reference. For example: "Analyses were performed in SAS 9.3 (SAS Institute, Cary NC)."

For complex analyses, it is also best to mention the SAS procedure used. As with the R package, this can help readers plan similar analyses, and may inform interpretation.

So a multi-software analysis section might end with the following statement: Analyses were performed in R 2.14.2 [1] using the MASS package [2] glm.nb() function for negative binomial regression and in SAS 9.3 (SAS Institute, Cary NC) using the MCMC procedure for negative binomial mixture models." The references to [1] and [2] would be found using the citation() function.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

3 comments:

Rick Wicklin said...

Another reason to cite the software is to implicitly document the algorithm used to compute the statistics. For example, if you cite the function/procedure and the software release, someone look up the documentation to conclude "Ah-ha! Those parameter estimates were computed by using a quasi-likelihood and a reference parameterization."

There is a tendency to cite relatively new technology but not established software. That's how software moves into the mainstream. I remember reading papers in the 80s and 90s that explicitly said "Typeset using LaTeX" and referenced Knuth and/or Lamport. Similarly, I've read older papers that reference Tcl/Tk, Perl, and even emacs :-)

Interestingly, I think Mathematica will always get a lot of references because people like to be able to skip the steps in a complicated derivation by saying, "we used Mathematica[CITATION] to apply the XYZ transformation, change variables, and simplify. The result is...."

Ken Kleinman said...

That's what I trying to get at when I mentioned "informing the interpretation"; thanks for elaborating!

I think the reasons to cite in general are to give credit to others for their work and to provide a sufficient roadmap to those who follow. It's a funny thing deciding what methods need citing, whether they be software or analytic methods, or physical tools, though. You rarely see a citation to Gosset, Kruskal and Wallis, or Fisher, for example. The implicit assumption is that "everyone knows" what those things mean, and that credit need not accrue any longer. I think this is somewhat wrongheaded, at least from the perspective of credit.

On the other hand, citations for methods walk a fine line between enough detail to communicate what was done and appropriate brevity. For example, I use mixed models a lot in my applied work. Whenever possible, I use numerical approximations to the likelihood, and if that's not possible, I use penalized quasi-likelihood. The difference matters, often, and there are nice citations to the methods available. But I don't recall ever deciding that this level of detail merited inclusion in an applied manuscript. Similarly, unless a paper is directly related to anthropometry, you won't see the manufacturer of a scale noted. I think it's just that the information is seen as of too little interest to too few people.

Rick Wicklin said...

Just read this interesting article on how R is trying to make it easier to cite authors of R packages: http://journal.r-project.org/archive/2012-1/RJournal_2012-1_Hornik~et~al.pdf