Wednesday, December 22, 2010

A plea for consistent style!

As we get close to the end of the year, it's time to look back over the past year and think of resolutions for 2011 and beyond. One that's often on my mind relates to ways to structure my code to make it clearer to others (as well as to myself when I look back upon it months later).

Style guides are common in many programming languages, and are often purported to increase the readability and legibility of code, as well as minimize errors. The Wikipedia page on this topic describes the importance of indentation, spacing, alignment, and other formatting conventions.

Many stylistic conventions are appropriate for statistical code written in SAS and R, and can help to make code clearer and easier to comprehend. Consider the difference between:

ds=read.csv("http://www.math.smith.edu/r/data/help.csv");attach(ds)
fOo=ks.test(age[female==1],age[female==0],data=ds)
plotdens=function(x,y,mytitle, mylab){densx = density(x)
densy = density(y);plot(densx,main=mytitle,lwd=3,xlab=mylab,
bty="l");lines(densy,lty=2,col=2,lwd=3);xvals=c(densx$x,
rev(densy$x));yvals=c(densx$y,rev(densy$y));polygon(xvals,
yvals,col="gray")};mytitle=paste("Test of ages: D=",round(fOo$statistic,3),
" p=",round(fOo$p.value,2),sep="");plotdens(age[female==1],
age[female==0],mytitle=mytitle,mylab="age (in years)")
legend(50,.05,legend=c("Women","Men"),col=1:2,lty=1:2,lwd=2)

and

# code example from the Using R for Data Management, Statistical
# Analysis and Graphics book
# Nicholas Horton, Smith College December 21, 2010
#
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)

# fit KS test and save object containing p-value
ksres = ks.test(age[female==1], age[female==0], data=ds)

# define function to plot two densities on the same graph
plotdens = function(x, y, mytitle, mylab) {
densx = density(x)
densy = density(y)
plot(densx, main=mytitle, lwd=3, xlab=mylab, bty="l")
lines(densy, lty=2, col=2, lwd=3)
xvals = c(densx$x, rev(densy$x))
yvals = c(densx$y, rev(densy$y))
polygon(xvals, yvals, col="gray")
}

# craft specialized title containing statistic and p-value
mytitle = paste("Test of ages: D=",
round(ksres$statistic,3),
" p=", round(ksres$p.value, 2),
sep="")

plotdens(age[female==1], age[female==0],
mytitle=mytitle, mylab="age (in years)")

legend(50, .05, legend=c("Women", "Men"), col=1:2, lty=1:2,
lwd=2)

While the first example has the advantage of using considerably fewer lines, it suffers dramatically from readability. The use of appropriate indentation, white space, spacing and comments help the analyst when debugging as well as fostering easier reuse in the future. In settings where code review is undertaken, sharing a set of common standards is eminently sensible.

SAS

A specific but somewhat cursory style manual for SAS can be found at the SAS community Style guide for writing and polishing programs. I like the start of this guide, though it is incomplete at present. Other useful words of wisdom can be found here and here.

R

Google's R Style Guide is chock full of tips and guidelines to make R code easier to read, share and verify. Another source of ideas is Henrik Bengtsson's draft R coding conventions. While one can quibble about some of the specific suggestions, overall, the effect of adherence to such a style guide is code that is easier to understand and less likely to hide errors.

Some coders are fundamentalists in insisting on "the correct" style. In general, however, it is more important to develop a sensible, interpretable, and coherent style of your own than to adhere to styles that you find awkward, whatever their provenance. The links above provide some common sense tips that can help improve productivity and make you a better analyst.

3 comments:

Ken said...

One problem is that several of your comments duplicate the code, when it is fairly obvious what the code does. Just spacing would have been sufficient. Comments are best used for describing code which isn't obvious, or the purpose behind a block of code.

Nick Horton said...

Good point. I agree that several of the comments are gratuitous. In my defense, I spend much of my time teaching introductory statistics courses using R, where the redundancy may be helpful to novices.

Peter Flom said...

One thing that I disagree with fairly strongly is the idea that an opening curly brace should go at the end of a line. I start a new line. This not only looks neater (at least to me) but it lets me see where curly braces match up. With indentation, I can use the up and down arrows to see which { matches which }.

I don't see the advantage of opening braces being at the end of a line