Monday, April 30, 2012

Example 9.29: the perils of for loops

A recent exchange on the R-sig-teaching list featured a discussion of how best to teach new students R. The initial post included an exercise to write a function, that given a n, will draw n rows of a triangle made up of "*", noting that for a beginner, this may require two for loops. For example, in pseudo-code:

for i = 1 to n
for j = 1 to i
print "*"

Unfortunately, as several folks (including Richard M. Heiberger and R. Michael Weylandt) noted, for loops in general are not the best way to take full advantage of R. In this entry, we review two solutions they proposed which fit within the R philosophy.

Richard's solution uses the outer() function to generate a 5x5 matrix of logical values indicating whether the column number is bigger than the row number. Next the ifelse() function is used to replace TRUE with *.

> ifelse(outer(1:5, 1:5, `>=`), "*", " ")
[,1] [,2] [,3] [,4] [,5]
[1,] "*" " " " " " " " "
[2,] "*" "*" " " " " " "
[3,] "*" "*" "*" " " " "
[4,] "*" "*" "*" "*" " "
[5,] "*" "*" "*" "*" "*"

Michael's solution uses the lapply() function to call a function repeatedly for different values of n. This returns a list rather than a matrix, but accomplishes the same task.

> lapply(1:5, function(x) cat(rep("*", x), "\n"))
* *
* * *
* * * *
* * * * *

While this exercise is of little practical value, it does illustrate some important points, and provides a far more efficient as well as elegant way of accomplishing the tasks. For those interested in more, another resource is the R Inferno project of Patric Burns.

We demonstrate a SAS data step solution mainly to call out some useful features and cautions. In all likelihood a proc iml matrix-based solution would be more elegant;

data test;
array star [5] $ star1 - star5;
do i = 1 to 5;
star[i] = "*";

proc print noobs; var star1 - star5; run;

star1 star2 star3 star4 star5

* *
* * *
* * * *
* * * * *

In particular, note the $ in the array statement, which allows the variables to contain characters; by default variables created by an array statement are numeric. In addition, note the reference to a sequentially suffixed list of variables using the single hyphen shortcut; this would help in generalizing to n rows. Finally, note that we were able to avoid a second do loop (SAS' primary iterative looping syntax) mainly by luck-- the most recently generated value of a variable is saved by default. This can cause trouble, in general, but here it keeps all the previous "*"s when moving on to the next row.

An unrelated note about aggregatorsWe love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.


Michael Weylandt said...

Just to clarify: the solution I gave does indeed return a list, but that's not so much the point of the construct. This would have done the same thing without making a list (admittedly, less clear):

apply(matrix(1:5), 1, function(x) cat(rep("*",x),"\n"))

The somewhat useless list returned by my lapply() is actually rep(list(NULL), n) (i.e., a list of NULL n times over) -- the observed star-printing is a side effect of cat(), while the return value is actually an invisible NULL which only becomes visible when aggregated by lapply.

If the stars need to be passed off to something else, Richard's solution actually returns a character matrix with the desired stars.

Thanks for the post!


Luis said...

While I agree that many times loops are not the best thing in R, I think for teaching purposes they are a lot easier to understand. For example, a fluent use of outer() relies on the student understanding some matrix algebra (outer products), which may not be the most common situation.

Loops are demonized in matrix languages, but for small examples the performance penalty is minimal and the gains in understanding are big. I would tend to teach the loop first and then explain the workings of outer and show a big example where performance differences matter.

Ken Kleinman said...

I agree with you, Luis. Sometimes people get dogmatic about the "right" way to do something, and lose sight of the fact that learning different ways to use the tools make for a more flexible craftsperson.

If it were possible to teach R without showing the for loop syntax, the result would be students who are less flexible and some convoluted though possibly elegant code.

Michael Weylandt said...

Just to clarify -- my solution does return a list, but that's independent of the stars printed. cat() returns NULL invisibly and has the side effect of printing its arguments. lapply() aggregates these NULLs into a list. Richard 's solution gives something that can be manipulated later.

tim said...

@Luis: completely agree. Ways that work are never wrong.

In this case, the for loop has several nice attributes: it is explicit about what it is doing. By comparison "outer" is really nice, but also obscure.

The *apply family of functions of course agree, as they are built on for loops. I find that the only time i am forced to use these, however, is when I wish someone had taken the time to make a nicely polymorphic version of the function that I am wrapping apply around.

More generally I find the apply family implementation difficult (mostly because these functions try to get so much done in a small space (datatype, margins, extra parameters, side effects).

for i = 1 to n
for j = 1 to i
print "I must use a for loop today so that people know what I am doing :-)"