Monday, December 28, 2009

Example 7.19: find the closest pair of observations

Suppose we need to find the closest pair of observations on some variable x. For example, we might be concerned that some data had been accidentally duplicated. We return the ID's of the two closest observations, and their distance from each other. In both languages, we'll first create the data, then sort it, recognizing that the smallest difference must come between two adjacent observations, once they're sorted.

SAS

First, we'll generate the data (see section 1.11.1) and and sort it (section 1.5.6).


data ds;
do id = 1 to 10;
x = normal(0);
output;
end;
run;

proc sort data=ds; by x; run;


Then we'll find the closest pairing and find their IDs. There's relatively complicated stuff going on here, so we annotate within the code rather than try to explain in text.


data smallest;
set ds end=eof;
retain smalldist last lastid smallid largeid;
/* variables we keep between processing observations
in the ds dataset */
if _n_ gt 1 then do; /* no diff for the smallest obs */
smalldist = min(smalldist, x - last);
/* is it smaller now than last time? */
if smalldist = x - last then do; /* if so */
smallid = lastid;
/* the current smallest id is the one
from the last row */
largeid = id;
/* and the current largest id is */
end; /* the one in this row */
end;
last = x; /* if this row turns out to be the */
lastid = id; /* smaller end of the smallest pair */
/* we'll need to know x and the id */
if eof then output; /* if we're at the last row, stop */
run;


The value of smalldist is missing for the first execution of the min function, which returns the minimum of the non-missing observations as discussed in section 1.4.14. Note that the end option to the set statement makes a temporary variable which = 1 for the last row of the data set. An output statement is the implicit last line of all data steps. When it is explicit, lines are output only when the condition is true.

The result is a one-line data set with the two IDs as smallid and largeid and their distance as smalldist.

Here is an example of the generated input data and the resulting smallest dataset.


proc print data = ds; run;

Obs id x

1 8 -2.14041
2 2 -1.17532
3 10 -0.80683
4 4 -0.77437
5 9 -0.06806
6 5 0.14425
7 6 0.40988
8 7 0.66603
9 1 0.68565
10 3 1.86831

proc print data = smallest;
var smalldist smallid largeid;
run;

Obs smalldist smallid largeid

1 0.019622 7 1



R

First, generate data (1.11.1). Next, sort the observations (1.5.6), then find the adjacent differences.


id <- 1:10
x <- rnorm(10)
sortx <- x[order(x)]
oneless <- sortx[2:length(x)]
diff <- oneless - sortx[1:length(x)-1]


Again, the next step is complex, but the use of indexing makes intra-code comments awkward, so we attempt to explain it here. The ID of the smaller of the two observations with the smallest distance is the value in the ID vector that's in the place where x equals the value of the sorted x vector that's in the same place as the smallest difference. The larger ID can be found the same way, using the shifted vector.


smallid <- id[x == sortx[which.min(diff)]]
largeid <- id[x == oneless[which.min(diff)]]
smalldist <- min(diff)



> options(digits=2)
> x
[1] 1.412 -0.518 1.298 0.351 2.123 -1.388
[7] 0.431 1.268 0.658 -0.014
> sortx
[1] -1.388 -0.518 -0.014 0.351 0.431 0.658
[7] 1.268 1.298 1.412 2.123
> oneless
[1] -0.518 -0.014 0.351 0.431 0.658 1.268
[8] 1.298 1.412 2.123
> diff
[1] 0.87 0.50 0.36 0.08 0.23 0.61 0.03 0.11 0.71
> smallid
[1] 8
> largeid
[1] 3
> smalldist
[1] 0.03

2 comments:

nacnudus said...

Note that in R, "diff" is a function. So the lines:

sortx <- x[order(x)]
oneless <- sortx[2:length(x)]
diff <- oneless - sortx[1:length(x)-1]

not only redefine a base function as a vector(!) but can be replaced with the simpler code:
diffx <- diff(x[order(x)])

Subsequent uses of "diff" in the post need to be replaced with "diffx"

Nick Horton said...

While I agree that it's poor style to use the same name for an object as a function, it doesn't overwrite it in the way that you suggest. Here's an example immediately after running the code in the entry:

> diff(c(3,2))
[1] -1
> diff
[1] 1.050152551 0.467807397 0.027438018 0.308591548 0.194177152 0.298517741 0.098446471 0.007216097 0.616300595

That being said, I'll be more careful in the future in picking names for objects.

Best wishes for the remainder of the International Year of Statistics.