*x*. For example, we might be concerned that some data had been accidentally duplicated. We return the ID's of the two closest observations, and their distance from each other. In both languages, we'll first create the data, then sort it, recognizing that the smallest difference must come between two adjacent observations, once they're sorted.

**SAS**

First, we'll generate the data (see section 1.11.1) and and sort it (section 1.5.6).

data ds;

do id = 1 to 10;

x = normal(0);

output;

end;

run;

proc sort data=ds; by x; run;

Then we'll find the closest pairing and find their IDs. There's relatively complicated stuff going on here, so we annotate within the code rather than try to explain in text.

data smallest;

set ds end=eof;

retain smalldist last lastid smallid largeid;

/* variables we keep between processing observations

in the ds dataset */

if _n_ gt 1 then do; /* no diff for the smallest obs */

smalldist = min(smalldist, x - last);

/* is it smaller now than last time? */

if smalldist = x - last then do; /* if so */

smallid = lastid;

/* the current smallest id is the one

from the last row */

largeid = id;

/* and the current largest id is */

end; /* the one in this row */

end;

last = x; /* if this row turns out to be the */

lastid = id; /* smaller end of the smallest pair */

/* we'll need to know x and the id */

if eof then output; /* if we're at the last row, stop */

run;

The value of smalldist is missing for the first execution of the

`min`function, which returns the minimum of the non-missing observations as discussed in section 1.4.14. Note that the

`end`option to the set statement makes a temporary variable which = 1 for the last row of the data set. An

`output`statement is the implicit last line of all

`data`steps. When it is explicit, lines are output only when the condition is true.

The result is a one-line data set with the two IDs as

`smallid`and

`largeid`and their distance as

`smalldist`.

Here is an example of the generated input data and the resulting

`smallest`dataset.

proc print data = ds; run;

Obs id x

1 8 -2.14041

2 2 -1.17532

3 10 -0.80683

4 4 -0.77437

5 9 -0.06806

6 5 0.14425

7 6 0.40988

8 7 0.66603

9 1 0.68565

10 3 1.86831

proc print data = smallest;

var smalldist smallid largeid;

run;

Obs smalldist smallid largeid

1 0.019622 7 1

**R**

First, generate data (1.11.1). Next, sort the observations (1.5.6), then find the adjacent differences.

id <- 1:10

x <- rnorm(10)

sortx <- x[order(x)]

oneless <- sortx[2:length(x)]

diff <- oneless - sortx[1:length(x)-1]

Again, the next step is complex, but the use of indexing makes intra-code comments awkward, so we attempt to explain it here. The ID of the smaller of the two observations with the smallest distance is the value in the ID vector that's in the place where

*x*equals the value of the

**sorted**

*x*vector that's in the same place as the smallest difference. The larger ID can be found the same way, using the shifted vector.

smallid <- id[x == sortx[which.min(diff)]]

largeid <- id[x == oneless[which.min(diff)]]

smalldist <- min(diff)

> options(digits=2)

> x

[1] 1.412 -0.518 1.298 0.351 2.123 -1.388

[7] 0.431 1.268 0.658 -0.014

> sortx

[1] -1.388 -0.518 -0.014 0.351 0.431 0.658

[7] 1.268 1.298 1.412 2.123

> oneless

[1] -0.518 -0.014 0.351 0.431 0.658 1.268

[8] 1.298 1.412 2.123

> diff

[1] 0.87 0.50 0.36 0.08 0.23 0.61 0.03 0.11 0.71

> smallid

[1] 8

> largeid

[1] 3

> smalldist

[1] 0.03

## 2 comments:

Note that in R, "diff" is a function. So the lines:

sortx <- x[order(x)]

oneless <- sortx[2:length(x)]

diff <- oneless - sortx[1:length(x)-1]

not only redefine a base function as a vector(!) but can be replaced with the simpler code:

diffx <- diff(x[order(x)])

Subsequent uses of "diff" in the post need to be replaced with "diffx"

While I agree that it's poor style to use the same name for an object as a function, it doesn't overwrite it in the way that you suggest. Here's an example immediately after running the code in the entry:

> diff(c(3,2))

[1] -1

> diff

[1] 1.050152551 0.467807397 0.027438018 0.308591548 0.194177152 0.298517741 0.098446471 0.007216097 0.616300595

That being said, I'll be more careful in the future in picking names for objects.

Best wishes for the remainder of the International Year of Statistics.

Post a Comment