SAS and R: Example 7.19: find the closest pair of observations

Monday, December 28, 2009

Example 7.19: find the closest pair of observations

Suppose we need to find the closest pair of observations on some variable x. For example, we might be concerned that some data had been accidentally duplicated. We return the ID's of the two closest observations, and their distance from each other. In both languages, we'll first create the data, then sort it, recognizing that the smallest difference must come between two adjacent observations, once they're sorted.

SAS

First, we'll generate the data (see section 1.11.1) and and sort it (section 1.5.6).


data ds;
do id = 1 to 10;
  x = normal(0);
  output;
  end;
run;

proc sort data=ds; by x; run;

Then we'll find the closest pairing and find their IDs. There's relatively complicated stuff going on here, so we annotate within the code rather than try to explain in text.


data smallest;
set ds end=eof;
retain smalldist last lastid smallid largeid;  
   /* variables we keep between processing observations 
      in the ds dataset */
if _n_ gt 1 then do;  /* no diff for the smallest obs */
   smalldist = min(smalldist, x - last);  
                 /* is it smaller now than last time? */
   if smalldist = x - last then do;       /* if so */
      smallid = lastid;                    
                 /* the current smallest id is the one 
                                    from the last row */
      largeid = id;                        
                 /* and the current largest id is */
   end;                                 /* the one in this row */
end;
last = x;           /* if this row turns out to be the  */
lastid = id;       /* smaller end of the smallest pair  */
                     /* we'll need to know x and the id */
if eof then output;   /* if we're at the last row, stop */
run;

The value of smalldist is missing for the first execution of the min function, which returns the minimum of the non-missing observations as discussed in section 1.4.14. Note that the end option to the set statement makes a temporary variable which = 1 for the last row of the data set. An output statement is the implicit last line of all data steps. When it is explicit, lines are output only when the condition is true.

The result is a one-line data set with the two IDs as smallid and largeid and their distance as smalldist.

Here is an example of the generated input data and the resulting smallest dataset.


proc print data = ds; run;

Obs    id        x

  1     8    -2.14041
  2     2    -1.17532
  3    10    -0.80683
  4     4    -0.77437
  5     9    -0.06806
  6     5     0.14425
  7     6     0.40988
  8     7     0.66603
  9     1     0.68565
 10     3     1.86831

proc print data = smallest;
var smalldist smallid largeid; 
run;

Obs    smalldist    smallid    largeid

 1      0.019622       7          1

R

First, generate data (1.11.1). Next, sort the observations (1.5.6), then find the adjacent differences.


id <- 1:10
x <- rnorm(10)
sortx <- x[order(x)]
oneless <- sortx[2:length(x)]
diff <- oneless - sortx[1:length(x)-1]

Again, the next step is complex, but the use of indexing makes intra-code comments awkward, so we attempt to explain it here. The ID of the smaller of the two observations with the smallest distance is the value in the ID vector that's in the place where x equals the value of the sorted x vector that's in the same place as the smallest difference. The larger ID can be found the same way, using the shifted vector.


smallid <- id[x == sortx[which.min(diff)]]
largeid <- id[x == oneless[which.min(diff)]]
smalldist <-  min(diff)


> options(digits=2)
> x
 [1]  1.412 -0.518  1.298  0.351  2.123 -1.388  
 [7]  0.431  1.268  0.658 -0.014
> sortx
 [1] -1.388 -0.518 -0.014  0.351  0.431  0.658  
 [7]  1.268  1.298  1.412  2.123
> oneless
 [1] -0.518 -0.014  0.351  0.431  0.658  1.268  
 [8]  1.298  1.412  2.123
> diff
 [1] 0.87 0.50 0.36 0.08 0.23 0.61 0.03 0.11 0.71
> smallid
 [1] 8
> largeid
 [1] 3
> smalldist
 [1] 0.03

2 comments:

nacnudus said...: Note that in R, "diff" is a function. So the lines:

sortx <- x[order(x)]
oneless <- sortx[2:length(x)]
diff <- oneless - sortx[1:length(x)-1]

not only redefine a base function as a vector(!) but can be replaced with the simpler code:
diffx <- diff(x[order(x)])

Subsequent uses of "diff" in the post need to be replaced with "diffx"; June 26, 2013 at 6:40 PM
Nick Horton said...: While I agree that it's poor style to use the same name for an object as a function, it doesn't overwrite it in the way that you suggest. Here's an example immediately after running the code in the entry:

> diff(c(3,2))
[1] -1
> diff
[1] 1.050152551 0.467807397 0.027438018 0.308591548 0.194177152 0.298517741 0.098446471 0.007216097 0.616300595

That being said, I'll be more careful in the future in picking names for objects.

Best wishes for the remainder of the International Year of Statistics.; July 4, 2013 at 8:51 AM

Post a Comment

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Monday, December 28, 2009

Example 7.19: find the closest pair of observations

2 comments:

About SAS and R

Topics discussed