SAS and R: December 2009

Monday, December 28, 2009

Example 7.19: find the closest pair of observations

Suppose we need to find the closest pair of observations on some variable x. For example, we might be concerned that some data had been accidentally duplicated. We return the ID's of the two closest observations, and their distance from each other. In both languages, we'll first create the data, then sort it, recognizing that the smallest difference must come between two adjacent observations, once they're sorted.

SAS

First, we'll generate the data (see section 1.11.1) and and sort it (section 1.5.6).


data ds;
do id = 1 to 10;
  x = normal(0);
  output;
  end;
run;

proc sort data=ds; by x; run;

Then we'll find the closest pairing and find their IDs. There's relatively complicated stuff going on here, so we annotate within the code rather than try to explain in text.


data smallest;
set ds end=eof;
retain smalldist last lastid smallid largeid;  
   /* variables we keep between processing observations 
      in the ds dataset */
if _n_ gt 1 then do;  /* no diff for the smallest obs */
   smalldist = min(smalldist, x - last);  
                 /* is it smaller now than last time? */
   if smalldist = x - last then do;       /* if so */
      smallid = lastid;                    
                 /* the current smallest id is the one 
                                    from the last row */
      largeid = id;                        
                 /* and the current largest id is */
   end;                                 /* the one in this row */
end;
last = x;           /* if this row turns out to be the  */
lastid = id;       /* smaller end of the smallest pair  */
                     /* we'll need to know x and the id */
if eof then output;   /* if we're at the last row, stop */
run;

The value of smalldist is missing for the first execution of the min function, which returns the minimum of the non-missing observations as discussed in section 1.4.14. Note that the end option to the set statement makes a temporary variable which = 1 for the last row of the data set. An output statement is the implicit last line of all data steps. When it is explicit, lines are output only when the condition is true.

The result is a one-line data set with the two IDs as smallid and largeid and their distance as smalldist.

Here is an example of the generated input data and the resulting smallest dataset.


proc print data = ds; run;

Obs    id        x

  1     8    -2.14041
  2     2    -1.17532
  3    10    -0.80683
  4     4    -0.77437
  5     9    -0.06806
  6     5     0.14425
  7     6     0.40988
  8     7     0.66603
  9     1     0.68565
 10     3     1.86831

proc print data = smallest;
var smalldist smallid largeid; 
run;

Obs    smalldist    smallid    largeid

 1      0.019622       7          1

R

First, generate data (1.11.1). Next, sort the observations (1.5.6), then find the adjacent differences.


id <- 1:10
x <- rnorm(10)
sortx <- x[order(x)]
oneless <- sortx[2:length(x)]
diff <- oneless - sortx[1:length(x)-1]

Again, the next step is complex, but the use of indexing makes intra-code comments awkward, so we attempt to explain it here. The ID of the smaller of the two observations with the smallest distance is the value in the ID vector that's in the place where x equals the value of the sorted x vector that's in the same place as the smallest difference. The larger ID can be found the same way, using the shifted vector.


smallid <- id[x == sortx[which.min(diff)]]
largeid <- id[x == oneless[which.min(diff)]]
smalldist <-  min(diff)


> options(digits=2)
> x
 [1]  1.412 -0.518  1.298  0.351  2.123 -1.388  
 [7]  0.431  1.268  0.658 -0.014
> sortx
 [1] -1.388 -0.518 -0.014  0.351  0.431  0.658  
 [7]  1.268  1.298  1.412  2.123
> oneless
 [1] -0.518 -0.014  0.351  0.431  0.658  1.268  
 [8]  1.298  1.412  2.123
> diff
 [1] 0.87 0.50 0.36 0.08 0.23 0.61 0.03 0.11 0.71
> smallid
 [1] 8
> largeid
 [1] 3
> smalldist
 [1] 0.03

Friday, December 18, 2009

SAS and R included on R bloggers

The R bloggers site is an aggregator for blogs about R. We're excited to be joining that community and suggest any readers of this blog may also find it useful.

Monday, December 14, 2009

Example 7.18: Displaying missing value categories in a table

When displaying contingency tables (section 2.3.1), there are times when it is useful to either show or hide the missing data category. Both SAS and the typical R command default to displaying the table only for observations where both factors are observed.

In this example, we generate some multinomial data (section 1.10.4) and then produce tables with and without missing data categories.

SAS

Generate the multinomial data, uniform data, and use the latter to censor the former:


data blog;
do i = 1 to 300;
   x = rand("TABLE",.3,.4);
   y = rand("TABLE",.3,.4);
   if uniform(0) gt .8 then x = .;
   if uniform(0) gt .8 then y = .;
   output;
   end;
run;

Print the default table with only complete data. Note the options used to reduce output, as in section 4.6.9.


proc freq data=blog;
   tables y*x / norow nocol;
run;

This produces:


Table of y by x

y         x

Frequency|
Percent  |       1|       2|       3|  Total
---------+--------+--------+--------+
       1 |     16 |     13 |     18 |     47
         |   8.16 |   6.63 |   9.18 |  23.98
---------+--------+--------+--------+
       2 |     18 |     32 |     22 |     72
         |   9.18 |  16.33 |  11.22 |  36.73
---------+--------+--------+--------+
       3 |     28 |     31 |     18 |     77
         |  14.29 |  15.82 |   9.18 |  39.29
---------+--------+--------+--------+
Total          62       76       58      196
            31.63    38.78    29.59   100.00

Frequency Missing = 104

The missing categories are included through the missprint option.


proc freq data = blog;
   tables y*x / norow nocol missprint;
run;

This produces:


Table of y by x

y         x

Frequency|
Percent  |       .|       1|       2|       3|  Total
---------+--------+--------+--------+--------+
       . |     12 |     12 |     20 |     14 |      .
         |      . |      . |      . |      . |      .
---------+--------+--------+--------+--------+
       1 |     10 |     16 |     13 |     18 |     47
         |      . |   8.16 |   6.63 |   9.18 |  23.98
---------+--------+--------+--------+--------+
       2 |     17 |     18 |     32 |     22 |     72
         |      . |   9.18 |  16.33 |  11.22 |  36.73
---------+--------+--------+--------+--------+
       3 |     19 |     28 |     31 |     18 |     77
         |      . |  14.29 |  15.82 |   9.18 |  39.29
---------+--------+--------+--------+--------+
Total           .       62       76       58      196
                .    31.63    38.78    29.59   100.00
Frequency Missing = 104

Note that if there are no missing values, SAS will not print the rows and columns headed with a '.' which is analogous to the "ifany" option in R shown below.

R

First, generate the data:


library(Hmisc)
x <- rMultinom(matrix(c(.3,.3,.4),1,3),300)
y <- rMultinom(matrix(c(.3,.3,.4),1,3),300)

Then, generate some random Uniforms to censor some of the observed data:


censprobx <- runif(300)
censproby <- runif(300)

Censor the data:


x[censprobx > .8] <- NA
y[censproby > .8] <- NA

Produce the default table (omits any missing data):


table(y,x)

   x
y    1  2  3
  1 18 18 29
  2 17 21 22
  3 20 30 40

Make the table which includes the missing category:


table(y, x, useNA="ifany")

      x
y       1  2  3   NA
  1    18 18 29    9
  2    17 21 22   17
  3    20 30 40   17
  NA   14  5 14    9

The useNA option also allows the values "no" and "always". The value "no" corresponds to the default behavior in R or SAS, while the "always" option is not available in SAS. SAS, however, shows the total number missing in any case.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts