Monday, December 28, 2009

Example 7.19: find the closest pair of observations

Suppose we need to find the closest pair of observations on some variable x. For example, we might be concerned that some data had been accidentally duplicated. We return the ID's of the two closest observations, and their distance from each other. In both languages, we'll first create the data, then sort it, recognizing that the smallest difference must come between two adjacent observations, once they're sorted.

SAS

First, we'll generate the data (see section 1.11.1) and and sort it (section 1.5.6).


data ds;
do id = 1 to 10;
x = normal(0);
output;
end;
run;

proc sort data=ds; by x; run;


Then we'll find the closest pairing and find their IDs. There's relatively complicated stuff going on here, so we annotate within the code rather than try to explain in text.


data smallest;
set ds end=eof;
retain smalldist last lastid smallid largeid;
/* variables we keep between processing observations
in the ds dataset */
if _n_ gt 1 then do; /* no diff for the smallest obs */
smalldist = min(smalldist, x - last);
/* is it smaller now than last time? */
if smalldist = x - last then do; /* if so */
smallid = lastid;
/* the current smallest id is the one
from the last row */
largeid = id;
/* and the current largest id is */
end; /* the one in this row */
end;
last = x; /* if this row turns out to be the */
lastid = id; /* smaller end of the smallest pair */
/* we'll need to know x and the id */
if eof then output; /* if we're at the last row, stop */
run;


The value of smalldist is missing for the first execution of the min function, which returns the minimum of the non-missing observations as discussed in section 1.4.14. Note that the end option to the set statement makes a temporary variable which = 1 for the last row of the data set. An output statement is the implicit last line of all data steps. When it is explicit, lines are output only when the condition is true.

The result is a one-line data set with the two IDs as smallid and largeid and their distance as smalldist.

Here is an example of the generated input data and the resulting smallest dataset.


proc print data = ds; run;

Obs id x

1 8 -2.14041
2 2 -1.17532
3 10 -0.80683
4 4 -0.77437
5 9 -0.06806
6 5 0.14425
7 6 0.40988
8 7 0.66603
9 1 0.68565
10 3 1.86831

proc print data = smallest;
var smalldist smallid largeid;
run;

Obs smalldist smallid largeid

1 0.019622 7 1



R

First, generate data (1.11.1). Next, sort the observations (1.5.6), then find the adjacent differences.


id <- 1:10
x <- rnorm(10)
sortx <- x[order(x)]
oneless <- sortx[2:length(x)]
diff <- oneless - sortx[1:length(x)-1]


Again, the next step is complex, but the use of indexing makes intra-code comments awkward, so we attempt to explain it here. The ID of the smaller of the two observations with the smallest distance is the value in the ID vector that's in the place where x equals the value of the sorted x vector that's in the same place as the smallest difference. The larger ID can be found the same way, using the shifted vector.


smallid <- id[x == sortx[which.min(diff)]]
largeid <- id[x == oneless[which.min(diff)]]
smalldist <- min(diff)



> options(digits=2)
> x
[1] 1.412 -0.518 1.298 0.351 2.123 -1.388
[7] 0.431 1.268 0.658 -0.014
> sortx
[1] -1.388 -0.518 -0.014 0.351 0.431 0.658
[7] 1.268 1.298 1.412 2.123
> oneless
[1] -0.518 -0.014 0.351 0.431 0.658 1.268
[8] 1.298 1.412 2.123
> diff
[1] 0.87 0.50 0.36 0.08 0.23 0.61 0.03 0.11 0.71
> smallid
[1] 8
> largeid
[1] 3
> smalldist
[1] 0.03

Friday, December 18, 2009

SAS and R included on R bloggers

The R bloggers site is an aggregator for blogs about R. We're excited to be joining that community and suggest any readers of this blog may also find it useful.

Monday, December 14, 2009

Example 7.18: Displaying missing value categories in a table

When displaying contingency tables (section 2.3.1), there are times when it is useful to either show or hide the missing data category. Both SAS and the typical R command default to displaying the table only for observations where both factors are observed.

In this example, we generate some multinomial data (section 1.10.4) and then produce tables with and without missing data categories.

SAS

Generate the multinomial data, uniform data, and use the latter to censor the former:

data blog;
do i = 1 to 300;
x = rand("TABLE",.3,.4);
y = rand("TABLE",.3,.4);
if uniform(0) gt .8 then x = .;
if uniform(0) gt .8 then y = .;
output;
end;
run;


Print the default table with only complete data. Note the options used to reduce output, as in section 4.6.9.

proc freq data=blog;
tables y*x / norow nocol;
run;


This produces:

Table of y by x

y x

Frequency|
Percent | 1| 2| 3| Total
---------+--------+--------+--------+
1 | 16 | 13 | 18 | 47
| 8.16 | 6.63 | 9.18 | 23.98
---------+--------+--------+--------+
2 | 18 | 32 | 22 | 72
| 9.18 | 16.33 | 11.22 | 36.73
---------+--------+--------+--------+
3 | 28 | 31 | 18 | 77
| 14.29 | 15.82 | 9.18 | 39.29
---------+--------+--------+--------+
Total 62 76 58 196
31.63 38.78 29.59 100.00

Frequency Missing = 104


The missing categories are included through the missprint option.

proc freq data = blog;
tables y*x / norow nocol missprint;
run;


This produces:

Table of y by x

y x

Frequency|
Percent | .| 1| 2| 3| Total
---------+--------+--------+--------+--------+
. | 12 | 12 | 20 | 14 | .
| . | . | . | . | .
---------+--------+--------+--------+--------+
1 | 10 | 16 | 13 | 18 | 47
| . | 8.16 | 6.63 | 9.18 | 23.98
---------+--------+--------+--------+--------+
2 | 17 | 18 | 32 | 22 | 72
| . | 9.18 | 16.33 | 11.22 | 36.73
---------+--------+--------+--------+--------+
3 | 19 | 28 | 31 | 18 | 77
| . | 14.29 | 15.82 | 9.18 | 39.29
---------+--------+--------+--------+--------+
Total . 62 76 58 196
. 31.63 38.78 29.59 100.00
Frequency Missing = 104


Note that if there are no missing values, SAS will not print the rows and columns headed with a '.' which is analogous to the "ifany" option in R shown below.

R

First, generate the data:

library(Hmisc)
x <- rMultinom(matrix(c(.3,.3,.4),1,3),300)
y <- rMultinom(matrix(c(.3,.3,.4),1,3),300)


Then, generate some random Uniforms to censor some of the observed data:

censprobx <- runif(300)
censproby <- runif(300)


Censor the data:

x[censprobx > .8] <- NA
y[censproby > .8] <- NA


Produce the default table (omits any missing data):

table(y,x)

x
y 1 2 3
1 18 18 29
2 17 21 22
3 20 30 40


Make the table which includes the missing category:

table(y, x, useNA="ifany")

x
y 1 2 3 NA
1 18 18 29 9
2 17 21 22 17
3 20 30 40 17
NA 14 5 14 9


The useNA option also allows the values "no" and "always". The value "no" corresponds to the default behavior in R or SAS, while the "always" option is not available in SAS. SAS, however, shows the total number missing in any case.