Friday, January 8, 2010

Example 7.21: Write a function to simulate categorical data

In example 7.20, we showed how to simulate categorical data. But we might anticipate needing to do that frequently. If a SAS function weren't built in and an equivalent R function not available in a package, we could build them from scratch.


The SAS code is particularly tortured, since we must parse the parameter string to extract the table probabilities. We do this with the count and find (section 1.4.6) and substr (section 1.4.3) functions. We also turn the characters into numbers (section 1.4.2). The SAS Macro (section A.8.1) runs a data step, so it cannot be called from within another data step, as would usually be desirable. Instead, you would have to generate the data and then merge it into any additional data that was required. On the whole, it might make more sense to treat every application as a special case and hand-code each one, rather than to use the macro, were a built-in function not available.

%macro rtable(outdata=, outvar=, n=1, p=);
data &outdata (keep=&outvar);
/* In the first section of code, we determine the number
of categories and set up an array to hold the
probabilities for each category. A maximum number of
catagories must be provided; here we use 100. */
nprobs = count(&p,",");
lengthprobs = length(&p);
array probs [100] _temporary_;

/* In the next section we parse the parameter string to
extract the probabilities for each category. */
lastcomma = 0;
lastprob = 0;
do prob = 1 to nprobs;
nextcomma = find(&p,",",lastcomma +1);
probs[prob] = lastprob +
substr(&p,lastcomma+1, (nextcomma-lastcomma -1));
lastcomma = nextcomma;
lastprob = probs[prob];
/* Finally, we generate a uniform random variate and
use it to find the appropriate category. */
do i = 1 to &n;
__x = uniform(0);
&outvar = 1;
do j = 1 to nprobs;
&outvar = &outvar + (__x gt probs[j]);
%mend rtable;

We've left this code so that an extra comma must be included in the probability parameter string. This could be fixed with an additional line or two in the data step.

We can then run the macro and test the resulting data using proc freq.

%rtable(outdata=test2,outvar= mycat, n = 10000,
p = ".1,.2,.3,");

proc freq data = test2; tables mycat; run;

Cumulative Cumulative
mycat Frequency Percent Frequency Percent
1 1032 10.32 1032 10.32
2 2013 20.13 3045 30.45
3 2956 29.56 6001 60.01
4 3999 39.99 10000 100.00


In contrast, the R code to write a function (section B.5.2) is trivial, using the cut() function as suggested by Douglas Rivers, discussed in section 1.4.10.

rtable1 <- function(cuts,n) {
output <- cut(runif(n), cuts, labels=FALSE)

In fact, the function hardly seems worth writing. To use it, you need to include 0 and 1 as the first and last elements of the cuts vector. Results can be examined using the table() function (section 2.2.1)

> mycat1 <- rtable1(c(0,.1,.3,.6,1),10000)
> summary(factor(mycat1))
1 2 3 4
1058 1989 2978 3975

To add a little bit of programming to it, suppose we want to list the probabilities per category, rather than the cutpoints. We'll omit the final category probability, assuming the input vector sums to less than 1 and has 1 fewer elements than the desired number of categories.

rtable2 <- function(p, n) {
cuts <- numeric(length(p) + 2)
cuts[1] <- 0
cuts[length(p) + 2] <- 1
for (k in 1:length(p)) {
cuts[k+1] <- sum(p[1:k])
output <- cut(runif(n), cuts, labels=FALSE)

The results are similar to the simpler function. We can see the results tabulated without explicitly making the result a factor by using the table() function (section 2.2).

> mycat2 <- rtable2(c(.1, .2, .3), 10000)
> table(mycat2)
1 2 3 4
961 1986 3058 3995

1 comment:

Ken Kleinman said...

... and proving the point that there are always many ways to skin a cat, XS, commenting on example 7.20, notes that the sample() function may be another more direct way to do this, meeting the goal of the more complex R function we show above.