## Friday, January 8, 2010

### Example 7.21: Write a function to simulate categorical data

In example 7.20, we showed how to simulate categorical data. But we might anticipate needing to do that frequently. If a SAS function weren't built in and an equivalent R function not available in a package, we could build them from scratch.

SAS

The SAS code is particularly tortured, since we must parse the parameter string to extract the table probabilities. We do this with the count and find (section 1.4.6) and substr (section 1.4.3) functions. We also turn the characters into numbers (section 1.4.2). The SAS Macro (section A.8.1) runs a data step, so it cannot be called from within another data step, as would usually be desirable. Instead, you would have to generate the data and then merge it into any additional data that was required. On the whole, it might make more sense to treat every application as a special case and hand-code each one, rather than to use the macro, were a built-in function not available.

`%macro rtable(outdata=, outvar=, n=1, p=);data &outdata (keep=&outvar);/* In the first section of code, we determine the number   of categories and set up an array to hold the    probabilities for each category.  A maximum number of    catagories must be provided; here we use 100. */nprobs = count(&p,",");lengthprobs = length(&p);array probs  _temporary_;/* In the next section we parse the parameter string to    extract the probabilities for each category. */lastcomma = 0;lastprob = 0;do prob = 1 to nprobs; nextcomma = find(&p,",",lastcomma +1); probs[prob] = lastprob +      substr(&p,lastcomma+1, (nextcomma-lastcomma -1)); lastcomma = nextcomma; lastprob = probs[prob]; end;/* Finally, we generate a uniform random variate and   use it to find the appropriate category. */do i = 1 to &n; __x = uniform(0); &outvar = 1; do j = 1 to nprobs;   &outvar = &outvar + (__x gt probs[j]);       end; output; end;run;%mend rtable;`

We've left this code so that an extra comma must be included in the probability parameter string. This could be fixed with an additional line or two in the data step.

We can then run the macro and test the resulting data using proc freq.

`%rtable(outdata=test2,outvar= mycat, n = 10000,                                    p = ".1,.2,.3,");proc freq data = test2; tables mycat; run;                                Cumulative  Cumulativemycat    Frequency   Percent    Frequency     Percent------------------------------------------------------  1        1032       10.32        1032        10.32  2        2013       20.13        3045        30.45  3        2956       29.56        6001        60.01  4        3999       39.99       10000       100.00`

R

In contrast, the R code to write a function (section B.5.2) is trivial, using the cut() function as suggested by Douglas Rivers, discussed in section 1.4.10.

`rtable1 <- function(cuts,n) {   output <- cut(runif(n), cuts, labels=FALSE)   return(output)}`

In fact, the function hardly seems worth writing. To use it, you need to include 0 and 1 as the first and last elements of the cuts vector. Results can be examined using the table() function (section 2.2.1)

`> mycat1 <- rtable1(c(0,.1,.3,.6,1),10000)> summary(factor(mycat1))   1    2    3    4 1058 1989 2978 3975 `

To add a little bit of programming to it, suppose we want to list the probabilities per category, rather than the cutpoints. We'll omit the final category probability, assuming the input vector sums to less than 1 and has 1 fewer elements than the desired number of categories.

`rtable2 <- function(p, n) {   cuts <- numeric(length(p) + 2)   cuts <- 0   cuts[length(p) + 2] <- 1   for (k in 1:length(p))  {      cuts[k+1] <- sum(p[1:k])   }   output <- cut(runif(n), cuts, labels=FALSE)   return(output)}`

The results are similar to the simpler function. We can see the results tabulated without explicitly making the result a factor by using the table() function (section 2.2).

`> mycat2 <- rtable2(c(.1, .2, .3), 10000)> table(mycat2)mycat2   1    2    3    4  961 1986 3058 3995 `

#### 1 comment:

Ken Kleinman said...

... and proving the point that there are always many ways to skin a cat, XS, commenting on example 7.20, notes that the sample() function may be another more direct way to do this, meeting the goal of the more complex R function we show above.