SAS and R: Example 7.21: Write a function to simulate categorical data

Friday, January 8, 2010

Example 7.21: Write a function to simulate categorical data

In example 7.20, we showed how to simulate categorical data. But we might anticipate needing to do that frequently. If a SAS function weren't built in and an equivalent R function not available in a package, we could build them from scratch.

SAS

The SAS code is particularly tortured, since we must parse the parameter string to extract the table probabilities. We do this with the count and find (section 1.4.6) and substr (section 1.4.3) functions. We also turn the characters into numbers (section 1.4.2). The SAS Macro (section A.8.1) runs a data step, so it cannot be called from within another data step, as would usually be desirable. Instead, you would have to generate the data and then merge it into any additional data that was required. On the whole, it might make more sense to treat every application as a special case and hand-code each one, rather than to use the macro, were a built-in function not available.


%macro rtable(outdata=, outvar=, n=1, p=);
data &outdata (keep=&outvar);
/* In the first section of code, we determine the number
   of categories and set up an array to hold the 
   probabilities for each category.  A maximum number of 
   catagories must be provided; here we use 100. */
nprobs = count(&p,",");
lengthprobs = length(&p);
array probs [100] _temporary_;

/* In the next section we parse the parameter string to 
   extract the probabilities for each category. */
lastcomma = 0;
lastprob = 0;
do prob = 1 to nprobs;
 nextcomma = find(&p,",",lastcomma +1);
 probs[prob] = lastprob + 
     substr(&p,lastcomma+1, (nextcomma-lastcomma -1));
 lastcomma = nextcomma;
 lastprob = probs[prob];
 end;
/* Finally, we generate a uniform random variate and
   use it to find the appropriate category. */
do i = 1 to &n;
 __x = uniform(0);
 &outvar = 1;
 do j = 1 to nprobs;
   &outvar = &outvar + (__x gt probs[j]);
       end;
 output;
 end;
run;
%mend rtable;

We've left this code so that an extra comma must be included in the probability parameter string. This could be fixed with an additional line or two in the data step.

We can then run the macro and test the resulting data using proc freq.


%rtable(outdata=test2,outvar= mycat, n = 10000, 
                                   p = ".1,.2,.3,");

proc freq data = test2; tables mycat; run;

                                Cumulative  Cumulative
mycat    Frequency   Percent    Frequency     Percent
------------------------------------------------------
  1        1032       10.32        1032        10.32
  2        2013       20.13        3045        30.45
  3        2956       29.56        6001        60.01
  4        3999       39.99       10000       100.00

R

In contrast, the R code to write a function (section B.5.2) is trivial, using the cut() function as suggested by Douglas Rivers, discussed in section 1.4.10.


rtable1 <- function(cuts,n) {
   output <- cut(runif(n), cuts, labels=FALSE)
   return(output)
}

In fact, the function hardly seems worth writing. To use it, you need to include 0 and 1 as the first and last elements of the cuts vector. Results can be examined using the table() function (section 2.2.1)


> mycat1 <- rtable1(c(0,.1,.3,.6,1),10000)
> summary(factor(mycat1))
   1    2    3    4 
1058 1989 2978 3975

To add a little bit of programming to it, suppose we want to list the probabilities per category, rather than the cutpoints. We'll omit the final category probability, assuming the input vector sums to less than 1 and has 1 fewer elements than the desired number of categories.


rtable2 <- function(p, n) {
   cuts <- numeric(length(p) + 2)
   cuts[1] <- 0
   cuts[length(p) + 2] <- 1
   for (k in 1:length(p))  {
      cuts[k+1] <- sum(p[1:k])
   }
   output <- cut(runif(n), cuts, labels=FALSE)
   return(output)
}

The results are similar to the simpler function. We can see the results tabulated without explicitly making the result a factor by using the table() function (section 2.2).


> mycat2 <- rtable2(c(.1, .2, .3), 10000)
> table(mycat2)
mycat2
   1    2    3    4 
 961 1986 3058 3995

1 comment:

Ken Kleinman said...: ... and proving the point that there are always many ways to skin a cat, XS, commenting on example 7.20, notes that the sample() function may be another more direct way to do this, meeting the goal of the more complex R function we show above.; January 8, 2010 at 2:45 PM

Post a Comment

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Friday, January 8, 2010

Example 7.21: Write a function to simulate categorical data

1 comment:

About SAS and R

Topics discussed