SAS and R: Example 7.20: Simulate categorical data

Monday, January 4, 2010

Example 7.20: Simulate categorical data

Both SAS and R provide means of simulating categorical data (see section 1.10.4). Alternatively, it is trivial to write code to do this directly. In this entry, we show how to do it once. In a future entry, we'll demonstrate writing a SAS Macro (section A.8.1) and a function in R (section B.5.2) to do it repeatedly.

SAS


data test;
p1 = .1; p2 = .2; p3 = .3;
do i = 1 to 10000;
  x = uniform(0);
  mycat = (x ge 0) + (x gt p1) + (x gt p1 + p2) 
                                + (x gt p1 + p2 + p3);
  output;
  end;
run;

Here the parenthetical logical tests in the mycat = line resolve to 1 if the test is true and 0 otherwise, as discussed in section 1.4.9.
The (x ge 0) makes the categories range from 1 to 4, rather than 0 to 3.

The results can be assessed using proc freq:


proc freq data=test; tables mycat; run;

                                  Cumulative    Cumulative
mycat    Frequency     Percent     Frequency      Percent
----------------------------------------------------------
    1         947        9.47           947         9.47
    2        2061       20.61          3008        30.08
    3        3039       30.39          6047        60.47
    4        3953       39.53         10000       100.00

R

In contrast, the R syntax to get the results is rather dense.


p <- c(.1,.2,.3)
x <- runif(10000)
mycat <- numeric(10000)
for (i in 0:length(p)) { 
    mycat <- mycat + (x >= sum(p[0:i])) 
    }

We can display the results using the summary() function.


summary(factor(mycat))
  1    2    3    4
 990 2047 2978 3985

11 comments:

Douglas Rivers said...: Or, you could just use

mycat <- cut(runif(10000), c(0, 0.1, 0.3, 0.6, 1), labels=FALSE); January 4, 2010 at 10:53 PM
Ken Kleinman said...: Thanks, Douglas! Much better.

It looks like if I omit the labels=FALSE, the factor labels are very useful, too.

> mycat <- cut(runif(10000), c(0, 0.1, 0.3, 0.6, 1))

> summary(mycat)
(0,0.1] (0.1,0.3] (0.3,0.6] (0.6,1]
987 1993 3047 3973; January 5, 2010 at 8:43 AM
Unknown said...: Sample may be a better function to simulate categorical data:

> sample(1:4,10000,rep=TRUE,prob=c(.1,.2,.3,.4))
> table(sample)

1 2 3 4
1012 2074 2924 3990; January 8, 2010 at 4:36 AM
Anonymous said...: Hello,

how could I simulate data from a multinomial logit model depending on a metric variable.; April 19, 2011 at 10:21 AM
Ken Kleinman said...: I'm not sure what you're asking. You can simulate data from a multinomial logistic model using a process similar to what we show for logistic regression in this entry: http://sas-and-r.blogspot.com/2009/06/example-72-simulate-data-from-logistic.html. What do you mean by a "metric" variable, though?; April 19, 2011 at 12:36 PM
burakaydin said...: Hello,
Can I simulate variables with a known Pearson covariance matrix?
I need to simulate categorical, continuous and binary variables based on the pearson covariance matrix? thanks; March 30, 2012 at 5:48 PM
Ken Kleinman said...: In example 6.3 in our book, we show correlated binary variables, based on Lipsitz et al, Stats in Med 1990, 9:1517-1525. You'll find many cites if you search with "simulate correlated" as your base.; March 30, 2012 at 8:24 PM
burakaydin said...: Thanks for the response.
There is an R package called "bindata". It performs almost perfect to create correlated binary variables, with known marginal probabilities and correlations.
What I need is the simulation of correlated continuous and categorical variables using a single multivariate distribution.; March 30, 2012 at 9:27 PM
Ken Kleinman said...: Good to know about that one, thanks. I don't know of a technique to do what you need, offhand. A brief search turned up this thread: http://stats.stackexchange.com/questions/22856/how-to-generate-correlated-test-data-that-has-bernoulli-categorical-and-contin where copulas are suggested. And also this paper: http://www.springerlink.com/content/011x633m554u843g/. Let me know what you end up doing.; March 30, 2012 at 10:20 PM
Nick Horton said...: There is a literature that might be relevant. A starting point might be Cox, D. R. and Wermuth, N. (1992). Response models for mixed binary and quantitative variables. Biometrika, 79, 441-461. They propose a flexible multivariate distribution which might be useful.; March 31, 2012 at 9:31 AM
Anonymous said...: What is the variance of the error term when a multinomial logit is simulated in this way?; May 13, 2013 at 11:04 AM

Post a Comment

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Monday, January 4, 2010

Example 7.20: Simulate categorical data

11 comments:

About SAS and R

Topics discussed