## Monday, January 4, 2010

### Example 7.20: Simulate categorical data

Both SAS and R provide means of simulating categorical data (see section 1.10.4). Alternatively, it is trivial to write code to do this directly. In this entry, we show how to do it once. In a future entry, we'll demonstrate writing a SAS Macro (section A.8.1) and a function in R (section B.5.2) to do it repeatedly.

SAS

`data test;p1 = .1; p2 = .2; p3 = .3;do i = 1 to 10000;  x = uniform(0);  mycat = (x ge 0) + (x gt p1) + (x gt p1 + p2)                                 + (x gt p1 + p2 + p3);  output;  end;run;`

Here the parenthetical logical tests in the mycat = line resolve to 1 if the test is true and 0 otherwise, as discussed in section 1.4.9.
The (x ge 0) makes the categories range from 1 to 4, rather than 0 to 3.

The results can be assessed using proc freq:

`proc freq data=test; tables mycat; run;                                  Cumulative    Cumulativemycat    Frequency     Percent     Frequency      Percent----------------------------------------------------------    1         947        9.47           947         9.47    2        2061       20.61          3008        30.08    3        3039       30.39          6047        60.47    4        3953       39.53         10000       100.00`

R

In contrast, the R syntax to get the results is rather dense.

`p <- c(.1,.2,.3)x <- runif(10000)mycat <- numeric(10000)for (i in 0:length(p)) {     mycat <- mycat + (x >= sum(p[0:i]))     }`

We can display the results using the summary() function.

`summary(factor(mycat))  1    2    3    4 990 2047 2978 3985`

#### 11 comments:

Douglas Rivers said...

Or, you could just use

mycat <- cut(runif(10000), c(0, 0.1, 0.3, 0.6, 1), labels=FALSE)

Ken Kleinman said...

Thanks, Douglas! Much better.

It looks like if I omit the labels=FALSE, the factor labels are very useful, too.

> mycat <- cut(runif(10000), c(0, 0.1, 0.3, 0.6, 1))

> summary(mycat)
(0,0.1] (0.1,0.3] (0.3,0.6] (0.6,1]
987 1993 3047 3973

Unknown said...

Sample may be a better function to simulate categorical data:

> sample(1:4,10000,rep=TRUE,prob=c(.1,.2,.3,.4))
> table(sample)

1 2 3 4
1012 2074 2924 3990

Anonymous said...

Hello,

how could I simulate data from a multinomial logit model depending on a metric variable.

Ken Kleinman said...

I'm not sure what you're asking. You can simulate data from a multinomial logistic model using a process similar to what we show for logistic regression in this entry: http://sas-and-r.blogspot.com/2009/06/example-72-simulate-data-from-logistic.html. What do you mean by a "metric" variable, though?

burakaydin said...

Hello,
Can I simulate variables with a known Pearson covariance matrix?
I need to simulate categorical, continuous and binary variables based on the pearson covariance matrix? thanks

Ken Kleinman said...

In example 6.3 in our book, we show correlated binary variables, based on Lipsitz et al, Stats in Med 1990, 9:1517-1525. You'll find many cites if you search with "simulate correlated" as your base.

burakaydin said...

Thanks for the response.
There is an R package called "bindata". It performs almost perfect to create correlated binary variables, with known marginal probabilities and correlations.
What I need is the simulation of correlated continuous and categorical variables using a single multivariate distribution.

Ken Kleinman said...

Good to know about that one, thanks. I don't know of a technique to do what you need, offhand. A brief search turned up this thread: http://stats.stackexchange.com/questions/22856/how-to-generate-correlated-test-data-that-has-bernoulli-categorical-and-contin where copulas are suggested. And also this paper: http://www.springerlink.com/content/011x633m554u843g/. Let me know what you end up doing.

Nick Horton said...

There is a literature that might be relevant. A starting point might be Cox, D. R. and Wermuth, N. (1992). Response models for mixed binary and quantitative variables. Biometrika, 79, 441-461. They propose a flexible multivariate distribution which might be useful.

Anonymous said...

What is the variance of the error term when a multinomial logit is simulated in this way?