SAS and R: Example 7.2: Simulate data from a logistic regression

Saturday, June 13, 2009

Example 7.2: Simulate data from a logistic regression

It might be useful to be able to simulate data from a logistic regression (section 4.1.1). Our process is to generate the linear predictor, then apply the inverse link, and finally draw from a distribution with this parameter. This approach is useful in that it can easily be applied to other generalized linear models. In this example we assume an intercept of 0 and a slope of 0.5, and generate 1,000 observations. See section 4.6.1 for an example of fitting logistic regression.

SAS
In SAS, we do this within a data step. We define parameters for the model and use looping (section 1.11.1) to replicate the model scenario for random draws of standard normal covariate values (section 1.10.5), calculating the linear predictor for each, and testing the resulting expit against a random draw from a standard uniform distribution (section 1.10.3).


data test;
intercept = 0;
beta = .5;
do i = 1 to 1000;
   xtest = normal(12345);
   linpred = intercept + (xtest * beta);
   prob = exp(linpred)/ (1 + exp(linpred));
   ytest = uniform(0) lt prob;
   output;
   end;
run;

R
In R we begin by assigning parameter values for the model. We then generate 1,000 random normal variates (section 1.10.5), calculating the linear predictor and expit for each, and then testing vectorwise (section 1.11.2) against 1,000 random uniforms (1.10.3).


intercept = 0
beta = 0.5
xtest = rnorm(1000,1,1)
linpred = intercept + xtest*beta
prob = exp(linpred)/(1 + exp(linpred))
runis = runif(1000,0,1)
ytest = ifelse(runis < prob,1,0)

24 comments:

Brian said...: Great post, I am confused by ytest = ifelse(runis < prob,1,0). What is the logic of this step?

If you are using this for simulation to determine sample size, would one experiment with the proper intercept (versus zero) to get the probability to average around the expected probability of success in the population?; March 14, 2011 at 6:20 PM
Ken Kleinman said...: Hi Brian--

ifelse() is a vectorwise logic test. It tests whether the i element of the runis vector is less than the i element of the prob vector. If it is, then the i element of ytest will be 1, otherwise it will be 0.

The prob vector contains the probability of the outcome being 1, given the covariate value and intercept. The ifelse() is like flipping a coin with probability of heads specified by the prob vector.

For a sample size calculation via Monte Carlo methods, you would solve prob = exp(intercept)/(1 + exp(intercept)) with prob = known proportion in the population when the covariate=0. You also need to choose the covariate distribution that best mimics your anticipation. Then simulate many times; fit a logistic regression model to each, and see what proportion of times the null is rejected. That's your power. Vary the sample size and beta to find the sample size and beta that achieve the power you need.; March 14, 2011 at 8:39 PM
Brian said...: I see, that is excellent! The light bulb went on :)

I was really thrown by the comparison to the random uniform value. Could you also take a draw from a binomial distribution with p=prob instead?; March 14, 2011 at 9:02 PM
Ken Kleinman said...: Absolutely. I thought of that last night and wondered why didn't write it that way to begin with. I think in R you'd have to use an apply function to allow a different p for each binomial. It would look something like:

y = sapply(prob, function(p) {rbinom(1,1,p)})

So the binomial part wouls be easier to understand, but the code might be less accessible.; March 15, 2011 at 9:23 AM
Brian said...: If I am not mistaken, I think if p is a vector and rbinom has its first argument is equal to the length of p, this will return a vector with each entry equal to the value in p at that index.

rbinom(length(p),1,p); March 15, 2011 at 6:16 PM
Ken Kleinman said...: Right you are. Very reasonable for it to work that way. (I didn't try to make it work that way, obviously, and the documentation doesn't clarify that p can be a vector.)

Using your code would be both easier to understand and more "the R way."; March 15, 2011 at 8:47 PM
Anonymous said...: Mr Ken Kleinman
i used your post to generate data logistic. I want to generate x1,x2, and x3. x1 has 2 category (binary), x2 has 3 category, and x3 has normal distribution.
When i generate 300, it gave a good model. But when i generate 30, the parameter did't significant. Can you tell me the reason?; May 24, 2011 at 10:06 PM
Ken Kleinman said...: With 30 observations, your power is very poor. That's the most likely issue. Try simulating and fitting your sample of 30 one thousand times. You should find each parameter with p < .05 more (just slightly more) than 5% of the time. As you increase the sample from 30 to 300, the proportion of rejections should increase-- that's your power increasing.; May 25, 2011 at 8:19 AM
Unknown said...: Hi,

Your program is very helpful, but if I would like to make a restriction of proportion difference like P(Y=1|X=1)-P(Y=1|X=0) = a, is there any way to generate such data controlled by this kind of proportion difference? Any suggestion will be pretty helpful. Thanks!; November 7, 2012 at 1:40 PM
Unknown said...: This comment has been removed by the author.; November 7, 2012 at 1:41 PM
Unknown said...: I'm interested in doing a post-hoc power test for a logistic regression, but I also have an interaction between continuous and categorical variables. Any suggestions on how to incorporate this into your code? Sorry, I realize this is several years after the original post, just hoping to get some advice as I've only been able to find information on assessing power for an interaction OR for a logistic regression, but not both. Thanks!; June 18, 2014 at 10:18 AM
Ken Kleinman said...: Post-hoc power assessment is fairly controversial and is frowned upon by many statisticians. But it's not hard to adapt our code to simulate your setting and then to assess power to detect an interaction. I'll write up a new blog post to address this shortly, and thanks for the question; June 19, 2014 at 11:44 AM
Unknown said...: My primary reason for wanting to do post hoc power was to retain the relationships between the variables we have. I'm having trouble simulating variables and accounting for the relationships between them. I have one categorical variable and one continuous one, and I'm interested in looking at the interaction between them. Any suggestions?; June 24, 2014 at 11:12 AM
Ken Kleinman said...: Possibly. Please contact me privately via e-mail.; June 25, 2014 at 11:46 AM
Anonymous said...: Mr Ken Kleinman

I used your post how to generate data from a logistic regression in SAS. Its very helpful posts for new users. I use your post of generating data from logistic regression I generate 1000 random numbers, Now I want to replicate this results 100 times, how i can do this. Any suggestions will be pretty helpful. Thanks; October 13, 2016 at 7:36 PM
Unknown said...: Supposing I already have a dataset, can I use same to simulate several logistic regression results?; November 27, 2017 at 5:08 AM
Nick Horton said...: Safiya, you could certainly use your dataset as the basis of your simulations and create new Y's using the approach we've described. Is that what you mean by "simulate several logistic regression results"?; November 27, 2017 at 7:45 AM
Unknown said...: Thanks, that was what I meant. What I am interested in is the probabilities of treatment assignment. Is it that I would have one set of probabilities representative of all the datasets or each dataset would have it's own set computed separately?; November 30, 2017 at 4:23 AM
Nick Horton said...: You can use whatever probabilities you like: the simulation can be structured to track the scenario of interest.; November 30, 2017 at 6:05 AM
Unknown said...: This is very helpful. Thank you for the post!
Is it possible to modify your program to simulate a case-control data with P(Y=1)=0.5? Many thanks!!; June 9, 2018 at 9:44 PM
Emmanuel said...: This is very helpful and informative. I do I introduce error terms to the ytest?; April 13, 2019 at 12:27 PM
Anonymous said...: How to simulate a binary response (obese or not obese) variable given a distribution of body weight?; July 9, 2019 at 2:40 PM
Seanlove said...: Can someone help me? How do I repeat the simulation like 30 times having different datasets in one file for example
simulation id y x1 x2
1 1 1 0 18
1 2 0 1 20
1 3 0 0 24
1 4 1 1 28
1 5 0 1 40
2 1 1 1 44
2 2 0 0 25
2 3 1 1 38
2 4 1 1 39
2 5 1 0 41
3 1 1 1 43
3 2 1 0 45
3 3 1 0 43
3 4 0 0 41
3 5 0 1 40; July 30, 2019 at 12:41 PM
Nick Horton said...: I'd suggest turning our code into a function then iterating over each of the values in your dataset.; August 4, 2019 at 9:26 AM

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Saturday, June 13, 2009

Example 7.2: Simulate data from a logistic regression

24 comments:

About SAS and R

Topics discussed