Saturday, June 13, 2009

Example 7.2: Simulate data from a logistic regression

It might be useful to be able to simulate data from a logistic regression (section 4.1.1). Our process is to generate the linear predictor, then apply the inverse link, and finally draw from a distribution with this parameter. This approach is useful in that it can easily be applied to other generalized linear models. In this example we assume an intercept of 0 and a slope of 0.5, and generate 1,000 observations. See section 4.6.1 for an example of fitting logistic regression.


SAS
In SAS, we do this within a data step. We define parameters for the model and use looping (section 1.11.1) to replicate the model scenario for random draws of standard normal covariate values (section 1.10.5), calculating the linear predictor for each, and testing the resulting expit against a random draw from a standard uniform distribution (section 1.10.3).

data test;
intercept = 0;
beta = .5;
do i = 1 to 1000;
xtest = normal(12345);
linpred = intercept + (xtest * beta);
prob = exp(linpred)/ (1 + exp(linpred));
ytest = uniform(0) lt prob;
output;
end;
run;


R
In R we begin by assigning parameter values for the model. We then generate 1,000 random normal variates (section 1.10.5), calculating the linear predictor and expit for each, and then testing vectorwise (section 1.11.2) against 1,000 random uniforms (1.10.3).

intercept = 0
beta = 0.5
xtest = rnorm(1000,1,1)
linpred = intercept + xtest*beta
prob = exp(linpred)/(1 + exp(linpred))
runis = runif(1000,0,1)
ytest = ifelse(runis < prob,1,0)

24 comments:

Brian said...

Great post, I am confused by ytest = ifelse(runis < prob,1,0). What is the logic of this step?

If you are using this for simulation to determine sample size, would one experiment with the proper intercept (versus zero) to get the probability to average around the expected probability of success in the population?

Ken Kleinman said...

Hi Brian--

ifelse() is a vectorwise logic test. It tests whether the i element of the runis vector is less than the i element of the prob vector. If it is, then the i element of ytest will be 1, otherwise it will be 0.

The prob vector contains the probability of the outcome being 1, given the covariate value and intercept. The ifelse() is like flipping a coin with probability of heads specified by the prob vector.

For a sample size calculation via Monte Carlo methods, you would solve prob = exp(intercept)/(1 + exp(intercept)) with prob = known proportion in the population when the covariate=0. You also need to choose the covariate distribution that best mimics your anticipation. Then simulate many times; fit a logistic regression model to each, and see what proportion of times the null is rejected. That's your power. Vary the sample size and beta to find the sample size and beta that achieve the power you need.

Brian said...

I see, that is excellent! The light bulb went on :)

I was really thrown by the comparison to the random uniform value. Could you also take a draw from a binomial distribution with p=prob instead?

Ken Kleinman said...

Absolutely. I thought of that last night and wondered why didn't write it that way to begin with. I think in R you'd have to use an apply function to allow a different p for each binomial. It would look something like:

y = sapply(prob, function(p) {rbinom(1,1,p)})

So the binomial part wouls be easier to understand, but the code might be less accessible.

Brian said...

If I am not mistaken, I think if p is a vector and rbinom has its first argument is equal to the length of p, this will return a vector with each entry equal to the value in p at that index.

rbinom(length(p),1,p)

Ken Kleinman said...

Right you are. Very reasonable for it to work that way. (I didn't try to make it work that way, obviously, and the documentation doesn't clarify that p can be a vector.)

Using your code would be both easier to understand and more "the R way."

Anonymous said...

Mr Ken Kleinman
i used your post to generate data logistic. I want to generate x1,x2, and x3. x1 has 2 category (binary), x2 has 3 category, and x3 has normal distribution.
When i generate 300, it gave a good model. But when i generate 30, the parameter did't significant. Can you tell me the reason?

Ken Kleinman said...

With 30 observations, your power is very poor. That's the most likely issue. Try simulating and fitting your sample of 30 one thousand times. You should find each parameter with p < .05 more (just slightly more) than 5% of the time. As you increase the sample from 30 to 300, the proportion of rejections should increase-- that's your power increasing.

Unknown said...

Hi,

Your program is very helpful, but if I would like to make a restriction of proportion difference like P(Y=1|X=1)-P(Y=1|X=0) = a, is there any way to generate such data controlled by this kind of proportion difference? Any suggestion will be pretty helpful. Thanks!

Unknown said...
This comment has been removed by the author.
Unknown said...

I'm interested in doing a post-hoc power test for a logistic regression, but I also have an interaction between continuous and categorical variables. Any suggestions on how to incorporate this into your code? Sorry, I realize this is several years after the original post, just hoping to get some advice as I've only been able to find information on assessing power for an interaction OR for a logistic regression, but not both. Thanks!

Ken Kleinman said...

Post-hoc power assessment is fairly controversial and is frowned upon by many statisticians. But it's not hard to adapt our code to simulate your setting and then to assess power to detect an interaction. I'll write up a new blog post to address this shortly, and thanks for the question

Unknown said...

My primary reason for wanting to do post hoc power was to retain the relationships between the variables we have. I'm having trouble simulating variables and accounting for the relationships between them. I have one categorical variable and one continuous one, and I'm interested in looking at the interaction between them. Any suggestions?

Ken Kleinman said...

Possibly. Please contact me privately via e-mail.

Anonymous said...

Mr Ken Kleinman

I used your post how to generate data from a logistic regression in SAS. Its very helpful posts for new users. I use your post of generating data from logistic regression I generate 1000 random numbers, Now I want to replicate this results 100 times, how i can do this. Any suggestions will be pretty helpful. Thanks

Unknown said...

Supposing I already have a dataset, can I use same to simulate several logistic regression results?

Nick Horton said...

Safiya, you could certainly use your dataset as the basis of your simulations and create new Y's using the approach we've described. Is that what you mean by "simulate several logistic regression results"?

Unknown said...

Thanks, that was what I meant. What I am interested in is the probabilities of treatment assignment. Is it that I would have one set of probabilities representative of all the datasets or each dataset would have it's own set computed separately?

Nick Horton said...

You can use whatever probabilities you like: the simulation can be structured to track the scenario of interest.

Unknown said...

This is very helpful. Thank you for the post!
Is it possible to modify your program to simulate a case-control data with P(Y=1)=0.5? Many thanks!!

Emmanuel said...

This is very helpful and informative. I do I introduce error terms to the ytest?

Anonymous said...

How to simulate a binary response (obese or not obese) variable given a distribution of body weight?

Seanlove said...

Can someone help me? How do I repeat the simulation like 30 times having different datasets in one file for example
simulation id y x1 x2
1 1 1 0 18
1 2 0 1 20
1 3 0 0 24
1 4 1 1 28
1 5 0 1 40
2 1 1 1 44
2 2 0 0 25
2 3 1 1 38
2 4 1 1 39
2 5 1 0 41
3 1 1 1 43
3 2 1 0 45
3 3 1 0 43
3 4 0 0 41
3 5 0 1 40

Nick Horton said...

I'd suggest turning our code into a function then iterating over each of the values in your dataset.