SAS
In SAS, we do this within a data step. We define parameters for the model and use looping (section 1.11.1) to replicate the model scenario for random draws of standard normal covariate values (section 1.10.5), calculating the linear predictor for each, and testing the resulting expit against a random draw from a standard uniform distribution (section 1.10.3).
data test;
intercept = 0;
beta = .5;
do i = 1 to 1000;
xtest = normal(12345);
linpred = intercept + (xtest * beta);
prob = exp(linpred)/ (1 + exp(linpred));
ytest = uniform(0) lt prob;
output;
end;
run;
R
In R we begin by assigning parameter values for the model. We then generate 1,000 random normal variates (section 1.10.5), calculating the linear predictor and expit for each, and then testing vectorwise (section 1.11.2) against 1,000 random uniforms (1.10.3).
intercept = 0
beta = 0.5
xtest = rnorm(1000,1,1)
linpred = intercept + xtest*beta
prob = exp(linpred)/(1 + exp(linpred))
runis = runif(1000,0,1)
ytest = ifelse(runis < prob,1,0)
24 comments:
Great post, I am confused by ytest = ifelse(runis < prob,1,0). What is the logic of this step?
If you are using this for simulation to determine sample size, would one experiment with the proper intercept (versus zero) to get the probability to average around the expected probability of success in the population?
Hi Brian--
ifelse() is a vectorwise logic test. It tests whether the i element of the runis vector is less than the i element of the prob vector. If it is, then the i element of ytest will be 1, otherwise it will be 0.
The prob vector contains the probability of the outcome being 1, given the covariate value and intercept. The ifelse() is like flipping a coin with probability of heads specified by the prob vector.
For a sample size calculation via Monte Carlo methods, you would solve prob = exp(intercept)/(1 + exp(intercept)) with prob = known proportion in the population when the covariate=0. You also need to choose the covariate distribution that best mimics your anticipation. Then simulate many times; fit a logistic regression model to each, and see what proportion of times the null is rejected. That's your power. Vary the sample size and beta to find the sample size and beta that achieve the power you need.
I see, that is excellent! The light bulb went on :)
I was really thrown by the comparison to the random uniform value. Could you also take a draw from a binomial distribution with p=prob instead?
Absolutely. I thought of that last night and wondered why didn't write it that way to begin with. I think in R you'd have to use an apply function to allow a different p for each binomial. It would look something like:
y = sapply(prob, function(p) {rbinom(1,1,p)})
So the binomial part wouls be easier to understand, but the code might be less accessible.
If I am not mistaken, I think if p is a vector and rbinom has its first argument is equal to the length of p, this will return a vector with each entry equal to the value in p at that index.
rbinom(length(p),1,p)
Right you are. Very reasonable for it to work that way. (I didn't try to make it work that way, obviously, and the documentation doesn't clarify that p can be a vector.)
Using your code would be both easier to understand and more "the R way."
Mr Ken Kleinman
i used your post to generate data logistic. I want to generate x1,x2, and x3. x1 has 2 category (binary), x2 has 3 category, and x3 has normal distribution.
When i generate 300, it gave a good model. But when i generate 30, the parameter did't significant. Can you tell me the reason?
With 30 observations, your power is very poor. That's the most likely issue. Try simulating and fitting your sample of 30 one thousand times. You should find each parameter with p < .05 more (just slightly more) than 5% of the time. As you increase the sample from 30 to 300, the proportion of rejections should increase-- that's your power increasing.
Hi,
Your program is very helpful, but if I would like to make a restriction of proportion difference like P(Y=1|X=1)-P(Y=1|X=0) = a, is there any way to generate such data controlled by this kind of proportion difference? Any suggestion will be pretty helpful. Thanks!
I'm interested in doing a post-hoc power test for a logistic regression, but I also have an interaction between continuous and categorical variables. Any suggestions on how to incorporate this into your code? Sorry, I realize this is several years after the original post, just hoping to get some advice as I've only been able to find information on assessing power for an interaction OR for a logistic regression, but not both. Thanks!
Post-hoc power assessment is fairly controversial and is frowned upon by many statisticians. But it's not hard to adapt our code to simulate your setting and then to assess power to detect an interaction. I'll write up a new blog post to address this shortly, and thanks for the question
My primary reason for wanting to do post hoc power was to retain the relationships between the variables we have. I'm having trouble simulating variables and accounting for the relationships between them. I have one categorical variable and one continuous one, and I'm interested in looking at the interaction between them. Any suggestions?
Possibly. Please contact me privately via e-mail.
Mr Ken Kleinman
I used your post how to generate data from a logistic regression in SAS. Its very helpful posts for new users. I use your post of generating data from logistic regression I generate 1000 random numbers, Now I want to replicate this results 100 times, how i can do this. Any suggestions will be pretty helpful. Thanks
Supposing I already have a dataset, can I use same to simulate several logistic regression results?
Safiya, you could certainly use your dataset as the basis of your simulations and create new Y's using the approach we've described. Is that what you mean by "simulate several logistic regression results"?
Thanks, that was what I meant. What I am interested in is the probabilities of treatment assignment. Is it that I would have one set of probabilities representative of all the datasets or each dataset would have it's own set computed separately?
You can use whatever probabilities you like: the simulation can be structured to track the scenario of interest.
This is very helpful. Thank you for the post!
Is it possible to modify your program to simulate a case-control data with P(Y=1)=0.5? Many thanks!!
This is very helpful and informative. I do I introduce error terms to the ytest?
How to simulate a binary response (obese or not obese) variable given a distribution of body weight?
Can someone help me? How do I repeat the simulation like 30 times having different datasets in one file for example
simulation id y x1 x2
1 1 1 0 18
1 2 0 1 20
1 3 0 0 24
1 4 1 1 28
1 5 0 1 40
2 1 1 1 44
2 2 0 0 25
2 3 1 1 38
2 4 1 1 39
2 5 1 0 41
3 1 1 1 43
3 2 1 0 45
3 3 1 0 43
3 4 0 0 41
3 5 0 1 40
I'd suggest turning our code into a function then iterating over each of the values in your dataset.
Post a Comment