I've found Sonja Swanson's excellent paper to be helpful with those questions:

https://www.ncbi.nlm.nih.gov/pubmed/21882219

A Monte Carlo investigation of factors influencing latent class analysis: an application to eating disorder research.

Swanson SA1, Lindenberg K, Bauer S, Crosby RD.
Author information
Abstract
OBJECTIVE:
Latent class analysis (LCA) has frequently been used to identify qualitatively distinct phenotypes of disordered eating. However, little consideration has been given to methodological factors that may influence the accuracy of these results.
METHOD:
Monte Carlo simulations were used to evaluate methodological factors that may influence the accuracy of LCA under scenarios similar to those seen in previous eating disorder research.
RESULTS:
Under these scenarios, the aBIC provided the best overall performance as an information criterion, requiring sample sizes of 300 in both balanced and unbalanced structures to achieve accuracy proportions of at least 80%. The BIC and cAIC required larger samples to achieve comparable performance, while the AIC performed poorly universally in comparison. Accuracy generally was lower with unbalanced classes, fewer indicators, greater or nonrandom missing data, conditional independence assumption violations, and lower base rates of indicator endorsement.
DISCUSSION:
These results provide critical information for interpreting previous LCA research and designing future classification studies.
Hi
I wonder how I can compare the fit statistics when I have more than 25 variables. which one is more reliable (log liklihood, G-square, AIC, BIC, CAIC, Adjusted BIC or Entropy)? 
Thanks by advance

Nope, the coding doesn't matter. If you make the switch this will just cause all of your parameter estimates to flip sign. (I'd encourage you to try this on a minimally reproducible example.)

Nick
Hi,

I am conducting an LCA and am wondering if how you code your binary indicators matters? For example 1=yes and 2=no verses 1=no and 2=yes.
Hi Isabella--

This is a good question. In general in R, you can use relevel() to change the reference category for a factor variable. But it doesn't seem to work for the varIdent() function used here!

For example, in the code below, the variances are identical. I think to do what you want, you might have to recode the factor manually!

milk$mc4 = relevel(milk$mc, ref=4)

mod = gls(value~mc, data=milk, weights = varIdent(form = ~1|mc), method="ML")
mod_ref4 = gls(value~mc4, data=milk, weights = varIdent(form = ~1|mc4), method="ML")

mod$modelStruct$varStruct
mod_ref4$modelStruct$varStruct
Hi Ken, 

Is it possible to control the reference level of mc in the formula weights = varIdent(form = ~1|mc) so as to force R to construct and report ratios of variances which reflect the choice of reference level? 

Thanks,

Isabella

Fixed! Thanks for pointing this out.

in the R example you forgot to define the data.frame "ds" before to use survfit(...)
Hi, 

Thanks for this post, but what would you do if you have age as the time scale? I have data set up as a single individual per row with the model statment written as: 
(age_in, age_out)*no_deaths (0)=drug_type

How do you assess proportional hazards in this case? I can't put in (age_out-age_in) as the x in the sgplot command... 

Thanks!

March 23, 2017
Hello everybody,

I am conducting an LTA analysis and, in my case, the % of seeds in the best fitting model is 48%. I have seen that your example has 40% seeds. What is the minimum % needed for the model to be identified?
Thank you in advance :)

best wishes,

How to simulate Cure Rate Models in R?

The abline() function plots the OLS regression onto an existing plot. So you can just add 

abline(coef = coef(lm(y~x)))

after the plot() function in the existing code.
Hello community,
i am using the R code of Mr. Ken to generate the graph and it works.
I would like to ask now what do i need to add to this code in order to get a regression line.
My code is:
scatterhist = function(x, y, xlab="1", ylab="2"){
 zones=matrix(c(2,0,1,3), ncol=2, byrow=TRUE)
 layout(zones, widths=c(4/5,1/5), heights=c(1/5,4/5))
 xhist = hist(x, plot=FALSE)
 yhist = hist(y, plot=FALSE)
 top = max(c(xhist$counts, yhist$counts))
 par(mar=c(3,3,1,1))
 plot(x,y)
 par(mar=c(0,3,1,1))
 barplot(xhist$counts, axes=FALSE, ylim=c(0, top), space=0)
 par(mar=c(3,0,1,1))
 barplot(yhist$counts, axes=FALSE, xlim=c(0, top), space=0, horiz=TRUE)
 par(oma=c(3,3,0,0))
 mtext(xlab, side=1, line=1, outer=TRUE, adj=0, 
 at=.8 * (mean(x) - min(x))/(max(x)-min(x)))
 mtext(ylab, side=2, line=1, outer=TRUE, adj=0, 
 at=(.8 * (mean(y) - min(y))/(max(y) - min(y))))
}
ds = read.csv("popAhohenheim.csv", header = T, sep = ";",dec = ",", na.strings= "*", stringsAsFactors = F )
with(ds, scatterhist(AES, Flowering, xlab="AES", ylab="Flowering"))


Best wishes

Thanks for this amazing post. I manage to get it up and running and hit a snag. To enable ports, you now need to add "Custom", "-1" and "0 - 65535" under Application, Protocol, Port range respectively. Seems that Amazon Lightsail just tweaked something at their end.

Thanks for this. Helpful for someone like me just beginning with R.

Off topic but I was wondering if you could recommend any training providers in London. I work for an I Bank and so can get training budget for a couple of days of dedicated training. Just wondering if you had any thoughts?

The latter was a great suggestion, I was actually able to embed it into a DESeq2 analysis co-opting the way that heatmaps are handle outlier issues and applying it to this. Thanks again.

Simply superb article thank you
Hi Justin--

My first thought would be to handle this on a case-by-case basis, meaning to arbitrarily remove the large values by hand before plotting the data.

But it would be an interesting exercise to construct a function to detect range issues like this. You could also embed the R code in a function and include an option to trim the n largest values before plotting.

This is great, thanks. Is there a way to restrict the Z value to limit outliers? All of my points are "significant" but even after log transforming I still have one or two points that are much larger than the others, dwarfing the majority of bubbles. 

Thanks!

Thanks for the R function. Is there a way to get the same table of observation as we get in SAS,for Hosmer Lemeshow test?

Hi Edward-- I think this is easier/quicker. I only need to install one program in Linux (in Amazon-- none if I use Digital Ocean). And the different docker images have various packages pre-installed, which I think might save a lot of time for a casual user.

If all you want to do is host a personal RStudio Server in the cloud, why do you need Docker? Why not just stand up an Ubuntu 16.04 server instance, install r-base, r-base-dev, git, gdebi-core and RStudio Server?
Mr Ken Kleinman

I used your post how to generate data from a logistic regression in SAS. Its very helpful posts for new users. I use your post of generating data from logistic regression I generate 1000 random numbers, Now I want to replicate this results 100 times, how i can do this. Any suggestions will be pretty helpful. Thanks