tag:blogger.com,1999:blog-1275149608391671670.comments2017-10-20T22:26:10.967-04:00SAS and RKen Kleinmanhttp://www.blogger.com/profile/09525118721291529157noreply@blogger.comBlogger613125tag:blogger.com,1999:blog-1275149608391671670.post-3200695085341721052017-09-22T16:55:40.085-04:002017-09-22T16:55:40.085-04:00lm() is largely out of our control, and I don'...lm() is largely out of our control, and I don't think it is a good idea to write a replacement for lm().<br /><br />For numerical summaries, take a look at df_stats(). I think you will find it does what you want and interoperates well with ggformula. In particular, it always returns a tidy data frame (hence the d in df_stats). Here is an example:<br /><br /><br />require(mosaic)<br />HELPrct %>% filter(sex == "male") %>% df_stats(age ~ substance, mean, median)<br />## substance mean_age median_age<br />## 1 alcohol 37.95035 38.0<br />## 2 cocaine 34.36036 33.0<br />## 3 heroin 33.05319 32.5Randall Pruimhttps://www.blogger.com/profile/07500805842903136651noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-23874489521891099772017-09-22T16:10:45.837-04:002017-09-22T16:10:45.837-04:00This is awesome! I just wish I knew about this a ...This is awesome! I just wish I knew about this a week ago (just spent a week teaching my intro stats class a messy conglomerate of tidyverse and mosaic for plotting).<br /><br />I love that the gf_ functions work with the %>% operator! Unfortunately, other formula using methods from base (like lm) or mosaic (like tally or favstats) do not work with the operator, and need data = ., which can potentially be rather confusing.Janhttps://www.blogger.com/profile/06073872742931383080noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-66112573489488637002017-07-31T10:18:41.982-04:002017-07-31T10:18:41.982-04:00Our goal, consistent with the revised GAISE Colleg...Our goal, consistent with the revised GAISE College report (https://arxiv.org/abs/1705.09530) is to integrate the teaching of key statistical concepts with the use of appropriate technology. <br /><br />It's certainly possible to teach research methodology (e.g., addressing confounding using multiple regression) without software. I think that having students use real-tools with a straightforward interface can augment such instruction.<br />Nick Hortonhttps://www.blogger.com/profile/00242216324355342047noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-33963424320220172142017-07-31T10:08:57.895-04:002017-07-31T10:08:57.895-04:00We very intentionally organized the material in th...We very intentionally organized the material in the book to ensure that there is a solid foundation in statistics. This permeates the data viz and data wrangling chapters (which are intended to answer a statistical question), the foundations in statistics chapter (which reviews key statistical concepts), and the topics chapters (e.g. text as data, spatial, network statistics). Such an approach seems critically important to be able to "think with data": http://amstat.tandfonline.com/doi/full/10.1080/00031305.2015.1094283Nick Hortonhttps://www.blogger.com/profile/00242216324355342047noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-52988935120908356832017-07-31T09:38:01.483-04:002017-07-31T09:38:01.483-04:00I agree! I mistakingly chose to get my second Mas...I agree! I mistakingly chose to get my second Master in bioinformatics instead of biostatistics, and I'm very disappointed with my program. It is predicated on the assumption that you can teach people data science without any statistics. The whole field of "data science" seems to have this idea, and it will send us backward in time. It isn't leading to better research, just more research.Jessicahttps://www.blogger.com/profile/01125948286235962017noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-28766309650373788622017-07-31T03:18:18.811-04:002017-07-31T03:18:18.811-04:00Do not teach beginners how to use R statistics. Te...Do not teach beginners how to use R statistics. Teach them proper research methodology. They will learn by themselves how to use any statistical software including R. That is power in learning. They will never forget.gatara timothyhttps://www.blogger.com/profile/04946108944968998288noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-39790301932722802162017-07-28T09:59:00.518-04:002017-07-28T09:59:00.518-04:00The treatment of missing data is not really a matt...The treatment of missing data is not really a matter of whether formulas are used but of the particular function used. One difference between R and SAS is that R is written by a community and SAS by a company. So it is easier for SAS to enforce stronger consistency across functions. Naming conventions (camelCase, dots, underscores, etc.) are also inconsistent across R. (The reason functions like lm() discard missing data by default is that they use model.frame() which has this as its default.)<br /><br />The reason mosaic doesn't change the default behavior with regard to missing values is that we decided early on that our versions of the functions should behave just like the originals in cases where the original functions produced sensible output. (It would be bad if mean() gave a different answer depending on whether mosaic was or was not loaded.) The user can, as you note, set the default behavior actively using options(). That seemed like a good compromise.<br /><br />For functions like favstats(), which we could write without worrying about compatibility with core R functions, we were free to do other things. In this case, we compute statistics after dropping missing values and also display the number of values that were missing.<br /><br />Regarding t.test(pre, post, paired = TRUE, data = mydata), I have received numerous requests for things like this. (Usually the request is along the lines of mean(x, data = mydata). And for a time (and against my better judgment) we supported this. But it was a bad idea for several reasons. For starters, the code is ambiguous if x exists both in the environment and in mydata. It also makes constructing the functions and keeping them compatible with their core R counterparts much more challenging and led to some subtle bugs. In R, a formula is the correct way to designate a name to be evaluated in a special way. Finally, it was somewhat confusing that some functions accepted the "bare variables + data" syntax and others required formulas. It is more systematic if they all require the formula.<br /><br />The bigger inconsistency -- that t.test(y ~ x) worked by t.test( ~ x) did not -- we fixed in mosaic.<br /><br /><br />In general, the problem with nonstandard evaluation is that it is nonstandard, so it can be hard to predict behavior. There are times where NSE is very useful, but it works best when operating within a well-defined system where the NSE can be correctly anticipated by the user. The use of formulas in lattice, ggformula, mosaic, and in functions like lm() at its cousins provides an established context for evaluation of formulas with a data context.<br />Randall Pruimhttps://www.blogger.com/profile/07500805842903136651noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-54888502585821656842017-07-28T09:01:57.734-04:002017-07-28T09:01:57.734-04:00Thanks for writing this thought-proving article. I...Thanks for writing this thought-proving article. I teach a lot of workshops for organizations that are migrating from SAS to R, and one of the things that confuses people is R's inconsistent treatment of missing values. Simple stat functions require setting na.rm = TRUE while formula-based ones don't. Using the mosaic functions, you can set options(na.rm = TRUE) and from then on, its simple functions will find that setting and work more like formula-based ones. Since mosaic is nice enough to add formulas to simple stat functions, it would be nice for that to be the default. <br /><br />Another inconsistency that I would like to see mosaic fix is that the data argument works only with formulas. So this finds the variables: t.test(y ~ group, data = mydata) while this does not: t.test(pre, post, paired = TRUE, data = mydata).Bob Muenchenhttps://www.blogger.com/profile/14224906531398701275noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-13309438569933827332017-07-28T07:54:12.714-04:002017-07-28T07:54:12.714-04:00While you can certainly begin with base graphics a...While you can certainly begin with base graphics and the plot() function, that only seems like a reasonable solution to me if you want to continue with base graphics throughout the course (which I don't). The different graphics systems don't play well together, so I find it best to pick one and stick with it. If you want to use lattice or ggformula, I'd suggest starting there and avoiding base graphics altogether.<br /><br />Note too, that plot() is a bit quirky in its choices. plot(~price, data = diamonds) is not a very good choice, for example. A student should expect something much better. On balance, I don't find plot() or qplot() from ggplot2 compelling. I generally want to get beyond what these provide, and I don't find teaching lattice or ggformula to be challenging without such a function.Randall Pruimhttps://www.blogger.com/profile/07500805842903136651noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-51392486450443399082017-07-28T07:21:06.579-04:002017-07-28T07:21:06.579-04:00An additional feature of the mosaic package is the...An additional feature of the mosaic package is the multi-purpose mplot() function available within RStudio. <br /><br />If you provide a linear model object as argument, it allows you to generate the typical regression diagnostics (including a regression coefficient plot). <br /><br />If you provide a dataframe as argument, you get an interactive data visualizer that lets you explore univariate, bivariate, and multivariate graphical displays (see http://escholarship.org/uc/item/84v3774z for an example of how we incorporate this on day one of an introductory statistics course).<br /><br />in RStudio, try running:<br /><br />library(mosaic)<br />mplot(HELPrct)<br /><br />The "Show Expression" feature is particularly useful: it's an easy way to see the syntax to generate the selected plot using lattice, ggplot2, or ggformula.<br />Nick Hortonhttps://www.blogger.com/profile/00242216324355342047noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-41662107859163601802017-07-28T06:18:53.747-04:002017-07-28T06:18:53.747-04:00Thanks for this post. I agree that strengthening t...Thanks for this post. I agree that strengthening the teaching of formula-based functions is a good idea and easy to learn for beginners. One useful R feature that is often overlooked in this context is that plot(y ~ x, data = df) chooses a suitable plot for various combinations of y and x. Of course, if both variables are numeric, this creates a scatter plot. For numeric "response" and categorical "explanatory variable" we get parallel boxplots. And if the response is categorical we get a spineplot or spinogram, respectively. For many data sets that are relevant to our students (business & economics) this goes quite a long way. And from that point onwards I can teach what kind of principles - and corresponding R functions/packages - can be used to construct more complex displays etc.Achim Zeileishttps://eeecon.uibk.ac.at/~zeileis/noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-8217944099962425032017-07-28T06:06:48.117-04:002017-07-28T06:06:48.117-04:00One point is that everything that it seems a good ...One point is that everything that it seems a good data scientist should know; statistics, computing including information technology and the art of graphics is what statisticians always knew we needed. The problem seems to be that those of us who learnt statistics in the eighties or earlier only ever got taught statistics and have learnt everything else as we went along. My belief is that a good background in statistics and basic programming skills is the most important thing to know. What is worrying is that the university I'm at, like many others, has the computer science department offering the data science masters.<br /><br />One point on your book. k-means should die. I know everyone teaches it, but mixtures of multivariate normals is so much more powerful and leads on to mixtures as solution to other problems. The main problem is understanding it requires a proper statistical background, but without it nobody understands k-means either.Ken Bushwalkerhttps://www.blogger.com/profile/00356020862476667328noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-79082258108140172242017-07-27T17:39:18.023-04:002017-07-27T17:39:18.023-04:00I'm not sure what you mean by "violate th...I'm not sure what you mean by "violate the R convention on interpreting formulas". Isn't mean(Y ~ X) equivalent in meaning to the t.test(Y ~ X) and aov(Y ~ X)? <br /><br />Note that mosaic does support the syntax you describe for summary statistics for the aggregating functions:<br /><br />mean(~ Y | X)<br />favstats(~ Y | X)<br /><br />along with<br /><br />mean(~ Y, group=X)<br />favstats(~ Y, group=X)<br /><br />> mean(~ age | substance, data=HELPrct)<br />alcohol cocaine heroin <br /> 38.2 34.5 33.4 <br />> mean(~ age, group=substance, data=HELPrct)<br />alcohol cocaine heroin <br /> 38.2 34.5 33.4 <br />> mean(age ~ substance, data=HELPrct)<br />alcohol cocaine heroin <br /> 38.2 34.5 33.4 <br /><br />Nick Hortonhttps://www.blogger.com/profile/00242216324355342047noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-69837253831501845132017-07-27T16:47:35.627-04:002017-07-27T16:47:35.627-04:00I have encountered this dilemma (trilemma?) as wel...I have encountered this dilemma (trilemma?) as well in teaching R. While mosaic looks like a great package, I have to complain about the mean(y ~ x, data = DF) syntax. That seems to violate the R convention on interpreting formulas. Mosaic should have limited it to mean(~ y | x, data = DF). Otherwise I could not give a consistent explanation for how formulas work in R.THKhttps://www.blogger.com/profile/17577107837780164044noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-52693687321958087172017-07-04T11:21:17.884-04:002017-07-04T11:21:17.884-04:00I have a question that is and R question and a st...I have a question that is and R question and a statistical question:<br />I am analysing sales of a retailer. These sales are related to some vars: var1, var2, var3.., varN<br />Most of the vars are continuos. <br />I want to analyze the relationship between sales and the vars. I have made a linear regression with R:<br /><br />rg<-lm(sales ~ var1 + var2 + var3 + var4, data=sales_2017)<br />summary(rg)<br /><br />Now I want to know which is the most important variable in sales, and to know the percent of importance of each var. I am doing this (caret package):<br /><br />varImp(rg, scale = FALSE)<br />rsimp <- varImp(rg, scale = FALSE)<br />plot(rsimp)<br /><br />Is this a good method to obtain variables importance??, is good way in R?<br />Thanks in advance. Any advice will be greatly apreciated.<br /><br />JuanJuan V.https://www.blogger.com/profile/11172645377583674938noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-53208635719295031732017-06-20T18:37:31.012-04:002017-06-20T18:37:31.012-04:00I've found Sonja Swanson's excellent paper...I've found Sonja Swanson's excellent paper to be helpful with those questions:<br /><br />https://www.ncbi.nlm.nih.gov/pubmed/21882219<br /><br />A Monte Carlo investigation of factors influencing latent class analysis: an application to eating disorder research.<br /><br />Swanson SA1, Lindenberg K, Bauer S, Crosby RD.<br />Author information<br />Abstract<br />OBJECTIVE:<br />Latent class analysis (LCA) has frequently been used to identify qualitatively distinct phenotypes of disordered eating. However, little consideration has been given to methodological factors that may influence the accuracy of these results.<br />METHOD:<br />Monte Carlo simulations were used to evaluate methodological factors that may influence the accuracy of LCA under scenarios similar to those seen in previous eating disorder research.<br />RESULTS:<br />Under these scenarios, the aBIC provided the best overall performance as an information criterion, requiring sample sizes of 300 in both balanced and unbalanced structures to achieve accuracy proportions of at least 80%. The BIC and cAIC required larger samples to achieve comparable performance, while the AIC performed poorly universally in comparison. Accuracy generally was lower with unbalanced classes, fewer indicators, greater or nonrandom missing data, conditional independence assumption violations, and lower base rates of indicator endorsement.<br />DISCUSSION:<br />These results provide critical information for interpreting previous LCA research and designing future classification studies.Nick Hortonhttps://www.blogger.com/profile/00242216324355342047noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-68764514120417604042017-06-20T14:06:58.328-04:002017-06-20T14:06:58.328-04:00Hi
I wonder how I can compare the fit statistics w...Hi<br />I wonder how I can compare the fit statistics when I have more than 25 variables. which one is more reliable (log liklihood, G-square, AIC, BIC, CAIC, Adjusted BIC or Entropy)? <br />Thanks by advanceAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-9879021711331272442017-06-09T10:07:06.704-04:002017-06-09T10:07:06.704-04:00Nope, the coding doesn't matter. If you make ...Nope, the coding doesn't matter. If you make the switch this will just cause all of your parameter estimates to flip sign. (I'd encourage you to try this on a minimally reproducible example.)<br /><br />NickNick Hortonhttps://www.blogger.com/profile/00242216324355342047noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-41886097893528899372017-06-08T15:38:28.349-04:002017-06-08T15:38:28.349-04:00Hi,
I am conducting an LCA and am wondering if ho...Hi,<br /><br />I am conducting an LCA and am wondering if how you code your binary indicators matters? For example 1=yes and 2=no verses 1=no and 2=yes. <br />Jenni Millerhttps://www.blogger.com/profile/12912475051931063784noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-16170092488322615112017-05-26T10:31:58.667-04:002017-05-26T10:31:58.667-04:00Hi Isabella--
This is a good question. In genera...Hi Isabella--<br /><br />This is a good question. In general in R, you can use relevel() to change the reference category for a factor variable. But it doesn't seem to work for the varIdent() function used here!<br /><br />For example, in the code below, the variances are identical. I think to do what you want, you might have to recode the factor manually!<br /><br />milk$mc4 = relevel(milk$mc, ref=4)<br /><br />mod = gls(value~mc, data=milk, weights = varIdent(form = ~1|mc), method="ML")<br />mod_ref4 = gls(value~mc4, data=milk, weights = varIdent(form = ~1|mc4), method="ML")<br /><br />mod$modelStruct$varStruct<br />mod_ref4$modelStruct$varStruct<br /><br /><br />Ken Kleinmanhttps://www.blogger.com/profile/09525118721291529157noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-20213210247869089682017-05-25T23:39:55.779-04:002017-05-25T23:39:55.779-04:00Hi Ken,
Is it possible to control the reference ...Hi Ken, <br /><br />Is it possible to control the reference level of mc in the formula weights = varIdent(form = ~1|mc) so as to force R to construct and report ratios of variances which reflect the choice of reference level? <br /><br />Thanks,<br /><br />IsabellaIsabella R. Ghement, Ph.D.https://www.blogger.com/profile/12764806870780813453noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-69018454721884766032017-04-23T09:50:11.461-04:002017-04-23T09:50:11.461-04:00Fixed! Thanks for pointing this out.Fixed! Thanks for pointing this out.Nick Hortonhttps://www.blogger.com/profile/00242216324355342047noreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-56612118008807804822017-04-20T09:44:28.892-04:002017-04-20T09:44:28.892-04:00in the R example you forgot to define the data.fra...in the R example you forgot to define the data.frame "ds" before to use survfit(...)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-9038513719795636012017-03-23T08:19:24.781-04:002017-03-23T08:19:24.781-04:00Hi,
Thanks for this post, but what would you do ...Hi, <br /><br />Thanks for this post, but what would you do if you have age as the time scale? I have data set up as a single individual per row with the model statment written as: <br />(age_in, age_out)*no_deaths (0)=drug_type<br /><br />How do you assess proportional hazards in this case? I can't put in (age_out-age_in) as the x in the sgplot command... <br /><br />Thanks!<br /><br />March 23, 2017Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-1275149608391671670.post-27173556732360626252017-03-23T08:18:21.824-04:002017-03-23T08:18:21.824-04:00This comment has been removed by the author.Dana Cullenhttps://www.blogger.com/profile/18393628322054247472noreply@blogger.com