While there are a lot of other useful textbooks and references out there (e.g., R for Data Science, Practical Data Science with R, Intro to Data Science with Python) we saw a need for a book that incorporates statistical and computational thinking to solve real-world problems with data. The result was Modern Data Science with R, a comprehensive data science textbook for undergraduates that features meaty, real-world case studies integrated with modern data science methods. (Figure 8.2 above was taken from a case study in the supervised learning chapter.)
Part I (introduction to data science) motivates the book and provides an introduction to data visualization, data wrangling, and ethics. Part II (statistics and modeling) begins with fundamental concepts in statistics, supervised learning, unsupervised learning, and simulation. Part III (topics in data science) reviews dynamic visualization, SQL, spatial data, text as data, network statistics, and moving towards big data. A series of appendices cover the mdsr package, an introduction to R, algorithmic thinking, reproducible analysis, multiple regression, and database creation.
We believe that several features of the book are distinctive:
- minimal prerequisites: while some background in statistics and computing is ideal, appendices provide an introduction to R, how to write a function, and key statistical topics such as multiple regression
- ethical considerations are raised early, to motivate later examples
- recent developments in the R ecosystem (e.g., RStudio and the tidyverse) are featured
This book is intended to help readers with some background in statistics and modest prior experience with coding develop and practice the appropriate skills to tackle complex data science projects. We've taught a variety of courses using it, ranging from an introduction to data science, a sophomore level data science course, and as part of the components for a senior capstone class.
We've made three chapters freely available for download: data wrangling I, data ethics, and an introduction to multiple regression. An instructors solution manual is available, and we're working to create a series of lab activities (e.g., text as data). (The code to generate the above figure can be found in the supervised learning materials at http://mdsr-book.github.io/instructor.html.)
Modern Data Science with R |
An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
One point is that everything that it seems a good data scientist should know; statistics, computing including information technology and the art of graphics is what statisticians always knew we needed. The problem seems to be that those of us who learnt statistics in the eighties or earlier only ever got taught statistics and have learnt everything else as we went along. My belief is that a good background in statistics and basic programming skills is the most important thing to know. What is worrying is that the university I'm at, like many others, has the computer science department offering the data science masters.
ReplyDeleteOne point on your book. k-means should die. I know everyone teaches it, but mixtures of multivariate normals is so much more powerful and leads on to mixtures as solution to other problems. The main problem is understanding it requires a proper statistical background, but without it nobody understands k-means either.
I agree! I mistakingly chose to get my second Master in bioinformatics instead of biostatistics, and I'm very disappointed with my program. It is predicated on the assumption that you can teach people data science without any statistics. The whole field of "data science" seems to have this idea, and it will send us backward in time. It isn't leading to better research, just more research.
ReplyDeleteWe very intentionally organized the material in the book to ensure that there is a solid foundation in statistics. This permeates the data viz and data wrangling chapters (which are intended to answer a statistical question), the foundations in statistics chapter (which reviews key statistical concepts), and the topics chapters (e.g. text as data, spatial, network statistics). Such an approach seems critically important to be able to "think with data": http://amstat.tandfonline.com/doi/full/10.1080/00031305.2015.1094283
ReplyDelete