Hacker News new | past | comments | ask | show | jobs | submit login
R for Data Science (had.co.nz)
230 points by hadley on Sept 16, 2016 | hide | past | favorite | 70 comments



I'm the author, and I'm happy to answer any questions.

The book should be in print by (hopefully) the end of this year, or definitely by Jan 2017. The content will not change significantly, but there is will be minor fixes and a lot of proof reading.


Great book, I'm getting a lot out of the site and I'm looking forward to the release. Thanks!

I understand there is always one more library or topic that could be included...

.. but with that acknowledged, what do you think of sqldf as an alternative to dplyr? You mention that dplyr is a bit easier (within the context of being specialized for data analysis). I'd have trouble weighting in because I don't use R all that much, but I do really like the python "equivalent" pandasql.

Also, I've used SQL for a long time, so I'd have trouble at this point really knowing what's "easier" for someone new to both, but I do often find it easier to use SQL than do data frame operations in pandas. dplyr seems to be a closer cousin to standard SQL, so the difference might not be quite as great.


I wondered something similar about sqldf, because at one point my brain just seemed to work better "in" SQL.

The biggest issue I found was that sqldf was significantly slower than dplyr and other alternatives.

I started trying to mess about with something I was calling sqldf2. Didn't get very far, but there is some perhaps somewhat useful benchmarking in the R script here:

https://github.com/phillc73/sqldf2/blob/master/R/sqldf2.R


I have a related technical question. Why couldn't something highly embeddable like SQLite be the default underlying implementation for a data frame in something like Python or R? It seems like Pandas and R data frames have a great deal redundant functionality.

SQLite seems like it has the guts to be the standard libdataframe.c for R, Python, Julia, etc. As a side benefit it already has a super consistent API (a.k.a. SQL).


Because it's designed to support typically relational db workloads (i.e. Lots of changes) not data analysis workloads. Dataframes in R, pandas etc, are column oriented, which leads to better trade offs for analysis.

Also SQL is a substantially inferior API for data analysis. (Not because it's a bad language, but again because that's not what it's designed for)


I agree completely that SQL is not the language for the kind of data analysis you're discussing in this book - to me, the question is whether it's useful to do querying and filtering through SQL and data analysis through python and R on the resulting datasets. I think pretty much everything you've written here would be continue useful if you used sqldf to generate data frames in R, but I don't know R well enough to be sure of that.

Because pandasql returns a data frame from a data frame (not sure if this is the case with R), I find it relatively easy to do data things with sql and data analysis with python. However, that's not a huge surprise since I've been using SQL for a while but don't know the pandas or R data frame syntax especially well.

I'm not sure why sqlite was chosen - could it have to do with the in-memory nature of dataframes? So far, my use of sql with data frames has been pretty generic, so I haven't bumped up any implementation specific SQL issues.


I think teaching multiple languages would make life much harder for new learners.

Also window functions are really useful for data analysis, and they are much easier to express in dplyr than they are in SQL (at the cost of being slightly less general).


Thank you for the reply--I immediately started Googling for more info about column-oriented data stores to see if there was something analogous to SQLite in this space. It looks like there's an embedded MoneDBLite package for R now that I'll need to check out.


If you don't already know SQL or dplyr, I think you would find dplyr significantly easier to learn. Some people who do know SQL well have commented that they too find dplyr easier. I think this is because the scope of dplyr is much smaller than SQL and it is designed specifically to facilitate data analysis.


Trivial, self-serving question: is there a library for generating the diagram of table relationships here (13.2 nycflights13) http://r4ds.had.co.nz/relational-data.html

And of course, thanks for another great book, it's helpful for learning R but I'm always enlightened by how thoroughly you explain the general concepts (e.g. Relational data and joins). Have heard a few people on faculty speak enthusiastically about the book even as I hold out for more adoption of Python :)


It won't be as pretty, but I deal with large data models all the time, and like to use SchemaSpy [1] which generates an interactive page in HTML and can be used on the command line (I guess you could always modify the CSS to make it pretty). It's literally one of the most useful tools in my life, and the output is good enough to show to clients.

If I'm designing a DB or even just an SQL example, I'll run the code on my local machine (psql + the Postgres app [2]) or if I'm lucky, the client already has a server running Postgres and I can run it there instead. All SchemaSpy then needs is access to the DB and voila, interactive example.

[1] http://schemaspy.sourceforge.net/

[2] http://postgresapp.com/


Lucidchart has the ability to generate SQL for you to run and it'll generate a schema for you. I used it to figure out the schema for a particularly poorly designed DB I had to get data out of.


No, I wish there was. Those were painstakingly drawn by hand.



As far as I recall, the MySQL Workbench has a pretty useful ER generation tool builtin.


Hey Hadley. Huge fan of your work! Many of the libraries you have authored or co-authored have had a big influence on how I think about building tools. I looks forward to getting a hard copy of the book!

I have a bit of a nitpick about chapter 13 on "relational data", in which I believe you are consistently misusing the technical term "relation" to refer to the relationship between two data sets. In the context of relational database theory, "relation" is just another word for "table" (although it connotes more mathematical formalism).

I think it is worth respecting the precise technical usage in this case: Consider a student who might read your book and be told that "relations are always defined between a pair of tables," and that "a primary key and the corresponding foreign key in another table form a relation." The same student might also stumble across the wikipedia page for the relational model and learn that "a relation is defined as a set of tuples that have the same attributes," and that the relational model "organizes data into one or more tables (or relations)."


Technically, the relation (value) is the set of tuples (i.e. the data itself, "the true statements", the rows in the table), the relation variable is what is defined by CREATE TABLE (i.e. how the data is constrained, "what can be true") and what most people are trying to model.

The relational data model - the set of relation variables - is thus the "equation that defines what is going on in your company" and what the DBMS puts in - the set of relation values, usually abbreviated to relations - can be seen as "the history of what happened at your company".


That is technically correct (the best kind of correct ;) but I think for newcomers it gives a better sense of the spirit of relational data.


I'm a software engineer who is already quite comfortable with Python and has more of an interest in machine learning than data science (as I understand it), is there any reason for me to learn R?


I'm in bioinformatics. R is huge, Python is a distant second. So I am forced to use both.

R is a terrible programming language. It's slow and syntactically inconsistent. For interactive statistical analysis it can be OK, but anything beyond a small program becomes unmanageable quickly. In particular, if you want to manipulate strings or hierarchical data structures quickly in R, good luck.

For ML, scikit-learn is almost always sufficient. For statistics, statsmodels is OK but very underdeveloped compared to what is available in R. IMO plotting is equally painful in both Python and R (seaborn is a good Python library to ease the pain if you haven't seen it).

Personally, I write everything in Python and call out to R as infrequently as possible using rpy2, usually only for specific statistical routines or bioinformatics-specific libraries.


Probably not any strong reasons. That said, if you're a software engineer, you shouldn't find it too hard to pick up enough R to be useful. You might enjoy <http://adv-r.had.co.nz> which describes R from more of a programming language perspective.


If you use M$ products you will find that SQL Server 2016 has R baked in.

https://www.r-bloggers.com/demo-r-in-sql-server-2016/


I use both R and python quite a bit. I prefer python as a programming language. Here's my take on 'Why learn R?': (1) R/ggplot is hands-down better for plotting than anything in python. I also think that R is better for EDA generally. (2) Many smart, knowledgable people use R and publish their code. To learn from it, you need to know enough R to read and modify it. (3) R has better package support than python in several common data analysis domains. For example, in forecasting and in graph analysis, the best R packages available are much better than the best python packages.


Are there solutions for the exercises?

Lot of the exercises, especially in the exploratory data analysis part are "why is blah?" or "is there a relationship in blah?"

I think I know the answers, but it would be nice to be able to check if I see what I'm supposed to se in the data.


No, but I'll probably crowd source when the book is final.


Just curious, but why crowd source? You're the author, I assume you wrote the questions, didn't you solve them when you wrote them?


I love your sweet naivety about the process of writing exercises :P


The proof is trivial and is left as an exercise for the publisher ;)


I've always had a hard time understanding pipes, so it may just be that the concept will take more time for me to grok. I thought that the little bunny foo foo example in section 18.2 was really hard to grasp. I think that doing a similarly in depth example with numerical data, while more boring, may make the concept easier to understand.

I've been using r4ds for the past couple weeks. This is the first time I've really understood how everything in R and the tidyverse fits together. I am really enjoying the book. It has already helped me immensely.

Seriously, thank you for writing this.


I'll be sure to buy a copy -- I'd just rather wait for the final version.


If you pre-order on amazon, you'll get the final version as soon as it's available.


Is there (or are there plans for) a version of this in Jupyter notebooks?


No, because I don't use Jupyter notebooks.


I don't see how there would be a need for that. The code will work in Jupyter Notebooks just as it will work in a different IDE.

I came from Python using iPython. I missed them for the first few weeks, but now I can't switch from RStudio. It really is just such a great tool for data science.


The advantage of Jupyter notebooks over RStudio is having text and code in the same place. Personally, I find that being able to run+modify the code in a textbook is much more informative than simply reading the syntax (for example - https://github.com/CamDavidsonPilon/Probabilistic-Programmin...). Sure, I could just copy and paste from the website into an IDE, but notebooks are a more natural way of communicating code and prose, IMO.


In R we also have notebooks that does this just it is slightly different. RMarkdown is how we do this same thing. At first I missed the different code and text blocks but it is just easier to work with when it is all a text file. You can make the RMarkdown for reports and then just run them from the command line and never have to open RStudio or R. Saves me a ton of time.

You use back ticks to make your code chunks.

For example: Some random Markdown text here is treated as a text block in Juypter

```{r}

summary(cars)

```

Then some further text goes here.

http://rmarkdown.rstudio.com/index.html


I agree, though I don't know if I'm just missing something when working in Jupyter, or if Python has an equivalent of RMarkdown?

Jupyter has some conveniences, but the tradeoffs aren't worth it for me. Working in a web browser has much less power, when it comes to keyboarding, than Atom/Sublime. And I generally don't need to interact with my data; I know what I'm outputting, I just want to show the results to readers alongside my code. I don't use RStudio but RMarkdown is easy to run from the command line.

By contrast, Jupyter requires (AFAIK) working within the browser and, when you save the file, you get a huge jumble of JSON, which is how the notebook is serialized. I tend to write a lot of vignettes/explorations and the need to full-text grep them is important to me and is not feasible when the text content is saved as JSON.


>Jupyter, or if Python has an equivalent of RMarkdown?

You can convert the Notebooks to different formats. Though they tend to have to be fixed up a bit for a script (Which I also might need to in R to turn it into a script)

https://ipython.org/ipython-doc/3/notebook/nbconvert.html


The source is available in https://github.com/hadley/r4ds, and works exactly as you describe when using the preview edition of RStudio (and indeed that's how I write the book)


I just saw this the other day! This wrapper around Pandoc to make life our lifes easier (tidyr)? I just went through the process of making a markdown file to an epub before I saw this, wish I saw this earlier.


R for Data Science is the canonical source for learning R and other real-world R tools such as dplyr/tidyr/ggplot2, and one I've recommended on HN submissions about R tutorials which simply go over primitative data types and out-of-date packages. (It's one of the reasons I've postponed making R tutorials myself, since the book would be better/more accurate in all circumstances.)


Seems interesting. Quick question:

Some background on myself first. I am a financial consultant (only 1 year since graduating) and am planning to do a PhD in Accounting in the next 3 years. Currently working through the GMAT, but once that is complete, I will find myself with 2 or so years to do things that will help prepare me for research. One thing I have considered is taking a course/reading books on data science and such to prepare me for the advanced stats/data analysis that will go on during research. As someone with no coding experience, and with solid quant background (I was an economics undergrad), would this book be a good starting point for getting experience with this stuff? And is R the appropriate language to learn? I don't mind learning to code, but it is intimidating.

Thanks!


I'm coursing a research/PhD access master degree and all I can tell you R is one of the most popular tools. If it is used in your university go for it.

It's not just the features of the R programming language. It is also free, open source and can be integrated easily with many external tools. For example you can make a paper using Markdown(RMarkdown) and convert it to ->TEX->PDF. Or you can use Rmarkdown to make a presentation converting it to ->TEX->beamer slides. Or just output a html document/webpage. It is also well integrated with online databases and resources.

While it's true that it can be a bit less intuitive or not as visualization focused as compared to some other tools, it is just as powerful under the hood.

Other open source tools like python are good for programming but when it comes to data analysis they can be a little bit cumbersome. For example, python is quite object oriented and for straightforward purposes with low reusability the amount of code needed can be large.

There are other tools like Matlab, Mathematica, SPSS... Matlab is good at visualization and Mathematica has really nice features aimed to improve understanding. However these are closed source and cost a lot of money to you or the university.


Personally I work in macro and fixed income market analysis (strategist), and I can heartily recommend R as your first language. Indeed, coming from a CS background, I first applied Python to many problems, and resisted R which was not a "grown up" programming language, in my opinion (some would make the same accusation on Python). However I dipped my toe in the water one day because R had a Bloomberg terminal add in and Python did not (at the time), and after about a month of uphill learning curve the eureka moments started materializing thick and fast. I cannot recommend R enough, as a problem exploration language. It just beats Python hands down when it comes to grabbing some (usually dirty) data, mangling it around, cleaning it, and then install.package'ing a bunch of potentially useful libraries which allow you to do everything you could possibly imagine to a small to medium sized data set. And crucially, static graphing. Nothing else comes close for this use case.

Now...caveats. R is not a production programming language. If you find yourself creating something truly useful for many users, that requires robust programming language structures such as threading, proper memory management, server-capability, or indeed, speed, R is going to become frustrating. Yes a whole bunch of people will tell you "it's possible, I do it, etc", but that is not its sweet spot. Also, if your data set is bigger than 2-3 gig or so, you're going to start hitting R's memory management wall. It's slow. You'll then be better off with Python, C, or indeed, Scala, or possibly, Apache Spark. The common thing about these caveats, however, is that they're definitely second order problems, later in your career life cycle, than the excellent mainstream data science tool which is R for people who have outgrown Excel, but are not full fledged computer scientists, and who want to get (lots of) stuff, done.

(by the way, pre-empting comments. Yes Pandas is great, but no it's not quite R).


Thanks for the replies everyone. I'll definitely save this link and pick up the book when it comes out!


If you want to attain a deeper understanding of coding, I suggest you try another language once you get the working basics of R. Maybe some basic MOOC on Python or C. You'll understand concepts of CS that are hard to grasp by learning only R.


It looks like it starts at a low enough level that you should be fine without a programming background.

Though they suggest Garrett's book [1] as a companion to R for Data Science in the Prerequisites section [2].

> And is R the appropriate language to learn?

I'd think so. Possibly either R or Python, or both (if you want to get beyond Excel).

[1] https://www.amazon.com/dp/1449359019 [2] http://r4ds.had.co.nz/intro.html


Yes


Looks really nice. I'm a heavy Python/Stata user, but I'm seriously thinking about transitioning, given all the amazing work in the hadleyverse.

Also, RMarkdown looks incredibly well thought out


cough tidyverse cough


I'm learning R for fun at the moment. I'm sure it's super useful for statisticians but it's quite an intricate language! It's an unlikely mix of different paradigms and features mixed together. Not something I'd recommend to a beginner programmer, yet it seems that people love it (even non-programmers).

I looked at several tutorials and what worked for me the best so far are the official manuals https://cran.r-project.org/manuals.html (esp. the language definition and the "introduction to R").

Moreover, for the programming languages enthusiasts, the following article is pretty interesting:

Evaluating the Design of the R Language (Morandat, Hill, Osvald, Vitek).

"R is a dynamic language for statistical computing that combines lazy functional features and object-oriented programming. This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become surprisingly popular. With millions of lines of R code available in repositories, we have an opportunity to evaluate the fundamental choices underlying the R language design. Using a combination of static and dynamic program analysis we can assess the impact and success of different language features."


It can be challenging for sure.

I will say, having struggled to use R and finally prevailing, that I will NEVER use anything else for generating figures for academic publications or internal reports (ggplot2 - same as the author of course).

Nothing comes close to the composable, functional way that it "just works" -- I don't use it every day - but it is my "go to" for data exploration etc. -- more than pandas / ipython.


You might enjoy <http://adv-r.had.co.nz/>, which discusses R from more of a programming language perspective (albeit a programming language that is chiefly used for data analysis). There are a lot of misunderstanding about R the language.


Great book!

We also have almost 10,000 forkable & executable R examples on Kaggle (https://www.kaggle.com/kernels - select R from languages). Almost all of these use at least one of Hadley's libraries


hadley, I love your book, and I learn a lot from your preferences in R packages. Now is there a general source for determining the "best" packages for various tasks?

CRAN has task views, but they are long lists and don't clearly show popularity or feature matrices. There are just so many options.

Im thinking something like https://djangopackages.org , for example see https://djangopackages.org/grids/g/commenting/


Not yet. It's a hard problem.


Hadley, can you share a bit more about your plans for modelr and what need(s) the package will be designed to solve? Congrats on your book btw, I've been reading it for a few weeks and it's quite simply excellent.


I don't think modelr is going to change significantly in the future. It solved a pressing problem (fitting models as part of a pipeline) so I could teach modelling using the same interface as everything else in the book.

However, the modelling infrastructure in R is generally showing it's age, and thinking about how to make modelling easier is something that I will be working on in the coming months.


I saw the vctrs package repo the other day on your GitHub. What's your plan with that? I guess your covering all of R's base data types (dplyr:data frames, purrr:lists, forcats:factors, vctrs:vectors)? Also do you plan on developing further functional programming packages?


Yes. You can read the issues for what the package will start with.

No plans for more FP packages in the near future, although I do want to add multicore and progress bars to purrr.


Sounds great. Something like a tidyverse version of the data pipeline capabilities of scikit-learn would be awesome.


What is the equivalent book for Python and data science?


I LOVE Python and really am pleased with Pandas, but ...

I use R exclusively for data science. Really encourage you to just give it a try. The tools, packages, community and the industry support is just awesome.

I did my reports for the end of the year and people loved the reports but the office is so MS Office focused that they wanted them in Word and PowerPoint (UGH), R has great tools for that RMarkdown and ReportRs library convinced me to switch. Also index being 1 is super strong selling point from now on for doing data science. http://davidgohel.github.io/ReporteRs/


Whatever the data problem, I have found R to much more likely have a solution than Python or anything else.


There's a large number of such books, though none that are as authoritative with respect to Python (this is a statement about the size of Python's community vs. R, not necessarily about the authors):

- via Wes McKinney, creator of pandas (which makes Python about as close to R as you can get): https://www.amazon.com/Python-Data-Analysis-Wrangling-IPytho...

- http://joelgrus.com/2015/04/26/data-science-from-scratch-fir...

There are a bunch of books specific to machine learning too though I haven't read them myself.


Wes's book is definitely the standard. But I would hold off buying one right now - he's currently working on a (much-needed) second edition, coming out next year (http://wesmckinney.com/).

Joel's book is a great resource for preparing for interviews or learning really basic stuff and less of an introduction to the tools.


What would you recommend for visualisation?


I actually have no idea about that. I don't think there's an equivalent to R's base graphics, so that would seem to make matplotlib the closest thing to a standard -- seaborn [0], which I've seen used a lot lately for more advanced dataviz, lives atop it, but it's also relatively new.

People seem to have conflicted feelings about matplotlib, maybe because of its origin in MATLAB? Not that Matlab itself is bad, but I think the decision to make matplotlib's API comfortable for MATLAB users seems to cause confusion to contemporary users, even before the usual 2.x vs 3.x issues (matplotlib ported to 3.x a few years ago but many users still write Python in the 2.x style.)

Anecdotally, I feel like I see advice like "Just use plotly" more than I see recommendations to actually learn matplotlib. I actually gave up on matplotlib until I stumbled upon this comprehensive tutorial, which covers the basics and many elaborate use cases. If there's a book that does it better, I haven't heard about it:

http://www.labri.fr/perso/nrougier/teaching/matplotlib/

The matplotlib site itself is chockful of well-documented examples, but some of them seem to be significantly more verbose than they need to be. My impression is that the library is stable/ubiquitous enough that there isn't a big movement to overhaul things. Last time I looked at the API changes for v2.0 [1] (1.5.3 is stable), most of the changes had to do with default styles and stylesheets, which is non-trivial given the number of people who use ggplot2 because it "just works"

[0] https://stanford.edu/~mwaskom/software/seaborn/

[1] http://matplotlib.org/devdocs/users/dflt_style_changes.html


What are your thoughts on bokeh? I seem to always revert to R for visualizations


I've used R and Python. I stick with Python whenever possible because IMO it supports the non-modeling parts of data science more effectively. ETL scripts, API creation, Flask for hosting simple websites, etc. yHat makes python-ggplot and Rodeo, similar to RStudio. I explore and develop algorithms in Jupyter notebooks, documenting along the way, while running "hardened" code from the command line, often nohup'ing it on a Linux box, for services that run perpetually, keep me updated via Slack/SMS/email, etc.

For visualization, almost everything I do is in D3, p5.js, or in Processing (Java), which has a Python interpreter, for those interested. There are some great Processing books and Daniel Shiffman is the Hadley of that world. Tons of engaging resources from him. There are tons and tons of good D3 books and online resources. bl.ocks and Mike Bostock's other online articles are wonderful.

Every organization with data scientists defines "data science" differently. People with a modeling and stats focus probably should stick with R. If you find yourself in a position with a wider scope, you simply must have more tools in your tool belt, and in my opinion, R, Python, and JavaScript all are part of that package. For me, personally, Processing is, too. Have a look at Ben Fry's work to understand why. I also use openFrameworks when the volume of data to visualize and performance concerns require it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: