Great book, I'm getting a lot out of the site and I'm looking forward to the rel...

phillc73 · on Sept 16, 2016

I wondered something similar about sqldf, because at one point my brain just seemed to work better "in" SQL.

The biggest issue I found was that sqldf was significantly slower than dplyr and other alternatives.

I started trying to mess about with something I was calling sqldf2. Didn't get very far, but there is some perhaps somewhat useful benchmarking in the R script here:

https://github.com/phillc73/sqldf2/blob/master/R/sqldf2.R

peatmoss · on Sept 16, 2016

I have a related technical question. Why couldn't something highly embeddable like SQLite be the default underlying implementation for a data frame in something like Python or R? It seems like Pandas and R data frames have a great deal redundant functionality.

SQLite seems like it has the guts to be the standard libdataframe.c for R, Python, Julia, etc. As a side benefit it already has a super consistent API (a.k.a. SQL).

hadley · on Sept 16, 2016

Because it's designed to support typically relational db workloads (i.e. Lots of changes) not data analysis workloads. Dataframes in R, pandas etc, are column oriented, which leads to better trade offs for analysis.

Also SQL is a substantially inferior API for data analysis. (Not because it's a bad language, but again because that's not what it's designed for)

geebee · on Sept 16, 2016

I agree completely that SQL is not the language for the kind of data analysis you're discussing in this book - to me, the question is whether it's useful to do querying and filtering through SQL and data analysis through python and R on the resulting datasets. I think pretty much everything you've written here would be continue useful if you used sqldf to generate data frames in R, but I don't know R well enough to be sure of that.

Because pandasql returns a data frame from a data frame (not sure if this is the case with R), I find it relatively easy to do data things with sql and data analysis with python. However, that's not a huge surprise since I've been using SQL for a while but don't know the pandas or R data frame syntax especially well.

I'm not sure why sqlite was chosen - could it have to do with the in-memory nature of dataframes? So far, my use of sql with data frames has been pretty generic, so I haven't bumped up any implementation specific SQL issues.

hadley · on Sept 16, 2016

I think teaching multiple languages would make life much harder for new learners.

Also window functions are really useful for data analysis, and they are much easier to express in dplyr than they are in SQL (at the cost of being slightly less general).

peatmoss · on Sept 16, 2016

Thank you for the reply--I immediately started Googling for more info about column-oriented data stores to see if there was something analogous to SQLite in this space. It looks like there's an embedded MoneDBLite package for R now that I'll need to check out.

hadley · on Sept 16, 2016

If you don't already know SQL or dplyr, I think you would find dplyr significantly easier to learn. Some people who do know SQL well have commented that they too find dplyr easier. I think this is because the scope of dplyr is much smaller than SQL and it is designed specifically to facilitate data analysis.