Great book, I'm getting a lot out of the site and I'm looking forward to the release. Thanks!
I understand there is always one more library or topic that could be included...
.. but with that acknowledged, what do you think of sqldf as an alternative to dplyr? You mention that dplyr is a bit easier (within the context of being specialized for data analysis). I'd have trouble weighting in because I don't use R all that much, but I do really like the python "equivalent" pandasql.
Also, I've used SQL for a long time, so I'd have trouble at this point really knowing what's "easier" for someone new to both, but I do often find it easier to use SQL than do data frame operations in pandas. dplyr seems to be a closer cousin to standard SQL, so the difference might not be quite as great.
I wondered something similar about sqldf, because at one point my brain just seemed to work better "in" SQL.
The biggest issue I found was that sqldf was significantly slower than dplyr and other alternatives.
I started trying to mess about with something I was calling sqldf2. Didn't get very far, but there is some perhaps somewhat useful benchmarking in the R script here:
I have a related technical question. Why couldn't something highly embeddable like SQLite be the default underlying implementation for a data frame in something like Python or R? It seems like Pandas and R data frames have a great deal redundant functionality.
SQLite seems like it has the guts to be the standard libdataframe.c for R, Python, Julia, etc. As a side benefit it already has a super consistent API (a.k.a. SQL).
Because it's designed to support typically relational db workloads (i.e. Lots of changes) not data analysis workloads. Dataframes in R, pandas etc, are column oriented, which leads to better trade offs for analysis.
Also SQL is a substantially inferior API for data analysis. (Not because it's a bad language, but again because that's not what it's designed for)
I agree completely that SQL is not the language for the kind of data analysis you're discussing in this book - to me, the question is whether it's useful to do querying and filtering through SQL and data analysis through python and R on the resulting datasets. I think pretty much everything you've written here would be continue useful if you used sqldf to generate data frames in R, but I don't know R well enough to be sure of that.
Because pandasql returns a data frame from a data frame (not sure if this is the case with R), I find it relatively easy to do data things with sql and data analysis with python. However, that's not a huge surprise since I've been using SQL for a while but don't know the pandas or R data frame syntax especially well.
I'm not sure why sqlite was chosen - could it have to do with the in-memory nature of dataframes? So far, my use of sql with data frames has been pretty generic, so I haven't bumped up any implementation specific SQL issues.
I think teaching multiple languages would make life much harder for new learners.
Also window functions are really useful for data analysis, and they are much easier to express in dplyr than they are in SQL (at the cost of being slightly less general).
Thank you for the reply--I immediately started Googling for more info about column-oriented data stores to see if there was something analogous to SQLite in this space. It looks like there's an embedded MoneDBLite package for R now that I'll need to check out.
If you don't already know SQL or dplyr, I think you would find dplyr significantly easier to learn. Some people who do know SQL well have commented that they too find dplyr easier. I think this is because the scope of dplyr is much smaller than SQL and it is designed specifically to facilitate data analysis.
I understand there is always one more library or topic that could be included...
.. but with that acknowledged, what do you think of sqldf as an alternative to dplyr? You mention that dplyr is a bit easier (within the context of being specialized for data analysis). I'd have trouble weighting in because I don't use R all that much, but I do really like the python "equivalent" pandasql.
Also, I've used SQL for a long time, so I'd have trouble at this point really knowing what's "easier" for someone new to both, but I do often find it easier to use SQL than do data frame operations in pandas. dplyr seems to be a closer cousin to standard SQL, so the difference might not be quite as great.