From personal experience, I learned R during college as a part of my stats courses, before tidyverse even existed.
After doing a few personal projects using base R with complex manipulations, I would have quit using R entirely if it were not for tidyverse/dplyr.
I'm a pragmatic software engineer: I prefer to get a data analysis job as fast as necessary, and IMO there isn't much merit in making things more complicated by using base R just for the sake of base R. It's definitely not faster, both in cognitive load and in speed. In my data science work there hasn't been a single case where I've needed to fall back to base R, or wanted to.
Pipes are being undersold here: the examples only show a single piped action, whereas pipes shine when you have to do 5+ operations in a single manipulation (I've seen base R code that does that by just having a lot of redundant df assignments and it's a pain to read).
There are nuances that to me make R data frames, especially with Magrittr, simpler to work with over Pandas. Pandas indexing always seems to get in the way.
I have been using Python a lot longer than R, but the Pandas syntax is taking longer to internalize.
That's an interesting caveat with pipes/magrittr (I didn't know of it), although it's out of scope of this debate since this discussion is just about how tidyverse uses pipes.
In my experience if there's a package that's required in a script/R Notebook outside of tidyverse, I just import the entire library at the start of the file Python-style (to make the dependency obvious), which would avoid this issue.
I thinks it’s fair to mention how by making things more complicated by using pipes you expose yourself to other issues.
(In my opinion the first variant is not less readable, and it has the advantage of allowing any of the operations to be removed by commenting the corresponding line, while in the second case to remove the last operation you also need to remove the piping operator in the preceding line.)
[Edit: I’m not sure I understand your “out of scope” remark. You don’t mean that magrittr is outside of tidyverse, do you?]
They are roughly equivalent to read; but the reason for that is that the first code sample is clearly someone unwinding something that they are conceptualising as a piped statement. If you aren't conceptualising the code as an unwound pipe, you'd have to actually inspect the code and think about what each function argument means rather than being confident it is a set of chained transforms as you can be by glancing and seeing a %>%.
I'm a bit suspicious about that coincidence - after I met the pipe operator, a lot of my code looks like that. Before I met the pipe operator, not very much of my code looked like that.
As long as people are thinking in pipes then they can choose if they value making commenting code cheap vs making reading/typing code cheap. But it is quite important that beginners are introduced to pipes as a concept as they learn R. The semantics are critical, using the syntax is optional.
Well, yeah, if anyone wants to write their pipes that way that is also an option. They aren't going to though, that isn't really human readable when there is a sensible option like using a pipe operator.
If you find that the example above is not "human readable" then the following code will make your head explode because it cannot be transformed into a pipe representation (or at least it's not so straightforward):
You've thrown out a few variations on a theme here, but I'm not sure where you are trying to angle towards. There are a lot of ways to format code, and I can't tell you what works best for you.
But this isn't shaking the paradigm of having an initial block of data (here separated a little awkwardly into two files) that is being subjected to a series of functional transformations, which is sorta pipe-like. Being able to talk about this as a 'pipe' is hugely useful, and having a pipe operator really standardises how people can talk about it. In my experience, selling this as a concept to a beginner is a lot easier when it has syntax support so they know they are doing the 'right thing' and get error messages/uncomfortable looking code when they try to cheat. I don't really care how you nest the brackets, if you want to write something that isn't a pipe go with (edit obviously a non-buggy version of...):
for i in 1:nrow(df){
if(df[i,'variable a'] > df[i,'x']) continue;
df[i,'variable ac'] = df[i,'variable_a']/df[i,'variable_b']
}
If someone shows me that and I can tell them to 'go rewrite it using pipe operators, it will be better' they will naturally start seeing that that horrible block of mixed ideas can be separated out into a few basic composable operations, and that is funnily enough easier for a beginner to grasp in my experience teaching people to use R. Plus, they can tell that a statement is going to be a series of composes on top of raw data if they see a pipe operator. This is a useful thing that makes it easier to read the code.
For the record, the original example is from the homepage of magrittr (I found it here https://www.fromthebottomoftheheap.net/2015/06/03/my-aversio...) and I'm not trying to angle towards anywhere. I only wanted to point out that the version with a sequence of assignments is not so painful to read (to me) and even the nested version is readable and has some advantages. Not that anyone cares, but I think I would have written something like:
My point is that sometimes you don't have a single "thread" of calculation. Putting rbind into a pipe like you did is somewhat artificial (broken symmetry) and doesn't work so well if there is some pre-processing before the merge and some post-processing after the merge (or if we have two or more essentially different arguments in a function that need some preprocessing). You may say that having multiple pipes, one merge operation (or whatever the function with multiple inputs is), and then a downstream pipe is the "human readable" way to do that. I'm not sure if that makes programmers able to handle some nesting superhuman or subhuman :-)
Teaching a "paradigm" can be too limiting. Looking at some random tutorial on the web:
"To demonstrate the above advantages of the pipe operator, consider the following example.
round(cos(exp(sin(log10(sqrt(25))))), 2)
# -0.33
"The code above looks messy and it is cumbersome to step through all the different functions and also keep track of the brackets when writing the code.
"The method below uses magrittr‘s pipe (%>%) and makes the function calls easier to understand.
which is a perfectly readable way of calculating the evolution of the value of a 60/40 portfolio of stocks and bonds from the value of each component at each rebalancing date?
I don’t think anyone is arguing that the pipe should be the -only- form of composition. Just that it’s a useful form when you have a linear sequence of transformations. Sometimes it’s useful to force something close to being linear into a linear form for consistency, but typically you would switch to an alternate form of composition, typically assigning to intermediate variables.
It’s easy to find examples of using the pipe in ways that I would consider suboptimal. But that doesn’t affect my thesis that, on average, the use of the pipe leads to more readable code.
I was also initially quite sceptical of the pipe, since it is a fundamentally new syntax for R (although obviously used in many other languages). I think the uptake by a wide variety of people across the R community does suggest there’s something there.
I was actually excited about pipes when they were introduced but in the end I'm pretty happy writing and debugging "unreadable" code.
I had a similar experience with Lisp syntax: Clojure's threading macros seem a neat idea but I do actually prefer old-style nesting of function calls.
Maybe my R journey is a bit atypical. I started learning R around the time you created reshape and ggplot and used them extensively. But as the "tidyverse" thing has evolved I have found myself more attracted to "base" R as I've become more familiar with its data structures and functionalities.
Because the pipes are at the end of the line, you can comment out a line and it will still work fine unless you're removing the last line of the chain.
He is glossing over the major issue with base R, which is its tendancy to switch data types in a way that appears random to new users. As an undergrad I spent nights literally on the verge of tears debugging R code where the types had been mucked up by R's bizarre semantics. Claiming that data[foo] is the same as filter(data, foo) is not correct - the [] and [[]] operators have a lot of strange side effects depending on what argument is passed in. It is like saying map() and a for loop are the same because everything a map can do a for loop can do. That is true, but the single-mindedness of map is a major selling point because you know it is only going to do certain specific things. The issues here will disproportionately affect beginners who are likely to accidentally pass in the wrong thing - it needs to fail with an error or in a predictable way to best help them. Hadley's code fails sensibly.
One of the major innovations of the tibble package is that it puts the types of each column into the headings when it prints the table out. Dr. Matloff is grossly underestimating how painful it is for beginners to, eg, figure out when they are dealing with a column of strings vs a column of factors. If a beginner copies code to someone else for help, the other person can actually tell what types are involved! It is like magic compared to base R.
Given how ad-hoc the R community is and the paucity of people with programmer backgrounds, these minor usability improvements on controlling types are important. And anybody who cares about efficiency is using the wrong language.
EDIT Also, this comment that RStudio is bypassing an Open Source project's leadership - that is one fo the major features of open source. It is usually a healthy sign that the project leadership are lost in the weeds. I recall a situation with broad parallels in the history of GCC for example; it didn't kill the project although I believe a bad codebase died.
> "Dr. Matloff is grossly underestimating how painful it is for beginners to, eg, figure out when they are dealing with a column of strings vs a column of factors."
This is just sapply(df, class)... Is this not found in R 101 material that beginners are likely to find?
Pay no attention to the myriad of other apply functions hiding behind the curtain and the difficulty in remembering what each of them does and why you can't just use map, eh? :)
So to answer your question, no, its not just simply that.
After reading your explanation I still, as has been the case for years, don't understand what sapply or rapply does, and vapply sounds weird.
That isn't going to change, because I'm just not going to use them or 'invest' the time in finding out what some statistician-of-yore's interpretation of a map is. Instead I'll stick to tidyverse map - returns a list. Or tidyverse map_[int/chr/dbl/etc, etc] if I want a vector of [int/chr/dbl/etc, etc].
That and the data frame manipulation verbs covers the most useful 80% of cases where *apply would otherwise be needed. If the base R team were implementing functions that way in the base, stats professors wouldn't need to complain about mass exoduses from base R.
Yeah, that's a valid point. And it took me a couple years to internalize these differences -- mostly by being burned on numerous occasions. Base R has a lot of idiosyncrasies and they have to be memorized, unfortunately.
I develop R packages for my colleagues and so I stick to Base R whenever possible. I don't want my packages depending on the tidyverse at all. But for EDA, I am agnostic about what my colleagues do. They should use the tools that stay out of the way and let them get their hands around the dataset intuitively. For me that's Base R, for others it's data.table or tidyverse.
> After reading your explanation I still, as has been the case for years, don't understand what sapply or rapply does,
There is nothing complicated about what sapply does... It simply means loop over the elements of the first argument (which must be a list; btw a dataframe, df, is internally the same as a list) and apply some function. lapply does this and returns a list, sapply does this and can optionally "simplify" the results into a vector, etc.
So:
lapply(df, class) = loop over the elements of df and tell me the class, return this in the form of a list
sapply(df, class) = loop over the elements of df and tell me the class, return this as a character vector
This is basically lapply:
res = NULL
for(i in 1:length(df)){
res = append(res, class(df[i]))
}
return(res)
Looking at the documentation; "sapply" is itself a wrapper for "lapply", and will do a number of different things depending on what arguments are passed into the function or not.
In this article and others I've read, people complain about the tidyverse's lack of performance, but I think that places too much emphasis on speed of execution versus speed of development. As an academic, most of the R users I know only code as a portion of their scientific projects. Besides data analysis we're doing data collection, manuscript writing and grant writing. The tidyverse's more english-like syntax (eg select() versus `[`) and following a series of pipes rather than unnesting ten sets of brackets makes it so much easier to come back to my code after a few weeks or months and pick up where I left off, or to work with other people's code. My time is more valuable than computer time so the tidyverse is my choice.
This is my experience as a data engineer/analyst. I work in the healthcare space and my analyses are run on datasets that are at the uppermost half a million rows. Yes, data.table is faster than dplyr/tibbles, but i dont care. Like you said, most of my time is spent on expressing my thoughts into code, not running the code. Tidyverse really simplifies it compard to base R.
Its good to know tools like data.table exist though, people shouldnt think the tidyverse is the only way to do things.
As an R user of some 10+ years, I think some of this is spot on and some of it is just silly.
Pretty much everyone knows the Tidyverse is slow to run, but many of us find it faster to write and very readable because it avoids one-time-use assignments (i.e. in `a <- f(x); g(a)` the variable `a` is only used once). Naming things is hard. Prof. Matloff's point is well taken though - you don't need the Tidyverse and it doesn't run faster. Fair criticism.
> The "star" of the Tidyverse, dplyr, consists of 263 functions. While a user initially need not use more than a small fraction of them, the high complexity is clear. Every time a user needs some variant of an operation, she must sift through those hundreds of functions for one suited to her current need.
> By contrast, if she knows base-R (not difficult), she can handle any situation.
Yet base R has some 1,346 functions. Another 455 functions in `stats`. I'm not sure the quantity of functions has as much bearing on learning as the quality and consistency. So this seems like a silly argument to me.
Mostly I think Prof. Matloff is forgetting that base R is kinda terrible and no one else has done much about it aside from RStudio.
> Mostly I think Prof. Matloff is forgetting that base R is kinda terrible and no one else has done much about it aside from RStudio.
I think he'd disagree with the contention that base R is kinda terrible. Like, I'm pretty sure he's wrong. But this isn't something he's forgotten, it's something he's never believed.
The point about the Tidyverse and RStudio is that RStudio doesn't make any of its revenue from the Tidyverse. They distribute it free. There are no upsells for any of it -- I can't go out and buy ggplot2 add ons, I can't buy dplyr support contracts. There's no money to be made on the Tidyverse, and plenty of resources thrown at it.
So what is RStudio selling? They sell an IDE, they sell servers to host things built in R, they sell a package manager solution, they offer most of these things as a SaaS offering as well (although rstudio.cloud is still in trial and they haven't started charging yet).
In other words, RStudio offers tooling for people who write R. They don't really care what kind of R you're writing, from a financial perspective. They get paid the same if you're using RStudio to write stuff using data.table as they get paid if you use dplyr. RStudio is making a bet that more people will want to write R code if they make R a more pleasant language to work in. And anecdotally, they're absolutely right -- I tried to learn R before the Tidyverse, and I hated every moment of it. I like post-Tidyverse R (although I got into it when people still called it the Hadleyverse). I know several other people with similar experiences to mine.
If you don't think base R is kinda terrible then I recommend you read R Inferno. It's truly a crazy language with enough footguns for a centipede. A lot of the attraction to the tidyverse is having sane defaults and consistent APIs that help you do the right thing quickly.
Decent critiques in my opinion. I was trained on/learned base-R myself ~10 years ago and love the "tidyverse" (although the author is correct, data.table is superior for big data and its not particularly close). Can't imagine people that only know tidyr/dplyr/etc without knowing base-R, seems like those people would get exposed quickly.
I enjoy base-R solutions where I can. I use data.table when my data sets are medium-ism. But like most people, my data is small and people just call it big and thus the attraction to dplyr in particular. Those chained pipes are addicting.
Always love seeing R discussion on HN, seems to be taboo though (not a 'real' language and all that).
The R standard library is kind of weird and hard to use for "general purpose" programming. It's very clearly a domain specific language.
Also, it's unbelievably slow for basic operations like looping, function calls, and variable assignment. It's literally orders of magnitude slower than Python (I've tested it). Unlike its fellow C-flavored-Lisp Javascript, R retains an extreme level of homoiconicity, Which apparently makes the language very difficult to compile or optimize. I don't know how e.g. SBCL does it, but over the lifetime of the R language no one has managed to implement a non-trivial subset of R that performs better than GNU R (fastR was never finished to my knowledge). So you are basically relegated to writing code in C, Fortran, or C++ if you need even decent performance. Otherwise you are stuck using "vectorized" operations like lapply(), which are fine but make for a jarring experience if you're coming from other languages.
So of course it's a "real" language in the sense that Bash is a real language. But it's ultimately and fundamentally a domain-specific scripting language and I don't know if there is a way around that.
I can give you some sense of why one can compile SBCL and not R. SBCL and other Lisps these days maintain some kind of conceptual barrier between macroexpansion time and run time, so that you can _stage_ your activity appropriately. Read the code, gather the macro definitions, expand the code (repeat as necessary) until you have some base language without metaprogramming in it.
Then you can optimize and compile. This is particularly true of Scheme which foregoes much of the dynamic quality of Common Lisp in favor of a much more static view of the world. But its still basically true in CL.
R doesn't have macros of that kind. It has a sort of weird laziness based on arguments remaining unevaluated until needed. Metaprogramming in R is typically done by intercepting those "quoted" arguments and then evaluating them in modified contexts (this is exposed to the user more or less by letting them insert scopes into the "stack").
Thus, there is no distinction between macroexpansion and execution time. Hence, its tough to write a compiler, which is basically just a program which identifies invariants ahead of time and pre-computes them (eg, lexical scope is so good for compilers because you can pre-compute variable access). Because all sorts of shenanigans can be got up to by walking quoted expressions and evaluating them in modified contexts, R is hard to compile.
This is, by the way, why the `with` keyword was removed from Javascript. It provided exactly the ability to insert a scope into the stack used to look up variable bindings.
I didn't think the with statement was removed, as that would break backwards compatibility on the web with any older code that used it. Instead, it's just not allowed under strict mode, but it is still part of the ECMAScript specification. I just tried it in the Chrome console, and it still works.
> This is particularly true of Scheme which foregoes much of the dynamic quality of Common Lisp
The idea of even the standard Common Lisp is that both is possible: a static Common Lisp and a dynamic Common Lisp, even within the same application in different sections of the program. Common Lisp allows hints to the compiler to remove various features (like fully generic code being reduced to type specific code), it allows a compiler to do various optimizations (inlining, getting rid of late-binding, ...) - while at the same time other parts of the code might be interpreted on the s-expression level.
Its definitely the idea. I know it works well enough for a lot of people, but for me, I'm happy to just give up on the dynamic behavior in favor of a simpler universe.
The last thing I want to be thinking about is whether my compiler needs a "hint" that something can be stack allocated, for instance. That is their business, not mine.
Common Lisp as a language standard makes no commitment to the smartness of a compiler. If a certain compiler can figure out things, great, but some compilers might be dumb by design (for example to be fast for interactive use). That's left to implementations.
it's ultimately and fundamentally a domain-specific scripting language
A domain-specific rather than general-purpose programming language? Most definitely. A scripting language? I would dispute that. Scripting languages are for automating the execution of sequences of tasks that you'd otherwise run independently.
Well, it may not be keeping up with the new versions but at least it does cover a non-trivial subset of the standard implementation. :-)
I have not tried it in many years and I don't know how much faster or compatible it is. Nevertheless, I think it's a good thing that an "outsider" is looking at performance issues.
As far as I know it's essentially a one-man's effort. One could imagine that RStudio or Microsoft would be able to improve R performance substantially if they wanted to.
I’d say one of the major problems I have with R is when people try to use it as a general purpose language. Just because you can doesn’t mean you should. It just wasn’t designed that way. I’ve seen a lot of R code that would have been trivial to write and execute significantly faster in another language.
I don't think there's any shame in R effectively being a good, productive DSL for a lot of stats and viz work, built around fast stuff written in C++. I also think this is largely true of a lot of Python as well, tbh. In the future my hope is that projects like Arrow end up with even more of the workload being taken off R's shoulders.
I'm a data scientist and, frankly, I really strongly prefer R over Python. Its mostly the whitespace in Python, but the language is also slow and crufty as heck. R, if you use it responsibly, feels a lot more like your standard, lexically scoped, dynamically typed language.
Absolutely. I usually respond to the "R-is-not-a-real-language" line with something like "R is syntactic sugar on a Lisp where the atoms are APL arrays". It's plenty interesting from a computer science perspective, if you bother to look.
For instance, why does R use <- instead of = for assignment? Because initial versions (of S) predate C -- developed down the hall -- the language that first introduced = for assignment.
The "not a real language" critique often comes from those without any knowledge of Scheme; "real" languages are like C# or Java, and maybe Python, even though the lisp-like properties of Python are not so much appreciated. As C# and Java have gotten more functional aspects over the years, I've seen less and less of the "not a real language" criticism.
There's really not much Lisp/Scheme left in R these days.
Function arguments/environments are still pairlists: a type of cons cell.
You can manipulate the parameters to a function as symbols before you evaluate them. This is useful — many very good R libraries all make heavy use of it — and certainly very interesting. Lispers probably know it as quote/antiquote.
I suppose we could also point that, much like most Lisps, R gives you plenty of object systems to choose from...
And that's about it for R-as-a-Lisp.
It's also a really bad language for working with tree-structured data, something Lisps normally excel at.
I agree; which is why I generally get confused when I see that criticism. This is only an assumption though; maybe its just a vocal minority that don't like R or I only notice the negative comments. Who knows
For what it’s worth, I am a JavaScript programmer and I feel like HN shits on JS All. The. Time. Some people just like to shit on languages.
I think it’s just insecurity. They worry a lot about whether they’re a good enough developer, so they try hard to be on the “most advanced” platform they can figure out, and then need to talk down to the other platforms.
Secure, talented, older developers, in my experience, tend to be open to whatever technology is most appropriate. They have confidence in their ability to learn new things. And see the good traits of any tech, as well as the bad one—a necessary skill if you’re going to be the kind of developer who can make traction in any situation.
The platform fetishists can only really work from inside their special tower.
The last time I tried putting a REST API in front of my predictive model, I used plumbr. I also then learned that the R runtime was single threaded, so that only 1 request could be processed at a time. I don’t know how you can overcome this, and it makes things significantly difficult to put into a real production environment.
I’m sure someone will chime in that there are different runtimes, like Microsoft’s (does this run on MacOS? If not, how can I use it to develop and test on my laptop) or some expensive RStudio solution.
The fact is you have none of those problems in e.g., Python. Although I would prefer to use Caret for many predictive modeling tasks, it is non-trivial to take the R runtime to production. That was what I walked away with.
I'm in a similar boat, using base R, data.table, or tidyverse as needed. But over time I've found that I respect, more and more, the good decisions that have been made in base R.
I think he’s exaggerating the effect of some of these claims. People who “grew up” learning R+Tidyverse will discriminate against non-Tidyverse users? Give me a break. You are either an advanced or competent or novice R user. You either understand how pipes (%>%) work or you don’t.
I doubt that anyone is going to be denied a job because they are an amazing R programmer but they just don’t have the experience with a particular set of packages, especially those that are as well implemented and easy to learn as the Tidyverse.
I can totally imagine someone not getting a job because they are a crap programmer, and then blaming it on something else. Or I can also imagine someone saying that the Tidyverse packages suck, and then be denied a job because of their attitude.
I could make a similar argument for most of the substantive claims in his post.
Some of these examples (dplyr vs. data.table) are cherry picked. I have several of my own examples where read_csv is way faster than read.csv, so maybe, like all good programmers, we should be testing and profiling our code and implementing the parts that make the most sense for our needs.
The bottom line is that the Tidyverse is a good set of packages and you are free to use them or not. There isn’t some blood feud (like vi vs. emacs) between users and non-users, we all get along just fine.
There are dozens of great tutorials on how to learn R that don’t use the Tidyverse, and RStudio is under no obligation to offer a full course on every possible way to learn R. Any R user is also a competent Google user.
It’s fine to be opinionated. But, a professor as respected as Norm Matloff should be careful of how they say things, or they risk souring their students on a set of packages that might be very useful in the future.
"cherry picked"? There is no doubt that data.table is generally way faster than dplyr, unless you cherry pick a few use cases.
The problem is that dplyr is much slower than data.table and RStudio is promoting tidyverse too much to make the slower choice a default for many users.
Depending on the size of your data, you might not care that dplyr is slower than data.table. If you're better at writing/composing dplyr, you can often make up the speed difference between the two in terms of the savings in time spent writing and reading code. And if your data is that large, there are solutions like dbplyr out there to run dplyr code on various backends and offload the computation outside of R.
In practice, if there's ever a case that there's "too much" data such that dplyr starts to hang (e.g. millions of rows, hundreds of columns), you would get better value by setting up a database first with the data. Which you can then query with dbplyr!
Yeah, I'm surprised we're having performance arguments about these two libraries with mostly undefined performance characteristics which both run on a single-threaded runtime.
data.table’s OpenMP stuff is pretty haphazard, and can’t parallelise anything that calls back into R code. And anything outside of this involving forking lots has just been painful every time I’ve seen it, and again, way slower than doing it on a more performant platform up front.
> If you're better at writing/composing dplyr, you can often make up the speed difference between the two in terms of the savings in time spent writing and reading code.
dplyr syntax is definitely more concise and readable than base R, but comparing to data.table I don't think it has any advantage in terms of saving time writing or reading code.
I think the article sort of punts on providing examples of a complicated set of operations on a data frame. dplyr's author provides what I think is a good example of the differences between data.table and dplyr on a reasonably complex problem:
I feel like the first example is far more readable than the second. People can disagree on this, but the adoption rates of dplyr versus data.table do suggest (don't prove, but suggest) that the consensus on the issue leans towards dplyr. As we've noted, people certainly aren't adopting dplyr for the speed.
I honestly find that less readable than Hadley's version, especially turning "by = cut" into "cut." This is where terseness really cuts into readability (and positional arguments is one of my least favorite features about R in terms of long-term readability of code).
It’s right on this issue, but the Tidyverse is a collection of packages, and most are speedy. Discussing the rare case that supports an argument but neglecting the numerous others that do not support it is the definition of cherry picking.
And I don’t expect RStudio to say, “we have this collection, which works fine for most users, except in this case you should replace package x with y, or in this corner case you might like package z.”
RStudio doesn’t need to promote a fragmented ecosystem if they don’t want to, it won’t cause the death of R.
See my other comment, but I would be curious to know how many users actually could detect the speed difference between a data.table and a tidy solution. Speaks to how small most datasets really are IMO
He was careful how he said it; thus the many disclaimers about being a fan of tidyverse in general. I think you are underestimating the effect of tidyverse being taught. Also, RStudio is absolutely free to push whatever packages they want. Same as R users in general are free to use any package they want. But if most people learn the tidyverse, thats invariably what they will default to whether people like it or not.
Sure, he puts the required disclaimers about respect in the beginning so that way he doesn't get accused of being hateful. But then he accuses RStudio of doing "an end run around the core R leadership team," and that what they are doing is "bad for the health of the project," which are pretty strong words.
RStudio isn't on the leadership team. If the language ends up dying (which it won't, R is dug into it's place like a tick), it would be due to the leadership team's refusal to adapt to good ideas coming in from the community.
I teach and consult on R and data science. I had the privilege of learning R from one of R's core developers. My students often ask why I don't use the Tidyverse. The answer is because I don't need to - I can do everything the Tidyverse does and so much more in base R.
This article only briefly touches on what I think is the biggest issue with the Tidyverse. The Tidyverse is incredibly limiting. The "tidy" workflow hides many details of the language, which leads many users to think all data has to be in a tidy "data.frame" (or now tibble) and organized just so. The functions the Tidyverse authors have chosen to implement are seen as the limits of R's capabilities. Every data problem has to be forced into the Tidyverse box (or it is impossible). I cannot tell you how many Tidyverse scripts I have seen where 90% of the operations are munging the data into the correct format for the Tidyverse functions, when the original data can be handled with a single base R function (e.g. lapply). Most Tidyverse users accept this as just the way R is.
Most R users who only learn the Tidyverse never hit its limits, and for them the Tidyverse is perfectly OK. Those that do either resign themselves to the perceived limits of R, or (hopefully) start learning some base R. One friend who took the latter path once exclaimed to me "Logical subsetting is amazing!?!" - this is a foundation piece of the language. That she went 2+ years in the "Tidyverse" without even knowing it was an option was eye opening to me.
To its credit, the Tidyverse is empowering - with very little programming knowledge a beginner can do a lot of data science. However, the majority of these people get stuck as "expert beginners". Some of these become fierce advocates of the Tidyverse without truly understanding base R. Meanwhile, none of the truly advanced R users I encounter use the Tidyverse.
"Meanwhile, none of the truly advanced R users I encounter use the Tidyverse."
This is most likely because truly advanced R users have been using R since far before the Tidyverse existed. A whole new generation of R users is being brought up with the Tidyverse, so im curious to see how the situation will be in 10 years time.
This is not entirely true - many are of the same "generation" as me. To be clear, I don't actually consider myself an advanced user. To me, many of these advanced users are pushing against the limits of the language in a way you don't see with run of the mill data science.
That said, I have heard my mentor say "I don't get the point of <insert tidyverse function> - I did this 20 years ago."
> That said, I have heard my mentor say "I don't get the point of <insert tidyverse function> - I did this 20 years ago."
Tidyverse has a tendency to promote "new" things without any references to what came before. If you listen to their talks they speak as if they invented functional programming and the idea of pure and tiny functions working together.
And it trickles down to the users. A lot of people who learned tidyverse first for example, praise the `purrr` package, but have no idea that something like `Map()` is in base R.
I run a large scale national survey. We download the data from our survey platform. Survey respondents are asked 100+ questions. The questions change week to week and so the column names are not consistent. We exclude respondents who appear to be cheating the system (rushing through questions, straight-lining, skipping almost every question, etc.) As part of our completion check, we want to do a row-wise map of the data frame e.g. apply(respondents, 1, completion_function). For the sake of this post, let's say that the desired completion function is simply to return the number of NAs in the row -- sum(is.na(x)). I don't care about the name of the variables, I just want to treat each row as a vector one at a time, then perform the operation, and return it as a mutated variable.
When I wrote code to do this, the preferred tidyverse routes would either be to 1:nrow(df) %>% map(function(index) { row = df[index, ] }) or else df %>% pmap(function(.... laundry list of variables here)) { }. A brief Google shows this has gotten worse in the last few years as dplyr deprecated rowwise operations. I do see stack overflow posts of people writing their own tidy/pipe-friendly row-wise iteration functions, but nothing official.
Maybe I missed something. I am a reasonably competent R programmer and package author, but I don't live and breathe tidyverse data-wrangling the way some people do.
You are right that tidy data is in a different form that many supplied tables are.
If I understand correctly, you want to know how many NA's there are in each column in a wide-form dataset (as opposed to a tidy dataset)
# One line to make the data tidy.
# The form of data will be 3 columns: id, question, answer, and no, we don't care what the columns are called, except for id.
tidydf <- df %>% gather("question", "answer", -id)
# one line to do your check
tidydf %>% group_by(id) %>% summarise(n_NA = sum(is.na(answer)))
Tidyverse is highly opinionated about its data structure, and it is one of its limiting factors, as it basically treats every dataset as a sparse dataset. This actually fits very well with your data, as a datapoint is not a fixed questionnaire, but rather a datapoint is a respondents answer to a question (as questionnaires vary in questions, a tall table layout is quite fitting).
From there on you have to think in groups and summaries, unless you wanna fight the library.
Tidyverse is an 80% datascience solution. It solves what you need 80% of the time really, really well, and the last 20% you either have to fall back to base R or really torture dplyr.
I agree that an alternative way to do this would be to mutate an ID column (maybe row number), then gather, then summarize, except that this of course will throw away all the rest of the data, so not great if all you want to do is add a column. Hence, I normally map rows or use base R.
Thank you for providing this example! I share your experience that rowwise operations seem more difficult to program using the tidyverse than using apply.
Right, which is why my example used its cousin, apply, instead of lapply. apply over the row margins of a data frame does not have an equivalent in tidyverse.
As someone who uses and teaches R extensively (and loves the tidyverse) the tidyverse is so much easier to teach to people without any programming experience and who have very little faith in their tech or maths skills (which was me as well).
The tidyverse just 'made sense' to me when I started using R for the first time a few years ago, and now I love using R and programming. On the other hand, some of my ex-classmates learnt base R (because that's what we were taught) and found it hard, didn't learn anything properly, and now still think R or other programming languages are opaque and hard.
I'm not particularly fussed if Statistics Profs prefer data.table to dplyr or base R to tidyr, I know what is easier to teach, understand and use for me and a lot of other ecology/bio students and people.
I concur with your sentiments having cultivated data science teams from the ground up with diverse educational backgrounds.
Programming in base R is more akin to assembly language and has accreted a babel of inconsistencies that make it difficult to teach and learn. Learning base R isolates you into a Galapagos island of academics who are either ignorant of the needs of data workers or too elitist to engage with those not in their priesthood.
Learning Tidyverse is a considerably better transition for learning other languages, frameworks, and libraries.
Functional programming is closer to algebra than indexing into data structures with magic numbers. I've found more success teaching functional pipelines of data structures using the idioms in Tidyverse as a general framework for data work than base R. Abstraction has a cost but for learning it is the appropriate cost.
I sense that much of this `monopolistic` fear mongering is really about feeling out of date.
"I know what is easier to teach, understand and use"
I think this really depends on the end point. If you want to learn to read data into R and do basic manipulations, plotting, modeling, etc., the Tidyverse absolutely has a lower bar to entry. Once you get into writing functions, it gets a little trickier. Knowing some of the base R programming concepts and skills will make you much more efficient. If you start debugging and profiling code, only knowing the Tidyverse becomes a liability because you fundamentally do not understand R's computational model (the Tidyverse does not follow it). Hence, if your end goal is to write and debug functions in R, the steeper learning curve of base R can more than pay off. If not, then the Tidyverse's low bar to entry can be more attractive.
I was waiting for the critique... but I never quite saw it.
Imho, the data problem Tidyverse is trying to solve is basically the ones we face in a database. So, select, join, inner join and so forth. Show me all the rows in this datatable where the 4th columm is larger than the 6th column and the number itself is odd. Something like that.
There might be other ways to do it, but you want your select, filter, summarize, mutate etc functions to all work with each other, pipe to each other and be compatible.
Maybe there is a better way to do all this -- I haven't seen it but I am not an expert -- but you have to show that to me.
So, in base R, walk through a set of example of mutating, joining, filtering and so forth, and show me how they are all easier. Then I'll say, wow there is an alternative to this Tidyverse thing. But in lieu of that demo, this felt more like an intro to a complaint than an actual complaint.
Edit: Also, its funny that Wickham is (apparently) such a nice fellow that people go out of the way to be nice to him in critiques.
His critique is more about the impact of the full ecosystem effect of the Tidyverse, not what you are referring to, which is just the dplyr semantics. The Tidyverse demands that it's many related packages use tidy data principles and lock users into that approach, which differs from base-R. Much of this discussion is really just a debate about dplyr and magrittr rather than the fragmentation that the broader tidyverse has brought on. All that said, I agree with many commenters that the Tidyverse's improvements to speed of development can more than offset the speed of execution issues, at least for small-to-medium datasets.
I think any one making the point of speed of development have seriously missed the boat. Lets be real: tibbles suck. Once you get the hang of data.table syntax for matrix operations, its superiority becomes impeccably clear.
I run a data science group a large geospatial company and we develop day in and day out in R and python. We've purged tidyverse as much as possible from all of our code base. We've moved completely over to data.tables, which make the vast majority of the tidyverse irrelevant.
Let’s keep the discussion civil please. People have legitimately different needs, and just because a package isn’t well suited to your needs doesn’t mean that it doesn’t help people with different backgrounds and goals.
I initially viewed this article with skepticism, rather liking the functional style of the Tidyverse.
However, the claims the author makes are entirely valid. Data table is a fantastic package, and easy to learn. Tidyverse really has split the R community in disparate sets of R users.
To my mind, one of R’s greatest strengths is its meta programming capabilities. It is this that allows such broad paradigms to exist within the same language. In that, the challenges that R is presented with is similar to languages like Lisp dialects and Scala.
I don’t have solutions, but the author has convinced me that a problem is indeed there.
Wow, the idea that the Tidyverse is somehow burdened with an overabundance of computer science or functional programming is just hilarious to this programmer who has had to maintain, debug and optimise other people's R code.
I think this is what the author is getting at, that correct tidyverse usage requires greater experience and knowledge and creates difficult to debug and optimize code when used incorrectly (and is easy to use incorrectly).
I tend to agree. I've found tidyverse code has a write-only quality to it. Since I'm going to see more of it I plan to dive into the inner workings of at least dyplr and purr.
That said it is hard to deny what Hadley Wickham has done for the R community and I can't write off the idea that he sees a bigger picture I'm missing.
Tidyverse code is not write-only; it is designed to be mostly read-only where the level of abstraction is at the level of the domain, which in this case is data frames and data pipelines akin to the same constructs in relational algebra/SQL or data processing (map/reduce).
Correct usage requires learning the right abstractions, but fortunately these abstractions are shared across language communities and frameworks.
If you are doing any data processing work and you do not know about foundational functional concepts ie map/reduce then I would argue the code you write is less readable, overly focused on the idiosyncracies of how to do something rather than how entities and data are related to each other - their logic and structure.
The main optimization is to be correct and expressive of one's intention and purpose. If you need performance use Julia.
It’s very hard to write expressive, readable code that munges some horrible tabular format into another arbitrary tabular format, to be fair. That said it’s true that most R code doesn’t seem like map/reduce, or some other pipeline of transformations. It’s usually more like someone cut and pasted a long, hard REPL session into a notebook and has no intention of ever scrolling back up.
I understand that's the goal, but it hasn't been what I've seen from user's code. I think the issue, which the article describes, is that the tidyverse has too large a surface area for something aimed at very smart people who are not software engineers and don't have time to become one.
So they end up with a subset of functionality they can get stuff done with, but then they need to collaborate with someone who uses a different subset. That can be messy.
I think in an environment where everyone learns R from the same course/book/MOOC/whatever (or have done lots of different sorts of programming) and the organization can impose a style guide the tidyverse approach would be great, but when you have people coming from all sorts of places and backgrounds I don't think it's a good fit.
This is more a symptom of the fact that nobody has told R programmers that they are software engineers and the thing that they’re creating is software first, and not just the plot that comes at the end. No library can really help that.
What’s really crazy is I never realized how fast and easy excel pivot tables were to work with.
I know I know it sounds ridiculous.
But if you do a lot data splicing and dicing, excel can actually get your cuts out way faster through pivot tables than writing R code.
So if you use R for almost everything, give Excel a try as well. And the nice thing is this is even more powerful if you work with non data science folks, because then you can say I just did this in excel (where excel skills should be required by everyone in the company, and R for the data scientists) and they’ll probably just leave you alone the next time they need something since they will try to do it in excel first.
Agree 100%, but only if it is a one time activity. If you have to automatically pull files and do the operation several times, it is better to go with R (or python, or awk sed or whatever).
Most of these tidyverse vs. data.table arguments end up sounding a bit irrelevant to me. They tend to focus on syntax, which is largely down to preference, or performance, which is really only important once in a while. And they seem oddly focused on the need to choose one or the other, when it seems obvious that all are good choices with various tradeoffs.
On the other hand, it's nice to see some criticism of RStudio and Hadley, even if it does stray into the conspiratorial. They've done a lot for the R community, but they do have some obvious blind spots.
It's not that RStudio has nefarious motives that cause harm, it's that they have neutral and sometimes benevolent motives with unforeseen consequences.
At least in my case, although I do like some of the tidyverse (ggplot is tidyverse after all, and probably my favorite plotting system in any language), I find the "tibble" data structure and its disdain for row names really annoying. Many, many, packages which I use expect a data frame with row names and likewise output such. So I am constantly converting back and forth between tibbles and data frames.
Seeing as the tidyverse is pretty much your invention, its well within your purvue to define what is in and what is out. I do think, respectfully, that both the op and the general sentiment article have the right of it.
Data.table and data.table-esque notation represent such an improvement of tibble/dplyr, that within my company, we're making a concerted effort to purge all tidyverse packages from general use (less ggplot). When new developers come on, if they are coming from tidyverse, their first task will be something involving pipes and data.table. Tidyverse was fine in school. It doesn't pass muster in production, at least not in our work.
Data.table syntax is simpler, easier to read, easier to teach, and orders of magnitude faster. It plays nicer with other packages than the tidyverse (if it fits into a DF, it almost always fits into a DT, and i've never met a tibble that I didn't wish was a data.table), and since almost all of our datasets are 10's to 1000's of millions of lines long, the decision was really made for us.
If you find data.table more useful, you should by all means use it.
My greatest regret about coining the word tidyverse is that for some reason people seem to think it’s a monolith. It’s not; you’re totally free to pick and choose whatever parts of it you find useful.
It doesn’t hurt my feelings if packages that I have help write aren’t the perfect fit for your problems. Use whatever makes you happy :)
Sure, with your amount of data you need data.table. But that is your specific use case. That has nothing to do with dplyr not being production ready, just that it is not the right tool for you. Separate things but somehow programmers love to consider them the same.
I probably should have phrased my comment better. My point was that it’s perfectly possible to use ggplot without explicit knowledge of tidy principles or tibbles. This is great! And for example, I’m reading through your ggplot book and it doesn’t make reference to it.[1] In my work, we use data.tables (we believe we need the raw performance) with ggplot for visualization and are not using the rest of tidyverse and it works fine.
[1]: In recent versions this seems to be slowly changing.
Pretty much all of the individual components of the tidyverse were created before the tidyverse since it’s only 3 years old.
But there’s no reason to use only the tidyverse. That’s not something I’ve ever recommended and it would be extremely hard. I just object to people claiming that some of the most important parts of the tidyverse aren’t actually parts of it.
I am a feverent user of data.table and try to avoid dplyr when ever possible. I have colleagues who are exactly the opposite. No one cares.
If you want efficient code, try not to use dplyr. If you want readable code, data.table isn't the best answer. If you want speed, there are better languages than R to choose from, even with Rcpp.
If you want to get something done, go with whatever you know best and will do the job.
I recently tried to do some stuff in R with tidyverse, and was not a fan. I'm no expert, but the tidyverse's frequent use of nonstandard evaluation drove me crazy. It makes it so much more difficult to write functions encapsulating tidy functions. However, I never see people complain about this, so maybe I'm doing something wrong or not grokking something...
It's because RStudio are pushing R/tidyverse as an interactive datascience environment, so most of the tutorials you see will be aimed at users sending a few lines a time to the REPL. Programming with the tidyverse and its non-standard evaluation really is a pain.
A friend pointed out to me that the author a few weeks ago put out a Python versus R comparison where he makes the rather incredible claim that the addition of some opinionated packages in the R ecosystem is more disunifying than the 2.7 to 3.x split in Python:
With respect, its laughable to think that anyone working on big-data with R has _not_ heard of `data.table`. When your livelihood depends on it, you tend to find it out one way or the other.
However, I'd claim most folks don't deal with big data or even medium data! [Aside: I chuckled at the 1e5 columns on the plot]. Personally, at the small-to-medium range, all languages are fast enough. It boils down to ergonomics for me. I _detest_ the subscript notation. It slows me down -- not just productivity-wise but also how I read or reason-about my code. Not surprisingly, I prefer `dplyr` over `data.table` (base R or `pandas`)
When I'm writing the book (www.anotherbookondatascience.com) I always try to avoid the usage of tidyverse and got some criticism. But now I feel supported.
Thats a lot of added complexity to be able to use column names, and to top it all, the use of non standard evaluation make things very complex when one go a little away from basic EDA.
Well, strictly literately to this case the tidyverse command would be:
mtcars %<>% mutate(pounds = wt / 1000)
Which is obviously simpler, and quite likely what a beginner is actually trying to do.
The real complaint - "tidyverse doesn't let us use variables to name columns!" - is fair enough, following quasiquotation is hard work. If you need that, drop back to base R. However, it is harder to teach because all the common operations (select, filter, etc) will involve some combination of []/[[]] and relatively hard to figure out which one a piece of code is trying to do.
Except for the lucky few who have brains wired to think in terms of arrays and database relations, someone learning the language is unlikely to thank you for that.
Tidyverse was nearly the only thing I really liked about R. The syntax is quirky, it's full of cruft, and unpleasant to make full programs out of. But if I used a new package, I knew instantly how to use it.
One reason dplyr was promoted is that you can connect to different server backend like sparkr, sql etc, which is the enterprise direction RStudio aiming at. In the other hand, data.table is your friend when you are using your own machine. With more and more memory and cpu power available, its benefits are actually increasing.
I have been using data.table from almost day one, it does worth more recognition. The syntax is a little bit hard in the beginning but can be grasped after some efforts. I often feel that my exploration code will be much more tedious look if written in dplyr style.
People who were big into R pre-Tidyverse have this tendency to view the way they manipulated data frames pre-Tidyverse as the "basics," and Tidyverse as the "advanced" method. I think it does anyone learning R for the first time an incredible disservice, for the reason you state: dplyr is fantastic. But the old guard is (rightly) afraid that if people start with dplyr, they'll never get around to learning what base R provides for manipulating data frames. Which is good in my view, and nobody should have to learn that.
I think the same thing is happening in Python for data science due to Pandas. It’s a great package, but many data scientists are only able to manipulate data using Pandas, and know little “base” Python.
I've been planning to go from "can work it out with copious examples" to "knows when to use apply, transform, etc. off the cuff" level of knowledge in Pandas - do you have any suggestions on good resources on this?
I'd like to understand better why the data structures work the way they do and thus have an intuition on what operations to use when. The O'Reilly Python Data Science Handbook[0] seems like it might be useful here, but I'm not sure if it is still up to date.
I find the pandas documentation pretty good [0]. Actually, I find the documentation, official tutorials of pandas, numpy, matplotlib all pretty good. Each package has their own idiosyncrasies but I think the documentation and code examples cover them well enough.
The OReily book should not be that outdated if at all. It also covers other essential tools.
> Ironically, though consistency of interface is a goal, new versions of Tidy libraries are often incompatible with older ones, a very serious problem in the software engineering world.
I was (sadly) amused by a recent comment on HN linking to some tweets celebrating that some R code still ran four years later.
That was my comment, and the point wasn't that the R code still ran four years later--the point was that R Core's focus on backwards compatibility (in reference to R being hamstrung via its focus on "compatibility with S") isn't a bad thing. You're never going to get the 2.x - 3.x schism that python had with future version of R, assuming R core continues with that philosophy.
I completely agree with your point and I think saying that R is keeping compatibility with S doesn’t make much sense. R is keeping compatibility with R, as it should.
Maybe I lack context, but those tweets looked to me as if someone said “I can run this four year old game in the latest release of Windows!” or “Three versions of Office later I can still open my excel files!”
After doing a few personal projects using base R with complex manipulations, I would have quit using R entirely if it were not for tidyverse/dplyr.
I'm a pragmatic software engineer: I prefer to get a data analysis job as fast as necessary, and IMO there isn't much merit in making things more complicated by using base R just for the sake of base R. It's definitely not faster, both in cognitive load and in speed. In my data science work there hasn't been a single case where I've needed to fall back to base R, or wanted to.
Pipes are being undersold here: the examples only show a single piped action, whereas pipes shine when you have to do 5+ operations in a single manipulation (I've seen base R code that does that by just having a lot of redundant df assignments and it's a pain to read).