Hacker News new | past | comments | ask | show | jobs | submit login
Rich Hickey: Reducers - A Library And Model For Collection Processing (clojure.com)
240 points by swannodette on May 8, 2012 | hide | past | favorite | 79 comments



I've been using Clojure for a production system for about two years now. It feels so great to use a language that just keeps handing you better tools as you go along.

What's nice is that the new tools are not just syntactic sugar, as in so many other languages. They either adress specific performance pain points (non-rebindable functions and numerics) or introduce new abstractions and tools (reduce-kv is a nice small example).

I love the fact that each time I read about a new thing coming to Clojure, I immediately think "well this will fit right into what I'm building, great!".

Clojure strikes a good balance between nice ideas and practicality.


Clojure is all substance and no glitter. It's all about getting the job done, which is why it's the most important language development in 20 years.


The best way to get “all substance and no glitter” is apparently Rich Hickey’s “hammock driven development”. http://news.ycombinator.com/item?id=1962051

“To arrive at the simplest truth, as Newton knew and practiced, requires years of contemplation. Not activity. Not reasoning. Not calculating. Not busy behaviour of any kind. Not reading. Not talking. Not making an effort. Not thinking. Simply bearing in mind what it is one needs to know. And yet those with the courage to tread this path to real discovery are not only offered practically no guidance on how to do so, they are actively discouraged and have to set abut it in secret, pretending meanwhile to be diligently engaged in the frantic diversions and to conform with the deadening personal opinions which are continually being thrust upon them.” –George Spencer Brown in The Laws of Form, 1969


I would add Erlang to this list. Both are not extremely popular, but their importance comes not from providing features, but from providing a clear path to developing modern software that tackles modern problems on modern hardware.


The implementation is a nice example of clojure protocols kicking ass - the core vector and map implementations needn't know anything about implementing coll-fold as the protocol definitions can be added within the reducers library.


Just wondering out loud... I'm interested in how this relates to the existing implementations in core.

For example, if we went back in time, would all of the core functions have been implemented this way? Would this be a possible drop in replacement in the future? Could future versions of Clojure integrate these ideas more deeply? If so, what are these backwards compatibility concerns?


> For example, if we went back in time, would all of the code functions have been implemented this way?

I don't think so - the existing implementations work on the higher-level abstraction of sequences. Reducers are optimized parallel versions that work on collections. While parallelism is extremely useful in some parts of your code, there is overhead and I don't think you would want either the overhead or the restriction of working below the sequence abstraction in the general case.

I seem to see some of the same choices being made available in the new Java 8 collections and parallel operations work. That is, it is up to the developer when to "go parallel".

For an entirely different approach, check out Guy Steele's Fortress language which eagerly executes most things in parallel by default (like all iterations of a loop, arguments to a function, etc) and you have to tell it not to do that.

Guy's Strange Loop 2010 talk is an interesting complement to this work: http://www.infoq.com/presentations/Thinking-Parallel-Program...


Well besides the implicit parallelism of fold, wouldn't this be generally useful to reduce intermediate list creation? Or do lazy-seqs already solve that problem?


Yes, reducers provide that benefit to sequential reduce, independent of the parallelism of fold.


So then, what downsides, if any (I'm sure there are), would there be to moving this reducers model to being the "default"?


When you call a reducer like r/map or r/filter, the result is reducible but not seqable. So, if you have an incremental algorithm, are calling first etc, the seq-based fns are a better fit. Also, lazy fns automatically cache whereas reducers recalc every reduce, until you save the realized result. They are genuinely complementary in many ways.


1) What is the benefit of recalculated on every reduce? Is this so you can use side-effects?

2) If seq or first is called on a reducible, wouldn't it be easy to just implicitly realize the reducible in to a sequence first?


1) The benefit is that you don't have to cache the results in a data structure, which really slows it down. Suppose you map the function (fn [x] (+ x 1)) over a reducible, and then you sum it by reducing it with +. With reducibles, there is no intermediate allocation, and it will run really fast especially if the functions can be inlined. Compare this with mapping and then folding a lazy seq: map builds an intermediate data structure, and reduce immediately consumes it.

2) That's possible, but it makes it too easy to write code with abysmal performance because of (1). The common case is that you call both first and rest on the reducible. If both turn the reducible into a seq first, then both will take O(n) time in the best case (might be much worse depending on how the reducible was built up). Combine that with the fact that most times, you're going to recurse on the rest, and you've got an O(n^2) algorithm where you expected O(n), if everything is based on reducibles. Additionally, it's impossible to take the first or rest of an infinite reducible (well, perhaps you could do it with exceptions -- in general you can turn a reducible into a seq with continuations).


In the reducible model as described, it's not possible to implement some operations efficiently, like map on multiple collections (this is called zipWith in Haskell). Also, if you filer/map/etc a function over a reducible and then use the resulting reducible twice, it will cause the filter/map/etc to be performed twice. This can lead to performance surprises, for example it can turn a linear time algorithm into an exponential time algorithm in some cases. Thus this is not a drop-in replacement for seqs, although it can probably cover most uses more efficiently.


Fold is implicitly sequential. See Guy Steele's ICFP 09 presentation on conc lists, previously discussed here as http://news.ycombinator.com/item?id=814632.


foldl and foldr are sequential, but Clojure's fold is potentially parallel.


Rich Hickey proves again that he is among the leaders, if not the leader in careful, industrial-strength application of excellent theory. Bravo.


Rich will be talking about this library at the next Clojure NYC meetup. If anyone is in the neighborhood, feel free to drop by.

http://www.meetup.com/Clojure-NYC/events/56212552/


Will there be a video of this? That would be very awesome.


Oh yes, please! Or at least slides & example code.


Is there no need to tune the number of threads with an approach like this? Or is there a general notion of the appropriate number of threads given the number of CPU cores?


If you're not doing any I/O, the number of threads can be limited to the number of cores. I believe this is the default for Fork/Join.


Correct. ForkJoinPool defaults to Runtime.getRuntime().availableProcessors() threads (but can be adjusted). The reducers library (https://github.com/clojure/clojure/commit/89e5dce0fdfec4bc09...) seems to initialize the pool with the default constructor.


I believe the Java fork/join library handles this tuning automatically... so it's a freebie based on Rich Hickey's design.


Recommended reading: Organizing functional code for parallel execution or, foldl and foldr considered slightly harmful (2009) by Guy L. Steele, Jr.

http://dl.acm.org/citation.cfm?id=1596551


> ...producing the same result as the Clojure's seq-based fns. The difference is that, reduce being eager, and these reducers fns being out of the seq game, there's no per-step allocation overhead, so it's faster. Laziness is great when you need it, but when you don't you shouldn't have to pay for it.

Looking forward to trying this out. I've been implementing another language in Clojure (tho just experimenting for the moment). It's a non-lazy language (Groovy) so I have reduce all over the place, e.g.

  (defn multiply [a b]
    (cond (and (number? a) (number? b)) (* a b)
          (and (vector? a) (integer? b)) (reduce conj [] (flatten (repeat b a)))
          (and (string? a) (integer? b)) (reduce str (repeat b a))))

  (defn findAll [coll clos]
    (reduce conj [] (filter clos coll)))
Hope reducers give a speed boost by removing the laziness I'm not using.


Forgive my simplified response.. Is a way to look at this.. Rather than inside out, the functions compose outside in and then evaluate the reducable as a last step. That gives the benefit, since there's only one step the laziness needs to "devolve" in?


Awesome for the linked talk alone

http://www.infoq.com/presentations/Simple-Made-Easy

If you haven't seen this just take time to watch the first 15 minutes. Really worth it.


If you use a mapped reducible twice, does that evaluate the function on each element twice?


If you call reduce on it twice, then sure, it would do all the work twice.


Jules's point is that the map is done twice, which it is. If you don't want that, you can reduce into a collection and then reduce the collection twice.


The blog post compares this to Haskell enumerators/iteratees, but I think a more direct comparison to something haskelly is to monads.

He says: "The only thing that knows how to apply a function to a collection is the collection itself." Which is like a monad in the sense that the insides of a monad are opaque; you can only interact with a monad through the functions it gives you.

The "map" function from his "reducers" library has type:

fn * reducible -> reducible

(i.e., it takes a function and a reducible and gives you back a reducible)

while monadic "fmap" is a little higher-order and has type parameters, but it does something analogous:

(t -> u) -> (M t -> M u)

(i.e., take a function from type "t" to type "u", and return a function from "monad of t" to "monad of u"). It's a little different in that Hickey's "reducers/map" applies the argument function itself, while monadic fmap gives you a function that will do that.

Of course, his "reducers" library addresses a bunch of other stuff like parallelism, which isn't something that monads themselves are concerned with. I'm just saying that part of the interface to his new collection abstraction is monad-like.


Everywhere you wrote "monad" you should have "functor". A monad is a functor with extra structure relating to how elements are inserted into the "collection" in the first place. Functors only discuss how functions are applied to elements already collected.


This will go nicely with Storm.


I thought storm worked on infinite streams? Does this support that?


It should. It depends on which reducing function you choose -- the map step doesn't consume the stream, it just set up the mapped function to be called as you reduce. If your reducing function is lazy, then it works on infinite streams.


This works on collections, which are not infinite. If I read correctly, sequences (a higher level abstraction) fall back to their curent implementation.


I believe the existing pmap and preduce functions work in parallel over lazy streams by chunking work and parcelling it out. Depending on your use case, this is not necessarily ideal.


Why not just do relational programming - prolog or sql. The lisp weenies still don't get it.


> Why not just do relational programming - prolog or sql. The lisp weenies still don't get it.

John McCarthy was the first to recommend using logic in programs with his advice taker proposal. This inspired the logic programming paradigm:

http://www-formal.stanford.edu/jmc/mcc59/mcc59.html

Paul Graham (creator of the hacker news site) has an entire chapter in his book On Lisp on embedding prolog:

http://www.bookshelf.jp/texi/onlisp/onlisp_25.html

Peter Norvig's Paradigms of Artificial Intelligence programming includes an implementation of prolog in Lisp:

http://norvig.com/paip/prolog.lisp

Common Lisp relations: an extension of Lisp for logic programming proposed a mechanism for logic programming back in 1988:

http://www.lw20.com/2011051092557312.html

Franz. Inc is developing Allegro Prolog as an extension to Common Lisp:

http://www.franz.com/support/documentation/8.2/doc/prolog.ht...

The Clojure programming language includes relational programming functionality through the core/logic module:

https://github.com/frenchy64/Logic-Starter/wiki

Racklog is an embedding of Prolog-style programming in Racket.

http://docs.racket-lang.org/racklog/

The Shen programming languages includes a fully functional embedded prolog system:

http://shenlanguage.org/learn-shen/prolog.html

CycL is a declarative ontology language based upon classical first-order language which includes support for ontology components including parts and relations:

http://www.cyc.com/cycdoc/ref/cycl-syntax.html

What is it that we don't get again?



There's even a parallel branch built on top of fork/join that needs more work: https://github.com/clojure/core.logic/tree/fork-join


So basically a thin sliver of what an sql rdbms provides.


Your "thin sliver" idea is intriguing. Maybe we can call slivers "simple components" and build bigger things out of them.

I wonder if there would be any benefit to that?


If 'components' were that useful for the things rdbms's are used for the thousands of highly trained phd's at oracle, microsoft, ibm would have added them to their sql products.

Of course, this a ROR crowd, software invented by a business school grad/game review writer, so it's can be an uphill battle explaining this stuff.


For those keeping score at home:

* argumentum ad verecundiam * ad hominem


If you think what a PhD gets you is "training", you've tragically misunderstood the point.


You say that like it's a bad thing.

Clojure tends to emphasise simple components that do specific tasks; so core.logic focuses on logic programming and not, for instance, persistence or atomicity. Because it's so focused, it's more capable in its particular area of expertise than a more general tool like SQL.


I think we will see a Datalog made faster by parallel reduce/combine real soon now. :-)


I'm getting there. :)


I'm not a supporter of Lisp myself, but still interested in what they don't get.


Well, basically, most applications that you want to write are basically 90 percent rdbms with a thin layer on top for the views (usually php etc.). The 'power' of lisp isn't relevant. Other types of applications are usually performance critical and written in c etc, or glue code written in languages with libraries tcl/perl/python etc.

So basically the traditional niche of lisp, complex apps, has been replaced by sql.

There are apps written in lisp, like this site for example, but that's just for vanity mainly (pg is famous for his lisp books) - it would be have been easier to write it in sql.


What on earth are you talking about? Are you a SQL consultant or a professional Troll or what am I missing here?


He's a big data consultant, and all he has is a hammer :-).


I've explained a bit longer than your 'wtf'. Perhaps you can explain to me applications where lisp is the best fit?

Almost every website or software running a business I can think of runs sql. The very small few that don't, google for example, run on c++ and assembly.

So, in 2012, with gigs of ram and postgres, why would I use lisp?


Dude, you advertised on here 36 days ago that you are a Lisp contractor (https://news.ycombinator.com/item?id=3784418) -- WTF?


Why WTF! I'm looking for AI type work, which is much much rarer than sql work, which was my point. You wouldn't use lisp for that either (matlab/numpy followed by c optimizations is used there).


Because 1 hour ago you said:

Why not just do relational programming - prolog or sql. The lisp weenies still don't get it.

And now you've peppered the entire thread with non sequiturs and off-the-wall droppings. Please take a break.


I'll plead WTF on that.

I'm simply trying to tell the kids on here that SQL is great and ROR/LISP/NoSQL worship around these parts is misguided. No 'droppings' in that.

Yes, I write Lisp on my resume to make me look like a stronger candidate. I learnt it a few years ago for fun, that's all (I'm not a lisp flip-flopper). Just want to put the advice out there to be a relational weenie (instead of lisp weenie).


Your point is a complete non-sequitur in this thread. Lumping Ruby on Rails, Lisp and NoSQL together is non-sensical. One is a framework for building web applications, the other is a family of programming languages with over 50 years of history, and "NoSQL" is a bad name for highly scalable data storage methods which abandon the relational model.


What planet do you live on? Your comments are comprised entirely of droppings. This post had nothing to do with any of those topics until you arrived; it's about a way to abstract operations over collections of data that has some cool properties. You then came in unprompted and attempted to troll the whole world about your pet peeve.

If you have something to say about the actual post, then please say it. Otherwise, get out.


Possibly a chatterbot with all the non sequiturs and AI references and previous postings about "bots" (http://news.ycombinator.com/item?id=808240).


Because many programmers are not in the website business. I'm a scientist, working with long-term simulation models. My problem is trying to find a good model representation of the data that I observe. I keep my eyes open towards probabilistic relational programming languages (such as http://www.openbugs.info ), but so far I haven't found a way to apply it to my work.

Lisp is useful to me (more useful than ruby or C was), predominantly because I can work with emacs/swank/slime and change and query my program while I'm observing its output.


How come you're not using matlab/numpy?


I think Twitter is using Clojure now that they've acquired BackType.


I think you have a good spirit to what you're saying. Everything is Turing Complete here, and yes with views and stored procedures and the like, and normalizing everything most apps would indeed just be a thin layer over top of these forms. However, set theory isn't exclusive to tables, nor even tree or document stores, and often with online systems like how a previous person mentioned Storm, you might not have the luxury of the overhead of a whole interpreted SQL stack, but may only need linked lists or something more bare, and closer to logic. I think you're not being downvoted on merit of the idea, just perhaps the quickness as classifying this work, even if it is correct, as something that is just more of the same. Everything in computers is arguably just more of the same. Just my opinion!


Even when all you need is a thin layer over SQL, a language with macros helps you keep your code cleaner, e.g. http://brl.codesimply.net/brl_4.html#SEC31


I applaude your attempt to say that SQL is "real world FP".

But you made your claim too big.

SQL (even with recursive CTE) is still just a DSL for relational data. Even more: DSL for relational data only.


wrexsoul? Is that you?


It seems like everything I read about Closure gives me a reason to use it, but a slightly stronger reason not to.

I'm happy to see a language putting this approach to collections into its core libraries and even combining it with ideas about parallel processing of data structures.

On the other hand, the whole thing is written as if Rich Hickey had an awesome idea, wrote some awesome code, and is now sharing his awesomeness with us. It's kind of a lost opportunity to give credit to the people who gave him the ideas (and maybe the people who helped him write the code, if there were any) and it's kind of a turn-off.

One good, prior write-up about reducing as a collections interface is:

http://okmij.org/ftp/papers/LL3-collections-enumerators.txt


I make no claims to novelty, and the blog post does link to http://www.haskell.org/haskellwiki/Enumerator_and_iteratee, the most similar work and a clear influence. If more people knew about Iteratee, it would be worth spending more time talking about the connections and contrasts, but they don't, and knowledge of it is not prerequisite to understanding reducers. No one helped me write the code.


Isn't foldr/build fusion much closer? A collection is represented by a "build" function that takes the reducer, and list transformers become reducer transformers. The main difference is that it's applied automatically by list library using rewrite rules, so it's not as obvious, the reducer is supplied as a pair of "cons" function and "init" value rather than a variadic function, and there's no parallel fold.

http://research.microsoft.com/~simonpj/papers/deforestation-... The first paper (yes, that's a .ps.Z - check the date)

http://www.scs.stanford.edu/11au-cs240h/notes/omgwtfbbq.html some recent slides, which also include a bit about a "stream fusion" which isn't yet in the standard library

http://darcs.haskell.org/cgi-bin/gitweb.cgi?p=packages/base.... The details from GHC's libraries - it's all in the RULES.


> and there's no parallel fold.

When working up from the bottom it might seem that this is just manual stream/list fusion. But the parallelism is the key prize. We need to stop defining map/filter etc in terms of lists/sequences, and, once we do, there is nothing to fuse, no streams/thunks to avoid allocating etc, because they were never essential in the first place, just an artifact of the history of FP and its early emphasis on lists and recursion.


Yes, foldr/build is almost exactly reducibles, but not foldables.

Iterators do nothing for parallelism either.


First, thank you for implementing and promoting this approach to collections.

I did see that link, because I liked the concept enough to read it all the way through without knowing Clojure, but that's not quite what I had in mind. You're right that most people have never heard of that library, which is why the way you presented it will leave most people with no idea that it was an influence (even if they read that far). That's something you could have just said, in one sentence, without getting into detail about how it was different.

I'm not really trying to criticize you for saying or not saying certain things, and I don't think you did anything wrong. Not really acknowledging influences is just a symptom of what turned me off. I feel like this post was written from a sort of aggressive fighting-for-popularity mindset that I'm uncomfortable with in a language.

EDIT: missing word


This can't really be true because there has never been anything "clear" about enumerators and iteratees. Maybe a dozen people understand what's going on there.

Also, most of what's special about enumerators is the gymnastics needed for Haskell's type system, which isn't relevant to Clojure at all...


This is a blog post, not an academic paper with citations. It's about conveying a message: "Here's this new thing I made. You can use it. Here are it's pros and cons."

One of the nice things about Clojure is that it's a lot of great ideas from a lot of great people, packaged up with Rich Hickey's particular brand of digestible design.

Having watched most of Rich's published talks, it's apparent to me that he is quite humble about the fact that none of his ideas are that novel. He simply takes credit for this particular arrangement and takes pride in his design approach. His design sensibilities really resonate with a lot of people, hence Clojure's popularity.

On a personal note, I wish I could give credit to all the giant's whose shoulders I have stood on. However, it's simply not possible to remember all those details. Sometimes, ideas take days, weeks, months, YEARS to bake. Sometimes, someone tells you the solution to the problem, but you don't get it. Then 5 years later, it comes to you in a dream.


Fair enough and absolutely true in principle, but I don't think "here are its pros and cons" is the message being conveyed.

Rich Hickey might be very humble in person, and I don't think this blog post is necessarily arrogant (but you're also not going to convince me it's humble).

I'm not really complaining about a lack of humility or even about Rich Hickey. I just feel like Clojure is one of those languages caught up in a sort of Cult of Awesomeness, where people feel obligated to lionize the pros and downplay or omit the cons and the contributions of others.

I don't really want to use a language that's being driven by that. Maybe I'm mistaken about Clojure fitting that description, and maybe meeting Rich Hickey would set me straight, but I still think it's a reasonable perception to take away from this blog post.


You dont want to use a language where people build the most awesome thing they can think of? The people are all about building cool stuff but if you ask any of them what is bad you will hear them becoming very very critical.

Do you expect Rich or anybody else to write a blogpost that goes 'I wrote this library and here is why it sucks'. EVERYBODY writes blogpost on why what they themselfs build is good, I dont see any diffrence between clojure and other languages (or communitys).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: