While cool, and interesting, this strikes me as the exact type of overengineerin...

aria · on April 9, 2012

I'm not quite sure you got the gist of the post. The library isn't about scalability. It's about having a consistent way to describe how data sources accumulate/aggregate new values to do something useful with them.

We don't have a batch MapReduce problem and so Hadoop isn't really an option and, having used it in the past, not sure it's even a great choice.

Also, most of the data we're working with comes from content being shared and not number of users, so for our case in particular, dealing with a lot of data is an issue from day one.

Lastly, " solve the problem first and abstract out the framework later" I think is one of the core misconceptions in computer science and engineering. Abstraction lets you build faster if done correctly.

Again, if we only needed a simple key-value store and only had to deal with managing user data/events, we have built this kind of library out. Many of the reasons we built this library is because of the kinds of data aggregation we encounter in relevance ranking and machine learning.

frisco · on April 9, 2012

I didn't necessarily mean scalability in terms of just compute throughput; I was also talking about sustained productivity, though the two are related. Abstraction does let you build faster if done correctly, but in my experience I've never seen a project turn out a good framework that started framework first. Rails was abstracted out of Basecamp. MapReduce was abstracted out of Google's experience with a huge mess of scripts for running batch jobs, with some theory on top.

I'm working with a similar issue (a huge amount of base data, relatively independent of users) and so have experienced the big-data-from-day-one problem. I started by writing a bunch of stuff to make the "heavy lifting" easier. I came out of that experience with the lesson that when the data is big enough, abstractions break down because you do care about the properties of the underlying data stores. On InnoDB/XtraDB, DELETEs cause an in-memory copy of the table to be created for MVCC, and if any INSERTs commit in that interval you'll get sporadic, hard-to-debug timeouts on 30GB tables, even if the machine has 96G memory. If you're making recommendations online and need to do set intersections to apply a topic model, you absolutely do care about the underlying data store. Code abstractions obscure this, are probably not the right ones anyway, and therefore end up being worked around when you're trying to debug these things.

carterschonwald · on April 9, 2012

1 million upvotes: good computer science is about designing / choosing abstractions which simplify and facilitate solving problems in your domain.

I'd argue that the grandparents comment that bespoke unique solutions are always needed really just means that we as computer scientists and programmers really don't (yet) have a good handle on what the "right" abstractions for big/monster data problems. To repeat a CS favorite proverb: "When all you have is a hammer, everything looks like a nail".

frisco · on April 9, 2012

> we as computer scientists and programmers really don't (yet) have a good handle on what the "right" abstractions for big/monster data problems

It's more that the underlying technologies aren't good enough to support these abstractions, so they end up breaking down as we need to debug things.

carterschonwald · on April 9, 2012

If an abstraction breaks, its not a good abstraction, in which case the Prismatic folks are doing precisely what you're advocating anyways, spending time getting things working rather than overengineering before they need to solve the problem.

My point though, is if an abstraction breaks, its not a good/real abstraction, and a new abstraction / solution needs to come into play. There is no reason why a data store library can't be parameterized (whether internally, or visible to the client user) over the semantics of the various backends and work differently in the underlying manipulations accordingly.

(note I am not saying that the prismatic folks are or are not doing this, just that an abstraction is only a good abstraction if you can't break it)

I agree that the underlying technologies do not have a uniform data store semantics. However, not preclude client side library logic that can provide a uniform semantics to these stores, and/or warning when they cannot.

tel · on April 9, 2012

I haven't used these Buckets nor have I scaled anything to that level with any need for uptime guarantees, so I'm certainly really naive.

But at the same time, simplifying the interfaces seems to me to (a) provide less points for buggy failure (b) allow you to write lots of machine learning methods that all "just work" on buckets.

It might all be custom, but if you're spending lots of time building, testing, analyzing, and deploying statistical models from scratch (the place I do have expertise) then a simple, consistent data interface is godsend. Scaling might still be custom, but if you can expose the Bucket interface then it "doesn't matter" what happens underneath.

If there's some real bottlenecking then it's probably important to break through that interface or to expand it (like that flush-rate parameter the Op mentions). Until there is though, you're winning on experimental fluidity.