How Prismatic deals with data storage and aggregation

frisco · on April 9, 2012

While cool, and interesting, this strikes me as the exact type of overengineering tech-heavy startups kept being told to stop doing. I get the argument that it'll "give you a platform to build on" and help you "move fast later," but how much did you spend on engineering a storage library that doesn't solve a critical business need? $10k? $20k? How many people spent hours and hours on building another Clojure-based abstraction layer over key-value stores?

I also don't buy the argument that it'll help you be really "scalable". Once your datasets no longer fit in memory, it's all custom. Hadoop is a great solution if you have infinite money. Data locality matters. All those different services may expose a consistent-enough interface that you can build a common abstraction over them; their latency and reliability properties, failure modes, consistency, etc are not homogeneous, and at scale you can't pretend that they are. If your data fits in memory today (and that means 144g - 192g per machine), then why are you worrying about this? You'll need to rewrite a huge part of your infrastructure to scale from 100k users to 10M users anyway.

TL;DR: solve the problem first and abstract out the framework later. Also "scalability" is incredibly expensive (and unnecessary) to engineer for a priori.

aria · on April 9, 2012

I'm not quite sure you got the gist of the post. The library isn't about scalability. It's about having a consistent way to describe how data sources accumulate/aggregate new values to do something useful with them.

We don't have a batch MapReduce problem and so Hadoop isn't really an option and, having used it in the past, not sure it's even a great choice.

Also, most of the data we're working with comes from content being shared and not number of users, so for our case in particular, dealing with a lot of data is an issue from day one.

Lastly, " solve the problem first and abstract out the framework later" I think is one of the core misconceptions in computer science and engineering. Abstraction lets you build faster if done correctly.

Again, if we only needed a simple key-value store and only had to deal with managing user data/events, we have built this kind of library out. Many of the reasons we built this library is because of the kinds of data aggregation we encounter in relevance ranking and machine learning.

frisco · on April 9, 2012

I didn't necessarily mean scalability in terms of just compute throughput; I was also talking about sustained productivity, though the two are related. Abstraction does let you build faster if done correctly, but in my experience I've never seen a project turn out a good framework that started framework first. Rails was abstracted out of Basecamp. MapReduce was abstracted out of Google's experience with a huge mess of scripts for running batch jobs, with some theory on top.

I'm working with a similar issue (a huge amount of base data, relatively independent of users) and so have experienced the big-data-from-day-one problem. I started by writing a bunch of stuff to make the "heavy lifting" easier. I came out of that experience with the lesson that when the data is big enough, abstractions break down because you do care about the properties of the underlying data stores. On InnoDB/XtraDB, DELETEs cause an in-memory copy of the table to be created for MVCC, and if any INSERTs commit in that interval you'll get sporadic, hard-to-debug timeouts on 30GB tables, even if the machine has 96G memory. If you're making recommendations online and need to do set intersections to apply a topic model, you absolutely do care about the underlying data store. Code abstractions obscure this, are probably not the right ones anyway, and therefore end up being worked around when you're trying to debug these things.

carterschonwald · on April 9, 2012

1 million upvotes: good computer science is about designing / choosing abstractions which simplify and facilitate solving problems in your domain.

I'd argue that the grandparents comment that bespoke unique solutions are always needed really just means that we as computer scientists and programmers really don't (yet) have a good handle on what the "right" abstractions for big/monster data problems. To repeat a CS favorite proverb: "When all you have is a hammer, everything looks like a nail".

frisco · on April 9, 2012

> we as computer scientists and programmers really don't (yet) have a good handle on what the "right" abstractions for big/monster data problems

It's more that the underlying technologies aren't good enough to support these abstractions, so they end up breaking down as we need to debug things.

carterschonwald · on April 9, 2012

If an abstraction breaks, its not a good abstraction, in which case the Prismatic folks are doing precisely what you're advocating anyways, spending time getting things working rather than overengineering before they need to solve the problem.

My point though, is if an abstraction breaks, its not a good/real abstraction, and a new abstraction / solution needs to come into play. There is no reason why a data store library can't be parameterized (whether internally, or visible to the client user) over the semantics of the various backends and work differently in the underlying manipulations accordingly.

(note I am not saying that the prismatic folks are or are not doing this, just that an abstraction is only a good abstraction if you can't break it)

I agree that the underlying technologies do not have a uniform data store semantics. However, not preclude client side library logic that can provide a uniform semantics to these stores, and/or warning when they cannot.

tel · on April 9, 2012

I haven't used these Buckets nor have I scaled anything to that level with any need for uptime guarantees, so I'm certainly really naive.

But at the same time, simplifying the interfaces seems to me to (a) provide less points for buggy failure (b) allow you to write lots of machine learning methods that all "just work" on buckets.

It might all be custom, but if you're spending lots of time building, testing, analyzing, and deploying statistical models from scratch (the place I do have expertise) then a simple, consistent data interface is godsend. Scaling might still be custom, but if you can expose the Bucket interface then it "doesn't matter" what happens underneath.

If there's some real bottlenecking then it's probably important to break through that interface or to expand it (like that flush-rate parameter the Op mentions). Until there is though, you're winning on experimental fluidity.

salimmadjd · on April 9, 2012

This is your second post, and you've already become one of my favorites. Have a nlp question for you. I noticed you're doing: {"Prismatic" {"runs" 1},"runs" {"on" 1}, "on" {"coffee" 1}} Why do you care to have a count of each word that follows and not assume its always one. Why not just analyze a simple bi-gram as key and an int to keep the count as value?

aidenn0 · on April 9, 2012

Are the merge operations performed within a transaction for the backends that support them, or is atomicity of merges not desired?

willf · on April 9, 2012

Just a small typo: the 'associative' link links to the Wikipedia article on communitivity.

aria · on April 9, 2012

fixed.thanks for noticing.

dmix · on April 9, 2012

Will Prismatic be releasing any clojure libraries as open source?

aria · on April 9, 2012

Yes. We don't have a definite date, but soon.

tborg · on April 9, 2012

I can't wait. I've been working with a bunch of XML files (TEI corpora, actually), and using document stores to build out indexes. One of the things that I've found is that what I want to do with reduce functions (build complex dictionaries, for instance with the key being a word stem and value being another dict with key=word form and value=list of locations or related words or whatever) is not the kind of thing Mongo/Couch DB reduce functions are designed for (i.e. scalars).

I learned what a 'reduce' is from learning Clojure (which was my intro to functional programming), not from databases. And so I got really used to the idea of reducing into nested dictionaries, because Clojure is good at that. Which meant that when I came to using MapReduce in DB environments (particularly CouchDB, but also a lil' Hadoop), I was a little disappointed.

Which is to say that a solid Clojuric abstraction layer that emphasizes a bucket approach on top of a variety of different stores sounds like my dream library.

davekale · on April 9, 2012

Looking forward to the libraries' release!

cleancode1 · on April 9, 2012

Are you going to release a public API?