This already exists, in Google BigQuery. Uses darn near every trick in the book,...

Joeri · on Nov 11, 2017

It is frustrating that google is always 5 to 10 years ahead of everyone else but they never open source their back-end technologies (except recently with the ML stuff). The whole reason hadoop exists is because google only released whitepapers (which was good) but not code. I wonder whether google really benefits by this strategy, given that they have to be an ecosystem instead of benefitting from being a part of one. I also wonder whether the industry is better off by having to cooperatively reinvent the google architecture. I doubt the hadoop ecosystem would have arisen had it been google code at the heart.

mikecb · on Nov 11, 2017

Apache Beam, gRPC/protobuf, Kubernetes. There are examples besides Tensorflow.

jimktrains2 · on Nov 12, 2017

Not code but they do publish papers that contain decent detail.

posnet · on Nov 10, 2017

[Edit]: Why was the above comment flagged?

How much of big queries performance do you think stems from Capacitor versus the rest of the system. For example if you switched it out with parquet, but kept everything else (Colossus, Dremel, Background reordering, metadata stored in Spanner etc) would it still be 10/30/50% worse or would it be an order of magnitude worse.

xoogler_thr · on Nov 11, 2017

We looked at Parquet early on and it wasn’t competitive even with what we were using at the time.

And yeah, this really depends not just on the dataset, but also on how selective your queries are, what predicates and aggregations they employ, etc. A significant percentage of queries gets orders of magnitude faster. I can’t disclose how much faster things got on average, but it was a significant gain, way more than would be sufficient to offset the increased cost of encoding (which is another aspect people typically don’t consider), even considering that much of the data people encode is hardly ever touched.

oaijdsfoaijsf · on Nov 11, 2017

It would have to depend on the dataset, right?

For anyone who doesn't know what Capacitor is: https://cloud.google.com/blog/big-data/2016/04/inside-capaci...

Diederich · on Nov 11, 2017

> Why was the above comment flagged?

It appears that relatively new accounts that post here are automatically flagged. I've seen it before.

Usually if I click on their specific comment and vouch for them, they get unflagged, but it didn't work this time.

Lame.

joelwilsson · on Nov 10, 2017

BigQuery uses a lot of tricks to get efficiency, but this post emphasizes Apache Arrow and open data formats like it as the way forward (in particular the last point, "Be open, or else…") which are not currently supported by BigQuery.

If Apache Arrow takes off I hope BigQuery will support it as a data interchange format in the future. Zero-copy is pretty awesome, as are open standards in general. This feature does not exist in BigQuery today (as far as I know - definitely not as discussed in the source).

xoogler_thr · on Nov 11, 2017

One thing people commonly miss about this is this is all meaningless if you don’t have the corresponding runtime integration, and these techniques very much imply co-design, and therefore tight coupling, between the format and the runtime. To give a concrete example, to make any of this efficient and fast you must have predicate pushdown directly into decoder such that filters could skip the data they don’t need to decode. Some aggregations could be handled the same way.

So it’s a little incorrect to think of this as a “file format” in the first place. If you end up designing it like that, you’d not be able to have a lot of the gains that the Abadi paper (and others like it) alludes to. My suggestion would be to go whole hog and push down as much filtering and aggregation in there as is feasible, exposing a higher level interface with _at least_ filtering predicate support, and do it in C++.