It is frustrating that google is always 5 to 10 years ahead of everyone else but they never open source their back-end technologies (except recently with the ML stuff). The whole reason hadoop exists is because google only released whitepapers (which was good) but not code. I wonder whether google really benefits by this strategy, given that they have to be an ecosystem instead of benefitting from being a part of one. I also wonder whether the industry is better off by having to cooperatively reinvent the google architecture. I doubt the hadoop ecosystem would have arisen had it been google code at the heart.
How much of big queries performance do you think stems from Capacitor versus the rest of the system. For example if you switched it out with parquet, but kept everything else (Colossus, Dremel, Background reordering, metadata stored in Spanner etc) would it still be 10/30/50% worse or would it be an order of magnitude worse.
We looked at Parquet early on and it wasn’t competitive even with what we were using at the time.
And yeah, this really depends not just on the dataset, but also on how selective your queries are, what predicates and aggregations they employ, etc. A significant percentage of queries gets orders of magnitude faster. I can’t disclose how much faster things got on average, but it was a significant gain, way more than would be sufficient to offset the increased cost of encoding (which is another aspect people typically don’t consider), even considering that much of the data people encode is hardly ever touched.
BigQuery uses a lot of tricks to get efficiency, but this post emphasizes Apache Arrow and open data formats like it as the way forward (in particular the last point, "Be open, or else…") which are not currently supported by BigQuery.
If Apache Arrow takes off I hope BigQuery will support it as a data interchange format in the future. Zero-copy is pretty awesome, as are open standards in general. This feature does not exist in BigQuery today (as far as I know - definitely not as discussed in the source).
One thing people commonly miss about this is this is all meaningless if you don’t have the corresponding runtime integration, and these techniques very much imply co-design, and therefore tight coupling, between the format and the runtime. To give a concrete example, to make any of this efficient and fast you must have predicate pushdown directly into decoder such that filters could skip the data they don’t need to decode. Some aggregations could be handled the same way.
So it’s a little incorrect to think of this as a “file format” in the first place. If you end up designing it like that, you’d not be able to have a lot of the gains that the Abadi paper (and others like it) alludes to. My suggestion would be to go whole hog and push down as much filtering and aggregation in there as is feasible, exposing a higher level interface with _at least_ filtering predicate support, and do it in C++.