Hacker News new | past | comments | ask | show | jobs | submit login

Unfortunately, it really only works ergonomically at Google, because you need the global protodb and a proto-aware SQL dialect to make it usable for humans, and you need proto-aware indexing to make the retrieval speed usable for machines if you ever query by the contents of a proto field. Externally, the closest I've been able to get is using JSON-encoded protos as values, but that comes with its own problems.



You are absolutely right, but when it works - it's amazing...


What are some of the benefits of doing it this way?


Most database schemas at Google have exactly two "columns": a primary key column, and an "Info" column which contains a large protobuf. The difference between that and a simple K/V store, though, is that the storage engine is building indices and columnar data stores under the hood that make it perform as if it had been defined with all the fields as flattened columns, but you never actually have to declare a schema in any detail. That cuts down dramatically on tooling and release nonsense - there's nothing like a Rails migration script, because the only update you're likely to do is "sync to latest proto definition". It also means that you have all the semantics of a protocol buffer to define your data structure - oneofs to express optionality, and submessages to group fields together - which simply don't exist in a normal DBMS.


I can speak only from my experience. I was in a team doing ads evaluations, and we had multiple other teams specializing in various bits & pieces of the Ads infrastructure - for example one team would keep what new amenities were on a hotel/motel, etc. - so in order to fill all these mundane, but intricate details - e.g. "bool has_pool" of sorts, it'll keep adding fields, or introducing new proto messages to fully explain capture that data. And then each team would own several if not hundreths of these protos, these protos would be bundled as `oneof` (e.g. "union") or something like this in a bigger encapsulating structure - so in your

row database (bigtable), each team would have a column that stores their proto (or empty). So we'll run a batch process to read each such column, and then treat all the fields in the proto stored there as data, and expand all these fields as individual columns over the the columnar (read-only) db.

Later an analysts/statistician/linguist/etc. can query that columnar db with data they are interrested about. So that's what I remember (I left several years ago, so things might've changed), but pretty much instead of typical for row-databases have a column for everything - you just have a column for your protobuf (a bit like storing HSON/JSON in postgres), but then have the ETL process mow down through each field in that "column" and create "columns" for each such field.

We had to do some workarounds though as it was exporting too much, and it was not clear how we can know which fields would be asked about (actually there was a way probably, but it'll taken time to coordinate with some other team, so it was more (I think) on coordination with internal "customers" to disallow exporting these).

But the cool thing, is that if customer team added new field to their proto, our process would see it, and it'll get expanded. If they deprecate a proto, there could be (not sure if there was, but could be added) - no longer export it. But for this to work you need the "protodb" e.g. to introspect, able to reflect actual names in order to generate the column.


That’s very cool. Thanks so much for the detailed explanation




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: