Hacker News new | past | comments | ask | show | jobs | submit login
Arguing against using protobuffers (reasonablypolymorphic.com)
307 points by haskellandchill on Oct 10, 2018 | hide | past | favorite | 298 comments



Hi there,

I'm an actual author of Protocol Buffers :)

I think Sandy's analysis would benefit from considering why Protocol Buffers behave the way they do rather than outright attacking the design because it doesn't appear to make sense from a PL-centric perspective. As with all software systems, there are a number of competing constraints that have been weighed that have led to compromises.

- D

P.S. I also don't believe the personal attacks to be warranted or productive.


While I agree personal attacks do not help, he gives many reasons why he thinks Protocol Buffers are wrong. You may either respond to the issues he raises or explain what are those constraints and compromises you mention, but your comment basically just is "I'm an author, he does not know what he's talking about", which is not very productive neither.


It astonishes me how many Google products are not developer friendly. Because they think they are so freaking smart they figure they can waste their time with balky code.

Protobuf raised all my red flags the first time I saw it and every time I see it again.

For instance I once tested five vision recognition APIs and I could get the other ones working in 15 minutes each. The Google API went way into overtime because Google's libraries trashed my Python installation forcing me to reinstall.

Google made a real boner with namespace packages in Python and they've contributed a big chunk of entropy to the Java ecosystem with the Guava library that, to this day, holds back Hadoop and all of the code around it to version 13 point something because what was supposed to be a minor revision broke HDFS.


Google hires the smartest software engineers in the world.

The problem with the SMARTEST software engineers is they are incapable of distinguishing between good and bad code. They're fantastic at writing EFFICIENT code, certainly, but not code that is comprehensible by others - all code is equally comprehensible.

Am I trolling? Slightly. I am exaggerating a little bit. Obviously all engineers care about readability/maintainability to a degree. But I definitely have noted a correlation during my career between super intelligent engineers and an over-emphasis on succinctness over readability. And when provided code review feedback, they are genuinely confused "why would your version be be better? Mine is perfectly understandable"


The only thing I was thinking while (half-) reading this article is there's some fundamental misunderstanding about what protobuf is for.


Dear D,

I m very interested in Protocol Buffers. Could you explain the tradeoffs you made while designing protobuff and what would you change if you were to design it now?

Cheers!


There's been some Protobuf spinoffs that claim to be "Protobuf but better" or "Protobuf but with changes from experience", like (iirc) Cap'n Proto and FlatBuffers, and full alternatives like Thrift and Fast Buffers.


I thought Kenton Varda was the author of protobufs?


Nope, I'm not the original author -- that would be Jeff Dean and Sanjay Ghemawat (also often credited with inventing things like MapReduce, BigTable, Spanner, ...). I wrote version 2 (a complete rewrite, but largely following the original design) and open sourced it. I stopped working on Protobuf about 8 years ago. Many others who have been on the Protobuf team since can certainly call themselves "authors".


> that would be Jeff Dean and Sanjay Ghemawat

or "amateurs", as the post would call them.


"an actual author"


Can we get access to more metadata in the lite api? Asking for a friend O:)


Though I dislike the hyperbolic tone and personal attacks, the author isn't entirely wrong. There are many design choices in Protocol Buffers that seem directly related to the scale and complexity at which Google operates, and which sacrifice safety, clarity and language integration.

The utter awkwardness of Protobuf-generated code is particularly problematic. I've had pretty good results with the TypeScript code generator, but the Go generator is just egregiously bad. There's no way to write a client or server that uses the Protobuf structs directly as first-class data types without ending up looking like a ball of spaghetti, at least not if you venture into the realm of oneofs, timestamps or the "well known types" set of types.

I think we're at a juncture where we really desperately need a better, unified way to express schemas and pass type-safe data over the network that flows through these schemas. I'm currently working on a project that involves both gRPC (which uses Protobuf), GraphQL and JSON Schema, with the backends written in Go, and the frontend in TypeScript, and the overlap between all of these is ridiculous. TypeScript is a fresh breath of air that mostly solves the brittleness of JavaScript's type system, and it's absolutely crucial to extend this type-safety all the way to the backend.

The challenge is that no language is the same, and a common schema format ends up targeting a lowest-common denominator. For example, GraphQL errs (in my opinion, wrongly) on the side of simplicity and doesn't support maps, so now you have a whole layer that needs to either custom types (JSON as an escape hatch, basically) or emulate maps using arrays of pairs, neither of which is ideal.


An approach I've used on the Go side is to build nice native Go types and converters to/from proto representations for wire marshaling. There's a minor performance hit but it's insignificant for what I'm doing. The result is quite nice, and you can largely ignore protobufs except for defining service APIs (which it's actually decent at).

Which tool are you using for TypeScript protobuf code generation? This is the next step for me, and I'd be keen to hear more of your experience there.


Yes, we use the same technique for our larger apps. We treat the Protobuf layer as a distinct layer related to the API; we have two packages, "protofy" and "unprotofy", so in the server implementation, you do something like:

  func (srv *Server) GetFoo(
    ctx context.Context,
    req *proto.GetFooRequest) (*proto.GetFooResponse, error) {
    foo, err := srv.db.getFoo(foo.ID)
    if err != nil { ... }
    pFoo, err := protofy.Foo(foo)
    if err != nil { ... }
    return &proto.GetFooResponse{
      Foo: pFoo,
    }, nil
  }
    
Obviously, a bit less pretty in reality. But the principle is the same. Having two packages makes it easier to read — e.g. protofy.Foo() is always about taking a "native" Foo and turning into a *proto.Foo, and unprotofy.Foo() is the reverse.

For TypeScript, I'm using ts-protoc-gen [1]. The weird part, which I don't fully understand, is that the serialization code is emitted as JavaScript code. All the type definitions end up in a .d.ts file, but the client is a .js file. Which means you still get type safety and autocompletion, just like plain TS, but it's still weird to me.

[1] https://github.com/improbable-eng/ts-protoc-gen


I believe that means you're more or less using it as intended. Protobuffers are intended for serialization. In go, serialization is often handled at a separate layer from the business logic.



That’s what you should do with all serialization, but everyone always thinks it’s a good idea to tie the model object to the serializer object at the beginning of a project, and by the time you are ready to build a time machine to kill the person who made that mistake, it’s much too late to do anything but rewrite the whole thing.


Could you elaborate on what you don't like about the generated Go code?

My main complaint initially was that everything was a pointer, which made constructing things harder.

Then I realized it made zero values much easier to implement. And it lets you foo.GetA().GetB().GetC() without all the intermediate steps, which feels like a violation of the Law of Demeter, but in a glue language I think it's probably the right trade-off.

Yes, you can't use them as classes/types themselves, but you're not supposed to subclass them in any language, so they're really better thought of as structs/dicts.


Oneofs, at least with gogoproto, are quite unpleasant to work with. There's a separate intermediate type generated for each member of a oneof field. With the following message:

    message Foo {
        oneof bar {
            Baz baz = 1;
            Qux qux = 2;
        }
    }
You'll have the types Foo, Baz and Qux like you'd expect, but you also get Foo_Baz, which does nothing but wrap the Baz for use with the Foo.bar property, and Foo_Qux, which does the same thing but for Qux instead of Baz. To make matters worse, the types all implement an interface which is declared but not exported. Stringing the whole thing together to assign to a oneof looks like this:

    foo := &rpc.Foo{Bar: &rpc.Foo_Baz{Baz: &rpc.Baz{}}}
That example makes it look better than it really is; in practice, you'll probably have more properties and longer names.


I'd like to humbly suggest that we use JSON please, in particular:

JSON + JSONSchema[0] +/- JSON Hyperschema[1] +/- JSON LD[2]

It's a bit to learn but I promise you, it's worth it. The technologies are not redundant (jsonschema spec is for validation, hyperschema spec is for specifying how you interact, and LD is for semantics like language and more). If you take a few hours, read all 3 specs, you're almost guaranteed to imagine a new (albeit somewhat daunting) future, where APIs and clients can be smarter.

There's no reason I can think of that a sufficiently specified JSON schema cannot be automatically converted into a efficient binary format. I have absolutely zero reason to use Protobuf and it's whole ecosystem of tools if that's the case, I'd much rather go with tools built on a standard like JSON.

GraphQL's functionality is a subset of the above set of tools -- IIRC GraphQL can basically be boiled down to an option (or two) on a resource endpoint -- How you query on the server side does indeed change a bit to support the syntax, but this is basically the same as if you'd implemented with horizontal filtering like in POSTgrest[3], or resource embedding feature[4] (which everyone ends up doing at some point with REST-ish APIs). I honestly don't know who or what is trying so hard to drive GraphQL adoption so hard but I think it's mostly hype, and in terms of general technology, by default it forces clients to know the inner workings of the data model and that's bad for any kind of future in which you don't want to have to learn X APIs to interact with X systems.

BTW I absolutely love a good well-supported rant that agrees with all my biases (yes, Google as well as every other company has a bunch of amateurs, as well as a small amount of very skilled engineers), and all the problems noted with protobuf are real problems, and this viewpoint is sorely missed in all the talks and conferences where people are touting GRPC as the next sliced bread with Protobufs as the transport (you can change GRPC's transport layer to be JSON for example[5])

[0]: https://json-schema.org/

[1]: https://datatracker.ietf.org/doc/draft-handrews-json-schema-... (also on json-schema.org)

[2]: https://json-ld.org/

[3]: https://postgrest.org/en/v5.1/api.html#horizontal-filtering-...

[4]: https://postgrest.org/en/v5.1/api.html#resource-embedding

[5]: https://grpc.io/blog/grpc-with-json


I don't think GraphQL is over-hyped at all. Maybe it's flawed, but the design is absolutely on the right traack. GraphQL completely changes how you work with APIs in a front end.

I work on React apps, and by using GraphQL, a component's data requirements can now be entirely declarative. For example, a component can do this (simplified):

  <Query query=`{
    posts(limit: 10) {
      title, creator { name }
    }
  }`>
    {({data, loading, error}) => {
      return <ul>
        {data.posts.map((title, {creator}) => <li>
          {title} by {creator.name}
        </li>)}
      </ul>
    }}
  </Query>
The component knows what it needs to render, so it declares that. With TypeScript, you can get type-safety all the way from the backend to the frontend, which means your IDE (e.g. VS Code) can correctly autocomplete, say, "creator." and suggest "name". It's rather magical.

I work on a product called Sanity [1], which evolves GraphQL one step further. It's a Firebase-like data store with schemas and joins, and a structured data model. Without implementing anything on the server end, you can run queries like these:

  *[_type == "post" && published] {
    title,
    creator -> { name },
    topPosts: *[creator.id == ^.creator._ref] {
      _id, title
    } | order(viewCount desc)[0..10],
    photos: photo -> {url}
  } | order(_createdAt desc)[0..100]
This doesn't just follow a child object (creator), it also joins with an unrelated set of objects (topPosts, which finds the top 10 most viewed posts created by the same user) as well as joining a bunch of 1:1 relations. I'm totally biased, but I think it's a game-changer when it comes to writing web apps, because you can just dump the whole query in a React component, no state management or API writing required.

[1] https://www.sanity.io/


Nothing about what you just posted couldn't be done with a normal RESTful endpoint with sufficient support for column-level filtering and embedded item filtering.

Your post is the perfect example of GraphQL is being over hyped. I absolutely get that there's a benefit to filtering at both these levels, and that the DSL cuts down on noise and gives you a way to "query" without thinking of web requests, but it's not a leap in thinking -- teams that have to deal with mobile environments have long been strapping on filtering options to endpoints to avoid sending unneeded bytes to mobile devices.

Yes -- GraphQL standardizes this stuff, but it does it in a way that is basically not compatible with anything else.

> The component knows what it needs to render, so it declares that.

??? The component doesn't know anything, components don't think. You're describing it as if you just gave the component a list of entities but you actually wrote a query DSL -- that's how the component "knows" -- you told it.

What GraphQL is doing for you here is:

- Enforcing consistent access patterns (which is how "posts" gets translated into the right URL)

- Ensuring "limit" is supported on the endpoint

- Ensuring horizontal filtering is supported

- Ensuring embedded entities get returned and they're filtered

This is not a paradigm shift. It's better, maybe, but that DSL will absolutely fail you at some point, when you try do a more dynamic query, and you'll have to drop back to writing code that looks a lot like life did before GraphQL.

> With TypeScript, you can get type-safety all the way from the backend to the frontend, which means your IDE (e.g. VS Code) can correctly autocomplete, say, "creator." and suggest "name". It's rather magical.

This is basically orthogonal... Write well typed javascript and your IDE is going to be able to help you out.

> Firebase-like data store with schemas and joins

You've lost me here. The excerpt you've posted looks even worse than SQL. At that point why not just send SQL directly to the backend (as long as you can get the permissions right and your DB is secure enough)?


Could one invent an ad-hoc REST API to perform joins, apply parameters and so forth? Of course. That's what people have been doing. That means every solution is ad-hoc. People get tired of this.

Have you tried any of the GraphQL client tools? You can literally point a client at an arbitrary server and not only see its schema, but also interact with the API. REST doesn't give you that, because the authors of REST neglected to actually specify anything beyond some fuzzy principles.

> ??? The component doesn't know anything, components don't think.

This is being obtuse. You know perfectly well what I meant: That the component localizes the information needed to make an informed request. This is domain knowledge encoded in the component.

This is contrary to how most developers design clients, where they might call an endpoint such as /posts?limit=10, which happens to fetch all attributes, even if the component doesn't actually need all the attributes.

> You've lost me here. The excerpt you've posted looks even worse than SQL.

Before a summary (and, honestly, tone deaf) dismissal such as this, I recommend reading up on it a bit [1], and maybe trying it out. SQL is great, but it does not handle nested documents/relationships, so you end up with a lot of messy structural mapping between flat relational data and structured data, which is why projects like Rails/ActiveRecord and Hibernate are so popular.

For example, a query such as this:

  *{
     name,
     photos->{ url },
     friends->{ name }
  }
will return something like:

  [
    {
      "name": "Bob",
      "photos": [
        {"url": "http://..."},
        {"url": "http://..."}
      ],
      "friends": [
        {"name": "Jane"}
      ]
    },
    ...
  ]
The point isn't that you cannot, with a sufficiently complex ORM and enough elbow grease, do that with SQL. It's that it should be unnecessary, because developers shouldn't need to write an entire data layer every time they want to bring an app up. This is part of what GraphQL brings to the table, too.

[1] https://www.sanity.io/docs/data-store/how-queries-work


You can do this with PostgReST as well. Something like:

  GET /user?select=name,photo(url),friend(name)
I do not use PostgReST, but created something similar.

When I perform the following 3 individual requests after I performed the above request the results can all be retrieved from cache (at the server and/or client side):

  GET /user?select=name
  GET /user/{name}/photo?select=url
  GET /user/{name}/friend?select=name
Even when I do the following (without the photo(url) part) it can now just build it from the cache without hitting the database server:

GET /user?select=name,friend(name)

That is one of the advantages of using ReST. It allows you to utilize and benefit from already existing proven and well defined infrastructure components (the connectors etc.).

GraphQL can be used on top of this as well. I experimented with this once as an alternative to the PostgReST-like syntax.


Someone linked to this thing called subZero[0], which seems to be a starter kit that exposes APIs as GraphQL and REST from a Postgres database (it uses postgrest), maybe it's a good example of them working side by side.

[0]: http://docs.subzero.cloud/


GraphQL certainly has a cleaner/less involved interface, but it's also less interoperable. My problem with it is that it doesn't offer much value for the amount of effort, and it isn't the paradigm shift it's claiming to be in the first place, it's just being marketed well. HateOAS[0]+Swagger[1]/jsonschema hyperschema[2] is enough to do what GraphQL does.

> This is being obtuse. You know perfectly well what I meant: That the component localizes the information needed to make an informed request. This is domain knowledge encoded in the component.

> This is contrary to how most developers design clients, where they might call an endpoint such as /posts?limit=10, which happens to fetch all attributes, even if the component doesn't actually need all the attributes.

I apologize, that was a terrible way to phrase my objection.

But I want to note that basically GraphQL has offered you a way to avoid writing:

    axios.get("/posts?limit=10&filter[0]=prop&filter[1]=otherProp&fetchEmbeddedEntities[0]=thing")
My point was that you can tack on a not-string query API that declaratively builds this without committing to "implementing a GraphQL endpoint" for all your APIs -- the technology is already there, what people are lacking is shared structure. GraphQL does create shared structure, but in an all-or-nothing way and I avoid building with tools that do that.

BTW, as far as actual dynamic API recognition goes, the promise of jsonschema + json hyperschema + json LD is much more promising (it's basically equivalent to the semantic web promise) -- GraphQL is a step in the right direction but the lack of interop means we have to step backwards to go in any other direction.

Let me put it this way, could you imagine writing a query where you don't know the name of the model on the server side? Like you only know the thing you want (let's say a "vehicle"), and you know the backend has "vehicles" but you don't know what they're called? Being able to do that query is a paradigm shift, and it's possible with the tools I mentioned, though the promise is yet to be realized by and large.

> Before a summary (and, honestly, tone deaf) dismissal such as this, I recommend reading up on it a bit [1], and maybe trying it out. SQL is great, but it does not handle nested documents/relationships, so you end up with a lot of messy structural mapping between flat relational data and structured data, which is why projects like Rails/ActiveRecord and Hibernate are so popular.

I've tried it -- I'm not impressed, this is the crux of my point and I haven't seen anything yet that can help me change my mind. It is a fact that the chunk of code you posted as a query is a nightmare to look at. Yes, my brain will eventually

Despite how people try there aren't many query languages that can match the expressive power of SQL. For all it's warts, it is excellent at what it does.

> SQL is great, but it does not handle nested documents/relationships, so you end up with a lot of messy structural mapping between flat relational data and structured data, which is why projects like Rails/ActiveRecord and Hibernate are so popular.

This is a wildly inaccurate statement. SQL is for querying relational database systems. Well structured relational data is the most structured data you're ever going to find. ORMs (ActiveRecord/Hibernate) are:

- excellent at reducing boilerplate when it's a perfect fit (pro) - often used by people who don't understand the expressive SQL (con) - great at creating N+1 query problems (con) - great at stopping you from using the deeper features of your DB (con)

ORMs are easier, that's why they saw widespread use.

It's been my experience that people fall out of love with ORMs as they reach their rough edges (which they must have, they are leaky abstractions). I much prefer query builders/generators and in-language level abstractions (methods/functions/etc).

> For example, a query such as this:

Please stop posting queries, I get the query language, the DSL is fairly easy to understand (this is great for GraphQL).

> The point isn't that you cannot, with a sufficiently complex ORM and enough elbow grease, do that with SQL. It's that it should be unnecessary, because developers shouldn't need to write an entire data layer every time they want to bring an app up. This is part of what GraphQL brings to the table, too.

I think you've conflated an ORM with a sufficiently capable and metadata-tagged backend. These features aren't ORM level feature necessarily, look at Postgrest's vertical column filtering[3] and embedded entities[4].

Also, my original point was that GraphQL is over-hyped considering that it is not a paradigm shift, and requires you to do a bunch of non-interoperable work aside from REST without much benefit.

I'm going to find some time and implement GraphQL automatically (with the help of jsonschema and hyperschema, etc) on top of a regular RESTful endpoint, I'll post it to HN when I do.

[0]: https://en.wikipedia.org/wiki/HATEOAS

[1]: https://swagger.io/specification/

[2]: https://datatracker.ietf.org/doc/draft-handrews-json-schema-...

[3]: https://postgrest.org/en/v5.1/api.html#vertical-filtering-co...

[4]: https://postgrest.org/en/v5.1/api.html#resource-embedding


> GraphQL has offered you a way to avoid writing:

> axios.get("/posts?limit=10&filter[0]=prop&filter[1]=otherProp&fetchEmbeddedEntities[0]=thing")

Is that right? I've never used GraphQL, but I thought the main advantage was not that the client didn't have to write that, but that the server didn't have to write the handling code for the joins to the embedded entity (to pick one example), or any of the dozens of other possible joins someone might want to do for any given relation.


GraphQL is more or less a data description language. It describes queries and mutations. There are JS libraries that take a GraphQL query/mutation, convert it into the spec'd JSON payload, send it to the backend (single endpoint), and map the requested fields back into a JS object to be used. This is super nice when paired with React -- it's hard to overstate how nice it is.

On the backend, there are libraries that process that JSON payload into whatever your framework wants (dicts in Python, for example). From that decoded payload, you have to figure out what to do. If you're using Django and you get something like

    {'name': 'Ruth', 'age': 85}
to your 'People' query, you need to ask your ORM for People named Ruth who are 85 years old, and return them. It is essentially exactly like REST API design, with a few important differences:

- You don't have to worry about irrelevant HTTP stuff. What status code is appropriate? What HTTP action? Isn't this URL a little wrong? All irrelevant.

- There is no standard for pagination or ordering. In fairness, REST's answer (do it in query params... just like everything else!) is a little unsatisfying. But REST never made grandiose claims about data description. You go through the whole GraphQL site and it's more or less just an exposition in how great it is to just _describe_ the fields you want to retrieve/change, but nothing about "what if I just want the first 10,000 results?" This is, as you might imagine, an important and common issue.

- There is no (meaningful) standard for errors.

I think it's a step in the right direction from REST. URL-based APIs really just never earn their keep, HTTP actions are too limiting, versioning weirds everything, query params naturally grow to contain/become their own DSLs. But as far as backend engineering goes, GraphQL really only addresses the problems we didn't really mind (URLs/HTTP actions) anyway. To answer your question directly: the benefit of GraphQL is... 98% realized only by API clients. The backend is still doing everything it used to.


This is an excellent response and is 100% correct.

At this point all I can say about my stance is that I would have preferred a standard for HTTP+JSON instead of how GraphQL is completely different.

I also completely forgot about HAL[0].

To be clear, I'm not against GraphQL, I understand it and I understand what it brings to the table but I just thought the alternative was better if only because it relied on a web of standards that were always aimed at doing more (semantic web).

[0]: http://stateless.co/hal_specification.html


I've long thought about creating a standard(iana type) for PostgREST json schema + http querystring conventions. I've seen some APIs[1](only one I can remember now) that follow PostgREST conventions, so perhaps this could benefit all of us who don't want to jump on the GraphQL bandwagon but still want an expressive and interoperable/standardized way to query resources.

We currently use OpenAPI but we found some shortcomings[2] and would like to have a more tailored standard.

Right now what we're missing is more funding, we don't have a backing company like GraphQL/gRPC, we rely on community donations through Patreon[3]. Definitely consider supporting us if you want to improve the state of REST APIs, allow us to keep fighting for web standards :).

[1]:https://developers.blockapps.net/advanced/concepts/#cirrus

[2]:https://github.com/PostgREST/postgrest/issues/790#issuecomme...

[3]:https://www.patreon.com/postgrest


Thanks to you and the other maintainers for your tireless work on Postgrest (also as a member of the Haskell community)!

Is there a paypal link for one-time-donations? I don't use patreon and am not really down to increase my online footprint but would love to be able to donate (I also couldn't find any mention of taking donations on the github or in the docs @ postgrest.org... you might get more donations if it were at least mentioned?)


Thanks a lot for your feedback @hardwaresofton, I've just added a section in our README that includes a Paypal link for one-time donations.

https://github.com/PostgREST/postgrest#supporting-developmen...

Thank you for your support!


Ha thanks.

I have fundamental misgivings about "APIs as bespoke ORMs for browser clients" -- at least how they're currently commonly implemented ([db] -> [orm] -> [controller magic] -> [request parser & deserializer / response serializer]). It just has this sort of... ad-hoc lack of coupling. Things like Datasette or PostgREST feel more "right", but I'm still a little incredulous that they're sufficient for even the majority of needs.

Designing a system that can handle concurrent reads/writes at production scale is hard even when you control the entire stack. When you consider all the constraints that modern web dev labors under, it's a miracle any of it works at all. But to be honest, my suspicion is that we get it to work by trading 90% of our performance potential for all this extra machinery between the client and the DB. We needed to do this 20 years ago because DBs didn't have the functionality (ex: row-level security) they do now, but at this point, maybe it's time to revisit. Either way, I'm doubtful that choosing a schema and protocol for client/server communication would meaningfully move the needle. I think a more fundamental rethink is necessary. Maybe after 3-4 years of wasm.



The other problem with GraphQL is letting external clients make arbitrary queries is a denial of service attack in waiting, and even with internal clients, how much can you trust another department? Having ugly, non proper REST endpoints adds a negligible amount of implementation time to the client but lets the server ensure it can actually meet its performance goals.


I have seen research in this problem area, where one is capable of knowing ahead if the query is legit or not. Anyway for internal purposes it's fine


Proper numeric types would like a word with you. (In particular, just look into how you would get infinity/NaN in there. Fun times.)

I mean, yes, you can do everything by just passing the string representation. Not exactly efficient, though. And most schema attempts in json are usually less than compelling.


You're absolutely right that it is a sticking point, but custom types can help this. Also, I don't know about you but I don't very often have Infinity or NaN as inputs that I want to see in my schemas...

Also, the JSON schema is an evolving document being developed in the open, it can be improved over time (they're already on Draft 7) -- if numeric types are lacking then suggest a way to make them better.

> And most schema attempts in json are usually less than compelling.

I'm not sure 100% what this means -- jsonschema is pretty easy to read and works pretty well for validation, people are already being very productive with it, what is not compelling?


I've wanted NaN and Infinity on a few serialization worlds. Mainly because I didn't want to have to reinvent what the IEEE numbers already have.

My complaint on the items being less than compelling is that they almost instantly devolve into bitter fights about what I'm supposed to care about. I fully grant that XML was too heavy in much of its schema attempts. However, it is much easier to reason about the extremity of XML than it is where in the middle road I want to be in.

That it is an evolving document being developed in the open is worthless if it is not always maintaining backwards compatibility. And, since there is no way to maintain backwards compatibility now that that ship has sailed, I'm skeptical of it.

I'm sure it will hit a good maxima at some point. But "is pretty easy to read and works pretty well for validation" was also fairly accurate for most XSD work. "People are already being very productive with it" was absolutely accurate. It was only when people started to stretch with it, that things got obnoxious. And we fell back on JSON not because it was somehow technically superior, but because it was technically easier.


> My complaint on the items being less than compelling is that they almost instantly devolve into bitter fights about what I'm supposed to care about. I fully grant that XML was too heavy in much of its schema attempts. However, it is much easier to reason about the extremity of XML than it is where in the middle road I want to be in.

Could you give a more concrete example? I'm really not following -- JSON + jsonschema is almost equivalent to XML + XSD in my mind, what was it that XML/XSD gave you that didn't devolve into bitter fights?

> That it is an evolving document being developed in the open is worthless if it is not always maintaining backwards compatibility. And, since there is no way to maintain backwards compatibility now that that ship has sailed, I'm skeptical of it.

This is a pretty extreme statement, if it's not hyperbole I don't think we could ever agree. I'd much rather live in a world where bad decisions can be overturned/corrected in a sufficiently new version when a big enough problem is found. Backwards compatibility is important but I don't subscribe to the thought that any non-backwards compatible changes make a spec or a document worthless.

> I'm sure it will hit a good maxima at some point. But "is pretty easy to read and works pretty well for validation" was also fairly accurate for most XSD work. "People are already being very productive with it" was absolutely accurate. It was only when people started to stretch with it, that things got obnoxious. And we fell back on JSON not because it was somehow technically superior, but because it was technically easier.

Yeahhh I don't know if "pretty easy to read and works pretty well for validation" is true from XSD and the other XML ecosystem tools. Complexity was everywhere, and it introduced many bugs and critical security issues.

Also, I think that JSON is not only technically easier, it is technically simpler (in the Rich Hicky Simple vs Easy sense). JSON does not have potential cleverness built in, and that means it will likely never be what XML was (or at least all the complexity built on top will be very very opt-in).


XML/XSD did devolve into many fights. I will not argue against that. The main outcome was the wide adoption of JSON.

What XML/XSD did give was ridiculously easy to implement autocomplete in any xml editor that was decently featured. And it gave people a ton of rope to try and over specify things.

Jsonschema seems poised to repeat the rope mistake. That is, I'm of the mind that the anti-fragile practices of schemaless JSON have actually been a boon to its success. Trying to solidify things back to a schema gives me little reason to think it will be better this time around.

I definitely agree it would be good to have bad decisions over turned. However, I will also not adopt a technology that is a constant grind to retread the same ground over and over. At least, not if I can avoid it. This puts me in the position of not trusting the likes of committees that seem to think throwing out decisions of a scant few years ago are a good idea.

The cynic in me thinks places do it so that they can just stay faster than everyone they are assigning migration tasks to.

My assertion for "pretty easy to read and works pretty well for validation" was not broadly applied to everyone. In particular, it failed at cross vendor efforts. However, the original claim was merely the vague "people", and I stand by that being just as true for XML as it was for JSON. With similar levels of complexity regarding different bugs in parsers and whatnot.


> Jsonschema seems poised to repeat the rope mistake. That is, I'm of the mind that the anti-fragile practices of schemaless JSON have actually been a boon to its success. Trying to solidify things back to a schema gives me little reason to think it will be better this time around.

Yeah I think I agree -- while considering your point I found myself wondering what made jsonschema any different this time around...

> I definitely agree it would be good to have bad decisions over turned. However, I will also not adopt a technology that is a constant grind to retread the same ground over and over. At least, not if I can avoid it. This puts me in the position of not trusting the likes of committees that seem to think throwing out decisions of a scant few years ago are a good idea.

Not like you need me to verify but that's an absolutely reasonable stance, the timeline of a standard and associated versions is important in deciding whether to use it or not.

> My assertion for "pretty easy to read and works pretty well for validation" was not broadly applied to everyone. In particular, it failed at cross vendor efforts. However, the original claim was merely the vague "people", and I stand by that being just as true for XML as it was for JSON. With similar levels of complexity regarding different bugs in parsers and whatnot.

I don't think I agree, but we can definitely agree to disagree here. XML was certainly easy to read (if you didn't use any advanced features) and as good for validation, but I think the simplest feature-complete JSON parser is more complex than the simplest feature-complete XML parser one could write -- but then again, maybe all the features I'm thinking of that XML had weren't in the core spec (I haven't read it).


To be fair, I fully acknowledge I'm being cynical. And, to that, I'm not proud.


No worrieds I'm pretty cynical myself which is why I'm pretty surprised when I see technology that actually makes me believe things could be less-shit in the future (hyperschema + json LD gave me that feeling).


I'd love to see something with the approximate structure of JSON, but less JavaScript-bound syntax. EDN has caught my eye, but then again I've always been fonder of Lisp-y syntax as a whole.


ooh I'd love EDN as well -- but I don't think people are willing to adopt S-expressions :(


Yeah EDN is a great format. It should be a lot more popular than it is.


Small correction -- I meant vertical filtering in postgrest[0]

[0]: https://postgrest.org/en/v5.1/api.html#vertical-filtering-co...


Yup. No stream API to unmarshall protobuffers suck majorly. You have to re-invent framing


Which language are you talking about? C++ proto is built around CodedInputStream which is what it sounds like. You can use it with or without the generated message code.

https://developers.google.com/protocol-buffers/docs/referenc...


Golang


[dead]


People should be able to provide constructive and useful criticism of a technology without having to build their own solution.


I spent 2.5 years at Google, and most of what I did was pushing one protobuf from one place to another :) - and I loved it... Honestly though, you can complain all day, but some of the decisions made in the list you presented most likely come from experience (daily) not as a user of protobufs (which I simply was), but someone that had to support a plethora of compression formats, how protobufs gets stored in the different databases, proxying, etc. etc.

Your "oneoff" and "repeated" might've been (I dunno really), a decision based on metrics where encoding repeated oneoff would've cost more, or made problems, more testing, or who knows what.

For one, Google was, and I believe still is open to discussing all kinds of matters, and you can get on to a design doc and comment if you like, if you care, if you have something to say... You can definitely influence, and people really did talk (you can even experience this from their public design docs - like from chromium, etc.).

And I love protobufs, so I guess I'm biased...


Wow, different strokes indeed. Discovering that most of what I was supposed to do at google was going to consist of pushing protobufs from one place to another completely destroyed my enthusiasm for working there. I hated it!


I got bad news for you but almost all programming is just moving bytes from one place to another, and occasionally multiplying, adding, subtracting, or dividing some of them. ;)


I wasn't expecting really to work on a next breakthrough algorithm, and such (although folks in my team actually worked on some incredible statistics related stuff - and this is where I understood I know nothing about statistics).

Plenty for everyone, I think. I'm glad I was part of the experience!

As an ex-game tools developer (back then, and back again - current job), it was quite a change for me. If I haven't learned enough, then at least seen things from quite different perspective.


Come and work in the film industry, where you will also discover it is still mostly plumbing, moving data from one place to another.

(Hello from another fellow Real Software alumnus :D)


Hi! That's a name I haven't seen in a while. :-)

Oddly enough, I worked on film-industry data plumbing for a couple of years before I joined Real - a distributed, fibre-channel based file system called Centravision. Just looked it up and it's apparently still in use, two acquisitions later, under the name "StorNext"...


Ignoring possible implementation difficulties, the application of a repeated oneof seems very esoteric. In most cases repeating the underlying fields is cleaner then repeating a oneof. E.g. instead of

    repeated oneof Option {
      Foo foo = 1;
      Bar bar = 2;
    }
use simply:

    repeated Foo foo = 1;
    repeated Bar bar = 2;
There is no longer "one of", therr are multiple, so why keep the complexity? If the interleaved ordering really matters, you can always fall back to a sub-message.


It is the first time I work on a project using protobuf. Previously, I have used corba, asn1 and xml. Protobuf (in java) is a real pleasure to work with.


I only interned at Google but I also really liked protocol buffers, somehow it seems very structured and defined to me, and better than Java. But this could just be that I haven't worked with better things... Lol



> Despite map fields being able to be parameterized, no user-defined types can be. This means you'll be stuck hand-rolling your own specializations of common data structures.

What a pain! Well, at least Google won't make that mistake again!


My contention with the quoted text is that you probably shouldn't be using elaborate data structures in streams/files. Using DTOs as heap/stack data has bitten me enough times that I'm fairly certain that it's an anti-pattern.

It doesn't matter if you're using a quantum binomial tree in a black hole: save it as a 'stupid' map<k,v> when it hits the network. That way everyone who interacts with your service can decide how they want to represent that structure. You can compose any in-memory data structure you can dream of, you can validate the data with more richness than "missing" or "present", and Protobuf doesn't contaminate your codebase.

Option 3 is the correct choice. Serialization libraries should be amateurish by design.


> That way everyone who interacts with your service can decide how they want to represent that structure.

Protobufs aren't a message format for publicly specced standard wire protocols. They're a serialization layer for polyglot RPC request and response messages. The whole point of them is that you're transporting the same typed data around between different languages, rather than there having to be two conversions (source language -> "standard format"; "standard format" -> dest language).

In this sense, they're a lot like, say, Erlang's External Term Format—driven entirely by the needs of an RPC-oriented distributed-computation wire protocol, doing interchange between nodes that aren't necessarily running the same version of the same software. Except, unlike ETF, Protobufs can be decoded without the backing of a garbage-collected language runtime (e.g. in C++ nodes.)

I'm not saying Protobufs are the best, but you have to understand the problem they're solving—the "rivals" to Protobufs are Cap'n Proto and Thrift, not JSON. JSON can't even be used to do a lot of the stuff Protobufs do (like talk to memory-constrained embedded kernels running on SDN infra.)

> and Protobuf doesn't contaminate your codebase

Just like in HTTP web-app codebases, a codebase properly designed to interact with an https://en.wikipedia.org/wiki/Enterprise_service_bus will tend toward Hexagonal architecture—you have your own logic, and then you have an "RPC gateway" that acts as an MVC system, exposing controllers that decode from the RPC format and forward requests into your business logic; and then build responses back into the RPC "view" format.

Once you have that, there's no point in having your application's parsed-out representation of the RPC format be different than the on-the-wire RPC format. The only thing that's touching those RPC messages is your RPC gateway's controller code anyway. So why not touch them with as little up-front design required as possible?


In the context of gRPC, it's true that Protobuf is intended to be a wire protocol, but the argument would be more convincing if the code generator toolchain Google fostered created code that integrated better with the languages that people use.

Typically, the data types and interfaces generated by these tools are so poor that you need to build another layer on top that translates between "Protobuf types" and "native types", as you describe in your comment, and shields the application from the nitty-gritty details of gRPC. So investing in gRPC means that when you've generated, say, Go files from your .proto files, you're only halfway done. Protobuf in itself introduces a kind of impedance mismatch, a kind of stupid in-bred cousin whose limited vocabulary has to be translated back and forth into proper language.

So gRPC/Protobuf solves something important at the wire level, but developers really want to productively communicate via APIs, and so what you have is just half the solution.

I wish the Protobuf/gRPC toolchain were organized in such a way that the generated code could actually be used as first-class code. Maybe similar to how parser generators like Lex/YACC or Ragel work, where you provide implementation code that is threaded through the output code.


> you probably shouldn't be using elaborate data structures in streams/files.

I don't consider Option<T> a particular "elaborate" data structure, but protobuf would benefit heavily from. Instead, as the author alludes to, you're forced to do different things depending on which side of the message/scalar dichotomy T falls on; if it's a message, you get Option<T> essentially for free, as a submessage can always be omitted. (Which is bad in the case that you just want a T; protobufs will happily let you forget to encode that attribute, and happily decode the lack thereof at the other end, all without error, despite the message being malformed, as it lacks a way to express this — rather common — constraint.) If T is a scalar, you get to use one of the WKTs.

See this GitHub issue for more context: https://github.com/protocolbuffers/protobuf/issues/1606

I work with data where most attributes outside of the attributes that compose the key identifying some object are quite possibly unknown. Our data comes from messy datasets, and while we clean it up as best we can, sometimes an attribute's value is unintelligble in the source data, and it is more pragmatic to move on without it. In SQL, we represent this as NULL, in JSON, null, but protobuf makes it rather difficult, and this frustration is evident from the other posters in that GitHub issue.


You can always write a code-generator to ease the pain.


That's what I hear from Go devs all the time, doesn't sound convicing to me.


I think both comments are sarcastic jabs at Go.


This feels pretty vitriolic, I wouldn't be surprised if there is some bias here. A lot of these problems seem pretty minor and there's weird stuff like

"Unlike most companies in the tech space, paying engineers is one of Google's smallest expenses."

According to here https://www.quora.com/How-many-software-engineers-does-Googl... there are ~28,000 engineers in 2014, which paying at 120,00 a year (glassdoor) would be... $3.3 billion a year? And this was 2014, and doesn't include stock compensation

Don't get me wrong - there are definitely problems with protobufs. But this post is citing a lot of stuff to back itself up that doesn't add up.

Also, required vs optional - definitely only optional, there should be nothing required. So I flat out disagree with that one.


Google's annual revenue was $110 billion in 2017 [1]. Even if headcount doubled and salary has increased, that's $7 billion a year. That's not peanuts, but at a company level it's not a massive expense.

[1]: https://www.androidauthority.com/alphabet-q4-2017-earnings-8...


True. I'm also guessing this number changes a lot with stock grants (that number was just base pay), and that yearly number seems pretty low (NYT has the entry level engineer at 124,000 https://www.nytimes.com/2017/09/08/technology/google-salarie...) so glassdoor might not have good data.

I wouldn't be surprised if it was much more than 6%.


I also wouldn't be surprised if it was more than 6% - but I very much doubt it could be over 12-15%. That's still not a very high percentage of revenue. It's most definitely a significant cost, but unlikely to be their biggest cost.


Anytime Pull out the revenue gun. It sounds like the revenue is actually produced by the machines though. I wonder why there isn't another search company competing with Google...

Btw, if you use glass door to deduce Google's cost on employee, you probably have no idea how company works...


lol yeah I know, but it's at least the low bar. I would guess employees are one of the largest expenses for Google.


While you're right, I'd like to point out that an employee earning 120k in wages costs an employer a lot more than 120k. Social security tax, Medicare tax, Federal Unemployment Tax Act tax, possibly state unemployment tax, health insurance, and some states have something called workers' compensation insurance. At Google, I'm sure there's also a lot of benefits included that would count towards the cost of each employee.


Thinking of that problem in terms of engineers not being the biggest expense is also a silly way to approach the problem.

At 20000 engineers, it’s a net positive to add a full time person to save everybody else 5 seconds a day. If protobuf were such a big waste of engineering time, then working to make it more efficient for engineers while being no worse on the wire would be worth a lot of dedicated engineers.


That's why Microsoft made sure that Windows Update always runs in the background and never reboots during a presentation. Oh wait..


> Protobuffers correspond to the data you want to send over the wire, which is often related but not identical to the actual data the application would like to work with.(...)

> Option 1 is clearly the "right" solution, but its untenable with protobuffers. The language isn't powerful enough to encode types that can perform double-duty as both wire and application formats. Which means you'd need to write a completely separate datatype, evolve it synchronously with the protobuffer, and explicitly write serialization code between the two.

I'd say, get used to this. This is exactly the same story you have with ORM/ActiveRecord. You either treat it as a different layer and write the translation code that's needed, or mix it with internal logic and pay the price of messy codebase later. The problem isn't protobuff(or ORM)-specific. It's just a consequence of the fact that your internal business model and the data you want to save/send are usually two different things.


The worst is when people deny that is the truth.

"Then model is like 80% the same, just use the same class!"


Yeah. It's not always obvious at the beginning, though. In my second job, I didn't notice this until it grew to the point we couldn't really implement anything new without spending 90% of the time on the spaghetti the business logic became. My excuse is that we were all new at this kind of work, but the lesson is clear: just don't do that.


Put me firmly in the camp of "optional fields are bad." I believe all fields should be required.

I come from an ONC/RPC background, which is the original UNIX RPC. Every iteration of an rpc would get versioned, and then you could write a conversion between versions, ex from V1 to V2, from V2 to V3, etc. This allowed for true backwards compatibility.

The idea of "forwards compatibility" is a pipedream, in my opinion. Just because something won't crash because a field is or isn't there doesn't mean that it will actually work. Code ages over time, and instead of looking at the spec, which will tell you which fields are required, you need to read the documentation, which may or may not tell you what is required. And especially when dealing with an older client, it simply may not work because what used to be optional-optional fields become required-optional fields.

It's the same argument as schemaless NOSQL database like Mongo vs MySQL. Having a schema is a pain in the ass, but it's better over time and ages much more gracefully than schemaless ones, because things won't just break.


If all fields are required, you cannot have middleware that processes multiple versions of the same protobuf, and every application has to be updated when a field is added, even if they do not use that field. This is one of the more important design goals underlying not only protobufs, but most of the non-language specific binary formatters.


> every application has to be updated when a field is added, even if they do not use that field

This is when, in a protocol, you reach for the hammer called "extensibility."

In this case (decoding to native structs), you'd probably have your FooMessage product-type have an "extensions" field, which is a list of zero or more FooExtensionStructs, where a FooExtensionStruct is a sum type of the known extensions to FooMessage.

Then, just make the semantics of your format such that a sum type can have a pragma in its IDL indicating that it isn't required to be decodable for the product-type containing it to be successfully decoded. I.e., the sum type becomes a (DecodeResult = Decoded(sum type) | Opaque(raw wire data)).

Opaque (unrecognized) extension data can't be manipulated by the program, but it can losslessly survive a trip back through the wire encoder. So you can choose to pass it through in your middleware, or not.

Boom: you've reinvented the "chunks" concept of the https://en.wikipedia.org/wiki/Interchange_File_Format, as seen in PNG and ELF.


I believe you've also reinvented optional fields in a more generalizable way. Generalizability can be good or bad, depending on how much complexity it adds.


Nope—there's a very important difference in the two approaches. You have to choose to add the kind of extensibility I was describing to a particular type, in advance, as part of that type's original specification. You have no choice in having optional fields.

With Protobuf optional fields, even if you formally specify a protocol where some particular message is absolutely "final" and will never be extended, anyone can just throw an optional member on there when sending one to you, and your implementation won't reject the struct. This is a big problem if you're trying to specify types that really map one-to-one to fixed-size structs you'll use in the business logic of your non-memory-managed language.

With explicit extension points, you get to dictate where and when extension happens; and, as well, you get to formalize exactly what extensions look like (i.e. you don't have to allow for un-recognized extensions; you can simply have extensions be a finite set of known new extensions, where everything else is a real decode error.)


... however, it's a great feature if the use case is "protocol for client-server network communication," which was the design goal for protobuffers. It follows the "permissive in what you accept, strict in what you emit" design philosophy.


> every application has to be updated when a field is added, even if they do not use that field

No, you maintain the older versions of the API. V1 of the API uses the V1 struct. V2 of the API uses the V2 struct, etc. Older applications maintain compatibility because it calls the older APIs, and you can convert between V1 to V2 and only keep one version of the API. Or, if you want, you can maintain both versions of the API, V1 and V2, at the expense of maintenance costs. But it's absolutely doable (we did it for decades).


You are missing when you have a middle layer. Message comes in at v3 and hits a layer that only knows v1 then gets passed to a layer that is at v4.

I'd wager most places don't have that many layers. But, if you are embracing microservices, you'll find yourself here fairly fast.


I doubt any api call trying to work in such a chaotic environment would actually work, and having all optional fields won't magically make things work. It will probably fail but in very mysterious ways. This sounds more like an environment where microservices are completely out of control and chaotic.


Doubt all you will, but being able to update different pieces of your service independently is a huge win. Having to update everything in lockstep is what I'd describe as 'out of control'. Which is precisely what forcing every field to be required does.

There's a million and one reasons for why you may want to push, or rollback only 1 out of X services. If you follow a few simple rules when adding/removing fields, you can do this safely.


You don't need to update everything lockstep. You maintain compatibility by converting the structs back and forth. However, all ambiguity is gone and everything is well understood.


This fails in several key spots. First, not every update should require a complete migration of all at rest or in flight data. Adding that as a requirement will instantly kill many efforts.

Second, you can't convert "to and from" a format you don't have specified yet. So, you have to first push out all code that can convert to a new version into all spots before you can start using that version. This will similarly kill many efforts.

If you are a smaller shop, you can jump both of those hurdles. But, I will gladly assert that extra work of jumping those hurdles will grow faster than your ability to advance past them.


No one said that all fields should be optional, only that it should be possible to have optional fields. It may sound like chaos to those who've not worked in such environments, but in fact it is not at all unusual in large companies to have the same message pass through several layers of services in a single call-flow. This long predates the term "microservices".


> No one said that all fields should be optional

As of protobuf v3, all fields must be optional. https://github.com/protocolbuffers/protobuf/issues/2497


Unless you are pushing validation of your data to an overly simplistic language specification, you will have to validate your data in your code. Once you are doing that, you might as well get used to the fact that you can't trust any data at rest. Even if you supposedly wrote it.

To that end, I fully agree that all fields should be "optional" for the protocol layer.


> You are missing when you have a middle layer.

And storage. People might have petabytes of historic data stored in protobufs.


How do you handle binary rollbacks and rollouts safely with an "everything is required" approach? Do you force binaries to roll out in a strict order with appropriate soak time at each layer? How does that affect developer velocity?


No, you create conversion routines that convert between different versions of structs. This keeps things well understood with no ambiguity. This is very easy, we were doing this 20 years ago and autogenerating the conversions using ANTLR that parsed the XDR files for ONC/RPC.


This doesn't make sense with the concept of rollbacks.

If I rollout server version 2 and client version 2 which each use a new required field, and then realize that there is some terrible error in server version 2, I can't roll it back to version 1, since it will reject all client calls from v2 clients.

The only way to make it work is to add a translation layer, as you suggest, on the server, wait a while, push the new 'required' client, wait a while longer, and then push the server without the translation layer. That's the "strict ordering with appropriate soak time" GP mentions.


Instead, you're going to get errors from the clients using version 2, because server version 2 was rolled back. You have to roll back the clients as well then.

Or you could have client version 2 know how to automatically convert to server version 1, because you're know what version the server is on, and you can convert your client parameters or even behavior to fit version 1.

You can't do this with protobufs because there is no such concept, you just add optional fields, and ignore them with different versions, and it's chaos.


>Instead, you're going to get errors from the clients using version 2, because server version 2 was rolled back. You have to roll back the clients as well then.

This depends on the update. If indeed the field was optional, you won't. A common example would be a field that is necessary for a new feature, but without which everything functions just fine, or functions with a minor degredation in experience.

But more importantly, you want to decouple features from api versions and communication protocols. It should be possible to enable an feature without rolling out a new version (for example, via a flipped flag). So a degraded communication protocol shouldn't actually impact anything. The application worked fine an hour ago, adding a new field won't break it.

>You have to roll back the clients as well then.

And if the clients are phones?

>Or you could have client version 2 know how to automatically convert to server version 1

This requires communicating with the server beforehand. Why should I have to roundtrip to negotiate which api version I should use? And what happens if you want to modify the protocol that you use to decide on api versions? Its turtles all the way down.

>because you're know what version the server is on, and you can convert your client parameters or even behavior to fit version 1.

So now, before I can roll out a server update, I need to roll out a client update that can make sure to negotiate back to a degraded experience until I update the server. Then I have to wait until that new client is rollback safe. Then I can update the server, then eventually I can update the client to remove the shim code. That's the "strict ordering with appropriate soak time" issue that you're still running into.

This has cascading consequences, each server/client update dance has to be mostly atomic, so you can only really do one of these dances at a time, and all clients have to be in sync. If a deep dependency service wants to make an api change, it has to wait until all of the clients are prepared before updating, and if any of those clients is the server in another context, they have to wait until everyone is ready.

That's chaos. And I don't want to do that when the other option is "update the server whenever, as clients upgrade, they'll see the improved service". You avoid the dance. It moves the initiative of an upgrade from the client to the server, and this is a good thing, because there are more clients than servers.


> each server/client update dance has to be mostly atomic

There’s a nice write up showing how that’s a risk but not an absolute restriction of using messages with required fields here: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-...

The basic idea is to acknowledge that all systems with forward and backward compatibility will have a translation layer, the question is just how is that defined and implemented?

If all fields are optional, it means that all readers need to handle any field being missing, in other words all readers must be able to process empty messages. A user update message might be missing a user id, and the reader will have to handle that. A couple of options come to mind: do nothing if there is no user id, or return an invalid message error. The key is that this is a translation layer that can noop or error before the message reaches the service business logic.

Then another thought is that message schemes needn’t be defined with version ids, trying to define a strict ordering between message versions is hard as you say, especially when handling non-linear updates, eg rollbacks or readers and writers skipping versions.

Instead, let’s define message schema compatibility. The user message processor could be defined to say it will only process messages with user ids - which practically speaking will be the case regardless of the message definition format - then a message without a user id can be rejected by common message parsing code, without per-service per-field translation code.

With a clear set of compatibility rules, it is even possible to write sensible reusable schema compatibility checking, eg: https://avro.apache.org/docs/1.7.7/api/java/org/apache/avro/...


We agree to disagree. I don't think you can convince me that all optional is better than all required and vice versa, which is okay. My point is required fields makes software age better over the long run because everything is explicit. If you don't agree, that's your prerogative. Everyone thought NOSQL without schemas was a godsend, until their code/service iterated a dozen times, developers leave, documentation gets out of date, and now all their older data can't be read because it doesn't match the code. Same thing holds true for RPC, in my opinion. Yours may differ.


Right, and my point is that all required fields prevents you from iterating. Your software doesn't age at all.

I've never found the problems you describe, and I work with some of the oldest protos around!


If all fields are required, how do you add a new required field without breaking all of the clients that were using the previous protobuffer version?


Middle Name. Address Line 2. Date Of Death.


Where something like JSON or (if you must) XML is used I prefer the idea of REQUIRING the preservation of all fields in the original structure UNLESS a field or set of fields is validated and then updated in place.

This lets end users and extension authors do things that make sense, such as adding a comment tag ( JSON added key:value "__customcomment": "Any well formed string will be safe here." or XML <!-- Please don't eat me when parsing the config! --> ) to a configuration file stored in either format, and actually having it persist.


Hmmm, I suppose I objectively agree with some of the points the author made, but as someone who works with protocol buffers daily, those issues never actually come to be problematic in practice. In fact, I have nothing but positive things to say about protocol buffers and find them pleasant to work with. Definitely a step up from sending raw JSON down the wire.

Granted, the application I'm working on is fairly boring/vanilla so maybe I don't feel the pain points that come from going off the beaten path.


Maybe for the benefit of everyone could you elaborate more on your vanilla application. So we can understand the context of why your problem-free experience differs from the author's or others'.


> Give the ability to parameterize product and coproduct types by other types.

This is "worse is better" at work. protobufs are inconsistent and types map to Java. But it's a feature because Java is a language designed for for your average programmer. It avoids words like "coproduct types". Because most people's eyes will glaze over that and their mind will think "this is too complicated, there must be something simpler, like a map or something..."

> "Google is filled with the world's best engineers," and that "anything they build is, by definition, not built by amateurs.

That's true too though. It's a major factor when picking tools. So protobufs are used because of Google. gRPC is used because of Google. Kubernetes because of Google. Angular because of Google. Go became popular because of Google etc. Of course there are technical advantages to those things and they are great tools, but I think quite often the reason is simply ... because it's Google.


It's abundantly clear why this author lasted only a year at Google. Aside from using the non-idiomatic "protobuffers" ...

"nothing wants to inspect only some bits of a message and then forward it on unchanged"

In my experience it is extremely useful to partially parse a protobuf, or to not parse it at all and simply modify it by appending to it. Also useful is the ability to define a message type that is isomorphic on the wire with a different type, but is much cheaper to parse (example: a deeply nested structure where all of the fields of interest are defined at the top level).

Nobody even within Google will argue that no mistakes have been made with proto. Proto3 is generally acknowledged to have been a big mistake and it doesn't have a lot of traction within Google. Map and OneOf are both basically antagonistic to high performance application code, and people avoid those for that reason. But the author's arguments are pretty weak and his dismissal of Google itself as an existence proof is insufficiently supported.


Ah right... when you can't attack the ideas, attack the author. Sandy is a Haskell enthusiast, and it is not a surprise that his criticism of protobuf is inspired heavily by that.

The argument presented is that protobuf does not attempt to even replicate the best practices we already have in terms of data representation. It makes sense why languages like C are constrained in their data representations -- they want to be close to the machine. However, a language meant literally to specify data really ought to be able to handle the basic competencies of its field (like sum and product types and polymorphism) with ease.

There is no excuse in 2018 for a data language without polymorphism and without product and co-product types. Compatibility with legacy or substandard languages is not an excuse for poor design. Compatibility with needed legacy languages should be the thing that's tacked on, not basic functionality.


Sure that makes sense if you can write off C++ as a "legacy" language, which is fine if you just can't get your mind wrapped around the scale at which Google operates. Due to the nature of weighted averages, you can't just write off something as being a "Google-only problem". Google and its peers like Amazon and Facebook own a very large fraction of the world's computing resources.

I think the overall mistake you and the author are making is assuming it is desirable for one program examining an encoded message to respect the type system of the program that produced it. That assumption is not obviously correct. The flexibility to just treat a vector of numbers as a vector of numbers, or to skip it, or ignore it, etc are pretty important in practice. That is why protobuf is defined at the level where it is defined. It encodes numbers and strings on the wire. It happens to be a fairly good way to represent search data. It brings no CS academic wankery to the party. It's very practical.


C++ is a legacy language when it comes to data representation. Its ideas on what constitutes 'data' are from a different era. That is not a criticism of C++ or its utility. It is just a fact that data representation has changed a lot since the time of C++s development.

Products and coproduct types are not academic wankery.


The machine does not think about category theory. The machine thinks about numbers. It can add them together! The way the machine thinks about numbers has not meaningfully changed in 40+ years.


Product and coproduct types are not category theory (what is that?)


> Ah right... when you can't attack the ideas, attack the author.

This feels unintentionally ironic considering the article starts with "They're clearly written by amateurs, unbelievably ad-hoc, mired in gotchas, tricky to compile, and solve a problem that nobody but Google really has"


> All you've managed to do is decentralize sanity-checking logic from a well-defined boundary and push the responsibility of doing it throughout your entire codebase.

One wonders if the author has practical experience with evolving a complex system composed of multiple independent services without downtime. Centralized sanity checking conflicts with transitioning services stepwise from e.g. foo.a foo.new_a by first deploying producers that create send both foo.a and foo.new_a and then phasing out foo.a after all consumers are updated.

Services which communicate via some general purpose, strongly typed schema language that's still lax enough to evolve backwards-compatibly are far easier to understand, maintain and evolve in my experience than alternatives such as json rest apis, overly tied down static or completely ad hoc protocols.


He meant "centralised in the serialisation library". I don't see how that conflicts with versioned evolution, especially because protobufs historically did verify the presence of fields.

The story behind "required considered harmful" at Google is really quite shameful. I was there at the time and couldn't quite believe people were making that argument. Beyond all the logical problems with it, they were basically saying Jeff Dean didn't know what he was doing when he made that decision (except they never actually said it explicitly because the notion itself would have been subjected to ridicule).

I used to like protobufs but I wouldn't use proto3. It seems to have regressed over time instead of getting better. Partly this is because the wire format is not evolvable. Protobufs do not have a "this is protobufs v1" header in them. This also explains the type system issues: Google cannot/will not evolve the wire format so each new version of protobufs is just different tooling and APIs over it.


I was there at the time too and couldn't believe people were defending required. I had already run into enough problems with it that I stopped using it earlier, and I worked on a relatively small system with not that many moving parts! I can hardly imagine how much pain required must have caused in something like the indexing pipeline.

The thing I realized was that even if a field is present, you almost always still want to do further checks on it before declaring the whole thing valid and good to go. "required" just tells you that it's an integer, for example, it doesn't tell you if it's in the range [0, 1000). Or that a string is a valid filesystem path. Or that two parallel arrays are the same size. You have to do those things anyway, so the has_... check is no big deal. And making everything optional is really nice when you have to make large changes to your protocol and don't want to have to carry around dummy values forever.

Protocol buffers weren't designed from scratch by Jeff Dean, they evolved from various ad-hoc ways of specifying query parameters (which, ok, were probably mostly written by Jeff), so I don't think it reflects badly on him to say that required was a mistake. Groups were a mistake in retrospect too, but no one thinks that means Jeff is an idiot.


Btw, I realize this argument is very similar to the argument for dynamic typing + lots of unit tests instead of static typing: "you have to write the tests anyway, and the tests will check the types".

I'd like to note that even if you're a static typing fan and don't believe that argument for code, you can still believe it for serialization and protocols, because data schemas and protocols evolve in different ways than code does (they're almost duals of each other).


The argument from Jeff Dean authority suffers from the fact that no matter how smart you are, getting some trade-offs right will require empirical validation.

Just having multiple end-points or doing version negotiation might work fine in some cases, but I can't see a particular good way field-replacement evolution would work with exclusively required fields in general,

What's your suggested approach?


Hello. I didn't invent Protocol Buffers, but I did write version 2 and was responsible for open sourcing it. I believe I am the author of the "manifesto" entitled "required considered harmful" mentioned in the footnote. Note that I mostly haven't touched Protobufs since I left Google in early 2013, but I have created Cap'n Proto since then, which I imagine this guy would criticize in similar ways.

This article appears to be written by a programming language design theorist who, unfortunately, does not understand (or, perhaps, does not value) practical software engineering. Type theory is a lot of fun to think about, but being simple and elegant from a type theory perspective does not necessarily translate to real value in real systems. Protobuf has undoubtedly, empirically proven its real value in real systems, despite its admittedly large number of warts.

The main thing that the author of this article does not seem to understand -- and, indeed, many PL theorists seem to miss -- is that the main challenge in real-world software engineering is not writing code but changing code once it is written and deployed. In general, type systems can be both helpful and harmful when it comes to changing code -- type systems are invaluable for detecting problems introduced by a change, but an overly-rigid type system can be a hindrance if it means common types of changes are difficult to make.

This is especially true when it comes to protocols, because in a distributed system, you cannot update both sides of a protocol simultaneously. I have found that type theorists tend to promote "version negotiation" schemes where the two sides agree on one rigid protocol to follow, but this is extremely painful in practice: you end up needing to maintain parallel code paths, leading to ugly and hard-to-test code. Inevitably, developers are pushed towards hacks in order to avoid protocol changes, which makes things worse.

I don't have time to address all the author's points, so let me choose a few that I think are representative of the misunderstanding.

> Make all fields in a message required. This makes messages product types.

> Promote oneof fields to instead be standalone data types. These are coproduct types.

This seems to miss the point of optional fields. Optional fields are not primarily about nullability but about compatibility. Protobuf's single most important feature is the ability to add new fields over time while maintaining compatibility. This has proven -- in real practice, not in theory -- to be an extremely powerful way to allow protocol evolution. It allows developers to build new features with minimal work.

Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time, hence the "required considered harmful" manifesto. In practice, you want to declare all fields optional to give yourself maximum flexibility for change.

The author dismisses this later on:

> What protobuffers are is permissive. They manage to not shit the bed when receiving messages from the past or from the future because they make absolutely no promises about what your data will look like. Everything is optional! But if you need it anyway, protobuffers will happily cook up and serve you something that typechecks, regardless of whether or not it's meaningful.

In real world practice, the permissiveness of Protocol Buffers has proven to be a powerful way to allow for protocols to change over time.

Maybe there's an amazing type system idea out there that would be even better, but I don't know what it is. Certainly the usual proposals I see seem like steps backwards. I'd love to be proven wrong, but not on the basis of perceived elegance and simplicity, but rather in real-world use.

> oneof fields can't be repeated.

(background: A "oneof" is essentially a tagged union -- a "sum type" for type theorists. A "repeated field" is an array.)

Two things:

1. It's that way because the "oneof" pattern long-predates the "oneof" language construct. A "oneof" is actually syntax sugar for a bunch of "optional" fields where exactly one is expected to be filled in. Lots of protocols used this pattern before I added "oneof" to the language, and I wanted those protocols to be able to upgrade to the new construct without breaking compatibility.

You might argue that this is a side-effect of a system evolving over time rather than being designed, and you'd be right. However, there is no such thing as a successful system which was designed perfectly upfront. All successful systems become successful by evolving, and thus you will always see this kind of wart in anything that works well. You should want a system that thinks about its existing users when creating new features, because once you adopt it, you'll be an existing user.

2. You actually do not want a oneof field to be repeated!

Here's the problem: Say you have your repeated "oneof" representing an array of values where each value can be one of 10 different types. For a concrete example, let's say you're writing a parser and they represent tokens (number, identifier, string, operator, etc.).

Now, at some point later on, you realize there's some additional piece of data you want to attach to every element. In our example, it could be that you now want to record the original source location (line and column number) where the token appeared.

How do you make this change without breaking compatibility? Now you wish that you had defined your array as an array of messages, each containing a oneof, so that you could add a new field to that message. But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.

In every single case where you might want a repeated oneof, you always want to wrap it in a message (product type), and then repeat that. That's exactly what you can do with the existing design.

The author's complaints about several other features have similar stories.

> One possible argument here is that protobuffers will hold onto any information present in a message that they don't understand. In principle this means that it's nondestructive to route a message through an intermediary that doesn't understand this version of its schema. Surely that's a win, isn't it?

> Granted, on paper it's a cool feature. But I've never once seen an application that will actually preserve that property.

OK, well, I've worked on lots of systems -- across three different companies -- where this feature is essential.


Indeed, I was waiting for the punchline where he reveals a superior alternative (given the fundamental nature of his criticism it needed to be a fundamentally superior concept, to justify the rhetoric, so something that would give you an a-ha moment, rather than iterative improvements). When you open sourced protobuf (thank you for doing so), we at my existing company had implemented a functionally similar but not language agnostic implementation (as I believe many companies had), and it was very exciting to have a cross language implementation. Also to your last point one of the major drivers was the forward-backward compatability and message preservation across intermediate systems that only required knowledge of their portion of the message, so the author's assertion that this never comes up also surprised me.

Of course I am all ears for the next revolution in high performance message formats that support schema evolution, etc., but the author did not provide one, and I think if one is going to unleash that level of negative rhetoric, one takes on the responsibility to also unveil at least a glimpse at an alternative.


> OK, well, I've worked on lots of systems -- across three different companies -- where this feature is essential.

Here's an actual real life example: Chrome uses this in Chrome Sync feature that allows you to sync your browser configuration and state across various devices. The feature is implemented basically like this: Chrome sends its version of state to server in a proto, and the Sync server reconciles it with the one it has saved, updating both according to which one is more recent. The feature fundamentally depends on the fact that the client won't drop the fields it doesn't know, because it would then be data loss for some other more recent client who knows these fields and synced them to server: if the older client dropped these unknown fields, it would be an equivalent to syncing in an empty value of this field.

Original designers of proto3 (the most recent protocol buffers definition language and semantics) actually decided to drop the unknown field preservation, for simplicity reasons. This made Googlers so unhappy that an internal doc was created listing many internal use cases for this feature, and after discussion, this was added back to proto3.


It's also useful the other way around - if you add a field which has significance only to the client (and these account for most fields), you don't have to count on the server knowing about this field immediately which simplifies your testing and deployment.


> Maybe there's an amazing type system idea out there that would be even better, but I don't know what it is. Certainly the usual proposals I see seem like steps backwards. I'd love to be proven wrong, but not on the basis of perceived elegance and simplicity, but rather in real-world use.

Care to elaborate on what these usual proposals are and why they're backwards?

Right now I'm thinking of 'row polymorphism', which, to my understanding, is just permissiveness cooked into the type system — you get to specify types that say "I may or may not have other fields, but I definitely have a 'name' and 'email' field" for example.


"Row polymorphism" sounds like exactly what protobuf does, which I think is the right answer.

As I mentioned, I've heard people argue that the client and server should pre-negotiate exactly which version of the protocol they are using, and then that version should have a rigidly-defined schema. It may have been unfair to suggest that this is a common opinion among type theorists -- I don't have that many data points.


Row polymorphism is a more principled approach that makes the type theorists happy.

The point of row polymorphism is not just that you can say "I may or may not have other fields, but I definitely have a 'name' and 'email' field", but that the extra unknown fields have a name that can be used to constrain other types.

For example, you can have this function type:

{name: string, email: string, &rest} -> {name: string, email: string, &rest}

This is different than this function type:

{name: string, email: string, &rest} -> {name: string, email: string}

The first type can act as a forward compatible passthrough because it says that whatever extra fields are in the input are also in the output. The second type promises that its output only has name and email fields.

The same applies to variant types: you can say that a variant has option A, option B, and other options rest:

(A | B | &rest) -> (A | B | &rest)

vs

(A | B | &rest) -> (A | B)

Functions of these types are polymorphic in the schema. For instance, the type (A | B | &rest) -> (A | B) can be instantiated with rest = (C | D) to get (A | B | C | D) -> (A | B).

So row types are fundamentally different in that it's the functions that that explicitly deal with multiple possible schemas. At the end of the day all of the rest variables get instantiated with explicit types to get a concrete instantiation of a function in which all the rest variables have been replaced by concrete types.

A serialisation library could use this to do version negotiation automatically. After it has negotiated a version, the library instantiates the functions so that the rest type variables get instantiated with the actual concrete version of the schema.

Different languages implement row types in different ways. Some compile row types in a way akin to Java's type erasure. They compile a single version of each polymorphic function that can be used regardless of how the rest parameters get instantiated. Some compile row types in a way akin to C#'s generics or C++ templates: they compile a separate version for each instantiation.

The advantage of the latter is that the data representation can be optimised with full knowledge of the concrete schema. If we have a function of type {name: string, email: string, &rest} -> {name: string, email: string, &rest} instantiated with rest = {age: int} then that compiles to a version of type {name: string, email: string, age: int} -> {name: string, email: string, age: int}. This compiles to faster code because the compiler statically knows the size of the thing.

In a client-server situation you wouldn't know the schema of rest until run time, so you'd need either have a JIT compiler that can compile new versions at run time, or specify a fixed number of options for rest at compile time. To update a client-server application you'd need to recompile both the client and server with support for the new version. That's not nice but it does not have a chicken and egg problem because both the client and server still support the old version too.

TL;DR: with row types schemas are always rigidly defined, it's the functions that can handle multiple schemas.


Out of the gate, identifying the OP as a theorist, isn't that an ad hominem attack? Why trust anything he says, he's a dirty theorist!


> In every single case where you might want a repeated oneof, you always want to wrap it in a message (product type), and then repeat that. That's exactly what you can do with the existing design.

But I could use the same argument for repeated anything. Why even allow repeated primitives at all, if your argument is convincing?


Repeated primitives were in the first version, before anyone knew better. I'd argue against them in new code.

In Cap'n Proto, I made it possible to upgrade a repeated primitive to a repeated strict whose first field is the primitive type, to avoid this trap.


There is a protobuffer best practice that suggests that if you think you might need a repeated message in the future, you should use a repeated message instead of a repeated primitive conservatively.


Maybe there's an amazing type system idea out there that would be even better, but I don't know what it is.

Required and optional is just an encoding of nullability in the type system. This is a common feature in most modern languages (Go excepted). Clearly Google got a very long way with proto1, whose designers felt strongly enough this was important to put it into what is otherwise a very feature-lite system.

The crux of your argument is not fundamental to serialisation or messaging systems but rather, is specific to Google's environment, business and early design choices. It is wrong for virtually all users of serialisation mechanisms:

Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time, hence the "required considered harmful" manifesto

There are many, many examples where a field does start required and stay required for the lifetime of a system. But even if over time a required field stops being used and you'd like to clean up the data structures or protocols by removing it, what proto3 does is completely wrong for any normal business.

Obviously, you can always remove a required field by changing it to optional and updating the software that uses that field to handle the case where it's missing (perhaps by doing nothing). The "required considered harmful" movement that took hold at Google whilst I was there directly contradicts all modern statically typed language design, except for Go, which also originated at Google and to put it bluntly, almost revels in its reputation for having ignored any PL thinking from the past 20 years. But if you look at - say - Kotlin, which the Android team have adopted as their new practical working language for practical working programmers, it has required/optional as a deeply integrated feature of the language. And this is not controversial. I do not see long running flamewars saying that Maybe types or Kotlin-style integrated null checking is a bad idea. Basically everyone who tries it says, this is great!

So why did Google have this feature and remove it?

The stated rationale of the "optional everywhere" movement was something like this: if we make a required field optional and stop setting it, some server, somewhere, might barf and cause a production outage. And nothing is worse than an outage, therefore, it is better to process data that's wrong (some default zero value casted to whatever that field is meant to mean), than crash.

This set of priorities is absurd for basically any company that isn't Google or in a closely related business, i.e. consumer services in which huge sums of money can be made by serving data even if it's "wrong" (or in which correctness is unknowable, like a search result page). But for most firms correctness does matter. They cannot afford to corrupt data by interpreting a version skew bug as whatever zero might have meant.

Arguably even Google can't afford to do this, which is why Jeff Dean made "required" a feature of the proto1 language.

There's a simple and very bad reason for why protocol buffers were changed to make all fields optional - the protocol buffer wire format was set very, very early on in the companies lifetime when there were only a few servers and Google was very much in startup mode, writing everything from scratch in C++ ... and without using any other frameworks. If memory serves, they're called "protocol buffers" because they started as a utility class on top of the indexserver protocol. The class provided basic PushTaggedInt() type methods which wrote a buffer representing a request. Over time the IDL and compiler was added on top. But ultimately the wire format has never changed, except for re-interpreting some fields from being "byte array of unknown encoding" to being UTF-8 encoded strings (and even that subtlety caused data corruption in prod).

Look at this wire format! https://developers.google.com/protocol-buffers/docs/encoding

There is no metadata anywhere. A protobuf with a single tagged int32 field encodes to just three bytes. There's no version header. There's no embedded schema. There's nothing that can be used to even mark a message as using an upgraded version of the format. There is however a rather complex varint encoding.

This makes a ton of sense given it was refactored out of a protocol for inter-server communication in a web search engine (Google search being largely a machine that spends its time decoding varints). But having come into existence this way, protobufs quickly proliferated everywhere including things like log files which may have to be read years later. So almost immediately the format became permanently fixed.

And what can't you do if the format is fixed? You can't add any sort of schema metadata that might help you analyse where data is going or what versions of a data structure are deployed in production.

As a consequence, when I was at Google, nobody had any way to know whether there was something running in production still producing or consuming a protobuf that they were trying to remove a field from. With lack of visibility comes fear of change. Combined with a business that makes bazillions of dollars a minute and in which end users can't complain if the served results are wrong, you get protocol buffers.

This is not fundamental. Instead it reflects an understandable but unfortunate lack of forward planning when protobufs were first created. Other companies are not compelled to repeat Google's mistake.


You have misunderstood the "required considered harmful" argument. It's not fundamentally about the abstract concept of required vs. optional but about the specific implementation in Protocol Buffers, which turns out to have had unintended consequences.

Specifically: As implemented, required field checking occurred every time a message was serialized, not just when it was produced or consumed. Many systems involve middlemen who receive a message and then send it on to another system without looking at the content -- except that it would check that all required fields were sent, because that's baked into the protobuf implementation.

What happened over and over again is that some obscure project would redefine a "required" field to "optional", update both the producer and the consumer of the message, and then push it to production. But, once in production, the middlemen would start rejecting any message where these fields were missing. Often, the middleman servers were operating on "omnibus" messages containing bits of data originating from many different servers and projects -- for example, a set of search results might contain annotations from dozens of Search Quality algorithms. Normally, those annotations are considered non-essential, and Google's systems are carefully architected so that the failure of one back-end doesn't lead to overall failure of the system. However, when an optional backned sent a message missing required fields, the entire omnibus message would be rejected, leading to a total production outage. This problem repeatedly affected the search engine, gmail, and many other projects.

The fundamental lesson here is: A piece of data should be validated by the consumer, but should not be validated by pass-through middlemen. However, because required fields are baked into the protobuf implementation, it was unclear how they could follow this principle. Hence, the solution: Don't use required fields. Validate your data in application code, at consumption time, where you can handle errors gracefully.

Could you design some other version of "required" that doesn't have this particular problem? Probably. But would it actually be valuable? People who don't have a lot of experience here -- including Jeff and Sanjay when they first designed protobufs -- think that the idea of declaring a field "required" is obvious. But the surprising result that could only come from real-world experience is that this kind of validation is an application concern which does not belong in the serialization layer.

> There is no metadata anywhere.

Specifically, you mean there is no header / container around a protobuf. This is one of the best properties of protobufs, because it makes them compose nicely with other systems that already have their own metadata, or where metadata is irrelevant. Adding required metadata wastes bytes and creates ugly redundancy. For example, if you're sending a protobuf in an HTTP response, the obvious place to put metadata is in the headers -- encoding metadata in the protobuf body would be redundant and wasteful.

From what you wrote it sounds like you think that if Protobufs had metadata, it would have been somehow easier to migrate to a new encoding later, and Google would have done it. This is naive. If we wanted to add any kind of metadata, we could have done so at any time by using reserved field numbers. For example, the field number zero has always been reserved. So at any time, we could have said: Protobuf serialization version B is indicated by prefixing the message with 0x00 0x01 -- an otherwise invalid byte sequence to start a protobuf.

The reason no one ever did this is not because the format was impossible to change, but because the benefits of introducing a whole new encoding were never shown to be worth the inevitable cost involved: implementation in many languages and tools, code bloat of supporting two encodings at once (code bloat is a HUGE problem for protobuf!), etc.


> Validate your data in application code, at consumption time, where you can handle errors gracefully.

Honest question: how can I validate data in application code when optional fields decode to a necessarily-valid value by design?

Suppose I'm an application author and I have an integer field called "quantity" which decoded to a 0. How can I tell whether that 0 meant "the quantity was 0 in the database" or "the quantity field was missing" instead?

(One answer is that I should opt into a different default value, like -1, which the application can know indicates failure. If that's what I should always do, then why not help me gracefully recover by requiring that I always specify my fallback value explicitly, rather than silently defaulting to a potentially misinterpretable valid value like `0`?)

I understand that required fields break message buses that only need to decode the envelope, but if I am working on a client/server application where message buses are not involved (as almost all client/server programmers in the world are), I don't follow how "everything is optional, and optional means always succeed with a valid default value" facilitates graceful recovery in the application layer. In order to gracefully recover, the application has to be informed that something went wrong!

It seems to me that this design more directly facilitates bugs in the application layer that are difficult to detect because the information that something unexpected happened during decoding is intentionally discarded by default. It makes the resulting bugs "not the protocol layer's fault" by definition, but that is not a compelling pitch to me as an application author.

What am I missing?


> Suppose I'm an application author and I have an integer field called "quantity" which decoded to a 0. How can I tell whether that 0 meant "the quantity was 0 in the database" or "the quantity field was missing" instead?

First, this is clear on the level of wire encoding: either the field has encoded 0 value, or it is simply missing from encoding.

Second, in proto2, you actually have has_quantity() method on a proto message, which will tell you whether quantity is missing or set to 0.

In proto3, the design decision was that the has_foo() methods are available only on embedded message field, and not available on primitive fields, so you'd have to wrap your int64 in a message wrapper, like e.g. the ones available in google/protobuf/wrappers.proto.

The point here (and a common pattern inside google3) is that in your handling code you simply manually check the presence of all required fields: if (!foo.has_quantity()) { return FailedPreconditionError("missing quantity"); }. It is a bit of a hassle, but the benefit is that you have control on where the bug originates and how it is handled in your application layer, as opposed to silently dropping the whole proto message on the floor.


Gotcha, thank you for the clear explanation!


In proto2, you could use `has_foo()` to check if `foo` is present, even for integer types. You could also specify what the default value should be, so you could specify e.g. a default of -1 or some other invalid value, if zero is valid for your app.

Unfortunately, proto3 removed both of these features (`has_` and non-zero defaults). I personally think that was a mistake. I'm not sure what proto3 considers idiomatic here. Proto3 is after my time.

Cap'n Proto also doesn't support `has_` due to the nature of the encoding, but it does support defaults. So you can set a default of -1 or whatever. Alternatively, you can declare a union like:

    # (Cap'n Proto syntax)
    foo :union {
      unset @0 :Void;
      value @1 :Int32;
    }
This will take an extra 16 bits on the wire to store the tag, but gets the job done. `unset` will be the default state of the union because it has the lowest ordinal number.

I suppose in proto3, you ought to be able to use a `oneof` in a similar way.


Good to know about the 0x00 trick.

My point about metadata is that if protobufs had even a small amount of self description in them, the middlemen who weren't being updated could all have been found automatically and the version skew issue would have been much less of an issue. Like how Dapper can follow RPCs around, but for data and running binaries.

Google doesn't do that because for its specific domain it needs a very tight encoding, amongst other reasons (and the legacy issues). It could have fixed the validating-but-not-updated-middleman issue in other ways, but instead it made the schema type system less rigorous vs more rigorous. That seems the wrong direction.


I'm using protobufs for an embedded product, and I think they're awesome. Not the solution to everything - certainly - but definitely a highly useful tool for the right circumstances.

In my case, I have a single .proto file that serves as the input to two software systems: an embedded OS running on ESP8266 hardware, and the web interface served up by that OS to edit OS-specific features/variables/etc.

So, I define in one file, my types that I want shared both with the embedded OS, and with the web interface. On compile, both systems get their relevant generated code, and both systems are therefore able to communicate easily with each other. The user can use the embedded OS to change values coded in the protobuf, and the web interface that the system serves up immediately knows about the change. The web interface, by the way, generates the UI based on the protobuf declarations - meaning that adding a new var to the project is a simple matter of editing the .proto file, and doing a full clean build. Very, very nice: one change, multiple systems updated, and it produces an auto-generated UI for the new vars.

This kind of interoperability is very key - I'm sure there are alternative ways of doing this, but I can't think of anything as smooth and easy to integrate as protobufs have been .. maybe if we'd put a Lua VM in the mix somewhere, this'd be better, but .. protobufs are pretty tight.


Mostly lost me at "Make all fields in a message required. This makes messages product types."

Your constraints on protocols can change over time and context, changing something that is required to something that is optional can cause crashes in prod. Unless somehow the 'optional' design has a way to handle that? You also will have to do custom validation anyways, as your storage format will never be able to enforce all constraints you care about.

Also, it would seem like the 'debate' of optional vs required is harmful was resolved, as proto3 makes everything optional.


If you would continue reading, there is an explanation of how optional would work almost immediately after this quote.


Right - this is the same as having the possibility of both required and optional. I'm saying there should not be any possibility of a 'required' field


In context of current protobuf design required was a mistake and optional is clearly better, but author argues about grand type system that is based on different principles, including stronger validation.

Criticizing protobufs is like criticizing C++ or Java. Both have major shortcomings, but solve practical problems and de facto lingua franca with no practical solutions to replace them.


From a practical standpoint the problem is that "required" handles a trivial subset of message validation.

I mean, I'm not going to claim that it never happens that your only constraint on a valid value of a message field is "present", but you quite often want to be able to require that one of three fields is set, or a number be between 0 and 1048576, or that a field be equal to an existing user ID, it that a string contain at least one printable non-punctuation, non-space letter.

So no matter your RPC message parsing code, you're going to need a custom bit of code in each of your handlers to say "This isn't a valid message, fuck off". Enforcing "this field should be of this primitive type" saves a lot of time, but it turns out that "this field should exist" doesn't save that much because you still have to write "... and have a sensible value" in your own code.

So you have a language feature which causes problems sometimes and doesn't really help much.


> it turns out that "this field should exist" doesn't save that much because you still have to write "... and have a sensible value" in your own code.

Disagree. Dozens of "x should exist" checks are tedious to write and even more tedious to read, obscuring the more relevant business-specific validation logic. Better to move the low-hanging fruit into the message format.


Like, 5 lines later:

> For example, we can rebuild optional fields:


How is this different than having both required and optional.


It's explained. Later in the article.


There are enough voices on the net these days that we don't have to put up with this kind of attitude anymore. This sort of stuff was "just the way things were" when we were teenagers visiting forums ran by teenagers. I don't know if the guy is right or wrong, I've just got better things to do than to sift through all the vitriol to get to whatever point he is trying to make.


I think a lot of people are still attracted to the "hostile genius" persona, or whatever. Maybe it justifies their own poor behavior.


These objections are interesting. I have mixed opinions on protos, I think overall I'm mostly in favor. I'm a bit confused by this set of objections though.

While oneof fields cannot be repeated, oneof fields can be arbitrary protos, so they can contain repeated fields. In other words, you can have a (pseudo-proto)

    oneof {
      RFoo {
        repeated Foo;
      }
      RBar {
        repeated Bar;
      }
    }
so in practice this isn't a restriction. If anything, its a (very minor) api wart.

As for maps, I'd say don't use them. I have, I don't think they're very useful, except for prototyping things. `repeated (string key, string value)` is just as useful for prototyping things, and you should be quick to promote things to fields. Optional fields aren't costly.

I'm also in the minority who thinks that the `has_foo` methods are a smell, and everything should always be set to a default value[0]. If it really matters, one can define a maybe type via a oneof, but that should be the exception, not the norm. Most people disagree with that though, so maybe I'm crazy and you should ignore me.

[0]: But if that's not the case, everything should be optional, nothing required.


While oneof fields cannot be repeated, oneof fields can be arbitrary protos, so they can contain repeated fields. In other words, you can have a oneof, so in practice this isn't a restriction. If anything, its a(very minor) api wart.

That's right. Moreover, oneof repeated and repated oneof are two different types: in first, you have a list of X, or a list of Y, or a list of Z etc, and in the second, you have a list such that any element can be either X, Y or Z. Both can be cleanly expressed with existing API, so I see no reason to do any special treatment.


    repeated(oneof(Foo, Bar))
is not the same as

    oneof(repeated(Foo), repeated(Bar))


Correct, both of those are possible to represent as is though. Each oneof just requires and extra proto to wrap it.

So you have

    repeated OneaofWrapper {
      oneof {
        Foo
        Bar
      }
    }
Or what I did in my previous comment.


> By far, the biggest problem with protobuffers is their terrible type-system. Fans of Java should feel right at home with protobuffers ...

Yeah right, because insulting only Protocol buffer and Google engineering won't be edgy enough for a blog post... :/


“Make all fields in a message required. This makes messages product types.”

This article was clearly written by amateurs.


I've used protocol buffers on a large project, evolving the messages over time. They are a fine choice.

The tone of this article is a little less than objective which I feel discredits it a bit. But hey it made it to the front page of hacker news so...


> Make all fields in a message required. This makes messages product types.

Now you lost one of the biggest benefits of using protobufs, the possibility to make things backwards/forwards compatible. When you add a field to the proto, a server using the new version of the proto will fail to deserialize a request from a client using an older version that's missing that field.

The author mentions this later, but doesn't really address it/propose a better solution, as far as I can see.


I can not really take someone who calls Jeff Dean an amateur seriously...


Once I was faced with a task of serializing object hierarchy to a binary on disk[1], with the requirement of transparent backward compatibility (and desired forward compatibility), and the ability to be accessed from several programming languages in the future.

Protobufs looked promising because of their forward compatibility promise, but it looked like they only work if you write your code from the get-go using them. Rewiring a 15-year-old codebase would be a nightmare, so that was not an option.

Boost::serialization[2] looked promising, because it's non-intrusive -- but even "dumb" forward compatibility (skipping unknown fields) is not supported -- and reading its archives from other languages seems nontrivial.

So I went ahead and rolled out a custom framework with C++ template trickery which was non-intrusive, wrote to SQLite (with a fixed ORM schema), and did the job.

I wondered for a while if I was re-inventing the wheel, but in the end, I am glad I did it:

1. It wasn't that much work - especially compared to rewiring everything for Protobufs

2. It made me appreciate the goals and the API of Boost::serialization better - mine ended up being not too dissimilar. It was a great way to learn that - or any other- serialization API - without even looking at it! Now it's "let's see how they did ___" instead of "what does it do and why is this here".

Reading this article made me glad again: partly, for not depending on Protobuf -- and partly, for it having some of the same design flaws that I have (like having difficulty telling a zero-initialized value from a value not present in the data).

But the take-away for me was that sometimes, writing a custom framework is justified. Not all tools are suited for all purposes. Figuring out what one wants, and checking the list against the goals of a project can save a lot of headache.

[1] Binary was a necessity because some of the data is large arrays of full-precision floating point numbers, on the scale of 10-100MB. This was scientific data, no precision loss on saving/loading was acceptable.

[2]https://www.boost.org/doc/libs/1_67_0/libs/serialization/doc...


I think it was Churchill said that "protobufs are the worst form of data serialization, except for all those other forms that have been tried from time to time." I really don't understand why you would ever want to serialize data without a description of what can be in the serialized data blob. And yes, I do care about the difference between a single precision float, a double precision float, or an int64. Thank you, typing system.


The main point that people are missing is that experienced engineers don’t want to work with people who think like the author of this article.

Protocol Buffers are not wrong, they simply have constraints, advantages, and disadvantages.

No language, binary format, text format, etc is free from advantages and disadvantages. All of them have different use cases.

If you are building a system where your data can be described by protobufs, it may be a good choice. If your data structures don’t mash up well and you have to manipulate them heavily, protobufs may be a bad choice.

Instead of pointing out use cases where a different serialization format may be better than protobufs and use cases where protobufs are better, and why, the author is spouting dogma about how protobufs are bad for every use case.

Be wary of working with developers who prefer to argue about why they hate certain technologies instead of providing useful data and ways to solve problems. You don’t always have to solve a problem just because you are aware of it but don’t go shouting from the rooftops that a technology sucks for every use case under the planet when that’s obviously not the case. The author’s opinion is more of: I don’t like protobufs. Not: protobufs are wrong


The author isn’t trying to say either of those things, really.

I think their real point was something like: many companies that have an Enterprise Service Bus architecture for their pile of polyglot microservices, have a dogma for the encoding of the data flowing over the Bus. And Protobufs, though good for some use-cases, are a particularly bad dogma to be stuck with. That is, when they don’t work for a use-case, they really don’t work for that use-case—and an ESB means having to try to shove a wire format into pretty much every use-case. So “when it’s bad, it’s really bad” is a bad property for a format aimed specifically at being an enterprise’s ESB-interchange-format dogma to have.

(Contrast to, say, JSON-RPC, which I’d place on the opposite end of the scale being described here. JSON-RPC is mediocre at best in terms of data fidelity, performance, etc., but it’s inoffensive—when it’s bad, it’s no worse than when it’s good. That makes it a good choice of ESB dogma... even though, in engineering terms, it sucks!)


No, the author is showing off how smart they are, and in an unprofessional and nasty manner to boot.

It's kind of you to gift the article with some genuinely thoughtful conclusions, but you're doing all the work there, not the original.

As the grandparent says, it's easy to poke holes in something, especially when disregarding important requirements that influenced its design. The fact that the OP can't point to any implementation of the "right" way to do things is telling.


I mean, all the arguments to support my conclusions, are their arguments. If someone spends 10 pages giving you facts and anecdotes and other types of data, and then uses them to support a "dumb" conclusion, they've still done useful work. They don't realize what exactly it is their data proved, but they did do the work required to prove something. Those facts and anecdotes support a conclusion, whether or not they feed it to you at the end/in the abstract.

And, that being said, I think they're right about the conclusion, too. A shorter way to say "Protocol Buffers are a bad choice of common ESB-bus format, even though being an ESB-bus format is the primary thing they're for and what everyone tries to use them for" is "Protocol Buffers are a wrong design." If something doesn't work when used to do the thing it's advertised to do, then it's broken, even if it can do something else.

The only difference between "Protobuffers Are Wrong" and my conclusion is that I'm making the implicit context of their argument explicit. Read between the lines of their argument—they are talking about the use of protocol buffers (specifically, gRPC) in an ESB-bus common-format scenario. None of their arguments make sense if they aren't.


How is the author showing off exactly? Because he said “coproduct” and “Prism” (both of which have more well-defined meaning than profobufs btw lmao)


The poster might be being kind, but their conclusions are spot on.


What makes you say that Protobufs are bad thing? I'm finding the opposite - where you end up with everyone's favourite json parser, not able to handle UTF8 here and there, or some handle multi-line json, other handle comments, etc, etc. Also certainly slower, and certainly when you process data back you have to serialize, convert, check, etc.

For me protobuf+grpc gives you enough building block, not to shoot yourself in the foot - but rather start from something ready. You get on top of that census metrics, bi-directional communication, and yes there are limitations - it's HTTP2 only (right? correct me if I'm wrong), but maybe that's what you need.

Instead, I see folks - going back to sock(), using them directly, and then reinventing wheels of wheels of wheels.

Then discovery comes - how do I discover what's there and there? I've got used to the nice (forgot the itnernal name) where I can ask the "grpc" server (well stubby, whatever it was called) give me your endpoints - and it gives me the endppoints. Then I can talk to it.

Yes, it complicates build systems (need to generate damn protos), but you can also do custom protobuf generator plugins that generate custom tailored access, where you get performance gains, sacrificing features you don't need, like project perfetto - https://android.googlesource.com/platform/external/perfetto/...

Here: ProtoZero - https://android.googlesource.com/platform/external/perfetto/...

Possibly old, there might be newer version, but still: https://docs.google.com/document/d/1bxlk9F79JZDk4wRXZQ9WQ5DP...


It doesn't complicate just build systems, because of the weaknesses of it's type system and code generation it requires wrappers. Those get desynced and take maintenance.

This is the main point where the whole reason to use Protobuf falls apart - it doesn't work well enough to skip manual parsing.

And if you do manual parsing you might well go all the way in and use JSON with a schema and a reasonable library to handle these. You gain very little convenience wise and only a bunch of line bytes at best. Which typically does not matter for use cases of Protobuf that are not Google sized.


With certain build systems, this is a solved problem, or at least solved well, but it's a buy-in, so no great solution for MSBuild (for example), but works well in bazel.


>experienced engineers don’t want to work with people who think like the author of this article

- has opinions on interface design

- isn't afraid to be wrong publicly

- is brash on a personal blog

I dunno, this is mostly positive. I'd have to see how well they'd adapt to the much different context and goals of one of our design reviews, but this isn't an immediate red flag. There's a million ways to be bad at a job. I'll give them enough rope to hang themselves in a serious conversation rather than invent the idea that someone would be an unworkable perfectionist professionally just from being a type purist on the internet.


Brash on a personal blog is tricky. In this post's case, I'd worry that the blogger has a hard time separating technical deficiencies from professional incompetence. "Designed by amateurs" is an over-the-top and dubious claim.


As an experienced engineer I can hear the pain behind that dubiuos claim and no, I wouldn't want to work with someone who behaves like this all the time but I wouldn't mind someone who once in a while gets fed up show some emotions. It's all about how you handle the aftermath, ie. can you apologize to people you accidentally hurt along the way.


How about don't hurt people to begin with?


The Googlers who designed Protobuffers’ type system were undeniably amateurs in that realm without knowledge of the state-of-the-art.


Their goals were not to create the most academically sound type system. They were focused on engineering challenges. Read the comments here ti understand the authors' priorities and obstacles.


I see protobuf as being designed to be serialized fast and with a low footprint by not too complex assembly or C or Go. All the trade-offs are correct with that mindset. Maybe the author is looking at this from an academic type-theory perspective, and that's why he can't see the "why" for each one of the trade-offs.


Nope. There are similar formats designed to be rapidly serializable and back that outperform protobufs on this front.

* cap'n proto https://capnproto.org/

* flatbuffers https://google.github.io/flatbuffers/

* hdf5 (for machine learning/numerical analysis/finance) https://support.hdfgroup.org/HDF5/

You're just not going to beat these formats in serialization speed with anything (just mmap the file, use. Good luck beating it, certainly not with protobuf). Seriously, use hdf5 for machine learning and you can restart experiments and ... well they start. They don't spend the first minute or 2-3 reading their data back in.

Protobufs focus on:

* standard (meaning it's perhaps a bad standard but it's standard) (also meaning of course you don't get to choose formats, or perhaps I should say your company will lose a lot if it lets developers choose alternatives to protobuf anywhere) (this is where the frustration comes from)

* streaming support (meaning you can write out and read in protobufs without having to keep the whole thing in memory)

* extendable (including sort-of kind-of backwards-forward-compatibility. Meaning old code can read in a new version protobuf, change something and write out a proto that still has the fields it didn't understand)

* language agnostic

* composability (meaning cat protobuf1 protobuf2 > resultbuf means resultbuf is deserializable)

* an attempt to efficiently store integers (meaning it's integer encoding format is bloody complex, but "space efficient", except not quite so efficient it doesn't need compression at which point, why bother ?)

And in this protobuf will beat the above formats (except cap'n proto).

None of the alternatives to protobuf existed when it was invented though. The only alternative in existence was ASN.1. And it beat that, by a LOT.


All of those significantly post-date protocol buffers, though. It would have been great if they had existed before PB, so we could have used them at Google. But, we had to invent something because nothing that existed at the time was suitable, and the migration cost to use something else now would be astonishing. I'll readily admit that a new company would likely be better off standardizing on one of these. However, standardizing on Protocol Buffers is still way better than having a mish mash of different formats, even if all of those formats are individually better.


Would you recommend someone, not Google, to standardize on protobuffers now?


I'd probably pick Cap'n Proto or Flat Buffers if speed were paramount, in a grass is greener sort of way. I haven't used either of those technologies, though, just read about them. I'm also cool with plain JSON, which is beautiful from an ease-of-getting-started and universality perspective. I also think GraphQL is super compelling, and there's something to be said for records-as-in-SQL.

Mainly, I just think that interchange formats should be boring, simple, and ideally not incredibly slow. Protocol buffers at least meets those bars, even if it doesn't meet the consistency bar that the article desires. The whole "stop being a hipster and just use X" meme comes to mind. Probably realistically X is JSON in this day and age.


Absolutely no to JSON, IMO.

The problem with JSON for internal formats is that there’s often only one consumer and producer, so documenting the format rarely happens. Later, when you want to reimplement one side, you learn that there is no “one place” where you parse the JSON, but that you hand bits and pieces of it to completely unrelated areas of code.

Figuring out these as hoc formats is nigh impossible, so you end up just looking at the data over the wire and writing code that parses what you see. Except then you have broken code that doesn’t parse the less-common variants that have extra fields, don’t have extra fields, sometimes have their fields as stringified ints instead of just binary ints, etc.

Virtually every time I’ve seen JSON used as an internal interchange format, reverse engineering that format a few years later.has become a massive, error-prone, tedious, and time-consuming task. Save yourself from this ahead of time and pick formats that require a predefined structure, like protobufs.


This is the classic static vs dynamic typing question. Suffice to say, I don't think there is sufficient science or anecdata on the issue to really give a solid answer.


I am a proponent of both static (Rust) and dynamic (Ruby) typing in programming languages. They both have their place, and if there are bugs you can always fix the code.

The one place where I don’t think dynamic types have their place is in internal interchange formats. These can live forever, and they can change subtly over time. It’s the living forever part that changes the calculus.


To be more precise, this isn't really a knock on JSON so much as the culture surrounding people who typically use JSON.

Presumably you would be okay if the care and rigor were taken to define a schema and its physical representation just happened to be JSON?

How many of those people exist?


This.

Don't forget integer type defined by implementation. Very easy to shoot yourself in the foot in when consumers/producers are not written in the same language, even more if it's a mix of dynamic and static languages.

Nothing beats a well defined and commented schema.


Protobufs also run anywhere, there are libraries of it ported to almost any machine.

When my team was deciding on a binary serialization and RPC communication format, we needed something that ran on a Cortext m3 with no malloc, and every major phone platform (including WP at the time) and desktop OS.

It came down to cap'n proto, and Protobufs.

Protobufs runs everywhere. It won.


Am I missing something, or do cap'n proto and flatbuffers not support mapping types?

Protobuf has disadvantages, but mapping types are something I frequently make use of. losing out on them is a big deal


Protobuf did not support maps for a long time. In practice using a list of key/value pairs works fine for most use cases. Sending the message over the wire is obviously always O(n); if you need fast inserts/lookups in code, then it usually makes sense to convert to and from the appropriate in-memory data structure for that purpose.

That said, I think maps are a fine feature for a serialization format to have. But, I haven't gotten around to adding them in Cap'n Proto because I have yet to hit a use case where I wasn't happy with a list of key/value pairs.


For what it's worth, I like them primarily because I don't mind using proto-generated objects as pieces passed around the stack (at least at higher levels of the stack). I lose consistency if I can do that for some objects, but am required to convert objects if they would need a mapping type.

It follows sorta the same reason that Clojure became a fairly widely used lisp – mapping types are extremely common in real world programs. Having them supported first class makes a lot of common tasks quicker to get through.


No, you're right. The canonical way to store a map in capnp is to use a List<Pair<key, value>>. Personally it doesn't bother me that much (mostly from an aesthetic standpoint) because it is not meant as an in-memory, working data structure.

What does bother me is that in the above generic type, "key" and "value" must be pointer types (i.e. non-scalar). The reasons are not unreasonable but it's definitely obnoxious in practice.


HDF5 is quite bad and corruption prone.

Parquet is gaining a lot of traction lately for this use case.


Arent you doing exactly what you are accusing the author of? One could say - "People are not wrong, they simply have constraints, advantages, and disadvantages". Drawing a conclusion about whether all experienced engineers want to work with the author based on a single article or opinion appears just as myopic.


Protocol buffers are definitely not perfect. As of 2014 in Java the biggest issue was that you could not have different versions of protobuf [easily] running in the same VM due to the fact that protobuf generated version-dependent implementation code for IDL. So if one library depended version X of protobufs and another library depended on version Y, you had a problem because Java will only load one protobuf jar file. (Maybe it's been fixed but it was a problem then.)

That said...I used protobufs to encode the log of Tungsten Replicator, a replicator for MySQL. We did not have a single version related error while I worked on it over a period of years. That was essential given the fact that replication logs are an on-disk format that must remain stable across version upgrades. Our logs in aggregate contained billions of transactions--you can't not read an older log file just because you are running a new software version. The claim that protobufs do not handle versioning is nonsense.


Is that a problem with ProtoBufs or Java?


Java only allows one version of any class within a single class loader. That's just how it works. So the problem was at the very least a limitation of the protobufs Java implementation.

Speaking of which, the generated code was incredibly convoluted with what seemed like tangled dependencies on the underlying protobufs library. I used to dread debugging it--sometimes I would have to go into it to figure out what the upper layers were doing wrong. Luckily it did not happen very often. And I don't recall ever hitting a bug in protobufs itself for all that the code was virtually unreadable.


I'm an outsider, but maybe it's a problem of their interaction?

Most likely, the vast majority of Protobuf use is with code generated to a single, canonical location (which leads to the problem described above).

But hypothetically, it would be possible for code to generate and refer to two different versions. That doesn't sound impossible. Maybe just improbably given the way things are typically set up.


If we suppose that his conclusion is using boolean logic, then what you're saying is a strawman because of his last claim; namely, protobufs are bad if "[...] && !Google":

> They're clearly written by amateurs, unbelievably ad-hoc, mired in gotchas, tricky to compile, and solve a problem that nobody but Google really has.

This dovetails with other arguments that I've seen recently that are becoming more frequent:

Have we entered a new world where the lessons of companies working at massive scales are not only generally superfluous for smaller scales, but are actively harmful?


> Have we entered a new world where the lessons of companies working at massive scales are not only generally superfluous for smaller scales, but are actively harmful?

I think so, yeah.

Microservices turn out to have a lot of negative consequences, and their positives work best when you have dozens or hundreds of developers. If you've got a handful of developers... not so great.


Your argument really depends on the use case. If you have discrete, well-factored operations microservices can make even small systems easier to deploy and manage. For example, you do the front-end API in Java (easier to build secure, debuggable systems with good RDBMS access) and backend analytic services in Python (easier to scrape data out of XML/JSON). Splitting them up into 2 or more microservices can simplify development, CI/CD, and deployment.

Whenever I see large numbers of microservices anywhere my null hypothesis is that some organizational disfunction is leading teams to factor applications into unnecessarily small pieces.


No. But you have to examine use cases carefully including the assumptions. I think this has always been the case but people (in my experience at least) get a little dazzled by the massive scale of companies like Facebook and try to apply their solutions to problems for which they are simply not applicable.


He did give a way to solve the problem though -.- the author isn’t spouting dogma and in fact uses a lot of factual arguments against their design.


Maybe that's the main point, for you. I don't think I'm missing anything :)

You can take your logic and apply it to anything.

For example killing innocent babies:

"Killing babies has constraints, advantage and disadvantages. No other action in the universe is free from having advantages and disadvantages. They all have different use cases...

I don't get why the author is 'spouting dogma' about how killing innocent babies is bad.

Be wary of people who argue about killing babies etc etc."

This relativism you're expressing can be seen as ridiculous when applied to killing babies - it can actually seem to make sense when you use it to justify your own biases - which is what you're doing. Of course doing that has it's advantages and disadvantages, etc etc no opinion is better or worse everything is relative :)


Many of the restrictions are there because you need to operate as well as you can at the intersection of all target language type systems.


Agree with other ppl here, an unnecessarily violent rant, and the point the author makes is kind of lost in the title "Protobuffers Are Wrong". What they really mean is "I think Protobuffers type system could be better".

I recently saw a talk by Rich Hickey about Effective Programs [1]. The talk explains why Hickey favors dynamic types, with EDN [2] and transit [3] proposed as an alternative to more statically typed data exchange formats like protobufs and less structured ones like Json (the talk explains the reasons, nearing the end I think).

1: https://www.youtube.com/watch?v=2V1FtfBDsLU 2: https://github.com/edn-format/edn 3: https://github.com/cognitect/transit-format


> Option 1 is clearly the "right" solution, but its untenable with protobuffers. The language isn't powerful enough to encode types that can perform double-duty as both wire and application formats. Which means you'd need to write a completely separate datatype, evolve it synchronously with the protobuffer, and explicitly write serialization code between the two. Seeing as most people seem to use protobuffers in order to not write serialization code, this is obviously never going to happen.

This last is not the case. I've only ever seen it used when people a) already have written serialization code that works with XML/JSON, and b) don't want to write code to serialize to a binary format.

JSON also doesn't let you use objects as object keys, or have any form of polymorphism. It seems like the author wants protobufs to be an RPC system with a modern type system, when in fact it is a data interchange format.


But the author is only talking about data types. I don't see any reason it shouldn't be designed with composition in mind.

The only restriction I can think of that JSON imposes (in terms of composition) is that keys must be strings. And I don't necessarily think JSON is a great serialization format either.


I disagree severely with large swaths of this article. The biggest problem is here:

>Of course, the actual serialization logic is allowed to do something smarter than pushing linked-lists across the network---after all, implementations and semantics don't need to align one-to-one.

I like that I can read a pb dump in hex and have it map quite closely to the protocol definition, rather than being viewed through heaping piles of abstractions. This is only going to make it more confusing and more work to use, with the only gain being to make some functional programmer feel squishy about their beautiful type system.

I largely agree with the "The Lie of Backwards- and Forwards-Compatibility" section. You can feed /dev/urandom into nearly any protobuf definition and it'll happily give you nonsense back, no errors. I understand why this is the case, and I'm not sure it should change, but it's bitten me several times.


I will say one thing. gzipped JSON is damn efficient over the wire. It elegantly solves the problem that field names are repeated in every object. The only problem is that adding compression and a serializationd/deserialization step cost CPU time. However if you have this problem then you're often better off by using something like cap'n proto or flatbuffers which do not have an extra serdes step. If you consider this then there is only a very very narrow band in which it makes sense to use protobuffers vs JSON (which is a defacto standard) or lesser known solutions that provide superior performance.


The problem with JSON as a wire/interop format is the lack of any sort of schema. Formats like proto are nice because if you can unmarshall the bytes successfully, you can be reasonably sure you've got a valid message and reason about its contents. A deserialized JSON object can literally be or contain anything.


Proto doesn't really have a schema. You can encode a proto of type X and tell your program to decode it as type Y, and it might work. Might not, but might. You can easily imagine messages that are isomorphic on the wire. E.g.

message X { optional string foo = 1; } message Y { optional X foo = 1; }

Encoded buffers of X will decode as Y just fine.


Sure, but it has more of a schema than JSON does :-) That's why I said "you can be REASONABLY sure you've got a valid message" (if it decodes successfully). It might not be the actual message you THINK it is, but at least all fields are going to be there.

Then again, if you're accepting arbitrary bytes off the wire and blindly assuming they're going to be the correct proto as long as they decode, you're... most stubby services I worked on. :-)


It could even be a gzip bomb


Recent algorithms, like zstd or lz4 can do the same compression ratio as gzip for a fraction of the cost. Hopefully they will be replacing gzip over the coming decade.


I like protobufs. I've only ever used them as a serialization format for save and map files for some C# games that I tinker away at, but I've been pretty happy using protobuf-net[1]. File size is smaller than JSON produced by JSON.NET or using the built-in binary serializer, much less XML. And it couldn't be much easier to use.

[1] https://github.com/mgravell/protobuf-net


I’ll give you one example where the protobuf design helped out a lot: GWT Protobuffers. The GWT Compiler has a version of Protobufs that transpiles all of the protos into Overlay types of naked JSON objects.

You get a statically typed Java interface, but underneath it a raw JSON object, and deserializing the proto is basically JSON.parse() + a cast.

e.g. something like this

MyProto foo = MessageDeserializer.deserialize(serializedProto); console.log(foo.getFieldFoo());

compiles to this JS:

var foo = JSON.parse(serializedProto); console.log(foo[1]);


>Fields with scalar types are always present. Even if you don't set them. Did I mention that (at least in proto3) all protobuffers can be zero-initialized with absolutely no data in them?

I don't this complaint. You have to initialize data with something. proto2 had to "solve" this problem in C/C++/Go by making everything a pointer. How is dealing with null the "sane" case against zero-initialization?


To be fair, in C++ you could use std::optional or something else that's type-safe.

Go has its own philosophy about zero values which is controversial, but at least with pointers you do get to express optional values. And it's not like you can't work around the ergonomic awkwardness that arises:

  type Post struct {
    Title *string
  }

  func (p *Post) GetTitle() (string, bool) {
    if p.Title == nil {
      return "", false
    }
    return *p.Title, true
  }

  func (p *Post) SetTitle(s string) {
    p.Title = &s
  }
Hardly elegant, but this is Go.

Go also run into the same issue when encoding and decoding JSON. Most languages do distinguish between empty string and a missing string, but not Go, which makes it hard to validate anything. I wrote a code generator for JSON Schema [1] recently, which applies validations while deserializing, and has to resort to a rather low-tech trick to do so:

  type Post struct {
    Title string `json:"title"`
  }

  func (v *Post) UnmarshalJSON(b []byte) error {
    var raw map[string]interface{}{
    if err := json.Unmarshal(b, &raw); err != nil {
      return err
    }
    if _, ok := raw["title"]; !ok {
      return errors.New("field title: must be set")
    }
    type plain Post
    var p plain
    if err := json.Unmarshal(b, &p); err != nil {
      return err
    }
    *v = Post(p)
    return nil
  }
(Clearly there are slightly more refined and faster ways to do this, but not that easily without importing third-party code, which I wanted to avoid.)

C and Go are in the minority here. This doesn't come up in languages like Rust, Swift, TypeScript, Nim, Haskell, OCaml, C#, F#, or Java >= 8, all of which have some sort of optional support (algebraic data types or built-in). For example, TypeScript:

  interface Post {
    title?: string
  }
Or Rust:

  struct Post {
    title: Option<&str>
  }
[1] https://github.com/atombender/go-jsonschema


so, just for fun, here's a way to do that check without double unmarshalling and allocating maps and such for everything under your struct (also does a check for extra fields, but you could pull that out if you want):

    type Post struct {
    	Title string `json:"title"`
    }

    func (v *Post) UnmarshalJSON(b []byte) error {
    	dec := json.NewDecoder(bytes.NewReader(b))
    	required := map[string]struct{}{
    		"title": struct{}{},
    	}
    	tok, err := dec.Token()
    	if err != nil {
    		return err
    	}
    	if d, ok := tok.(json.Delim); !ok || d != '{' {
    		return errors.New("Expected object")
    	}
    	for {
    		tok, err := dec.Token()
    		if err != nil {
    			return err
    		}
    		if d, ok := tok.(json.Delim); ok && d == '}' {
    			break
    		}
    		switch tok {
    		case "title":
    			delete(required, "title")
    			err := dec.Decode(&v.Title)
    			if err != nil {
    				return err
    			}
    		default:
    			(*v) = Post{}
    			return errors.New(fmt.Sprintf("Unexpected field %s", tok))
    		}
    	}
    
    	if len(required) > 0 {
    		(*v) = Post{}
    		return errors.New(fmt.Sprintf("Missing %v required fields", len(required)))
    	}
    	return nil
    }


Thanks, that's neat! The challenge here is that I also need to validate values (e.g. support minimum/maximum), not just within structs, but also standalone values, which means an UnmarshalJSON directly on the type. I might end up doing something like your example, though. (Rejecting extra fields is on my list!)


the json.NewDecoder stuff does still trigger unmarshals of the types in the struct and such: i.e.

    package main
    
    import (
    	"bytes"
    	"encoding/json"
    	"errors"
    	"fmt"
    )
    
    func main() {
    	fmt.Println("Hello, playground")
    	var p Post
    	err := json.Unmarshal([]byte(`{"title": "test"}`), &p)
    	fmt.Println(err, p)
    	err = json.Unmarshal([]byte(`{"title2": "test"}`), &p)
    	fmt.Println(err, p)
    	err = json.Unmarshal([]byte(`{}`), &p)
    	fmt.Println(err, p)
    	err = json.Unmarshal([]byte(`{"title": "long test"}`), &p)
    	fmt.Println(err, p)
    }
    
    type Post struct {
    	Title ShortTitle `json:"title"`
    }
    
    func (v *Post) UnmarshalJSON(b []byte) error {
    	dec := json.NewDecoder(bytes.NewReader(b))
    	required := map[string]struct{}{
    		"title": struct{}{},
    	}
    	tok, err := dec.Token()
    	if err != nil {
    		return err
    	}
    	if d, ok := tok.(json.Delim); !ok || d != '{' {
    		return errors.New("Expected object")
    	}
    	for {
    		tok, err := dec.Token()
    		if err != nil {
    			return err
    		}
    		if d, ok := tok.(json.Delim); ok && d == '}' {
    			break
    		}
    		switch tok {
    		case "title":
    			delete(required, "title")
    			err := dec.Decode(&v.Title)
    			if err != nil {
    				return err
    			}
    		default:
    			(*v) = Post{}
    			return errors.New(fmt.Sprintf("Unexpected field %s", tok))
    		}
    	}

    	if len(required) > 0 {
    		(*v) = Post{}
    		return errors.New(fmt.Sprintf("Missing %v required fields", len(required)))
    	}
    	return nil
    }
    
    type ShortTitle string
    
    func (st *ShortTitle) UnmarshalJSON(b []byte) error {
    	var tmpst string
    	err := json.Unmarshal(b, &tmpst)
    	if err != nil {
    		return err
    	}
    	if len(tmpst) > 6 {
    		return errors.New("Title too long!")
    	}
    	*st = ShortTitle(tmpst)
    	return nil
    }


Scalar primitive fields in proto1/proto2 C++ generated code are not pointers. They are just class member fields returned by accessors (foo()) and their presence is indicated by a presence vector (has_foo()). At least that's what you get if you use Google's proto compiler.

The class layout for proto3 is the same (surprise!) but the presence vector is ignored.


If protobuffers are wrong, then I don’t want to be right.


Now to the real question: what is the best alternative?


One of the most surprising features of Proto mentioned in this thread (at least for me/my background) is the desire to forward on a message you receive. Possibly a message you can't read/parse, or don't want to. And you want to forward that whole payload to the next person.

What kind of systems have this architecture? Why would you forward on a message and not know its contents 100%?


> What kind of systems have this architecture?

The Web, for example.

> Why would you forward on a message and not know its contents 100%?

In the web architecture you can observe following cases:

* proxying, which itself is an umbrella term for a lot of different things, such as:

    * request routing

    * filtering

    * access control

    * logging

 * caching

 * load balancing
For example, a very typical scenario for load balancing is "sticky sessions". Basically, a reverse proxy (e.g. nginx) needs to inspect a HTTP request and analyze certain headers to find the server to forward request to (which already has user's session). Note that nginx can just pass headers without understanding what they mean. And it simply cannot understand the meaning of query part of URL or message body.

So, apparently, protobuf gives you a lot of HTTP's flexibility in a binary RPC format. So e.g. you can make a cache server which will cache responses without understanding them, so you can evolve your backend without updating cache server each time you add a new field to a response.

What I don't understand, however, is why they couldn't control that behavior using compiler options. Sometimes you want to be lax with validation, sometimes you don't. Why not give programmer a choice...


Protobuffers always seemed like an interesting approach but every time I've tried to use in a prototype I've ended up deciding against the added overhead of including them. And then I use JSON because I simply wanted versioned serialisation when messaging or persisting data.


Versioning was the biggest disappointment for me. I just want a middleware layer that can handle clients of different versions reliably. Surely everyone has the same problem.


>Surely everyone has the same problem.

Yes, we do.

But the appropriate answer in most cases is "it's up to the business logic". Trying to solve it at the transport layer is futile.


Avro has a solution to this that I always thought was quite elegant. Schemas are external (serialized data isn't self-describing) but the library exploits this to it easy to read and write old versions.

To quote the documentation:

The library provides "a resolving decoder, which accepts calls for according to a reader's schema but decodes data corresponding to a different (writer's) schema doing schema resolution according to resolution rules in the Avro specification."

Those resolution rules are defined here: https://avro.apache.org/docs/current/spec.html#Schema+Resolu...

The downside, of course, is performance.


> Surely everyone has the same problem.

it doesn't mean that there is a single solution that will satisfy everyone


I never understood debate over technology that just works

You may not prefer it, but it works. Do something better, and make people use it.


The disagreement is over what "just works" means. Obviously the author things Protobuf isn't suitable for some use cases. They don't work in his view.


Wow, some minor complaints on language formality issues, then discredit the thing as a whole.

But in the end, the author probably does not understand PB solves the inter-language data sharing problem, which makes all its complain secondary and inconsequential.

It's like complaining a car that drives fast and has great gas efficiency of cannot talk. (Well in 2018, it might be less of a stretch to demand a car that can talk)


> But in the end, the author probably does not understand PB solves the inter-language data sharing problem, which makes all its complain secondary and inconsequential.

Huh? It's a serialization format. This problem has been solved many times in many different ways.

The author is pointing out that Protobufs were designed poorly, and they didn't have to be.


> This problem has been solved many times in many different ways.

Mind provide examples of such tools.


Plain old JSON, with types defined.


Way too slow even at moderate scale.


There's thrift, msgpack, Avro, ASN.1, JSON, XML, BSON, S-Expressions, EDN, etc.


What I was reading? It is easy to badmouth someone without knowing the context. It is not uncommon to end up with a flawed design if you find yourself in circumstances like short deadline, uncertain requirements, lack of resources, small user base, no QA and so on that are outside of the developer means to change.


I think attacking a widely used open source piece of software rather than trying to help improve it behind the scenes without drawing attention to yourself will always be less work.


Yet protobuf is probably the most compact, efficient and performant serialization method especially when saving bandwidth is important. I experimented with protofbuf, flatbuffers and messagepack and always found protobuf messages the most compact by a noticeable margin


That's the core of the author's argument. Protobuffers optimize for something besides usability and maintainability, because Google cares more about incremental performance than developer-friendliness. Which is a fine thing to care about at Google's scale, but maybe others' calculations should be different.


That doesn't seem like the author's main argument - they say: "Protobuffers were obviously built by amateurs because they offer bad solutions to widely-known and already-solved problems."


That's a fine argument. But then the author decides to call the people who wrote protobuf idiots and amateurs.


note that you could succinctly put in one sentence something that the author took a whole page. I feel there is something jib about compression to be said, but I'll leave that as an exercise to the reader.


That really depends on the nature of your data. The space saving in protobufs really comes from its variable-length zig-zag encoding of numeric fields (including lengths for arrays and strings).

Tbh, i'm surprised msgpack didn't produce smaller outputs for you, because it really should (at the cost of being slower)


No way that msgpack can produce more compact messages than protobuf, it is schemaless and stores the element names inside the message. Here is a simple benchmark I have of an array of objects where each {id: int,x: int,y: int,rot: float}. An array of 20 elements with random values of such object would take in JSON 804 bytes, in mpack 538 bytes and in protobuf3 only 304 bytes, of course this changes if you put other values but as you see for this case protobuf is almost 60% of msgpack message size and it's around the same ration even if you make the array length 200 instead of 20.


Yep, you're right. I forgot about the keys.


Protobuf is not size efficient. The half self-describing nature and built-in backwards/forwards compatibility features sacrifice a lot of bits in efficiency. I'm not saying that this is a bad thing, just that choosing protobuf if bandwidth efficiency is your main goal is probably not a good idea.


See my other reply below under this comment tree, I also tried flatbuffers and it was extremely inefficient compared to protobuf but I don't have numbers currently for it


"msgpack and flatbuffers are worse" is faint praise.


It definitely is not.

ASN.1 PER encodes data on a bit-level (that is, it represents integers using minimal number of bits they can fit in) and doesn't waste space on tags if not necessary. It is used in all kinds of telecom protocol, particularly mobile protocols such as LTE. Obviously, it needs to be both space-efficient and efficient in processing.

You can find more about ASN.1 uses here: https://www.marben-products.com/


Good points, but can someone articulate what the best alternative to protobuffers would be in 2018, you know with 'hindsight' etc.?


I've heard good things about https://capnproto.org/ but the author of this piece may strongly disagree, since I think their objections are rather more fundamental, and also because Cap'n Proto went with banning the concept of "required" fields.

I'd also be interested in what alternatives the author might suggest.


You should definitely also look at Cap'n Proto – it is in many ways a theoretically more elegant design: far less overhead and the "time-traveling" RPC design is quite clever. Another potential important advantage, esp for crypto could be that Cap'n Proto in theory has a canonical representation, but in practice this was neither fully defined nor usable last I looked.

Cap'n Proto also suffers from a number of practical disadvantages, which can be quite severe depending on your use case. The main problem is eco system maturity and rate of adoption. So whilst protobufs have excellent cross-language support and a mature RPC system, in cap'n proto everything but C++ feels at bit second class, even other popular languages. Python and Rust, and I believe Java have workable, but not necessarily great bindings for serialization and in theory RPC, but the RPC support for other languages than C++ didn't look that great last I looked (but that was over a year ago) and even with C++ it looks much easier to get something going with gRPC (tooling, http2 based etc). For other languages I looked at (e.g. Ocaml, Lua and Haskell) the bindings looked very immature. There are a few other minor annoyance with capnproto, such as the lack of a timestamp type (you can build your own, incompatible to everyone else's, but this is still an annoying omission). The C++ library is also saddled with the weird home-grown kj library that no one other than Kenton Varda uses. However, it has to be said that the C++ API and also the Python API for Capnproto feel a lot more natural and idiomatic than the protobuf ones. If protobuf is too high overhead for your use case and you don't need the wide language and eco system support, Cap'n Proto is worth a serious look.


Capnproto is built by the guy who originally built protobufs, and he cites capnproto as being better because he was able to learn from his mistakes. Makes me optimistic.


Minor clarification: the author of Capnproto worked on protobufs at Google but was not their original creator.


Minor clarification of the clarification: the author of Cap'n'proto was the original creator of Proto2, which was a ground-up (but binary compatible) rewrite of Proto1. The original authors of Proto1 were Jeff Dean & Sanjay Ghemawat. The current version of protobufs is Proto3, which AIUI is maintained by a team at Google (it was released after I left), and is an evolution of the Proto2 codebase with many new features.


Ah, my mistake. Thanks guys. :+1:


Transit: https://github.com/cognitect/transit-format

Transit has the following features:

- self-describing

- allows extension types on top of base types

- strong value types + dynamic serialization

- can transparently pass through data without consumer knowing about schema

- has "repeated key" optimization in the encoder to allow compression in the stream

- allows transparent binary encoding with msgpack

- is stream-based, so processing can begin immediately unlike JSON which has enclosing {}

This set of features basically allow transparent proxying to JSON as well, which is something that proto2/3 cannot do without updating the proxy.


There's Avro: https://avro.apache.org/docs/current/

Not sure if the best, but maybe has different tradeoffs


Depending on your goals, Cap’n Proto is excellent.


Which of the problems listed in the article are solved by it?


Parameterization of types, off the top of my head.


Also, cap'n proto loses a whole lot of API and semantic weirdness the author complains about: implicit auto-assignment of fields, "repeated" etc. The overall design feels much cleaner, although there are still some surprises.


Furthermore, in the case where I'm using protobufs as a necessary evil of grpc, what's the best alternative here?


protos are not necessary for grpc. You can use whatever payload format you want to use.


JSON/custom binary format + HTTP2?


HTTP2 is just the transport, there would be a ton of machinery to implement in each language in the stack.


How about json + jsonschema + gzipped over the wire


I know a few companies who use flatbuffers instead.


JSON or graphql. Personally I think the typing, API compatibility and extensive language support make protobuffers currently the winner though, despite the warts


Neither of those are binary formats though, and if I’m picking up a binary format like Protobuffers it’s probably because I want the performance benefits.

Personally, these days I’m taking the approach of just outright avoiding pretty much anything google does. Either because it solves a problem only they have, or they’ll probably just drop it out of the blue one day. Or like Tensorflow, it’s fucking impossible to get your head around because their documentation seems to be written with the goal of being maximally confusing.


I like GraphQL schema, but it would have to have a standardized serialization aspect. Moreover, it really is designed for querying and has a bunch of stuff totally irrelevant to data.

But I agree ... I almost wish they'd finish it off and go for full on serialization.

That said, it may not be that-that hard, a serialization spec can be pretty short if need be. Certainly one could 'roll one's own' for a mid-sized project if need be.


How would graphql help reduce the latency of API calls between services?


I'm no GraphQL expert/fan but one example is requesting only the data you need -- you can cut down on what you have to send over the wire which is at least guaranteed to help some.

Also, the nested query structure can save what would have normally been multiple requests with a by-the-book RESTful HATEOAS setup.


Apache Thrift is another one.


Thrift suffers pretty much all the same problems as protobufs and has many similar dysfunctional failure modes, like places that rely on serializing into Thrift structs stores in hdfs and treating that like a de facto database, with Thrift struct definitions as the schema. It is so miserable to work in code bases like that.


This is part of the original lambda architecture. What would you recommend the schema be represented in instead? Or are you preferring something like NewSQL?


I can’t tell if you mean a schema for the generated RPC or just for data storage. I’m only talking about data storage. But for RPCs, I just think don’t autogenerate them. Just expose an RPC API and let other people write code to consume it in whatever language’s packages for web requests, and don’t ever transmit things that are expected to automatically be treated as any type of object. Just send JSON or an equivalent thing in a more optimized buffer that is never allowed to be anything but a key-value store for primitive types.

On the data side, use relational database systems. Yes, even for huge modern webscale event data for apps or services with hundreds of millions of users. Don’t ever use hadoop, period.

If you’re bigger than hundreds of millions of users or otherwise are generating hundreds of billions of records per day or more, that’s the size when you might _start_ considering something different than sharded and distributed standard RDBMSs, but you are likely big enough at that point that you need an in-house, highly customized version of a distributed file store that matches your usage patterns and cost optimizations in a way that a one-size-fits-all solution cannot, and so again you should never be using hadoop.

As a result, if you find yourself relying on Thrift data serialized on a distribted filesystem as if those “flat files” are a database and map-reduce is like your de facto table scanner, it’s a red alarm bad code smell that you are growing your data scale in a horribly broken way that’s going to cause huge problems the minute you have products which need a data model or some flexibility that cannot be supported when you’ve welded data infrastructure to Thrift objects.


Easiest way to send data over the wire.

   struct mytype s;
   s.field1 = something;
   s.field2 = something else;
   send(socket, &s, sizeof(s), 0)
or, using a language with a good type system like Haskell

   data MyData = MyData Int Int deriving Generic
   instance Storable MyData

   alloca $ \buf -> do
     let d = MyData field1 field2
     poke buf d
     send socket buf (sizeof d)


This may be easy, but it’s wrong. Endianness issues are just the start. Information leaks due to padding are a big deal. And it straight up doesn’t work if nontrivial data structures are involved.


Endianness issues are something every programmer should be aware of when sending data over the wire. I'm sorry I didn't insert the htons, htonls in my C code.

In the Haskell code, endianness is handled by your Storable implementation (you define it after all, and can customize it however you want).

I agree my example is somewhat tongue in cheek. The point though is that most languages have standard libraries for dealing with binary data that are almost universally more straightforward than proto bufs.


htons, etc are very 1980s, and I’ll make a fairly strong claim: they should never be used in new code, with a single exception. The reason is that an int with network endianness simply should not exist. In other words, when someone sends you a four byte big-endian integer, they sent four bytes, not an int. You can turn it into an int by shifting each byte by the relevant amount and oring them together. And a modern compiler will generate good code.

The sole exception is legacy APIs like inet_aton() that actually require these nonsensical conversions.


> You can turn it into an int by shifting each byte by the relevant amount and oring them together.

Yeah! And you can even write a function to do that for you! Maybe call it "ntohl".


You could also use the functions with explicit bit widths, like bswap64 and bswap32.


No, you should not. By the time you have a wrong-endian uint64_t or whatever, you’ve already done it wrong.


After re-reading your comment above, I'm actually confused. You think you should never store a big-endian int? That is ridiculous. Some architectures are big-endian. You should not be using custom bitswapping as part of application code, because you cannot know the endianness of your architecture.

The ntoh* functions are the right approach, and your claim is not only strong, it's wrong. The ntoh* functions exist to transform network byte-order to host byte-order. Depending on your architecture endianness, their functionality will change.


Let me try saying it differently. The following code is poorly written:

    char *buf = ...;
    uint32_t word = *(int32_t *)buf;
    uint32_t host_word = ntohl(word);
Because you just type-punned the read from buf. (In fact, this code is UB.) You could write it a little better like:

    char *buf = ...;
    uint32_t word;
    memcpy(&word, buf, 4);
    uint32_t host_word = ntohl(word);
Although IIRC there is or at least was still some disagreement as to whether this might be UB. You could use a union to make it definitely not UB.

But none of these variants are sensible, and, in fact, they don't even translate to most safer languages than C. The correct way to write this code is:

    char *buf = ...;
    uint32_t host_word = ((uint32_t)buf[0] << 24) |
        ((uint32_t)buf[1] << 16) | ((uint32_t)buf[2] << 8) |
        (uint32_t)buf[3];
On any recent compiler, this will generate as good or better code, and it doesn't make pointless assumptions about the representation of uint32_t on the platform you're using.

So I stand by my claim: well-written modern C code should not contain any "network-order" values. They should contain bytes, vectors of bytes, and numbers.


My C code didn't include mention of ints though, so I'm wondering where you got that from.

Your first example is UB and again, is not something my example depended on.

Your final claims are overly cautious. It is perfectly fine to use uint32_t in this way. Uint32_t is defined as a 32-bit unsigned integer. There is a bijection between network order 32-bit unsigned integers and host order integers, and ntohs is the bijection. It is no different than storing any other value. It is certainly not wrong.


C doesn't make struct memory layout guarantees, endian guarantees, or size guarantees in some cases, so unless you're running the exact same binary this is a very poor serialization technique.


C does not, but few compilers implement standard C. Most C compilers do have documentation on how structs are laid out and many options to change the behavior. Most C compilers document this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: