I love flatbuffers but they're only worthwhile in a very small problem space.
If your main concern is "faster than JSON" then you're better off using Protocol Buffers simply because they're way more popular and better supported. FlatBuffers are cool because they let you decode on demand. Say you have an array of 10,000 complex objects. With JSON or Protocol Buffers you're going to need to decode and load into memory all 10,000 before you're able to access the one you want. But with FlatBuffers you can decode item X without touching 99% of the rest of the data. Quicker and much more memory efficient.
But it's not simple to implement. You have to write a schema then turn that schema into source files in your target language. There's an impressive array of target languages but it's a custom executable and that adds complexity to any build. Then the generated API is difficult to use (in JS at least) because of course an array isn't a JavaScript array, it's an object with decoder helpers.
It's also quite easy to trip yourself up in terms of performance by decoding the same data over and over again rather than re-using the first decode like you would with JSON or PB. So you have to think about which decoded items to store in memory, where, for how long, etc... I kind of think of it as the data equivalent of a programming language with manual memory management. Definitely has a place. But the majority of projects are going to be fine with automatic memory management.
> But it's not simple to implement. You have to write a schema then turn that schema into source files in your target language. There's an impressive array of target languages but it's a custom executable and that adds complexity to any build. Then the generated API is difficult to use
Worth noting that all these things are true for protobuf as well.
Less so. Many languages have a native implementation of protobuf that uses that language to build rather than a binary (e.g. pbandk), and will generate relatively idiomatic code.
Are there protobuf throughput benchmarks somewhere? I haven't been able to verify that they're faster than JSON.
Edit: I was able to find these at https://github.com/hnakamur/protobuf-deb/blob/master/docs/pe... but these numbers don't seem conclusive. Protobuf decode throughput for most schemas tested is much slower than JSON, but protobufs will probably also be a bit smaller. One would have to compare decode throughput for the same documents serialized both ways rather than just looking at a table.
While I haven't benchmarked JSON vs protobuf, I've observed that JSON.stringify() can be shockingly inefficient when you have something like a multi-megabyte binary object that's been serialized to a base64 and dropped with an object. As in, multiple hundreds of megabytes of memory needed to run JSON.stringify({"content": <4-megabyte Buffer that's been base64-encoded>}) in node
JSON is the default serialization format that most JS developers use for most things, not because it's good but because it's simple (or at least seems simple until you start running into trouble) and it's readily available.
Large values are by no means the only footguns in JSON. Another unfortunately-common gotcha is attempting to encode an int64 from a database (often an ID field) into a JSON number rather than a JSON string, since a JS number type can lead to silent loss of precision.
A more thoughtful serialization format like proto3 binary encoding would avoid both the memory spike issue and the silent loss of numeric precision issue, with the tradeoff that the raw encoded value is not human readable.
Isn't HTTP POST content similarly encoded? Likewise with small embedded images in CSS, though I am rusty on that topic. Likewise with binary email attachments in SMTP (though this may be uuencoded, same net effect).
The particular example of a trivial message that is mostly-binary just sounds like a useful test case, more than anything else.
Copying into a string is a safe default. Also, proto’s API currently returns string references (and not views) so making a copy is required for open source.
(Although now std::string_view is common, I hear rumors that the proto API might change…)
Maybe that's true, but safety is an additional concern. You have way more lifetime headaches if you alias the underlying data. Copying avoids all that.
simdjson also does this sort of thing. All strings are decoded and copied to an auxiliary buffer. For strings without escape sequences in them or for end users who don't mind destroying the json document by decoding the strings in-place, these copies could be avoided. I may get around to shipping a version of simdjzon (the Zig port) that optionally avoids these copies (this sort of has to be optional because the current API lets you throw away your input buffer after parsing, and this option would mean you cannot do that), but porting this stuff back to C++ and getting it upstreamed sounds more difficult.
If you only want faster than JSON then a 'binary json's like MessagePack or JSONB (is that the PostgreSQL one?) avoids dealing with schema. If you want schema it's not like JSON.
MsgPack is great. However I’ve been curious about Amazon’s Ion. Mainly since the namespace stuff allow you to define tags upfront into ints. MagPack transfers field names as strings still unless you crest your own string to int conversion.
If your main concern is "faster than JSON" then you're better off using Protocol Buffers simply because they're way more popular and better supported. FlatBuffers are cool because they let you decode on demand. Say you have an array of 10,000 complex objects. With JSON or Protocol Buffers you're going to need to decode and load into memory all 10,000 before you're able to access the one you want. But with FlatBuffers you can decode item X without touching 99% of the rest of the data. Quicker and much more memory efficient.
But it's not simple to implement. You have to write a schema then turn that schema into source files in your target language. There's an impressive array of target languages but it's a custom executable and that adds complexity to any build. Then the generated API is difficult to use (in JS at least) because of course an array isn't a JavaScript array, it's an object with decoder helpers.
It's also quite easy to trip yourself up in terms of performance by decoding the same data over and over again rather than re-using the first decode like you would with JSON or PB. So you have to think about which decoded items to store in memory, where, for how long, etc... I kind of think of it as the data equivalent of a programming language with manual memory management. Definitely has a place. But the majority of projects are going to be fine with automatic memory management.