I love flatbuffers but they're only worthwhile in a very small problem space.
If your main concern is "faster than JSON" then you're better off using Protocol Buffers simply because they're way more popular and better supported. FlatBuffers are cool because they let you decode on demand. Say you have an array of 10,000 complex objects. With JSON or Protocol Buffers you're going to need to decode and load into memory all 10,000 before you're able to access the one you want. But with FlatBuffers you can decode item X without touching 99% of the rest of the data. Quicker and much more memory efficient.
But it's not simple to implement. You have to write a schema then turn that schema into source files in your target language. There's an impressive array of target languages but it's a custom executable and that adds complexity to any build. Then the generated API is difficult to use (in JS at least) because of course an array isn't a JavaScript array, it's an object with decoder helpers.
It's also quite easy to trip yourself up in terms of performance by decoding the same data over and over again rather than re-using the first decode like you would with JSON or PB. So you have to think about which decoded items to store in memory, where, for how long, etc... I kind of think of it as the data equivalent of a programming language with manual memory management. Definitely has a place. But the majority of projects are going to be fine with automatic memory management.
> But it's not simple to implement. You have to write a schema then turn that schema into source files in your target language. There's an impressive array of target languages but it's a custom executable and that adds complexity to any build. Then the generated API is difficult to use
Worth noting that all these things are true for protobuf as well.
Less so. Many languages have a native implementation of protobuf that uses that language to build rather than a binary (e.g. pbandk), and will generate relatively idiomatic code.
Are there protobuf throughput benchmarks somewhere? I haven't been able to verify that they're faster than JSON.
Edit: I was able to find these at https://github.com/hnakamur/protobuf-deb/blob/master/docs/pe... but these numbers don't seem conclusive. Protobuf decode throughput for most schemas tested is much slower than JSON, but protobufs will probably also be a bit smaller. One would have to compare decode throughput for the same documents serialized both ways rather than just looking at a table.
While I haven't benchmarked JSON vs protobuf, I've observed that JSON.stringify() can be shockingly inefficient when you have something like a multi-megabyte binary object that's been serialized to a base64 and dropped with an object. As in, multiple hundreds of megabytes of memory needed to run JSON.stringify({"content": <4-megabyte Buffer that's been base64-encoded>}) in node
JSON is the default serialization format that most JS developers use for most things, not because it's good but because it's simple (or at least seems simple until you start running into trouble) and it's readily available.
Large values are by no means the only footguns in JSON. Another unfortunately-common gotcha is attempting to encode an int64 from a database (often an ID field) into a JSON number rather than a JSON string, since a JS number type can lead to silent loss of precision.
A more thoughtful serialization format like proto3 binary encoding would avoid both the memory spike issue and the silent loss of numeric precision issue, with the tradeoff that the raw encoded value is not human readable.
Isn't HTTP POST content similarly encoded? Likewise with small embedded images in CSS, though I am rusty on that topic. Likewise with binary email attachments in SMTP (though this may be uuencoded, same net effect).
The particular example of a trivial message that is mostly-binary just sounds like a useful test case, more than anything else.
Copying into a string is a safe default. Also, proto’s API currently returns string references (and not views) so making a copy is required for open source.
(Although now std::string_view is common, I hear rumors that the proto API might change…)
Maybe that's true, but safety is an additional concern. You have way more lifetime headaches if you alias the underlying data. Copying avoids all that.
simdjson also does this sort of thing. All strings are decoded and copied to an auxiliary buffer. For strings without escape sequences in them or for end users who don't mind destroying the json document by decoding the strings in-place, these copies could be avoided. I may get around to shipping a version of simdjzon (the Zig port) that optionally avoids these copies (this sort of has to be optional because the current API lets you throw away your input buffer after parsing, and this option would mean you cannot do that), but porting this stuff back to C++ and getting it upstreamed sounds more difficult.
If you only want faster than JSON then a 'binary json's like MessagePack or JSONB (is that the PostgreSQL one?) avoids dealing with schema. If you want schema it's not like JSON.
MsgPack is great. However I’ve been curious about Amazon’s Ion. Mainly since the namespace stuff allow you to define tags upfront into ints. MagPack transfers field names as strings still unless you crest your own string to int conversion.
Similarly does anyone use https://capnproto.org/? It's a project I was really interested in a few years back, but I haven't heard much in the way of it lately.
IMO it's the best available option if you can choose to use it; the foundations are well researched, it's well designed, it has many good features that alternatives lack (cough sum types) and overall is robust and has a good, fast C++ and Rust implementation. From a technical perspective it's by far the best option both in design and implementation, if you ask me.
The problem, of course, is that technically worse solutions win for various reasons. One of those reasons is language support. Even if Protocol Buffers is worse in many metrics, it has a lot of language bindings and tools available.
I find myself primarily working in the apache arrow domain these days, so I've done more work with the Apache Flight RPC than anything else (https://arrow.apache.org/docs/format/Flight.html). I'll make sure to keep an eye on the capn'n proto project though. Seems well thought out and very interesting.
Hey that's me. And yeah, if you don't hear much about Cap'n Proto, it's not because it isn't advancing, but more because I don't really have any reason to advertise it. Cap'n Proto's goal right now -- for me at least -- is not to take over the world or anything, but rather to support the needs of my main project, which is Cloudflare Workers. If other people find it useful to, great! But that's not my focus.
The C++ implementation is the reference implementation which is by far the most complete, and is the only one I maintain personally. The other implementations have varying levels of completion and quality, ranging from "pretty solid" (Rust, Go) to "someone's weekend project one time in 2015". I suppose number of dependencies would depend entirely on the particular implementation.
I've played with using it to interopt some python data science code with some C++ code efficiently without writing the project in Cython or using a tool like Pybind11. It worked pretty well in my test scenario, but I'm not sure how great of an idea that truly is.
Is the capn'n proto use case similar to something like ZeroMQ or NNG? I'm still not fully sure.
They (and similar technologies) are used where it matters.
Games, data visualization, ... numerically heavy applications mainly.
On a side-note; JSON has been somewhat of a curse. The developer ergonomics of it are so good, that web devs completely disregard how they should layout their data. You know, sending a table as a bunch of nested arrays, that sort of thing. Yuck.
In web apps, data is essentially unusable until it has been unmarshalled. Fine for small things, horrible for data-heavy apps, which really so many apps are now.
Sometimes I wonder if it will change. I'm optimistic that the popularity of mem-efficient formats like this will establish a new base paradigm of data transfer, and be adopted broadly on the web.
See if domain allows for more constraint typing (e.g. ints as actual ints). Will push devs naturally to binary formats.
Consider column orientation layout. This will also help with more advanced compression (e.g. delta encoding).
When sticking with JSON, avoid large nested hierarchies that would spam the heap when unmarchalling (ie. prefer [1,2,4,5] over [[1,2],[4,5]].
In general, for large payloads, see if you can avoid deserialization of the payload altogether, and just scan through it. Often times the program just ends up copying values from one place (the file) to another (buffer; DOM-objects), so there's really no need to create to allocate the entire data in heap as many individual objects. That's a hit the user will always feel twice: once at parse (freezing), once at garbage-collection (framerate hickups). You could technically do that with stream parsing of JSON, but then you need special libraries anyway. And once moving to stream-based parsing, you may as well choose a format which has a lot of other advantages as well (e.g. rich type system, column-layouts).
wrt text
It matters less (?). Nonetheless, scannable formats are good here too (e.g. the whole reason we e.g. have line delimited json, to bypass JSONs main limitation).
These are just some general workable ideas. ymmv, ianal, etc..
Is using flatbuffers as the on-disk storage format for an application a hare-brained idea?
If yes, is it a less hare-brained idea than using the ctypes Python module to mmap a file as a C struct? That's what I'm currently doing to get 10x speedup relative to SQLite for an application bottlenecked on disk bandwidth, but it's unergonomic to say the least.
Flatbuffers look like a way to get the same performance with better ergonomics, but maybe there's a catch. (E.g. I thought the same thing about Apache Arrow before, but then I realized it's basically read-only. I don't expect to need to resize my tables often, but I do need to be able to twiddle individual values inside the file.)
I don't think it's hare-brained, I think it'd be great. No less hare-brained than storing stuff to disk in any other format like json or yaml.
That said, the ergonomics are absolutely awful for modifying existing objects; you can't modify an existing object, you need to serialize a whole new object.
There's also a schemaless version (flexbuffers) which retains a number of the flatbuffers benefits (zero-copy access to data, compact binary representation), but is also a lot easier to use for ad-hoc serialization and deserialization; you can `loads`/`dumps` the flexbuffer objects, for example.
It really depends on what you're storing and how you need to access the content to meet performance requirements. A simple example of where flatbuffers shines is in TCP flows or in Kafka where each message is a flatbuffer. In Kafka, the message size and type can be included in metadata. In TCP, framing is your responsibility. Serializing a message queue to a flat file is reasonable and natural.
Regarding files-as-C-structs: That isn't (necessarily) harebrained (if you can trust the input), Microsoft Word .DOC files were just memory dumps. However, twiddling bits inside a flatbuffer isn't recommended per the documentation; rather, the guidance is to replace the entire buffer. If you don't want to manage the IO yourself, then a key/value store that maps indices to flatbuffers is entirely possible. I'd suggest a look at Redis.
Read only is a good case, afaik one of the usecases of flatbuffers is that you can mmap a huge flatbuffer file and then randomly access the data quickly without paying a huge deserialization cost.
1. It's not easier to use them than JSON when just getting started. However, the pay off is the strong typing and zero-copy access that they offer to folks that need to support clients on multiple architectures.
2. No, writers can directly embed structs and primitive data types into binary buffers at runtime through an API generated from an IDL file. Readers use direct memory access to pull values out of the buffers. If you set it up right, this can result in a massive perf boost by eliminating the encoding and decoding steps.
3. Facebook uses them in their mobile app. Another commenter mentioned use of the in the Arrow format. The flatbuffers website isn't the best, but clearly documents the flatbuffers IDL.
The google documentation has a minimal tutorial. There are implementations for all of the major languages. The level of documentation in the ecosystem, though, is poor. My best recommendation for you is to jump in and get the tutorial/hello world example working in a language you're comfortable with.
They aren't hard to use, but they aren't the easiest thing either.
Once you get the gist of the API through the tutorial, the other important topics that come up immediately are version control; git repo design; headers and framing.
In production, they've been bulletproof. As long as you account for the compile time issues (schema versioning, repo, headers and framing, etc.).
We were using protobufs at Spotify and ditched them for simple JSON calls on the client side. No one complained, and never going back to having anything like that on the client side if I can.
Just too many drawbacks.
For server to server, they might be fine, but to client just stick with JSON. (which when compressed is pretty efficient).
One could combine JSON and a serializationless library, your JSON would be blown up with whitespace, but read and update could be O(1), serialization would be a memcpy, you could probably canonicalize the json during the memcpy using the SIMD techniques of Lemire.
I did this one for reading json on the fast path, the sending system laid out the arrays in a periodic pattern in memory that enabled parseless retrieval of individual values.
That's an intriguing idea but limits you to strings for your internal representation. Every time you wanted to pull a number out of it you'd be reparsing it.
Also I assume you'd have to have some sort of binary portion bundled with it to hold the field offsets, no?
It sounds like the approach is to set aside e.g. 11 bytes for an i32 field and write or read all 11 of them on each access, and to do similar things for strings, such that their lengths must be bounded up-front. It's interesting, but it requires a bit of work from the end user, and to be polite one may want to remove all the extra spaces before sending the data over the wire to a program that isn't using these techniques.
I think I'd take a different approach and send along an "offset map" index blob which maps (statically known in advance based on a schema that both client and server would need to agree to) field IDs to memory offsets&lengths into a standard JSON file.
Then you have a readable JSON, but also a fast way O(1) to access the fields in a zero-copy environment.
Done right the blob could even fit in an HTTP response header, so standard clients could use the msg as is while 'smart' clients could use the index map for optimized access.
But as I said would suffer from numeric values being text encoded. And actually 'compiling' the blob would be an extra step. Wouldn't have all the benefits of flatbuffers or capnproto, but could be an interesting compromise.
I'd be surprised if this isn't already being done somewhere.
When the library decoding the data is falling with weird errors, and you open the devtools in the browser and the data being transmitted is all in binary, well you have a very hard time debugging things.
We moved to flatbuffers and back to JSON because in the end of the day, for our data, data compression with JSON+gzip was similarly-sized than the original one (which had some other fields that we were not using) and 10-20 times faster to decode.
That said, the use case for flatbuffers and capnproto isn't really about data size, it's about avoiding unnecessary copies in the processing pipeline. "Zero copy" really does pay dividends where performance is a concern if you write your code the right way.
Most people working on typical "web stack" type applications won't hit these concerns. But there are classes of applications where what flatbuffers (and other zerocopy payload formats) offer is important.
The difference in computation time between operating on something sitting in L1 cache vs not-in-cache is orders of magnitude. And memory bandwidth is a bottleneck in some applications and on some machines (particularly embedded.)
Not OP, but I'm going to guess because it's an added dependency in your client library, and even worse, it includes some code generation in each client's build.
How is it annoying? To be fair, we’re fronting our gRPC service with a AWS LB that terminates TLS (so our gRPC is plaintext), so we don’t deal with certs as direct dependencies of our server.
FlatGeoBuf [1] is an encoding for geographic data (vector features, i.e. points lines polygons and so on) written around flatbuffers that is increasingly well supported in geospatial software (GDAL, MapServer) and people reporting some experiments and demos on the @flatgeobuf Twitter.
It's used for several ML-related projects, including as the model format for TensorFlow Lite (TFLite). The TFLite format also has long-term support as part of Google Play Services. The main attraction is ability to pass large amounts of data without having to serialize/deserialize all of it to access fields.
I'm using flatbuffers as the basis of communication for my multiplayer game. They're really quite pleasant to work with after you get into the flow of it.
yeah they're used a lot. I think the difference is json is good for data or apis you want to be easily shared, flatbuffers (or protobuf or captnproto) are good for data that stays internal. That's just a guideline and there are plenty of exceptions but it's a starting point to thinking about it.
Yes, I work on a product that uses Flatbuffers to control stormwater: https://optirtc.com/
Basically, we use a rather bandwith-constrained link between our services running in the cloud, and Particle-based IoT devices deployed in many locations. Some locations are remote, some are urban.
I personally haven't had to touch the Flatbuffers code since I joined the company two years ago. It's written and hasn't needed to be maintained.
TensorFlow Lite (tflite) uses flatbuffers. This format, and vendor-specific forks of it, ship on hundreds of millions of phones and other embedded devices.
I experimented with them (also with capnproto) at my last job for a usecase involving dense numerical data where being able to randomly seek within a dataset would have been really helpful for speed reasons, but found that as compared to protobuf, these formats were unacceptably bulky (lots of extra padding for word alignment, etc.), and the added cost even just to read the extra data from disk mostly negated the savings from avoiding the explicit decode step, plus would have had significant implications in terms of storage cost, etc. I ended up writing a custom wire format that allowed for seeking with less wasted space.
Seems like a neat idea, but as another commenter said, the usecases where it's the best choice seem pretty narrow.
Yep! Our platform uses flatbuffers as the primary format for both IPC, including for web responses, and for object persistence. It's a phenomenal format; I'm super happy with it.
1) SQLite with BLOB storage gives you binary benefits for file layout and database solutions to metadata, versioning, & indexing into large structure.
2) FlexBuffers look like a more flexible solution within the FlatBuffers library.
FlatBuffers was designed around schemas, because when you want maximum performance and data consistency, strong typing is helpful.
There are however times when you want to store data that doesn't fit a schema, because you can't know ahead of time what all needs to be stored.
For this, FlatBuffers has a dedicated format, called FlexBuffers. This is a binary format that can be used in conjunction with FlatBuffers (by storing a part of a buffer in FlexBuffers format), or also as its own independent serialization format.
If you're looking for something faster but C++ specific, more compact in serialized size, more efficient in serialization you can try cista: https://github.com/felixguendling/cista
(Disclaimer: I'm the author, always happy for feedback, eg in the GitHub issues)
If your main concern is "faster than JSON" then you're better off using Protocol Buffers simply because they're way more popular and better supported. FlatBuffers are cool because they let you decode on demand. Say you have an array of 10,000 complex objects. With JSON or Protocol Buffers you're going to need to decode and load into memory all 10,000 before you're able to access the one you want. But with FlatBuffers you can decode item X without touching 99% of the rest of the data. Quicker and much more memory efficient.
But it's not simple to implement. You have to write a schema then turn that schema into source files in your target language. There's an impressive array of target languages but it's a custom executable and that adds complexity to any build. Then the generated API is difficult to use (in JS at least) because of course an array isn't a JavaScript array, it's an object with decoder helpers.
It's also quite easy to trip yourself up in terms of performance by decoding the same data over and over again rather than re-using the first decode like you would with JSON or PB. So you have to think about which decoded items to store in memory, where, for how long, etc... I kind of think of it as the data equivalent of a programming language with manual memory management. Definitely has a place. But the majority of projects are going to be fine with automatic memory management.