Hacker News new | past | comments | ask | show | jobs | submit login

Yes, flatbuffers are fantastic. Let me know if you have any specific questions; happy to respond.



Is using flatbuffers as the on-disk storage format for an application a hare-brained idea?

If yes, is it a less hare-brained idea than using the ctypes Python module to mmap a file as a C struct? That's what I'm currently doing to get 10x speedup relative to SQLite for an application bottlenecked on disk bandwidth, but it's unergonomic to say the least.

Flatbuffers look like a way to get the same performance with better ergonomics, but maybe there's a catch. (E.g. I thought the same thing about Apache Arrow before, but then I realized it's basically read-only. I don't expect to need to resize my tables often, but I do need to be able to twiddle individual values inside the file.)


I don't think it's hare-brained, I think it'd be great. No less hare-brained than storing stuff to disk in any other format like json or yaml.

That said, the ergonomics are absolutely awful for modifying existing objects; you can't modify an existing object, you need to serialize a whole new object.

There's also a schemaless version (flexbuffers) which retains a number of the flatbuffers benefits (zero-copy access to data, compact binary representation), but is also a lot easier to use for ad-hoc serialization and deserialization; you can `loads`/`dumps` the flexbuffer objects, for example.


> ctypes Python module to mmap a file as a C struct

Tell me more! Is your data larger than memory? You need persistence?

You might take a look at Aerospike, even on a single node if you need low latency persistence.


It really depends on what you're storing and how you need to access the content to meet performance requirements. A simple example of where flatbuffers shines is in TCP flows or in Kafka where each message is a flatbuffer. In Kafka, the message size and type can be included in metadata. In TCP, framing is your responsibility. Serializing a message queue to a flat file is reasonable and natural.

Regarding files-as-C-structs: That isn't (necessarily) harebrained (if you can trust the input), Microsoft Word .DOC files were just memory dumps. However, twiddling bits inside a flatbuffer isn't recommended per the documentation; rather, the guidance is to replace the entire buffer. If you don't want to manage the IO yourself, then a key/value store that maps indices to flatbuffers is entirely possible. I'd suggest a look at Redis.


Read only is a good case, afaik one of the usecases of flatbuffers is that you can mmap a huge flatbuffer file and then randomly access the data quickly without paying a huge deserialization cost.


1. Are they easier to use than JSON?

2. Is it just putting strings into a big array?

3. Do you have an examples of them being used?


1. It's not easier to use them than JSON when just getting started. However, the pay off is the strong typing and zero-copy access that they offer to folks that need to support clients on multiple architectures.

2. No, writers can directly embed structs and primitive data types into binary buffers at runtime through an API generated from an IDL file. Readers use direct memory access to pull values out of the buffers. If you set it up right, this can result in a massive perf boost by eliminating the encoding and decoding steps.

3. Facebook uses them in their mobile app. Another commenter mentioned use of the in the Arrow format. The flatbuffers website isn't the best, but clearly documents the flatbuffers IDL.


Is there any good writeup about using them that is easy to follow? With drawbacks on using them in prod?


The google documentation has a minimal tutorial. There are implementations for all of the major languages. The level of documentation in the ecosystem, though, is poor. My best recommendation for you is to jump in and get the tutorial/hello world example working in a language you're comfortable with.

They aren't hard to use, but they aren't the easiest thing either.

Once you get the gist of the API through the tutorial, the other important topics that come up immediately are version control; git repo design; headers and framing.

In production, they've been bulletproof. As long as you account for the compile time issues (schema versioning, repo, headers and framing, etc.).


We were using protobufs at Spotify and ditched them for simple JSON calls on the client side. No one complained, and never going back to having anything like that on the client side if I can.

Just too many drawbacks.

For server to server, they might be fine, but to client just stick with JSON. (which when compressed is pretty efficient).


One could combine JSON and a serializationless library, your JSON would be blown up with whitespace, but read and update could be O(1), serialization would be a memcpy, you could probably canonicalize the json during the memcpy using the SIMD techniques of Lemire.

I did this one for reading json on the fast path, the sending system laid out the arrays in a periodic pattern in memory that enabled parseless retrieval of individual values.

https://github.com/simdjson/simdjson


That's an intriguing idea but limits you to strings for your internal representation. Every time you wanted to pull a number out of it you'd be reparsing it.

Also I assume you'd have to have some sort of binary portion bundled with it to hold the field offsets, no?


It sounds like the approach is to set aside e.g. 11 bytes for an i32 field and write or read all 11 of them on each access, and to do similar things for strings, such that their lengths must be bounded up-front. It's interesting, but it requires a bit of work from the end user, and to be polite one may want to remove all the extra spaces before sending the data over the wire to a program that isn't using these techniques.


ah, I see.

I think I'd take a different approach and send along an "offset map" index blob which maps (statically known in advance based on a schema that both client and server would need to agree to) field IDs to memory offsets&lengths into a standard JSON file.

Then you have a readable JSON, but also a fast way O(1) to access the fields in a zero-copy environment.

Done right the blob could even fit in an HTTP response header, so standard clients could use the msg as is while 'smart' clients could use the index map for optimized access.

But as I said would suffer from numeric values being text encoded. And actually 'compiling' the blob would be an extra step. Wouldn't have all the benefits of flatbuffers or capnproto, but could be an interesting compromise.

I'd be surprised if this isn't already being done somewhere.


Take a look at Msgpack.


I have before, but that's very different from what I'm brainstorming here.


Yeah, you all get it.


Is the serializer public?


Why is it bad in server to client?


Being able to debug through a simple curl or browser devtools is golden.

Also browser has JSON parsing built in. Less dependencies. Easier tooling overall.

In my experience people overuse protobuf. But I also worked at Google, where it's the hammer in constant search of any nail it can find.

At the very least, endpoints should provide the option to provide a JSON form through content representation negotiation.


When the library decoding the data is falling with weird errors, and you open the devtools in the browser and the data being transmitted is all in binary, well you have a very hard time debugging things.

We moved to flatbuffers and back to JSON because in the end of the day, for our data, data compression with JSON+gzip was similarly-sized than the original one (which had some other fields that we were not using) and 10-20 times faster to decode.


Truth.

That said, the use case for flatbuffers and capnproto isn't really about data size, it's about avoiding unnecessary copies in the processing pipeline. "Zero copy" really does pay dividends where performance is a concern if you write your code the right way.

Most people working on typical "web stack" type applications won't hit these concerns. But there are classes of applications where what flatbuffers (and other zerocopy payload formats) offer is important.

The difference in computation time between operating on something sitting in L1 cache vs not-in-cache is orders of magnitude. And memory bandwidth is a bottleneck in some applications and on some machines (particularly embedded.)


Not OP, but I'm going to guess because it's an added dependency in your client library, and even worse, it includes some code generation in each client's build.


Setting up TLS in gRPC is annoying if the alternative is using an already-existing HTTPS endpoint.


That sounds like a problem with GRPC not protobufs.


How is it annoying? To be fair, we’re fronting our gRPC service with a AWS LB that terminates TLS (so our gRPC is plaintext), so we don’t deal with certs as direct dependencies of our server.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: