We were using protobufs at Spotify and ditched them for simple JSON calls on the...

nerpderp82 · on Jan 17, 2023

One could combine JSON and a serializationless library, your JSON would be blown up with whitespace, but read and update could be O(1), serialization would be a memcpy, you could probably canonicalize the json during the memcpy using the SIMD techniques of Lemire.

I did this one for reading json on the fast path, the sending system laid out the arrays in a periodic pattern in memory that enabled parseless retrieval of individual values.

https://github.com/simdjson/simdjson

cmrdporcupine · on Jan 17, 2023

That's an intriguing idea but limits you to strings for your internal representation. Every time you wanted to pull a number out of it you'd be reparsing it.

Also I assume you'd have to have some sort of binary portion bundled with it to hold the field offsets, no?

anonymoushn · on Jan 17, 2023

It sounds like the approach is to set aside e.g. 11 bytes for an i32 field and write or read all 11 of them on each access, and to do similar things for strings, such that their lengths must be bounded up-front. It's interesting, but it requires a bit of work from the end user, and to be polite one may want to remove all the extra spaces before sending the data over the wire to a program that isn't using these techniques.

cmrdporcupine · on Jan 17, 2023

ah, I see.

I think I'd take a different approach and send along an "offset map" index blob which maps (statically known in advance based on a schema that both client and server would need to agree to) field IDs to memory offsets&lengths into a standard JSON file.

Then you have a readable JSON, but also a fast way O(1) to access the fields in a zero-copy environment.

Done right the blob could even fit in an HTTP response header, so standard clients could use the msg as is while 'smart' clients could use the index map for optimized access.

But as I said would suffer from numeric values being text encoded. And actually 'compiling' the blob would be an extra step. Wouldn't have all the benefits of flatbuffers or capnproto, but could be an interesting compromise.

I'd be surprised if this isn't already being done somewhere.

politician · on Jan 19, 2023

Take a look at Msgpack.

cmrdporcupine · on Jan 20, 2023

I have before, but that's very different from what I'm brainstorming here.

nerpderp82 · on Jan 17, 2023

Yeah, you all get it.

anonymoushn · on Jan 17, 2023

Is the serializer public?

wzvici · on Jan 17, 2023

Why is it bad in server to client?

cmrdporcupine · on Jan 17, 2023

Being able to debug through a simple curl or browser devtools is golden.

Also browser has JSON parsing built in. Less dependencies. Easier tooling overall.

In my experience people overuse protobuf. But I also worked at Google, where it's the hammer in constant search of any nail it can find.

At the very least, endpoints should provide the option to provide a JSON form through content representation negotiation.

franciscop · on Jan 17, 2023

When the library decoding the data is falling with weird errors, and you open the devtools in the browser and the data being transmitted is all in binary, well you have a very hard time debugging things.

We moved to flatbuffers and back to JSON because in the end of the day, for our data, data compression with JSON+gzip was similarly-sized than the original one (which had some other fields that we were not using) and 10-20 times faster to decode.

cmrdporcupine · on Jan 17, 2023

Truth.

That said, the use case for flatbuffers and capnproto isn't really about data size, it's about avoiding unnecessary copies in the processing pipeline. "Zero copy" really does pay dividends where performance is a concern if you write your code the right way.

Most people working on typical "web stack" type applications won't hit these concerns. But there are classes of applications where what flatbuffers (and other zerocopy payload formats) offer is important.

The difference in computation time between operating on something sitting in L1 cache vs not-in-cache is orders of magnitude. And memory bandwidth is a bottleneck in some applications and on some machines (particularly embedded.)

clintfred · on Jan 17, 2023

Not OP, but I'm going to guess because it's an added dependency in your client library, and even worse, it includes some code generation in each client's build.

pphysch · on Jan 17, 2023

Setting up TLS in gRPC is annoying if the alternative is using an already-existing HTTPS endpoint.

anuragsoni · on Jan 17, 2023

That sounds like a problem with GRPC not protobufs.

boardwaalk · on Jan 17, 2023

How is it annoying? To be fair, we’re fronting our gRPC service with a AWS LB that terminates TLS (so our gRPC is plaintext), so we don’t deal with certs as direct dependencies of our server.