When Not to Serialize

panic · on Aug 10, 2018

This quote from http://erights.org/e/StateSerialization.html has always stuck in my mind:

Do you, Programmer, take this Object to be part of the persistent state of your application, to have and to hold, through maintenance and iterations, for past and future versions, as long as the application shall live?

- Erm, can I get back to you on that?

0xcde4c3db · on Aug 11, 2018

See also: configuration file formats. Outside of systems where configuring the installation is basically part of the sales process, you will probably have at least a handful of customers raising hell if version 10.0 doesn't seamlessly run with a config file from version 1.0. And maybe even vice-versa.

repsilat · on Aug 11, 2018

An upside to SAAS, I guess -- if the user data lives on your servers in a reasonably structured format you can try to migrate it, and your migrations don't have to work across too many versions if that's difficult (unlike file format compatibility, which is a long term commitment.)

Of course, that only works to a certain extent. Removed features can't be "migrated" cleanly, and often config files (or worse -- code written by users in most DSLs) aren't well-structured enough to make migration straightforward.

Matthias247 · on Aug 11, 2018

I agree with the overall message of the article, but I don't think "serialize" is the right term here. Serialize means going from in-memory data representation to a flat bytearray, which is eventually persisted. There are numerous way to perform serialization, from just memcopying the datastructures up to defining a good and extensible persistent format and converting into that.

The compatibility and extensibility issues are mostly coming up from the first approach. And can often be avoided by utilizing a more flexible persistent format, which can be anything from a total domain-specific format up to json, xml, protobuf, etc.

rb808 · on Aug 10, 2018

I think Serialization and associated changes in APIs is my #1 headache for software development. The problem is with all the new auto magic frameworks now it seems to be getting worse not better. Anyone got any solutions?

leggomylibro · on Aug 10, 2018

Do protocol buffers count as a 'new auto magic framework'?

https://developers.google.com/protocol-buffers/

ninkendo · on Aug 10, 2018

The problem is, protobuf isn't just a serialization protocol, it's also a bunch of generated model code you have to start using in your application. You don't just serialize your domain model to protobuf, you tell protoc to build you some classes that become your model.

Which means if you aren't careful, you can't easily move to anything else for serialization, ever again. Your code now uses protobuf-specific objects everywhere, because that's what protobuf encourages. I'm currently in a codebase where countless method signatures (which should be serialization-agnostic) take or return `Message`-derived objects because, that's what we get when we read in a request or emit a response, and using those types everywhere was just so tempting.

And now, we have new requirements that introduce some dynamism to our data model, in a way protobuf doesn't provide, so we're trying to move away from protobuf, and it's turning out to require a rewrite of practically everything because these protobuf classes are our data model, so everything depends on them.

What I've come to prefer is for serialization to be implemented a the boundaries of your service, with your models at least somewhat isolated from any given serialization technique. Protobuf is a foot-gun here because it blends these roles in a way that's hard to get away from.

deathanatos · on Aug 11, 2018

> What I've come to prefer is for serialization to be implemented a the boundaries of your service, with your models at least somewhat isolated from any given serialization technique.

I think this is the right way to do it. Just like how UTF-8 to a string type is kept at the borders. Inevitably, someone comes along with a requirement that implies the first iteration of the data modeling was not only wrong, but backwards-incompatibly wrong.

It's hard to convince coworkers that it isn't code duplication though.

> Protobuf is a foot-gun here because it blends these roles in a way that's hard to get away from.

I'm not sure; in many ways it is just trying to give you a way to supply it the data to serialize with those models. I'd be nice to not have the "foot gun", but I'm not sure what such a serialization framework would look like.

ninkendo · on Aug 11, 2018

IMO the serializers should be their own standalone classes/modules which live separately from your application’s core types. You can invoke them when you need to do the serialization and keep parallel versions of them for legacy clients, etc.

ActiveModel::Serializers work like this in Rails, although I haven’t tried any similar approaches in statically-typed languages where protobuf is so commonly used.

aldarn · on Aug 11, 2018

For Python there's Marshmallow (https://github.com/marshmallow-code/marshmallow) and Django REST Framework if you're using Django (http://www.django-rest-framework.org/api-guide/serializers/). Both of these work as you described.

SamReidHughes · on Aug 12, 2018

Serializers are just functions. Why do they need to be classes?

TeMPOraL · on Aug 11, 2018

> What I've come to prefer is for serialization to be implemented a the boundaries of your service, with your models at least somewhat isolated from any given serialization technique. Protobuf is a foot-gun here because it blends these roles in a way that's hard to get away from.

This is exactly what I think about using ORMs, too, and keep repeating it. Using ORM-generated model classes as your models is a semi-automatic footgun with a hair trigger.

skybrian · on Aug 11, 2018

It seems like this might be fixed with a more flexible code generator? Perhaps one that merges app-specific definitions and the definitions in the .proto file.

foota · on Aug 11, 2018

Yeah, though it can be tempting to just pass some protos around, it's generally best to use some other abstraction for your code (even if it's just a wrapper around a proto!)

sprucely · on Aug 10, 2018

How about FlatBuffers, making the file format and in-memory format one and the same?

https://google.github.io/flatbuffers/

thechao · on Aug 10, 2018

So, this is literally the whole point of this article: it turns out that that is a bad idea, in the long run.

rdtsc · on Aug 11, 2018

It seems unrelated in a way. Flatbuffers and Protobufs are ways to serialize data. The fact that FlatBuffers happen to serialize such that the persistent format is the same as in the in-memory representation is an optimization. It is just as easy to shoot yourself in the foot with one as is with the other in regards to what the article talks about. That is could choose do dump your objects as vertices only instead of including edge information with protobufs, json, flatbuffers, xml, s-expressions etc.

The main point was that serialization needs to be thought about very well, because it will involved compatibility issues. It shouldn't be an automating stream of current object structures to disk.

osigurdson · on Aug 11, 2018

Coming up with a format which is independent of your in memory structure can be limiting in some situations. A successful strategy is to isolate key persistable types in a manner that allows you to carry the old types into later versions of your application at minimal cost. This allows you to deserialize the data in its exact original form. From this point a series of transforms are used to map the data to the current version of the application. The nice thing about this strategy is it is entirely additive - transforms are added as required and chained together and old transforms are never mutated. Having said this, if you can get away with defining your data structure up front, by all means do it as there are many advantages to doing so. If you cannot (unknown requirements, large team, etc) then a more rigorous transform chain approach can be a reasonable option.

faragon · on Aug 11, 2018

Example for C (for structures not using enums nor bitfields, because are compiler-dependant, and avoiding architecture-dependant types like e.g. size_t/ssize_t) :

if (islittleendian() && sizeof(mystruct) == REFSIZE_mystruct)

   memcpy(buffer, mystruct, sizeof(mystruct));

else

   conversion_mystruct(buffer, mystruct);

(so you can avoid slow serialization in most platforms)