Hacker News new | past | comments | ask | show | jobs | submit login
Transit – A format for conveying values between different languages (cognitect.com)
299 points by _halgari on July 22, 2014 | hide | past | favorite | 109 comments



I really think the future is schema-based.

The evolution of technologies goes something like this:

1. Generation 1 is statically typed / schemaful because it's principled and and offers performance benefits.

2. Everyone recoils in horror at how complicated and over-designed generation 1 is. Generation 2 is dynamically typed / schemaless, and conventional wisdom becomes that this is generally more programmer-friendly.

3. The drawbacks of schemaless become more clear (annoying runtime errors, misspelled field names, harder to statically analyze the program/system/etc). Meanwhile the static typing people have figured out how offer the benefits of static typing without making it feel so complicated.

We see this with programming languages:

1. C++

2. Ruby/Python/PHP/etc.

3. Swift, Dart, Go, Rust to some extent, as well as the general trend of inferred types and optional type annotations

Or messaging formats:

1. CORBA, ASN.1, XML Schema, SOAP

2. JSON

3. Protocol Buffers, Cap'n Proto, Avro, Thrift

Or databases:

1. SQL

2. NoSQL

3. well, sort of a return to SQL to some extent, it wasn't that bad to begin with given the right tooling.

If you are allergic to the idea of schemas, I would be curious to ask:

1. isn't most of your data "de facto" schemaful anyway? Like when you send an API call with JSON, isn't there a standard set of keys that the server is expecting? Isn't it nicer to actually write down this set of keys and their expected types in a way that a machine can understand, instead of it just being documentation on a web page?

2. Is it the schema itself that you are opposed to, or the pain that clunky schema-based technologies have imposed on you? If importing your schema types was as simple as importing any other library function in your native language, are you still opposed to it?


Completely agree. A key thing we realized recently at Snowplow was that people's data starts schema'ed - in MySQL tables, or Protocol Buffers, or Backbone models. Normally when data is being passed around in JSONs (e.g. into/out of APIs, into SaaS analytics), it means the original schema has been _lost_ - not that there was never a schema in the first place. And that's something that needs fixing. We released Iglu (http://snowplowanalytics.com/blog/2014/07/01/iglu-schema-rep...) as a JSON Schema repository system recently and it's been really cool seeing people start to use it for other schema use cases outside of Snowplow.


If anyone finds JSON Schema in Ruby to be too slow, I developed a Ruby-based schema system that is much faster:

http://rubygems.org/gems/classy_hash

https://github.com/deseretbook/classy_hash

I wrote it for an internal backend system at a small ecommerce site with a large retail legacy.

Edit: Ruby Hashes (the base "language" used by Classy Hash) aren't easily serialized and shared, but if there's enough interest, it would be possible to compile most JSON Schema schemas to Classy Hash schemas.


Have you looked contracts.ruby (https://github.com/egonSchiele/contracts.ruby)? I'm sure you could overlap some code


Interesting. It looks like contracts.ruby does for method calls what Classy Hash aims to do for API data.


> 1. isn't most of your data "de facto" schemaful anyway? Like when you send an API call with JSON, isn't there a standard set of keys that the server is expecting? Isn't it nicer to actually write down this set of keys and their expected types in a way that a machine can understand, instead of it just being documentation on a web page?

I'd argue that it's not.

Your schema is implicitly defined somewhere in the business logic, and you have to first learn the schema description language in order to translate your application code into schema description code. And when the application code changes, you won't be very excited to adjust the schema again.

Sometimes it's worth the effort and makes development easier, often it's the opposite. An error message saying `error: articleId missing in sale object` is more informative than `schema error in line 4282`.


Here is a JSON Schema validation failure taken straight out of the Snowplow test suite (pretty printed):

  {
    "level": "error",
    "schema": {
      "loadingURI": "#",
      "pointer": ""
    },
    "instance": {
      "pointer": ""
    },
    "domain": "validation",
    "keyword": "required",
    "message": "object has missing required properties ([\"targetUrl\"])",
    "required": [
      "targetUrl"
    ],
    "missing": [
      "targetUrl"
    ]
  }
You can't seriously prefer a NullPointerException (or choose your poison) three functions later.


I'll admit that this doesn't look too bad, but it's an additional effort. You wouldn't do this for very simple formats.


I'm not sure I agree with all your points. For programming languages hopefully Python and Ruby will not be going away anytime soon. Javascript mind share is also continually growing.

For databases, weren't there other reasons people got excited about NoSQL databases? Not having a schema was one aspect of it, but it mostly had to do with scaling. Now people realize SQL scales just fine for pretty much most of the use cases that were getting replaced with NoSQL. And also that most data(in webpages at least) is relational in nature.


> For programming languages hopefully Python and Ruby will not be going away anytime soon.

Neither is C++. But languages being designed these days don't look like Python or Ruby, they look like Swift, Dart, and Go.


I'm not sure I would lump Dart in with those as its typing discipline is completely optional. Also Julia, Racket and Clojure are all relatively recent and very actively developed. I think dynamic and schema-less approaches have some very serious legs yet :)


I'm not sure I would describe Racket as recent.


Racket is not a single language as much as a family of languages quite a few which are quite recent.


okay, fair enough.


Hmm. Maybe because we don't need a new high-level language Python or Ruby, but there are opportunities for better low-level languages?


Agreed. The recent spate of compiled languages says more about C++ than anything else.


I like the idea of schemas but only when they're built into the message (which is probably not a good way of conveying my meaning).

Essentially JSON gives you numbers, strings and nulls so when accepted on the other side it obviously knows what's a number, what's a string, etc.

Honestly if JSON could be expanded to essentially be the same but add additional types along with bolting on new types (extensible) then I think it would be perfect for the job.

At least in my opinion.


Schemas built into the message certainly have the benefit of being self-describing. But they also have downsides:

- encoding the schema along with the message makes encodings like this less efficient.

- without an ahead-of-time schema, you don't have any canonical list for all the fields that can exist and their types. Instead this gets specified in ad-hoc ways in documentation. For example, like this: https://developers.facebook.com/docs/graph-api/reference/v2....

That URL describes a schema for groups. The schema exists, it's just not machine-readable! That means you can't use it for IDE auto-completion, you can't reflect over it programmatically, and you can't use it to make encoding/decoding more CPU/memory efficient. It's so close to being useful for these purposes, why not just take that final step and put it in a machine-readable format?


It's really frustrating how in 2014 almost everybody is still writing API definitions which are only human-readable. Two worthy exceptions:

- https://github.com/balanced/balanced-api - https://helloreverb.com/developers/swagger

You can have self-describing messages without having to embed the schema in the instance. Instead you embed a reference to the schema in the message. We came up with an approach to this called self-describing JSONs: http://snowplowanalytics.com/blog/2014/05/15/introducing-sel... The Avro community do something similar.


> Instead you embed a reference to the schema in the message

AKA XML DTD


You're talking about typing, not really about schemas. With JSON, you can know whether a value is a number as opposed to a string, but you can't know whether it's supposed to be a number.


Dumb question: why doesn't the value being a number tell us it's suppose to be a number?


Assuming the service is operating correctly, I think the more accurate statement is that it doesn't tell you the value will always be a number. That is, perhaps a field has multiple valid types. Maybe that field won't exist all the time or similar data may be available in a different struture. Without a schema, these questions can't really be answered.


Example: Postal codes. Say you're transferring an address in JSON and you have a postal code field. In the UK, postal codes are strings (e.g. "BS42BG"), easy enough. Now, someone enters a US postal code (90505). Should we transfer it as a number, or a string?


Definitely as a string. Numbers aren't things that have digits. Numbers are things you do math with.


OK, that's logical. So where do we specify this without a schema? What happens if a client sends a number instead of a string to the server? Should it accept it and convert it, or return an error?


Many technologies are developed, few are adopted. That's where your 3's are.

The usefulness of schema is inversely proportional to the rate of change. They are great for getting it right, but what's the point if it all has to change before you are done?

Rate of change is a question of fact, not personal preference.

NB. I like getting things right, choose static typing, and am developing a tooling technology to further this.


So I'm confused, are you in favor of the transit way or against it?


Transit appears to be schema-less, so I would be in favor of other formats that have explicit schemas.


Transit lets you define your own semantic types, with handlers and decoders to map from/to your programming language types.

What exactly would one gain from using schemas, if I can send the value (state) of any of my static types to another application using Transit?


> What exactly would one gain from using schemas, if I can send the value (state) of any of my static types to another application using Transit?

Interoperability with other languages, for one. The static type you defined in your language can't be used with any other languages. Schemas are static types that can be used across languages.


Right, but you can write a decoder an let Transit convert your type to an equivalent type in another language.

That's the whole point of Transit, interoperability with other languages. Having a good set of scalar types, basic composite types, and the ability to extend it with your own semantic types built recursively from the base types.


So, EDN [1] is a formalization of Clojure data-literal syntax that includes tagged types, has a text representation, and no built-in caching.

Fressian [2] supports the same types and extensibility as EDN, has a compact binary encoding, and the serializer/writer can choose its own caching strategy (so-called domain-specific caching[3]). I believe it was created to provide a serialization format for Datomic.

Transit sounds like an evolution of EDN and Fressian: make the bottom layer pluggable to support human-readable/browser-friendly JSON or use the well-established msgpack for compactness. Caching is still there, but it can only be used for keywords/strings/symbols/etc. instead of arbitrary values like Fressian -- probably a good trade-off for simplicity.

[1]: http://edn-format.org [2]: http://fressian.org [3]: https://github.com/Datomic/fressian/wiki/Caching


Nailed it.


1. Protobuf 2. Avro 3. Thrift 4. MsgPack 5. CORBA 6. ASN.1 7. Cap'n Proto 8. FlatBuffer

+ whatever internal stuff big software companies have cooked up etc.

What was so special about your use-case that demanded a totally new standard?

I hate to bring up that xkcd but it's actually relevant here.

Is it the higher-level semantics on top that allow abstraction over the underlying serialization format?

The "caching" doesn't seem to be that big of a win where network latency is high and some of the other formats can be directly mmapped, but it looks intriguing however it seems like something that could be added in a versioned binary format that some of the others provide.


Protobuf - static schema, non self describing Avro - not self describing MsgPack - limited data types (no URLs, Dates, etc) etc.

Go find a format that offers everything transit does, and when you don't find a perfect match for all the goals, you'll understand why this library was created.

Cross-platform (without writing in C), self describing, schema-less, extensible, support for caching, etc.


Avro is self describing [1]. It embeds the schema of whatever it is describing inside the format so a reader can handle arbitrary avro serialized files. It is also cross platform. I'm not sure if avro maps can cache keys, but if you use a record type the keys are not stored in the data. As far as I know though, avro cannot handle custom types, so those will need customized importers and exporters.

[1]: http://avro.apache.org/docs/current/


Why is self-description a necessity here?


Gotta differentiate it from the other stuff somehow.


That is a pretty potshot statement. Please read the objectives section: http://en.wikipedia.org/wiki/Self-documenting

"minimize the effort required to maintain or extend legacy systems" and "reduce the need for users and developers of a system to consult secondary documentation sources" are fitting here.


JSON does all the things Transit does, with fewer types. (And less stupidity)

Should rename it to Enterprise JSON, because it's JSON with more complexity for those architects who don't realize you can easily store a date as a int, or a URL as a string. (Or cache ANY document)

Seriously... why does a document format need support for caching? It would seem to me that a document format should be agnostic to whether it has been cached or not.

Also, why does a document care what language writes it? I don't understand how a document couldn't be cross platform, like maybe if you're using 36-bit words or some fuckery, but most people these days store documents using 8 bit words. Does anyone seriously have issues with JSON on a PDP-10?


> JSON does all the things Transit does, with fewer types. (And less stupidity)

I really hate comments like this.

These guys took the time to show the world this thing they created to fill a need they had, and this comment takes a dump on it without its author first getting any experience using the system. As though the author understands Transit's purpose better than Transit's authors do.

Hey, I get it: Transit /does/ (at first glance) seem largely redundant with all the other serialization libraries out there. But before we assume that its authors spent all this time on their project because they're "stupid", it behooves us to try to understand their motivations.

In the end they're not hurting anyone by releasing this thing they built. If it's bad, you don't have to use it. There's no need to be mean or get upset.


I'm not upset, I'm fine with other people using it. I still think the format is stupid. For the same reasons I think XML is stupid, and no I don't have to use XML, nor do I.

Also, my comment isn't hurting anyone, if you don't like it you don't have to read it, there's no reason to hate :)

No where in my post did I say Rich Hickey is stupid, he has some very great ideas I just don't think this is one of them.


Your comment is hurting people. It hurts the developers, based on at best subjective and at worst ignorant evidence. And it lowers the quality of discussion because people end up having to address your culturally poor behavior rather than the topic at hand.

Being mindful isn't hard.


> if you don't like it you don't have to read it How do we know we don't like it until we've read it?


I upvoted your comment because you make some valid points, e.g. caching should be kept orthogonal to the document format, which needs to be kept simple above almost all else.

I upvoted only after some hesitation, though, because of the unnecessary snark about stupidity. I think that's what you were downvoted for.


Do 1,2,3,4,5,6,7,8 have good, performant implementations in JavaScript - a programming language that many services have to communicate with today?


If the answer is to this question is no, the answer to whether or not a new format is necessary is not automatically yes. Certainly for at least one of those formats, a reasonably performant JS implementation could be created.


We assessed what prior work had been done and found these attempts to ultimately be unsatisfactory performance-wise for the breadth of JavaScript clients we would like to reach.


We have also found that none of the common serialization formats perform acceptably in javascript except JSON (and that's huge on the wire).

As an aside, transit seems dramatically faster in v8 than in firefox, at least in the versons of browsers I'm using, despite the fact that JSON.parse and hydrate are faster in firefox. Has it been specially optimised for v8?


We did not specifically optimize of for V8, the optimizations present resulted in a performance win in all browsers. Firefox simply does not deliver the same performance as V8 or JavaScriptCore for this kind of work. Still I think Transit is plenty acceptable under Firefox for many typical JavaScript programs. Hopefully the existence and usage of Transit will encourage Firefox to further improve their JavaScript performance profile.


I would love to see this data--it would help me greatly when picking a format that needs to work.


http://jsperf.com/json-bson-msgpack/2

Look at JSPerf for other serialization formats vs. JSON. They all look pretty much like this. Compare these results to http://jsperf.com/json-vs-transit/2 where in some cases we beat JSON.


Possibly but in many ways it's a lot easier to pick and write a fast implementation using e.g. typed arrays than it is to convince the world to use your special format


Do typed arrays work in the last 14 years of browser technology or JavaScript environments?


It would be very nice if y'all gave a detailed rationale mentioning things like this over "we made a cool new format"

I still am not seeing why the other formats fail, especially with the very limited compression that you have baked into the spec (!)

Where are the benchmarks on ie6 era browsers? And why should I let ie6 era perf direct my future data format design?


One of the big issues we've been struggling with is getting large ClojureScript data structures with tons and tons of structural sharing (think application state history) 1.) small enough to transmit to the server 2.) for efficient storage.

It sounds like Transit may help with this via its caching etc.? Can someone from Cognitect comment on whether this is a suitable use?


I don't think it can help here - caching only applies to map keys, transit keywords, transit symbols, and transit tags and it's not as of yet configurable.


Ah, bummer. The search goes on then - thanks for the quick reply!


What are you doing instead now? Using event sourcing and replaying events on a server side copy of the client code?


A tour of the JS implementation is at http://cognitect.github.io/transit-tour/



This is likely a prelude to how Datomic will support multiple languages outside of the JVM.


Why re-invent the wheel when MessagePack already exists, supports a similar set of types, and has far greater implementation reach?


The biggest difference is that MessagePack extensibility (which is not yet widely implemented) is based upon binary blobs, whereas Transit defines extensions in terms of other Transit types. Also, Transit can reach the browser via JSON. And Transit has caching...


Okay, thanks for edifying. As someone else said (to the creators) it would be nice to see a "why I would use this" blurb.


MessagePack implementations in JavaScript get trounced by JSON for read/write performance and JavaScript is a pretty important part of the puzzle for many people building systems these days. Transit on the other hand can best JSON on more recent JS engines and also in a bind I'd rather debug Transit verbose JSON output than MessagePack :)


I wouldn't say that it's a reinvention of MessagePack. Indeed, Transit uses MessagePack at the bottom (when specified) to provide a level of extensibility and richer types.


The reasons why it's better or why it's not solving the same problem should probably be listed on that page. There is an oblique reference to MessagePack but no clear comparison with it.


They also released a podcast episode about Transit, which doesn't seem to be mentioned on the blog post:

http://blog.cognitect.com/cognicast/060-tim-ewald


Literally about an hour ago I was browsing around Rich Hickey's Twitter account and the Cognitect website because I thought, "Hey, I haven't heard anything from him/them in a while", and voila! Just like that, this appears.


Just when you think Rich Hickey has retired to his hammock to play classic guitar, he throws out something new and awesome.


Is this just NIHism...? I find it frustrating the announcement didn't explain why they felt the need to create an alternative to Avro, Cap'n Proto etc. Transit doesn't seem to do anything new... maybe it's better, maybe not


It is accessible from browser (and is fast, on par with JSON), it has rich set of basic types + extensibility built in


Haven't you realized it's written by demigod Hickey?


Can anyone explain what this would mean for the day-to-day programmer?


It's basically a format that lets you transmit data somewhat more extensibly than JSON would on its own (and doesn't take a huge performance hit).

This means that you can for example, transmit an array of dates and not need to worry about parsing the dates in the right location when you receive them.

Additionally, it's extensible, so you can define formats for any domain specific data that you're dealing with, if you need to.


Manually marshalling/unmarshalling JSON has been a pain point for me on recent projects. I took a quick look at Protocol buffers just now and instantly understood how that will solve most of my issues with JSON. I define a simple message format (.proto) and generate native classes. My service methods can use the generated classes as parameters.

The only thing missing for me was native support for JS (but I quickly found 3rd party libraries).

I don't quite understand how transit, since it's schema-less, addresses that problem. From transit-java docs:

Object data = reader.read();

I might be missing something, but it seems I have to manually create the native classes on both endpoints and cast to those classes. Either that or I still have to manually extract values using the reader API.


This comment is not necessarily related to Transit, but any serialization specification -- in XML, we had XSLT which could transform any well-defined input to an XML output and vice versa.

What's the equivalent for JSON/Transit etc? When parsing and validating the correctness of an input, what is the standard protocol for propagating error messages laced with contextual domain information?

The two solutions I've found was: - use XSLT - use a domain specific language


Reminds me of Thrift[1] which is an Apache foundation project started by Facebook and supports more languages. It also is battle tested. I've seen it used in production under heavy load. I don't know if Thrift does the caching or needs to. Data on the wire is already compressible via gzip which should handle repetitive values.

[1] https://thrift.apache.org/


Thrift is non self describing, which seems to be incompatible with one of the aims of Transit (be self describing)


I couldn't ask someone to explain the similarities and differences between this and EDN could I? I realize that this seems more aimed at transferring data whereas EDN may have been more targeted at serialization (is that correct? please correct me if I'm misremembering) but I thought they covered similar use cases? (with EDN obviously not including the performance enhancements that transit seems to have).


It's EDN but on top of JSON/MsgPack instead of having its own serialization format.


transit is cross-language and faster


It wasn't exactly clear from the post...how is this different from msgpack? Is it just an implementation of more complex data types on top of it?


I couldn't find any documentation about it, but is there any way to achieve forward/backward compatibility with Transit?


As neat as this sounds, I would prefer to do a little extra parsing by hand in exchange for the readability of JSON. Looking at some of Transit's examples, it seems like it would be difficult to gain as complete an understanding of a set of information at a glance.


JSON is readable in small quantities. A 100MB JSON file is just as readable as a 100MB binary file, except it is slower and bigger to parse.

Also unless you can see electrons bouncing on the wire, JSON is readable because there is a program that decodes and shows it to you. It would probably take a couple of lines of code in python to cat a msgpack file.


Displaying files is not the only way to read JSON. I've often read it on the browser development tools, on Wireshark, on tcpdump, etc.


All JSON based Transit reader and writers can respectively read and emit verbose output. You can easily imagine setting a dev configuration flag to enable this during development. Transit verbose JSON output is quite readable IMO - http://cognitect.github.io/transit-tour/resources/example.js...


In most cases, readability is not a concern for inter-program messaging. A good logging system in your program should take care of the readability concerns. I've never come across a single instance of a text format for interprogram messaging not having to be rewritten in a binary format once it becomes obvious that it is a huge waste of computing resources.

As Alan Kay suggested, it the total and deliberate ignorance of lessons learned from our history that has made modern programming into a pop-culture phenomenon.


Transit does specify a json-verbose mode that generates a human-readable JSON packet. I suppose readability lies in the eye of the beholder, but I think it looks reasonable.


If it's programmatically self-describing, nobody ever has to read it.

EDIT: I mean as opposed to something like json that has what the authors describe as out-of-band schema.


You have to read it when you want to look at what is going over the wire. Like with Wireshark or just inspecting network traffic.


This is exactly what I was thinking. Also, when inheriting projects from other developers that didn't produce documentation.


For another binary-JSON, see "CBOR" Concise Binary Object Representation, IETF RFC 7049 http://cbor.io/ I wrote Python and Go implementations of CBOR. It packs smaller and parses faster than JSON.


I think this is very helpful to keep in mind... consumers pushing their demands to producers... and eliminating waste and inefficiency.

http://en.wikipedia.org/wiki/Lean_manufacturing


I think it would be nice if more serialization formats could at least support timezone-aware date/times and deltas, as they are used really really frequently and it's a total pain to have to do a second parse to deserialize them.


Can this be used for saving state to disk when you don't want to use a database?


They're suggesting not using this for storage at the moment since the spec is still in flux. But I think once the spec stabilizes, that seems reasonable.


It looks promising, but is there a mapping for XML? I would recommend adding this (as an optional profile) in a future version of the spec. It would help interoperability with legacy (non-Transit based) systems.


Obligatory XKCD standards post - http://xkcd.com/927/


Cool, now just add a TransitSchema package for every scripting language and we can use it in place of protocol buffers


something like BSON?


As easy as:

[["^ ","~:district/region","~:region/e","~:db/id",["^ ","~:idx",-1000001,"~:part","~:db.part/user"],"~:district/name","East"],["^ ","^2",["^ ...


Or with verbose output:

  [{"~:district/region": "~:region/e",
    "~:db/id": {"~:idx": -1000001, "~:part": "~:db.part/user"},
    "~:district/name": "East"}, ...]


My eyes, the goggles, they do nothing!


Underwhelming... From the teasers it seemed like it could be something actually novel.


You say that as if it's a bad thing. It may not be novel, but it'll definitely be useful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: