Hacker News new | past | comments | ask | show | jobs | submit login
Introducing TJSON, a stricter, typed form of JSON (tonyarcieri.com)
140 points by bascule on Nov 2, 2016 | hide | past | favorite | 129 comments



All of the keys in JSON must be strings, so they should not need tags for themselves. Instead why not put the tag of the value assigned to the key in the key:

    {
        "s:string":"Hello, world!",
        "b64:binary":"SGVsbG8sIHdvcmxk",
        "i:integer":42,
        "f:float":42.0,
        "t:timestamp":"2016-11-02T02:07:30Z"
    }
This prevents having to mess with the values in general and integers don't need to be encoded as strings.

EDIT:

I see this constraint:

   Member names in TJSON must be distinct. The use of the same member name more than once in the same object is an error.
which is still satisfied, however you could have `i:foo` and `s:foo` which would result in redundant keys in the resulting JSON document. This constraint could be clarified that, untagged key names must be unique.

Another question, is a mimetype planned for this? `application/tjson`?


I agree, why define a new format that is more verbose when you can just make it by convention at first and let parsers evolve. I probably wouldn't use : even in quotes to prevent confusion. Something like this seems safe and doesn't break anything: { "string$s":"Hello, world!", "binary$b64:":"SGVsbG8sIHdvcmxk", "integer$i":42, "float$f":42.0, "timestamp$t":"2016-11-02T02:07:30Z" }

This makes it easy for the parser to determine if they should perform type checking. If you run this JSON through a non typed parser, you could easily strip out the $type yourself (until they evolve as well). Surely not perfect but gives you self describing data and ability to perform type checking if desired. $0.02


Putting type sigils on object keys does not solve the problem of typing arrays, unless the types of array array elements are always homogenous, disallowed as the root symbol (they are presently allowed), and are always typed by their membership in an object (and therefore by the key referring to them). This also does not solve the problem of how to type multidimensional arrays.

The question of homogenous types for non-scalars is still an open issue, and is probably the best place to further discuss this:

https://github.com/tjson/tjson-spec/issues/23

As an aesthetic note: I personally find "$" visually noisy as a sigil, and think it has generally lost favor as a sigil for commonly used expressions in programming languages, but is probably familiar to users of Perl, PHP, bash, and BASIC


Array is a more complex issue. The biggest issue is whether it can break existing parsers or not. Additionally, the extra typing for things like heterogeneous or nested arrays will require application code that understands the typing instead of leaving that up to the parser. I think the simplest rule for now would be to only allow homogeneous arrays. This is quite an interesting problem. (Other suggestions to send along a JSONSchema seem unrealistic and the beauty of JSON is its simplicity and brevity, nobody wants another XML)

{ homog1$ai : [2, 3, 4] } { homog2$as : ["a", "b", "c"] }

I don't love $ but _ is so much more likely to be used in a key name for clarity like first_name. I also doubt that many people end their keys with $type so there is unlikely to be a conflict. If they do, it is probably a code standard they are using internally anyway for a similar purpose. Personally, I think that things like jQuery, etc. have trained people to see $ as a marker for "identifier" that I think it feels pretty natural, at least at this point. Again, just my $0.02 and you're mileage may vary....


Please see this issue for homogenous typing of arrays:

https://github.com/tjson/tjson-spec/issues/23

Also based on the feedback I've received, I'm putting together a full proposal for moving all type information to object keys, and fully typing all non-scalars (and nested non-scalars) in a way that will be friendlier to statically typed languages.

That said, I don't think the "$" thing is going to happen.


I've made a concrete proposal for moving type signatures exclusively to a postfix tag on object keys here:

https://github.com/tjson/tjson-spec/issues/30


For anyone interested in further discussing encoding type information about object members in the names instead of the values, there's an open issue on the GitHub repo for the spec:

https://github.com/tjson/tjson-spec/issues/28

Regarding MIME types, since the format is JSON-compatible I would prefer it remain "application/json" however, "application/json+tjson" might make sense.


That is not the case. Binary data is also allowed as the keys of objects (see https://www.tjson.org or the spec).

As noted in the "Content-Aware Hashing" section, an intended future feature is to support redaction, so tags on keys are needed to support this feature.

Finally, if you were to do it that way I think it would make more sense to place the type tags on the values, not the keys, both visually and semantically.


> That is not the case. Binary data is also allowed as the keys of objects

What is the value of binary key? A key is just the name for a value, it should not contain any data itself.

> I think it would make more sense to place the type tags on the values, not the keys, both visually and semantically.

Tags on keys are like types for columns or any other schema. I would rather not have to pre-process the values. To be pedantic, this would require copying all string-based values just to add a prefix.


Binary keys are useful anywhere data is named/identified by a cryptographic key or hash, such as content-addressable systems:

https://en.wikipedia.org/wiki/Content-addressable_storage

A keyring where the object members are named by public keys is another example of where binary keys are useful.

Tags on keys are like types for columns or any other schema. I would rather not have to pre-process the values.

Tags on keys do not work for arrays, at least as the format is presently specified. They could potentially work if arrays always consist of homogenous types, and objects were the only allowed nonterminal allowed by the root symbol. See:

https://github.com/tjson/tjson-spec/issues/23


I'd like to register a weak vote of dissent on this. And I'm pretty well down the conversion funnel on "Content-addressable is the way, the truth, and the light". Binary keys are very dubious. I'd rather a format without them.

It's just incredibly annoying to work with non-string keys in almost every language. To pick an example, just for the sake of being concrete: in golang, `map[string]interface{}` is manageable; `map[interface{}]interface{}` is utterly disgusting to work with.

We have to print keys, almost invariably. Values we can sometimes shrug and say "...elided [binary content]...", but doing it on the keys is typically nonviable. Keeping the data in binary and choosing ways to stringify it to print at runtime has historically been a disaster: keys, key fingerprints, which base-$N they're going to use, and so forth, has been an unmitigated trainwreck in openssl and its ilk. I have a cheatsheet of pgp and ssl commands to print key fingerprints in various formats and I hate that cheatsheet with the cold weeping fire of a disintegrating neutron. Let's not do that again, for anything, ever, please. Picking a format composed of printable characters once and using it consistently in an application is the far better road.

Non-string keys are something that if permitted, almost no one will ever use; and yet every client library will have a massively more complicated interface in order to handle. At the same time, if my prior experience with people using e.g. yaml parsers that return wildcard types for keys is any indication, every caller will so aggressively disregard the feature that whether libraries support it will be moot: callers writing in any strongly typed language will write code that rejects non-string keys out of hand anyway in order to simply the rest of their program. I can't imagine the battle to be worth fighting.


I think you may still want to encode integers as strings anyways if you are encoding/decoding in Javascript.


A human readable bencode.


When have you ever written a program that doesn't know ahead of time what type of data it's going to be operating on? Especially if you're using a statically typed language.

Whether you validate incoming payloads in JSONSchema or not, you will always have some understanding of what the shape of the incoming JSON is supposed to be, down to the most concrete types. You'll probably receive many JSON payloads that all conform to the same schema. So why bother redundantly describing that schema in every individual payload?

If you want strict types, write a JSONSchema. If you need to know specific sub-type information, start specifying what should go into the "format" field in JSONSchema. They did it in Swagger: http://swagger.io/specification/

Since the article complains about JSON parsers not knowing how to handle certain situations, perhaps people should start writing JSON parsers that allow you to pass in a JSONSchema document at parse time so they're sure to handle each field type correctly.


The zen of JSON is it's a schema-free, self-describing structure.

If people can be bothered to deal with schemas, they can probably also consume the Protobuf serialization of a particular object. JSON is targeting a market that doesn't want to do that.

There's no reason such an audience can't also reap the benefits of richer types and cryptographic authentication. TJSON aims to provide these benefits to programmers who don't necessarily want to consume schemas up-front to integrate.

This is particularly useful for tools which consume a small number of fields. There's a lot of overhead to pulling in IDL definitions (and keeping them up-to-date), and often it's coupled to boilerplate code generation systems. If you're just plucking a few fields from an object here and there, there's no need to go through that ceremony.

I say this as someone who's defining the data model in Protobufs. If you're doing any serious data access / API integration: use protobufs. But JSON is a nice fallback for simpler integrations.

That is quite literally the point of going through this whole exercise.


There are many use cases where you don't know the shape of the data. Many apps need to index or store or transform arbitrary key/value pairs, but without knowing anything about those keys or values mean. JSON is a schemaless interchange format, so those situations arise pretty much by default.

Not that I love this format -- fixing JSON needs a bit more effort, especially on syntax.


> Many apps need to index or store or transform arbitrary key/value pairs, but without knowing anything about those keys or values mean.

Then those apps shouldn't be interpreting those values. E.g. if you don't know and don't care whether a given JSON number is an integer or a decimal, don't represent it as a number in your app. Just copy the serialized number verbatim (or a canonicalized version thereof).


Of course, if you're not touching it, that's a fine strategy. But maybe you're transforming it. Or you want to parse it, extract some value and send it somewhere. There are lots of use cases where there's no way around parsing and interpreting the data types in a JSON blob.


I don't mean the entire JSON structure. Go ahead and parse that; leave the individual values opaque.


Plenty of times, either when I'm taking other people's JSON or when I'm coding for future-me or for arbitrary JSON traversal or search.

But I agree that TJSON rubs me the wrong way. The simplicity of JSON is what I like and I can code around it when I need to.


^ this


I'm still waiting on xml with curly braces instead of angle brackets. As far as I can tell that's all that's holding us back


Yup, we already have schema validation, JSONRPC, and transformations, all that's really missing is namespaces and comments.

Then we can go full WSDL and SOAP.


Don't worry, we have comments https://hjson.org/

and our scientists are hard at work on namespaces: http://www.goland.org/jsonnamespace/


I find this comment extremely offensive. Microsoft is going to fix the namespaces in their SOAP for .net real soon now. In the meantime all you have to do is put a few patches in your non .net SOAP code to deal with the badly formed namespaces. Besides, those 20 patches have only been needed for ten years now.


The sarcasm detectors seem faulty today. I got it at least. ;)


My mistake is probably making a joke of the pain people have gone through with SOAP. It is probably not funny to many. :)


I'm looking forward to JSON-I becoming the recommended intersection of standards to interoperate with others.



It's amazing how many people are trying to reinvent protocol buffers! Every time I see something like this I think the developer didn't do their research or maybe they wanted to make a hobby project anyway. Stuff like this is dangerous to use in production. Even JSON as simple as it looks had a lot of bugs that are now.

If you want typed data structure transfer, use protocol buffer.


I take it you don't know who either of the two authors are?

- https://en.wikipedia.org/wiki/Ben_Laurie

- https://github.com/tarcieri

They know about protocol buffers.


Pretty damn impressive. I never knew about them either to be honest.


Did you even read the post? First paragraph:

"Its primary intended use is in cryptographic authentication contexts, particularly ones where JSON is used as a human-friendly alternative representation of data in a system which otherwise works natively in a binary format."

See also the "Content-Aware Hashing" section: The goal of this format is to enable content-aware hashing which produces the same digest for data encoded either as TJSON or a binary format such as Protobufs.

I am using it in conjunction with Protobufs.


So they are inventing ASN.1 again, for the third time. Next thing they will invent distinguished encoding rules so data can be hashed without decoding.


No, far from "inventing ASN.1 again", TJSON could potentially be a very useful format for representing equivalent structures to ASN.1, similar to:

https://github.com/google/der-ascii


That's not correct -- Protocol buffers differ from both JSON and TJSON because the schema isn't part of the protocol.

That is, the information on the wire doesn't contain enough information to interpret the data -- the schema has to be compiled into or otherwise included in the binary to know what the data means. That's not the case with JSON or TJSON.


Yes, this is a very important point, (T)JSON is self-describing in ways protobufs or other similar on-the-wire formats are not. Making sense of a protobuf involves both knowing what type of protobuf you're looking at in advance (protobufs are NOT self-identifying) and loading the corresponding schema. Otherwise each member of a protobuf is identified by an integer, so good luck guessing what each field represents.


I wouldn't necessarily compare it to protobuf, since that in general requires a schema for exact parsing and forwarding. However there are other encodings which should already cover all the desired features. E.g. CBOR is a standardized JSON-like binary format which has the possibility to store exact types for binary data, dates, etc...

Yeah, such a format isn't directly human readable on the wire, but you can just get in into whatever string-like representation you want in your program. And parsers are for sure not harder to write than for most text formats.


I am curious if you simply missed the numerous places this post refers to CBOR? Such as this:

There exists a binary analogue of JWT called CWT which is based on the Compact Binary Object Representation (CBOR) standard. Unfortunately you can’t convert JWTs to CWTs without the original issuer re-issuing them and re-signing them in the new format.

Or this:

This is a pervasive problem for anyone who would like to store authenticated/signed data natively in a binary format (e.g. Protobufs, Thrift, capnp, MessagePack, BSON, or CBOR), but also permit clients to work natively with a JSON API without necessarily being aware of a full-and-evolving schema.

JSON is a human-meaningful serialization format, as opposed to all the binary formats named in the post, including CBOR.

See also:

https://github.com/tjson/tjson-spec/issues/26

TJSON could potentially provide non-lossy transcoding to/from the similarly tagged types in CBOR in ways JSON itself cannot.


Enough people have given reasons why Protobufs (and other schema-based interchange formats) are different from the unstructured JSON/BSON/TSON/whatever.

But if you are using a schema there are now far better alternatives to Protobufs. If you're primarily sending data over the network I've found Google's Flatbuffers to be great. For writing to disk Capn'Proto is similar and equally good. Receiving market data where every nanosecond matters? Simple Binary Encoding (SBE).

All of these formats employ some form of code generation to extract values from what are essentially cleverly packed structs. All data is sent little endian and follow machine word sizes. Not truly cross-language but Flatbuffers has native support for C/C++, Python, Java, Go, C# and 3rd party support for Rust.


The main reason to prefer protobufs is GRPC: there is now a pseudo-standard HTTP/2-based RPC format with many robust language implementations. That's mostly to say: I think I've generally observed GRPC being embraced. My intended deployment profile is having a single HTTP(/2) server listening on a single TCP port which can speak JSON over HTTP as well as protos over GRPC, both of which can be authenticated by the same objecthash/signature.

You can store whatever format you want on disk, but if you primarily intend to serve proto-consuming clients, you might as well store protos on disk so what you serve to the network is an opaque blob of bytes with no transcoding.

Don't get me wrong, I really love capnp, particularly the CapTP-like features, but I feel like many of the novelties of its IDL/serialization format (possibly ones involving kentonv's original work before he left Google) have actually shipped in proto3. I really love capnp, but there's this handwavy "this is the way the wind is blowing" argument to be made for GRPC, I think.


Protobuf have a binary size bloat issue compared to JSON.

JSON wins for similar reasons why HTTP 1.1 won. It's a human readable, simple format and performant enough the majority of development cases. Human readable makes debugging easier.

I hope with tjson it will help increase parser speeds with it's type hints.


> Its primary intended use is in cryptographic authentication contexts, particularly ones where JSON is used as a human-friendly alternative representation of data in a system which otherwise works natively in a binary format.

The author might care to take a look at canonical S-expressions, a format from the 90s which attempted to do the same thing for many of the same reasons, and has the advantage of being rather more elegant.

E.g:

    {
        "s:string":"s:Hello, world!",
        "s:binary":"b64:SGVsbG8sIHdvcmxk",
        "s:integer":"i:42",
        "s:float":42.0,
        "s:timestamp":"t:2016-11-02T02:07:30Z"
    }
could be:

    (string "Hello, world!"
     binary [b]|SGVsbG8sIHdvcmxk|
     integer [i]"42"
     float [f]"42.0"
     timestamp [t]"2016-11-02T02:07:30Z")
Which is a perfectly valid encoding, but can use the canonical encoding (useful for cryptographic hashes):

    (6:string13:Hello, world!6:binary[1:b]13:Hello, world!7:integer[1:i]2:425:float[f]4:42.09:timestamp[1:t]20:2016-11-02T02:07:30Z)
Which can be encoded for transport as:

    {KDY6c3RyaW5nMTM6SGVsbG8sIHdvcmxkITY6YmluYXJ5WzE6Yl0xMzpIZWxsbywgd29ybGQhNzpp
    bnRlZ2VyWzE6aV0yOjQyNTpmbG9hdFtmXTQ6NDIuMDk6dGltZXN0YW1wWzE6dF0yMDoyMDE2LTEx
    LTAyVDAyOjA3OjMwWik=}
Granted, 'elegance' is in the eye of the beholder, but I like it.

I also think that there's a deeper concern with any shallow notion of types. An application doesn't care so much about 'some integer' as it does about 'a valid integer for this domain,' and that concern is what leads to schemas and profiles and things like that. Just encoding the machine type of a value is insufficient: one has to encode the domain type, which means conveying the domain, which means assuming some sort of shared knowledge.


S-expressions are great, and I'm a big fan of SPKI/SDSI, which used S-expressions in a security context.

However, they have generally not gained favor in the greater programming ecosystem, whereas JSON has. TJSON is trying to tap into the greater ecosystem of people who are familiar with JSON to some extent. Hence its backwards compatibility with JSON, and not adding a backwards-incompatible type syntax, as Amazon Ion did.


I feel like there's a missed opportunity in not calling it TySON or something like that.

That aside, wouldn't it make more sense to fix the JSON parsers instead? They are the ones having issues parsing e.g. 64 bit integers, JSON has no problem holding them.


I was confused by the claim that JSON parsers do not handle 64-bit integers. If the parser is written in Javascript, then it has a problem because Javascript does not support 64-bit integers. But I have not seen that problem in any other language. For example, Postgres's JSON parser can handle whatever the maximum size of PG numeric is and Python can handle extremely large numbers as well.


From RFC 7159 section 6. Numbers:

https://tools.ietf.org/html/rfc7159#section-6

   Note that when such software is used, numbers that are integers and
   are in the range [-(2**53)+1, (2**53)-1] are interoperable in the
   sense that implementations will agree exactly on their numeric
   values.
You can't depend on interoperable support for 64-bit integers in JSON. Furthermore many JSON libraries convert all numbers to floats, so this problem doesn't affect only JavaScript.

TJSON requires conforming parsers to support the full 64-bit signed and unsigned ranges. This will involve using bignums in JavaScript.


Which ones besides Javascript implementations? Do you have examples?


Go's JSON parser parses all numbers as floats, for example


Also note that the sinister problem here is that implementations which convert numbers to floats will silently lose precision when they overflow the range allowed in RFC 7159. This leads to quite subtle errors, and is why Twitter moved to encoding Snowflake IDs as strings:

https://blog.twitter.com/2011/important-direct-message-ids-w...


Yes, but Postgres has high standards ;) There are plenty of crappy JSON libraries out there. (I took to writing my own in C for just this reason.)


Yes! Name should totally be changed. I hope they see this.


As a woman in tech, I would feel uncomfortable using a format named after a famous rapist.


wow, this looks awful and painful.

There's no reason to tag the type of a field when you have a typed syntax. The real problems with JSON aren't at all addressed by this:

keys have to be strings lack of 'attributes' like xml, which means you have to make a document convoluted from the start.

For example, lets say I am storing product data, I might do it like:

{'title': "Billy goes to Buffalo", 'page_count': 193, 'author': "Ray Broadbunky"}

But later I might want to be able to store attributes or metadata, in xml this doesn't change the schema of the document:

<product> <title>Billy goes to Buffalo</title> <page_count>193</page_count> <author>Ray Broadbunky</author> </product>

Can be extended to:

<product> <title human_verified="false">Billy goes to Buffalo</title> <page_count human_verified="true">193</page_count> <author human_verified="true">Ray Broadbunky</author> </product>

It's not beautiful but anything using this data will not have to change at all to add any metadata like this.

However, with JSON you have to either add new data that can somehow be joined to the data originally, or more commonly you have to be very defensive and 'plan for' this stuff, greatly complicating the schema.

You end up starting with: {'attributes': [ {'name':'title','value':"Billy goes to Buffalo"}, {'name':'page_count', 'value':193, ...

so that you can add unanticipated things later without breaking consumers of the data

but at least some are addressed: no standard way to store bytestrings lack of time type


isn't there a way to extend the types to specify our own and register constructors for them? like transit?

otherwise we will be in the same place of json in terms of extension where our own types are second class citizens.


The problem I see is that everyone has their own favorite type systems. Functional people may consider sum types (tagged unions) indispensable, while OOP people might want their types to have notions of inheritance. Another functional programmer might want existential quantification, higher-kinded types that most people outside the functional niche have never heard of, but a lisp programmer might want actual code as data (quote/eval) so the type has to involve functions, etc. Extending the types beyond the basic primitives is difficult because there are so many different ways of doing that.


it's not about specifying a type system, just letting users specify a tag for a type and then register a constructor for that tag, then inside it you can have whatever type system thing you like, the serialization format doesn't care, for example {"#Some:myOption": "s:value"}, the decoder will call the constructor registered for Some passing the value and not care about your type system.


Agreed. Just adding some fixed types doesn't really help that much.

Something like EDN for JSON would be cool: https://github.com/edn-format/edn


Isn't Transit basically EDN for JSON in that it adds types and whatnot, and encodes to JSON?

Or do you mean, you want a format that's sort of halfway between EDN and JSON?


transit works great except that it's unreadable with current tools (for example browser devtools or attaching listeners to kafka).

I know it's a tool problem but I don't see the whole world embracing transit.

If this format gets adopted with extensible types we get a readable format that has what transit provides and if there's no tooling support we can still read it with standard json tools or none at all.


Transit is unreadable exactly because it has to work around the limitations of JSON (like string-only keys) to deliver its primary features: true maps, tagged collections etc. TJSON only has tags for primitives, so yeah, it's not much different from JSON this way, the tooling is happy.


> Just adding some fixed types doesn't really help that much.

It brings the set of scalar types you can express in a JSON message on par with other serialization formats like Protobufs:

https://developers.google.com/protocol-buffers/docs/proto3


We could just write a JSON Schema for it. It allows you to specify a "format": http://json-schema.org/latest/json-schema-validation.html#an...

So you can write a schema like: {"type": "string", "format": "email"}

or: {"type": "integer", "format": "uint64"}

There's no spec for what is allowed as a "format", so you have to decide on your own values and write your own validators, but someone could come up with a standard spec for this. Swagger formally specifies some values of "format" in this document: http://swagger.io/specification/


The purpose of TJSON is to be self-identifying and schema-free. If you want a schema, use Protobufs or the myriad JSON schema languages.


I don't want a schema, I want to preserve types between serialization and deserialization thus avoiding conventions or having to specify those types "out of band", the same way you want to make it clear that an int is an int and a date is a date, I want to tag an object to tell that that object is a city, a person or something else, each program should register a function to rebuild the actual object but at least it's not a convention anymore.


Those labels in the example are confusing. Instead of string, binary, integer, float,timestamp please use something like name, password, age, height, sessiontime.

Using string and binary is worse than using foo and bar.


Reminds me of Tyre – Typed regular expressions: https://news.ycombinator.com/item?id=12292389


"underspecification has lead to a proliferation of interoperability problems and ambiguities."

So TJSON has a perfect spec and everyone, now and forever, will interpret it perfectly?


No, but it has a set of machine-readable examples which are intended to cover JSON's present underspecified edge cases:

https://github.com/tjson/tjson-spec/blob/master/draft-tjson-...


Huh. Thought that's what XML with namespaces and schemas was supposed to do.

Only being a little sarcastic...


On the other hand you're "better off with a diamond with a flaw than a pebble without". Perfect is the enemy of good and all that.



Ion is a superset of JSON: not all Ion documents are valid JSON documents.

TJSON can be viewed as a subset of JSON: all TJSON documents are valid JSON documents, and parsed by existing JSON parsers. Consuming TJSON documents as JSON will involve stripping the tags, but as noted in the post, people already do these sort of transformations on parsed JSON to e.g. extract binary data.


Why muddy up the actual values where you will have to parse that value with "t:" where t is type?

Why stuff it in one key/val? why not separated where it looks to see if type is present, if so it converts to it/validates against it (you can also place other validations/constraints on it like min/max values, length etc -- that will fall apart if you are trying to stuff it all in one key/value).

Like this:

  {
    "val":"Hello, world!",
    "type":"string",
    "validation": "[regex]"
  }
Instead of:

  {
    "s:string":"s:Hello, world!"
  }
This is typically how we type fields in JSON when needed as there is no parsing needed on the value. If you need to check type and it is present you can act on it.


Storing validation next to the type like that is a bad idea in general. If you can't trust the incoming data to be valid, then for the same reasons, you can't trust the incoming data's claim for what would make it valid.


Possibly out in the wild but if it comes from a server you control and systems you validate then both this and TJSON or any JSON type system would have that same issue. Typically typing/schemas are system to system and not necessarily filled by users or in areas they can be edited. Same issue with XML validation, any schema info needs to be enforced by the server/backend/api.


That's a lot of extra bytes you have to send over the wire. Also, I don't think validation makes sense. When sent by the server, it's too limited (would lead to situations where you're doing half the validation in TJSON and half in the client code). When sent by the client, it can't be trusted anyway.


True if validation is on there. I just put it in to show you could have other easily added validations aside from type (TJSON is locked to just type as it is concat/mashed in one value colon separated). If you just take the "val" and "type" it is really no extra bytes or very minimal but cleaner.

  {
    "val":"Hello World",
    "type":"string"
  }

  OR 

  {
    "s:string":"s:Hello, world!"
  }
Pretty much the same. I guess my personal preference is I don't like to mash values and parse values out of key/value values.

In the end all validation is done on the server anyways so types/schemas for JSON are really just a nice to have and should not be relied on unless you control both ends of the pipe.


>That's a lot of extra bytes you have to send over the wire.

Redundant data is not a problem if you combine JSON with gzip. JSON with gzip is basically good enough for everything except fast serialization or deserialization.

If you care about that then you should use something like Protocol Buffers or Cap'n Proto.


You've just reinvented XML


Or made it closer to current schemas for JSON like JSON Schema[1]

TJSON type tagging "t:" looks eerily like XML namespace prefixes.

Personally not a big fan of typed JSON and hate XML/SOAP/bloat but also not a fan of mashed/concatenated values which is reminiscent of CSV days, most protocol buffers are reminiscent of binary data exchange days, those were even more fun /s.

You can apply constraints on an instance by adding validation keywords to the schema. For instance, the "type" keyword can be used to restrict an instance to an object, array, string, number, boolean, or null:

  { "type": "string" }
[1] http://json-schema.org/


Compactness. TJSON expresses in 2 characters what you're taking another roughly 24 to do (omitting the "validation", which I think is unhelpful and pointless)


True, a few bytes more. But less processing on the flipside not having to parse every key/value for the colon concatenated tag.


I don't how the "less processing" argument works. In your version, parsing a value also requires additional work - there's a whole { } to go through.


json became so popular in first place because of its simplicity, i.e. no schemas, namespaces, attributes, less bizarre notation than xml. let's keep it this way.


TJSON doesn't add any of the things you just complained about


it doesn't. instead, it takes it to new heights:

"s:id":"i:11"

this illustrates, what in my mind is main problem with contemporary software development. in old days, first, there was a problem, for which we had to find a tool that is good enough. nowdays there are plenty of tools, for which we are hoping someone will find a problem.


This looks similar to msgpack with saltpack for crypto parts. Right?

http://msgpack.org/

https://saltpack.org/


Six things:

1) "Lack of full precision 64-bit integers" is bullshit. Numeric precision is not specified by JSON. If a parser can't deal with 64-bit integer values, it's a poor parser.

2) "s: UTF-8 string" What does this mean? JSON strings are strings of Unicode code points; JSON itself may be encoded as UTF-8, -16, or -32. So does this mean "encode the string as UTF-8, then represent as Unicode code points"? That makes no sense.

Does this mean "encode the string as UTF-8 and output directly regardless of the encoding of the rest of the JSON output"? That makes no sense either.

So I'm guessing the author just conflated "UTF-8" with "Unicode", which is concerning given that he is attempting to define an interchange protocol.

3) "i: signed integer (base 10, 64-bit range)" What does this mean? (-2^64,2^64)? (-2^63,2^63)? [-2^63,2^63)?

4) "t: timestamp (Z-normalized)" What does that mean? There are literally dozens of timestamp formats. Does he mean full ISO 8601, restricted to UTC?

5) What is the point of TJSON anyway? When you deserialize, you still have to check that the data is of the type you expect. At best this saves a bit of parsing, since the deserializer can do that automatically. Various JSON schema languages already exist, which give you this richer typechecking.

The only use case I can think of for this is exactly what the author mentions further down the article: canonicalization for content-aware hashing. But this only works if the only types you care about fall into the small handful he thought of. What about, say, IP addresses? Case-insensitive strings (such as e-mail addresses)?

6) If we're talking about canonicalization, TJSON does not say how to canonicalize decimal numbers. I suppose this stems from the author's mistaken belief that numbers in JSON are IEEE floats (they're not, regardless of what common broken parsers do).

I hate to be so negative, but this really comes off as half-baked.

EDIT: Looking at the spec [1] it seems to address some of these, but still indicates a strong confusion between data types (Unicode, rational numeric) and data representations (UTF-8, IEEE double).

[1] https://github.com/tjson/tjson-spec/blob/master/draft-tjson-...


Responding to:

EDIT: Looking at the spec [1] it seems to address some of these, but still indicates a strong confusion between data types (Unicode, rational numeric) and data representations (UTF-8, IEEE double).

The format is described in terms of the tags (which act as type annotations), each of which corresponds to a specific on-the-wire format. Different tagged serializations of the same data may correspond to data of the same type. A better place to discuss ambiguities in the spec regarding this issue is here: https://github.com/tjson/tjson-spec/issues/27

The idea that different on-the-wire representations of an object correspond to the same typed data object (and can therefore result in the same hash) is core to understanding content-aware hashing.

So to your I'm guessing the author just conflated accusations, I don't think you fully understand what's going on here.


JSON is not defined in terms of UTF-8. That would be patently ridiculous, since UTF-8 is a serialization.

JSON is defined in terms of Unicode code points. A string in JSON is a sequence of code points, some of which are (necessarily) escaped, others of which may be.

So, to say "the string must be UTF-8" makes no sense. The JSON serialization itself can be UTF-8 (which I presume is what the author means). But nowhere does JSON talk about the encoding of a string within JSON, because it is not encoded.

Furthermore, what does the author intend for escaped characters? Are they allowed? Presumably not, since that would provide for non-canonical representations. But some escapes must be allowed, since control characters (i.e. code points less than U+0020) must be escaped per the JSON spec. Nowhere does he address this; just a technically meaningless "strings must be UTF-8".


JSON is not defined in terms of UTF-8. That would be patently ridiculous, since UTF-8 is a serialization.

TJSON is defined as a serialization format on top of a JSON-like data model. The TJSON spec originally used the terminology "Unicode String", but moved to using "UTF-8 String", the rationale for which is given here: https://github.com/tjson/tjson-spec/issues/27

If your intent is to actually effect a change in the specification, that is the proper place to do it, but specific criticisms of the exact wording of the specification, preferably in the form of pull-requests, would be the best way to affect such changes.

If your intent is not to effect a change in the specification, you're entitled to your opinion, but I'm done discussing the matter as the discussion has ceased to be meaningful to me. Generic criticisms like "You used 'UTF-8' instead of 'Unicode'" outside the context of specific sections of the specification aren't particularly helpful.

Furthermore, what does the author intend for escaped characters? Are they allowed? Presumably not, since that would provide for non-canonical representations.

You are continuing to miss the point: TJSON intends to provide a foundation to use content-aware content hashing in lieu of a canonicalization scheme as an alternative solution which works across multiple encodings of the same data, sidesteps the exact problems you're talking about, and also allows arbitrary subsets of an object graph to be authenticated without requiring rehashing/resigning. Please see this closed issue on canonicalization ("won't do"):

https://github.com/tjson/tjson-spec/issues/24

From what I can gather, TJSON is offering a degree of abstraction you have not yet fully gleaned. The core idea is: many serializations, one underlying data structure/object graph. TJSON is a mere serialization layer, and indeed many TJSON documents may refer to the same underlying data structure, but all will have the same "objecthash":

https://github.com/benlaurie/objecthash


> The core idea is: many serializations, one underlying data structure/object graph.

Then why is UTF-8 even mentioned? Or time zone offsets, for that matter?


So it's possible to specify a rigorous set of tests cases that, ideally if all are passed, can be used to certify a conforming implementation.

In other words, to solve this problem:

http://seriot.ch/parsing_json.php

While in some cases it might make sense to relax some of the requirements, I'm a fan of keeping things simple. Call me one of those crazy people who thinks Postel's Law is wrong.

TJSON specifies a set of test cases for this purpose here:

https://raw.githubusercontent.com/tjson/tjson-spec/master/dr...

I prefer to specify things in such a way that it's relatively easy to specify a test suite that covers all of the corner cases.

A secondary goal of TJSON is to produce a stricter format, so I'd prefer to start with additional strictness requirements, and relax them if a reasonable case can be made.


1) Numeric precision of integers is (under)specified in RFC 7159 section 6:

https://tools.ietf.org/html/rfc7159#section-6

  Note that when such software is used, numbers that are integers and
   are in the range [-(2**53)+1, (2**53)-1] are interoperable in the
   sense that implementations will agree exactly on their numeric
   values.
There is no contract that JSON integers give you full 64-bit precision. TJSON has such a contract, and tests for support for full-precision 64-bit integers (and expected failure in the boundary cases) is specified in the canonical test cases/examples file:

https://github.com/tjson/tjson-spec/blob/master/draft-tjson-...

2) Please see https://github.com/tjson/tjson-spec/issues/27

3) Yes, these specific ranges are covered in the spec: https://www.tjson.org/spec/#rfc.section.3.3

4) Z-normalized RFC3339. See: https://www.tjson.org/spec/#rfc.section.3.4

5) TJSON provides a repertoire of types which approximates what's available in the scalar types of a format like Protobufs:

https://developers.google.com/protocol-buffers/docs/proto3#s...

What about, say, IP addresses

Simple solution for that case: IP addresses have canonical representations as strings, so use their string representations. Or, if you prefer, represent them as a TJSON object.

6) objecthash provides an alternative to canonicalization: we can use a "content-aware" hash algorithm to produce a digest of the content rather than trying to arrange the content into a canonical form. See: https://github.com/tjson/tjson-spec/issues/24

I hate to be so negative, but this really comes off as half-baked.

As far as I can tell, you didn't read the spec. All of your perceived ambiguities are addressed.


1) That paragraph is discussing interoperability, not the semantics of JSON. JSON "integers" have no such concept as "precision". They are just a sequence of digits. Just like most environments have no problem with very large strings, many environments also have no problem with very large numbers. Dictating "numbers can only be this big" is quite a step backward.

5) Ignoring for a second that IP addresses (particularly IPv6) don't have a universally-accepted canonical format, that's a great solution. But it's one that applies equally to every other data type, even those TJSON special-cases. TJSON is picking a handful of "privileged" types that won't be enough for everyone, so we'll just hit the same problem again.


1) TJSON imposes precision requirements on parsers which JSON lacks. It gives you guarantees where JSON doesn't. JSON may or may not lose precision when you go outside the RFC 7159 range [-(2^53)+1, (2^53)-1]. This is a potential silent failure that mangles data and is unacceptable in a security context, and one present in popular language environments such as JavaScript and Go.

5) The set of scalar types provided by TJSON is not too far off from that provided by protos. As I explained in my previous response, if you want to go beyond those, use a non-scalar type:

https://developers.google.com/protocol-buffers/docs/proto3#s...

This is par for the course for most typed languages and serialization formats. You don't magically define new scalar types de novo: you build them as sum/product types from scalars and other non-scalars.

TJSON's objects are self-describing product types.


Why don't float types use a tagged string? It says "tagging is mandatory" in the initial document, but floating point types are then omitted in the official spec


Floating point types are tagged by the use of the floating point grammar. It would require the standard to be clear that the only way to indicate integers is via "i:288", though, or there will be ambiguity.

I don't know if that circle can be squared, either; if you require integers to use the tagged string, it isn't really backwards compatible any more. If you don't, the floats remain ambiguous.

Given that the text of the blog post suggests, probably correctly, that new parsers will be necessary to use this format, I'm not convinced that trying to reuse JSON's grammar is that advantageous. If I'm switching parsers, the competition is no longer JSON, it's the full range of possible replacements, including Protocol Buffers, Cap'n Proto, XML, BSON, and everything else. If you're willing to replace parsers there's probably already something out there for you.


It would require the standard to be clear that the only way to indicate integers is via "i:288", though, or there will be ambiguity.

The spec does this here:

https://www.tjson.org/spec/#rfc.section.4.3

  4.3.  Floating Points

     All numeric literals which are not represented as tagged strings MUST
     be treated as floating points under TJSON.  This is already the
     default behavior of many JSON libraries.
If I'm switching parsers, the competition is no longer JSON, it's the full range of possible replacements, including Protocol Buffers, Cap'n Proto, XML, BSON, and everything else.

As noted in the post (which names a similar list of binary formats), TJSON is intended to be supplemental to binary formats, not a "replacement"


Thank you. I skimmed over that accidentally. Good.


I had the same thought. It seems inconsistent/confusing that everything else has its type defined by the tag prefix, except for floats. Why not `{ "s:float": "f:42.0" }`?


This. From a purely semantic point of view it seems odd


Floating points already have a distinct type. Many JSON parsers already convert number literals to floats in all cases. For ones that emit a mixture of integers and floats, converting to a float consistently is a simple transform.

Floats are not typically used in the intended contexts for TJSON (cryptographically authenticated data), and normalizing them is rather difficult: https://github.com/benlaurie/objecthash/blob/master/objectha...


I've opened an issue about using tagged strings for floats here: https://github.com/tjson/tjson-spec/issues/32


I've been writing a JSON parser when I have a few minutes here and there. I was surprised by the lack of specificity in defining numbers, specifically floats. If floats are know to lose precision after a few decimal places...

iex> 1.5555555555555555

1.5555555555555556

...why not just specify a max precision? You can always say "if you need a more precise number, just store it as a string". If I wanted a room for interpretation, I'd use YAML!


This argumentation is complete bullshit and even dangerous.

> "Parsing JSON is a Minefield": From a strictly software engineering perspective these ambiguities can lead to annoying bugs and reliability problems, but in a security context such as JOSE they can be fodder for attackers to exploit. It really feels like JSON could use a well-defined “strict mode”.

Not at all. This article just outlined the differences of the various implementations regarding the 2 specs. And then added a spec test suite, including all the undefined problems, with suggestions how to go forward.

JSON is already strict enough. The problem are people like op to make it even not-stricter. The latest JSON spec RFC 7159 adds ambiguity by allowing all scalar values on the top level, which leads to practical exploitability. See e.g. https://metacpan.org/pod/Cpanel::JSON::XS#OLD-VS.-NEW-JSON-R...

"For example, imagine you have two banks communicating, and on one side, the JSON coder gets upgraded. Two messages, such as 10 and 1000 might then be confused to mean 101000, something that couldn't happen in the original JSON, because neither of these messages would be valid JSON.

If one side accepts these messages, then an upgrade in the coder on either side could result in this becoming exploitable."

What the op now suggests is adding the insecurity-mistake YAML took by adding tags to all keys. Here types don't add security, they weaken security!

It is security nightmare as it is leading to exploits which are e.g. already added to metasploit (CVE-2015-1592). tagged decoders are always a problem, and currently JSON and msgpack are the only serializers safe from such exploits due to its strictness.

I would suggest that the remaining JSON libraries first fix their problems by conforming to the specs. First the secure old variant (RFC 4627) as default, and then maybe the relaxed new RFC 7159 variant, but denoting the security problems with interop of scalar values.

Currently only my Cpanel::JSON::XS library pass all these tests from the Minefield article. E.g. the ruby one, which the author complains about, not. The type problem is esp. problematic in dynamic languages like ruby, where classes are not finalized by default.


So, why would I use this instead of actual JSON (== browser support), BSON (binary JSON), or Capn Proto (I control both ends of this)?


I'd rather use XML than this atrocity.


> All base64url strings in TJSON MUST NOT include any padding with the '=' character.

This seems like it makes a streaming parser's job (slightly) more of a headache, without any serious advantage. Which seems particularly odd to me given that this seems heavily focused on binary stuff.


Padding is redundant when base64url is encapsulated in a quoted string.

If you're writing a state machine-based parser which is processing a quoted base64url it will, in amortized time, be able to find a close quote token faster than it will be able to find valid close padding.


I'm a bit confused that TJSON only allows UTF-8 strings. The only way to escape Unicode characters in JSON is \uXXXX. But to encode astral characters with this syntax, UTF-16 surrogate pairs must be used. How does TJSON handle this, if strings must be encoded with UTF-8 only?


JSON is defined to use surrogate pairs to encode these. TJSON must do nothing here.

e.g. \ud8a4\uddd1 => U+391d1


Does a time zone key trigger the enforcement a specific ISO standard format for the value?


Why not just have a separate metadata file. It will keep the json file lean.


and still no ability to have comments. one reason I strongly prefer JSON5 http://json5.org/


Can you have a typed array too?



This is literally protobuffs.


It's actually the opposite of protobufs. This format is self-describing - the type information is carried along with the data. Protobufs aren't self-describing. You need to have the type information out-of-line in order to make any sense of serialized protobufs.


I'm sort of nitpicking, but Protobufs have wire-level type tags (so old app versions are able to handle newer schemes, with fields they don't know yet). They're limited, but they exist.


[flagged]


Any example of what you think is a serious data interchange formats?


Based on ubiquity do you disagree?


Yes. Ubiquity doesn't mean it is a fit for purpose. Especially for most things these extensions try to overcome.

In this case ubiquity is largely a product of the primary consumer of much JSON data is a web browser. Likely much of that data is simple enough that it does not require more than what JSON provides.


Ubiquity is a very useful property of interchange formats (APIs are a big deal).

That said, for internal-only things where I have a lot of control (and I'm writing in supported languages - wtb elixir), I'd probably be using grpc


The next guy who inherits your internal code would prefer you to just use JSON. There is a reason it is ubiquitous, simplicity. I wonder how long though with all these type systems and XJSONs.


Mmm, disagree. It's not really the interchange format that's the only useful part of GRPC. (Though protobuf is pretty standardized these days). This was just the context of "I'm a big org standardizing on microservice tradeoffs", so maybe slightly out-of-context


At least GRPC is a standard, JSON/REST is arguably more simple. But both are at least standard and allow teams to not reinvent everything and provide a baseline/plane for others to work on it.

Just stating most programmers would probably rather inherit a JSON/REST app over a GRPC one though it is quite nice.


I agree, and if it successfully handles 80% or more of use cases then it's a win. I just don't have the expectation that it should handle the other 20% and if I had a use case in that 20% I probably wouldn't start addressing my problem by creating yet another JSON extension.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: