Hacker News new | past | comments | ask | show | jobs | submit login

> In conclusion, JSON is not a data format you can rely on blindly.

What does HN suggest for configuration files (to be written by a human essentially)?

I am looking at YAML and TOML. My experience with JSON based config files was horrible.




Funnily enough, as I've been experimenting with Chef and trying to stick to JSON config files where allowed, I was again struck that (a) it's not a good choice for config files (b) it's an OK choice though (c) lots of people are using it anyway (d) nearly everyone that does so (including Chef) allows comments, so in reality are not actually using JSON at all.

Point (d) is the important one. I really think we need a standard for json-with-comments. JSONC or whatever, but it should have a different standard filename and it should have an RFC dictating what is and isn't allowed. Personally I would allow only // comments because there are too many subtle issues with C-style comments, but it may be too late to agree on that.

Half the point of JSON is that if application A stores its data as JSON then application B can parse that without any nasty surprises. Except, there are now probably thousands of noncompliant implementations in the wild that only exist because the standard doesn't allow comments. Each one of those standards adds subtle differences (in addition to the comments themselves) depending largely on how they remove the comments before passing to the standards-compliant JSON parser (assuming they do that, which being DC's recommended approach, is as close to a standard as currently exists).


> (b) it's an OK choice though

I really think it's not an OK choice. A config file format that doesn't allow comments provides some of the worst possible UX.

One of the nice things about config files is that normally they are self-documenting, explaining the meaning of the various directives and providing possible values.

Without comments, you have to constantly switch between the documentation and the config file.

Also, the restriction on trailing commas is another really bad issue for a config file language as it pollutes diffs, makes moving lines around needlessly difficult and is one more landmine waiting to happen for the sysadmin editing a file.

No. JSON not at all OK as a config file language.


> I really think it's not an OK choice. A config file format that doesn't allow comments provides some of the worst possible UX.

Are we sure about that? E.g. mostly when messing with configuration files, I have to visit the documentation anyway, which could explain key-value pairs in the json file. Furthermore there are configuration files which consist mostly of comments to explain all kind of edge cases with actual configuration data commented, which can be a mess on its own.

If good code should be self-explanatory then why not configurations?


How does json "not support comments"?

{"comment":"default values for this object"}


Since when is in-band signalling a good idea? What if one of your configuration keys is named "comment"?


And what do you name your second comment?


{"notes to self":["Don't edit config files by hand","use a decent hierarchy"]}

//Http://jsoneditoronline.org


You could even write comments as a linear RSS feed of nested OPML outlines, by converting all that XML to JSON.

http://convertjson.com/xml-to-json.htm


Yep, I went through the process of replacing all our XML objects into json ones about 6 years ago now. Smaller files and much easier to read, manipulate store and transfer.

And while I personally quite liked XSLT, javascript is a much more flexible and reliable option.


For one thing, not every place where you might want a comment happens to be in an object.


Like where? Who for? For what purpose?

I cant think of a single example that this would be useful.


A list of items in an array. Some items in that list are of particular note.

(Say you have 15 items in an array. For whatever reason, such as you not being in control of the expected input structure of whatever you're giving this JSON to, you have them grouped with comments at the top of each block.)


Why would you want to do that in what is basically a human readable binary file and not in a readme?

You seemed to stop at the for who and for what purpose.

I can 'just' about see the case in something like nodeJS package.json files (that is the least of nodeJS's problems, but that's a whole other conversation). But a readme is a so much better option than having to troll through code comments.


The discussion above concerns why JSON isn't a good choice for configuration files. That's exactly because configuration files are not human readable binary files.


TBH. I find package.json files to be much much much much nicer files than configure.sh or settings.txt files, even a configure.py.

And if that isn't the case for you, not like you don't have the choice.


Did you miss the part where parent says he is using JSONC?


Reflecting the great tradition of "C++", I hereby propose calling it "//JSON".


We have had good luck with HOCON for config files:

https://github.com/typesafehub/config/blob/master/HOCON.md


JSON with comments? How about { "object":{ "foo":1,"bar":"New Jersey"} , "_comment": "blah blah blah" }

A bit hackish, but always worked for me.


Now add a comment for foo and another one for bar. Also try it when your client throws an exception whenever it encounters invalid keys in object. Or if perhaps it blindly persists or tries to perform some logic using that "data". And so on.


I like HJSON[0] for files you need to edit manually. You have comments and other user friendly things.

[0]https://hjson.org/


YAML is equally horrible and the spec is an order of magnitude more complex. I wasted half an hour trying to spot an error in the ejabberd yaml config, only to find out something trivial was missing. At least JSON has braces even though it's not suitable for configuration files. By all means choose TOML or something else (even ini or java properties files) instead.


YAML has braces. In fact, it's a super set of JSON. Any YAML parser should be able to parse JSON encoded data with one exception. Block comments. Which is a bastardization of JSON (as mentioned in the article) so block comments shouldn't be a problem in most cases.


> [YAML is] a super set of JSON.

The specification says this, but I was never convinced it was true. Specifically, the spec around escaped unicode characters lacks any mention of surrogates being encoded in two \u sequences, but rather specifies \u as:

> Escaped 16-bit Unicode character.

Which is an unhelpfully not-even-wrong statement. In practice, trying to treat JSON as a subset of YAML results in things not round-tripping, like the following:

  In [9]: yaml.load(json.dumps('\N{PILE OF POO}'))
  Out[9]: '\ud83d\udca9'
(in case it isn't clear, that's not a valid repr of pile-of-poo in Python:

  In [14]: _9 == '\N{PILE OF POO}'
  Out[14]: False
)

I've now got surrogates in a decoded Unicode string (why is this even allowed, I don't know); this results in fun behavior like `.encode('utf-8')` raising.

Edit: weird: HN appears to strip pile of poo from comments…


That's not wrong.

    In [1]: import json
    
    In [2]: json.dumps("\N{PILE OF POO}")
    Out[2]: '"\\ud83d\\udca9"'
Quality detective work there. JSON, being JavaScript [Object Notation], is standardized to UTF-16. What you see here is "PILE OF POO" in UTF-16.

YAML specifies that the recommended encoding is UTF-8, but should be able to parse UTF-16 and UTF-32. And, if you want that, then you need to tell YAML to expect UTF-16. You can do that by including a BOM (Byte-Order Mark).

PyYAML is documented as only supporting UTF-8 and UTF-16, but not UTF-32.

I'm scratching my head on how to get that "UTF-16 encoded" UTF-8 string to a proper representation (in Python). If this were Ruby, I would use str.force_encoding() which doesn't change the bytes at all. If the solution comes to be, then I'll update this comment (or reply).

http://pyyaml.org/wiki/PyYAMLDocumentation http://yaml.org/spec/1.2/spec.html#id2771184


> What you see here is "PILE OF POO" in UTF-16.

Well, yes, but you missed my point: we're not looking at an encoded string object. A `str` (Python's string type) is supposed to represent a Unicode string — "Strings are immutable sequences of Unicode code points"; the underlying encoding is supposed to be transparent, and in fact, it's possible to construct an example (invalid, IMO, like the above example) YAML that PyYAML will decode where the underlying encoding in memory would be UTF-32, but still contain those surrogates.¹

In Unicode's lexicon, this string contains surrogate code points, not surrogate code units. This is wrong, and as I stated, it is surprising that Python permits it. Unicode explicitly warns against this behavior:

> A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character.

And that's what's happening here.

> YAML specifies that the recommended encoding is UTF-8, but should be able to parse UTF-16 and UTF-32. And, if you want that, then you need to tell YAML to expect UTF-16. You can do that by including a BOM (Byte-Order Mark).

I'm passing PyYAML a string, but I can pass it encoded text as well; the output is the same. Note that the input contains characters completely in ASCII, so it's really not input encoding at play here. I realize now it wasn't explicit in my original comment, but here's the input being given to PyYAML (JSON that we're claiming is also YAML, because the claim was that JSON is a subset of YAML):

  "\ud83d\udca9"
Note that this is the raw JSON/YAML, not a Python repr of it. Those slashes are literal slashes.

> PyYAML is documented as only supporting UTF-8 and UTF-16, but not UTF-32.

PyYAML is documented to contain a bug, then. (Though admittedly better than it being undocumented.) It is not developer friendly to only decode some text, and emit erroneous output on others.

> I'm scratching my head on how to get that "UTF-16 encoded" UTF-8 string to a proper representation (in Python).

It's hard because it shouldn't be possible. Unicode forbids it. The encoded UTF-32 version of that string (if Python permitted it) would be:

  0x0000d83d  0x0000dca9
unless you explicitly handled the surrogate code points separately by decoding them first, and then re-encoding them into the target encoding, but that's crazy.

(this is another example of why the above is not valid or expected output from PyYAML.)

¹this:

  In [19]: yaml.load('"\\ud83d\\udca9, \\U0001f4a9"')
  Out[19]: '\ud83d\udca9, '
is, in CPython, a string that in memory is stored in UTF+32, but contains two surrogate code points.

(Note that I'm using a late version of Python 3. Early Python 3 and Python 2 handle Unicode poorly, but this example should work equally strangely there too. While PyYAML's output here is clearly buggy, the spec is equally vague and underspecified about what should happen.)


> In Unicode's lexicon, this string contains surrogate code points, not surrogate code units. This is wrong, and as I stated, it is surprising that Python permits it. Unicode explicitly warns against this behavior:

> Note that this is the raw JSON/YAML, not a Python repr of it. Those slashes are literal slashes.

I just came to these realization over dinner. Apparently, this is the odd behavior is defined by JSON. So, it's not Python's json module at fault because it is, actually, implemented correctly.

https://en.wikipedia.org/wiki/JSON#Data_portability_issues

This whole time I thought the json module had a bug, but now I am wondering if it's another PyYAML bug.

Maybe not. I'll have to read this section a few more times. http://yaml.org/spec/1.2/spec.html#id2770814

Edit: Going back a bit here.

> Specifically, the spec around escaped unicode characters lacks any mention of surrogates being encoded in two \u sequences,

http://yaml.org/spec/1.2/spec.html#id2771184 makes the statement, "All characters mentioned in this specification are Unicode code points. Each such code point is written as one or more bytes depending on the character encoding used. Note that in UTF-16, characters above #xFFFF are written as four bytes, using a surrogate pair.". And, there's numerous mentions for JSON compatibility. So, I suppose, this is an issue PyYAML (and Ruby's YAML, I checked).

I had never seen this issue before. Ruby's JSON module doesn't breakup utf-8 characters into surrogate pairs, nor does any online json parser that I could find via Google.


What was the trivial missing thing?


I've used JSON and YAML for a very long time, but whenever I have the option, I'll be using TOML every time.

YAML is _huge_ (did you know all JSON is valid YAML?) and those features can come back to bite you... http://blog.codeclimate.com/blog/2013/01/10/rails-remote-cod...

JSON has a number of annoyances, mostly no comments, no trailing commas, all the stuff this article gets into.

Formats like INI or CSV don't really have a spec, or if they do, most implementations don't seem to follow them.

TOML is a bit weird at first, but it's grown on me quite a bit.


I don't have a specific recommendation, but when I see a project uses a JSON file as configuration, I wonder: "hasn't the author ever needed to include a comment in the configuration ?".


I can't speak for others. When I write something that uses JSON for configuration files those files are pretty much always written from configuration management. The comments are then in the manifest/recipe/playbook that generates the config file, which is where people are actually working with the values.



This is what I use, it's very good.

https://github.com/typesafehub/config


I have found it is not very cross-language (and a bit of a pain w/ the replacement hierarchies to easily implement everywhere).


Lua was written for exactly this purpose, and I personally enjoy writing with it, so it would be my first choice in most cases.


What purpose? Writing config files? Lua is a programming language (a nice one too). It's code. Code should not be used for config files, nor for data serialization because once you eval it you are executing it.


Yes, Lua was originally written as a language for rich configuration files and has grown out of that into a more fully featured language.

It's also one of the easier languages to sandbox, since you can evaluate user provided code in a custom environment that only contains the functions you deem safe. You can even use the standard debug hooks to set an upper limit on the number of instructions a script can execute to prevent someone from creating an infinite loop in a config file and locking whatever thread is reading the config.

It's not very appropriate as a data serialization format, or as machine written config, but the parent post specifically asked about human written configuration files.



In my personal experience TOML works really well. It's a little reminiscent of .ini files, but definitely is better.


I have actually used Lua before with good success. It was on a smaller scale, so I can't speak to edge cases, but I would certainly recommend considering it at the least.


If you decide to use YAML, make sure to check out Strict YAML[1] and its FAQ[2].

[1]: https://github.com/crdoconnor/strictyaml

[2]: https://github.com/crdoconnor/strictyaml/blob/master/FAQ.rst...


INI files.

Of course, some lunatics try to embed big amounts of text and that is where INI files not look ok.



> What does HN suggest for configuration files (to be written by a human essentially)?

JSON.

YAML confuses many people by being whitespace-sensitive; ini files I find too limited.


A general rule of thumb: Never use yet another non-markup language designed by people who claimed to be designing yet another markup language from the very outset, then after somebody awkwardly pointed out that what they'd designed wasn't actually a markup language, they invent a backronym to contradict that embarrassing historical fact.

It just makes me wonder what the hell they thought they were doing all that time... It's like designing something called YACC, and ending up with an interpreter interpreter!

https://en.wikipedia.org/wiki/YAML

>Originally YAML was said to mean Yet Another Markup Language, referencing its purpose as a markup language with the yet another construct, but it was then repurposed as YAML Ain't Markup Language, a recursive acronym, to distinguish its purpose as data-oriented, rather than document markup.


in Clojure we use EDN (Extensible Data Notation) for config files and a data transfer format. https://github.com/edn-format/edn

It is really a pleasure to use compared to JSON and XML. While it may not be as compact as ProtoBuffers, Thrift, or Avro, it is human readable and also valid Clojure code. Libraries are ready available to convert it to JSON.


JSON is great for data exchange but config files should be human readable and amply commented. Even rolling your own simple format is probably better than using JSON.


YAML or TSV depending on whether your configuration looks like a rectangular table.

If you want extreme flexibility using C++ as the main language, take a look at my project: https://github.com/jzwinck/pccl

It lets you configure your C++ apps using Python. Config items can even be Python functions.


After trying json, yaml, json5, java properties, ini and toml, I finally choose hjson* as the configuration file format for the software I'm building. It's the easiest format to read and write IMHO, a bit like nginx config files.

* http://hjson.org


I would recommend TOML, but I am a bit biased as the author of the toml python package.


XML




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: