Funnily enough, as I've been experimenting with Chef and trying to stick to JSON config files where allowed, I was again struck that (a) it's not a good choice for config files (b) it's an OK choice though (c) lots of people are using it anyway (d) nearly everyone that does so (including Chef) allows comments, so in reality are not actually using JSON at all.
Point (d) is the important one. I really think we need a standard for json-with-comments. JSONC or whatever, but it should have a different standard filename and it should have an RFC dictating what is and isn't allowed. Personally I would allow only // comments because there are too many subtle issues with C-style comments, but it may be too late to agree on that.
Half the point of JSON is that if application A stores its data as JSON then application B can parse that without any nasty surprises. Except, there are now probably thousands of noncompliant implementations in the wild that only exist because the standard doesn't allow comments. Each one of those standards adds subtle differences (in addition to the comments themselves) depending largely on how they remove the comments before passing to the standards-compliant JSON parser (assuming they do that, which being DC's recommended approach, is as close to a standard as currently exists).
I really think it's not an OK choice. A config file format that doesn't allow comments provides some of the worst possible UX.
One of the nice things about config files is that normally they are self-documenting, explaining the meaning of the various directives and providing possible values.
Without comments, you have to constantly switch between the documentation and the config file.
Also, the restriction on trailing commas is another really bad issue for a config file language as it pollutes diffs, makes moving lines around needlessly difficult and is one more landmine waiting to happen for the sysadmin editing a file.
> I really think it's not an OK choice. A config file format that doesn't allow comments provides some of the worst possible UX.
Are we sure about that? E.g. mostly when messing with configuration files, I have to visit the documentation anyway, which could explain key-value pairs in the json file. Furthermore there are configuration files which consist mostly of comments to explain all kind of edge cases with actual configuration data commented, which can be a mess on its own.
If good code should be self-explanatory then why not configurations?
Yep, I went through the process of replacing all our XML objects into json ones about 6 years ago now. Smaller files and much easier to read, manipulate store and transfer.
And while I personally quite liked XSLT, javascript is a much more flexible and reliable option.
A list of items in an array. Some items in that list are of particular note.
(Say you have 15 items in an array. For whatever reason, such as you not being in control of the expected input structure of whatever you're giving this JSON to, you have them grouped with comments at the top of each block.)
Why would you want to do that in what is basically a human readable binary file and not in a readme?
You seemed to stop at the for who and for what purpose.
I can 'just' about see the case in something like nodeJS package.json files (that is the least of nodeJS's problems, but that's a whole other conversation). But a readme is a so much better option than having to troll through code comments.
The discussion above concerns why JSON isn't a good choice for configuration files. That's exactly because configuration files are not human readable binary files.
Now add a comment for foo and another one for bar. Also try it when your client throws an exception whenever it encounters invalid keys in object. Or if perhaps it blindly persists or tries to perform some logic using that "data". And so on.
YAML is equally horrible and the spec is an order of magnitude more complex. I wasted half an hour trying to spot an error in the ejabberd yaml config, only to find out something trivial was missing. At least JSON has braces even though it's not suitable for configuration files. By all means choose TOML or something else (even ini or java properties files) instead.
YAML has braces. In fact, it's a super set of JSON. Any YAML parser should be able to parse JSON encoded data with one exception. Block comments. Which is a bastardization of JSON (as mentioned in the article) so block comments shouldn't be a problem in most cases.
The specification says this, but I was never convinced it was true. Specifically, the spec around escaped unicode characters lacks any mention of surrogates being encoded in two \u sequences, but rather specifies \u as:
> Escaped 16-bit Unicode character.
Which is an unhelpfully not-even-wrong statement. In practice, trying to treat JSON as a subset of YAML results in things not round-tripping, like the following:
In [9]: yaml.load(json.dumps('\N{PILE OF POO}'))
Out[9]: '\ud83d\udca9'
(in case it isn't clear, that's not a valid repr of pile-of-poo in Python:
In [14]: _9 == '\N{PILE OF POO}'
Out[14]: False
)
I've now got surrogates in a decoded Unicode string (why is this even allowed, I don't know); this results in fun behavior like `.encode('utf-8')` raising.
Edit: weird: HN appears to strip pile of poo from comments…
In [1]: import json
In [2]: json.dumps("\N{PILE OF POO}")
Out[2]: '"\\ud83d\\udca9"'
Quality detective work there. JSON, being JavaScript [Object Notation], is standardized to UTF-16. What you see here is "PILE OF POO" in UTF-16.
YAML specifies that the recommended encoding is UTF-8, but should be able to parse UTF-16 and UTF-32. And, if you want that, then you need to tell YAML to expect UTF-16. You can do that by including a BOM (Byte-Order Mark).
PyYAML is documented as only supporting UTF-8 and UTF-16, but not UTF-32.
I'm scratching my head on how to get that "UTF-16 encoded" UTF-8 string to a proper representation (in Python). If this were Ruby, I would use str.force_encoding() which doesn't change the bytes at all. If the solution comes to be, then I'll update this comment (or reply).
Well, yes, but you missed my point: we're not looking at an encoded string object. A `str` (Python's string type) is supposed to represent a Unicode string — "Strings are immutable sequences of Unicode code points"; the underlying encoding is supposed to be transparent, and in fact, it's possible to construct an example (invalid, IMO, like the above example) YAML that PyYAML will decode where the underlying encoding in memory would be UTF-32, but still contain those surrogates.¹
In Unicode's lexicon, this string contains surrogate code points, not surrogate code units. This is wrong, and as I stated, it is surprising that Python permits it. Unicode explicitly warns against this behavior:
> A process shall not interpret a high-surrogate code point or a low-surrogate code point
as an abstract character.
And that's what's happening here.
> YAML specifies that the recommended encoding is UTF-8, but should be able to parse UTF-16 and UTF-32. And, if you want that, then you need to tell YAML to expect UTF-16. You can do that by including a BOM (Byte-Order Mark).
I'm passing PyYAML a string, but I can pass it encoded text as well; the output is the same. Note that the input contains characters completely in ASCII, so it's really not input encoding at play here. I realize now it wasn't explicit in my original comment, but here's the input being given to PyYAML (JSON that we're claiming is also YAML, because the claim was that JSON is a subset of YAML):
"\ud83d\udca9"
Note that this is the raw JSON/YAML, not a Python repr of it. Those slashes are literal slashes.
> PyYAML is documented as only supporting UTF-8 and UTF-16, but not UTF-32.
PyYAML is documented to contain a bug, then. (Though admittedly better than it being undocumented.) It is not developer friendly to only decode some text, and emit erroneous output on others.
> I'm scratching my head on how to get that "UTF-16 encoded" UTF-8 string to a proper representation (in Python).
It's hard because it shouldn't be possible. Unicode forbids it. The encoded UTF-32 version of that string (if Python permitted it) would be:
0x0000d83d 0x0000dca9
unless you explicitly handled the surrogate code points separately by decoding them first, and then re-encoding them into the target encoding, but that's crazy.
(this is another example of why the above is not valid or expected output from PyYAML.)
¹this:
In [19]: yaml.load('"\\ud83d\\udca9, \\U0001f4a9"')
Out[19]: '\ud83d\udca9, '
is, in CPython, a string that in memory is stored in UTF+32, but contains two surrogate code points.
(Note that I'm using a late version of Python 3. Early Python 3 and Python 2 handle Unicode poorly, but this example should work equally strangely there too. While PyYAML's output here is clearly buggy, the spec is equally vague and underspecified about what should happen.)
> In Unicode's lexicon, this string contains surrogate code points, not surrogate code units. This is wrong, and as I stated, it is surprising that Python permits it. Unicode explicitly warns against this behavior:
> Note that this is the raw JSON/YAML, not a Python repr of it. Those slashes are literal slashes.
I just came to these realization over dinner. Apparently, this is the odd behavior is defined by JSON. So, it's not Python's json module at fault because it is, actually, implemented correctly.
> Specifically, the spec around escaped unicode characters lacks any mention of surrogates being encoded in two \u sequences,
http://yaml.org/spec/1.2/spec.html#id2771184 makes the statement, "All characters mentioned in this specification are Unicode code points. Each such code point is written as one or more bytes depending on the character encoding used. Note that in UTF-16, characters above #xFFFF are written as four bytes, using a surrogate pair.". And, there's numerous mentions for JSON compatibility. So, I suppose, this is an issue PyYAML (and Ruby's YAML, I checked).
I had never seen this issue before. Ruby's JSON module doesn't breakup utf-8 characters into surrogate pairs, nor does any online json parser that I could find via Google.
I don't have a specific recommendation, but when I see a project uses a JSON file as configuration, I wonder: "hasn't the author ever needed to include a comment in the configuration ?".
I can't speak for others. When I write something that uses JSON for configuration files those files are pretty much always written from configuration management. The comments are then in the manifest/recipe/playbook that generates the config file, which is where people are actually working with the values.
What purpose? Writing config files? Lua is a programming language (a nice one too). It's code. Code should not be used for config files, nor for data serialization because once you eval it you are executing it.
Yes, Lua was originally written as a language for rich configuration files and has grown out of that into a more fully featured language.
It's also one of the easier languages to sandbox, since you can evaluate user provided code in a custom environment that only contains the functions you deem safe. You can even use the standard debug hooks to set an upper limit on the number of instructions a script can execute to prevent someone from creating an infinite loop in a config file and locking whatever thread is reading the config.
It's not very appropriate as a data serialization format, or as machine written config, but the parent post specifically asked about human written configuration files.
I have actually used Lua before with good success. It was on a smaller scale, so I can't speak to edge cases, but I would certainly recommend considering it at the least.
A general rule of thumb: Never use yet another non-markup language designed by people who claimed to be designing yet another markup language from the very outset, then after somebody awkwardly pointed out that what they'd designed wasn't actually a markup language, they invent a backronym to contradict that embarrassing historical fact.
It just makes me wonder what the hell they thought they were doing all that time... It's like designing something called YACC, and ending up with an interpreter interpreter!
>Originally YAML was said to mean Yet Another Markup Language, referencing its purpose as a markup language with the yet another construct, but it was then repurposed as YAML Ain't Markup Language, a recursive acronym, to distinguish its purpose as data-oriented, rather than document markup.
It is really a pleasure to use compared to JSON and XML. While it may not be as compact as ProtoBuffers, Thrift, or Avro, it is human readable and also valid Clojure code. Libraries are ready available to convert it to JSON.
JSON is great for data exchange but config files should be human readable and amply commented. Even rolling your own simple format is probably better than using JSON.
After trying json, yaml, json5, java properties, ini and toml, I finally choose hjson* as the configuration file format for the software I'm building. It's the easiest format to read and write IMHO, a bit like nginx config files.
What does HN suggest for configuration files (to be written by a human essentially)?
I am looking at YAML and TOML. My experience with JSON based config files was horrible.