Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Very low footprint JSON parser in portable ANSI C (github.com/udp)
111 points by udp on Feb 12, 2012 | hide | past | favorite | 54 comments



There definitely is a lack of good JSON parsers for C. We wrote our own in QEMU. The relevant code is:

http://git.qemu.org/?p=qemu.git;a=blob;f=json-lexer.c;h=3cd3...

http://git.qemu.org/?p=qemu.git;a=blob;f=json-parser.c;h=849...

Among other things, this supports streaming, is fairly fast, and has gotten a fair bit of scrutiny against malicious input.

The lexer is a hand written state machine which seems like something you should never do but turned out to be pretty reasonable.


What's wrong with YAJL?


Nothing, YAJL got a lot of things very right. We use it in our C daemons for a bunch of things. Not having to unpack the whole JSON into memory is pretty handy.


Nothing is wrong with yajl it is awesome! I have used the lua module for yajl for some time with no hiccups.


  case JSON_INTEGER:
  obj = QOBJECT(qint_from_int(strtoll(token_get_value(token), NULL, 10)));
Why is it so hard to parse integer from string correctly?


Another alternative, https://github.com/esnme/ultrajson

"Ultra fast JSON decoder and encoder written in C with Python bindings"

From the people that built the Battlefield 3 web portal.

Medium complex object:

ujson encode : 18757.01101 calls/sec

yajl encode : 6315.14030 calls/sec

simplejson encode : 5542.03928 calls/sec

cjson encode : 4651.59072 calls/sec

---------

ujson decode : 10759.69649 calls/sec

simplejson decode : 8148.35221 calls/sec

cjson decode : 7931.04387 calls/sec

yajl decode : 5887.38201 calls/sec


This Show HN makes me think there needs to be a site for more formalized code reviews of open software. Ideally with some great game mechanics to make sure engagement is high and thing are getting reviewed well.


I am actually working on that as we speak. Here is my (very) alpha prototype. I plan on expanding the supported languages as I roll out each iteration.

Link: http://codetique.com


As a younger programmer with great ambitions, some sort of code review site would be awesome!



What happens when the input length is longer than 2^31? You used an "int" for the length (also, why ever use a signed value for length?) --- even on LP64, that counter wraps at ~32 bits.

(Same question applies to how you handle the max_memory computation).


Added some protection against that, thanks.


You aren't checking the return value of json_alloc() in new_value()


Well spotted! Fixed, thanks.


Is JSON guaranteed to be ASCII?

To clarify: Any "lookup table" that maps hex values to assumed character values is a portability red flag. When using them, it's polite to add comments to explicitly call out the code page dependency and argue (from a spec or RFC, say) why that assumption is okay.


http://www.ietf.org/rfc/rfc4627 specifies that the encoding must be Unicode: "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."


Neither the execution nor source character set (of C) is guaranteed to be ASCII though. This makes the general parsing as well as lines like "if (c >= 'A' && c <= 'F')" non-portable.


Non-portable to different character sets, not platforms. One could argue that the argument to json_parse is a UTF-8 string.


I'm assuming UTF-8, so the characters I'm looking for should match up fine (\u escape sequences are also converted to UTF-8 for output).


\u is utf16 so you should be able to append two characters to get something in the extended unicode set outside ucs16. You dont seem to handle this; not sure how many parsers do.

Also not sure you handle the case where the json invalidly terminates in the middle of a \u sequence.

Just from a quick glance though, may be wrong.


ts=3? Is this some sort of subtle troll to irritate all factions of tab stop religions?


Interesting. I may try my hand at learning some Erlang NIF creation using this. Then I can benchmark it against Bob Ippolito's mochijson2 module.

Could be a fun little exercise.


I like using Jansson: http://www.digip.org/jansson/


+1 for Jansson here too. Lightweight and works really well.


I really liked using it too.


  const json_char *cur_line_begin, *i;
  ...
  top->u.dbl = strtod (i, (json_char **) &i);
  top->u.integer = strtol (i, (json_char **) &i, 10);
Ick


I've used cJSON (http://sourceforge.net/projects/cjson/) in the past, which worked very well for what I needed (simple 1 file JSON parser for config files). Maybe I'll give this a shot the next time I need to do some simple JSON parsing.

You should get the project listed on http://www.json.org/


If that's the same cJSON I was using a few months ago, I found it a lot more memory-hungry than it needed to be. I was doing some network code with lwIP on an embedded system, so the all-static nature of js0n (with some helper functions) was a better fit for me.


Seems similar to the one up on CCAN [1], which is also BSD-MIT licensed.

EDIT: forgot to mention that it includes a bunch of great helper functions, too.



Here's another minimalist alternative: https://bitbucket.org/zserge/jsmn/wiki/Home

I've used it and think it's pretty neat. One of these days I'll get around to releasing the helper functions we've written to make it easier to use too.


I use it as well, and I'm very happy with it! It's been running in production for quite a while now without any hiccups.


Any reason you roll your own numeric() instead of using isdigit()? It's in C89.


Just to be sure it's inlined, really. Although I assume isdigit would be, being a compiler built-in.


Seems like a premature optimization. Don't assume, check the assembly if you care :)


why are the flag values not enums (and why is 4 missing?)? is using a lookup table for decoding hex really faster than the (minimal) logic (what if it causes cache misses)? do you really think that a state machine with bit flags is the best way to express the logic here? is string_add meant to increment string_length on subsequent passes? what is "json_value * cur_value" supposed to do at the top of json_value_free (maybe i am missing some c trick here?)?

[not dissing you, just bored on a sunday afternoon...]


> why are the flag values not enums (and why is 4 missing?)?

What would the advantage of using an enum be? (and I guess I used 4 and then removed it later.)

> is using a lookup table for decoding hex really faster than the (minimal) logic (what if it causes cache misses)?

No idea, that's just the way I did it. Feel free to try something else and profile if you're really that concerned.

> do you really think that a state machine with bit flags is the best way to express the logic here? is string_add meant to increment string_length on subsequent passes?

There's only two passes, and it increments the length on both (the first is to measure the string, the second is to know where to write in it).

> what is "[..] cur_value" supposed to do at the top of json_value_free (maybe i am missing some c trick here?)?

You're not supposed to mix code and value declarations in ANSI C, so I put it at the top of the function. It's just used to temporarily store the value while reading the parent.


I've converted the lookup table to a few lines of logic. I think it's more readable, and I would definitely bet on it being faster, though since I haven't profiled I don't know how much difference it would make.

https://github.com/PeterScott/json-parser/commit/db9c326f747...


Yeah, I'll go with that - cheers.


ha. on the last one i was confused by your spaces - thought it was a multiplication... (sorry)

[edit] on the bitfield / enum question, i've been looking around for a consistent, standard way of doing things and there doesn't seem to be any one best practice (although various people note that bit fields are normally unsigned ints, while enums are signed).


> You're not supposed to mix code and value declarations in ANSI C

You can in C99.


In ordinary usage, "ANSI C" is a synonym for strict C89. C99 is not well-supported by many compilers, so code intended to be widely portable is still frequently written to strict C89.


sizeof(enum) depends on your compilation flags, so enums are bad for library ABIs.


What are you doing with json.h:121 in _json_value::&operator[](const char* index) when your key doesn't exist?

Still, very nice. Comparable to jsonxx which i've been using up until now.


Hmm, what should I do? (since it returns a reference). I could make it a pointer instead, but then you wouldn't be able to chain it.

Maybe some kind of const json_null value to return when the key isn't found.

edit: Done that.


longjmp to an earlier stage where you can "retract" the error or somehow wrap it in a chainable form (e.g. add a union to your result to signal whether it's a value or error, or whatever)

That's what exceptions are supposed to do. C doesn't have exceptions, so you use setjmp/longjmp.


On a related note, if you want to get something done on a sunday afternoon, writing a simple recursive descent JSON parser from scratch is both doable and fun.


This should go without saying, but never use such a thing on user-supplied data.


The inline keyword is optional and redundant for member functions defined within a class or struct declaration.


Where c is `json_c', "return c > 127 ? 0xFF : hex_table [c];" always returns false due to the range of types.


Nice catch. It's fixed in the latest version.


Why are you doing "#define numeric(b) ((b) >= '0' && (b) <= '9')" instead of say isdigit(3) ?


Where are the tests?


Oh, thanks. I was starting to feel weird because nobody was saying anything about the lack of tests.

HN may or may not work as a code review platform, but I don't think I would use myself a 3rd party software that doesn't provide tests.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: