Show HN: Very low footprint JSON parser in portable ANSI C

aliguori · on Feb 12, 2012

There definitely is a lack of good JSON parsers for C. We wrote our own in QEMU. The relevant code is:

http://git.qemu.org/?p=qemu.git;a=blob;f=json-lexer.c;h=3cd3...

http://git.qemu.org/?p=qemu.git;a=blob;f=json-parser.c;h=849...

Among other things, this supports streaming, is fairly fast, and has gotten a fair bit of scrutiny against malicious input.

The lexer is a hand written state machine which seems like something you should never do but turned out to be pretty reasonable.

haberman · on Feb 12, 2012

What's wrong with YAJL?

lflux · on Feb 13, 2012

Nothing, YAJL got a lot of things very right. We use it in our C daemons for a bunch of things. Not having to unpack the whole JSON into memory is pretty handy.

mikehuffman · on Feb 13, 2012

Nothing is wrong with yajl it is awesome! I have used the lua module for yajl for some time with no hiccups.

adobriyan · on Feb 13, 2012

  case JSON_INTEGER:
  obj = QOBJECT(qint_from_int(strtoll(token_get_value(token), NULL, 10)));

Why is it so hard to parse integer from string correctly?

mape · on Feb 12, 2012

Another alternative, https://github.com/esnme/ultrajson

"Ultra fast JSON decoder and encoder written in C with Python bindings"

From the people that built the Battlefield 3 web portal.

Medium complex object:

ujson encode : 18757.01101 calls/sec

yajl encode : 6315.14030 calls/sec

simplejson encode : 5542.03928 calls/sec

cjson encode : 4651.59072 calls/sec

---------

ujson decode : 10759.69649 calls/sec

simplejson decode : 8148.35221 calls/sec

cjson decode : 7931.04387 calls/sec

yajl decode : 5887.38201 calls/sec

spullara · on Feb 13, 2012

This Show HN makes me think there needs to be a site for more formalized code reviews of open software. Ideally with some great game mechanics to make sure engagement is high and thing are getting reviewed well.

jsaunders · on Feb 13, 2012

I am actually working on that as we speak. Here is my (very) alpha prototype. I plan on expanding the supported languages as I roll out each iteration.

Link: http://codetique.com

alexchamberlain · on Feb 13, 2012

As a younger programmer with great ambitions, some sort of code review site would be awesome!

zoul · on Feb 13, 2012

http://codereview.stackexchange.com/

tptacek · on Feb 12, 2012

What happens when the input length is longer than 2^31? You used an "int" for the length (also, why ever use a signed value for length?) --- even on LP64, that counter wraps at ~32 bits.

(Same question applies to how you handle the max_memory computation).

udp · on Feb 12, 2012

Added some protection against that, thanks.

halayli · on Feb 12, 2012

You aren't checking the return value of json_alloc() in new_value()

udp · on Feb 12, 2012

Well spotted! Fixed, thanks.

feralchimp · on Feb 12, 2012

Is JSON guaranteed to be ASCII?

To clarify: Any "lookup table" that maps hex values to assumed character values is a portability red flag. When using them, it's polite to add comments to explicitly call out the code page dependency and argue (from a spec or RFC, say) why that assumption is okay.

michael_miller · on Feb 12, 2012

http://www.ietf.org/rfc/rfc4627 specifies that the encoding must be Unicode: "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."

pantaloons · on Feb 13, 2012

Neither the execution nor source character set (of C) is guaranteed to be ASCII though. This makes the general parsing as well as lines like "if (c >= 'A' && c <= 'F')" non-portable.

udp · on Feb 13, 2012

Non-portable to different character sets, not platforms. One could argue that the argument to json_parse is a UTF-8 string.

udp · on Feb 12, 2012

I'm assuming UTF-8, so the characters I'm looking for should match up fine (\u escape sequences are also converted to UTF-8 for output).

justincormack · on Feb 12, 2012

\u is utf16 so you should be able to append two characters to get something in the extended unicode set outside ucs16. You dont seem to handle this; not sure how many parsers do.

Also not sure you handle the case where the json invalidly terminates in the middle of a \u sequence.

Just from a quick glance though, may be wrong.

lflux · on Feb 13, 2012

ts=3? Is this some sort of subtle troll to irritate all factions of tab stop religions?

chops · on Feb 12, 2012

Interesting. I may try my hand at learning some Erlang NIF creation using this. Then I can benchmark it against Bob Ippolito's mochijson2 module.

Could be a fun little exercise.

roschdal · on Feb 12, 2012

I like using Jansson: http://www.digip.org/jansson/

scumola · on Feb 14, 2012

+1 for Jansson here too. Lightweight and works really well.

rmk · on Feb 13, 2012

I really liked using it too.

schlecht · on Feb 12, 2012

  const json_char *cur_line_begin, *i;
  ...
  top->u.dbl = strtod (i, (json_char **) &i);
  top->u.integer = strtol (i, (json_char **) &i, 10);

Ick

tcas · on Feb 12, 2012

I've used cJSON (http://sourceforge.net/projects/cjson/) in the past, which worked very well for what I needed (simple 1 file JSON parser for config files). Maybe I'll give this a shot the next time I need to do some simple JSON parsing.

You should get the project listed on http://www.json.org/

mikepurvis · on Feb 12, 2012

If that's the same cJSON I was using a few months ago, I found it a lot more memory-hungry than it needed to be. I was doing some network code with lwIP on an embedded system, so the all-static nature of js0n (with some helper functions) was a better fit for me.

peterldowns · on Feb 12, 2012

Seems similar to the one up on CCAN [1], which is also BSD-MIT licensed.

EDIT: forgot to mention that it includes a bunch of great helper functions, too.

rmgraham · on Feb 12, 2012

1. http://ccodearchive.net/info/json.html

m_eiman · on Feb 12, 2012

Here's another minimalist alternative: https://bitbucket.org/zserge/jsmn/wiki/Home

I've used it and think it's pretty neat. One of these days I'll get around to releasing the helper functions we've written to make it easier to use too.

fmardini · on Feb 13, 2012

I use it as well, and I'm very happy with it! It's been running in production for quite a while now without any hiccups.

avar · on Feb 12, 2012

Any reason you roll your own numeric() instead of using isdigit()? It's in C89.

udp · on Feb 12, 2012

Just to be sure it's inlined, really. Although I assume isdigit would be, being a compiler built-in.

shtylman · on Feb 12, 2012

Seems like a premature optimization. Don't assume, check the assembly if you care :)

andrewcooke · on Feb 12, 2012

why are the flag values not enums (and why is 4 missing?)? is using a lookup table for decoding hex really faster than the (minimal) logic (what if it causes cache misses)? do you really think that a state machine with bit flags is the best way to express the logic here? is string_add meant to increment string_length on subsequent passes? what is "json_value * cur_value" supposed to do at the top of json_value_free (maybe i am missing some c trick here?)?

[not dissing you, just bored on a sunday afternoon...]

udp · on Feb 12, 2012

> why are the flag values not enums (and why is 4 missing?)?

What would the advantage of using an enum be? (and I guess I used 4 and then removed it later.)

> is using a lookup table for decoding hex really faster than the (minimal) logic (what if it causes cache misses)?

No idea, that's just the way I did it. Feel free to try something else and profile if you're really that concerned.

> do you really think that a state machine with bit flags is the best way to express the logic here? is string_add meant to increment string_length on subsequent passes?

There's only two passes, and it increments the length on both (the first is to measure the string, the second is to know where to write in it).

> what is "[..] cur_value" supposed to do at the top of json_value_free (maybe i am missing some c trick here?)?

You're not supposed to mix code and value declarations in ANSI C, so I put it at the top of the function. It's just used to temporarily store the value while reading the parent.

pjscott · on Feb 12, 2012

I've converted the lookup table to a few lines of logic. I think it's more readable, and I would definitely bet on it being faster, though since I haven't profiled I don't know how much difference it would make.

https://github.com/PeterScott/json-parser/commit/db9c326f747...

udp · on Feb 12, 2012

Yeah, I'll go with that - cheers.

andrewcooke · on Feb 12, 2012

ha. on the last one i was confused by your spaces - thought it was a multiplication... (sorry)

[edit] on the bitfield / enum question, i've been looking around for a consistent, standard way of doing things and there doesn't seem to be any one best practice (although various people note that bit fields are normally unsigned ints, while enums are signed).

parenthesis · on Feb 12, 2012

> You're not supposed to mix code and value declarations in ANSI C

You can in C99.

nknight · on Feb 12, 2012

In ordinary usage, "ANSI C" is a synonym for strict C89. C99 is not well-supported by many compilers, so code intended to be widely portable is still frequently written to strict C89.

cpeterso · on Feb 12, 2012

sizeof(enum) depends on your compilation flags, so enums are bad for library ABIs.

mappu · on Feb 12, 2012

What are you doing with json.h:121 in _json_value::&operator[](const char* index) when your key doesn't exist?

Still, very nice. Comparable to jsonxx which i've been using up until now.

udp · on Feb 12, 2012

Hmm, what should I do? (since it returns a reference). I could make it a pointer instead, but then you wouldn't be able to chain it.

Maybe some kind of const json_null value to return when the key isn't found.

edit: Done that.

mahmud · on Feb 12, 2012

longjmp to an earlier stage where you can "retract" the error or somehow wrap it in a chainable form (e.g. add a union to your result to signal whether it's a value or error, or whatever)

That's what exceptions are supposed to do. C doesn't have exceptions, so you use setjmp/longjmp.

krakensden · on Feb 12, 2012

On a related note, if you want to get something done on a sunday afternoon, writing a simple recursive descent JSON parser from scratch is both doable and fun.

lmm · on Feb 13, 2012

This should go without saying, but never use such a thing on user-supplied data.

cpeterso · on Feb 12, 2012

The inline keyword is optional and redundant for member functions defined within a class or struct declaration.

schlecht · on Feb 12, 2012

Where c is `json_c', "return c > 127 ? 0xFF : hex_table [c];" always returns false due to the range of types.

pjscott · on Feb 12, 2012

Nice catch. It's fixed in the latest version.

schlecht · on Feb 12, 2012

Why are you doing "#define numeric(b) ((b) >= '0' && (b) <= '9')" instead of say isdigit(3) ?

robocop · on Feb 13, 2012

Where are the tests?

reidrac · on Feb 13, 2012

Oh, thanks. I was starting to feel weird because nobody was saying anything about the lack of tests.

HN may or may not work as a code review platform, but I don't think I would use myself a 3rd party software that doesn't provide tests.