Hacker News new | past | comments | ask | show | jobs | submit login
A Journey building a fast JSON parser and full JSONPath, Oj for Go (github.com/ohler55)
111 points by dcu on July 5, 2020 | hide | past | favorite | 70 comments



I wrote a fast C json parser a few years ago (+600megs/s) And it was an interesting experience. Validating the json roughly halved the performance. The most interesting performance gain was going from using aligned structs to packed byte offsets being accessed using memcpy. It added 20-30%. The overhead of aligning was nothing compared to fewer cache misses. In the end i found that making a truly fast json parser mostly depend on what you parse it to. Like, is the structure read only, and how fast is it to access?


Without validation, all the speed and performance is worthless, if a bad formed JSON can break your application


There are many circumstances where the JSON is always perfect. Having the option to not validate is beneficial


If you have so much control over the JSON and performance is a big deal, then there's a big chance you can get rid of the JSON in favor of a more performant format.


That isn't true though. It's the lingua franca of data transfer. Also, why are we giving away money because of this view? We are stuck with JSON for better or worse.

I've seen parsers where the same real task, say parse, map, and reduce a 100MB of doubles encoded as JSON. In a decent library it is taking much less than half a second and very little memory, say the size of the result and the memory mapping of the JSON document. In many common libraries it takes multiple seconds while using hundreds/gigs of memory. That means one has to pay for bigger machines with more uptime per task. That is giving money away.


If you want performance and you control both ends to the point that you don't want to validate the format, why are you using a lingua franca and not a specific solution? You could swap it out for bson or msgpack.

A little ETL goes a long way.


For sure, if you can control both ends, use something more efficient. But, at least in my cases, I am consuming data from others.


So you are consuming data from others yet assuming it to be perfect and not in need of validation?


> It's the lingua franca of data transfer.

Fortunately, this is really only true at the application level. Space/line delimited byte strings and binary formats are used frequently at lower levels.


Interesting data sources and even RPC for remote things are in JSON though. But, I am only arguing that it shouldn't be slow because JSON is "slow", make the best of what we have.


> 100MB of doubles encoded as JSON.

That sounds like a very bad use case for JSON. I would be surprised if your program wasn't more efficient with an ad-hoc binary format for that piece of data.


It's really hard to get away from getting data in JSON these days. It's ubiquitous.

It was a bit contrived true, but when some are doing it in 0.1s and many are in the 1.5-2s range, and that was parse time, not loading the data or startup. If I would go binary it would probably be something like protobuf, not ad-hoc. Ad-hoc has issues with maintenance, interop, and tooling.


> t's really hard to get away from getting data in JSON these days. It's ubiquitous.

Yes, I agree with that. It's a neat format if you have small pieces of information to move around, and it's very easy to read for humans, but for large enough data, wouldn't it turn into a bottle neck?

> If I would go binary it would probably be something like protobuf, not ad-hoc. Ad-hoc has issues with maintenance, interop, and tooling.

Indeed, there's always these dimensions to take into consideration, as well as evolution. The main issue is to find a library/format which is well supported across all sort of languages, and JSON has that. I don't know if there are many binary formats with the same level of support.

I suggested ad-hoc because the format seemed simple enough to be mmapped directly, or something equivalent (not sure how scripting languages would do in that case).


The Lingua franca of data transfer until today is the unsigned byte array, which is understood from any major programming language existing up to today.

JSON is just an higher abstraction layer which might introduce errors.


If JSON is always perfect (under your control), why then using JSON and not something like Protobuf or Thrift and use byte arrays?


There is a nugget buried at the bottom of this: you will get different output for different kinds of for ... range or for i ... loops. You'd think these are equivalent:

        for i = range arr {
                sum += arr[i]
        }
        for _, c = range arr {
                sum += c
        }
... but the latter is 4 bytes shorter in x86. The compiler takes things literally.


Nice explanation, thanks.


> Both the Ruby Oj and the C parser OjC are the best performers in their respective languages.

Um, no, they arent:

https://github.com/simdjson/simdjson


According to the OjC README:

> No official benchmarks are available but in informal tests Oj is 30% faster that [sic] simdjson.

Source: https://github.com/ohler55/ojc/blob/master/README.md#benchma...

If you have benchmarks that show otherwise, that would be great for the discussion here, but your point appears to have already been addressed?


I find it surprising that a fairly straightforward parser runs "30% faster" than simdjson (full disclosure: I wrote the original simdjson code) while parsing things byte by byte with copious conditional branches. This looks exactly like the approach used in most JSON parsers, which generally were beaten by simdjson by an order of magnitude or more.

"Informal tests", especially ones with zero description as to what was being tested or even ballpark numbers as to how fast this system supposedly run, may not entirely address this point.

We wrote a paper and anyone curious about our methodology can download and run simdjson with a few clicks and verify our performance claims, which (I hope) are well-described in the paper and line up with the arguments we make in favour of the design.

Given that no indication of what was being measured is provided, it's kind of hard to tell what this performance claim even is, or how to verify it.


Are there benchmarks with simdjson getting the JSON reified into native data structures outside the parser?


No. This is an interesting problem but a non-goal for the C++ version of simdjson, which produces usable "native data structures" albeit not outside the parser (speaking informally - the structures aren't really "in the parser" but they are specific to simdjson if you see what I mean).

The reification problem is awkward but it's a different problem for each language. I did daydream of breaking some API boundaries and using SIMD to magic up some Python structures directly, but this would be incredibly brittle. Someone would probably use it but the chance of getting burned by this is... high.


We have started experimenting how to do it with simdjson, though :) https://github.com/simdjson/simdjson/pull/947


Sorry, I meant the applications data structures and not the libraries. And C++ yes, at least that is my exposure to simdjson.


I'll get the code up on the OjC repo in the next day or two. You are right, there should be reproducible benchmarks to back up the statement.



These results aren't reproducible when both are compiled with -O3 (similar results with -O2):

  ojc_parse_str    1000000 entries in  709.300 msecs. ( 1409 iterations/msec)
  simdjson_parse   1000000 entries in  450.724 msecs. ( 2218 iterations/msec)
In those results simdjson is roughly 60% faster.

Telling simdjson we have aligned input gives us further improvement:

  simdjson_parse   1000000 entries in  369.234 msecs. ( 2708 iterations/msec)
About 90% faster. Example changes:

   simd_parse(const char *str, int64_t iter) {
       simdjson::dom::parser parser;
       int64_t   dt;
  +    auto   padded = simdjson::padded_string(std::string_view(str));
       int64_t   start = clock_micro();
   
       for (int i = iter; 0 < i; i--) {
  -        simdjson::dom::element doc = parser.parse(str, strlen(str));
  +        simdjson::dom::element doc = parser.parse(padded);
       }
       dt = clock_micro() - start;


These are my results on an (admittedly old) Intel i3-2120 CPU @ 3.30GHz. Compiling both programs with -O3:

    ojc_parse_str    1000000 entries in 3607.615 msecs. (  277 iterations/msec)
    simdjson_parse   1000000 entries in  418.997 msecs. ( 2386 iterations/msec)
and might as well throw in my own parser...

    uj_parse   1000000 entries in 1959.731 msecs. (  510 iterations/msec)
The -O3 seems to make a large difference for simdjson.


Shockingly different. The -O3 option made hardly any difference with OjC but more than a 10x difference with simdjson. I'll be removing the claim from the OjC readme.

Thank you for being civil with your reply. Much appreciated.


What I've learned from this (as a simdjson author) is that we need to update the quick start in the README to have -O3. I was so psyched about the fact that we now compiled warning-free without any parameters ... that I didn't stop to think that some people would go "huh I did what they told me and simdjson is slow, wtf." Because we evidently told you to compile it in debug mode in the quick start :)

simdjson relies deeply on inlining to let us write performant code that is also readable.

Sorry to have sent you down a blind alley!

One thing to note: if you want to get good numbers to chew on, we have a bunch of really good real world examples, of ALL sizes (simdjson is about big and small), in the jsonexamples/ directory of simdjson. And if you want to check ojc's validation, there are a number of pass.json and fail.json files in the jsonchecker/ directory.


The structuring of simdjson is considerably easier to read if we rely on -O3. In order to get the same performance at lower levels of optimization we need to do a lot of manual things that make the source code quite difficult to read and work with.


One thing to note with benchmarks like this, they are the norm, but they measure the time to validate a document and do not reflect much of the real word where we reify the JSON back to native data structures. Few benchmarks get that far , I suspect much of it is that the setup is tedious.


i see you are using strlen in the inner loop in simdjson which is an extra O(n) operation.

UPDATE and your test data is quite short. You're maybe mostly measuring startup overhead?


I tried to make the test realistic. Agreed that in a benchmark where the string is the same every time the length could be hard coded but in a real world scenario the length of a string is not known until you get the length.

As for the length of the test, I also tried with 10 times the number of iterations with the same results but then the simbbench takes almost a minute to complete. I figured 5 seconds was good enough to get the results and someone can bump up the iterations if they want to.


As discussed elsewhere, simdjson is comfortable with the notion that JSON files might be of different sizes, but does rather better if you can pad the buffers that it reads data into. This is required to avoid the small but real risk that it reads over a page boundary in its SIMD loops.

The test is not realistic. Measuring performance on small cases is a legitimate use case, of course, but using the same data every time means that both ojc and simdjson will train the branch predictor almost perfectly on any modern architecture. This will help either parser during its "branchy" code, which is all of ojc and a significant portion of simdjson (despite all our yelling about how great SIMD is, once we have tokens, we go to a fairly traditional state machine).

On a tiny message, startup costs are probably more significant. Fixing buffer alignment, -O3, and working with larger messages I would not be shocked to see a factor of 20x. There's nothing wrong with ojc but it is a typical parsing strategy that isn't all that different from the other 'normal' JSON parsing libraries that we evaluate (RapidJSON, sajson, dropbox, fastjson, gason, ultrajson, jsmn, cJSON, and JSON for Modern C++).

It's not clear to me why you were so convinced that your library, by extension, would run 10-25x faster than all these other libraries. With the exception of RapidJSON and sajson - which have some performance tricks to go faster - most of these libraries read input in the same way.


If you read the docs, you can see that the string must be allocated to len+SIMDJSON_PADDING bytes. I suppose that still works by accident because it's followed by other static data.

EDIT: Actually, there is a parameter 'realloc_if_needed' that defaults to true and that will reallocate the string because it assumes you haven't padded it. You're likely just measuring malloc performance.


I suspect the need to reallocate in simdjson is a factor in performance but that is really part of the benchmark. When using a parser it would be a very rare case to have the luxury of embedding the string to be parsed in the code. Usually a string comes in through some other means such as over HTTP or maybe a simple socket. Then again it could be passed from some other part of an application. If the parser has a requirement to expand the string to modify it in some fashion then a realloc or really a malloc and copy is part of the parsing process.


simdjson doesn't care what is in the padding and won't modify it; it just needs the buffer the string lives in to have 32 extra addressable (allocated) bytes. It doesn't ever use the bytes to make decisions, but it may read them as certain critical algorithms run faster when you overshoot a tiny bit and correct after.

Most real world applications like sockets read into a buffer, and can easily meet this requirement.

If you are interested to know what it's for, the place where it parses/unescapes a string is a good example. Instead of copying byte by byte, it generally copies 16-32 raw bytes at a time, and just sort of caps it off at the end quote, even though it might have copied a little extra. Here's some pseudocode (note this isn't the full algorithm, I left a out some error conditions and escape parsing for clarity):

  // Check the quote
  if (in[i] == '"') {
    i++;
    len = 0;
    while (true) {
      // Use simd to copy 32 bytes from input to output
      chunk = in[i..i+32];
      out[len..len+32] = chunk;

      // Note we already wrote 32 bytes, and NOW check if there was a quote in there
      if (int quote = chunk.find('"')) {
        len += quote;
        break;
      }
      len += 32; // No quote, so keep parsing the string
      i += 32;
    }
  }


Yes, this is exactly why simdjson wants padding. It certainly doesn't need the string to be embedded in source or any such nonsense.

I wish there was a standardized attribute that C++ knew about that pretty much just said "hey, we're not right next to some memory-managed disaster, and if you read off this buffer, you promise not to use the results".

It is awful practice to read off the end of a buffer and let those bytes affect your behavior, but it is almost always harmless to read extra bytes (and mask them off or ignore them) unless you're next to a page boundary or in some dangerous region of memory that's mappped to some device.

This attribute would also need to be understood by tools like Valgrind (to the extent that valgrind can/can't track whether you're feeding this nonsense into a computation, which it handles pretty well).


You don't have to embed the string in the code, obviously. If you want to use a certain lib, I'm sure you can arrange for the buffer to be slightly bigger.


Sorry, this is not realistic in C. The moment you get the json data, you have to also get the length, or you could not allocate enough memory for it.

Also, number of iterations is not the issue. Length of the input is. What happens if you r json is a few kb, is bigger than l1/l2 cache, ....


Have you tried running the tests with a pre-calculated length? When I made that change there was no noticeable difference. The iterations/msec went from 195 to 192 which is not statistically significant.

I have no doubt a more complete set of benchmarks could be made with various sized JSON and from file as well as string. This was put together quickly to get an answer to the questions raised in this post. Since it does seem to be a topic of interest I'll expand and cleanup the tests in the future but for now it does give at least one data point.

You might notice that the somdjson::dom::parser is reused. Without that optimization to allow warming up the iteration/msec was only 160.


Nope. You've got a code review from an interested random stranger on the net that spend 10s looking at your code, who noticed something suspect is going on.

Given your response, I spent 10s extra reading the simdjson docs and noting you violate the SIMDJSON_PADDING requirement. So either your code is a crash waiting to happen, or you use a very non- optimal code path in simdjson that requires it to re-copy all data.

That's also the maximum amount of attention you'l get from this random stranger. my time is up ;-)


The code example followed was in the simdjson basics.md. The example in the error handling example left off the realloc_if_needed argument which does default to true. I have updated the tests to be explicit that third argument is set for clarity.

As for the path in simdjson being non-optimum and having to copy bytes, that should be expected if the string is to be modified. The buf argument type is a const after all so it should not be modified. In any case, glangdale's code is clean and solid so I doubt his code is anything but optimized and the examples correct.


You can thank the cat on my lap for some bonus attention. I'll try to answer some points in this and other threads of yours. All this from a simdjson non-expert with a reasonable amount of C experience.

Suppose there is a socket with json data. What you do is, at init you create 1 or more buffers of, say, 64kb+required padding. Then, when epoll or whatever says there is data, you call read on this buffer. This gives you a length.

At this point, you have a padded buffer and a length, so requirements for simdjson(reallocifneeded=false) are met, so you can now parse at full speed. When done, reuse the buffer for the next epoll/read cycle.

There are complications, of course. Data is not guaranteed to arrive all together in 1 read call. There might be a http library feeding you. All of this amounts to mostly buffering and chunking. When carefull, they can be solved in a mostly zero copy way, ready for optimal simdjson.

The example simdjson code you refer to is a kind of demo mode. It gets you of the ground quickly, but is far from optimal.

I assume simdjson does not write to the buffer. It just reads more than 1 byte at once, so if it reads say 8 bytes and there is only 1 left, it will read 7 bytes of random junk. And discard them when it notices its unheeded optimism. However, it needs to be allowed to read these extra bytes without causing a SISEGV, hence the padding.

UPDATE all of this an educated guess, the simdjson authors are welcome to fix/finetune whatever I said.


Seeing the downvotes I assume something is wrong in here. Feel free to add a correction.


The issue is youve been pretty arrogant/rude over multiple comments. Talking about your time like its some kind of gift from the heavens.


Ok sorry. Apologies to everyone involved, the author first of all. It wasn't ment like that, for what its worth, but it clearly came out like that.


Can you explain what you were measuring and what the general "GB/s"-type numbers you were seeing from both systems were?


I was measuring extracting each piece of data in a JSON document on the assumption that all the data was needed as apposed to just a single field and then noting the time it took for each parser to complete. I will get the benchmarks code up.

That being said, simdjson does take an interesting approach and for just pulling out specific parts of a JSON document it is very fast.


I'm confused as to how you did these measurements, then.

simdjson is designed to do the opposite of what you describe - nearly all the intellectual work in simdjson was focused on parsing a whole document in one hit, not on "pulling out specific parts of a JSON document".

So, I'm quite unclear on how you could derive a measurement that made us look fast on pulling out specific parts of a JSON document. That being said, I don't have a lot of experience with the current simdjson DOM traversal code and maybe there's some terrible "buried treasure" there (I mean sarcastically - i.e. we have some dumb decisions, maybe?) that you're tripping over. But the one-shot "parse" should be fast.

Anyhow, if you could try to code up something to show parsing speeds on workloads similar to what we describe at https://github.com/simdjson/simdjson it would be very interesting. We've made the effort to make all our benchmarks repeatable - if your approach is faster it would be cool and surprising.

That latter task is a valid use case, and interesting in itself, but not something we really knew how to measure and write about. The problem is with those sort of benchmarks is that they tend to resemble a "ask yourself a question then answer it" - I couldn't think of a good way of coming up with a set of "let's query a JSON file for a specific thing" benchmark that didn't seem ludicrously contrived.


What I tried was getting each member of a parsed/loaded doc vs a single element. If I was using the library incorrectly my apologies. When I put the code up maybe you can give suggestions on how to make it more realistic. I will gladly retract my performance statement if it turns out I was using your library incorrectly.



Do you not need to specify optimization flags when compiling?


The -O3 optimization is just an optimization. It made a 1% difference in performance. It's just something I had in the Makefile I copied.


well, now I'm invested in this.


Thats one person saying it is faster. Without a reproducible test, thats pretty much worthless.


You don’t see the irony in that statement? You’re one person saying simdjson is faster, and you are also lacking a reproducible test. That doesn’t make your side of the argument particularly compelling either.

I’ve never used either library, so I don’t care which one is faster, but I would generally trust the author of the library to have tested the performance at some point more than I would trust someone who isn’t the author to comment on the performance of code they’ve never tested.

The author of that library isn’t present to make their case, so you’re the only one of the two who could provide evidence right now.

EDIT: the author is present now: https://news.ycombinator.com/item?id=23738512


>You don’t see the irony in that statement? You’re one person saying simdjson is faster, and you are also lacking a reproducible test. That doesn’t make your side of the argument particularly compelling either.

simdjson is a massively popular library (~9.9k stars, the third most popular json library on GitHub), that is well known to beat the pants off of every other json library. Likewise the simdjson library has reproducible benchmarks in its own repo with charts and the like. So yes, I would consider that statement pretty much "worthless" given the popularity of simdjson.


Given how things turned out, this might be a good lesson as to whether "I measured my own library informally on some benchmarks that I couldn't be bothered to describe to you and it WINS!" is trustworthy in future.

When we introduced simdjson, we went out of our way to explain (a) why we thought it might be faster than other libraries and (b) how people could replicate our measurements or find fault in our methodology. There were zero performance claims, public or private about the performance of simdjson before this material was available.

In my long experience, undocumented and non-reproducible performance claims suffer from the fact that people generally stop benchmarking when they get the result that they wanted to hear, even if the absolute numbers involved are unreasonable. It's very easy to make mistakes in performance measurement (compiling simdjson with anything short of -O2 or -O3, doing repeated runs of a tiny input and thus getting perfect branch prediction behavior), but people generally fix these mistakes only on the side that makes their stuff look good.

Back when I worked on the Hyperscan project ( https://github.com/intel/hyperscan ), we had an API call that allocated scratch memory (necessary due to Hyperscan's rather constrained approach to memory usage). I lost count of the number of times we had users run this allocation (which only ever needed to be done once per thread, effectively) inside their measurement loop. And whenever we were competing against an incumbent system and the people doing the measurement were the authors of that incumbent system, it was amazing how quickly they'd accept the "truth" of their benchmark run.


I think it’s worth clarifying that I said I would “generally trust {author} more than {random internet person who hasn’t even tested author’s library}”. Are you saying that you trust random internet person more by default?

Trusting one more than the other is not the same as blindly trusting one and never running my own tests. The author who put in the effort to develop and benchmark their library deserves more benefit of the doubt than some drive-by commenter, in my opinion.

If I needed a high performance JSON parsing library, I would definitely set up a benchmark representative of my use case, and test a few different libraries. JSON libraries are usually pretty ergonomic and easy to swap between, so it shouldn’t be an arduous task.

Your very first two comments in this thread were far more insightful than {random internet person} just saying over and over that simdjson can’t possibly be slower, and those comments of yours on their own merit would have removed the need for me to leave any comments whatsoever to defend the original HN post, even without needing to observe that you’re an expert in this field, which just lends further credibility to your comments.

I think the rest of the discussion that occurred was really interesting to watch!


Your comment would improve without the last sentence.


I was iterating on the comment rapidly after posting to try to make my point in the most neutral way possible, so I’m not sure which “last sentence” you saw, but hopefully it is better now.


This is much better, no need for ad hominems.

But please keep the editorializing in check too, it is etiquette to mention edits, and continually editting breaks continuity in the discussion. As you pointed out yourself.


How does this compare to something like https://github.com/valyala/fastjson ?


The fastjson package is fast. Can't disagree but unfortunately it accepts invalid JSON and what you get from the parser is a Value that you can get values from using a Get() function and not a simple go type. To get a simple go type like []interface{} or map[string]interface{} you have to know what the paths are. You can't actually iterate over a map as far as I can tell.

Basically the two packages are different in what they provide. OjG returns simple go types and instead of a simple Get() it provides a full JSONPath implementation. Different but both have their uses so depending on a user's needed one might be better than the other.


I wonder how people that make super specific things like this their whole career make money. I wonder if they are independently wealthy and just do things like this for fun


It's kind of a "continued learning" requirement, a little. Work doesn't always have the problems (or funding for them) that grow you in the right directions.

Open source is also a sort of portfolio for many people, I think. Resumes can only tell you so much; code speaks volumes. It pays, just not directly or immediately.

Though my first job in Silicon Valley-sized tech was at Netscape, and I got that specifically after writing some big open source patches for Mozilla. So it can sometimes pay more directly.


Most of us have a day job and do this sort of thing for fun. I know that might seem weird to many but I think you will find most open source developers fall into that category.


Go is becoming the language to implement json parsers




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: