"I hadn't noticed this before since the parser is written in a way that it will ignore everything that doesn't look like JSON."
and this is precisely why you want to fail hard if you encounter invalid input. Yes. It's annoying in the cases of "nearly valid" input or "valid input but with some garbage". Yes it's more work to deal with the error.
But it also means that something like this blows up before you end up in a "sometimes it works, sometimes it doesn't" situation.
Yes. I have been dealing with and even preferring the silently-failing kind of functionality, but over the years I've been bitten by it too many times to still being able to prefer it with a good conscience.
Overall you might still spend more time overall dealing with bitchy libraries, but at least you will hopefully never have to deal with bugs that happen only sometimes as those are really hard to track down and fix (if it's at all possible).
Sure. Sometimes you can get away with "yea - it fails at times - that's an unfortunate fact of life", but the moment that issue which only rarely appears costs the customers or your money, it all becomes really important and "it fails at times" just doesn't do. Of course, by then, the problem needs to fixed right then - which just doesn't go very well with "it usually works".
At that point you spend the hours it takes to track the problem down and you will curse your decision to fail silently once.
I'm currently having issues with em-http-request, resque and resque-retry. It's still sometimes dropping work before the retry limit and not behaving nicely with the retry timespans. Also the async http request is not using the timeout value, randomly...
It only happens with big traffic, like 0.01% just fails in a wrong way. It's not much, but still it's our and our customer's money. I hope our get together to solve this problem helps tomorrow.
I can argue this both ways, mainly based on who I focus on.
If I focus minimizing pain for the end user, I want things to blow up as little as possible. If I focus on minimizing my pain, I tend to go for hard failure.
For things that are important enough to spend the time on, I go for both: a system that is maximally kind to its users, but is internally a fussbudget. But that requires building a decent infrastructure for logging and alerting, plus an organization disciplined enough take the alerts seriously.
This bug reminds me of what my dad taught me when I got my driver's license. He taught me that knowing how to drive carefully wasn't enough to prevent accidents. I had to drive for myself and for everyone else. I clearly remember "you don't know what kind of drunk will be blowing a red light".
In this bug, Paul was driving carefully - he relied on the parser to do a good job. But relying on the parser is like crossing in GREEN light without checking. 99.9% of the time you should be OK. Until that one time with the drunk blowing the RED light.
My dad was right. Had Paul not relied entirely in the parser and done accurate memory allocation (checked for that drunk blowing the RED light) - everything would have been fine.
I wonder how many people even know that dashes are legal in DNS names. (I mean, of course, the ASCII character that serves as hyphen, en-dash, and minus sign.)
I think there are lots of domain names that would benefit from a well-placed dash -- the most amusing example I've seen being Pen Island's.
Expert Sex Change. (Familiar to almost everyone here, I think.) Power Genitalia. (An Italian battery company.) Whore Presents. (A service for finding out about people's publicity agents, etc.)
Would it be awfully smug to point out that Valgrind would've pointed this bug out in mere minutes? That's exactly why I make a habit of running my tests under Valgrind regularly during development; there's no point wasting hours debugging the classes of problem that tools can pinpoint in minutes.
So, wait. When you allocate a new array in the JVM, it's filled with random data instead of zeroes? That seems like a fundamental security model error. Or are these 'buffers' special native IO primitives that break all the Java security rules and guidelines? I haven't used Java in a while...
Heh no, this was the actual bug (that it was reading "random" data from memory on the first iteration). I just hadn't noticed the issue until this "random memory" contained fragments of invalid json.
Were you or the network library reusing buffer objects (to avoid reallocating them), so the random data was leftover from an early socket read? I'm surprised the JVM would allocate a new buffer object with non-zero data.
Could you elaborate on this? If the random data isn't caused by the JVM not filling a buffer with zeros (which I'm sure it does) how is the data actually leaking? Do you share byte[] arrays between threads or recycle them once a thread dies?
JVM objects are always zero-allocated, but Java libraries typically don't make any guarantees about memory that sits outside the JVM heap, which is probably where the socket was reading from.
By default, the JVM initialises arrays as appropriate for their type. Presumably this was an array of String, so the initialisation values are nulls, not zeros. Arrays of boolean are initialised to false, etc.
We don't know anything about it, because the blog post is slightly meager..
That said, this reads more like a byte[] array or similar to me, since you are reading data from the net/a stream. Somewhere there will be a process to interpret these bytes as a string in a specific encoding, but the error 'sounds' like being related to the raw buffer of power of 2 size bytes.
We currently use it extensively for analytics, our product recommendation engine (blog on that soon!) and we are writing a custom scala based http proxy for our new API. In general, we are trying to progessively do more scala and less ruby.
If you (or anybody else) is interrested in hacking with us, please drop me a note (link to your github profile is enough) at paul@dawanda.com :)
Oh, Don. No, everything is fine with ruby, I just got bored. And scala... she is so much faster! See you soon in AMS; I'll bring Mikael for fun and profit ;)
and this is precisely why you want to fail hard if you encounter invalid input. Yes. It's annoying in the cases of "nearly valid" input or "valid input but with some garbage". Yes it's more work to deal with the error.
But it also means that something like this blows up before you end up in a "sometimes it works, sometimes it doesn't" situation.
Yes. I have been dealing with and even preferring the silently-failing kind of functionality, but over the years I've been bitten by it too many times to still being able to prefer it with a good conscience.
Overall you might still spend more time overall dealing with bitchy libraries, but at least you will hopefully never have to deal with bugs that happen only sometimes as those are really hard to track down and fix (if it's at all possible).
Sure. Sometimes you can get away with "yea - it fails at times - that's an unfortunate fact of life", but the moment that issue which only rarely appears costs the customers or your money, it all becomes really important and "it fails at times" just doesn't do. Of course, by then, the problem needs to fixed right then - which just doesn't go very well with "it usually works".
At that point you spend the hours it takes to track the problem down and you will curse your decision to fail silently once.