A bug I won't forget

pilif · on May 28, 2012

"I hadn't noticed this before since the parser is written in a way that it will ignore everything that doesn't look like JSON."

and this is precisely why you want to fail hard if you encounter invalid input. Yes. It's annoying in the cases of "nearly valid" input or "valid input but with some garbage". Yes it's more work to deal with the error.

But it also means that something like this blows up before you end up in a "sometimes it works, sometimes it doesn't" situation.

Yes. I have been dealing with and even preferring the silently-failing kind of functionality, but over the years I've been bitten by it too many times to still being able to prefer it with a good conscience.

Overall you might still spend more time overall dealing with bitchy libraries, but at least you will hopefully never have to deal with bugs that happen only sometimes as those are really hard to track down and fix (if it's at all possible).

Sure. Sometimes you can get away with "yea - it fails at times - that's an unfortunate fact of life", but the moment that issue which only rarely appears costs the customers or your money, it all becomes really important and "it fails at times" just doesn't do. Of course, by then, the problem needs to fixed right then - which just doesn't go very well with "it usually works".

At that point you spend the hours it takes to track the problem down and you will curse your decision to fail silently once.

pimeys · on May 28, 2012

I'm currently having issues with em-http-request, resque and resque-retry. It's still sometimes dropping work before the retry limit and not behaving nicely with the retry timespans. Also the async http request is not using the timeout value, randomly...

It only happens with big traffic, like 0.01% just fails in a wrong way. It's not much, but still it's our and our customer's money. I hope our get together to solve this problem helps tomorrow.

alttab · on May 29, 2012

I've noticed the same issue with that gem. It only happens randomly and with high density asynchronous traffic.

wpietri · on May 28, 2012

I can argue this both ways, mainly based on who I focus on.

If I focus minimizing pain for the end user, I want things to blow up as little as possible. If I focus on minimizing my pain, I tend to go for hard failure.

For things that are important enough to spend the time on, I go for both: a system that is maximally kind to its users, but is internally a fussbudget. But that requires building a decent infrastructure for logging and alerting, plus an organization disciplined enough take the alerts seriously.

ardillamorris · on May 28, 2012

This bug reminds me of what my dad taught me when I got my driver's license. He taught me that knowing how to drive carefully wasn't enough to prevent accidents. I had to drive for myself and for everyone else. I clearly remember "you don't know what kind of drunk will be blowing a red light".

In this bug, Paul was driving carefully - he relied on the parser to do a good job. But relying on the parser is like crossing in GREEN light without checking. 99.9% of the time you should be OK. Until that one time with the drunk blowing the RED light.

My dad was right. Had Paul not relied entirely in the parser and done accurate memory allocation (checked for that drunk blowing the RED light) - everything would have been fine.

RyanMcGreal · on May 28, 2012

Sidenote: I'm pretty sure the author is Paul Asmuth, not Paula Smuth.

ScottBurson · on May 28, 2012

Ah, thanks.

I wonder how many people even know that dashes are legal in DNS names. (I mean, of course, the ASCII character that serves as hyphen, en-dash, and minus sign.)

I think there are lots of domain names that would benefit from a well-placed dash -- the most amusing example I've seen being Pen Island's.

gjm11 · on May 28, 2012

Expert Sex Change. (Familiar to almost everyone here, I think.) Power Genitalia. (An Italian battery company.) Whore Presents. (A service for finding out about people's publicity agents, etc.)

jwr · on May 28, 2012

It's a pity I can't read this blog post. On an iPad it specifically prevents me from zooming and the font is too small.

Please don't use "ipad-specific" or "mobile" themes. They break the web.

derwildemomo · on May 28, 2012

Zooming works fine here, ipad3, safari.

FrankBooth · on May 29, 2012

Would it be awfully smug to point out that Valgrind would've pointed this bug out in mere minutes? That's exactly why I make a habit of running my tests under Valgrind regularly during development; there's no point wasting hours debugging the classes of problem that tools can pinpoint in minutes.

Domenic_S · on May 29, 2012

Tangent -- it's crazy to me that println passes as debugging.

kevingadd · on May 28, 2012

So, wait. When you allocate a new array in the JVM, it's filled with random data instead of zeroes? That seems like a fundamental security model error. Or are these 'buffers' special native IO primitives that break all the Java security rules and guidelines? I haven't used Java in a while...

paulasmuth · on May 28, 2012

Heh no, this was the actual bug (that it was reading "random" data from memory on the first iteration). I just hadn't noticed the issue until this "random memory" contained fragments of invalid json.

cpeterso · on May 28, 2012

Were you or the network library reusing buffer objects (to avoid reallocating them), so the random data was leftover from an early socket read? I'm surprised the JVM would allocate a new buffer object with non-zero data.

fizx · on May 29, 2012

Yeah, he was almost certainly reusing his byte[]s. My takeaway is that if you program a high-level language as if it's C, expect C-like bugs.

i386 · on May 29, 2012

Likely reusing his directly allocated ByteBuffer and not checking the number of bytes that he filled it with.

From what I remember, directly allocated ByteBuffers are not guaranteed to be zeroed.

emarion · on May 28, 2012

Could you elaborate on this? If the random data isn't caused by the JVM not filling a buffer with zeros (which I'm sure it does) how is the data actually leaking? Do you share byte[] arrays between threads or recycle them once a thread dies?

cynicalkane · on May 28, 2012

JVM objects are always zero-allocated, but Java libraries typically don't make any guarantees about memory that sits outside the JVM heap, which is probably where the socket was reading from.

cgh · on May 28, 2012

By default, the JVM initialises arrays as appropriate for their type. Presumably this was an array of String, so the initialisation values are nulls, not zeros. Arrays of boolean are initialised to false, etc.

darklajid · on May 28, 2012

We don't know anything about it, because the blog post is slightly meager..

That said, this reads more like a byte[] array or similar to me, since you are reading data from the net/a stream. Somewhere there will be a process to interpret these bytes as a string in a specific encoding, but the error 'sounds' like being related to the raw buffer of power of 2 size bytes.

CookWithMe · on May 28, 2012

Just out of curiosity: How much Scala are you using at DaWanda? And for what use cases?

paulasmuth · on May 28, 2012

We currently use it extensively for analytics, our product recommendation engine (blog on that soon!) and we are writing a custom scala based http proxy for our new API. In general, we are trying to progessively do more scala and less ruby.

If you (or anybody else) is interrested in hacking with us, please drop me a note (link to your github profile is enough) at paul@dawanda.com :)

SimHacker · on May 28, 2012

Oh, Paul. ;( Less Ruby??! What happened? Are you angry at her? Did she cheat on you? Did you catch her in bed with Mikael? I warned you about him.

paulasmuth · on May 28, 2012

Oh, Don. No, everything is fine with ruby, I just got bored. And scala... she is so much faster! See you soon in AMS; I'll bring Mikael for fun and profit ;)