This post reminded of a /fun/ bug I ran into at my last job. The root of the issue was our server was returning a 408 timeout error when something timed out on the backend. Astute readers might immediately notice an issue with that code, mainly that it's for a /client/ timeout not 504 (server timeout).
Since we controlled the client and server you might think it doesn't matter much, as long as the client knows how to handle a 408 everything should be fine right? Well not exactly. We had a number of pages/endpoints that would be overwhelmed when there was too much data on the backend for a given user/company (which was it's own issue but let's just take that as fact for this and move on). We dutifully sent back 408's and knew that we needed to optimize or chunk those endpoints in the future.
The problem was the code that timed out would keep running until it finished (or hit a different timeout). On it's own this isn't the end of the world, the server does work that never gets sent to the client and essentially is thrown away. The problem was some of these endpoints would run very heavy queries that could bring down our database if enough of them were run in a short window. Even more confusing it appeared as if our server was running multiple of these queries for only 1 request.
We were using Akka (Actor-based) in our backend and thought maybe something was misconfigured or that we might accidentally be dropping duplicate messages into the queues that caused multiple queries to get fired off. We fought with this on/off for months (you know, something else is always a higher priority). Finally we tracked down that the browser was sending multiple requests. This was only clear when testing in a isolated dev environment with no other traffic and you could clearly see a second request being made after the first one timed out. What was extremely frustrating is Chrome did not show this second request in the dev tools, it only showed the initial request.
After more digging and a little luck I stumbled across a "feature" of browsers where they will retry the same request under certain circumstance and they wouldn't show that in dev tool. A 408 was one of those cases. Switching to the correct code, 504, immediately fixed our odd self-DDOS against our DB.
Obviously this was the fault of whoever initially defined `TIMEOUT = 408` somewhere in our HTTP error codes class but to this day I feel like Chrome should have had some indication that it was firing off another request. If you left the tab open Chrome would just keep retrying and slowly overwhelm the DB with heavy queries until it fell over.
I had a similar problem with unusual status codes in Safari which caused lots of confusion. If you give it a 204 (example: http://httpbin.org/status/204), weird things happen - it doesn't even change the currently displayed page to a blank one, the URL bar behavior is undefined (the URL bar doesn't change unless the developer tools are open) and the request doesn't appear in the developer tools either.
It's likely being retried on a lower level, similar to the sibling comment about CORS.
Dev tools are not intentionally hiding the retry from you; from its perspective it's asking the networking layer to fetch a page, and the fact that that layer retries beneath the scenes is invisible to it.
Yeah, that was my thought as well. It threw me for a loop because I never would have expected that my browser dev tools would hide something like that. It was a core tenant of my debugging that was broken.
Very nice write-up. I'd be curious to see it using fetch rather than the older xhr Api - that would make more sense as a library today I think? Or are there compelling reasons to stick with xhr in a post-ie world?
Thank you for your kind words! For lofi.limo I support back to Chrome 49 and Firefox 52 (the last versions available for Windows XP), which I /think/ both have Fetch, so I could have probably used it here.
Fetch has about the same capabilities as XHR, so it should be a drop-in replacement in this project. Let's call it an exercise for the reader! ;)
The only thing I’ve seen that I’ve hit recently is fetch doesn’t have a progress API.
Use case: I was fetching a large image from a server to render into a canvas and I wanted to show a progress bar for the download. AFAIK there is no way to do this with fetch
I still use xhr because of familiarity. Using a javascript closure as a callback in xhr works quite well and seems much easier to me than dealing with the promises that are required for fetch. (I understand what promises are; I just think the ones in javascript are poorly-defined.)
> That's the point I made in "Taming the asynchronous beast with ES7", where I explored the ES7 async/await keywords, and how they integrate promises more deeply into the language. Instead of having to write pseudo-synchronous code (with a fake catch() method that's kinda like catch, but not really), ES7 will allow us to use the real try/catch/return keywords, just like we learned in CS 101.
> This is a huge boon to JavaScript as a language. Because in the end, these promise anti-patterns will still keep cropping up, as long as our tools don't tell us when we're making a mistake.
Async/await is extremely well-supported and reliable in the ecosystem now, and allows you to express Promise-based asynchronous flows with a much more natural syntax. I highly recommend that you take a look, as it pretty much obviates all the pain points mentioned in the linked article.
I'll do that. The whole promise chaining thing seems very awkward to me, and I stopped investigating promise-based solutions because I was under the impression that "everybody" preferred promise chaining to async/await. Maybe that's exactly the reason I might prefer async/await ;-)
This is a good introduction to the subject. At the end you might want to mention rate limiting, truncation (giving up after a specified number of retries) and, for the most sophistication, circuit breakers.
If every developer just did what you advise and no more, I would consider it a big win. I saw at a previous employer exactly how much extra work and complexity we had to build into our operations just to deal with clients using dumb simple retry loops.
At this point I like to wrap my remoting code (including optional retries) into async functions that return a promise. The promise resolves with a result from the server, or rejects with any sort of error that can come from an XHR (or now, fetch), and the consumer can decide what to do with that.
In my software I almost never do automatic retries, rather I warn the user that there was a connection error if necessary, and revert whatever state was waiting on a call. In my view anything that can fail silently probably does not need to be retried right now.
Since we controlled the client and server you might think it doesn't matter much, as long as the client knows how to handle a 408 everything should be fine right? Well not exactly. We had a number of pages/endpoints that would be overwhelmed when there was too much data on the backend for a given user/company (which was it's own issue but let's just take that as fact for this and move on). We dutifully sent back 408's and knew that we needed to optimize or chunk those endpoints in the future.
The problem was the code that timed out would keep running until it finished (or hit a different timeout). On it's own this isn't the end of the world, the server does work that never gets sent to the client and essentially is thrown away. The problem was some of these endpoints would run very heavy queries that could bring down our database if enough of them were run in a short window. Even more confusing it appeared as if our server was running multiple of these queries for only 1 request.
We were using Akka (Actor-based) in our backend and thought maybe something was misconfigured or that we might accidentally be dropping duplicate messages into the queues that caused multiple queries to get fired off. We fought with this on/off for months (you know, something else is always a higher priority). Finally we tracked down that the browser was sending multiple requests. This was only clear when testing in a isolated dev environment with no other traffic and you could clearly see a second request being made after the first one timed out. What was extremely frustrating is Chrome did not show this second request in the dev tools, it only showed the initial request.
After more digging and a little luck I stumbled across a "feature" of browsers where they will retry the same request under certain circumstance and they wouldn't show that in dev tool. A 408 was one of those cases. Switching to the correct code, 504, immediately fixed our odd self-DDOS against our DB.
Obviously this was the fault of whoever initially defined `TIMEOUT = 408` somewhere in our HTTP error codes class but to this day I feel like Chrome should have had some indication that it was firing off another request. If you left the tab open Chrome would just keep retrying and slowly overwhelm the DB with heavy queries until it fell over.