Rackspace Investigating Current Issue

mark_l_watson · on Dec 18, 2009

A bit off topic, but: I like the way you get redundancy with Amazon: Elastic Load Balancing can proxy traffic to multiple availability zones. That said, Amazon AWS have had outages this year - goes with the business.

eli · on Dec 18, 2009

I was wondering what happened. This wasn't just the cloud, it affected our dedicated box in DFW too.

justinsb · on Dec 18, 2009

We just had a discussion about this here, about redundancy and how to achieve it. I think a big problem is that web browsers don't try multiple IP addresses - am I correct in this?

What I'm thinking is that if a DNS server goes down, no big deal, DNS clients just try another server. But if a web-browser can't connect on the first address it resolves, it won't try other addresses?

If so, could this be fixed on the client side? Would this even need a RFC?

bretpiatt · on Dec 19, 2009

This could be fixed on the client side, an A record for DNS can return multiple IP addresses. A browser could step through the IPs like the OS does for DNS or a mail client will for MX. Go 'nslookup www.google.com' as an example of a record that returns multiple IPs.

I wrote a blog post about availability from the discussion of BitBucket a couple months back. It applies to the situation today as well: http://www.bretpiatt.com/blog/2009/10/03/availability-is-a-f...

All of the people saying, "You should have a multi-provider / multi-location deployment and you should DIY" are talking in theory or talking from a large enterprise perspective. I know very few start-ups or SMBs that go the multiple location route DIY -- few even that go with a multi-DC configuration with the help of a provider.

The reason behind this is opportunity cost, if you are a startup and have a 30 minute outage on a Friday afternoon people will cope with it and it won't cost you enough revenue to more than double the cost of your configuration (a full multi-site configuration is 2x + y where x = cost of a single site and y = cost to ensure data consistency across sites and make the decision on which location to route traffic).

nikcub · on Dec 19, 2009

This already exists - it's called a SRV record. SRV records allow you to assign a priority and a weight for a lookup query, the same way you can with MX records (in SRV you can give different ports as well, very cool). It was proposed by Microsoft in 2000 and was first used in Win2k in ActiveDirectory and is now a standard (it was around before 2000, can't remember where exactly).

It has been widely adopted, it is used in zeroconf, SIP, XMPP and there are proposals to route WHOIS lookups using SRV. Everywhere but in browsers, for various reasons.

You can do MX in email, because the time it takes to do the multiple lookups is not an issue since email goes out in the background. A records were designed to be round-robin.

In theory, SRV records in browsers are a fantastic solution that would change the web and make it easy to have failover.

Here is the problem. Even if the first server is up all the time, you double the number of DNS queries on the web. SRV records dont store IP addresses, they store names. So the browser does a SRV req, orders by priority and weight, picks one, then does an A lookup and attempts connection. Then, if that connection were to fail (meaning 10-20 seconds). It would move to the next. This then raises issues of caching. If you have a high TTL, it defeats the purpose, so you lose all benefit of DNS caching. If you bring it down again the browser will just keep checking the failed A record each time it hits expiry. Multiply this out by the number of elements on a page and you start to see why SRV is implemented almost everywhere else except for web browsers. It reaches a point where having the record management on the server end makes more sense, because it is determined once which servers are healthy and all clients are informed, rather than each client checking for itself each time.

doubleukay2 · on Dec 19, 2009

"SRV records dont store IP addresses, they store names. So the browser does a SRV req, orders by priority and weight, picks one, then does an A lookup and attempts connection."

Not necessarily in this case. The DNS server can provide the A record in the Additional Section. Lookups returning hostname-based records e.g. MX/NS/CNAME records benefit from this too. Until it exceeds the UDP reply size limit of course.

But if a SRV record doesn't exist, then the result of an A lookup needs to be used. That A lookup is done sequentially or in parallel (think of how dual-stack browsers handle of AAAA/A lookups).

nikcub · on Dec 19, 2009

(cont)

Would still be nice though to have the support, since not all sites would take advantage of it and it could be made easier by allowing IP addresses within SRV records

Amusing moment - I recall a proposal to allow browser to report back on which records work or not, it was taken seriously for a few minutes before somebody pointed out the obvious implication of such a feature would be that it would become terribly easy to tear down the entire internet

adamt · on Dec 18, 2009

Almost all modern web browsers will try multiple A records on a connection failure. You do though need to make sure the connection to the first fails quickly and doesn't time out for some reason.

justinsb · on Dec 18, 2009

Ah thanks.. it feels like that's almost the wrong way round though; that if it doesn't get a rapid TCP connection establishment then it should try another address.

I guess an issue is that the client sends the first data in the HTTP protocol, so you don't get a full health indication until after you've issued your request, at which point you have to worry about idempotency etc.

mustpax · on Dec 18, 2009

Maybe the refresh button should just try another A record. Most of the time when a page is not responsive I'll give it a couple refreshes before I conclude it's dead.

(Or maybe some browsers do this already.)

dphiffer · on Dec 18, 2009

Seems like this could be addressed by a browser UI enhancement. That is, let the user decide when to cut off a slow connection and try another IP. After n seconds show a "try elsewhere" bar.

Timothee · on Dec 18, 2009

That sounds like a bad solution for such a problem.

Imagine your typical internet user going to yahoo.com and being asked "Try elsewhere?". They'd have no idea what that means. They don't know what IP addresses, servers or anything like that is.

falsestprophet · on Dec 19, 2009

This is divine punishment for the term "the cloud."