Hacker News new | past | comments | ask | show | jobs | submit login
Stack Overflow: How we upgrade a live data center (serverfault.com)
213 points by Nick-Craver on March 5, 2015 | hide | past | favorite | 68 comments



I love reading about StackOverflow, particularly their infrastructure.

The site has been a useful resource for so many years, and it works so well. It was a joy to discover that it all ran on like two racks worth of servers, and still does. Having seen corporate intranet portals, with maybe a thousand daily active users, running on excessive* hardware (needlessly, of course), it's like a breath of fresh air.

*EDIT: Removed hyperbole. Not more hardware, but too much nonetheless.


We take a lot of pride in how much we get out of each piece of hardware. You don't need to have 1000 servers to run a large site, just the mindset of performance first.


Likewise, I always find these write-ups fascinating, well written, well planned, etc. I really appreciate the extra effort involved in making all this public!


Perhaps off topic, but this post just reminds me so much of why I love AWS EC2, etc. Not ever having to think about hardware again is wonderful.


I agree. I read this post and was shocked at the amount of planning, process, man-hours, hardware issues and other problems that come with hardware. I've worked at places with ~500 EC2 machines in a dozen autoscaling groups across 3 AZs with many ELBs, databases, SQS queues and other AWS infrastructure and never had to deal with anything like this when upgrading.

Upgrading hardware in EC2 is as simple as changing a launch configuration and updating an auto-scaling group. Maybe an hour of my time to update configs, verify and deploy. Updating something like a database or caching servers is more work for sure, but with 0 time needed to get to the DC, unpack, rack and configure servers you do save time with 'the cloud'.

I get that you do pay more for EC2 instances, especially if you keep hardware for 4 years. But AWS prices drop every year or two along with (generally) faster versions of software so your overall costs do drop.

How many ops employees would you need for a fleet of 500 servers in a datacenter? We managed it all with 4 people with AWS.


The Stack Exchange philosophy is that because they can buy truly mega hardware (each one of those two blade chassis they bought has 72 cores and 1.4TB of RAM, remember!), they don't need those 500 servers to start with. Plus the hardware is an asset and you get to depreciate it.

Everywhere I've ever worked we've had the "big spreadsheet" of projected cloud costs, projected ops costs, and hardware costs. In general the "scale horizontally" philosophy will favor the cloud while the "scale vertical" philosophy still seems to favor owned hardware in local datacenters. Which is superior is a crazy, long-standing debate with no clear answer.


The biggest cost of using amazon isn't the hardware, it's the markup on traffic (if you are a dynamic site.)


Can you elaborate? I thought the answer to that question was to scale up if you can, because its much simpler and therefore cheaper. Similar to how you don't give up ACID unless the scale you're working at doesn't permit it anymore.


There's never really a "one size fits all" answer, which is why it's a long-running debate and depends heavily on the product.

Scaling horizontally can let you use smaller, cheaper hardware on average and burst to higher capacity more easily if you need to, at the expense of a lot of complexity. It also (done right, which is rare) tends to gain you a greater degree of fault tolerance, since hardware instances become rapidly-replaceable commodities.

Most web apps have spiky but relatively predictable load. For example, a typical enterprise SaaS startup gets more traffic during work hours than on weekends. For these companies the complexity of developing a horizontally scaled architecture can be offset by the decreased cost of buying really big machines for peak load and then scaling back to a couple small instances for periods of below-average load.

That's (ostensibly) why AWS exists in the first place: Amazon had to buy a lot of peak capacity for Black Friday and Christmas and found it going unused the rest of the year. They never meant to sell their excess capacity, but they realized the tools that they built to dynamically scale their infrastructure were valuable to others.

Plus, a lot of work is offline data analytics, ETL, and so on. It's very cost effective to scale these workloads horizontally on-demand - spin up extra workers to run your reporting each hour/night and keep costs down the rest of the time when you don't need the capacity.

On the flip side, companies like Stack Exchange and Basecamp have high, relatively stable traffic worldwide. For companies like this it makes more sense to scale vertically - if they were in the cloud, they would never scale down or shut down their instances anyway.

Personally, I agree that horizontal scalability is oversold and most people can, indeed, scale up instead of out. However, a lot of plenty smart people disagree with me and have valid reasons to scale horizontally, too.


> a typical enterprise SaaS startup gets more traffic during work hours than on weekends.

You still need to budget what you can get if renting dedicated hardware vs. renting virtual machines. For eg. a Dual Xeon X5670 machine w/ 96GB RAM and 4x480GB SSD can be had for $249 per month (just something random I found for demo purposes). Even if you do a reserved instance for a year on EC2, you can get a m3.2xlarge for this kind of money, and that's only 30GB RAM and 2x80GB SSD.

It might worth it to rent this sort of iron instead of spinning up and down EC2 instances especially if you can reasonably buy a large enough machine to cut a lot of headaches arising from distributed computing. The right tool for the job.

Owning hardware is again a different bag of hurt.


> I thought the answer to that question was to scale up if you can, because its much simpler and therefore cheaper

✱cough✱ SQL Server licensing fees ✱cough✱


That's a good point that I overlooked in my post - I'm sure this is a huge consideration for Stack Exchange.

For what it's worth Basecamp also evangelize the "scale up" approach and they're on an open-source stack.


How many ops employees would you need for a fleet of 500 servers in a datacenter? We managed it all with 4 people with AWS.

This could be a false dichotomy. Just because a service with AWS uses so many servers doesn't mean a more monolithic system would need as many.

We did talks with one of our competitors (before they were a competitor). We mentioned that we ran our infrastructure on 4 large VM hosts (with a light density of 3-4 VMs per host). They were shocked. They were currently running over a hundred EC2 instances, with the relevant satellite services. They literally could not believe that we could provide a comparable service without reliance on something like AWS.

It's amazing what an be done with the right knowledge. In our case, myself and one of my coworkers maintain our infrastructure at something like 4-6 hours per week total (mostly patching, reviewing logs, etc). We both have previous networking, hardware, and software experience. When we do major upgrades (about every 2 years), it takes one of us about a week to source the hardware, get it loaded in to the rack, and turned on. Then we migrate guests over and we're done. This doesn't even get in to the cost savings on running on our own hardware vs AWS pricing.


We run about a hundred servers (soon to be lots more) with a part time staff of 3 (as in, we all do dev work most of the time). It used to be mostly me for ages, but we got big enough that I got promoted out of most of the day-to-day stuff.

All own hardware, and having just had a reboot on Softlayer's schedule to fix the Xen issue for a separate project we're running on their gear - being able to schedule your own maintenance windows is so much nicer. We spend less time dealing with problems on our own hardware than we do dealing with cloud providers having issues.


At my previous company, I built a dedicated hardware system that consistently delivered a sub-second response time at a cost of about a third of what it would cost to host on AWS.

After I left, the new CTO who replaced me migrated it Rackspace's cloud offering. I don't know the costs involved, but now the site averages an 8s load time.

You can't really beat the raw performance for the price of dedicated equipment.


Where I am currently working, thanks to good automation procedures there are only 3 people that are managing 4 datacenters on 3 continents with over 1000 virtual machines and also a couple of hundred physical servers. Those 3 people are 1 linux sysadmin, one network guy and one vmware guy. None of them works fulltime on maintaining the infrastructure just on patching/upgrading/installing new systems and that's 1/2 days a week at most. I have now finished the plan for the migration of two datacenters and that process takes about 2 months with shipping/networking/configuring/installing machines.

I really don't understand the obsession of getting rid of ops / hardware guys and relying on Amazon/Google/CoolCloudProvider to handle everything.


I worked for a large scada company. We collected large amounts of data from thousands of large industrial installations.

One day we got a new VP who came from a well known firm who was a "cloud expert". He moved (nearly) all of our infrastructure to AWS, after producing untold amounts of spreadsheets/power points expressing how much cheaper/better/faster it's going to be.

Long story short, it was 4x more expensive as running it in house. By the time they went back to our own infrastructure, most of the internal sysops(including "The Glue" guy) had moved on and much of the old internal hardware was re-purposed or gone. It was a fiasco that they still have no fully recovered from yet.

I would be very careful in characterizing AWS as the solution for every large scale computer infrastructure problem.

Conversely, I have had excellent experiences with AWS in my current job, although we still have a rather large HPC cluster internally which would never make sense to move to AWS.


> How many ops employees would you need for a fleet of 500 servers in a datacenter? We managed it all with 4 people with AWS.

I'd say our goal is to keep growing and serving more content without _needing_ 500 servers in a data center. We are doing pretty well at that so far. We'll see what happens in the future.


The cloud is awesome if you have zero interest whatsoever in hardware. It's not without trade-offs (nor is the other direction), and too many for us - but if it works for you then great I say.

We obviously feel very different, and are just doing what works best for us.


most of the public cloud cheer-leading is just people rationalizing to themselves what an awesome decision they made deploying on aws onto a billion tiny instances or whatever. and for lots of folks, it probably is pretty awesome.

however, i've seen very, very few people compare actual before/after $ figures on hn. when it comes time to show your cards everyone gets cold feet, either because they 1. don't have any idea because they aren't the ones paying for it or 2. don't have a baseline for comparison and are just paying whatever amazon asks or 3. it ended up costing 2-4x as much on amazon

when your bill is $5k/month 2-4x isn't that big of a deal. when it's $100k+ a month, it becomes a really big deal.


> We obviously feel very different, and are just doing what works best for us.

And doing it very well, if I may say so.


Actually, occasionally you do need to think about it. Reference the various times AWS emails customers about unplanned outages that need to occur because of hardware issues/patching/etc.

Cloud is great, but there str still plenty of reasons to run your own datacenter. Yes for many startups it might not make sense, but at a certain sized company/application it can easily make sense.


AWS just announced a few days ago that their latest Xen patch will be deployed through a live update to their hypervisor kernel, and that going forward they expect patches like this to be rolled out live.

The real upside of AWS is that they have relentlessly pursued and killed off reasons for you to care about things like this. They've eliminated points of failure in their infrastructure and given operators a wealth of tools to ensure their apps stay up through any update or event (AZ-affine ELBs and autoscaling groups, single-IP ELBs, continuous improvements to EBS and S3, etc.) Given the scale of their infrastructure in us-east-1, it's now also highly unlikely that any customer will manage to overload it on their own.


I can't resist reminding you that 1 command from a sysadmin routing traffic to the wrong network was the cause of the last major outage there :)

They are getting much better, as all providers are. They're still just not a fit for many people because of performance requirements that are either impossible or too costly to meet on that type of infrastructure.

I've always said this: the cloud isn't a good fit for us; do what works for you.


> I can't resist reminding you that 1 command from a sysadmin routing traffic to the wrong network was the cause of the last major outage there :)

If you're going to bring that up, I can't resist reminding you of that time you had poor sysadmins running up and down stairs with buckets of fuel to keep your servers running[1].

> I've always said this: the cloud isn't a good fit for us; do what works for you.

There are costs. Frankly, it appears Stack Exchange prefers to lay those on its people rather than its purse.

[1]: http://blog.stackoverflow.com/2012/11/se-podcast-36-we-got-h...


As Kyle says, we were helping our sister company Fog Creek keep their servers online (as well as other people in that facility like Squarespace) because we cared. Our traffic was not being served from that data center and in fact we shut down most of our servers during that to conserve generator fuel. Our traffic was flowing just fine from Oregon. A decision Kyle and I made the night before when concluding they would probably shut down power to lower manhattan in preparation for flooding.

When your neighbor's house is on fire you don't argue over the price of the hose. You help. Our remote people that couldn't come help in person also helped them replicate their entire network in AWS as a backup plan.

I don't usually post pissed off comments, but you're dead wrong here and intetionally or not demeaning a good company and good people whom, because they cared, came to help in a time of emergency. I take it you weren't in New York during Sandy; it looked like a post-apocalyptic war zone afterwards.

TL;DR - You don't know what you're talking about.


intetionally or not demeaning a good company

It's sad that you can't handle a snarky comment (right or wrong) that was given in response to your own snarky comment. If you can't handle being poked and it pisses you off, don't poke others.

Also, you seem particularly offended that your altruism is being maligned, when you're also complaining that the GP was unaware that the action was altruistic in the first place (???).


His comment wasn't snarky in the first place, it was pointing out that mistakes scale too. The response was totally inappropriate.


I'm finding it hard to contrive a debating situation where "I can't resist reminding you..." isn't snarky.


That wasn't stack exchange. We failed over to our secondary data center during sandy.


Are you at the same scale as Stack Exchange?


That's hardly an argument. Netflix, consuming 33% of the nation's bandwidth is a great counterexample.

In terms of scale, GoGuardian (the company I co-founded) has passed Stack Exchange. Articles like this make me so happy to be on AWS. Delegating this work to AWS allows us to focus on the product instead of the hardware that it runs on.


> That's hardly an argument. Netflix, consuming 33% of the nation's bandwidth is a great counterexample.

Actually it isn't.

Can AWS handle the scale of StackOverflow?

Absolutely. Netflix is a GREAT example of the scale AWS can handle.

However, can netflix migrate their entire system off of AWS without careful planning (which the parent of my comment was scoffing at)?

I'm a huge proponent of AWS, but I wouldn't be so naive to say that AWS allows me to be oblivious of hardware and infrastructure costs.

> In terms of scale, GoGuardian (the company I co-founded) has passed Stack Exchange.

How so? StackOverflow is #56 in the world in Internet Traffic Rank (alexa.com).


Woah, the datacenter you guys moved to is across the street from my apartment. Just a heads up, I considered it for a project myself but ruled it out because... well this is the back corner of that building: http://40.media.tumblr.com/tumblr_lqnh988WRS1qzpdb2o4_1280.j...


That's nothing: https://gigaom.com/2012/10/31/how-good-prep-and-a-bucket-bri...

We where hosted at that Peer 1 facility during Sandy. Also, if the water makes it up to the 16th floor it's time to give up anyway ... no one will care at that point :-D


yea we got hit hard by it too (33 whitehall), thats why we went shopping for a second location in jersey. we wound up in newark at 165halsey.


It's odd to see bollards and speed bumps in a swimming pool that big.



I'm really wondering how SO uses Redis.

For example, how failure scenarios handled and what is the role of slaves.


We have a hot slave at all times for all instances. In the event of a failure the second slave will kick in. All applications are already connected to both servers via the StackExchange.Redis (https://github.com/StackExchange/StackExchange.Redis) library. The mechanisms for failover, etc. are built in there. We use this library via the dashboard (pictured at the bottom of the post) in Opserver (https://github.com/opserver/Opserver) to do quick swapping of master/slave, etc. We can do this during the day without anyone noticing.

Slaves are not just for backups - they also serve as pub/sub mechanisms. Every publish propagates to slaves. Since we use pub/sub for things like web sockets, we can easily move that entire concurrent connection load to another data center with a simple DNS change. Yep, we've tested this - it worked well.

It is of course noting: redis just doesn't fail. We had one out of memory fail for the Q&A web sites when forking and that's it. It has been rock solid here.



That's a really nice cabling job. I've inherited 2 DC's with just-awful-enough cabling that I'd be exceptionally happy to set up new racks again and migrate, just to sort the bloody cables out

Sigh, don't think it's gonna happen.

As a customer of Dell, it was interesting to see someone elses insight on managing a dell farm.

So neat! I am mildly envious! ;)


Thanks! We honestly do this for us and my mild cabling OCD, but if we can help anyone by sharing the details then it's worth it.


I'm curious to know what your perceptions of the FX2's are, since it had been racked up.

(EDIT: Over and above what's in the article.. Because they look quite nice, is why I'm asking - we haven't got any.. yet! Still performing well?)


They are nice, we honestly haven't had to do much with them after we got them setup since they are VM hosts.

The most annoying part was that the Force10 OS that the IOA's run is slightly annoying to work with when you are used to working on Cisco gear. I'd almost rather people stopped doing "it's close to Cicso" and did their own thing because the connotative dissonance is jarring when something is close but not quite what you expect.


We've got 2 M1000e's, but they're slowly being retired and TBH they're in a 'do-less not do-more' situation (mostly!), so I haven't honestly had much experience past run-of-the-mill blade config, so that particular issue happily hasn't really bitten me :)

But thanks (to both you and Nick) for responding, the FX2s look interesting enough that I might have to see about an eval unit :)


I haven't had nearly as much experience as George and Shane have with them after initial spin up. I'll get one of those guys to chime in here.


Minor nitpick - it isn't mild (which is why it looks so damn amazing :-)


You've had us looking for your model of label-maker :)


> We learned that accidentally sticking a server with nothing but naked IIS into rotation is really bad. Sorry about that one.

Wait, does that mean what I think it means? Did someone get an IIS splash page when visiting SO?

I was going to say they should live-blog during their next upgrade but it they did it on twitter which is awesome.



How long did it take to figure out it occurred? Do you have any stats on how many people hit that machine?

Would be hilarious if someone who went to SO to figure out why the IIS splash was showing instead of his app happened to hit that page.


It took about 20 minutes to notice and fix. The Twitter alerting system let us know really quick.


Why is their load balancer not configured to take servers like that out of rotation?


It normally would, except the base IIS page is a proper 200 to all requests. The load balancer removes all servers having issues - from HAProxy's point of view things were dandy.

As always, we are looking how to adjust our deployment process to prevent it from happening in the future.


Everyone I know big enough to run a load balancer has had this happen.

The next step in evolution is to have your script look for specific text on a page, which will change 24 months later and have all the perfectly good servers pulled out of rotation.


Does bare IIS respond with a 200 to /ping or /sping?


We generally pick a hashtag on twitter and do it that way. We've also done Hangouts On Air for non-hardware in the past which have been pretty fun as well.


If by fun you mean a 4 hour maintenance window turning into 8 hours ending well after midnight :-P


I never claimed to be sane


I think what impressed me most was the power density that data center is supporting. That looks like nearly 10kW of power in a single rack! Many providers top out at 4kW and HE2 in Fremont was just 1.8kW per rack when I was there.


Most companies depreciate hardware across 3 years

Can somebody explain why exactly? Was there ever some large scale research with the outcome this was somehow the cheapest? Or is it a mere byproduct of the consumption-based society? I'm asking because I'ved used lots of different types of hardware (both computing and non-computing) and >90% lasted well over 3 years. It seems a waste of time/money to get rid of ot after 3 years? Which matches with SE's findings: they choose 4 years. Still not that much, but already 'better'


The default accountability rules say 3 years (or maybe "minimum 3 years"). Therefore older hardware is assumed to have no value, unless someone takes the time to think about it.


Most leasing contracts are on a 3 year basis.


Ok yes, but why 3?


Reading this brought back fond memories. Or nightmares. Sometimes it's difficult to distinguish between the two.


If anything you can see the enthusiasm and passion they have for their job. It's inspiring!


Thanks for sharing! Happy to see some of my past work in action.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: