Second this. For reference: 100K+ users on a very dynamic site served from a sin...

mobiplayer · on June 2, 2014

I don't think this applies to you. Loads of RAM and 32 cores sounds like you already had to scale and as another comment points out, you seem to have forgotten about redundancy.

I'll add that security is a concern in that case, too. Getting access to your frontend server is getting access to your whole solution, while on a properly architechtured solution that wouldn't be true. In any case I don't know the details of your situation so I might be wrong, but I highly doubt it :(

Edit: You just left an answer to this on a different part of this thread. I'll add a reply to your answer here:

>>> MTBF is a factor of the number of parts in your system. A single machine will have substantially less risk of breaking down than a complex setup. In the past (when servers were not this powerful) and the system consisted of 9 (!) servers, 5 web front ends, a load balancer, replicated DB back ends and a logging host we had a lot of problems due to bits & pieces failing.

That's not true. A single machine has more chances of breaking than a complex setup... If it is properly designed. It doesn't matter if you have 5 frontends and replicated DBs if you're going to have one single load balancer. Also redundancy is not only about hardware, but about processes. If a critical process fails and there are no plans or processes to avoid disaster then it doesn't matter if you've got every single piece of hardware at least duplicated.

I mostly agree with the rest of your answer and happy to see that going simpler suits you. There's no "final solution design" and it depends a lot on your comapany's particularities and your application design.

jacquesm · on June 2, 2014

Well, whether you have a single loadbalancer, a single uplink, a single upstream router or a single datacenter is not really relevant. Unless you go full distributed (across multiple data centers) those things are from a risk perspective almost equivalent. I don't think the loadbalancer ever broke but even if it had it would not have been the end of the world.

What matters is that you stay away from complexity as long as you can't afford to expend your time, energy and funds on it.

And if you have to then go for it.

threeseed · on June 2, 2014

Would strongly disagree with this. Not all parts of your entire stack are equal. It is far, far more likely that your application tier is likely to fall over or say suffer a full GC than your load balancer falling over. You've also eliminated the ability to do rolling restarts and many other redundancy activities.

Staying away from complexity is one thing. But only if you are running some toy site.

radicalbyte · on June 2, 2014

What about RAID? If I follow your logic, you're claiming that by adding more harddrives to a RAID cluster I'm increasing the risk of data loss?

jacquesm · on June 2, 2014

No, that means you're not following my logic. Hot-swapping power supplies and/or drives obviously improves you reliability (assuming you don't configure for speed but for redundancy). But they're not a guarantee against data-loss, only off-site back-ups that you test are a guarantee for that.

After all, if you controller goes on the blink your precious raid could easily die with it.

mobiplayer · on June 2, 2014

Also known as common sense :)

TomNomNom · on June 2, 2014

> I see no advantages from using multiple machines where a single one will do

One machine is OK when downtime is OK too. Such situations do exist, so it is sometimes a viable solution.

If you're serving traffic to 100K+ users then chances are that's not one of those situations. How are you handling redundancy? Or is possible downtime just an accepted risk?

jacquesm · on June 2, 2014

MTBF is a factor of the number of parts in your system. A single machine will have substantially less risk of breaking down than a complex setup. In the past (when servers were not this powerful) and the system consisted of 9 (!) servers, 5 web front ends, a load balancer, replicated DB back ends and a logging host we had a lot of problems due to bits & pieces failing.

Moving it all to one box was an interesting decision, it has paid off handsomely over time.

Redundancy is a good thing to have, obviously. But it is not simple (nor cheap) to get it right. This machine has redundant power supplies, redundant drives and we back-up multiple times per day. Worst case (a total system failure or a fire in the hosting center) we'd be down for a while but that exact scenario has hit us once before (we were an EV1 customer when their datacenter had a fire) and we came through that quite well.

It all depends on the kind of service you are running what your competitive space looks like and how much money you can throw at the problem.

But for the majority of web apps, especially when funds are critical and you're concentrating on the business side of things rather than the tech you will find that having it all on one box allows you to focus on your immediate problems rather than on how to stay on top of all the complexities running a distributed application brings.

radicalbyte · on June 2, 2014

> MTBF is a factor of the number of parts in your system. > A single machine will have substantially less risk > of breaking down than a complex setup

A. The probability that a single component in your system fails increases as the number of components increases.

B. The probability of the entire system failing decreases as the number of components increases.

Where (B) goes wrong is if the system is designed in such a way that components are dependent on each other.

Imagine you have a system containing 4 parts, all of which have to have at least one operating component for the system to remain operational. The components are:

WS = Web Server DB = Database Server AS = Application Server (executing long-running tasks) LB = Load Balancer

Each component has a different probability of failure on a given day, given here:

WS, AS = 0.001 DB = 0.002 LB = 0.000001

If you do this:

10 x WS = (0.001)^10 10 x AS = (0.001)^10 1 x DB = (0.002)^1 10 x LB = (0.000001)^2

Then the probability of failure is 0.002, because if the database fails then the system fails. To increase redundancy you need to increase the number of DB servers too. If you have two DB servers, then the probability is 0.000004, 500 times lower.

I believe that you really did experience a problem with your setup: and I'll hazard a guess that the root cause is nothing to do with the architecture of your system but everything to do with the exponential increase in SNAFUs cause by the extra complexity.

Hardware failures are rare, people failures are common.

jacquesm · on June 3, 2014

> I'll hazard a guess that the root cause is nothing to do with the architecture of your system but everything to do with the exponential increase in SNAFUs cause by the extra complexity.

Almost :) It has more to do with the fact that testing such a setup under realistic conditions modelling all the potential failure modes is no substitute for the variety of ways in which a distributed system can fail. Network cards that still send but don't receive? Check (heartbeat thinks you're doing a-ok). Link between to DCs down, DCs themselves still up and running? Check... and so on.

Doing this right is extremely hard, and even the best of the best still get caught out (witness Amazon and Google outages, and I refuse to believe they don't know their stuff).

Hardware failures are rare, people failures are common, distributed systems are hard.

threeseed · on June 2, 2014

>A single machine will have substantially less risk of breaking down than a complex setup.

Being pretty disingenuous here.

Single machine breaking down means an outage. Complex system breaking down means no outage. Again if you are running a toy site then sure go with the single machine.

And redundancy is very easy to get right if you are using something like AWS or even DigitalOcean. Provider based load balancer + App Tier + Multi Master database like Cassandra.

neom · on June 2, 2014

And developers who are not full stack focused (more and more developers) - They make early bad architecture decisions and often don't consider downtime in the future of acceptable risk, therefore don't plan for it, therefore build applications that don't scale well.

It's fine to say in most instances you don't need more than one server that you can scale "physically" - but it's unfair to suggest you shouldn't consider the risks involved and the (sometimes very rapid) needs at future scale.

cookerware · on June 2, 2014

curious who do you use as a host for 32 cores with ton of ram? I also favor this vertical scaling over splitting over multiple servers and replicating database with a load balancer sitting on top. Best to deal with one point of failure, if you need network concurrency up the ram and bandwidth. RAM should be easily upgradeable up to a descent double digit gig figure. For disk space and backup.

jacquesm · on June 2, 2014

I bought the box and co-located it.

It's a run-of-the-mill HP, total costs including memory were under $10K.