If you are just starting, you should have the simplest setup - everything on one server - and scale it only when it becomes necessary. Premature scalability adds complexity and slows down your iterations.
My setups usually consist of an nginx serving static content and proxying applications requests (doing gzip, etc). The data tier is initially collapsed into the application as described in http://www.underengineering.com/2014/05/22/DIY-NoSql/ This architecture allows very fast iterations while providing enough performance headroom; it can serve 10k simple (CRUD) http requests per second on a single core.
I'm falling behind on security updates because I built an all-in-one box like this. While I mostly agree with you, now I wish I had separated out the db on day 1 into a private network so I could maintain a stateless app tier where I can update the public-facing OS image with no downtime or risk.
You might want to check with your hosting provider. I know with Linode I can "clone" my running server, do the upgrades on it, and then swap the IP addresses of the clone and the original (so now the clone is my main server). It's not the most convenient process, but it should work with minimal downtime.
You could do the same thing without the IP swap trick using DNS, it just take a bit longer to propagate. The real catch is what happens to data that was added or changed between when you clone the server to when you make the switch? Ideally you'd want to put your site in a readonly mode if possible.
I think rolling release distros are your friend for these type of all-in-one setups. Small weekly updates. Easy to test too because you only need one test machine.
>I think rolling release distros are your friend for these type of all-in-one setups. Small weekly updates. Easy to test too because you only need one test machine.
I think the important thing here is a distro that tests it's changes well, and one that doesn't force you into a major upgrade (where you have to change your configs) very often.
Releases like Debian that want you to do a rolling major upgrade every two years, I think, are more difficult to deal with, because the major upgrades, if you have anything at all custom and the config file format changed, are going to require work and testing for you to move the configs over, even if the developers test perfectly (and nothing is perfect.)
I think RHEL/CentOS is best, assuming that the latest RHEL/CentOS supports all the packages in-distro. (If you need to step outside of the distro repos, that kind of defeats the point. maintaining a package yourself on an ancient distro gets old fast, and most of the smaller 3rd party repos don't put as much effort into keeping the old package versions patched up.)
That's the thing, sure, you have to format and re-install for a major upgrade, but you have ten years before you have to worry about that.
Second this. For reference: 100K+ users on a very dynamic site served from a single box with some partial caching. The machine is a fair sized one (lots of ram, fast disks, 32 cores) but I see no advantages from using multiple machines where a single one will do.
I've seen quite a few places with over-complex setups due to gross inefficiencies in the software they deployed.
I don't think this applies to you.
Loads of RAM and 32 cores sounds like you already had to scale and as another comment points out, you seem to have forgotten about redundancy.
I'll add that security is a concern in that case, too. Getting access to your frontend server is getting access to your whole solution, while on a properly architechtured solution that wouldn't be true. In any case I don't know the details of your situation so I might be wrong, but I highly doubt it :(
Edit: You just left an answer to this on a different part of this thread. I'll add a reply to your answer here:
>>> MTBF is a factor of the number of parts in your system. A single machine will have substantially less risk of breaking down than a complex setup. In the past (when servers were not this powerful) and the system consisted of 9 (!) servers, 5 web front ends, a load balancer, replicated DB back ends and a logging host we had a lot of problems due to bits & pieces failing.
That's not true. A single machine has more chances of breaking than a complex setup... If it is properly designed. It doesn't matter if you have 5 frontends and replicated DBs if you're going to have one single load balancer. Also redundancy is not only about hardware, but about processes. If a critical process fails and there are no plans or processes to avoid disaster then it doesn't matter if you've got every single piece of hardware at least duplicated.
I mostly agree with the rest of your answer and happy to see that going simpler suits you. There's no "final solution design" and it depends a lot on your comapany's particularities and your application design.
Well, whether you have a single loadbalancer, a single uplink, a single upstream router or a single datacenter is not really relevant. Unless you go full distributed (across multiple data centers) those things are from a risk perspective almost equivalent. I don't think the loadbalancer ever broke but even if it had it would not have been the end of the world.
What matters is that you stay away from complexity as long as you can't afford to expend your time, energy and funds on it.
Would strongly disagree with this. Not all parts of your entire stack are equal. It is far, far more likely that your application tier is likely to fall over or say suffer a full GC than your load balancer falling over. You've also eliminated the ability to do rolling restarts and many other redundancy activities.
Staying away from complexity is one thing. But only if you are running some toy site.
No, that means you're not following my logic. Hot-swapping power supplies and/or drives obviously improves you reliability (assuming you don't configure for speed but for redundancy). But they're not a guarantee against data-loss, only off-site back-ups that you test are a guarantee for that.
After all, if you controller goes on the blink your precious raid could easily die with it.
> I see no advantages from using multiple machines where a single one will do
One machine is OK when downtime is OK too. Such situations do exist, so it is sometimes a viable solution.
If you're serving traffic to 100K+ users then chances are that's not one of those situations. How are you handling redundancy? Or is possible downtime just an accepted risk?
MTBF is a factor of the number of parts in your system. A single machine will have substantially less risk of breaking down than a complex setup. In the past (when servers were not this powerful) and the system consisted of 9 (!) servers, 5 web front ends, a load balancer, replicated DB back ends and a logging host we had a lot of problems due to bits & pieces failing.
Moving it all to one box was an interesting decision, it has paid off handsomely over time.
Redundancy is a good thing to have, obviously. But it is not simple (nor cheap) to get it right. This machine has redundant power supplies, redundant drives and we back-up multiple times per day. Worst case (a total system failure or a fire in the hosting center) we'd be down for a while but that exact scenario has hit us once before (we were an EV1 customer when their datacenter had a fire) and we came through that quite well.
It all depends on the kind of service you are running what your competitive space looks like and how much money you can throw at the problem.
But for the majority of web apps, especially when funds are critical and you're concentrating on the business side of things rather than the tech you will find that having it all on one box allows you to focus on your immediate problems rather than on how to stay on top of all the complexities running a distributed application brings.
> MTBF is a factor of the number of parts in your system.
> A single machine will have substantially less risk
> of breaking down than a complex setup
A. The probability that a single component in your system fails increases as the number of components increases.
B. The probability of the entire system failing decreases as the number of components increases.
Where (B) goes wrong is if the system is designed in such a way that components are dependent on each other.
Imagine you have a system containing 4 parts, all of which have to have at least one operating component for the system to remain operational. The components are:
WS = Web Server
DB = Database Server
AS = Application Server (executing long-running tasks)
LB = Load Balancer
Each component has a different probability of failure on a given day, given here:
WS, AS = 0.001
DB = 0.002
LB = 0.000001
If you do this:
10 x WS = (0.001)^10
10 x AS = (0.001)^10
1 x DB = (0.002)^1
10 x LB = (0.000001)^2
Then the probability of failure is 0.002, because if the database fails then the system fails. To increase redundancy you need to increase the number of DB servers too. If you have two DB servers, then the probability is 0.000004, 500 times lower.
I believe that you really did experience a problem with your setup: and I'll hazard a guess that the root cause is nothing to do with the architecture of your system but everything to do with the exponential increase in SNAFUs cause by the extra complexity.
Hardware failures are rare, people failures are common.
> I'll hazard a guess that the root cause is nothing to do with the architecture of your system but everything to do with the exponential increase in SNAFUs cause by the extra complexity.
Almost :) It has more to do with the fact that testing such a setup under realistic conditions modelling all the potential failure modes is no substitute for the variety of ways in which a distributed system can fail. Network cards that still send but don't receive? Check (heartbeat thinks you're doing a-ok). Link between to DCs down, DCs themselves still up and running? Check... and so on.
Doing this right is extremely hard, and even the best of the best still get caught out (witness Amazon and Google outages, and I refuse to believe they don't know their stuff).
Hardware failures are rare, people failures are common, distributed systems are hard.
>A single machine will have substantially less risk of breaking down than a complex setup.
Being pretty disingenuous here.
Single machine breaking down means an outage. Complex system breaking down means no outage. Again if you are running a toy site then sure go with the single machine.
And redundancy is very easy to get right if you are using something like AWS or even DigitalOcean. Provider based load balancer + App Tier + Multi Master database like Cassandra.
And developers who are not full stack focused (more and more developers) - They make early bad architecture decisions and often don't consider downtime in the future of acceptable risk, therefore don't plan for it, therefore build applications that don't scale well.
It's fine to say in most instances you don't need more than one server that you can scale "physically" - but it's unfair to suggest you shouldn't consider the risks involved and the (sometimes very rapid) needs at future scale.
curious who do you use as a host for 32 cores with ton of ram? I also favor this vertical scaling over splitting over multiple servers and replicating database with a load balancer sitting on top. Best to deal with one point of failure, if you need network concurrency up the ram and bandwidth. RAM should be easily upgradeable up to a descent double digit gig figure. For disk space and backup.
I was surprised with how long we were able to get away with the simplest setup: everything in one server. It was about ~2 years. I'm very glad I didn't introduce complexity in the setup from the get-go.
+1 again. Followed simple setup at our startup. Worked for ~6 years. We've been growing and scaling the backend architecture as the user needs have grown. We're now at the blended #6. Very glad we waited to optimize until business needs declared certain choices to be imperative. Also makes budget for servers "just in time" optimized.
I had a startup experience experience rapid growth. Guess how long it took me to add capacity: minutes with zero outage. Basic redundancy is not some scary, impossible monster to implement. It is the standard.
This is good advice but I would like to also recommend just using some config manager like puppet right from the off. So when it comes to scaling and rebuilding you know exactly what state your one good server is in and aren't bitten by that random config that's super important some admin ssh'd into to set and forget about last year.
I agree everything on one box makes sense very early on. I'd add that you do want to put some time into designing your architecture so that it's easier to decouple, so you can iterate and separate services when you have a moment to do so, or you can scale when you absolutely have to.
Spending a few hours early on thinking about the following, will save you days of headache.
Eventually talk to a database on a different machine.
Your web app is either stateless, or has an external session cache.
You can have a process that is not your web app process, do "jobs."
And above all else, make those items "configurable" so you can easily swap them when needed. Rails' use of `database.yml` which can be easily modified by a Chef recipe, etc. is a good example of getting configuration right early.
Personally, I would (and I did) start directly with solution 3, i.e. 2 FE instances and a single db. Switching when needed can be costly and you can find a lot of bugs/things not working just because you didn't think of it (e.g. bad use of sessions, local cache for shared objects...).
It probably depends upon the size of the project you have in mind, but if you think you'll have to scale, imho solution 3 is not that an extra burden and forces you to do things better.
I agree totally. This is not premature optimization, this is planning for the future. Another thing is to script your different server types from day 1. If you cant take the time to learn how to use puppet or chef, then just script it with bash. You will thank yourself later.
The worst thing is to stand in the 11th hour when your app is a raging success and having to plan for infrastructure upgrades and potential downtime right at the moment when you would be hurt the most by downtime and upgrade related problems.
I certainly agree with the overall idea, particularly in the context of 'just starting'.
However, I find it odd that it's seen as such a wide-ranging absolute in these comments. These days, I'm not given to thinking of number of 'machines' as a sole or even primary axis of complexity.
I don't really necessarily agree with that. I think it's important to think about your application at scale and if it makes sense for you to start with a slightly more complex beginning, but a much easier time when you hit scale. A lot of my job is helping people who are stuck on a single box with an increasingly complex rapidly scaling application and no real well thought out plan for how to start splitting that app into clustered components.
> I think it's important to think about your application at scale and if it makes sense for you to start with a slightly more complex beginning, but a much easier time when you hit scale.
That's a text-book example of premature optimization.
> A lot of my job is helping people who are stuck on a single box with an increasingly complex rapidly scaling application and no real well thought out plan for how to start splitting that app into clustered components.
That's a fine time to start thinking about how to solve that problem and there are plenty of good, battle tested solutions out there. The first one is cache as much as you can, that will buy you a lot of time to get to a more scale-friendly setup.
Remember that if they had spent their precious runway time on thinking about scaling instead of customer acquisition that they probably would not have a company at all at this stage, rather than a solvable scaling problem.
Are you comfortable sharing your application's actual availability? I'd love to throw away my preconceptions about redundancy but I find it hard to believe you can reach five 9s with a single machine.
The biggest issue in reliability are hard drives, power supplies, network interfaces and power infrastructure. At least, over the last 16 years of operating a series of websites those have been the main causes of trouble.
Uptime does not say much about service uptime, for instance, if the network uplink on one of those machines is down then the users will experience an outage, having a redundant, multi-data center setup would guard against such a situation.
But that would immediately introduce a whole pile of other problems. For instance, in a multi-master setup it would be quite difficult to recover if the only thing that went down was the peering link between the two data centers, with both locations still accessible from the public internet.
In that situation there is a 50/50 chance that my simple-but-dumb strategy would not even be noticed and a 50/50 chance that we'd be down.
That doens't mean there are no situations where such a distributed setup would be warranted but from where I'm sitting the economy just isn't there.
Having regular hardware is no reason by itself why such hardware could not be reliable, regular applications stacks perform remarkably well and the weak points in networking are just as weak when they are connecting otherwise reliable components across WAN links as when they are connecting outsiders to your co-location facility.
Once you start scaling up and/or out the whole equation changes and you need to invest a lot more into planning and testing your setup. Most people find out that their distributed setup was a little less distributed than they thought it was when the first outage hits them. This stuff is very hard to get right and most companies do not operate at a scale where this is a requirement, nor do they have a 100% uptime requirement. Of course we'd all like to pretend we're that important but that's a nonsense argument, the only way you're going to get to 100% is by spending an infinity of money. Everything can go down.
Sorry, counter experience here. I've seen too many shops with over exuberant usage expectations waist (and I mean waist) months and months of developer time when they really needed to be working on other features. Couple of times it killed the company because they spent so much time working on (non-existant) scalability issues they did get around to deploying or addressing the actual need of the end user. Too much emphasis was placed on getting everything perfect the first time around, and not staging that out over time as need arises.
Now, when dealing with what you are talking about, typically I've seen that from developers just making bad decisions in the development process full-stop. E.g. Moving the database to a different server should amount to a config change, not a code change.
Well, sort of. With this architecture (not using a real database) it is fairly easy to save the state to S3, every minute. I don't have an automated way to rebuild the server in case of failure yet, but this is coming as part of https://t1mr.com
It should soon be possible to recover automatically, in a different data center, in matter of minutes. That's tolerable for me.
Was about to say that. For certain applications you cant tolerate data loss, so there is no one size fits all solution. For most content heavy apps, the simplest option works good enough.
The one thing I really want from Digital Ocean is a guide that carefully explains how to set up the "private network" piece of the equation.
The "orange box" that represents the private network in each of the examples is taken for granted, but for someone coming from an application development perspective that piece isn't trivial to make. EC2 Security groups make that sort of box incredibly easy to make, but DO doesn't have anything like that.
The article on setting up and effectively using the private network for what you're describing is actually now in our pipeline. It will probably be live this or early next week :)
Great! If that article would contain setting up custom hosts without using the HOSTS file (local DNS?), that'd be awesome. I don't like those ugly IP-addresses.
This "private networking" is private to the every DO droplet in the same datacenter. If you expose a port for private networking in a data center and I get your IP I can access it if I also get a droplet in that data center.
> I think I finally understand how serving web applications works
It's nice to see you got a lot out of the article but this is hardly a complete course on how the web works from the server side. It is more of a quick guide on a number of common server set-ups for mid sized web sites. If you want to learn more about 'how web serving applications work' I suggest you follow one of the how to guides about setting up a web server of your own and serving up a couple of pages. You won't need any extra hardware for this, all the software is 'open source' and won't cost you a dime. Depending on what kind of operating system you normally use you could start with any of these:
>It's nice to see you got a lot out of the article but this is hardly a complete course on how the web works from the server side. It is more of a quick guide on a number of common server set-ups for mid sized web sites. If you want to learn more about 'how web serving applications work' I suggest you follow one of the how to guides about setting up a web server of your own and serving up a couple of pages.
I actually meant just the hardware aspect of the setup, sorry for the confusion. That said, I'm still super interested in how the actual serving works. The resources you've provided seem to be exactly what I'm looking for. Thanks so much for providing those.
My experience is in HPC where 'serving content' actually means 'sending data to other nodes'. The upside of this is that in a compute cluster, all the nodes are, usually, in the same room and are actually located very close together. There's still a lot of networking involved in getting the nodes to communicate, but it's super interesting to me to see how to scale things on the web where nodes are not necessary even located in the same country! The example of having the DB and application servers on different machines is a good example.
Anyway, sorry for the digression, and thanks again for the links. It'll be bed-time reading for me :)
I really enjoy the community-driven articles/tutorials that DigitalOcean provides. They have documentation for a lot of processes that are not readily documented or still emerging.
Thanks so much! We really appreciate it and are always excited to cover new topics. If you have any suggestions of what you'd like to see in our community, I'd recommend posting them in the comments here: https://www.digitalocean.com/community/articles/digitalocean...
I am hosting all of my stuff on a single VPS instance in Docker/lcx containers. It is reasonably easy to migrate stuff out if I need a larger hardware, but it's also very cheap.
Regarding scaling: a couple of years ago I ran a database on a single CPU core (because of licensing issues). It stored 50M rows a day and also executed various queries quite quickly. So I seriously doubt that most of us is going to need large clusters.
He did not say how the data was structured. Someone with simple/skinny indexes would do fine on weaker hardware than someone with complex/wide indexes on the same size dataset.
How do you figure they need Docker for that? Seems like it's what they chose, not that they had an undying belief that Docker is the One-True-Solution.
website hosted on 1 droplet. additional 1 droplet per every customer is deployed through Stripe and DO api.
DO let's you save a snapshot and load it to the droplet. I have a snapshot that is basically a copy of my 'software'. It's a LAMP stack with init script to load the webapp from git repo.
Customer logs in at username.mywebapp.com
The beauty of this is that I never have to worry about things breaking or becoming a bottle neck. if one customer outgrows themselves, they won't affect other resources. It has linear scalability, new customers, add a new droplet. I don't need to worry about writing crazy deployment scripts although I use paramiko to ssh in to each server when I need to get dirty.
The main website is mostly static content. I could host it even on Amazon S3 but currently using cloudflare.
Updating the product code requires me to restart the droplet instance. However, I test things out on another staging droplet. Once things work on there, I use the DO api to iterate through all the customer droplets and do a restart.
The obvious remark is why do a new webapp deployment per customer instead of a multi-tenant app ? Multi-tenancy requires more code in the web-app to isolate accounts, but it will mutualize and consolidate web servers.
I did it because a new webapp deployment costs nothing, no extra work involved at all with DO, just add a snapshot and ssh key to a new droplet. If something needs to be consolidated, I just use DO api and paramiko to ssh into each droplet and run new commands. If it's updating the webapp across all customers, it's a matter of issuing restart command to all the droplets via API.
Aren't you paying for hosting on each droplet though? Consolidating would save you that money, but I guess this wouldn't have a huge impact if the income per customer is a lot larger than the cost of an additional droplet.
This is awesome! Great content for DigitalOcean to be pushing out as I am probably the exact audience they are looking for when they published this. E.g. I've never gone beyond a shared hosting setup but have been curious to try my luck at learning more of the stack by using the DO platform.
Yeah, I've been googling for tutorials/configs/info on various deployment setups and digitalocean's come up more and more with solid guides. Thanks for sharing the secrets of the somewhat arcane!
The effort D.O. puts into their community education is one of my favorite things about them. The few times I've had problems with a droplet configuration, inevitably someone had already posted a solution in the help section.
Wouldn't it be much better to tech the concept of horizontal scalability applied to the application stack? Your server is a stack of interfaces: a frontend cache, a static content server, a dynamic content server and a database. You can horizontally scale each stack layer. Much simpler, applicable to different scenarios.
However, this approach won't give you a viral article title like "eight server setups for your app" (replace eight by 2^n where n is the layer count).
Excellent writeup! Next I'd like to see an article on deployment. What if I want my development team to be able to push code changes regularly to an app cluster via a git-based workflow and have these deploys all occur with zero downtime ? I think that an article which demonstrates how to use modern deployment tools such as ansible or docker to achieve those goals on a commonly used programming environment such as Ruby would serve to lure quite a few developers away from PaaS towards something like Digital Ocean.
For now though, those tasks are still "hard" which means that for many developers digital ocean is still hard to use relative to other emerging platforms such as Redhat's Openshift or Heroku. I know there are many shops who would love to jump ship from IaaS to a less expensive platform but they feel the cost of rolling their own zero-downtime clustered deployment infrastructure is not worth the $ savings.
I suspect that if IaaS providers were to dedicate resources towards producing more educational material for developers with the aim of demonstrating how to achieve these deployment objectives on all the popular platforms using modern open source tools then loads of PaaS developers would jump ship.
For example: How can I use ansible to instantiate 5 new droplets and automatically install a load balancing server on one of them while setting up the Ruby on Rails platform, and ganglia on the remaining ? How can I run a load balancing test suite against the newly created cluster, interpret the results, and then tear the whole thing back down again all with a few keystrokes ? How could this same script allow me to add additional nodes and how does the resulting system allow for the deployment of fresh application code ? How can it be improved to handle logging and backup ?
I know that it's possible to create a deployment system to answer the above questions in less than a few hundred lines of ansible + Ruby, so I imagine it could be explained in a short series of blog posts, but you would probably need to hire a well-paid dev-ops guru to produce such documentation. I bet if you ask around on HN...
Thanks for the write up. It's the perfect time for me to be reminded about starting simple and changing the architecture as needed. I have prematurely optimized on one project in the past. It was painful. And after all that pain the mythical millions of unique visits never arrived.
Virtually no mention of how the different server setups affect availability - this is very unfortunate. Availability (not to mention disaster recovery) are two things which I think are significantly more important than scaling, and your choice of server setup will affect both.
Yup: Uptime. In the face of a box rebooting unexpectedly because it's a VPS. In the face of a data center experiencing connectivity issues. In the face of Hurricane Sandy's big brother.
As the "Startup Standards" begin to take shape, these guides prove to be extremely useful for the newcomers out there. Sure in 6-12 months it may become a bit dated (depending on the guide) but if kept up-to-date, they can be a powerful tool for a new company.
It would be very helpful if DigitalOcean sells load balancer too as Linode, because the bandwith limits are for each Droplet which makes it very illogical to use DigitalOcean. Of course, we can use Cloudflare or similar, but still It is a need.
does anyone know what a bare minimum monitoring setup for a single server having nginx, postgres and rails ? I'm far too intimidated by nagios to do anything significant.
If I'm reading correctly that you just want something simple and completely hands off I'd suggest New Relic - absolutely trivial to set up, free for basic server monitoring.
I propose an alteration to the typical LAMP stack: Replace Apache with Nginx and MySQL with MongoDB. Personally, the reduced resource use of Nginx is nice since I can run on a smaller "box". MongoDB is just a choice depending on the data set, but it does allow for sharding out horizontally without too much effort.
My setups usually consist of an nginx serving static content and proxying applications requests (doing gzip, etc). The data tier is initially collapsed into the application as described in http://www.underengineering.com/2014/05/22/DIY-NoSql/ This architecture allows very fast iterations while providing enough performance headroom; it can serve 10k simple (CRUD) http requests per second on a single core.