I'm now working on my *third* distributed high performance enterprise storage cl...

aphyr · on Feb 21, 2013

You're right: reliable fault-tolerance between two nodes in datacenters with dedicated physical hardware and isolated network links is easier than distributing state over a hundred geographically disparate nodes in in virtualized, multitenant environments backed by unreliable hardware abstractions.

codemac · on Feb 21, 2013

My point was to complain about what RapGenius needed by describing an extreme that doesn't exhibit the behaviors your post describes. However you were speaking in terms of Heroku, and my post reads like it's trying to deride that. My comment reads pretty awfully, actually.

I guess I'm just frustrated with how much I hear about cloud services solving the "ops problem" when in reality all I see are people ignoring having a rigorous architecture for their applications.

There are extremes that solve just about any problem RapGenius is facing, because I bet you their traffic, I/O and CPU profiles are not that difficult of a problem to solve for their specific case.

I highly doubt that RapGenious needs over a hundred geographically disparate nodes in a virtualized multitenant environment, (notably because they are the one tenant!), but somehow that's what they're using and they don't understand what it means for them.

I'm sorry that previous comment sucked, and possibly sorry for this one.

aphyr · on Feb 21, 2013

Heroku operates in EC2, which is an Amazon service providing virtual machines spread across several datacenters. Rapgenius' nodes, as well as all of Heroku's internal infrastructure (e.g. routing) run as EC2 virtual machines. EC2 VMs are multitenant environments because you share them with other virtual machines. The disk, CPU, and network are all virtualized, shared resources, with wildly varying latency, throughput, and reliability characteristics.

RapGenius doesn't need a hundred nodes to do their routing, but Heroku does, because they're responsible for lots of customers pushing a large volume of traffic. Their routing infrastructure needs to share global state about which dynos are available for which applications, (and which versions of those applications) belong to a given HTTP request, and that routing infrastructure is broadly distributed. I'd phrase their routing TCP load as "nontrivial", in the sense that if I had to build a system like this on physical hardware in, say, Portland, I'd start by buying five floors in the Pittock building.

That's not to say Heroku's architecture is optimal, just that their problem is not as easy as it may appear.

[edit] As for why companies choose to run in a heavily virtualized environment which has so many drawbacks... it comes down to a tradeoff in time, personnel, and costs. Running in EC2 allows for dramatically shorter lead times in hardware acquisition--and when your customers can increase their traffic by an order of magnitude without warning, I think that comes in handy. It also frees them from running their own power, network, physical security, hardware acquisition pipeline, associated vendor contracts, etc. I suspect both Heroku and RapGenius have weighed these costs carefully, but I don't really know their specific constraints.