Hacker News new | past | comments | ask | show | jobs | submit login

I'm now working on my third distributed high performance enterprise storage cluster. SCSI requires a certain (high) level of consistency per LU. And availability... and I'm delivering performance?

What? How could that be possible?

This blog post has so many assumptions about how nodes can and should behave that it's almost unreadable.

Faulty network? Get two fabrics. Nodes going down? RDMA to non volatile ram, and fail the I/O to the other node. This works for all your banks, airlines, governments, etc.

Heroku could do much better, but really: most of these startups should be hiring some ops people and design their own architectures already!




You're right: reliable fault-tolerance between two nodes in datacenters with dedicated physical hardware and isolated network links is easier than distributing state over a hundred geographically disparate nodes in in virtualized, multitenant environments backed by unreliable hardware abstractions.


My point was to complain about what RapGenius needed by describing an extreme that doesn't exhibit the behaviors your post describes. However you were speaking in terms of Heroku, and my post reads like it's trying to deride that. My comment reads pretty awfully, actually.

I guess I'm just frustrated with how much I hear about cloud services solving the "ops problem" when in reality all I see are people ignoring having a rigorous architecture for their applications.

There are extremes that solve just about any problem RapGenius is facing, because I bet you their traffic, I/O and CPU profiles are not that difficult of a problem to solve for their specific case.

I highly doubt that RapGenious needs over a hundred geographically disparate nodes in a virtualized multitenant environment, (notably because they are the one tenant!), but somehow that's what they're using and they don't understand what it means for them.

I'm sorry that previous comment sucked, and possibly sorry for this one.


Heroku operates in EC2, which is an Amazon service providing virtual machines spread across several datacenters. Rapgenius' nodes, as well as all of Heroku's internal infrastructure (e.g. routing) run as EC2 virtual machines. EC2 VMs are multitenant environments because you share them with other virtual machines. The disk, CPU, and network are all virtualized, shared resources, with wildly varying latency, throughput, and reliability characteristics.

RapGenius doesn't need a hundred nodes to do their routing, but Heroku does, because they're responsible for lots of customers pushing a large volume of traffic. Their routing infrastructure needs to share global state about which dynos are available for which applications, (and which versions of those applications) belong to a given HTTP request, and that routing infrastructure is broadly distributed. I'd phrase their routing TCP load as "nontrivial", in the sense that if I had to build a system like this on physical hardware in, say, Portland, I'd start by buying five floors in the Pittock building.

That's not to say Heroku's architecture is optimal, just that their problem is not as easy as it may appear.

[edit] As for why companies choose to run in a heavily virtualized environment which has so many drawbacks... it comes down to a tradeoff in time, personnel, and costs. Running in EC2 allows for dramatically shorter lead times in hardware acquisition--and when your customers can increase their traffic by an order of magnitude without warning, I think that comes in handy. It also frees them from running their own power, network, physical security, hardware acquisition pipeline, associated vendor contracts, etc. I suspect both Heroku and RapGenius have weighed these costs carefully, but I don't really know their specific constraints.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: