Why do you think AWS is a poor fit for your workload? There are customers who churn petabytes in S3. If I was running a crawl and indexing operation, I'd put the bytes in blob storage like that, and aggressively negotiate better pricing with Amazon.
There are some pretty obvious reasons mainframes aren't the server format of choice for today's cloud. The people who are building to today's cloud grew up in a world where fast desktop computers running linux were ubiquitous and cheap (important for poor students learning to ship code on a budget) while mainframes were something you couldn't easily get access to even while waving large sums of cash at IBM.
No doubt there could be warehouse scale systems that were more efficient, overall, compared to the current designs that the cloud providers use. Every bit of the stack could be squeezed to provide exactly the hardware you needed for a particular problem. It doesn't seem like the economic incentives exist at this time, for a provider like Amazon (I imagine they make far more money on modest VM configs with no GPUs or Infiniband than they do with the high end stuff even if the latter has a higher profit margin, because volume wins in cloud)
Multi-core-large-memory-SSD is just the latest architectural evolution. We've just been pushing the bottleneck between the various parts of the computer, and custom manufacturers have grown rich and then out of business producing a system that had 30% more "X" than what you could buy top-of-the-line from Dell (Bull will sell you an x86 system with 24T of RAM!). Right now the pain point is having many memory spaces across machines, rather than a single super-fast RAM you can access from any processor (the laws of physics and cache coherency suggest this is a pretty hard problem), because multicore and SSD got so big and fast.
> Why do you think AWS is a poor fit for your workload?
Every time I've done the math it comes out 10x as expensive, and I've done the math with pretty much everyone from the CTO down to the guy who is "sure we can do it cost effectively."
Generally folks who have petabytes of data in S3 don't have large amounts of read/write change. So a typical file or image sharing site like Imgur or Github will be 'read mostly'. When you're doing search you crawl billions of documents, often replacing 30 - 40% in your store, and you are constantly re-reading them as you index and rank them. Further, as you process the data what you really want to do is push your computation out to the data rather than pull it over the wire, mutate it, and push it back over the wire. Processing through 1 petabyte of data, on a 10gbps backbone where you pull it and then push it, running full duplex (so you're pushing and pulling at the same time) takes a million seconds at a gigabyte per second. That is 11 days, 13 hrs, 47 minutes. That is what happens changing 1/3 of a 3PB data set. Pushing the computation into the data (which is to say running the processors where your data is actually stored on disk) you can process a petabyte (assuming your data distribution algorithm is good (and ours is)) in about a day and a half. Not quite 1/10th the time.
If you want to coordinate 10,000 worker threads which are working through your data, you need to be able to share messages between them, they don't have to happen a lot but their latency adds up if they take too long.
You end up asking for the same system built in the "cloud" that you've built in a colo; Dedicated "fat" machines with lots of memory and disk, all within easy network 'shouting' distance (aka a non-blocking full crossbar bandwidth network) from each other without any confounding network traffic going over your back bone. And when you arrive at that inevitable conclusion, the actual loaded cost of the machine falls right out the bottom and lands in the customers lap including all the loaded up margin costs. Three months or so ago (right after the last price war) we ran all the numbers again, Amazon would cost about $1.5M/month to let us do what we want to do.
Now as I point out to cloud sales guys, and to you, this isn't "bad" or actually a problem, building search engines requires an extraordinary amount of horsepower to be brought to bear on a very large, very noisy, data set. There is absolutely no rationale for making that configuration cost effective, its an outlier, the number of people who do that you can count on one hand. But it is the same reason that people don't just pop out the AWS toolkit and poof crawl and index billions of web pages :-)
> volume wins in cloud
I don't agree with that, I think what 'wins' in the cloud is the ability to oversubscribe the hardware. Just like Comcast sells everyone on the street 50 megabits of internet knowing darn well that more than a handful use that much at the same time everyone will throttle down, cloud providers sell you an 'instance' which probably spends most of its time not doing anything at all. And while it isn't someone else is. That is the 'magic' that makes this stuff so profitable for Amazon. Not people like me who have 100GB memory machines running at 85% utilization 24 hours a day[1]. It would be like everyone on the block signing up for high speed internet and every one of us downloading a copy of the entire Internet Archive :-) Not a likely situation, so rarely considered something the infrastructure needs to support.
[1] To be fair, they don't do that continuously, crawls start and stop and we switch things around, but when they are in the fetcher/extractor phase and running at R3, its a wonder to behold :-)
> what 'wins' in the cloud is the ability to oversubscribe the hardware
We're increasing compute exponentially, so it makes sense we'd want to over subscribe it as much as possible. Demand is like a dog nipping at your heels.
There are some pretty obvious reasons mainframes aren't the server format of choice for today's cloud. The people who are building to today's cloud grew up in a world where fast desktop computers running linux were ubiquitous and cheap (important for poor students learning to ship code on a budget) while mainframes were something you couldn't easily get access to even while waving large sums of cash at IBM.
No doubt there could be warehouse scale systems that were more efficient, overall, compared to the current designs that the cloud providers use. Every bit of the stack could be squeezed to provide exactly the hardware you needed for a particular problem. It doesn't seem like the economic incentives exist at this time, for a provider like Amazon (I imagine they make far more money on modest VM configs with no GPUs or Infiniband than they do with the high end stuff even if the latter has a higher profit margin, because volume wins in cloud)
Multi-core-large-memory-SSD is just the latest architectural evolution. We've just been pushing the bottleneck between the various parts of the computer, and custom manufacturers have grown rich and then out of business producing a system that had 30% more "X" than what you could buy top-of-the-line from Dell (Bull will sell you an x86 system with 24T of RAM!). Right now the pain point is having many memory spaces across machines, rather than a single super-fast RAM you can access from any processor (the laws of physics and cache coherency suggest this is a pretty hard problem), because multicore and SSD got so big and fast.