Can anyone think of any ways to collect data on just how big AWS really is? So far it seems like everyone is just trying to infer based on the "other" line of their revenues and anecdotes from the cloud community.
If we could somehow collect information on the power usage of their data centers i think that would be the most accurate measure. You might be able to do that by using a thermo camera to determine how much heat is dissipated by their cooling systems.
If we could generate random IP samplings by crawling, and if the IP addresses are linearly incremented, then we would have a problem analogous to the German tank problem (http://en.wikipedia.org/wiki/German_tank_problem) allowing us to compute the minimum variance estimate of the total number of IPs from a random sampling as: max(IPs observed)*(1+(number of IPs observed)^-1)-1.
Their servers send an indication in the http headers:
Server: Apache/2.2.21 (Amazon)
I guess you'd need to crawl the web and look for those headers. Or maybe you could look at IP addresses? It would certainly be difficult to do with any kind of accuracy but you could probably get some decent estimates if your sample size was large enough.
Very large numbers of Amazon servers are used for something else than cranking out HTTP pages. Expect a lot of them to be crunching numbers for bio-informatics problems, physics simulations and so on. That's why there is a CUDA enabled instance.
Rendering web pages is actually one of the worst use cases for Amazon from a bang-for-the-buck perspective, especially when you factor in bandwidth.
When people are building 30,000 core compute clusters [1] on EC2 - presumably with zero publicly available web servers, I'd be very interested in any methodology that provides reasonable estimates of revenue based on public web servers.
I don't know where you got that header; S3 returns Server: AmazonS3 and http://aws.amazon.com returns Server: Server. I would suspect a lot of their backend services use custom HTTP servers, and maybe a small handful actually use apache. Moreover you'll be completely missing backend servers that have no public IP at all...
I haven't used AWS in a while. Is there a pattern to the AWS IP addresses? (e.g. are they using a set of specific blocks? ...that seems too simple). The other interesting data point would be how much AWS resource is consumed by Amazon itself. I understand that they have been moving big pieces of infrastructure onto AWS over the past couple years.
(You probably have to have showdead on to see me because I'm hellbanned. I've emailled pg to try and get this fixed, and had no response. If you happen to see this, please check my comment history to realise that I'm not at all a troll, and consider upvoting me on the offchance it will get my account back into positive karma land. Thanks!)