"If you added one S3 object every 60 hours starting at the Big Bang, you'd have accumulated almost two trillion of them by now."
That actually sounds underwhelming! IMHO our brains have an easier time thinking "hey, 1 every 60 hours that's not much" compared to figuring out the universe is really incredibly old ;-)
How about comparing to the lifetime of one person? With the US average life expectancy (78 years) that means you'd have to upload 800 objects/second for your whole life to get to 2 trillon :)
Edit: One S3 object for each fish in the ocean (3 to 4 trillion) could also be a nice future milestone (if also slightly underwhelming :)
Edit2: I also love the eye blink as a unit of time. Each time you blink (average is around 15 times a minute), XXX more objects will have been uploaded to/requested on S3
My math is probably wrong... but I believe if you ate a twinkie for every request at the end of a year it would take 1,350 Blue Marlin heavy lift ships to move you across the ocean.
Oh, oh, I like doing these analogies, and we badly need AWS stickers for the office.
A typical grain of beach sand – which must be about the lightest solid thing that’s easy for most people to envision – is 3 mg. That many grains of sand would mass 6e6 kg (6000 metric tons), which in turn is near the upper limit of masses that make sense to most people in everyday terms.
So you could say that if every S3 object were a grain of sand, S3 would be 3× the launch mass of the Space Shuttle, or somewhat more than the gold held in Fort Knox, or as much as the heaviest living thing: http://en.wikipedia.org/wiki/Pando_(tree)
And the the 1.1 million requests per second would be about 3.3 kg, or as much as a gallon of milk/water.
Imagine each object being a person, and imagine a room large enough to fit 10 000 people, the size of a small village, standing side by side, barely being able able to touch each other's hands, spanning roughly 15-20 km.
Now imagine putting 9 999 people behind every person.
And finally, imagine stacking 9 999 on each of these million people's head. That's twice the height a normal airplane flies at.
Then double that number. That's how many objects have been created so far.
---
A bit long, but I think the roughly human scale numbers, 999s and even more effect makes it almost imaginable.
EDIT: Another one - you could fill the whole of manhattan with cents and still have money to spare.
(Manhattan is 87.46 km^2 and a coin is less than about 40mm^2)
It's so many objects that you had all their names printed out, and spent the rest of your life reading them, you wouldn't have time to get to the end.
In fact, even if you and all your friends spent your lives reading the lists of object names, you wouldn't have time to get to the end.
In fact, even if you and all your friends and all of their friends devoted your lives to reading lists of S3 object names, you wouldn't even make a dent because new objects are arriving faster than you could collectively read their names.
If each object is a megabyte, you could stick them on microSD cards packed in a cube as tall as an adult.
(1 card is 0.11cm x 0.15cm x 0.01cm, 32GB. 32k objects, 31.2M cards needed. 165 cm x 165 cm x 165cm = 110 * 150 * 1650 cards = 27.2M. Instead of 165cm which is 5 and a half feet, say 6 feet, so try 166 * 122 * 1829 = 37.04M, enough to use some error-correcting codes just in case.)
I like to use dice to think about storage. If a byte is the size of a die (let's say 1cm^3), then a kilobyte is a 10x10x10 cm^3 cube. A megabyte is a 1m^3 crate, a gigabyte is a 10m^3 house, a terabyte is a 100m^3 tall building, a petabyte is a 1km wide borg cube etc.
If each object was written on a single standard 8 1/2 x 11 inch piece of paper and stacked up, the stack would be about 125,000 miles high (about half the distance to the moon).
Note that RRS is "Reduced Redundancy Storage" - an S3 option that offers 99.99% durability, as opposed to the 99.999999999% durability offered by Standard S3.
Considering we don't hear about problems that often, this is quite an impressive feat of engineering. You really don't think about it until we get a hiccup and half the Internet goes down.
To put this into perspective it's 2.7B new objects per day (assuming 1 Trillion objects averaged over 365 days).
Assuming each object is 100KB (generous estimate, after compression) that would be 270GB per day -- or assuming ten levels of redundancy and striped across three RAID storage devices (per level of redundancy) then 8.1TB per day.
I'm not familiar with their hard disk procurement policies but it wouldn't be difficult to assume they've been purchasing 1TB drives, so 10 new disk drives per day just for keeping ahead of growth. Furthermore let's assume their disk drive failure churn rate is 10% per day so another 1 new disk drive for parts replacement (so 11 disk drives per day).
These are really loose numbers not based on any actual data (or any personal experience at all) but just napkin math, so take it all with a grain of salt.
I'm not convinced that 100KB is a great estimate on file size, but either way you're off by a few zeroes. It's not 270GB per day, it's 270TB. Even if each object were just one byte, that would be 2.7GB. 100KB is one hundred thousand bytes. So it's quite a bit more than eleven drives per day!
After applying the same shoddy math with each object being 100KB -- 270TB with 10 levels of redundancy across 3 RAID drives resulting in 8,100TB per day. This would be 8,100 drives (at 1TB per drive), or 8,910 drives after 10% being dead-on-arrival.
The math is sketchy, so let's cut it down by 10x (10KB per object): 891 drives per day. Keep in mind this is just for S3 and it doesn't account for existing drives failing, growth, or what other services require (eg: EC2, RDS, Cloudwatch, Cloudfront, etc).
Average size is irrelevant. A handful of 1G objects dwarf hundreds of 1 byte objects when computing the average. The overal distribution is interesting. There are actually three: GET sizes, PUT sizes, and stored sizes. They are not identical distributions, especially since the as the PUT size distribution has changed it becomes out of synch with the stored size distribution. Wish I could tell you more, there are some fascinating data points in there but, you know, NDA. Source: form S3 employee.
You could then multiply by 2 trillion to work out their total data volume under management, which I believe they consider commercially sensitive information, but I can't quite put my finger on why.
Wow, Windows Azure seems to be beating it by a long way. Genuine surprise... I assumed it would be much closer.
9 months ago, they announced they stored twice this amount - 4 trillion objects. A year before that, 1 Trillion. Given that previous rate of growth, we can expect they have a lot more than this now.
They also announced peaks of 880,000 requests a second. Whilst Amazon wins here, I'd say its fair to assume this number has increased in those 9 months.
Bear in mind that Azure storage includes "Blobs, Disks/Drives, Tables, and Queues" however S3 is only blobs (other services like Amazon's DynamoDB would be analogous to table storage). Hence, it's not an apples-to-apples comparison.
An object is a single blob of data. S3 objects vary in size from 1 byte up to 5 Terabytes. They can be uploaded with a single PUT, or with multiple PUTs in series or in parallel (which we call multipart upload). They can be downloaded as a unit, in full (GET) or in part (range GET).
I am surprised that the number of requests per second is this low — especially if this includes PUTs. There must be a pretty huge multiple between CloudFront and S3 that keeps this in check.
Anyone else find this surprisingly low? I'd imagine your typical web service holds a few thousand objects in S3 for images etc, then backups and anything else. Then you have your big players like netflix, dropbox etc that use the service. Who store data for tens of millions of customers...
Netflix is unlikely to store customer data on S3. They use it for movie files IIRC, but customer data would go in a proper database of some sort.
Dropbox does use S3, but I think you're probably doing the fairly standard human mistake of not realizing just how big a trillion is. Dropbox hit 100M users in November of 2012, so for Dropbox to use up two trillion objects each user would need to have 20,000 of them. Dropbox does deduplication, has lots of inactive/minimal users, etc., so they're probably a percent or two of S3 objects.
Netflix stores most of their user data in Cassandra (at the AWS conference last year I attended a talk by them, and specifically asked after some of the things they were storing).
(At one point, the stuff I was storing/logging for Cydia actually represented over a percent, maybe it was even over two percent, of all objects in S3; now I'm between 0.1% and 0.5%.)
There is more to S3 than just serving static objects off of the local drive. Maintaining the integrity of the data and ensuring that the data sent out is consistent is not a trivial task. A constantly changing map with 2 trillion keys is a hard problem on its own. Also, serving 1000 1MiB objects per second is not the same as serving 1000 1KiB objects per second so it's hard to say how many resources just the serving portion consumes.
I was assuming that the number of requests referred to read requests, and I guess that the system is designed in a way that makes read requests very cheap, maybe even cheaper than reading a file on a usual filesystem, at least for hot data.
Just because it's huge and complex doesn't mean it's slow and that requests are expensive.
If you ignore the fact that these aren't actually static objects and require a lot more computation to work out where they are and where they need to go.
1.1M RPS is the amount they actually serve, not how much they can serve. Just because your single server can serve more than 1,000 static objects/second (in fact, that number should be much higher), it doesn't mean you need to.
> If you ignore the fact that these aren't actually static objects
It depends what you call a static object.
If the content of your object is stored somewhere and you can just send it without transformations, it's some kind of static object.
Now, looking at a single S3 bucket as a key-value store, with some kind of routing mapping an object's URL to a set of shards each containing the object, one could argue it's serving static objects.
> require a lot more computation to work out where they are and where they need to go
I hope not. I would bet it's not very far from serving a file from the filesystem. There may be a lot of i/o contension though.
> in fact, that number should be much higher
Yes, very probably, and that only makes the 1.1M RPS number seem even less impressive - for amazon.
> 1.1M RPS is the amount they actually serve, not how much they can serve
My whole point was this number of requests seem low for amazon, not that they couldn't handle more.
That actually sounds underwhelming! IMHO our brains have an easier time thinking "hey, 1 every 60 hours that's not much" compared to figuring out the universe is really incredibly old ;-)