Can you share more about your implementation? (Disclosure: I'm very interested s...

S3ForMe · on Sept 7, 2013

> How does data placement work? Each object bigger than some size is split in small parts, these parts are linked to the metadata object with all information, such as name, bucket, size, date, checksum, etc. All data is split in server groups - each group is at least 3 mirrored servers with no more than 5TB of data to make system flexible. Server groups can be added to increase system capacity or removed to decrease.

> How is data checked for correctness? With checksum. Once the data is uploaded by user he will receive its checksum and must compared it with local checksum to make sure that it was correctly transfered and stored. The same checksum is used to ensure server-side data correctness.

> How do you do listings at scale? There is a trick - we support only one delimiter (/), this means that we can use very simple listing algorithm which scales very easy.

> During hardware failures, do you still make any durability or availability guarantees? Yes, all data is split in server groups by 3 servers each. If one of 3 servers will fail, this group will still running like nothing happened, some running requests may fail though. If 2 servers will fail at the same time, then this group and all data in it will be put in read-only mode to avoid any possible data damage.

> How do you handle hot content? It is cached in RAM by OS, we do not perform any additional measures. OS does a pretty good job.

> How do you resolve conflicts (eg concurrent writes)? Some conflicts are resolved by the software if possible. Unrecoverable conflicts are returned back to user with HTTP 400, 500 errors to make him know that something is wrong and he must run request again. For concurrent writes we use simple rule - the last one wins.

notmyname · on Sept 7, 2013

Interesting, and thanks for the response. If I may probe a little further, I have a couple of follow-up questions.

1) Server groups of at least 3 mirrored servers, with a max of 5TB.

This seems like an interesting design choice. What do you mean by "at least"? Does this mean you'll have some data with more replicas? Are these server pools filled up and then powered down until they are needed? How do you choose which server pool to send the data to? And since you have a mirrored set of servers, when do you send a response back to the client?

Is the 5TB number something that is a limit for the storage server (ie 15TB total for a cluster of 3)? That seems rather low. It also doesn't divide evenly into common drive sizes available from HDD vendors today. So what kind of density are you getting in your storage servers? How many drives per CPU, and how many TB per rack? Since you're advertising on low price, I'd think very high density would be pretty important.

2) You say you split data into smaller chunks if it crosses some threshold. Let's suppose you split it into 1MB objects, once the data is bigger than 5MB. And each 1MB chunk is then written to some server pool which has replicated storage (via the mirroring). How do you tie the chunks back to the logical object? Do you have a centralized metadata layer that stores placement information? If so, how do you deal with the scaling issue there? If not, another option would be to store a manifest object that contains the set of chunks. But in either case, you've got a potentially very high number of servers that are required to be available at the time of a read request in order to serve the data.

Just as an example (and using some very conservative examples), suppose I have a 100MB file I want to store and you chunk at 10MB. So that means there are now 10 chunks replicated in your system, for a total of 30 unique drives. Now when I read the data, your system needs to find which 10 servers pools have my chunks and then establish a connection with one of the servers in each server group. This seems like a lot of extra networking overhead for read requests. What benefits does it provide that offset the network connection overhead?

And what happens when one of the chunks is unavailable? Can you rebuild it from the remaining pieces (which would essentially be some sort of erasure encoding)?

Overall, the chunking and mirroring design choices seem to me like they would introduce a lot of extra complexity into the system. I'd love to hear more about how you arrived at these choices, and what you see as their advantages.

In order to not make my long post even longer, I'll not pursue more questions around listings, failures, or hot content.

S3ForMe · on Sept 8, 2013

1) I've made a typo, not "at least", but "at maximum", meaning that each server can store up to 5TB of data, it's 2x3TB hard drives servers. The density is very low because we use cheap hardware which fail regularly and such a small data amount means high recovery speed. 5TB is a soft limit and can be different for server groups, but it is not at the moment. Each group of 3 servers has a total capacity of 5TB because data is mirrored.

2) We have a centralised replicated metadata layer which is stored on the same servers as the data itself. All object chunks are stored at one servers group at the same time, so there is no need to connect to multiple servers to serve a file, it is enough to connect to one server from server group to get all the data. Metadata may be stored at different server group though. All chunks are replicated to 3 servers at the same time using a sequential append-only log to ensure that all servers has the same data. This may introduce replication lag and if it is too big for some server then it is removed from the server group until replication lag back to normal (1-3 seconds usually).

Actually, it is much simpler than I explained, data layer with replication, data consistency and sharding is completely transparent to the application layer and it is really-really small and simple. Email me at support@s3for.me and I'll share with you software details and you will understand how simple it is.