Ask HN: Using AWS S3 as a database

spotman · on May 4, 2015

Hi,

It's not the worst idea, but here are some thoughts to consider:

* concurrency and locking

Databases are good at giving you a consistent view of data. Some are better at it than others. Depending on your application this may be more or less important to you. However, how often would it be possible for your application to have multiple requests that come in. How would you deal with if 2 requests write to the same file in s3? Which one will have the accurate representation of your data?

To give an example, lets say (step1) request A comes in and pulls down this json stored in s3. This request turns into a subsequent request (step2) to store the data in s3 (lets say a value in the JSON needs to change.) Meanwhile, another request session is started, request B that also pulls down a copy of the data. Lets say request B does this in-between step1 and step2. So then on step 2 of request B - you are potentially overwriting data that you did not intend to.

* performance fluctuation

Having used S3 extensively, I can tell you that its performance varies. What if your 400ms request goes through a 15 minute period where they are 4000ms requests. Stranger things happen. S3 is an extremely reliable system in terms of not losing your data, and generally working all the time, but performance varies. With each request that takes longer than you think, resources in your application (file descriptors, cpu burn, etc) pile up. If your application can sustain itself comfortably with these upper levels of latency you would be fine, but this may trigger downstream impatience with the users of your application, causing them to refresh data even more.

* querying and features

S3 is just a key value store at the end of the day. Maybe that is enough for your application. One day however, if you decide you want to know how many items in your database have a certain value, or how many were created on a certain date, you have no option but to really download the entire bucket iteratively from S3.

S3 doesn't have a good way to even estimate the amount of data in it. Clients that do this have to iterate over the contents of your bucket, and if one day the data grows to 100GB, this is a slow task.

In short, S3 tends to lend itself better to data that doesn't have as much concurrency issues, or data that is mostly static. However, if there is a will there is a way. If it were me, I would be most worried about the lack of database-like features, and secondly worried about the concurrency issues, but it does depend on your application.

Cheers

dazhbog · on May 5, 2015

Excellent points, thanks for sharing your knowledge.

pjungwir · on May 4, 2015

I've used to S3 to store JSON objects. Two pain points I've noticed:

- If your tests change the time (e.g. with Delorean for Ruby), S3 will fail because the protocol depends on your client having approximately the same time as the server.

- If you ever want to load several S3 files at once, e.g. to show a list of 30 Foos, you'll need to make 30 requests. So this is a bit like an n+1 problem. There might be a way around this, but I haven't investigated it, and it will probably require you to sidestep the abstractions you've built. I'd say with 99.9% confidence you will want to do this someday.

These days I mostly use Postgres instead of MySQL, but I can't help but think that querying MySQL from Node has got to be easier than building your own database on top of S3.

dazhbog · on May 5, 2015

I am also thinking of using Postgre, I think the support is better (now with native json support). You still need to compile the sql queries (in node) tho, which is a pain. Another thing I remember was, when having a new db, you had to get the tables set up and initialized, so more boilerplate code there :)

mak4athp · on May 5, 2015

You don't realize it yet, but you're expressing a ton of anti-patterns here. Not the least of which is an urge to prematurely optimize your project, and desire to invent your own new solution to a broadly (but not universally) solved problem.

Don't assume that you can anticipate where your scaling challenges are really going to strike. MySQL and Mongo are both more than capable of supporting your project for quite a while, and after you collect more empirical data on your projects growth and bottlenecks, you can start thinking about how to address those problems.

dazhbog · on May 5, 2015

Hi, thanks for your reply. I like your answer. I am aware of the programmer's urge to over-optimize "all the things" for the million requests per second. I think I get that many times.

My issue is that, while learning, you want to make the best choice when it comes to the users data. My take is if one does not full understand a component of the stack, one takes precautions. For example, I might not be the best on setting up a couple of mysql servers doing replication and monitoring the whole lot. However, what one can do is use a service like RDS, where it takes backups for you, and you can restore, or spin up a new instances etc etc. Which is nice.

Having switched to Mongo a while ago (using Compose), I feel that I don't have that much of control. So yeah, just experimenting I guess.

giaour · on May 4, 2015

This could work, but it would make it impossible to query your data without traversing every record in your DB.

If you want to store unstructured data but don't want to deal with scaling pains in the future, try looking into hosted solutions.

- Compose is pretty great for RethinkDB, Mongo, and ElasticSearch.

- Amazon DynamoDB can scale like mad and supports secondary indexes, arbitrary queries, and triggers (through Lambda).

dazhbog · on May 5, 2015

Hi, thanks for your reply. I actually use Compose at the moment. I think its pretty good! Just being a 3rd party solution makes me slightly uneasy :)

seahorse · on May 5, 2015

Try S3fuse. It saves you from writing that driver and the caching.

https://code.google.com/p/s3fuse/

czbond · on May 4, 2015

Object store or database? The use cases you suggest are centered on storing files - if that is your central use case - then S3 is designed for that, and perfect.

NeutronBoy · on May 5, 2015

Given you're looking at using S3, is there a particular reason you aren't using DynamoDB? Or even RDS?

dazhbog · on May 5, 2015

Hi, I am actually using RDS for some other projects. I haven't used DynamoDB. I am using mongo at the moment. Thanks