Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Using AWS S3 as a database
2 points by dazhbog on May 4, 2015 | hide | past | favorite | 12 comments
Hello,

I have a website that displays data from sensors. In the backend its a nodeJS server (doesn't really matter) and I have been trialling mongoDB and mySQL. I haven't been that happy with mySQL as it needs a lot of code to get it to play well with node. On the other hand we have mongo; its nice but I have no idea how it will scale in the future. Plus seeing how the HN community approaches mongo, I am exploring other options :).

Anyways.. to the question.

So I have been thinking to use S3 as a database, and write a nodeJS driver with some basic query abilities. I am willing to have a JSON request taking up to 500-700ms. But scale cheaply, easily to terabytes of data. I would add caching in the future but for now, reliable, scalable storage is what I am after.

I have done a preliminary benchmark between my server and S3.

* A file of 6.9MB takes an average 400MS with spikes of many seconds on the first requests).

* A file of 400Bytes takes around <50MS, again with random spikes.

So, given the nice scaling nature of S3 would this be a good way of attacking the problem? what do you think??

I did research before posting :)

Cheers.




Hi,

It's not the worst idea, but here are some thoughts to consider:

* concurrency and locking

Databases are good at giving you a consistent view of data. Some are better at it than others. Depending on your application this may be more or less important to you. However, how often would it be possible for your application to have multiple requests that come in. How would you deal with if 2 requests write to the same file in s3? Which one will have the accurate representation of your data?

To give an example, lets say (step1) request A comes in and pulls down this json stored in s3. This request turns into a subsequent request (step2) to store the data in s3 (lets say a value in the JSON needs to change.) Meanwhile, another request session is started, request B that also pulls down a copy of the data. Lets say request B does this in-between step1 and step2. So then on step 2 of request B - you are potentially overwriting data that you did not intend to.

* performance fluctuation

Having used S3 extensively, I can tell you that its performance varies. What if your 400ms request goes through a 15 minute period where they are 4000ms requests. Stranger things happen. S3 is an extremely reliable system in terms of not losing your data, and generally working all the time, but performance varies. With each request that takes longer than you think, resources in your application (file descriptors, cpu burn, etc) pile up. If your application can sustain itself comfortably with these upper levels of latency you would be fine, but this may trigger downstream impatience with the users of your application, causing them to refresh data even more.

* querying and features

S3 is just a key value store at the end of the day. Maybe that is enough for your application. One day however, if you decide you want to know how many items in your database have a certain value, or how many were created on a certain date, you have no option but to really download the entire bucket iteratively from S3.

S3 doesn't have a good way to even estimate the amount of data in it. Clients that do this have to iterate over the contents of your bucket, and if one day the data grows to 100GB, this is a slow task.

In short, S3 tends to lend itself better to data that doesn't have as much concurrency issues, or data that is mostly static. However, if there is a will there is a way. If it were me, I would be most worried about the lack of database-like features, and secondly worried about the concurrency issues, but it does depend on your application.

Cheers


Excellent points, thanks for sharing your knowledge.


I've used to S3 to store JSON objects. Two pain points I've noticed:

- If your tests change the time (e.g. with Delorean for Ruby), S3 will fail because the protocol depends on your client having approximately the same time as the server.

- If you ever want to load several S3 files at once, e.g. to show a list of 30 Foos, you'll need to make 30 requests. So this is a bit like an n+1 problem. There might be a way around this, but I haven't investigated it, and it will probably require you to sidestep the abstractions you've built. I'd say with 99.9% confidence you will want to do this someday.

These days I mostly use Postgres instead of MySQL, but I can't help but think that querying MySQL from Node has got to be easier than building your own database on top of S3.


I am also thinking of using Postgre, I think the support is better (now with native json support). You still need to compile the sql queries (in node) tho, which is a pain. Another thing I remember was, when having a new db, you had to get the tables set up and initialized, so more boilerplate code there :)


You don't realize it yet, but you're expressing a ton of anti-patterns here. Not the least of which is an urge to prematurely optimize your project, and desire to invent your own new solution to a broadly (but not universally) solved problem.

Don't assume that you can anticipate where your scaling challenges are really going to strike. MySQL and Mongo are both more than capable of supporting your project for quite a while, and after you collect more empirical data on your projects growth and bottlenecks, you can start thinking about how to address those problems.


Hi, thanks for your reply. I like your answer. I am aware of the programmer's urge to over-optimize "all the things" for the million requests per second. I think I get that many times.

My issue is that, while learning, you want to make the best choice when it comes to the users data. My take is if one does not full understand a component of the stack, one takes precautions. For example, I might not be the best on setting up a couple of mysql servers doing replication and monitoring the whole lot. However, what one can do is use a service like RDS, where it takes backups for you, and you can restore, or spin up a new instances etc etc. Which is nice.

Having switched to Mongo a while ago (using Compose), I feel that I don't have that much of control. So yeah, just experimenting I guess.


This could work, but it would make it impossible to query your data without traversing every record in your DB.

If you want to store unstructured data but don't want to deal with scaling pains in the future, try looking into hosted solutions.

- Compose is pretty great for RethinkDB, Mongo, and ElasticSearch.

- Amazon DynamoDB can scale like mad and supports secondary indexes, arbitrary queries, and triggers (through Lambda).


Hi, thanks for your reply. I actually use Compose at the moment. I think its pretty good! Just being a 3rd party solution makes me slightly uneasy :)


Try S3fuse. It saves you from writing that driver and the caching.

https://code.google.com/p/s3fuse/


Object store or database? The use cases you suggest are centered on storing files - if that is your central use case - then S3 is designed for that, and perfect.


Given you're looking at using S3, is there a particular reason you aren't using DynamoDB? Or even RDS?


Hi, I am actually using RDS for some other projects. I haven't used DynamoDB. I am using mongo at the moment. Thanks




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: