Hello,
I have a website that displays data from sensors. In the backend its a nodeJS server (doesn't really matter) and I have been trialling mongoDB and mySQL. I haven't been that happy with mySQL as it needs a lot of code to get it to play well with node. On the other hand we have mongo; its nice but I have no idea how it will scale in the future. Plus seeing how the HN community approaches mongo, I am exploring other options :).
Anyways.. to the question.
So I have been thinking to use S3 as a database, and write a nodeJS driver with some basic query abilities. I am willing to have a JSON request taking up to 500-700ms. But scale cheaply, easily to terabytes of data. I would add caching in the future but for now, reliable, scalable storage is what I am after.
I have done a preliminary benchmark between my server and S3.
* A file of 6.9MB takes an average 400MS with spikes of many seconds on the first requests).
* A file of 400Bytes takes around <50MS, again with random spikes.
So, given the nice scaling nature of S3 would this be a good way of attacking the problem? what do you think??
I did research before posting :)
Cheers.
It's not the worst idea, but here are some thoughts to consider:
* concurrency and locking
Databases are good at giving you a consistent view of data. Some are better at it than others. Depending on your application this may be more or less important to you. However, how often would it be possible for your application to have multiple requests that come in. How would you deal with if 2 requests write to the same file in s3? Which one will have the accurate representation of your data?
To give an example, lets say (step1) request A comes in and pulls down this json stored in s3. This request turns into a subsequent request (step2) to store the data in s3 (lets say a value in the JSON needs to change.) Meanwhile, another request session is started, request B that also pulls down a copy of the data. Lets say request B does this in-between step1 and step2. So then on step 2 of request B - you are potentially overwriting data that you did not intend to.
* performance fluctuation
Having used S3 extensively, I can tell you that its performance varies. What if your 400ms request goes through a 15 minute period where they are 4000ms requests. Stranger things happen. S3 is an extremely reliable system in terms of not losing your data, and generally working all the time, but performance varies. With each request that takes longer than you think, resources in your application (file descriptors, cpu burn, etc) pile up. If your application can sustain itself comfortably with these upper levels of latency you would be fine, but this may trigger downstream impatience with the users of your application, causing them to refresh data even more.
* querying and features
S3 is just a key value store at the end of the day. Maybe that is enough for your application. One day however, if you decide you want to know how many items in your database have a certain value, or how many were created on a certain date, you have no option but to really download the entire bucket iteratively from S3.
S3 doesn't have a good way to even estimate the amount of data in it. Clients that do this have to iterate over the contents of your bucket, and if one day the data grows to 100GB, this is a slow task.
In short, S3 tends to lend itself better to data that doesn't have as much concurrency issues, or data that is mostly static. However, if there is a will there is a way. If it were me, I would be most worried about the lack of database-like features, and secondly worried about the concurrency issues, but it does depend on your application.
Cheers