Infinitely Scalable Framework with Amazon Web Services?

mattculbreth · on April 7, 2007

This is pretty interesting. I'm not sure I understand what you mean though about separating databases "by user". Could you elaborate on that? How would my application's database change if I were a user of this service?

mattjaynes · on April 7, 2007

Sure. Basically each user would have their own database file generated on account creation. Then in your app, instead of doing something like:

connectDB("/path/to/db")

you would do:

connectDB("/path/to/$user/db")

Also, if you needed to query all of your users data in aggregate, you would use the 'ATTACH' capability of SQLite to essentially create a master database. Note that you will need to design your scheme appropriately (thoughtful table/column naming, etc) for this to work well. See: http://www.sqlite.org/lang_attach.html

blader · on April 7, 2007

From what I've read of evaluations of S3, the latency of S3 requests seems significant. Significant enough that using S3 as a backend database doesn't seem feasible. Is this true or am I mistaken?

blader · on April 7, 2007

I forgot to mention that I think this is fantastic idea. I'm just not sure if it will be able to replace a local mysql cluster.

mattjaynes · on April 7, 2007

Thanks, AWS is amazing and has been super fun to play with. This solution will of course only work for a certain subset of apps. Apps that require more database complexity will not be a good fit for this framework. My apps are typically designed so simply that they could just use a flat-file storage system instead of a db. SQLite is awesome because it provides the flat-file storage, but in a format that allows you to query it like a normal database. The simplicity of that model is amazing and hard for me to resist. I've used mysql clusters in most of the projects while working for companies, but I often found that the added complexity did little for us. My last company hired two database guys with fat salaries just to manage it. My goal with this framework is to really reduce the cost of maintenance, backups, etc. S3 is distributed and so really doesn't need much in the way of backups - but backing up this system is a simple as tar gziping the directory of flat-file databases.

This was interesting from pg's Viaweb FAQ:

http://www.paulgraham.com/vwfaq.html

"What database did you use?

We didn't use one. We just stored everything in files. The Unix file system is pretty good at not losing your data, especially if you put the files on a Netapp.

It is a common mistake to think of Web-based apps as interfaces to databases. Desktop apps aren't just interfaces to databases; why should Web-based apps be any different? The hard part is not where you store the data, but what the software does.

While we were doing Viaweb, we took a good deal of heat from pseudo-technical people like VCs and industry analysts for not using a database-- and for using cheap Intel boxes running FreeBSD as servers. But when we were getting bought by Yahoo, we found that they also just stored everything in files-- and all their servers were also cheap Intel boxes running FreeBSD.

(During the Bubble, Oracle used to run ads saying that Yahoo ran on Oracle software. I found this hard to believe, so I asked around. It turned out the Yahoo accounting department used Oracle.)"

busy_beaver · on April 7, 2007

There's a guy working on a MySQL storage engine that uses S3:

http://fallenpegasus.livejournal.com/tag/s3

It's still at a very early stage in terms of implemented features, but he seems to be moving forward rapidly.

I have no idea what kind of performance he's getting -- I'm planning to check out the MySQL source sometime next week and do a test build. Extremely cool idea. :-)

mattjaynes · on April 7, 2007

Yeah, that's a valid concern. Fortunately the read/write caching that S3DFS provides largely negates that problem. The caching is on the EC2 instance and so should provide great responsiveness to the app.

blader · on April 7, 2007

I was thinking more of the latency involved in making the roundtrip request from the app server to S3, which is probably on the order of tenths of a second. It would force you to rely heavily on a local memcached to achieve acceptable performance (which is probably a good thing anyway).

I'm reminded of a conversation with Facebook's CTO at Startup School, and he mentioned that they experimented with S3 for their photo hosting, but the latency of loading images from S3 made it unacceptable by their performance standards.

mattjaynes · on April 8, 2007

Facebook's 'young' and 'technical' performance requirements must be off the wall ;)

For the vast majority of apps, the latency will not be an issue. Especially since S3DFS has caching built in.

A good example is SmugMug who switches over to use S3 to serve out images when they are working on their own datacenter. Don McCaskill (their CEO and head programmer) says that switching back and forth was completely transparent to their users with no human-discernible latency. For more, see:

http://blogs.smugmug.com/don/2007/03/08/amazon-s3-the-speed-of-light-problem/

http://blogs.smugmug.com/onethumb/files/ETech-SmugMug-Amazon-2007.pdf

BitGeek · on April 8, 2007

The terms1 of S3DFS don't work for me so I suggest people look at MogileFS. I just learned about it recently, so I can't give a review... but it looks interesting.

http://www.danga.com/mogilefs/

1 Will pay good money for software, their prices are ok, just don't like the feeling I get that they will have my company by the balls.

mattjaynes · on April 7, 2007

Added this in the hopes of getting some good feedback from the community on this framework I'm developing.

The post is derived from this comment: http://news.ycombinator.com/comments?id=9991

enomaly · on April 10, 2007

You may want to take a look at http://www.enomalism.com, it provides migration to and from Amazon EC2 as well as the ability to create your own geo-targeted cloud.