The reason you don't keep files in your database is that file systems are much better at handling files. Faster, more efficient, basically all the reasons that a single-purpose layer tends to be faster than a general-purpose layer.
Databases are much better at handling discrete data than file systems - that's what they are built for. Sure, I could keep my data in a bunch of little files, but that doesn't work as well.
(MS SQL has a feature where you "store" the file in the database, but the db writes the file to the filesystem, and just maintains a pointer to the actual file - not a bad hybrid)
I don't know how well GridFS stacks up (it is on my todo list), although I do like the idea of replication and sharding being built in. My gut (which has been wrong before) says that it is good for websites, not so good for general storage.
I use MongoDB for the same reason as mrkurt: prototyping new schemas is a breeze. I still find myself reaching for the old RDBMS toolbox as things move along, grow, and stabilize. Sometimes, a JOIN _is_ the right tool for the job.
To provide some balance: the reasons that many people /do/ keep files in their database are 1) to extend transactions around the storage of those files, and 2) to allow a single backup or replication infrastructure (such as write-ahead log shipping) to handle all forms of data that are being managed.
What about batch processing a large number of small files? say 10 million image files of 500KB. A typical file system will need to seek each small file.
I wonder if GridFS stores small files in blocks to allow efficient batch retrieval for processing.
GridFS is just a standard convention of how to map files to key-value stores like MongoDB -- you can implement GridFS over MongoDB in just a few lines of Ruby code. GridFS breaks files into fixed-size chunks, and uses a single MongoDB document per chunk. It's not exactly rocket science.
The author of the blog post touts it as a _feature_ of MongoDB, but it's more accurate to say that it's an artifact of MongoDB's 4MB document size limit -- you simply cannot store large files in MongoDB without breaking them up. Sure, by splitting files into chunks you can parallelize loading them, but that's about the only advantage.
Among the key-value NoSQL databases, Cassandra and Riak are much better at storing large chunks of data -- neither has a specific limit on the size of objects. I have used both successfully to store assets such as JPEGs, and they are both extremely fast both on reads and on writes.
Neither is built for that purpose, and will load an entire object into memory instead of streaming it, so if you have lots of concurrent queries you will simply run out of memory at some point -- 10 clients each loading a 10MB image at the same time will have the database peak at 100MB at that moment.
Actually, Riak uses dangerously large amounts of memory when just saving a number of large files. I don't know if that's because of Erlang's garbage collector lagging behind, or what; I would be worried about swapping or running out of memory when running it in a production system.
You actually list one of the advantages of GridFS right there in your post: streaming. If you are serving up a 700MB video, you don't want to have to load the whole thing into memory or push the whole thing to the app server before you can start streaming. Since we break the files into chunks, you can start sending data as soon as the first chunk (256k by default) is loaded, and only need to have a little bit in ram at any given moment. (Although obviously the more you have in ram, the faster you will be able to serve files)
GridFS is simple (and probably could be implemented with most DBs) but it was designed to have some nice properties. Notably, you don't have to load the whole file into memory: you can stream it back to the client. You can also easily get random access to the file, not just sequential.
"The reason you don't keep files in your database is that file systems are much better at handling files. Faster, more efficient, basically all the reasons that a single-purpose layer tends to be faster than a general-purpose layer."
Databases are much better at handling discrete data than file systems - that's what they are built for. Sure, I could keep my data in a bunch of little files, but that doesn't work as well.
(MS SQL has a feature where you "store" the file in the database, but the db writes the file to the filesystem, and just maintains a pointer to the actual file - not a bad hybrid)
I don't know how well GridFS stacks up (it is on my todo list), although I do like the idea of replication and sharding being built in. My gut (which has been wrong before) says that it is good for websites, not so good for general storage.
I use MongoDB for the same reason as mrkurt: prototyping new schemas is a breeze. I still find myself reaching for the old RDBMS toolbox as things move along, grow, and stabilize. Sometimes, a JOIN _is_ the right tool for the job.