I must point out Jim Gray's paper To Blob or Not To Blob[0]. His team considered NTFS vs. SQL Server, but most rationale applies to any filesystem vs. database decision.
The summary was "The study indicates that if objects are larger than one megabyte on average, NTFS has a clear advantage over SQL Server. If the objects are under 256 kilobytes, the database has a clear advantage. Inside this range, it depends on how write intensive the
workload is," but keep in mind this is spinning media from 2006. Modern SSDs change the equation quite a bit, as they are much more friendly to random IO and benefit less from database write-ahead log and buffer pool behavior.
Also, when deciding between blob vs. filesystem, blobs bring transactional and recovery consistency. The DB is self contained, and all blobs are contained in it. A restore of the DB on a different system yields a consistent system, it won't have links to missing files, and there won't be orphaned files left over (files not referenced by records in DB).
Despite all this, my practical experience is that filesystem is better than blobs for things like uploaded content, images, pngs and jps etc. Blobs bring additional overhead, require bigger DB storage (more expensive usually, think AWS RDS) and the increased size cascades in operational overhead (bigger backups, slower restore etc).
My experience (for an application which had a working set of under 1 GB of files in the 50kb to N MB range, and approximately 50 GB persisted at any given time) was that preserving access to the toolchain which operates trivially with files was worth the additional performance overhead of working with the files and, separately, occasionally having to retool things to e.g. not have 10e7 files in a single folder, which is something that Linux has some opinions (none of them good) about.
Trivial example: it's easy to delete any file most recently accessed more than N months ago [+] with a trivial line in cron (the exact line escapes me -- it involves find), but doing that with a database requires that you roll your own access tracking logic. Incremental backups of a directory structure are easy (rsync or tarsnap); incremental backups of a database appeared to me to be highly non-trivial.
[+] Since we could re-generate PDFs or gifs from our source of truth at will (with 5~10 seconds of added latency), we deleted anything not accessed in N months to limit our hard disk usage.
> e.g. not have 10e7 files in a single folder, which is something that Linux has some opinions (none of them good) about.
I keep running in to this as a common anti-pattern in software, even with software developers that should know better (having fallen afoul of it before).
I ran in to it with some semi-bespoke software once, used by a remarkable number of state departments across the country. It was one of many, many things wrong with that godawful thing. Worse, not only were they dumping files in a single directory, almost any time they went to do something with the directory they'd request a directory listing too. Like many of the bizarre things the app did, it never actually used the information it acquired that way.
I never understood why it mattered that files would be in a single directory or split over multiple directories. At the end of the day it is the same file system with the same number of files. The directories are just a logical structure. Why would there be a difference?
If you want raw file access, your ideal filesystem is a giant key-value store that keys on the full path to the file. This choice means doing a directory listing will involve a lot of random access reads across the disk, and in the days before SSDs this would be a performance killer.
So instead, a directory would have a little section of disk to write its file entries to. These would be fairly small, because most directories are small, and if there were more files added than there was room, another block would be allocated. Doing a file listing would be reasonably quick as you're just iterating over a block and following the link to the next block, but random file access would get slower the more files are added.
It's probably a solvable problem in filesystem design, but because current filesystems fall down hard on the million-files-in-a-directory use case, nobody actually does that, so there's no drive to make a filesystem performant in that case.
Ok but listing 100 directories with 10,000 files each should take the same time than listing one directory with 1,000,000 files (what you are describing is a filesystem with more files vs less files, not with more subdirectories than less subdirectories).
Well, my Linux-fu isn't very up-to-date, but my guess (which actually goes for any file system) is that a directory is a bit more than just a logical organization. It's an actual "thing" in your hard drive which keeps track of everything beneath it. And so it grows in size when more files are there, which slow down access.
Just don't let anyone mount with noatime, for performance. I found that little gem on a drive where paring down to used content would have been very helpful.
> Despite all this, my practical experience is that filesystem is better than blobs for things like uploaded content, images, pngs and jps etc.
This is especially true if you want to deliver them back to the users in the context of a web application. If you place this kind of content in a database, you'll need to serve them with your application.
If you use files for this content, you get two interesting options. For one, you can use any stock web server like nginx to serve these files - and nginx will outperform your application in this context. On top of that, it's easy to push this content onto a CDN in order to further cut the latency to the user.
Completely shameless plug here: My webserver[1] will outperform nginx for serving up static content like this.
It has many optimizations, like not opening the file its serving (which reduces the latency of walking/interacting with the VFS layer); almost no runtime allocations on the heap (minimal stack frames and sizes as well) which can be completely disabled at the expense of logging; high performance logging which does not interfere with processing requests; the normal stuff in a fast webserver (sendfile, keep-alive, minimal code paths between reading the user request and delivering the first byte).
You can check its health by making a request to a known resource -- they're practically free.
It delivers an ETag header that it computes the first time it opens a file and maintains as long as that file is open. If it has to re-open the file it will generate a new ETag value.
In this case nginx is the 'application'. the requests is still going to be expressed as a SQL query, sent to the PG, parsed, compiled, optimized, executed, then the tabular response formatted as the HTTP response. Many more steps compared to a file-on-disk response.
Ah, but the filesystem has to do all that as well! It must receive a path, parse it, and then execute the query, with possible optimizations (eg. ext4 even has indexes implemented with hashed b-trees).
Does your database transparently DMA the buffer cache to the NIC after loading the file in the background using read-ahead without any process context switching or upcalls to userland?
Loosely. Today's file systems aren't transactional (i.e. acid compliance), which is a basic property that most people consider necessary for a database.
Many filesystems are logged and provide atomic operations. You can build multi-step transactions on top of them with various operations. And with CoW filesystems you can even get read consistency. Oh, and optimistic locking also is an option.
Are we talking about the POSIX interface to files or is there a new transactional API that allows for atomic writes (all pages are written to disk or none are), consistency, isolation (opening a file in another process which is currently being written won't show the pages that are being written), and makes the writes always durable (sync; not fsync)?
Multi-file transactions can be built by moving whole directories over symlinks to the previous version. The linux-specific RENAME_EXCHANGE flag can simplify this.
> or is there a new transactional API
CoW filesystems give you snapshots and reflink copies via ioctls. Under heavy concurrency this can provide cheaper isolation than locking.
> and makes the writes always durable (sync; not fsync)?
For durability fsync is sufficient if you do it on the file and the directory.
To combine atomic and durable you can do the write, fsync, rename, fsync dir dance.
>For durability fsync is sufficient if you do it on the file and the directory. To combine atomic and durable you can do the write, fsync, rename, fsync dir dance.
Right. This allows for the correct behaviour but I wouldn't conclude that POSIX is a transactional API. I would regard this as being able to build a transactional API on top of it; but not transactional in itself. I mean, if you had to do the same dance in SQL (insert to dummy pkey and then update the entry) you would rightly dismiss it.
> Also, when deciding between blob vs. filesystem, blobs bring transactional and recovery consistency.
Interesting. I would have thought the other way around, at least for crash-resistance: the I/O stack (including disk hardware) has a tendency to reorder writes and so updates that live within a single file will corrupt fairly easily. Separate files not so much. I vaguely remember a paper on that (Usenix?) and sqlite generally did very well except for that point.
You have two consystency issues with storing the files in the filesystem:
- rollbacks in the DB can lead to orphaned files on disk. One can try to add logic in the app (eg. a catch block that removes the file if the DB rolled back) but that is not gonna help on a crash
- it is impossible to obtain a consistent backup of both the DB and the filesystem. You can backup the filesystem and the DB, but the two will not be consistent between them unless you froze the app during the backup. When you restore the two backups (filesystem, DB) you may encounter any anomaly: orphaned files (exists on filesystem but no entry in DB), broken links (entry in DB referencing a non-existent file) etc. This is because the moment at which the backup 'views' the file and the DB record referencing it are distinct in time.
As for write reordering: write-ahead log systems relies on correct write order. All DBs worth their name enforce this one way or another (via special API, via config requirements etc etc)
I'm surprised that there aren't any tools provided by various databases to handle that usecase.
Something which can abstract away the storage on-disk of large blobs and manage/maintain them over time to prevent a lot of the issues you talk about, but still give the ability for raw file access if/when it's needed.
I've given it all of 10 seconds of thought, but even something like a DB type of a file handle would be useful. Do a query, get back a handle to a file that you can treat just like you opened it yourself.
> Do a query, get back a handle to a file that you can treat just like you opened it yourself
Things are a bit more complex. For one, the trivial problem of client vs. server host. The DB cannot return a handle (a FD) from the server, because it has no meaning on the host running the app. The second problem is that any file manipulation must conform to the DB semantics for transactions, locking, rollback and recovery.
What you describe does exists, is the FileStream feature that dates back to 2007 if I remember correctly. I'm describing the SQL Server feature since this is what I'm familiar with. The app queries the DB for a token, using GET_FILESTREAM_TRANSACTION_CONTEXT[0] and then uses this token to get a Win32 handle for the 'file' using OpenSqlFilestream[1]. The result handle is valid for usual file handle operations (read, write, seek etc).
There were great expectations on this feature, but in real life it flopped. For one it caused all sort of operational headache from the increased DB files size (increased backups size etc) or from problems like having to investigate 'filestream thumbstone status'[2]. But more importantly, adoption required application rewrite (to use the OpenSqlFilestream), which of course never materialized.
File Tables is a newer stab at this problem and this one does allow to expose the DB files as a network share and apps can create and manipulate files on this share and everything is backed by the DB behind the scenes. But turns out a lot of apps do all sort of crazy things with the files, like copy-rename and swap as means to do failure safe saves, but many such operations are significantly more expensive in DB context. And when the DB content is manipulated directly by the apps that 'think' they interact with the filesystem, a lot of useful metadata is never collected in the DB, since the file API used never requires it (think file author, subject etc).
I figured there was something massively annoying with it.
I'm hoping that PostgreSQL can try to tackle this use-case somehow because it's so damn common and 99% of the time the solution that is used completely throws out all the guarantees that the database gives you.
I'm not sure there's a non-annoying way to bridge the problem. The semantics are just so different that there's a lot of friction that needs to be handled somewhere. (What does O_DIRECT mean in this context? What happens when someone decides to store files that are never normally closed?)
In some ways, the problem mirrors OR mappers. A lot of common cases can be handled, but there are always situations where you're confronted with the fact that relational logic just doesn't map cleanly to OO (or hierarchic storage).
Databases use journaling to avoid these issues. The difference between RDBMS' journaling and file system's journaling is the RDBMS journals application data, the file system does not (with few exceptions). Thus, a RDBMS gives you consistent application state, while the file system does not (with fewer exceptions, basically only TxF).
Another pain point is backup; it's not possible to create a consistent snapshot of (rdbms state, other state on the file system). This is avoided entirely if you only need to snapshot (rdbms state,).
> Despite all this, my practical experience is that filesystem is better than blobs for things like uploaded content, images, pngs and jps etc.
If you store them on a filesystem, how do you deal with redundancy / failover / scaling ? At a previous job we did a bit of experimenting with using a clustered FS but they all introduced a lot of problems. However, this was a couple of years ago so the situation may be different now.
If you store them in sqlite, how do you deal with redundancy / failover / scaling? There are other databases with better tooling for this, but they weren't part of this comparison. The world is full of tradeoffs. Scale and availability vs. microbenchmark performance is one of them. It would be very interesting to see a comparison of clustered MySQL or Postgres vs. Gluster or Ceph. I very much doubt the performance side of it would favor the database so much.
> If you store them on a filesystem, how do you deal with redundancy / failover / scaling ?
In most cases, using a solution like Amazon S3 will have all these three sorted out for you pretty good. And while it's not a full mountable filesystem, from a system development perspective it has a really good API for retrieving and storing files.
Technically there is s3fs, but it's fairly terrible and I actively recommend against it. Still, it does come handy from time to time, in a limited set of cases.
After years of development, databases and file systems borrow ideas from each other. I am not surprised that the breakeven range might be expanded. It would be interesting to re-visit the conclusions with current databases, file systems, hardware (SSD vs HDD) etc.
There are several very valid business cases to store files as blobs in a DB.
What's the problem is the DB is 150GB? It's not like the working set (which for the files will just be the metadata) will be that big for storing file blobs.
10GB Database + 140GB of pdfs on the filesystem are not any different to a 150GB DB with everything in.
And you have other issues (consistency, transactional issues, backups, etc).
How about a 100TB database with 95TB being pdf's? It seems there are better ways of managing lots of immutable small files (including for backups/replication) than shoving them all into a database.
>How about a 100TB database with 95TB being pdf's?
How about it?
>It seems there are better ways of managing lots of immutable small files (including for backups/replication) than shoving them all into a database.
Depends on the business case. A database gives certain guarantees you'll have to replicate (usually badly) in any other way.
Conceptually, it's absolutely cleaner.
As for from a scientific or engineering standpoint, there are again no laws that dictate saying whether this is bad or good.
Even the performance characteristics depend on the implementation of the particular DB storage engine. They might be totally on par with storing in the filesystem (or close enough not to matter).
Not to mention that some filesystems might even have worse overhead depending on the type, size, etc. of files. In fact this very FA speaks of "SQLite small blob storage" being "35% Faster Than the Filesystem".
Plus filesystem storage and DBs are not that different in most cases -- they share algorithms for storage and indexing (logstorage, btrees, etc), and the main difference is the access layer. The FreeBSD filesystem, for one, was more like a DB storage layer than a 70s style Unix filesystem.
Originator of MBTiles format here. The initial intentions were moving 100,000s of raster map tiles between computer and mobile device, either over USB or network. Main benefit was avoiding per-file transaction overhead, as well as checksumming potential. Side benefits included a small space savings over filesystem because of block overhead, plus later iterations allowed for de-duplication of e.g. all-blue water map tiles. These days, we still use the format for transport between local machine and the Mapbox backend, despite having moved to "vector tiles", which essentially are protocol buffer-based encoded geometries directly.
Also, to clarify a bit on why spatial capabilities weren't part of the MBTiles spec: when rendering maps, you assemble tiles for a given zoom scale in the vertical and horizontal directions, like a checkerboard. The unique tuple defining a particular square on the map is z/x/y — zoom level, horizontal, vertical. So stuffing these into SQLite and uniquely keying across z/x/y there makes for fast retrieval at render time.
Try SpatialLite[0] if you need advanced geometry capabilities. Has R-Tree spatial indexes if you use SQLite 3.6+, and rudimentary MBR index for prior versions. Has rudimentary support for curves as well.
As a data point, one of our users figured out how to load the SpatiaLite extension for DB Browser for SQLite (on Windows), then wrote up step by step instructions:
Part of the concerns outlined in that article and of more importance for this company are ease of distribution/compatibility for customers of mapsets and mobile devices, rather than raw read/write performance, however.
Though the storage size claims might be relevant, and definitely a clear example that sqlite can be very workable and performant in the context of blob storage.
The article speculates that the overhead of calling `fopen` for each file and subsequently reading it is more expensive than reading from an already opened database file.
But as for how expensive `fopen` is really depends many factors.
Filesystems can be built in memory (ramdisks) too, but comparing sqlite in memory to files on disk would be pretty silly. For any serious and fair comparison, in sqlite still means on disk.
For what it's worth, I think Mapbox's vector tiles can be stored in around 48GiB of RAM, which is perfectly within reason for a server. It's also the sort of data which is simple to shard. They can have multiple geolocated databases for a local area, and maybe fewer for international queries. This can be done pretty simply at the load balancer level.
This is weird benchmarketing. They are comparing reading/writing 100,000 individual files vs writing 100,000 entries into a single file(sqlite database).
For comparison one could concatenate the same data into a single big file even faster than into sqlite.
They then do not offer logical analysis as to why things are faster.
My understanding is that reads are probably faster due to operating system readahead being able to predict reads better when they are within a single file. Writes are faster because they do one bulk COMMIT instead of many individual fsyncs.
> For comparison one could concatenate the same data into a single big file even faster than into sqlite.
One could, but then one would have great difficulty retrieving the individual files back when needed.
The point is not about what has the greatest raw speed, the point is that that for applications that read lots of small files from the filesystem, they'll possibly get better performance and almost certainly use less disk space by sticking the files in SQLite instead.
Writing your own methods to retrieve arbitrary chunks of data from a monolithic file would be a lot of work. Using map tiles as the example, how do you easily retrieve just the tiles in a specific region? Ok, how about all tiles that have the “hasLand” attribute set? Or the “containsCoastline” attribute?
You’d end up rewriting your own version of a database.
> You’d end up rewriting your own version of a database.
This is the hell we just crawled out of in one of our codebases.. it's not so much the access for a single process, but when you want to share the data in the file, that's when it seriously starts to get messy and breaking through otherwise clean abstraction layers and/or forcing recompile on compatibility breaking changes.
Straight array of structs. The complication was that there were several of them, they were stored in bus-attached MRAM which was accessed directly (e.g. mmap of /dev/ram) by several independent processes.
The idea came from previous generations of the product which generally ran on microcontrollers coupled with custom FPGA designs, so, not entirely unwarranted.. but completely incorrect for a newer product with much greater computing power than their previous devices.
> Oh goodness have we come a long way if interacting with a file is harder than interacting with a SQL database.
It's not interacting with a file, it's interacting with a file that contains other files of varying sizes that need to be accessed randomly with good performance.
If you only want to read then both zip and tar are good for this purpose. It's commonly used and known to work well. There are many formats that are really just a renamed zip file. Like jar, and epub. File entries don't have to be compressed so you can mmap.
If you want to do random writes then more consideration is needed.
> If you only want to read then both zip and tar are good for this purpose
And programmatically accessing them is significantly more complicated than using SQLite which involves dropping a single header in your project and about 10 lines of code.
> If you want to do random writes then more consideration is needed.
Which you almost certainly what you want to do for the sorts of use cases where SQLite is also under consideration.
As far as I know, querying a database has always been easier than querying a custom file format. It's the whole reason databases exist in the first place. And using one off the shelf is clearly easier than writing your own.
Sure, it's completely trivial to read a file containing an array of C structs of metadata and offsets into a blob of concatenated PNGs: just mmap the thing, and your main data structure is sitting right there in memory.
But problems start arising when you need to update the structure of that metadata. Add some more fields. Some more structured, multi-value fields, referring to multiple entities in your increasingly complex data structure. Entities which must be guaranteed to exist. And perform modifications concurrently, from multiple processes. And have the file not get corrupted when something kills a writer process in the middle of writing...
Databases are slow and overkill for the simple case, but they certainly make adding unexpected requested features much easier. And it can be expected that you will get unexpected feature requests.
Nothing wrong with comparing apples and oranges. They have a lot more in common than people give them credit for (both are fruit, shape, flavour, nutrient content, where they grow, who eats them etc etc). You can even compare things that aren't similar at all and its not necessarily pointless.
The third paragraph might offer the logical analysis you're looking for:
> The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database. The size reduction arises from the fact that individual files are padded out to the next multiple of the filesystem block size, whereas the blobs are packed more tightly into an SQLite database.
A big part of it is because they're repeatedly walking the same path while reading and writing it. I got a 20% performance improvement on FreeBSD+ZFS with this change, which merely opens an fd, and uses that to stat and write, cutting the number of path traversals in half:
Which is why before making such broad statements they should check on the other platforms, which have much less broken file systems.
(Check Ext4, ZFS, UFS2, HPFS+ and APFS at least.)
What they did is check an untuned Ext4, UBIFS, HPFS+ and NTFS vs a tuned database in a single thread scenario. Especially misleading in the mmap case where synchronization would just kill the performance dead.
I find that comparison justified: How many people are even aware of the fact that filesystems can be tuned (except for, maybe, noatime and discard for SSDs)?
No, no, if it is faster than the filesystem it must mean that it has a faster storage medium. Maybe it is writing to memory, or the aether. /s Now, seriously, how can the foot be faster than the shoe?
> So let your take-away be this: read/write latency for SQLite is competitive with read/write latency of individual files on disk. Often SQLite is faster. Sometimes SQLite is almost as fast. Either way, this article disproves the common assumption that a relational database must be slower than direct filesystem I/O.
This is under the 1.1 Caveats heading, which makes me feel the title is a little misleading (but only in the way most benchmark headings probably are, I guess).
Incidentally, can someone more experienced with filesystem and database I/O confirm or contest the assertion here? Specifically, I'm not sure it's fair to generalize these results (even if valid) to categorical relational databases. But this is not a special area of expertise for me.
You're asking whether a database can read/write records faster than disk? Yes. The opposite is also true. I'm not an expert in either fs or db, but it's obvious that the design and use of one is not the same as the design and use of the other, and given the right circumstances one will do a particular job better than another.
An append-only transaction log is probably going to perform better on a spinning disk than random writes. An intelligent app sorting and committing writes in order is probably going to perform better than an fs on a disk with no queue ordering facility. As they said in the post, block-aligned writes in bulk are going to be more efficient than spreading out a bunch of writes not block-aligned. And latency for a db write is going to be lower if you don't close an fd, assuming that closing an fd would always trigger an fsync.
But a properly tuned filesystem, with the right filesystem, with the right kernel driver, with the right disk, with the right bus, with the right control plane, etc etc may work fantastically better than an unoptimized database, and a given workload may fit that fs perfectly.
It's important to remember that a benchmark is only meaningful to the person who is running the benchmark. If you want it to be meaningful to you, you have to try it yourself (and then presumably blog about it).
For small files, I could see this being true, but not in a way that would scale to large files.
For example, instead of a database, you could make a folder (instead of a table) and then a bunch of 4 byte files with ints in them. Doing a table scan would be like cat *.
Obviously this uses a lot more open/close/read calls, with more seeks and iops, because those files may or may not be clustered together on disk. If you packed it all into the same file, it would go faster because of the locality and less system calls. You can also be smarter about caching and other things that the file system does well, but probably isn't optimized for the small file use case.
From OP: "SQLite is much faster than direct writes to disk on Windows when anti-virus protection is turned on. Since anti-virus software is and should be on by default in Windows, that means that SQLite is generally much faster than direct disk writes on Windows."
I don't get this. If scanning the content is important (as acknowledged by the author), then bypassing the scan via blob storage is a security issue and the application should go through some extra hoops to scan the content before saving it to blob, and this should be measured and part of the comparison.
Also, if the SQLite files are exempt from AV scan, then the level field should also exempt the uploaded files folder in test. I mean, knowing the dice are loaded and then claiming it as an advantage does not seem professional.
Data chunks inside another file are effectively quarantined by default. There is no exempting or security issue, it's safe by design. It will get scanned occasionally but not on every write.
Except when they are not. Just because malcode is encapsulated doesn't mean it is neutralised. The issue is untrusted data, not how it is stored.
Word Docs with embedded content or can trigger code execution on parsing, data structures that break AV parsers, image or font data that causes system libraries to overflow/UAF/etc; There all could potentially be a security issue in a DB, the same as in a FS.
You don't need to worry about a single process writing data to a file and then reading it back and performing malicious actions. That process could just as easily have been exploited entirely in-memory. Scan the database when it exits and you're fine.
The article may be unclear on this point: anti-virus was turned off for these tests. What that paragraph is pointing out is that performance tanks badly (roughly a factor of 10x) when you turn on Windows Defender, and presumably even worse with other anti-virus software.
Credential: I was one of several people who vetted the article before it was published. I argued that Defender should be enabled, since it is a platform default. D. Richard Hipp chose to take the charitable path and not report those results alongside the others.
The test is easily replicated yourself. The source code for it is part of the SQLite source tree, and command lines for building and running the test are given in the article.
In any case, I really take issue with that approach. Nearly every windows database vendor, including Microsoft, recommends or requires that you disable most types of scanning in databases. Not only is it ineffective and performance impacting, it potentially undermines integrity of the data.
My contention was that this article is about least-effort methods (i.e. fopen() vs default SQLite) so a lot of Windows SQLite users are going to have Defender (or heaven help them, something much worse) installed.
Once you start getting into optimizations, much of this article starts to sway back and forth. Put enough effort into optimization, and you can outdo SQLite for a specific application, simply because SQLite isn't magic, it's just C. Well-written and highly-optimized C, but purpose-written code always has the potential to do better than general-purpose code.
The point of SQLite in general and this report in specific is that you get a lot of performance, power, and safety for free with SQLite as opposed to `fopen()` and such. Good performance while antimalware interferes is just one way this manifests.
As I understand it, drh simply disabled Windows Defender entirely for these tests.
SQLite gets an additional advantage with Defender enabled because Defender doesn't try to scan inside the SQLite DB file for the sub-files, whereas with separate files, Defender must at least look at each file's header to determine whether it's a potentially dangerous file type.
While I was vetting this article, I did some of my own tests, and I found that simply excluding the directory containing the test tree was sufficient to get the same benefit.
My point in other posts here is that you don't need to disable Defender to get this benefit when you use SQLite.
But let's be clear: the 35% claimed performance benefit has nothing to do with Windows Defender. That's a wholly separate issue. Defender pays more attention to a pile of individual files than to an amalgamation of that same data into a single file, and this magnifies SQLite's advantages, but it is not the source of the advantage the linked report talks about.
Interesting. We built a couple of systems many years ago using SQLite as the storage engine for binary files - one was a kind of mini version control utility for storing report files for a legacy ERP system.
I always thought that the SQLite system was far quicker to search and retrieve files from the dataset than it was doing so in our previous version, using hierarchical file folders.
Interestingly, we canned the project because the ERP vendor themselves release a similar tool using - wait for it - individual files in a folder system... (Addendum - much later on they switched to - wait for it - CVS for version control.... in 2010!!)
For writing, the major drawback of SQLite is it doesn’t support concurrent writes.
All filesystems do (at least when writing different files), all full-fledged RDBMS-es do, even some embedded databases do (like ESENT). Yet, in SQLite only a single thread can write.
Isn't that because SQLLite's whole use case is "simple, small, fast database that competes with fopen() and not Postgres"? At least in my experience on NTFS, unless you know the section that 2 threads are writing to, and that section is different, you can't write to a file concurrently (N.B.: in C#)
I'd be happy to be wrong (seriously that would make my day if someone could explain a clean way to do the above) but if I want to get "big boy database features" then I'd use a heavy database like SQL Server. I think it's great that there's an option like SQLLite out there for rapid prototyping and "store all your music metadata" type applications!
> simple, small, fast database that competes with fopen() and not Postgres"
The hardware evolved in a way so even very small systems (phones, embedded, Raspberry Pi and other IoT devices) are now multi core. Multithreading is required to benefit from a multi-core CPU. For IO-heavy tasks, it’s sometimes a good idea to implement multi-threaded IO as well.
> you can't write to a file concurrently
I can concurrently write different files. Or to different streams/blobs/records for most other IO methods, except SQLite.
Storing many small files into few (compressed) bundle files is common since at least the 90's for storing game assets, but usually not with SQLite, but some sort of compressed format (simple zip archives are quite common but there are faster alternatives now). The bundling gives you better overall compression, and faster read performance compared to individual files (at least on Windows) as long as you open the file only once, and from then on only seek and read.
SQLite is a big code base, using a simple archive file format instead gives you most of the advantages, but without the bloat.
PS: This is mostly for read-only situations though. Using SQLite probably starts to make sense when the applications needs to write to and create new files in the archive.
But that would work only for static content, like the game assets. The discussion of BLOBs vs. filesystem comes up mostly in the context of content management and user/app uploaded content.
I was going to make a semi-sarcastic comment along the same lines, but some other authors rightly point out the benefit of using and RDBMS when then file schema is changing.
Not surprised by the results, very surprised by the explanation given, the overhead of open/close calls.
As far as I know (from many measurements and talking to kernel people and researching the mechanism involved), the difference is due to the fact that buffers are shared between all pages of a single file, and not between files.
So for reads, the filesystem will do read-ahead of significantly more data than requested and keep that in buffers, and future reads will profit. Similar for buffers shared when writing.
The same effect will be reproducible with any format storing multiple objects in a single file, it has virtually nothing to do with "SQLite" or "Databases".
One tradeoff is the greater potential for inconsistencies, which despite all the measures taken is much greater when you modify a file rather than writing only completely new files. Another is the inconvenience and duplication of effort, because all your default file-system management tools aren't available.
It would be interesting to see if a pseudo-filesystem that is mapped to a single underlying file would show the same effects (preferably implemented in user space to avoid overheads of multiple kernel round trips).
> The same effect will be reproducible with any format storing multiple objects in a single file
Given unbounded developer time, any advantage SQLite has here can of course be matched or beaten by custom code. It's just another C library, not magic.
The real issue is, how much work will it take you to do that?
SQLite is billed as competing with `fopen()`. That's the proper way to compare it: given equal development time spent talking to the C runtime library vs. integrating SQLite, how much speed and robustness can you achieve?
> the greater potential for inconsistencies
How? In this application, SQLite is filling a role more like a filesystem than a DBMS or `fopen()` call. You can corrupt a filesystem just as easily as a DBMS. SQLite protects itself in much the same ways that good filesystems do, and it can be defeated in much the same sort of ways.
> Another is the inconvenience and duplication of effort, because all your default file-system management tools aren't available.
It's a classic tradeoff: do you need the speed this technique buys or not?
A better reason to avoid this technique is when the files you're considering storing as BLOBs need to be served by a web server that uses the `sendfile(2)` system call. There, the additional syscalls caused by the DBMS layer will probably eat up the speed advantage.
Software development is all about tradeoffs. No technique or technology is perfect for everything.
Now you know one more technique. Maybe it will be of some use to you someday.
> a pseudo-filesystem that is mapped to a single underlying file
That pretty much describes SQLite. Both SQLite and a good filesystem use tree-based structures to index data, both have ways to deal with fragmentation, both have ways to ensure consistency, both strive for durability, etc., etc.
> preferably implemented in user space to avoid overheads of multiple kernel round trips
How are you going to avoid kernel round trips when I/O is involved?
If you think user-space filesystems are fast, go try FUSE.
that thread began: "Now we just need a redesign of Unix tools and principles/practice"
This is perhaps one advantage of a sq-lite fs; the flexibility.
You could design concepts into the db structure, and have the driver translate this into the usual language of folders/files/read-write.
But as well as requiring new fsck tools etc, you might also need new ways to interact with the fs as it would have new capabilities, e.g. storing data non-hierarchically.
If the OS can be pushed forward wrt some feature, I care less about whether it would perform.
SQLite is much faster than direct writes to disk on Windows when anti-virus protection is turned on. Since anti-virus software is and should be on by default in Windows, that means that SQLite is generally much faster than direct disk writes on Windows.
Isn't this because writing to a SQLite database file bypasses most (or all) of the antivirus file scanning since it's can't see a complete file, and can only look at raw blocks of data?
So if you find value in having your antivirus scan all of your files, that's a disadvantage for using SQLite to store them?
The reason I still use files rather then SQLite is that I don't know how SQLite handles concurrency.
For example I have a PHP app that is used by 10k users a day and it happily handles 100k tmp files in a single directory. On each request, it checks the file age via filemtime() and if new enough, includes the tmp file with a simple include(). (I write PHP arrays into the tmp files). If too old, it recalculates the data and writes it via fopen(), fputs() and fclose().
This migh be archaic but it has been working fine for years and never gave me any problems.
Somehow I would expect that if I simply replaced it with SQLite, I would run into concurrency problems.
> Somehow I would expect that if I simply replaced it with SQLite, I would run into concurrency problems.
SQLite wouldn't really be ACID-compliant if multiple threads were enough to defeat the Durability guarantee, would it?
That's not to say that there are no concurrency problems in SQLite, but that they're more in the way of potential bottlenecks than data corruption risks. For a reader-heavy application like yours, I suspect SQLite will perform just as well or better than your existing solution.
If you're solely after speed, I'm not sure it would be worth rewriting your app to use SQLite. This technique's value is simply in the benefit it gives when you were already going to use SQLite for some other reason. If you just want a 35% speed boost, wait a few months or buy a faster SSD. Both are going to be easier and cheaper than rewriting the data storage layer of your application.
That said, maybe there are other things in SQLite that you could use. Easy schema changes, full-text searching, more advanced indexing than the filesystem allows, etc. If you go for one of those, then the extra speed is a nice bonus.
SQLite wouldn't really be ACID-compliant if
multiple threads were enough to defeat the
Durability guarantee, would it?
I'm not so much concerned about durability. More about SQLite not responding to "SELECT v FROM t WHERE id=123" with value v but instead with something like "Error: v is currently being written by another process. Try again later" or something.
No idea if that is a realistic scenario. I'm kind of surprised I don't have this kind of problem with my filebased solution. What happens if process A reads from a file while process B writes it? No clue.
> What happens if process A reads from a file while process B writes it?
If process B started first, the writer blocks access to the table being written to, so the reader waits for the writer to complete. There are timeout and retry behaviors, but within those configurable limits, that's what happens.
You can make SQLite behave as you worry about if you set the retries to 0 and timeout to 0, but that's not the default.
I know that SQLite performs exceptionally on embedded devices including most phones, IoT devices, and more. However, for better or worse, it's a flat file. Does anyone know of a TCP/IP-speaking SQL database that would work well on an embedded device? PostgreSQL/MariaDB seem kinda heavy, and the net couplers for SQLite look pretty unsupported.
They were not technically correct, they were just correct. A flat file database is a file with no indexing structure, usually plaintext, usually one record per line. SQLite uses a binary format consisting of an index of pages and a rollback journal or log file. They are significantly different.
Also, yes, there are networked databases with sqlite database drivers.
MySQL/MariaDB can be tuned to run very well on low resource devices. Remember that they were originally developed back in the days of 486 CPUs. A couple of years ago I had MySQL backing a commercial product on a 512M ARM7 5400RPM device (we originally developed it with 256M of memory). Performance was not an issue for our use case even with on the order of a million rows in the DB.
and what exactly do you think any database is at the end of the day? It's not magic, it all ends up on the disk as a file. I mean why is a page file called a FILE?
Sure, we can agree that (almost) everything's a file. However, through their related programs and daemons, we can ask files for more useful functionality such as indexes, versioning, and yes, socket communications.
The source code to the test program is part of the SQLite source tree. It should be fairly easy to modify it to test the scheme you have in mind.
> How well does SQLite deal with fragmentation?
SQLite operates a lot like a modern filesystem: tree-based indexing structures, writes go to empty spaces, deletes leave holes that may later be filled by inserts, etc.
Are you stating that the SQLite database did not write the files to disk? Or simply that it's caching them in-memory? In the latter case, doesn't an operating system typically offer in-memory file buffering / caching? Otherwise, why would we need O_DIRECT or O_SYNC or fsync()?
I'm no expert in FS or sqlite, but as everyone knows, filesystems are not good processing small files, having an application that bundles in memory small blobs and then flush them to disk as a single file, is what make it faster.
ReiserFS 4 was (I believe still is) the only FS that improves small files performance by bundling small files and stored them on disk as a single blob.
as for that person who downvoted my original comment, I'd love to hear from you why you think I'm wrong. you don't just downvote comments because you disagree.
I do think your original claim is wrong. Your explanation in this response is not the same as your original claim. And what you wrote here is exactly in line with claims in the article:
"The size of the blobs in the test data affects performance. The filesystem will generally be faster for larger blobs, since the overhead of open() and close() is amortized over more bytes of I/O, whereas the database will be more efficient in both speed and space as the average blob size decreases."
Yes, of course SQLite is going to do some operations in-memory. Raw file writes using fwrite are also going to be initially written in-memory, unless O_DIRECT or some other mechanism is involved. And the article explicitly outlines that they made no effort to bypass file buffering, to the point of not even explicitly flushing to disk.
If both processes are writing the BLOBs to disk, then I don't see how your original claim applies. Writing to disk in a manner that is more efficient for small files is still writing to disk, and so the test is not in-memory vs on-disk as your original claim.
If I asserted: "It's faster to put 10,000 rows of csv data in single file instead of 10,000 individual files" even the most junior programmer would likely say "Well, duh, it's 1 file instead of 10,000".
Yet this benchmark is at the top of HN for some reason.
"Measure before optimizing" has limited application. There are many things that need not be measured. Appending to a file is definitely not slower than doing memory management inside the file to allocate a new chunk, seeking to that position and then writing out the chunk. Period. (And the serialization overhead is negligible compared to the disk I/O).
It's not only "writing" though. It's "writing" AND "reading".
Can you read the correct line in in CSV faster? Can you find the row with specific critiria like SQLite faster? Can you handle concurrency correctly like SQLite? Can you make sure your CSV is always in consistent state like SQLite?
The point of the article is that SQLite is still fast even with all the benefit of database.
The idea of SQLite is that you could default to using it, and move away when it hits its limit. Nobody should default to using CSV, you use CSV when you have to, despite all its limitation.
I'm not attempting to compare the usefulness of CSV to a RDBMS. I'm simply pointing out it should be inherently obvious that writing to 1 file is faster than writing to N files, which is the only thing the benchmark is in question is concerned with as well.
Opening a file means locating it physically on the disk by following references (usually at least one per directory level). This is a costly operation.
Compare to the typical case of adding another row by appending to an already allocated disk block. No seek operation needed.
This analysis is incorrect. SQLite must allocate database pages for rows just like the kernel must allocate filesystem blocks for files. Reading a row in SQLite means traversing a B-tree to find the row. Reading a file from the filesystem means traversing a B-tree to find the file metadata. Very similar. The kernel doesn't need to "seek" any more than SQLite does, the article says that all the hard drives are SSDs which don't have heads so they don't seek.
The major differences here are due to the different levels of isolation and authentication that the kernel and SQLite provide. Another difference is the fact that when you find a row in SQLite by primary key, you don't have to do a second lookup to get the data from the metadata. This is a good optimization to use when storing many small pieces of data, but a bad optimization for large pieces of data.
This was just about opening a file and appending a few bytes vs appending to an open file. Obviously the latter can't be slower (and is typically faster).
Because SQLite isn't meant as a general purpose filesystem. It can do some things better (like, apparently, small blob storage), but not everything.
For example SQLlite vacuuming to free up deleted data can be slow, on large SQLite databases, we've found it much faster to rewrite the entire file than to vacuum it.
It also has some scalability limits, it uses locking to limit to a single concurrent writer (short duration locks), which only scales up to a point.
There're a lot of file system characteristics that differ from a database file format, SQLite could probably be more file-system like,but then it would diverge from being a fast and lightweight database format.
The parent comment is a bit misleading, or completely wrong. SQLite also has to allocate blocks in the database for anything you store. Some of the structures and techniques that SQLite uses for doing this are very similar to the way a filesystem does it.
Instead, think about it this way. With a filesystem, the database is managed by the kernel. Every time you want to read a file, you might do four system calls: open, fstat, read, close. Or you might do mmap instead of read, but you're still doing four context switches, at least in typical cases. Switching to the kernel and back has cost. Normally this cost is small, but if you make a lot of system calls you'll notice the costs piling up. The kernel also has to check permissions to make sure that you have permission to read the file.
With SQLite, the database is inside your application. When you read a row, there's a chance that the row is already in your application's memory. This means no context switches back and forth between application and kernel.
Additionally, when you read a row, the entire database page is read into memory, which includes other rows too. The kernel won't do anything like that with your application--it won't give you img1.png and img2.png if you just ask for img1.png. Maybe they'll both be in the kernel's page cache, but you still have to open and read the file.
Microsoft tried doing this and there are file systems that have tried or are trying. It just opens up multiple cans of worms but it's a pretty interesting idea.
Now I'm confused myself: Parent's context was: Why appending to a CSV is faster than opening a file and appending to that. There was no word about a database.
This page does not live up to the standards of the other high quality technical articles on sqlite.org. I couldn't even find a reference to the file system used for comparison. But the whole setup is a farce anyway.
They tested it on 5 different OSes, which presumably included 3, probably 4 different filesystems (Windows, Mac, and Ubuntu would be using different ones, and I'm guessing the Ubuntu system has a different filesystem than Android too, though I don't actually know), so it doesn't really seem like it matters.
Of course it matters. Different file systems have different performance characteristics. Make sqlite into a file system, and I'm pretty sure you can't beat it by putting another sqlite on top.
When the author has tested 3 or 4 different filesystems against SQLite (and at least 2 of them you can be pretty sure as to what they are simply based on the OS, with a good guess at the other 2 as well), and SQLite beats them all, then it really doesn't matter.
As for turning SQLite into a filesystem, that's not really going to work. SQL isn't designed to support things like cheap "file" appends or reading only portions of a value or seeking or anything like that, so you'd end up having to read the entire value for any read, and write a new copy of the entire value for any write, and your performance would be really really bad. So yeah, putting SQLite inside of SQLite isn't going to work, because SQLite isn't a filesystem and wasn't designed to behave like one. Not to mention this entire article is about small blob storage, and embedding a SQLite database inside of SQLite isn't a small blob.
As for benchmarking multiple file systems with consistent result, point taken. To a degree. Some file systems optimize for directory lookups, and some don't.
As for turning SQLite into a filesystem, we could make it into a file system that would be fast in this particular benchmark. Right?
I mean, is SQLite not also using the filesystem? Maybe I'm confused here. Seems like saying using C is 15% faster than ASM. But that includes user error, right?
The article is basically entirely an answer to that question.
The third sentence in particular is:
The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database. The size reduction arises from the fact that individual files are padded out to the next multiple of the filesystem block size, whereas the blobs are packed more tightly into an SQLite database.
The advice to use a single open/close call instead of multiple is great, but that is advice on how to effectively use the filesystem. It isn't in anyway, shape, or form beating the filesystem.
You can't use a single open/close call to write multiple BLOBs, unless you are concatenating them into one file. That's not typically what happens nor what people are advocating when they recommend not using database BLOBs and using the filesystem instead.
The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database.
During my first week at American Express I did not have the credentials to install any applications. My manager wanted me to build a prototype and I couldn't install a database. Firebase and other third party was impossible because I had to work on an internal server. I thought SQLite would save my day, but I remembered it being really hard to install and set up. So I had to write my own DB that I called 'stupid-db'. It literally just reads/writes data into a file: https://github.com/songz/stupid-db
SQLite is a library, not a database server. There's no install step - it's a single .C file (and a header, if you're into that). Most of the major scripting languages have rock-solid bindings that don't require any additional system libraries or software installs.
The summary was "The study indicates that if objects are larger than one megabyte on average, NTFS has a clear advantage over SQL Server. If the objects are under 256 kilobytes, the database has a clear advantage. Inside this range, it depends on how write intensive the workload is," but keep in mind this is spinning media from 2006. Modern SSDs change the equation quite a bit, as they are much more friendly to random IO and benefit less from database write-ahead log and buffer pool behavior.
Also, when deciding between blob vs. filesystem, blobs bring transactional and recovery consistency. The DB is self contained, and all blobs are contained in it. A restore of the DB on a different system yields a consistent system, it won't have links to missing files, and there won't be orphaned files left over (files not referenced by records in DB).
Despite all this, my practical experience is that filesystem is better than blobs for things like uploaded content, images, pngs and jps etc. Blobs bring additional overhead, require bigger DB storage (more expensive usually, think AWS RDS) and the increased size cascades in operational overhead (bigger backups, slower restore etc).
[0] https://www.microsoft.com/en-us/research/publication/to-blob...