SQLite small blob storage: 35% Faster Than the Filesystem

rusanu · on June 14, 2017

I must point out Jim Gray's paper To Blob or Not To Blob[0]. His team considered NTFS vs. SQL Server, but most rationale applies to any filesystem vs. database decision.

The summary was "The study indicates that if objects are larger than one megabyte on average, NTFS has a clear advantage over SQL Server. If the objects are under 256 kilobytes, the database has a clear advantage. Inside this range, it depends on how write intensive the workload is," but keep in mind this is spinning media from 2006. Modern SSDs change the equation quite a bit, as they are much more friendly to random IO and benefit less from database write-ahead log and buffer pool behavior.

Also, when deciding between blob vs. filesystem, blobs bring transactional and recovery consistency. The DB is self contained, and all blobs are contained in it. A restore of the DB on a different system yields a consistent system, it won't have links to missing files, and there won't be orphaned files left over (files not referenced by records in DB).

Despite all this, my practical experience is that filesystem is better than blobs for things like uploaded content, images, pngs and jps etc. Blobs bring additional overhead, require bigger DB storage (more expensive usually, think AWS RDS) and the increased size cascades in operational overhead (bigger backups, slower restore etc).

[0] https://www.microsoft.com/en-us/research/publication/to-blob...

patio11 · on June 14, 2017

My experience (for an application which had a working set of under 1 GB of files in the 50kb to N MB range, and approximately 50 GB persisted at any given time) was that preserving access to the toolchain which operates trivially with files was worth the additional performance overhead of working with the files and, separately, occasionally having to retool things to e.g. not have 10e7 files in a single folder, which is something that Linux has some opinions (none of them good) about.

Trivial example: it's easy to delete any file most recently accessed more than N months ago [+] with a trivial line in cron (the exact line escapes me -- it involves find), but doing that with a database requires that you roll your own access tracking logic. Incremental backups of a directory structure are easy (rsync or tarsnap); incremental backups of a database appeared to me to be highly non-trivial.

[+] Since we could re-generate PDFs or gifs from our source of truth at will (with 5~10 seconds of added latency), we deleted anything not accessed in N months to limit our hard disk usage.

troutaway123 · on June 14, 2017

¿Porque no los dos?

Store the files in a database but expose them via FUSE.

pjc50 · on June 14, 2017

That gives you the time cost of a filesystem plus the time cost of a database plus an extra round-trip in and out of kernel space.

troutaway123 · on June 14, 2017

> preserving access to the toolchain which operates trivially with files was worth the additional performance overhead of working with the files

Sounds like they're OK with that.

In any case, couldn't you avoid the kernel trips with some dylib foolery (assuming the toolchain is dynamically linked)?

firethief · on June 14, 2017

So: store the files in a database, expose them via FUSE, bypass FUSE and access the database?

troutaway123 · on June 17, 2017

Yep ;D

octetta · on June 14, 2017

...or v9fs... but the overhead might make the whole exercise pointless.

Twirrim · on June 14, 2017

> e.g. not have 10e7 files in a single folder, which is something that Linux has some opinions (none of them good) about.

I keep running in to this as a common anti-pattern in software, even with software developers that should know better (having fallen afoul of it before).

I ran in to it with some semi-bespoke software once, used by a remarkable number of state departments across the country. It was one of many, many things wrong with that godawful thing. Worse, not only were they dumping files in a single directory, almost any time they went to do something with the directory they'd request a directory listing too. Like many of the bizarre things the app did, it never actually used the information it acquired that way.

cm2187 · on June 14, 2017

I never understood why it mattered that files would be in a single directory or split over multiple directories. At the end of the day it is the same file system with the same number of files. The directories are just a logical structure. Why would there be a difference?

jamie_ca · on June 14, 2017

Spinning disks.

If you want raw file access, your ideal filesystem is a giant key-value store that keys on the full path to the file. This choice means doing a directory listing will involve a lot of random access reads across the disk, and in the days before SSDs this would be a performance killer.

So instead, a directory would have a little section of disk to write its file entries to. These would be fairly small, because most directories are small, and if there were more files added than there was room, another block would be allocated. Doing a file listing would be reasonably quick as you're just iterating over a block and following the link to the next block, but random file access would get slower the more files are added.

It's probably a solvable problem in filesystem design, but because current filesystems fall down hard on the million-files-in-a-directory use case, nobody actually does that, so there's no drive to make a filesystem performant in that case.

the8472 · on June 14, 2017

ext4 uses btrees for the directory index, so accessing files by name is just as fast as manually splitting it into a prefix-based directory tree.

Searching by filename on the other hand can be much faster with a prefix tree because directories only allow linear scans, not range-based ones.

ngrilly · on June 14, 2017

I'm missing something. What difference do you see between "accessing files by name" and "searching by filename"?

the8472 · on June 14, 2017

    f = open("/path/to/file");
    // do something with file

vs.

    d = opendir("path/to")
    while((dent = readdir(d)) != NULL) {
       if(!dent.name.startsWith("file"))
          continue;
       // do something with first matched entry
    }

The former is O(log n) on ext4, O(n) on older filesystems that use flat lists.

The latter is always O(n)

ngrilly · on June 14, 2017

Understood. Thanks ;-)

cm2187 · on June 14, 2017

Ok but listing 100 directories with 10,000 files each should take the same time than listing one directory with 1,000,000 files (what you are describing is a filesystem with more files vs less files, not with more subdirectories than less subdirectories).

hulahoof · on June 14, 2017

The benefit is with accessing single files and not requiring a 1,000,000 directory listing lookup

HeavyStorm · on June 14, 2017

Well, my Linux-fu isn't very up-to-date, but my guess (which actually goes for any file system) is that a directory is a bit more than just a logical organization. It's an actual "thing" in your hard drive which keeps track of everything beneath it. And so it grows in size when more files are there, which slow down access.

Walf · on June 14, 2017

Just don't let anyone mount with noatime, for performance. I found that little gem on a drive where paring down to used content would have been very helpful.

cat199 · on June 14, 2017

Or, consider your use case for this at least.

Some user-level backups for example, will make the atime useless.

stuaxo · on June 16, 2017

Even with way fewer files than that you can't use wildcards in bash because it will expands to longer than the commandline allows.

This makes any work in the shell painful.

ccommsxx · on June 14, 2017

50GB of files in the 50KB-inf range can be at most 10e5 files, not 10e7. (10e7 would be 5TB of 50KB files)

tetha · on June 14, 2017

> Despite all this, my practical experience is that filesystem is better than blobs for things like uploaded content, images, pngs and jps etc.

This is especially true if you want to deliver them back to the users in the context of a web application. If you place this kind of content in a database, you'll need to serve them with your application.

If you use files for this content, you get two interesting options. For one, you can use any stock web server like nginx to serve these files - and nginx will outperform your application in this context. On top of that, it's easy to push this content onto a CDN in order to further cut the latency to the user.

rkeene2 · on June 14, 2017

Completely shameless plug here: My webserver[1] will outperform nginx for serving up static content like this.

It has many optimizations, like not opening the file its serving (which reduces the latency of walking/interacting with the VFS layer); almost no runtime allocations on the heap (minimal stack frames and sizes as well) which can be completely disabled at the expense of logging; high performance logging which does not interfere with processing requests; the normal stuff in a fast webserver (sendfile, keep-alive, minimal code paths between reading the user request and delivering the first byte).

[1] http://filed.rkeene.org/

tetha · on June 14, 2017

Color me interested. How do I health-check this in a LB? What kind of caching headers do you give me out of the box?

rkeene2 · on June 14, 2017

You can check its health by making a request to a known resource -- they're practically free.

It delivers an ETag header that it computes the first time it opens a file and maintains as long as that file is open. If it has to re-open the file it will generate a new ETag value.

rusanu · on June 14, 2017

Amen to that. Fastest database query is the one you never run.

icebraining · on June 14, 2017

If you place this kind of content in a database, you'll need to serve them with your application.

Well, not necessarily: https://github.com/FRiCKLE/ngx_postgres/

rusanu · on June 14, 2017

In this case nginx is the 'application'. the requests is still going to be expressed as a SQL query, sent to the PG, parsed, compiled, optimized, executed, then the tabular response formatted as the HTTP response. Many more steps compared to a file-on-disk response.

But I second that is an interesting nginx module

icebraining · on June 14, 2017

Ah, but the filesystem has to do all that as well! It must receive a path, parse it, and then execute the query, with possible optimizations (eg. ext4 even has indexes implemented with hashed b-trees).

A filesystem is just an hierarchical database.

cat199 · on June 14, 2017

Does your database transparently DMA the buffer cache to the NIC after loading the file in the background using read-ahead without any process context switching or upcalls to userland?

That's what I thought.

https://www.freebsd.org/cgi/man.cgi?query=sendfile&sektion=2

http://man7.org/linux/man-pages/man2/sendfile.2.html

coldtea · on June 15, 2017

Does your filesystem provide for its files that are referenced by your DB the same versioning, consistent backups, and ACID guarantees as the DB does?

That's what I thought.

fnord123 · on June 14, 2017

>A filesystem is just an hierarchical database.

Loosely. Today's file systems aren't transactional (i.e. acid compliance), which is a basic property that most people consider necessary for a database.

the8472 · on June 14, 2017

Many filesystems are logged and provide atomic operations. You can build multi-step transactions on top of them with various operations. And with CoW filesystems you can even get read consistency. Oh, and optimistic locking also is an option.

fnord123 · on June 14, 2017

Are we talking about the POSIX interface to files or is there a new transactional API that allows for atomic writes (all pages are written to disk or none are), consistency, isolation (opening a file in another process which is currently being written won't show the pages that are being written), and makes the writes always durable (sync; not fsync)?

the8472 · on June 15, 2017

> Are we talking about the POSIX interface to files

Those give you these atomic ops and optimistic or cooperative locking: https://rcrowley.org/2010/01/06/things-unix-can-do-atomicall...

Multi-file transactions can be built by moving whole directories over symlinks to the previous version. The linux-specific RENAME_EXCHANGE flag can simplify this.

> or is there a new transactional API

CoW filesystems give you snapshots and reflink copies via ioctls. Under heavy concurrency this can provide cheaper isolation than locking.

> and makes the writes always durable (sync; not fsync)?

For durability fsync is sufficient if you do it on the file and the directory. To combine atomic and durable you can do the write, fsync, rename, fsync dir dance.

https://lwn.net/Articles/457667/

fnord123 · on June 15, 2017

>For durability fsync is sufficient if you do it on the file and the directory. To combine atomic and durable you can do the write, fsync, rename, fsync dir dance.

Right. This allows for the correct behaviour but I wouldn't conclude that POSIX is a transactional API. I would regard this as being able to build a transactional API on top of it; but not transactional in itself. I mean, if you had to do the same dance in SQL (insert to dummy pkey and then update the entry) you would rightly dismiss it.

the8472 · on June 15, 2017

Which is why I said

> You can build multi-step transactions on top of them with various operations.

phamilton · on June 14, 2017

That's not really a true statement. I use plenty of databases that aren't transactional beyond a single record.

_lqaf · on June 14, 2017

It is a true statement. You're not a member of the set of "most people":

> a basic property that most people consider necessary

I use non-ACID stores for some purposes, but don't consider them real databases.

justinclift · on June 14, 2017

Interesting project. Although that one seems pretty dead (2 years since last commit), this fork of it seems actively developed:

https://github.com/konstruxi/ngx_postgres

sbov · on June 14, 2017

It's pretty easy to use a CDN with blob storage, which reduces the impact of serving it from your application.

internalfx · on June 14, 2017

I personally prefer the database...

But I think for it to work well you need a better strategy than just large BLOBs.

GridFS for MongoDB - https://docs.mongodb.com/manual/core/gridfs/

ReGrid for RethinkDB - https://github.com/internalfx/rethinkdb-regrid

I'm sure the same concept could apply to other databases.

mpweiher · on June 14, 2017

> Also, when deciding between blob vs. filesystem, blobs bring transactional and recovery consistency.

Interesting. I would have thought the other way around, at least for crash-resistance: the I/O stack (including disk hardware) has a tendency to reorder writes and so updates that live within a single file will corrupt fairly easily. Separate files not so much. I vaguely remember a paper on that (Usenix?) and sqlite generally did very well except for that point.

rusanu · on June 14, 2017

You have two consystency issues with storing the files in the filesystem:

- rollbacks in the DB can lead to orphaned files on disk. One can try to add logic in the app (eg. a catch block that removes the file if the DB rolled back) but that is not gonna help on a crash

- it is impossible to obtain a consistent backup of both the DB and the filesystem. You can backup the filesystem and the DB, but the two will not be consistent between them unless you froze the app during the backup. When you restore the two backups (filesystem, DB) you may encounter any anomaly: orphaned files (exists on filesystem but no entry in DB), broken links (entry in DB referencing a non-existent file) etc. This is because the moment at which the backup 'views' the file and the DB record referencing it are distinct in time.

As for write reordering: write-ahead log systems relies on correct write order. All DBs worth their name enforce this one way or another (via special API, via config requirements etc etc)

Klathmon · on June 14, 2017

I'm surprised that there aren't any tools provided by various databases to handle that usecase.

Something which can abstract away the storage on-disk of large blobs and manage/maintain them over time to prevent a lot of the issues you talk about, but still give the ability for raw file access if/when it's needed.

I've given it all of 10 seconds of thought, but even something like a DB type of a file handle would be useful. Do a query, get back a handle to a file that you can treat just like you opened it yourself.

rusanu · on June 14, 2017

To name just a few:

Filestream https://docs.microsoft.com/en-us/sql/relational-databases/bl...

File Tables https://docs.microsoft.com/en-us/sql/relational-databases/bl...

Remote Blob Storage https://docs.microsoft.com/en-us/sql/relational-databases/bl...

BFILE http://docs.oracle.com/cd/E11882_01/appdev.112/e18294/adlob_...

I'm sure there are more. But rest assured, they do cost, and usually a lot.

rusanu · on June 14, 2017

> Do a query, get back a handle to a file that you can treat just like you opened it yourself

Things are a bit more complex. For one, the trivial problem of client vs. server host. The DB cannot return a handle (a FD) from the server, because it has no meaning on the host running the app. The second problem is that any file manipulation must conform to the DB semantics for transactions, locking, rollback and recovery.

What you describe does exists, is the FileStream feature that dates back to 2007 if I remember correctly. I'm describing the SQL Server feature since this is what I'm familiar with. The app queries the DB for a token, using GET_FILESTREAM_TRANSACTION_CONTEXT[0] and then uses this token to get a Win32 handle for the 'file' using OpenSqlFilestream[1]. The result handle is valid for usual file handle operations (read, write, seek etc). There were great expectations on this feature, but in real life it flopped. For one it caused all sort of operational headache from the increased DB files size (increased backups size etc) or from problems like having to investigate 'filestream thumbstone status'[2]. But more importantly, adoption required application rewrite (to use the OpenSqlFilestream), which of course never materialized.

File Tables is a newer stab at this problem and this one does allow to expose the DB files as a network share and apps can create and manipulate files on this share and everything is backed by the DB behind the scenes. But turns out a lot of apps do all sort of crazy things with the files, like copy-rename and swap as means to do failure safe saves, but many such operations are significantly more expensive in DB context. And when the DB content is manipulated directly by the apps that 'think' they interact with the filesystem, a lot of useful metadata is never collected in the DB, since the file API used never requires it (think file author, subject etc).

[0] https://docs.microsoft.com/en-us/sql/t-sql/functions/get-fil... [1] https://docs.microsoft.com/en-us/sql/relational-databases/bl... [2] https://www.sqlskills.com/blogs/paul/filestream-garbage-coll...

Klathmon · on June 14, 2017

I figured there was something massively annoying with it.

I'm hoping that PostgreSQL can try to tackle this use-case somehow because it's so damn common and 99% of the time the solution that is used completely throws out all the guarantees that the database gives you.

_lqaf · on June 14, 2017

I'm not sure there's a non-annoying way to bridge the problem. The semantics are just so different that there's a lot of friction that needs to be handled somewhere. (What does O_DIRECT mean in this context? What happens when someone decides to store files that are never normally closed?)

In some ways, the problem mirrors OR mappers. A lot of common cases can be handled, but there are always situations where you're confronted with the fact that relational logic just doesn't map cleanly to OO (or hierarchic storage).

rusanu · on June 14, 2017

Since you mention the OR impedance mismatch problem, I have to link to The Vietnam of CS article: http://blogs.tedneward.com/post/the-vietnam-of-computer-scie...

dom0 · on June 14, 2017

Databases use journaling to avoid these issues. The difference between RDBMS' journaling and file system's journaling is the RDBMS journals application data, the file system does not (with few exceptions). Thus, a RDBMS gives you consistent application state, while the file system does not (with fewer exceptions, basically only TxF).

Another pain point is backup; it's not possible to create a consistent snapshot of (rdbms state, other state on the file system). This is avoided entirely if you only need to snapshot (rdbms state,).

Aaargh20318 · on June 14, 2017

> Despite all this, my practical experience is that filesystem is better than blobs for things like uploaded content, images, pngs and jps etc.

If you store them on a filesystem, how do you deal with redundancy / failover / scaling ? At a previous job we did a bit of experimenting with using a clustered FS but they all introduced a lot of problems. However, this was a couple of years ago so the situation may be different now.

notacoward · on June 14, 2017

If you store them in sqlite, how do you deal with redundancy / failover / scaling? There are other databases with better tooling for this, but they weren't part of this comparison. The world is full of tradeoffs. Scale and availability vs. microbenchmark performance is one of them. It would be very interesting to see a comparison of clustered MySQL or Postgres vs. Gluster or Ceph. I very much doubt the performance side of it would favor the database so much.

fb03 · on June 14, 2017

> If you store them on a filesystem, how do you deal with redundancy / failover / scaling ?

In most cases, using a solution like Amazon S3 will have all these three sorted out for you pretty good. And while it's not a full mountable filesystem, from a system development perspective it has a really good API for retrieving and storing files.

eric_the_read · on June 14, 2017

Technically there is s3fs, but it's fairly terrible and I actively recommend against it. Still, it does come handy from time to time, in a limited set of cases.

nullnilvoid · on June 14, 2017

After years of development, databases and file systems borrow ideas from each other. I am not surprised that the breakeven range might be expanded. It would be interesting to re-visit the conclusions with current databases, file systems, hardware (SSD vs HDD) etc.

chiph · on June 15, 2017

Another reason to store them on the filesystem (especially if they're user-contributed) is that your anti-virus can see them there.

vonhugendong · on June 14, 2017

Thanks for this clarification

ianamartin · on June 14, 2017

I swear to god, I don't understand why people don't think about this.

total sadness about having to deal with a 150GB SQL database, of which 148 GBs are blob storage for PDFs

"We have big data!! We need enterprise scale!"

Nope. You have big stupid.

coldtea · on June 14, 2017

I don't see the sadness -- or the stupid.

There are several very valid business cases to store files as blobs in a DB.

What's the problem is the DB is 150GB? It's not like the working set (which for the files will just be the metadata) will be that big for storing file blobs.

10GB Database + 140GB of pdfs on the filesystem are not any different to a 150GB DB with everything in.

And you have other issues (consistency, transactional issues, backups, etc).

sheeshkebab · on June 14, 2017

How about a 100TB database with 95TB being pdf's? It seems there are better ways of managing lots of immutable small files (including for backups/replication) than shoving them all into a database.

coldtea · on June 14, 2017

>How about a 100TB database with 95TB being pdf's?

How about it?

>It seems there are better ways of managing lots of immutable small files (including for backups/replication) than shoving them all into a database.

Depends on the business case. A database gives certain guarantees you'll have to replicate (usually badly) in any other way.

Conceptually, it's absolutely cleaner.

As for from a scientific or engineering standpoint, there are again no laws that dictate saying whether this is bad or good.

Even the performance characteristics depend on the implementation of the particular DB storage engine. They might be totally on par with storing in the filesystem (or close enough not to matter).

Not to mention that some filesystems might even have worse overhead depending on the type, size, etc. of files. In fact this very FA speaks of "SQLite small blob storage" being "35% Faster Than the Filesystem".

Plus filesystem storage and DBs are not that different in most cases -- they share algorithms for storage and indexing (logstorage, btrees, etc), and the main difference is the access layer. The FreeBSD filesystem, for one, was more like a DB storage layer than a 70s style Unix filesystem.

craigds · on June 14, 2017

For web map tiles (millions of tiny PNGs), everyone who's anyone stores their tiles in sqlite rather than on disk: https://www.mapbox.com/help/an-open-platform/#mbtiles

incanus77 · on June 14, 2017

Originator of MBTiles format here. The initial intentions were moving 100,000s of raster map tiles between computer and mobile device, either over USB or network. Main benefit was avoiding per-file transaction overhead, as well as checksumming potential. Side benefits included a small space savings over filesystem because of block overhead, plus later iterations allowed for de-duplication of e.g. all-blue water map tiles. These days, we still use the format for transport between local machine and the Mapbox backend, despite having moved to "vector tiles", which essentially are protocol buffer-based encoded geometries directly.

Original post: https://developmentseed.org/blog/2010/oct/08/portable-map-ti...

incanus77 · on June 14, 2017

Also, to clarify a bit on why spatial capabilities weren't part of the MBTiles spec: when rendering maps, you assemble tiles for a given zoom scale in the vertical and horizontal directions, like a checkerboard. The unique tuple defining a particular square on the map is z/x/y — zoom level, horizontal, vertical. So stuffing these into SQLite and uniquely keying across z/x/y there makes for fast retrieval at render time.

justinclift · on June 16, 2017

As a data point, many of the links on that page to the mapbox.com website are 404. :(

asmosoinio · on June 14, 2017

Interesting and very relevant to the discussion!

Hmm... Does SQLite have some geo data or 2d coordinate lookup capabilities as well?

shawn-butler · on June 14, 2017

Try SpatialLite[0] if you need advanced geometry capabilities. Has R-Tree spatial indexes if you use SQLite 3.6+, and rudimentary MBR index for prior versions. Has rudimentary support for curves as well.

[0]: https://www.gaia-gis.it/fossil/libspatialite/index

justinclift · on June 14, 2017

As a data point, one of our users figured out how to load the SpatiaLite extension for DB Browser for SQLite (on Windows), then wrote up step by step instructions:

https://github.com/sqlitebrowser/sqlitebrowser/wiki/SpatiaLi...

(people had been asking previously, but we had no idea. ^\o/^)

i_feel_great · on June 14, 2017

There is some mention of that here: https://www.sqlite.org/rtree.html

mynewtb · on June 14, 2017

GeoPackage is a great geodata format on sqlite. Also check out spatialite.

cat199 · on June 14, 2017

Definitely an interesting use case -

Part of the concerns outlined in that article and of more importance for this company are ease of distribution/compatibility for customers of mapsets and mobile devices, rather than raw read/write performance, however.

Though the storage size claims might be relevant, and definitely a clear example that sqlite can be very workable and performant in the context of blob storage.

notacoward · on June 14, 2017

FYI, in sqlite is still on disk.

elchief · on June 14, 2017

It doesn't have to be. There's an in-memory option

Myrth · on June 14, 2017

I wonder if ram fs would perform exact the same as sqlite in memory.

DaiPlusPlus · on June 14, 2017

The article speculates that the overhead of calling `fopen` for each file and subsequently reading it is more expensive than reading from an already opened database file.

But as for how expensive `fopen` is really depends many factors.

kccqzy · on June 14, 2017

I would believe ramfs would be slower, as there will be additional system calls needed.

notacoward · on June 14, 2017

Filesystems can be built in memory (ramdisks) too, but comparing sqlite in memory to files on disk would be pretty silly. For any serious and fair comparison, in sqlite still means on disk.

microcolonel · on June 14, 2017

For what it's worth, I think Mapbox's vector tiles can be stored in around 48GiB of RAM, which is perfectly within reason for a server. It's also the sort of data which is simple to shard. They can have multiple geolocated databases for a local area, and maybe fewer for international queries. This can be done pretty simply at the load balancer level.

Chris2048 · on June 14, 2017

Do you mean "stored on a traditional file-system"?

tarasglek · on June 14, 2017

This is weird benchmarketing. They are comparing reading/writing 100,000 individual files vs writing 100,000 entries into a single file(sqlite database). For comparison one could concatenate the same data into a single big file even faster than into sqlite.

They then do not offer logical analysis as to why things are faster.

My understanding is that reads are probably faster due to operating system readahead being able to predict reads better when they are within a single file. Writes are faster because they do one bulk COMMIT instead of many individual fsyncs.

imron · on June 14, 2017

> For comparison one could concatenate the same data into a single big file even faster than into sqlite.

One could, but then one would have great difficulty retrieving the individual files back when needed.

The point is not about what has the greatest raw speed, the point is that that for applications that read lots of small files from the filesystem, they'll possibly get better performance and almost certainly use less disk space by sticking the files in SQLite instead.

nimchimpsky · on June 14, 2017

> have great difficulty retrieving

Not that great, and the point is its comparing apples to oranges, pointless.

function_seven · on June 14, 2017

Writing your own methods to retrieve arbitrary chunks of data from a monolithic file would be a lot of work. Using map tiles as the example, how do you easily retrieve just the tiles in a specific region? Ok, how about all tiles that have the “hasLand” attribute set? Or the “containsCoastline” attribute?

You’d end up rewriting your own version of a database.

akira2501 · on June 14, 2017

> You’d end up rewriting your own version of a database.

This is the hell we just crawled out of in one of our codebases.. it's not so much the access for a single process, but when you want to share the data in the file, that's when it seriously starts to get messy and breaking through otherwise clean abstraction layers and/or forcing recompile on compatibility breaking changes.

adventured · on June 14, 2017

If you don't mind answering, out of sheer curiosity how was the data stored in the individual file/s?

akira2501 · on June 14, 2017

Straight array of structs. The complication was that there were several of them, they were stored in bus-attached MRAM which was accessed directly (e.g. mmap of /dev/ram) by several independent processes.

The idea came from previous generations of the product which generally ran on microcontrollers coupled with custom FPGA designs, so, not entirely unwarranted.. but completely incorrect for a newer product with much greater computing power than their previous devices.

inetknght · on June 14, 2017

> Writing your own methods to retrieve arbitrary chunks of data from a monolithic file would be a lot of work.

Oh goodness have we come a long way if interacting with a file is harder than interacting with a SQL database.

(it really depends on what language and/or library(ies) you use and for what purpose)

imron · on June 14, 2017

> Oh goodness have we come a long way if interacting with a file is harder than interacting with a SQL database.

It's not interacting with a file, it's interacting with a file that contains other files of varying sizes that need to be accessed randomly with good performance.

diroussel · on June 14, 2017

If you only want to read then both zip and tar are good for this purpose. It's commonly used and known to work well. There are many formats that are really just a renamed zip file. Like jar, and epub. File entries don't have to be compressed so you can mmap.

If you want to do random writes then more consideration is needed.

imron · on June 14, 2017

> If you only want to read then both zip and tar are good for this purpose

And programmatically accessing them is significantly more complicated than using SQLite which involves dropping a single header in your project and about 10 lines of code.

> If you want to do random writes then more consideration is needed.

Which you almost certainly what you want to do for the sorts of use cases where SQLite is also under consideration.

dkersten · on June 14, 2017

which involves dropping a single header in your project and about 10 lines of code.

10 lines of user code + the sqlite library. Using other formats like zip is hardly any more user code (+ some external library).

imron · on June 14, 2017

The sqlite 'library' is a single header file and a single c file that you can just drop in your project and use.

dkersten · on June 14, 2017

So are many libraries for other formats.

Btw I'm personally a big fan of SQLite, I just think the argument here is a bit of a straw man.

akx · on June 15, 2017

Afaik, tar files do not contain a central directory, so finding a certain file in a tar file requires skimming through the entire archive.

function_seven · on June 14, 2017

As far as I know, querying a database has always been easier than querying a custom file format. It's the whole reason databases exist in the first place. And using one off the shelf is clearly easier than writing your own.

tetromino_ · on June 14, 2017

Not easier, but vastly more flexible.

Sure, it's completely trivial to read a file containing an array of C structs of metadata and offsets into a blob of concatenated PNGs: just mmap the thing, and your main data structure is sitting right there in memory.

But problems start arising when you need to update the structure of that metadata. Add some more fields. Some more structured, multi-value fields, referring to multiple entities in your increasingly complex data structure. Entities which must be guaranteed to exist. And perform modifications concurrently, from multiple processes. And have the file not get corrupted when something kills a writer process in the middle of writing...

Databases are slow and overkill for the simple case, but they certainly make adding unexpected requested features much easier. And it can be expected that you will get unexpected feature requests.

dkersten · on June 14, 2017

Nothing wrong with comparing apples and oranges. They have a lot more in common than people give them credit for (both are fruit, shape, flavour, nutrient content, where they grow, who eats them etc etc). You can even compare things that aren't similar at all and its not necessarily pointless.

codehusker · on June 14, 2017

The third paragraph might offer the logical analysis you're looking for:

> The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database. The size reduction arises from the fact that individual files are padded out to the next multiple of the filesystem block size, whereas the blobs are packed more tightly into an SQLite database.

ori_b · on June 14, 2017

A big part of it is because they're repeatedly walking the same path while reading and writing it. I got a 20% performance improvement on FreeBSD+ZFS with this change, which merely opens an fd, and uses that to stat and write, cutting the number of path traversals in half:

https://eigenstate.org/paste/d41e3b69d258e3b52f7cee7a9921f02...

justinclift · on June 14, 2017

Cool, have you pointed them at it? :)

bigger_cheese · on June 14, 2017

From this old stack overflow topic: https://stackoverflow.com/questions/6916011/how-do-i-get-win...

Conclusion seems to be that filesystems on windows are not well optimized for the usecase of reading and writing from many small files at once.

Which could explain poor windows performance in the benchmark (writing to 100,000 individual files).

AstralStorm · on June 14, 2017

Which is why before making such broad statements they should check on the other platforms, which have much less broken file systems. (Check Ext4, ZFS, UFS2, HPFS+ and APFS at least.)

What they did is check an untuned Ext4, UBIFS, HPFS+ and NTFS vs a tuned database in a single thread scenario. Especially misleading in the mmap case where synchronization would just kill the performance dead.

majewsky · on June 14, 2017

I find that comparison justified: How many people are even aware of the fact that filesystems can be tuned (except for, maybe, noatime and discard for SSDs)?

std_throwaway · on June 14, 2017

Why stop there? One could use the raw device to get even greater performance!

nine_k · on June 14, 2017

A database on a raw device is a thing. E.g. Oracle has this option for decades; MySQL, for some time, too.

aruggirello · on June 14, 2017

Exactly my thought. A possibly fairer comparison could be SQLite blobs vs. some archive format (with and without compression).

visarga · on June 14, 2017

No, no, if it is faster than the filesystem it must mean that it has a faster storage medium. Maybe it is writing to memory, or the aether. /s Now, seriously, how can the foot be faster than the shoe?

geon · on June 14, 2017

A reading/writing a single file can be faster than multiple files. Also querying a db can be faster than querying the fs.

dsacco · on June 14, 2017

> So let your take-away be this: read/write latency for SQLite is competitive with read/write latency of individual files on disk. Often SQLite is faster. Sometimes SQLite is almost as fast. Either way, this article disproves the common assumption that a relational database must be slower than direct filesystem I/O.

This is under the 1.1 Caveats heading, which makes me feel the title is a little misleading (but only in the way most benchmark headings probably are, I guess).

Incidentally, can someone more experienced with filesystem and database I/O confirm or contest the assertion here? Specifically, I'm not sure it's fair to generalize these results (even if valid) to categorical relational databases. But this is not a special area of expertise for me.

0xbadcafebee · on June 14, 2017

You're asking whether a database can read/write records faster than disk? Yes. The opposite is also true. I'm not an expert in either fs or db, but it's obvious that the design and use of one is not the same as the design and use of the other, and given the right circumstances one will do a particular job better than another.

An append-only transaction log is probably going to perform better on a spinning disk than random writes. An intelligent app sorting and committing writes in order is probably going to perform better than an fs on a disk with no queue ordering facility. As they said in the post, block-aligned writes in bulk are going to be more efficient than spreading out a bunch of writes not block-aligned. And latency for a db write is going to be lower if you don't close an fd, assuming that closing an fd would always trigger an fsync.

But a properly tuned filesystem, with the right filesystem, with the right kernel driver, with the right disk, with the right bus, with the right control plane, etc etc may work fantastically better than an unoptimized database, and a given workload may fit that fs perfectly.

It's important to remember that a benchmark is only meaningful to the person who is running the benchmark. If you want it to be meaningful to you, you have to try it yourself (and then presumably blog about it).

cbanek · on June 14, 2017

For small files, I could see this being true, but not in a way that would scale to large files.

For example, instead of a database, you could make a folder (instead of a table) and then a bunch of 4 byte files with ints in them. Doing a table scan would be like cat *.

Obviously this uses a lot more open/close/read calls, with more seeks and iops, because those files may or may not be clustered together on disk. If you packed it all into the same file, it would go faster because of the locality and less system calls. You can also be smarter about caching and other things that the file system does well, but probably isn't optimized for the small file use case.

rusanu · on June 14, 2017

From OP: "SQLite is much faster than direct writes to disk on Windows when anti-virus protection is turned on. Since anti-virus software is and should be on by default in Windows, that means that SQLite is generally much faster than direct disk writes on Windows."

I don't get this. If scanning the content is important (as acknowledged by the author), then bypassing the scan via blob storage is a security issue and the application should go through some extra hoops to scan the content before saving it to blob, and this should be measured and part of the comparison.

Also, if the SQLite files are exempt from AV scan, then the level field should also exempt the uploaded files folder in test. I mean, knowing the dice are loaded and then claiming it as an advantage does not seem professional.

Dylan16807 · on June 14, 2017

Data chunks inside another file are effectively quarantined by default. There is no exempting or security issue, it's safe by design. It will get scanned occasionally but not on every write.

45h34jh53k4j · on June 14, 2017

Except when they are not. Just because malcode is encapsulated doesn't mean it is neutralised. The issue is untrusted data, not how it is stored.

Word Docs with embedded content or can trigger code execution on parsing, data structures that break AV parsers, image or font data that causes system libraries to overflow/UAF/etc; There all could potentially be a security issue in a DB, the same as in a FS.

Dylan16807 · on June 14, 2017

You don't need to worry about a single process writing data to a file and then reading it back and performing malicious actions. That process could just as easily have been exploited entirely in-memory. Scan the database when it exits and you're fine.

Spooky23 · on June 14, 2017

Great catch. The whole premise of the article is undermined by that.

wyoung2 · on June 14, 2017

The article may be unclear on this point: anti-virus was turned off for these tests. What that paragraph is pointing out is that performance tanks badly (roughly a factor of 10x) when you turn on Windows Defender, and presumably even worse with other anti-virus software.

Credential: I was one of several people who vetted the article before it was published. I argued that Defender should be enabled, since it is a platform default. D. Richard Hipp chose to take the charitable path and not report those results alongside the others.

The test is easily replicated yourself. The source code for it is part of the SQLite source tree, and command lines for building and running the test are given in the article.

Spooky23 · on June 14, 2017

Thanks for the clarification.

In any case, I really take issue with that approach. Nearly every windows database vendor, including Microsoft, recommends or requires that you disable most types of scanning in databases. Not only is it ineffective and performance impacting, it potentially undermines integrity of the data.

wyoung2 · on June 14, 2017

My contention was that this article is about least-effort methods (i.e. fopen() vs default SQLite) so a lot of Windows SQLite users are going to have Defender (or heaven help them, something much worse) installed.

Once you start getting into optimizations, much of this article starts to sway back and forth. Put enough effort into optimization, and you can outdo SQLite for a specific application, simply because SQLite isn't magic, it's just C. Well-written and highly-optimized C, but purpose-written code always has the potential to do better than general-purpose code.

The point of SQLite in general and this report in specific is that you get a lot of performance, power, and safety for free with SQLite as opposed to `fopen()` and such. Good performance while antimalware interferes is just one way this manifests.

rusanu · on June 14, 2017

Can you clarify then wether the AV was simply not scanning the SQLite files (ie. the file extension or the file location was exempt in AV config) ?

wyoung2 · on June 18, 2017

As I understand it, drh simply disabled Windows Defender entirely for these tests.

SQLite gets an additional advantage with Defender enabled because Defender doesn't try to scan inside the SQLite DB file for the sub-files, whereas with separate files, Defender must at least look at each file's header to determine whether it's a potentially dangerous file type.

While I was vetting this article, I did some of my own tests, and I found that simply excluding the directory containing the test tree was sufficient to get the same benefit.

My point in other posts here is that you don't need to disable Defender to get this benefit when you use SQLite.

But let's be clear: the 35% claimed performance benefit has nothing to do with Windows Defender. That's a wholly separate issue. Defender pays more attention to a pile of individual files than to an amalgamation of that same data into a single file, and this magnifies SQLite's advantages, but it is not the source of the advantage the linked report talks about.

cyberferret · on June 14, 2017

Interesting. We built a couple of systems many years ago using SQLite as the storage engine for binary files - one was a kind of mini version control utility for storing report files for a legacy ERP system.

I always thought that the SQLite system was far quicker to search and retrieve files from the dataset than it was doing so in our previous version, using hierarchical file folders.

Interestingly, we canned the project because the ERP vendor themselves release a similar tool using - wait for it - individual files in a folder system... (Addendum - much later on they switched to - wait for it - CVS for version control.... in 2010!!)

Const-me · on June 14, 2017

SQLite is good when you’re mostly reading.

For writing, the major drawback of SQLite is it doesn’t support concurrent writes.

All filesystems do (at least when writing different files), all full-fledged RDBMS-es do, even some embedded databases do (like ESENT). Yet, in SQLite only a single thread can write.

Even embedded chips are multicore these days…

_mhyx · on June 14, 2017

Isn't that because SQLLite's whole use case is "simple, small, fast database that competes with fopen() and not Postgres"? At least in my experience on NTFS, unless you know the section that 2 threads are writing to, and that section is different, you can't write to a file concurrently (N.B.: in C#)

I'd be happy to be wrong (seriously that would make my day if someone could explain a clean way to do the above) but if I want to get "big boy database features" then I'd use a heavy database like SQL Server. I think it's great that there's an option like SQLLite out there for rapid prototyping and "store all your music metadata" type applications!

Const-me · on June 14, 2017

> simple, small, fast database that competes with fopen() and not Postgres"

The hardware evolved in a way so even very small systems (phones, embedded, Raspberry Pi and other IoT devices) are now multi core. Multithreading is required to benefit from a multi-core CPU. For IO-heavy tasks, it’s sometimes a good idea to implement multi-threaded IO as well.

> you can't write to a file concurrently

I can concurrently write different files. Or to different streams/blobs/records for most other IO methods, except SQLite.

justinclift · on June 14, 2017

Some work seems to be ongoing for improving SQLite concurrency:

https://www.sqlite.org/cgi/src/timeline?n=100&r=begin-concur...

Also mentioned here:

https://news.ycombinator.com/item?id=12742943

WaxProlix · on June 14, 2017

Consistently happy with SQLite, just a lovely little database. Performance is basically a cherry on top of the simplicity it provides.

cassandra1024 · on June 14, 2017

"To blob or not to blob" - a paper by Microsoft Research from 2006. https://www.microsoft.com/en-us/research/publication/to-blob...

flohofwoe · on June 14, 2017

Storing many small files into few (compressed) bundle files is common since at least the 90's for storing game assets, but usually not with SQLite, but some sort of compressed format (simple zip archives are quite common but there are faster alternatives now). The bundling gives you better overall compression, and faster read performance compared to individual files (at least on Windows) as long as you open the file only once, and from then on only seek and read.

SQLite is a big code base, using a simple archive file format instead gives you most of the advantages, but without the bloat.

PS: This is mostly for read-only situations though. Using SQLite probably starts to make sense when the applications needs to write to and create new files in the archive.

rusanu · on June 14, 2017

But that would work only for static content, like the game assets. The discussion of BLOBs vs. filesystem comes up mostly in the context of content management and user/app uploaded content.

flohofwoe · on June 14, 2017

Yes you are right. I wrote the 'PS' before I saw your comment. Apologies :)

JabavuAdams · on June 14, 2017

I was going to make a semi-sarcastic comment along the same lines, but some other authors rightly point out the benefit of using and RDBMS when then file schema is changing.

mpweiher · on June 14, 2017

Not surprised by the results, very surprised by the explanation given, the overhead of open/close calls.

As far as I know (from many measurements and talking to kernel people and researching the mechanism involved), the difference is due to the fact that buffers are shared between all pages of a single file, and not between files.

So for reads, the filesystem will do read-ahead of significantly more data than requested and keep that in buffers, and future reads will profit. Similar for buffers shared when writing.

The same effect will be reproducible with any format storing multiple objects in a single file, it has virtually nothing to do with "SQLite" or "Databases".

One tradeoff is the greater potential for inconsistencies, which despite all the measures taken is much greater when you modify a file rather than writing only completely new files. Another is the inconvenience and duplication of effort, because all your default file-system management tools aren't available.

It would be interesting to see if a pseudo-filesystem that is mapped to a single underlying file would show the same effects (preferably implemented in user space to avoid overheads of multiple kernel round trips).

wyoung2 · on June 14, 2017

> The same effect will be reproducible with any format storing multiple objects in a single file

Given unbounded developer time, any advantage SQLite has here can of course be matched or beaten by custom code. It's just another C library, not magic.

The real issue is, how much work will it take you to do that?

SQLite is billed as competing with `fopen()`. That's the proper way to compare it: given equal development time spent talking to the C runtime library vs. integrating SQLite, how much speed and robustness can you achieve?

> the greater potential for inconsistencies

How? In this application, SQLite is filling a role more like a filesystem than a DBMS or `fopen()` call. You can corrupt a filesystem just as easily as a DBMS. SQLite protects itself in much the same ways that good filesystems do, and it can be defeated in much the same sort of ways.

> Another is the inconvenience and duplication of effort, because all your default file-system management tools aren't available.

It's a classic tradeoff: do you need the speed this technique buys or not?

A better reason to avoid this technique is when the files you're considering storing as BLOBs need to be served by a web server that uses the `sendfile(2)` system call. There, the additional syscalls caused by the DBMS layer will probably eat up the speed advantage.

Software development is all about tradeoffs. No technique or technology is perfect for everything.

Now you know one more technique. Maybe it will be of some use to you someday.

> a pseudo-filesystem that is mapped to a single underlying file

That pretty much describes SQLite. Both SQLite and a good filesystem use tree-based structures to index data, both have ways to deal with fragmentation, both have ways to ensure consistency, both strive for durability, etc., etc.

> preferably implemented in user space to avoid overheads of multiple kernel round trips

How are you going to avoid kernel round trips when I/O is involved?

If you think user-space filesystems are fast, go try FUSE.

mpweiher · on June 14, 2017

> how much work will it take you to do that?

Dunno, use a zip library?

> [inconsistencies]

No, SQLite doesn't have the same tools available as filesystems. It is very good, but not quite as good.

> [Pseudo FS mapped onto a single file] That pretty much describes SQLite.

Not really. For one, it doesn't provide a filesystem-like interface.

>> avoid overheads of multiple kernel round trips

> How are you going to avoid kernel round trips when I/O is involved?

Added emphasis.

> If you think user-space filesystems are fast, go try FUSE.

Exactly, because you tend to make multiple round trips. If you can flatten that to just one, you avoid those overheads.

Chris2048 · on June 14, 2017

linking to a post in another article wrt tagging fs:

https://news.ycombinator.com/item?id=14552224

that thread began: "Now we just need a redesign of Unix tools and principles/practice"

This is perhaps one advantage of a sq-lite fs; the flexibility.

You could design concepts into the db structure, and have the driver translate this into the usual language of folders/files/read-write.

But as well as requiring new fsck tools etc, you might also need new ways to interact with the fs as it would have new capabilities, e.g. storing data non-hierarchically.

If the OS can be pushed forward wrt some feature, I care less about whether it would perform.

Johnny555 · on June 14, 2017

SQLite is much faster than direct writes to disk on Windows when anti-virus protection is turned on. Since anti-virus software is and should be on by default in Windows, that means that SQLite is generally much faster than direct disk writes on Windows.

Isn't this because writing to a SQLite database file bypasses most (or all) of the antivirus file scanning since it's can't see a complete file, and can only look at raw blocks of data?

So if you find value in having your antivirus scan all of your files, that's a disadvantage for using SQLite to store them?

TekMol · on June 14, 2017

The reason I still use files rather then SQLite is that I don't know how SQLite handles concurrency.

For example I have a PHP app that is used by 10k users a day and it happily handles 100k tmp files in a single directory. On each request, it checks the file age via filemtime() and if new enough, includes the tmp file with a simple include(). (I write PHP arrays into the tmp files). If too old, it recalculates the data and writes it via fopen(), fputs() and fclose().

This migh be archaic but it has been working fine for years and never gave me any problems.

Somehow I would expect that if I simply replaced it with SQLite, I would run into concurrency problems.

wyoung2 · on June 14, 2017

> I don't know how SQLite handles concurrency.

Why remain in ignorance? There are several articles in the SQLite documentation that address concurrency:

https://www.sqlite.org/wal.html https://www.sqlite.org/lockingv3.html https://www.sqlite.org/whentouse.html

> Somehow I would expect that if I simply replaced it with SQLite, I would run into concurrency problems.

SQLite wouldn't really be ACID-compliant if multiple threads were enough to defeat the Durability guarantee, would it?

That's not to say that there are no concurrency problems in SQLite, but that they're more in the way of potential bottlenecks than data corruption risks. For a reader-heavy application like yours, I suspect SQLite will perform just as well or better than your existing solution.

If you're solely after speed, I'm not sure it would be worth rewriting your app to use SQLite. This technique's value is simply in the benefit it gives when you were already going to use SQLite for some other reason. If you just want a 35% speed boost, wait a few months or buy a faster SSD. Both are going to be easier and cheaper than rewriting the data storage layer of your application.

That said, maybe there are other things in SQLite that you could use. Easy schema changes, full-text searching, more advanced indexing than the filesystem allows, etc. If you go for one of those, then the extra speed is a nice bonus.

TekMol · on June 14, 2017

    SQLite wouldn't really be ACID-compliant if
    multiple threads were enough to defeat the
    Durability guarantee, would it?

I'm not so much concerned about durability. More about SQLite not responding to "SELECT v FROM t WHERE id=123" with value v but instead with something like "Error: v is currently being written by another process. Try again later" or something.

No idea if that is a realistic scenario. I'm kind of surprised I don't have this kind of problem with my filebased solution. What happens if process A reads from a file while process B writes it? No clue.

wyoung2 · on June 14, 2017

> What happens if process A reads from a file while process B writes it?

If process B started first, the writer blocks access to the table being written to, so the reader waits for the writer to complete. There are timeout and retry behaviors, but within those configurable limits, that's what happens.

You can make SQLite behave as you worry about if you set the retries to 0 and timeout to 0, but that's not the default.

ifdefdebug · on June 14, 2017

"If it works, don't fix it".

But if you really need to change your design for some reason, sqlite should handle 10k users per day just fine with simple transactions.

rcarmo · on June 14, 2017

I've been using SQLite in WAL mode for years in similar scenarios (well, except for PHP) without any hitches.

wyc · on June 14, 2017

I know that SQLite performs exceptionally on embedded devices including most phones, IoT devices, and more. However, for better or worse, it's a flat file. Does anyone know of a TCP/IP-speaking SQL database that would work well on an embedded device? PostgreSQL/MariaDB seem kinda heavy, and the net couplers for SQLite look pretty unsupported.

dragonwriter · on June 14, 2017

> However, for better or worse, it's a flat file.

No, it's not. It's a file, but it's not flat.

wyc · on June 14, 2017

You are technically correct. The best kind of correct!

0xbadcafebee · on June 14, 2017

They were not technically correct, they were just correct. A flat file database is a file with no indexing structure, usually plaintext, usually one record per line. SQLite uses a binary format consisting of an index of pages and a rollback journal or log file. They are significantly different.

Also, yes, there are networked databases with sqlite database drivers.

programd · on June 14, 2017

MySQL/MariaDB can be tuned to run very well on low resource devices. Remember that they were originally developed back in the days of 486 CPUs. A couple of years ago I had MySQL backing a commercial product on a 512M ARM7 5400RPM device (we originally developed it with 256M of memory). Performance was not an issue for our use case even with on the order of a million rows in the DB.

Quarrelsome · on June 14, 2017

and what exactly do you think any database is at the end of the day? It's not magic, it all ends up on the disk as a file. I mean why is a page file called a FILE?

wyc · on June 14, 2017

Sure, we can agree that (almost) everything's a file. However, through their related programs and daemons, we can ask files for more useful functionality such as indexes, versioning, and yes, socket communications.

Chris2048 · on June 14, 2017

Some dbs (innodb?) support raw block/partition access

elcritch · on June 14, 2017

ActorDB could get you there, it's a distributed DB based on SQLite [1].

[1]: http://www.actordb.com

qznc · on June 14, 2017

Use sqlite and send SQL queries and answers over the network.

AstralStorm · on June 14, 2017

Or you can switch to the big daddy Postgres or Oracle. They will generally accept SQL that SQLite accepts with few modifications.

codewiz · on June 14, 2017

35% faster than the same blobs can be read from or written to individual files on disk using fread() or fwrite().

How about reading efficiently with unbuffered I/O directly on the file descriptor?

YZF · on June 14, 2017

What about mixed delete/write/read workloads? How well does SQLite deal with fragmentation?

wyoung2 · on June 14, 2017

> What about mixed delete/write/read workloads?

The source code to the test program is part of the SQLite source tree. It should be fairly easy to modify it to test the scheme you have in mind.

> How well does SQLite deal with fragmentation?

SQLite operates a lot like a modern filesystem: tree-based indexing structures, writes go to empty spaces, deletes leave holes that may later be filled by inserts, etc.

SQLite has active mechanisms to reduce fragmentation on the fly: b-tree page defragmentation (https://www.sqlite.org/fileformat.html) and auto-vaccuum (https://www.sqlite.org/pragma.html#pragma_auto_vacuum) being the main ones. If the DB file gets too fragmented, you can run a manual VACCUM (https://www.sqlite.org/lang_vacuum.html) on it.

YZF · on June 15, 2017

Sounds good. I might give it a test in our product to replace some file based storage scheme.

_pmf_ · on June 14, 2017

> All filesystems do (at least when writing different files)

The article is about writing 100000 files, which will not scale well (since there will be massive contention at the directory inode).

ComodoHacker · on June 14, 2017

Can we draw a side conclusion from these measurements that open()/close() overhead on Win10 is significantly higher than on Win7? This seems doubtful.

wyoung2 · on June 14, 2017

See section 2.1 in the report: the Win10 and Win7 installations are on different hardware.

If you have a dual-boot Win10/7 box, it's easy to run the test yourself.

JepZ · on June 14, 2017

So anybody knows where I can get this blazing fast SQLiteFS? ;-)

bedros · on June 14, 2017

it's basically storing in memory vs storing on a disk

it's not sqlite vs filesystem it's memory storage vs disk storage

jdmichal · on June 14, 2017

Are you stating that the SQLite database did not write the files to disk? Or simply that it's caching them in-memory? In the latter case, doesn't an operating system typically offer in-memory file buffering / caching? Otherwise, why would we need O_DIRECT or O_SYNC or fsync()?

bedros · on June 14, 2017

I'm no expert in FS or sqlite, but as everyone knows, filesystems are not good processing small files, having an application that bundles in memory small blobs and then flush them to disk as a single file, is what make it faster.

ReiserFS 4 was (I believe still is) the only FS that improves small files performance by bundling small files and stored them on disk as a single blob.

as for that person who downvoted my original comment, I'd love to hear from you why you think I'm wrong. you don't just downvote comments because you disagree.

jdmichal · on June 14, 2017

I do think your original claim is wrong. Your explanation in this response is not the same as your original claim. And what you wrote here is exactly in line with claims in the article:

"The size of the blobs in the test data affects performance. The filesystem will generally be faster for larger blobs, since the overhead of open() and close() is amortized over more bytes of I/O, whereas the database will be more efficient in both speed and space as the average blob size decreases."

Yes, of course SQLite is going to do some operations in-memory. Raw file writes using fwrite are also going to be initially written in-memory, unless O_DIRECT or some other mechanism is involved. And the article explicitly outlines that they made no effort to bypass file buffering, to the point of not even explicitly flushing to disk.

If both processes are writing the BLOBs to disk, then I don't see how your original claim applies. Writing to disk in a manner that is more efficient for small files is still writing to disk, and so the test is not in-memory vs on-disk as your original claim.

bedros · on June 15, 2017

sqlite works in memory most of the time, on disk sometime when flushing, faster

filesystem, works on disk most of time, in memory some time (some buffering), slower

which makes my original comment true,

meritt · on June 14, 2017

If I asserted: "It's faster to put 10,000 rows of csv data in single file instead of 10,000 individual files" even the most junior programmer would likely say "Well, duh, it's 1 file instead of 10,000".

Yet this benchmark is at the top of HN for some reason.

joesb · on June 14, 2017

So you never think that using database to store file will be slower than using file to store file?

"Duh, I always knew that" is easy to say when you don't have to provide proof.

Also, are you sure that CSV will be faster as well? Do you have a benchmark for it?

jstimpfle · on June 14, 2017

"Measure before optimizing" has limited application. There are many things that need not be measured. Appending to a file is definitely not slower than doing memory management inside the file to allocate a new chunk, seeking to that position and then writing out the chunk. Period. (And the serialization overhead is negligible compared to the disk I/O).

joesb · on June 14, 2017

It's not only "writing" though. It's "writing" AND "reading".

Can you read the correct line in in CSV faster? Can you find the row with specific critiria like SQLite faster? Can you handle concurrency correctly like SQLite? Can you make sure your CSV is always in consistent state like SQLite?

The point of the article is that SQLite is still fast even with all the benefit of database.

The idea of SQLite is that you could default to using it, and move away when it hits its limit. Nobody should default to using CSV, you use CSV when you have to, despite all its limitation.

jstimpfle · on June 14, 2017

No. Read parent again.

meritt · on June 14, 2017

I'm not attempting to compare the usefulness of CSV to a RDBMS. I'm simply pointing out it should be inherently obvious that writing to 1 file is faster than writing to N files, which is the only thing the benchmark is in question is concerned with as well.

rpdillon · on June 14, 2017

The overhead of the database could dominate the gains from only working with a single file. It seems that in some cases, that may not be true.

paulddraper · on June 14, 2017

Your intuition far surpasses mine.

I for one am interested in the result and the reason.

jstimpfle · on June 14, 2017

Opening a file means locating it physically on the disk by following references (usually at least one per directory level). This is a costly operation.

Compare to the typical case of adding another row by appending to an already allocated disk block. No seek operation needed.

klodolph · on June 14, 2017

This analysis is incorrect. SQLite must allocate database pages for rows just like the kernel must allocate filesystem blocks for files. Reading a row in SQLite means traversing a B-tree to find the row. Reading a file from the filesystem means traversing a B-tree to find the file metadata. Very similar. The kernel doesn't need to "seek" any more than SQLite does, the article says that all the hard drives are SSDs which don't have heads so they don't seek.

The major differences here are due to the different levels of isolation and authentication that the kernel and SQLite provide. Another difference is the fact that when you find a row in SQLite by primary key, you don't have to do a second lookup to get the data from the metadata. This is a good optimization to use when storing many small pieces of data, but a bad optimization for large pieces of data.

jstimpfle · on June 14, 2017

No, my analysis is totally correct.

This was just about opening a file and appending a few bytes vs appending to an open file. Obviously the latter can't be slower (and is typically faster).

paulddraper · on June 14, 2017

Prompting the question: why isn't the whole disk just a SQLite database?

Johnny555 · on June 14, 2017

Because SQLite isn't meant as a general purpose filesystem. It can do some things better (like, apparently, small blob storage), but not everything.

For example SQLlite vacuuming to free up deleted data can be slow, on large SQLite databases, we've found it much faster to rewrite the entire file than to vacuum it.

It also has some scalability limits, it uses locking to limit to a single concurrent writer (short duration locks), which only scales up to a point.

There're a lot of file system characteristics that differ from a database file format, SQLite could probably be more file-system like,but then it would diverge from being a fast and lightweight database format.

Microsoft tried the database as a filesystem once: https://en.wikipedia.org/wiki/WinFS (though this goes beyond just being a container)

klodolph · on June 14, 2017

The parent comment is a bit misleading, or completely wrong. SQLite also has to allocate blocks in the database for anything you store. Some of the structures and techniques that SQLite uses for doing this are very similar to the way a filesystem does it.

Instead, think about it this way. With a filesystem, the database is managed by the kernel. Every time you want to read a file, you might do four system calls: open, fstat, read, close. Or you might do mmap instead of read, but you're still doing four context switches, at least in typical cases. Switching to the kernel and back has cost. Normally this cost is small, but if you make a lot of system calls you'll notice the costs piling up. The kernel also has to check permissions to make sure that you have permission to read the file.

With SQLite, the database is inside your application. When you read a row, there's a chance that the row is already in your application's memory. This means no context switches back and forth between application and kernel.

Additionally, when you read a row, the entire database page is read into memory, which includes other rows too. The kernel won't do anything like that with your application--it won't give you img1.png and img2.png if you just ask for img1.png. Maybe they'll both be in the kernel's page cache, but you still have to open and read the file.

Chris2048 · on June 14, 2017

Maybe it could be, there are a few different VFS for SQlite, and one of them ("one_file") directly writes to memory (for embedded devices):

https://sqlite.org/vfs.html

http://www.sqlite.org/src/doc/trunk/src/test_onefile.c

Maybe this can be turned into a vfs driver (fuse?).

BinaryIdiot · on June 14, 2017

Microsoft tried doing this and there are file systems that have tried or are trying. It just opens up multiple cans of worms but it's a pretty interesting idea.

jstimpfle · on June 14, 2017

You got me wrong. I explained why appending to a CSV is faster than inserting into a "real" database.

jstimpfle · on June 14, 2017

Now I'm confused myself: Parent's context was: Why appending to a CSV is faster than opening a file and appending to that. There was no word about a database.

tracker1 · on June 14, 2017

Are the rows in the test indexed with a "filename" as a primary key? Bound to query a single file faster than a CSV parsing in that case.

jstimpfle · on June 14, 2017

This page does not live up to the standards of the other high quality technical articles on sqlite.org. I couldn't even find a reference to the file system used for comparison. But the whole setup is a farce anyway.

lilyball · on June 14, 2017

They tested it on 5 different OSes, which presumably included 3, probably 4 different filesystems (Windows, Mac, and Ubuntu would be using different ones, and I'm guessing the Ubuntu system has a different filesystem than Android too, though I don't actually know), so it doesn't really seem like it matters.

And why is the whole setup a farce anyway?

jstimpfle · on June 14, 2017

Of course it matters. Different file systems have different performance characteristics. Make sqlite into a file system, and I'm pretty sure you can't beat it by putting another sqlite on top.

lilyball · on June 14, 2017

When the author has tested 3 or 4 different filesystems against SQLite (and at least 2 of them you can be pretty sure as to what they are simply based on the OS, with a good guess at the other 2 as well), and SQLite beats them all, then it really doesn't matter.

As for turning SQLite into a filesystem, that's not really going to work. SQL isn't designed to support things like cheap "file" appends or reading only portions of a value or seeking or anything like that, so you'd end up having to read the entire value for any read, and write a new copy of the entire value for any write, and your performance would be really really bad. So yeah, putting SQLite inside of SQLite isn't going to work, because SQLite isn't a filesystem and wasn't designed to behave like one. Not to mention this entire article is about small blob storage, and embedding a SQLite database inside of SQLite isn't a small blob.

jstimpfle · on June 14, 2017

As for benchmarking multiple file systems with consistent result, point taken. To a degree. Some file systems optimize for directory lookups, and some don't.

As for turning SQLite into a filesystem, we could make it into a file system that would be fast in this particular benchmark. Right?

copenja · on June 14, 2017

I mean, is SQLite not also using the filesystem? Maybe I'm confused here. Seems like saying using C is 15% faster than ASM. But that includes user error, right?

mcpherrinm · on June 14, 2017

The article is basically entirely an answer to that question.

The third sentence in particular is:

The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database. The size reduction arises from the fact that individual files are padded out to the next multiple of the filesystem block size, whereas the blobs are packed more tightly into an SQLite database.

copenja · on June 14, 2017

The advice to use a single open/close call instead of multiple is great, but that is advice on how to effectively use the filesystem. It isn't in anyway, shape, or form beating the filesystem.

jdmichal · on June 14, 2017

You can't use a single open/close call to write multiple BLOBs, unless you are concatenating them into one file. That's not typically what happens nor what people are advocating when they recommend not using database BLOBs and using the filesystem instead.

subway · on June 14, 2017

3rd paragraph

The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database.

songzme · on June 14, 2017

During my first week at American Express I did not have the credentials to install any applications. My manager wanted me to build a prototype and I couldn't install a database. Firebase and other third party was impossible because I had to work on an internal server. I thought SQLite would save my day, but I remembered it being really hard to install and set up. So I had to write my own DB that I called 'stupid-db'. It literally just reads/writes data into a file: https://github.com/songz/stupid-db

The demo went well.

jitl · on June 14, 2017

???

SQLite is a library, not a database server. There's no install step - it's a single .C file (and a header, if you're into that). Most of the major scripting languages have rock-solid bindings that don't require any additional system libraries or software installs.

Python (in the stdlib!): https://docs.python.org/2/library/sqlite3.html

Nodejs (prebuilt): https://www.npmjs.com/search?q=sqlite

Ruby (needs libs, which are preinstalled on macOS): https://rubygems.org/search?utf8=&query=sqlite3