Note that the measurements in the paper were made before they fixed a bug where they confused bits and bytes.
So SQLite only used 1/8 of the reserved bloom filter space, thus increasing the false positive rate significantly:
I found and reported the bug because I wanted to know how the bloom filters work in SQLite for my uni seminar paper. Still wondering if one can find those kind of bugs with test cases.
On top of that I don't think it's fair to say it's 10x faster when it btree was tested only on integer index primary key column. Benchmarks with that bold statements should include short string (1-16 chars maybe) and UUID indexes at least.
I do not know if it is still the case, but the last time I looked into the source code SQLite did hash all strings to the exact same value.
So the bloom filter optimization does not work there.
It had to do with the different ways strings can be compared with collating functions, as strings may be equal even if they have different bytes: https://sqlite.org/forum/forumpost/0846211821
Call me crazy, but I'm simply splitting my UUID into the higher and lower bits and indexing off that.
IE
CREATE TABLE foo(
id_ms UNSIGNED BIG INT NOT NULL,
id_ls UNSIGNED BIG INT NOT NULL,
PRIMARY KEY (id_ms, id_ls)
) WITHOUT ROWID;
That works well with UUIDv7 and is just storing 128bits rather than a full string. In most languages it's pretty trivial to turn 2 longs into a UUID and vice versa.
In SQLite, if you were to define a TEXT column (or anything other than INTEGER, for that matter) with a UUID as the PK, you’d already have two indices, because it stores data based on the rowid [0]. So you’d already have a level of indirection, where the “PK” would be pointing to the rowid.
You could define the table as WITHOUT ROWID [1], but as docs point out, the average row size shouldn’t exceed 200 bytes for the default 4 KiB page size. Since a UUID in text form is at best 32 chars, that doesn’t leave much for the rest of the columns.
It's also a problem in machine learning. Your data might be mangled due to a bug but the NN will still extract something useful out of it. Or, on the flip side, if you make a change to the data and things do break (learning stops converging), you never really know if it's the architecture or the data that's the issue.
SQLite only knows nested loop joins and the bloom filter can just tell us "no need to do a join, there is definitely no matching entry".
If it has a false positive all the time (the worst case) then the performance is the same as before the bloom filter optimization was implemented (besides the small bloom filter overhead).
As the bloom filter size in SQLite directly depends on the table size I estimated a false positive rate of 63.2% due to this bug, while it could have been just 11.75%.
https://sqlite.org/src/info/56d9bb7aa63043f5
I found and reported the bug because I wanted to know how the bloom filters work in SQLite for my uni seminar paper. Still wondering if one can find those kind of bugs with test cases.