Co-author of the paper here. The repo https://github.com/FastFilter/fastfilter_cpp contains all the source code and benchmarks. Yes, construction of the xor filter is slower, but I wouldn't say massively. You should probably compare Bloom 12 against Xor 8, as those have similar false positive rates. Bloom 12 is 60 ns/key, versus Xor 8 at 110 ns/key. Cuckoo filter is somewhere between. Those number don't include calculating the hash code of a key, which might be another 30 ns/key or so.
What is your use case? Maybe Blocked Bloom might be better (if you have the memory).
Thanks, found the benchmarks (your blog links primarily to different repo, "/xorfilter").
My use case is a new kind of debugger (https://BugJail.com), which works by capturing 20+ million events per second from running application, and then reconstructing model of program execution from the raw events. With 20+ million events per second that 110ns per key is just not going to work..
Very interesting! If both construction and lookup need to be very fast, and if you do have enough memory, then I would say a blocked Bloom filter would be best (without knowing your use case in detail). With a regular Bloom filter, you will have lots of CPU cache misses. Cuckoo and xor filter might make sense if you can do (batch) construction in background threads, multiple filters concurrently. That way you can save memory.
The filters are used to seek from database files, so lookup/query side performance is not major issue (Xor filters were mainly interesting because of the smaller size). Normally a secondary index would do the job fine, but because of the high sustained rate there is no time to write it. And all CPU cores are already maxed out, so moving more to background doesn't provide net-benefit..
That is a very interesting project that you have on the go there, it doesn't happen often that I see tools that I think might be game changers coming along and this just might be one of those.
I know, I asked dang to fix it but he refused. I'll have to live with it I guess.
What I like about your project is that I've always felt that debuggers and such give us the same 'von Neumann' view of running programmers that we feel our CPU has. It's a bit like trying to figure out why your car won't start when all you have is a view through a microscope. And besides that, it is quite often that which is not happening that is indicative of an issue and debuggers usually allow you to focus only very well on everything that does happen.
Maybe you feel that way because that's exactly what debugger was designed for: CPU register access in assembler programs..
It's pretty insane that every other aspect of software engineering has changed so dramatically since the 1970s, but debugger has remained fundamentally unchanged for almost half a century now.
Also.. next BugJail beta will have SQL queries against captured program execution (select * from field_write join method_call...) to test hypothesis about runtime behavior. If you can express "that which is not happening" in SQL and get back 0 rows, then it didn't happen. Do you think this could solve your need?
The website mentions a waiting list. Can you point me in the direction of where I can get email notifications about this project? Would be very interested in this for .NET. Is it similar to https://devblogs.microsoft.com/dotnet/debugging-net-apps-wit... ?
The underlying technology in BugJail and Microsoft TTD is somewhat similar (capturing based), but the functionality build on top of that is very different: Microsoft TTD still looks and works like "normal debugger plus back button", while BugJail is total redesign from scratch and looks nothing like any traditional debugger.
I know it's frowned upon to say this at HN, but I'm beginning to wonder who's reading the actual page. At least to me the page seems pretty clear about it:
>The xor filters takes a bit longer to build, but once built, it uses less memory and is about 25% faster in some demonstration test.
When they say "faster," they mean for querying. Which is actually pretty useful for a lot of use cases where bloom filters are used today.
To be clear, I wrote specifically "construction performance".
The blog already stated that "build is bit slower", so i was mainly commenting that instead of "bit slower" it is actually order of magnitude slower (in some cases).
According to the paper, the query side performance is indeed faster, so they are not misleading in any way.
What is your use case? Maybe Blocked Bloom might be better (if you have the memory).