How is this different from a hashset where you simply never do the final key com...

aidenn0 · on Dec 20, 2019

The constant factor is different. If you want a small number of collisions, you will need a very small load-factor (you can't do most techniques for handling collisions because you aren't storing the key). So if you treat your hash-table as an array of bits and you run at a load-factor of say .1, then you are now using 10 bits per key, which is a lot more space than a bloom filter or an xor filter with the same false-positive rate.

gnulinux · on Dec 20, 2019

Bloom filters don't require linear space. You can essentially create a set of arbitrarily large objects, as long as you have a hash for them. It requires O(C) memory where C is the size of your largest hash function. As a trade-off, it's not deterministic, since there can be hash collisions due to pigenhole principle. So, even though bloom filter claims the object is in the set, there is a certain probability (that you can calculate) that it's not actually there. On the other hand, a tree-set or hash-set will be deterministically correct, but they require O(N) space and tree-set requires O(logN) lookup cost.

saurik · on Dec 20, 2019

Bloom filters absolutely call for linear space if you are using them appropriately: given the hash collision probability you are working with the size of the required filter is linear in the number of elements you want to store. Yes, it is "technically true" (the most useless kind of true) that you can store as much stuff as you want in the filter... but you get increasingly worse false positive rates that eventually just saturate to near 100%. If you actually read any of the math--including this linked work on Xor filters--you will always see them described in terms of "bits per item stored" (where the number of bits is extremely small vs. what you would expect). Hell: Wikipedia even describes the algorithm's storage requirements as "A Bloom filter with 1% error and an optimal value of k, in contrast, requires only about 9.6 bits per element, regardless of the size of the elements.".

gnulinux · on Dec 20, 2019

I disagree. It requires linear space in the sense that you need another data structure to deal with hash collisions. If you read my comment I was talking about exactly that. In normal applications of bloom filter it's built before a cache layer to ask if your data is in the cache. For this you do not need a list of objects since if bloom filter wrongly determined that object is in the set, it will be a cache miss and as long as a bloom filter operation is faster than a cache miss this is usually a good tradeoff; because normally when you're about to have a cache miss, you end up not going to cache because bloom filter says the object is not there. For this application, you do NOT need O(N) space inside the bloom filter layer, but obviously eventually someone will need to go retrieve this data.

> If you actually read any of the math

I mean I did and I was trying to describe that math above. I built bloom filters many times before. I'm curious if you misread my comment.

aidenn0 · on Dec 20, 2019

You said the following, so I assume you agree with it:

> It requires O(C) memory where C is the size of your largest hash function.

For any given expected false-positive rate P, C must be chosen to be a size of O(N), so a bloom filter with expected false-positive rate P will be of size O(N).