Hacker News new | past | comments | ask | show | jobs | submit login

The paper claims that it is more space efficient than golomb coded sequences, but this is clearly not the case at least when the golomb parameters and FP rate are set optimally[1] a GCS can be well within a fraction of 1% of the information theoretic limit.

The main limitation of GCS in terms of communications efficiency is just that there aren't that many optimal rate options. Though arithmetic coding the differences isn't really that much more complex to implement and it's optimal for whatever parameters you want.

Now, if it wanted to talk about _access_ efficiency, then sure, GCS don't result in an data structure thats efficient for random access.

[1] https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc84...




(Co-author here) For GCS, I also found the Golomb-Rice parameter Q-1 to be better, this is what we used. According to my calculation, the overhead of GCS is around 1.5 bit/key. See also https://github.com/0xcb/Golomb-coded-map . Sure, with arithmetic coding, or ANS coding, it should be lower; it would be interesting to see how much. I kind of like GCS.


The overhead you're seeing is because you've mismatched your FP rate with the optimal size for your GCS code. If they're exactly matched the overhead is very low (not as low as an arithmetic code simply because each entry is still coded into bits, so you get something like a half-bit overhead on average from that).


I would love to see improvements to the GCS code I have, but I tested with many different sizes and Golomb-Rice parameters... I was not able to reduce the size overhead in any way. Maybe you can help? There are two implementations, one is Java: https://github.com/FastFilter/fastfilter_java/blob/master/sr... and the other is C++: https://github.com/FastFilter/fastfilter_cpp/blob/master/src...


Good point about Q-1. I never looked at optimizing parameters when writing up the GCM.

A quick empirical check shows a small improvement for Q-1, but the claim that it comes in under 1% from the minimum seems to be based on an unusual definition of the minimum. nullc gives it as log2(eM), but the real minimum is simply log2(M), in which case the unary terminator alone imposes a 5% penalty at M=2^20.

It may still be possible to optimize the parameters further, but the gain would be perhaps a tenth of a bit per item.


> but the real minimum is simply log2(M)

The minimum entropy to code such a thing is log2(MN choose N)/N bits per element.

log2(eM) is just an approximation when N is large.

You cannot code with N uniform choices out of MN options with less than log2(MN choose N) bits on average, by the pigeonhole principle.

> the unary terminator alone

The terminator carries information. If M is chosen well the terminator carries almost one bit of information.


The minimum for approximate membership testers is log2((M choose N)/(N + fp_rate * (M - N) choose N))/N, assuming M >> N. The asymptotic is log2(1/fp_rate). For simplicity's sake below I pretend we're dealing with a set of 1, and so 1/fp_rate = M.

I thought you had this in mind in the first paragraph with "1 / log(2) (around 1.44) times more compact than Bloom filters". However using your definition of the minimum, for practical purposes it isn't possible to be anywhere near 1.44 times more compact than Bloom filters.

Using your example of M=2^20, log2(M) log2(e) / log2(eM) = 1.34. Personally I've never spent more than 45 bits per item on such structures, but even at 64 bits log2(eM) only reaches 1.41.

The oft-cited 1.44x Bloom filter tax is simply taken from its per-item cost of log2(M) log2(e), absent the information-theoretic minimum of log2(M), leaving just the log2(e) multiplier.


I think I was caught up thinking the discussion was all about the efficiency of the GCS coding for the underlying bit-set; which it is indeed indeed nearly optimal for (but only for some parameters, which don't happen to be where the fp rate is 1/2^k, which unfortunately is the case usually considered in papers!). Sorry about that.

The bloom cost 1.44 times the lower bound. The difference between an optimally coded bitset and the asymptotic lower bound is an additive 1.44 bits per element. This is a pretty big difference!


If I were coding a static approximate membership query structure, I wouldn't use either bloom filters or GCS, I'd use an Elias-Fano-coded sequence of fingerprints. That has nearly the same compression efficiency as Golomb coding but can be efficiently queried with no decompression.


Interesting idea! Elias-Fano monotone sequences unfortunately do need a bit more space than just a list of Rice codes. What is possible, without having to increase the space usage, is to re-arrange the Rice codes in each bucket, so that all variable parts come first, and the fixed parts at the end. That way, lookup speed would be much, much faster for large buckets. We used this for RecSplit: https://arxiv.org/abs/1910.06416 I will try that out for GCS as well!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: