Note that intersections using HLLs don't work in general. The error rates become gigantic for sets that have small intersections, or whose cardinality differs greatly.
A general solution to cardinality estimations for intersections is to just use HLLs for union operations, and keep a MinHash sketch for each key as well to perform intersections.
HN is great. I was looking for something similar to HLL to do intersections, but couldn't find it because I didn't know what it was called. MinHash, apparently!
It would be really great if this was implemented in stream-lib (https://github.com/addthis/stream-lib), which is a pretty standard library for HLL counters.
Also, the community supporting that lib is very familiar with the mathematics behind these structures, and would be be better able to critique the error rates and such.
HyperLogLog is an algorithm that allows you to find an estimate of the distinct elements of a dataset with very little memory. Here's a simple explanation:
HyperLogSandwich solves a similar problem. From the README:
> Unlike the HyperLogLog, which estimates the cardinality of unique items in a set, and unlike a CountMinSketch, which estimates the frequency of a specific item, the HyperLogSandwich estimates the number of unique items that have occurred for any frequency.
Real world examples include:
- How many people viewed my Tweets 4+ times this month?
- How many users watched this video 2 times this week?
- How many users visited my website 3 times during an arbitrary time range?
- How many log exceptions happened 10+ times last week?
A general solution to cardinality estimations for intersections is to just use HLLs for union operations, and keep a MinHash sketch for each key as well to perform intersections.
There is a very good analysis of this technique over at http://tech.adroll.com/blog/data/2013/07/10/hll-minhash.html