I'm feeling like I'm missing something, but why not? 2^64 is the number of cards in the deck, but that's not the number of cards we're picking. That might be just 10.
As a sibling comment mentions the constraints given simply don't allow sampling with rejection due to the memory constraints. You get a stream of uniform samples, you're not allowed to store them.
But even if you were allowed to store them the algorithm still has an advantage. It doesn't require random access, only sequential reading and skipping. This was made for tapes, but is also useful for cases where it's simpler or more efficient to stream your data source instead of doing random access.
And of course it also works for m < k < n, where m is your memory size. E.g. when you're performing analysis on a dataset with 10^12 entries, you only want to sample 10^9 of them and don't have the storage (or don't want to pay the IO overhead) for either.
> even if you were allowed to store them the algorithm still has an advantage. It doesn't require random access, only sequential reading and skipping. This was made for tapes, but is also useful for cases where it's simpler or more efficient to stream your data source instead of doing random access.
This is a good point, but its evil twin is the point that this algorithm requires you to read the entire data set, every time, making it undesirable for any application where you can do random access.
Not necessarily. If you're using it a tape drive, it requires you to scroll through half of the entire tape on average every time, but fast-forwarding a tape is a faster operation than actually reading the entire tape. Other sequential access data store may have similar behavior--i.e., if you are accessing data from a scrolling/paging API that lets you skip ahead and drop pages, but doesn't let you back up without restarting.
Because we're restricted to O(1) extra space by the problem statement, regardless of the number of cards we're picking. We could be picking all 2^64, but definitely don't want the constant in O(1) to be 2^64!
The paper describes an online one-pass algorithm for which this is not true: as you get each input value you accept or reject it. So you only need O(1) space to store the number of values you've seen, the number remaining, and the number you still need to accept.
You're not storing the values themselves. If the number of values you've processed is too large to fit in memory... you probably had to spend a long, long time processing those values. It's not a relevant objection.
It doesn't. The original problem statement is "generate a list of random tweets, without duplication". If you can't store the list, you're stuffed no matter what.
I found the article a bit obtuse, but I take it to mean we are drawing a uniform sample from a very large set. This means we don't need a shuffled sample, we just need one with a uniform chance of each source element being in the sample.
Duplication in the context of the article appears to refer to tweet identity rather than content, so taking a stream of all existing tweets and selecting a uniformly sampled subset of them is certainly possible without storing the list of sampled tweets.
The algorithm referred to has an additional constrain of sequential access only. If you have random access there are simpler approaches to take. If you have an unbounded input stream you just filter using a prng for a uniformly distributed sample.
The article only applies to enormous and sequentially accessed data sources, which is why the algorithm died with tape drives: it's not actually a common problem any more.
He also just throws this out there: "visual inspection was enough to convince me that it [Python's sample function] was broken, without going further" -- with no explanation or in what way it is supposedly broken.