I'm feeling like I'm missing something, but why not? 2^64 is the number of cards...

the8472 · on Sept 13, 2019

As a sibling comment mentions the constraints given simply don't allow sampling with rejection due to the memory constraints. You get a stream of uniform samples, you're not allowed to store them.

But even if you were allowed to store them the algorithm still has an advantage. It doesn't require random access, only sequential reading and skipping. This was made for tapes, but is also useful for cases where it's simpler or more efficient to stream your data source instead of doing random access.

And of course it also works for m < k < n, where m is your memory size. E.g. when you're performing analysis on a dataset with 10^12 entries, you only want to sample 10^9 of them and don't have the storage (or don't want to pay the IO overhead) for either.

thaumasiotes · on Sept 16, 2019

> even if you were allowed to store them the algorithm still has an advantage. It doesn't require random access, only sequential reading and skipping. This was made for tapes, but is also useful for cases where it's simpler or more efficient to stream your data source instead of doing random access.

This is a good point, but its evil twin is the point that this algorithm requires you to read the entire data set, every time, making it undesirable for any application where you can do random access.

gliese1337 · on Sept 17, 2019

Not necessarily. If you're using it a tape drive, it requires you to scroll through half of the entire tape on average every time, but fast-forwarding a tape is a faster operation than actually reading the entire tape. Other sequential access data store may have similar behavior--i.e., if you are accessing data from a scrolling/paging API that lets you skip ahead and drop pages, but doesn't let you back up without restarting.

SAI_Peregrinus · on Sept 13, 2019

Because we're restricted to O(1) extra space by the problem statement, regardless of the number of cards we're picking. We could be picking all 2^64, but definitely don't want the constant in O(1) to be 2^64!

debatem1 · on Sept 13, 2019

O(1) /extra/ space, but we must have working space at least the size of the output.

dmurray · on Sept 14, 2019

The paper describes an online one-pass algorithm for which this is not true: as you get each input value you accept or reject it. So you only need O(1) space to store the number of values you've seen, the number remaining, and the number you still need to accept.

theelous3 · on Sept 14, 2019

The number of values you've seen could be "only" an exabyte, which is a problem.

thaumasiotes · on Sept 16, 2019

You're storing the number of values you've seen:

    So far, I've seen 172 values.

You're not storing the values themselves. If the number of values you've processed is too large to fit in memory... you probably had to spend a long, long time processing those values. It's not a relevant objection.

drfuchs · on Sept 13, 2019

The algorithm has to deal with picking, say, 2^63 of them. No room in RAM or even on disk to keep track of the chosen cards.

debatem1 · on Sept 13, 2019

It doesn't. The original problem statement is "generate a list of random tweets, without duplication". If you can't store the list, you're stuffed no matter what.

codebje · on Sept 13, 2019

I found the article a bit obtuse, but I take it to mean we are drawing a uniform sample from a very large set. This means we don't need a shuffled sample, we just need one with a uniform chance of each source element being in the sample.

Duplication in the context of the article appears to refer to tweet identity rather than content, so taking a stream of all existing tweets and selecting a uniformly sampled subset of them is certainly possible without storing the list of sampled tweets.

The algorithm referred to has an additional constrain of sequential access only. If you have random access there are simpler approaches to take. If you have an unbounded input stream you just filter using a prng for a uniformly distributed sample.

The article only applies to enormous and sequentially accessed data sources, which is why the algorithm died with tape drives: it's not actually a common problem any more.

stillworks · on Sept 13, 2019

From the article: "Stated more formally: given non-negative integers k and n with k <= n"

Focus on k <= n.

Also, in the comments section in the article, the author counters a comment in favor of Python's random generator along the same lines.

tripzilch · on Sept 16, 2019

He also just throws this out there: "visual inspection was enough to convince me that it [Python's sample function] was broken, without going further" -- with no explanation or in what way it is supposedly broken.