"(Stated more formally: given non-negative integers k and n with k <= n, generate a list of k distinct random numbers, each less than n. We reasonably ask that this be done in O(k) time and O(1) additional space. Finally, the distribution must be uniform.)"
"...it’s also possible to use the language of cards: deal k cards from a deck of size n, without replacement. So a poker hand like A2345 of clubs is valid for k=5 and n=52, but a hand containing AAAAA of clubs is not."
These seem to be fairly straight forward problems to solve- can anybody ELI5 why the Vitter algorithm is necessary?
Probably the hardest things are the space and time requirements. If you relax e.g. the O(k) time requirement to O(k log k), the problem essentially becomes trivial.[1]
More generally, though, what solution are you thinking of? (The problem, as stated, is actually fairly difficult! As far as I can tell...)
We can compute a random k-size subset of {0..n-1} as rndsubset(0,k,n), where
rndsubset(i,0,n) = {}
rndsubset(i,k,n) = {i} union rndsubset(i+1,k-1,n) with probability k/(n-i), rndsubset(i+1,k,n) otherwise
This however takes on average n/k time per element, which is prohibitive when n >> k.
The probability of {h} being the next chosen element in rndsubset(i,k,n) is k/(n-h) * Prod_{i<=j<h} 1-k/(n-j), which is sampled directly in constant time by Vitter.
Trivial algorithms (like one implemented in Python) slow down when k is close to n. Or when n is large, then it is not possible to use bitmap to mark dealt cards.
NOTE: This comment I wrote is totally wrong. I haven't had my morning coffee yet. H_n - H_(cn) ≈ constant, not diverging (for 0 < c ≤ 1). :)
Yeah, but the k ~ n/2 case (in which inverse selection and normal selection have the same runtime) is still Ω(n log n) (equiv Ω(k log k)), which is still "slower" than the presented algorithm.
Actually, this comment is correct despite making a similar argument to your incorrect one elsewhere. Using a set to accumulate intermediate results requires k inserts at O(log k) each.
I'm still not sure about this—the amortized time of insertion should be O(1) on any one of the usual hash table-type-structures. We're not inserting a single item to an already-built hash-table, but rather k of them into an empty hash-table, with elements being uniformly drawn and bucketed using a good hash function. All of this should give O(1) average time for insertions.
"...it’s also possible to use the language of cards: deal k cards from a deck of size n, without replacement. So a poker hand like A2345 of clubs is valid for k=5 and n=52, but a hand containing AAAAA of clubs is not."
These seem to be fairly straight forward problems to solve- can anybody ELI5 why the Vitter algorithm is necessary?