Hacker News new | past | comments | ask | show | jobs | submit login

"(Stated more formally: given non-negative integers k and n with k <= n, generate a list of k distinct random numbers, each less than n. We reasonably ask that this be done in O(k) time and O(1) additional space. Finally, the distribution must be uniform.)"

"...it’s also possible to use the language of cards: deal k cards from a deck of size n, without replacement. So a poker hand like A2345 of clubs is valid for k=5 and n=52, but a hand containing AAAAA of clubs is not."

These seem to be fairly straight forward problems to solve- can anybody ELI5 why the Vitter algorithm is necessary?




Probably the hardest things are the space and time requirements. If you relax e.g. the O(k) time requirement to O(k log k), the problem essentially becomes trivial.[1]

More generally, though, what solution are you thinking of? (The problem, as stated, is actually fairly difficult! As far as I can tell...)

---

[1] See: https://en.wikipedia.org/wiki/Coupon_collector's_problem . The solution is then to draw repeatedly and then check if this card has been drawn.


I see- in that case I need to read up on O(k) vs O(k log k)


We can compute a random k-size subset of {0..n-1} as rndsubset(0,k,n), where

rndsubset(i,0,n) = {}

rndsubset(i,k,n) = {i} union rndsubset(i+1,k-1,n) with probability k/(n-i), rndsubset(i+1,k,n) otherwise

This however takes on average n/k time per element, which is prohibitive when n >> k.

The probability of {h} being the next chosen element in rndsubset(i,k,n) is k/(n-h) * Prod_{i<=j<h} 1-k/(n-j), which is sampled directly in constant time by Vitter.


Trivial algorithms (like one implemented in Python) slow down when k is close to n. Or when n is large, then it is not possible to use bitmap to mark dealt cards.


The python algo does not get slow when n is large. The k~n case obviously calls for inverse selection.

https://github.com/python/cpython/blob/master/Lib/random.py#...


NOTE: This comment I wrote is totally wrong. I haven't had my morning coffee yet. H_n - H_(cn) ≈ constant, not diverging (for 0 < c ≤ 1). :)

Yeah, but the k ~ n/2 case (in which inverse selection and normal selection have the same runtime) is still Ω(n log n) (equiv Ω(k log k)), which is still "slower" than the presented algorithm.


Actually, this comment is correct despite making a similar argument to your incorrect one elsewhere. Using a set to accumulate intermediate results requires k inserts at O(log k) each.


I'm still not sure about this—the amortized time of insertion should be O(1) on any one of the usual hash table-type-structures. We're not inserting a single item to an already-built hash-table, but rather k of them into an empty hash-table, with elements being uniformly drawn and bucketed using a good hash function. All of this should give O(1) average time for insertions.


What's preventing someone from implementing a non-trivial algorithm in Python?


That was not at all what I meant.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: