The sampling algorithm you describe becomes pretty slow as *k* grows. There are ...

The sampling algorithm you describe becomes pretty slow as k grows. There are much faster algorithms. We wrote a paper on fast sampling: https://arxiv.org/pdf/1610.05141.pdf (page 11, what you described is labelled "H" (for hash table) in the plot. Sorry about the single-letter names, it's a weird convention for sampling algorithms). You can see that as soon as k is more than a few thousand, "R" ("recursive") is much faster. What it does is recursively split the range (initially 1...N) in half, and deciding how many samples come from the left and how many from the right side. It does this by using hypergeometrically distributed numbers, which is the distribution for sampling without replacement. Its base case is the algorithm you describe (H) when there are less than ~1000 elements left. Algorithm R is also very easy to do in parallel. ("B_MKL" and "B_GPU" are vectorised, non-portable implementations of a Bernoulli sampling based algorithm using Intel MKL and Nvidia CUDA, respectively).

Reservoir sampling can be done much faster than O(n). It's possible in Θ(k log(N/k)). This is a tight bound, so things like O(k) are impossible, because every position in the reservoir will be replaced approximately log(N/k) times. See https://www.researchgate.net/profile/Jeffrey_Vitter/publicat... for a proof.

But you're right in that sorting in order to draw a sample is not particularly efficient most of the time. Not to mention that a selection algorithm would be sufficient as well, and requires expected time O(N): https://en.wikipedia.org/wiki/Quickselect