Hacker News new | past | comments | ask | show | jobs | submit login

A bunch of comments here are missing the main point: Unless you have an Exabyte of memory, you can't use any kind of data structure to remember which cards you've already picked among 2^64 of them. The goal here is an algorithm that generates the selected cards, one by one, in order, as if from a coroutine that only uses a tiny, constant amount of memory. So, no arrays, lists, sets, hash tables, bitmaps, etc., even if they're hidden in some cute construct built into your favorite language.



Is there an easy way to adapt the algorithm to get the cards drawn in a random order? The article says "it’s easy to randomize the order after the fact" but if we can't store them in memory then that's a no-go.


You can draw the cards in a random order by using the output of an accumulator fed to a block cipher as an index. The fixed output size of a block cipher does entail extra work in filtering numbers outside the range you'd like, just as you would with an LFSR. (as a direct power of 2, you could directly use an LFSR or some ECB-mode block ciphers as if it were format-preserving, but that is "coincidence").

You can produce an exact random permutation with a Feistel network for any number, in this case selecting every number between 0 and 2^64 exactly once, with no appreciable space requirements.

   procedure feistel(input) {
      result = input;
      h = 2<<32;
      for (let i = 0; i < 8; i++) {
          l = (input / h)|0;
          r = input % h;
          temp = r;
          fn = (r + i) % h;
          r = (l + fn) % h;
          l = temp;
          result = h * l + r;
      }
  
      return result;
  }


Is that code correct? Why do you overwrite `result` every step of the loop if you only use it at the end?


Good eye, `input` should be replaced with `result` when building the subkeys within the loop.

Here's a working implementation:

  function feistel(ctr, lr, ll) {
      let result = ctr;
      for (var i = 0; i < 8; i++) {
          let l = (result / ll)|0;
          let r = result % ll;
          let temp = r;
          // alter round fn's multiplier & increment accordingly.
          let fn = (r * 16807 + i) % lr;
          r = (l + fn) % lr;
          l = temp;
          result = lr * l + r;
      }
      return result;
  }
  
  let rsize = 8;
  let outputs = new Set();
  for(let i = 0; i < rsize * rsize; i++) outputs.add(feistel(i, rsize, rsize))
  for(let i = 0; i < rsize * rsize; i++) if (!outputs.has(i)) console.log('failure: ', i);


There are kinds of PRNGs that are designed to emit every possible ouput value in the range. There are some that will emit every possible value before a repeat.

If you look at his code, the first function seems to be calling a method that claims to do the latter. I think his documentation may be out of sync.


You select N from M, in order, where M is "too big" and N fits in memory / swap / disk. You then shuffle your N selections to get random ordering.


I understand "you can't use any kind of data structure to remember which cards you've already picked among 2^64 of them" as "N cards don't fit in memory". Or is the relevant distinction memory vs disk, as it's feasible to randomize on-disk data?


The problem asks to find k cards (integers) out of n, in time O(k) and O(1) additional space — this means additional to the k required to store the output. So while holding n items wouldn't fit in memory, it is implicit that k would. But we still want to avoid Ω(k log k) time, or more than a constant amount of additional space.


Not sure that this qualifies as uniform distribution, even if you spread the N across the M. Because the selection must be predictable, so that any next N don't include the previous ones. That's presuming that the ‘user’ might iterate over M items in total, even if you pick chunks of N internally.


Well the main point of the algorithm is that the output is going to be a sequential, increasing list of int64 "cards".

Which may be useful for tape drives, but I'm not sure what else.

The big point that the author is missing, is that if you drop that requirement, or rather if you require the output to be shuffled, like you'd expect from dealing cards, then there are super easy algorithms.

Just pick a random permutation function over the 64-bit integers, and start counting. It's literally like dealing cards.

People in this thread have suggested block encryption as permutation functions, which will work. But there are also good non-cryptographic (much faster) permutation functions out there.

If you got a permutation function you don't need memory to store the permutation, and poof the problem just goes away.

You need this algorithm only if you absolutely need the output to remain in order. But that's not really part of the problem he sketches, which is about "take X without replacement".


> Which may be useful for tape drives, but I'm not sure what else.

Partitioning large data sets, or accessing any data store where sequential access is more efficient than random access. Tape drives are merely a canonical and previously-common example of such data stores, but such situations do still exist in the modern world, even if they are not as common. I for one was rather happy to see this, as it addresses a problem I have that I did not know that there was already a solution for!

Suppose you have a huge dataset that you cannot load into memory all at once, but you can request samples of via a paging API. You want to sample a random subset of the records in that dataset. Just generating the indices won't do--you need to turn those indices into actual records. Random access imposes too much overhead--even if you can request arbitrary pages, you don't want to have to request each page more than once, and you don't want to have hold previously-requested pages in memory Just In Case you decide to back up and select another record from them later. Ergo, a non-ordered permutation function doesn't give optimal performance.

With ordered sequential sampling, you can request one page at a time, get all the records you will ever need out of that page, then move on to the next page that would contain an index from your sample.

Now, if your sample size is small enough that you can fit the list of card-indices into memory, rather than just streaming the actual selected records off somewhere else for processing (which in my case, it is), then you could achieve equal data-access performance by selecting the head of a random permutation and then sorting it--but as the article mentions, that kills your asymptotic performance for computing the indices, and means you can no longer perform selection on-line, or parallelize the operations of index selection and data retrieval.


How do you pick a random permutation function over 64-bit integers? On a set of N=2^64 integers, there are N! permutations, so you need lg(N!) ≈ NlgN random bits to generate a uniformly random permutation, and thus at least that much time. This is too much: we want a solution in O(K) time and O(1) additional space (in addition to the space for K integers) — something that's Ω(NlgN) is out of the question. Things like block encryption do not generate a uniform distribution over all possible N! permutations of N=2^64 integers (at typical parameter sizes); obviously they can generate only as many possible outputs as 2^(the number of random bits used).



What “format-preserving encryption” gives you is a hard-to-predict (pseudorandom) permutation on the N elements. The resulting permutation is not guaranteed to be uniformly distributed on the space of all N! permutations; it just comes from some large space that's still much smaller than N! (depending on the number of random bits used), and it's hard to predict which member of that space the permutation will be.

It is simply not possible to sample from a uniform distribution on a set S (here, the set of all N! permutations) in time less than lg|S|. Just think about it: whatever your randomized algorithm, if you only flip a coin M times, there are only 2^M possible outputs of your process. If you want every member of S to be achievable, this requires 2^M ≥ |S|, or M ≥ lg|S|.


As long as N is factorable into primes, it can be done by Feistel encrypting each prime using modular addition.

There are ways to do it with greater coverage of the space between 4 and 2^64, but I would defer to better judgement before doing so.


Factoring into primes doesn't really affect the lower bound on complexity.

Note that for N=2^64, the number N! is about 10^(10^20.5), i.e. just the number of digits in N! is itself about 3×10^20. The number of bits in N! is itself over 10^89. Every prime number from 1 to 2^64 (over 10^17 primes) is a prime factor of this number N!; if an algorithm is supposed to work with those many primes, it's clearly no longer O(K) as K could be small relative to N.

An algorithm (like one of the proposed encryption schemes) that produces, say, one of 2^1024 permutations may be “good enough for practical purposes” (as 2^1024 is much larger than the number of atoms in the universe), yet as a fraction of the space of all possible permutations, it is dizzyingly insignificant.


That is correct.

To achieve full coverage of possible permutations, one needs independent keys and innumerable rounds.


I'm feeling like I'm missing something, but why not? 2^64 is the number of cards in the deck, but that's not the number of cards we're picking. That might be just 10.


As a sibling comment mentions the constraints given simply don't allow sampling with rejection due to the memory constraints. You get a stream of uniform samples, you're not allowed to store them.

But even if you were allowed to store them the algorithm still has an advantage. It doesn't require random access, only sequential reading and skipping. This was made for tapes, but is also useful for cases where it's simpler or more efficient to stream your data source instead of doing random access.

And of course it also works for m < k < n, where m is your memory size. E.g. when you're performing analysis on a dataset with 10^12 entries, you only want to sample 10^9 of them and don't have the storage (or don't want to pay the IO overhead) for either.


> even if you were allowed to store them the algorithm still has an advantage. It doesn't require random access, only sequential reading and skipping. This was made for tapes, but is also useful for cases where it's simpler or more efficient to stream your data source instead of doing random access.

This is a good point, but its evil twin is the point that this algorithm requires you to read the entire data set, every time, making it undesirable for any application where you can do random access.


Not necessarily. If you're using it a tape drive, it requires you to scroll through half of the entire tape on average every time, but fast-forwarding a tape is a faster operation than actually reading the entire tape. Other sequential access data store may have similar behavior--i.e., if you are accessing data from a scrolling/paging API that lets you skip ahead and drop pages, but doesn't let you back up without restarting.


Because we're restricted to O(1) extra space by the problem statement, regardless of the number of cards we're picking. We could be picking all 2^64, but definitely don't want the constant in O(1) to be 2^64!


O(1) /extra/ space, but we must have working space at least the size of the output.


The paper describes an online one-pass algorithm for which this is not true: as you get each input value you accept or reject it. So you only need O(1) space to store the number of values you've seen, the number remaining, and the number you still need to accept.


The number of values you've seen could be "only" an exabyte, which is a problem.


You're storing the number of values you've seen:

    So far, I've seen 172 values.
You're not storing the values themselves. If the number of values you've processed is too large to fit in memory... you probably had to spend a long, long time processing those values. It's not a relevant objection.


The algorithm has to deal with picking, say, 2^63 of them. No room in RAM or even on disk to keep track of the chosen cards.


It doesn't. The original problem statement is "generate a list of random tweets, without duplication". If you can't store the list, you're stuffed no matter what.


I found the article a bit obtuse, but I take it to mean we are drawing a uniform sample from a very large set. This means we don't need a shuffled sample, we just need one with a uniform chance of each source element being in the sample.

Duplication in the context of the article appears to refer to tweet identity rather than content, so taking a stream of all existing tweets and selecting a uniformly sampled subset of them is certainly possible without storing the list of sampled tweets.

The algorithm referred to has an additional constrain of sequential access only. If you have random access there are simpler approaches to take. If you have an unbounded input stream you just filter using a prng for a uniformly distributed sample.

The article only applies to enormous and sequentially accessed data sources, which is why the algorithm died with tape drives: it's not actually a common problem any more.


From the article: "Stated more formally: given non-negative integers k and n with k <= n"

Focus on k <= n.

Also, in the comments section in the article, the author counters a comment in favor of Python's random generator along the same lines.


He also just throws this out there: "visual inspection was enough to convince me that it [Python's sample function] was broken, without going further" -- with no explanation or in what way it is supposedly broken.


The original problem is to "generate a list of random tweets, without duplication". So lists are most assuredly allowed, and the implied constraints on the output size are in play. Using those to solve the problem under an assumption that the output size is much smaller than the input is therefore fair game.


Picking tweets is an example of the problem, not the definition. Further, the article was very explicit about noting the small-size cases and ignoring then as irrelevant and boring. So no, they're not fair game.


Let's just read the first paragraph again:

"Here’s a program roughly 0% of programmers know how to write: generate a list of random tweets, without duplication. You may think you know how to do this, but you almost assuredly don’t."

This problem can definitely be changed into harder problems, but this is the premise of the article and it's pretty darn easy.


I think the premise is ambiguous and my initial reading caught the implication from "tweets" that the datasets are huge, therefore keeping track of state in the naive solution is impractical.


The problem he states is really not for the algorithm he fails to actually explain. The algorithm only makes sense if you require the output to be in order (and you need a really large sample). The part where he suggests shuffling the output afterwards is absurd; if you wanted random order, there are much simpler algorithms.


Yes but, I feel like he's still working too hard.

I've 'reinvented' an algorithm at least twice to generate/iterate all permutations or combinations of a list by counting from 1 to P(n, k) or C(n, k) and just doing repeated divisions to determine the next element in the sequence. That generates them in order. If n is small enough to fit into memory then you can shuffle the inputs, but that still gives you a partially ordered output (all outputs with 'foo' in the same position are contiguous, for instance).

I always left myself a TODO to use a PRNG instead of an increment if I needed the outputs to be more shuffled than that, but it never got that far.


If you can store 2^64 data points, why cant you store 2^63 indexes? That's a very small increase in memory requirements.


The issue isn't storage requirements; the issue is that, in the environments where this algorithm is useful, random access to the storage device is orders of magnitude slower than sequential access. Think magnetic tape in 1986, or today's spinning rust hard drives that take relatively forever to "seek", or even RAM being accessed sequentially, allowing the CPU to speculatively pre-fetch cache lines in burst-mode.

Using Vitter's algorithm for generating the random numbers in order, one at a time, lets you do a single, serial pass through the 2^64 records on said device, streaming out the selected records sequentially.

If instead you generate your 2^63 random numbers out of order, you'll have to do at least 2^63 random writes one way or another. Even if you have an extra exabyte to do these writes into (and don't mind waiting to generate all 2^63 of them before outputting the first one), it will take a prohibitively long time to do, since it's not sequential writing.

That's the motivation behind this algorithm; if the characteristics of your exabyte-scale storage device are not "sequential = fast / random = orders-of-magnitude-slower", then it may not be of interest.


No if you generate them out of order, you can just use a permutation function and have zero memory requirements.

This algorithm is only fast given that sequential access is much faster than random access (and you don't need your output shuffled).


You can't naively store 2^63 indices, else you'll get exponential run time to insert or lookup.

Using a random access memory structure like a binary tree gives you logarithmic run time, but for a 64-bit index and 64-bit pointers will cost you 3x the space, so now your index is larger than your source data.

And since you probably don't have random access memory, but instead have random block access, you'll need to use something like a b-tree to avoid getting destroyed by seek times, and this has a further space overhead - you're now more like 6-8 times larger than the source data.

You may as well merge sort the 64-bit source data (indices) at O(nlogn) time and constant space costs.


It might be easy to generate and output the 2^64 data points, because you can throw them away. With 2^63 indices you need to keep them around.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: