My experience with compiler development is that the incentives all align toward making the "default RNG that people go to" as fast as possible to the exclusion of all else, including the quality of the generated numbers. That's because people frequently write benchmarks gated on the speed of random number generation, and those benchmarks usually don't care about the quality of the random numbers. Sometimes those benchmarks get popular, which encourages a race to the bottom in RNG quality.
(Similar incentives exist for hash functions, by the way.)
Popular RNG-bound benchmarks:
* "fasta" from the Benchmarks Game [1] (thankfully, this one mandates use of its own RNG, though it's vulnerable to fast-forwarding as demonstrated in [2])
The author makes a reasonable point that there are algorithms that are both faster AND higher quality than the one v8 uses. I think not understanding random number generators and the resulting voodoo programming is the real culprit.
Fast hash functions can be really important for performance of hash sets and maps. In my experience high speed is more important than a low collision rate.
That was for primitive types, it may well be different for types where comparison is more costly. And of course if you need protection from algorithmic complexity attacks, then you will need to take that into account.
As long as your usecase isn't worried about hash-flooding [0]. Unfortunately too few people know this is even an attack, so it may be reasonable for languages to default to a stronger hash algorithm for safety, but this isn't the common case...
language design is hard.
There isn't much complexity and performance cost of optimistically using a faster hash, and dynamically falling back to siphash (or another secure keyed hash function).
Using associative arrays implemented via chaining, one could have a bit in the associative array header that indicates if the associative array uses the fast keyed hash or siphash (or another secure hash). When inserting an item, if one finds that the chain length exceeds a certain limit without the load exceeding the resize threshold (indicating poor hash distribution), one could rehash all of the keys using siphash.
An associative array using open addressing could do something similar, switching hash functions if the probe sequence got too long (analogous to a chain being too long).
Alternatively, if using open addressing, one could use something like cuckoo hashing. A very fast keyed hash could be computed, and upon a collision, siphash could be used, with a probe sequence similar to used by Python dicts. Though, if your use case has lots of missed lookups, then performance will be better just using siphash. Of course, one could use counters to detect this condition, set a bit in the associative array header to indicate all future lookups should just use siphash, and rehash all of the existing keys.
Rust's solution is to provide the hash function as a generic parameter so you can just Fnv/Xx/Sip based on your workload, with Sip as the default if you don't pick. Unfortunately this is mostly incompatible with the adaptive scheme you suggest, because you definitely don't want the adaptive logic if you've already picked Xx/Fnv. Making it possible to disable all that if you pick something other than Sip would make things quite complicated.
But yeah, data structuring is ultimately really work-load specific. Any solution a language provides will always be suboptimal for tons of use cases. The solution you describe sounds good for a "never think about it" solution that works for 85% of cases. Rust's solution kinda necessitates more thinking more often, but I think it makes it more applicable for more usecases if you're willing to do that little bit of thinking.
I've literally never heard of Vhash, and I see no references to it outside of the 2007 paper. As far as I know, the state of the art for "fast and secure" is still SipHash 2-4 or 2-3.
While SipHash perf is generally quite good, it's still not as fast as Fnv on small inputs or XxHash on large inputs. As in, 4x slower [0]. FarmHash has great perf if you can invoke it in the way it wants, but doesn't do well in a streaming context.
> I've literally never heard of Vhash, and I see no references to it outside of the 2007 paper.
It is intimately tied up with VMAC. The title of the 2006 paper that introduced them is "Message authentication on 64-bit architectures". Google scholar says it has been cited 32 times, while "SipHash: a fast short-input PRF" has been cited 43 times.
> As far as I know, the state of the art for "fast and secure" is still SipHash 2-4 or 2-3.
SipHash seems to me like a good choice.
SipHash has some security properties that Vhash does not, although Vhash-based-VMAC has similar security properties to SipHash, I believe. However, for hash flooding attacks, if the hash values themselves are not directly exposed to the attackers and there is enough noise in the response latency that timing attacks are difficult, I do not know what benefits SipHash has.
"Faster 64-bit universal hashing using carry-less multiplications" has some benchmarks that show Vhash as faster than SipHash, substantially so on long input.
Speed-wise, an iterated string hash using Dietzfelbinger-style multiply-shift hashing is also substantially faster (3-10x) than SipHash on both large and small inputs in my testing, and it needs about 20 lines of code to implement. I haven't written any Rust in a while, but I'll try and send you a patch to the repo you linked.
The repo's a bit of a commented-out mess (I was trying to get a bunch of broken impls working before I had to get back to real work), so let me know if you have any trouble. Always happy to help people learn the language. :)
I have previously written about how to predict the next Math.random in Java (and hence Firefox also): http://franklinta.com/2014/08/31/predicting-the-next-math-ra...
The TL;DR; is that it is easy since you only have to brute force around 2^22 possibilities (which runs in < 1 second).
Someone then asked whether it is also possible in Chrome and Node: https://github.com/fta2012/ReplicatedRandom/issues/2. After examining the MWC1616 code I saw the same “two concatenated sub-generators” problem explained in this post. The implication of it is that you can brute force the top and bottom 16 bits independently. So it is also easy since that’s just doing 2^16 possibilities twice! Code: https://gist.github.com/fta2012/57f2c48702ac1e6fe99b
This is actually a solved problem already on most underlying operating systems, it just hasn't filtered down into the Javascript-world. On Windows you want a GUID and on UNIX-like OSs you want a UUID.
GUIDs and UUIDs are guaranteed to be unique not simply because of their length, but also how they're constructed (MAC address, timestamp, and a random element).
So even if the random number generator does loop back around, it still won't ever generate the same GUID/UUID even on the same hardware, and on other hardware it is more unlikely yet still (due to MAC address).
So the question is: Why isn't GUID/UUID generation not available to Javascript? It works extremely well and is used for exactly this type of scenario.
Great point / good question. Here's why we didn't just use UUIDs:
* UUID generation still requires a good (CS)PRNG and should be vetted the same way you'd vet a (CS)PRNG.
* UUIDs are one-size-fits-all 36 character base-16 encoded strings. If you want a short random identifier that you can use in a URL (that people might need to type on a phone or something) they're not ideal. Re-encoding them and/or truncating them is as hard and error prone as just generating the randoms yourself.
* Researching a good UUID library and maintaining a dependency is harder than vetting the 10 lines of code required to do it yourself. Without a trusted standard library solution it's unclear which library to use (maybe not as true today as it was a couple years ago when this happened)
If UUIDs were available in the Javascript standard library we probably would have used them in some of the places that we currently use our own identifiers. We do use UUID4s in our Java services, for instance.
I'm guessing that UUIDs aren't standardized because the standards process is browser-centric and it's low priority in that context... but that's just a guess.
Two questions (not trying to be argumentative or snarky):
> we like using randomly generated identifiers.
Why? Why do your identifiers need to be cryptographically secure at all? Why do they even need to be random? Unless you're using them AS KEYS (your use of Math.random() suggests you aren't), "cryptographically secure" doesn't even have a valid context.
It seems like UNIQUE would suffice, which brings me to:
> Researching a good UUID library and maintaining a dependency is harder than vetting the 10 lines of code required to do it yourself
Well.....except that if you had done the former, you wouldn't have had to analyze collisions because, you know, they wouldn't have happened.
And if you're not using identifiers AS KEYS (which would be horrible practice anyway, since it's apparently output to the logs), why does the library need to be kept "up to date"? Seems as long as its generating unique id's, you're good to go.
Also, I don't understand, if you're using Node.JS and server side javascript, why couldn't you just build a dead simple add-on/module in C++ that makes a direct call to libuuid? (and by using a dylib, that would stay up to date as long as you patch your servers).
I mean the math was definitely an interesting read, but it seems like it was all for naught.
Again, I didn't want to come off as rude or anything, I just happen to agree with the top level comment.
Not rude at all. They're actually all good questions.
> Why? Why do your identifiers need to be cryptographically secure at all?
They don't have to be cryptographically secure. They just need to be unique. Using random identifiers are easier to generate than deterministic/semi-sequential identifiers which would require some form of coordination, which is a pain in the ass and a point of failure. We switched to a cryptographically secure PRNG because it is higher quality than the non-CS alternative.
> Well.....except that if you had done the former, you wouldn't have had to analyze collisions because, you know, they wouldn't have happened.
You're assuming a UUID library that didn't just use Math.random() under the hood. Many libraries at the time of this incident did. Either way it would have been work to find one we could trust, vs. 10 lines. Also, UUIDs are 36 characters and not suitable for some use cases like short identifiers that will appear in URLS (which might need to be typed oh phones, etc).
> why does the library need to be kept "up to date"?
We would have had to update it when someone realized it was using Math.random() under the hood, for instance. In general there's a certain amount of maintenance involved with any dependency.
> why couldn't you just build a dead simple add-on/module in C++
Because 10 lines of Javascript vs. that.
> it was all for naught.
The random identifier generation was relatively irrelevant / just a fun narrative around the real story, which was about the subtleties of PRNGs, doing your homework, and Math.random() being crappy. If you don't think that use case is legitimate, that's fine. There are lots of other use cases where Math.random() is problematic, like shuffling an array (for an unbiased shuffle of a length n array you need a cycle length of n!).
Well, to ensure uniqueness you always need coordination. If using UUIDv1, you need to ensure your MAC addresses are not cloned. UUIDv4, you need to ensure the generated IDs are not duplicated by ensuring you have not generated each ID before. Sounds like you've missed a critical step here.
UUIDv1 is pretty much your best bet here because it's the least pain in the ass. 36 chars are hardly any less suitable for the use cases you mention than 22 chars. UUIDv1 can survive on a poor PRNG, so your reason for updating library isn't really valid. Sure there's a small amount of maintenance, but I'd expect a developer with solid experience and knowledge to almost immediately understand that (even) a 10 line homebaked solution is likely to cause more issues than a standardised algorithm developed specifically for this purpose.
"10 lines of JavaScript" isn't really what you wrote now, is it?
As for your other use cases, CSPRNG! If you need something to be unbiased, a PRNG just doesn't work. In fact, PRNGs pretty much never work for anything that needs anything. They exist for convenience only.
You don't need to check for duplicates with UUI4s if you have a good source of entropy. I'm not sure where you're getting your information, but here's what Wikipedia has to say, for instance[1]:
"To put these numbers into perspective, the annual risk of a given person being hit by a meteorite is estimated to be one chance in 17 billion,[4] which means the probability is about 0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate. In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%."
There seems to be some sort of collective delusion that UUIDs magically appear somehow. Most UUID libraries are very small packages that contain code nearly identical to the code we're using. Many of the libraries that existed at the time this problem occurred actually used Math.random() under the hood. Moreover, we need our code to produce variable length identifiers for different use cases. Since we already need it, it's sensible to re-use it for a case where UUIDs might also be used.
In some places (when we're producing much shorter IDs, for instance, or in extremely critical sections of code) we do check the database for duplicates. In our Java-based accounting system we even use UUID1s for transaction IDs. I understand the trade offs. The code in question was being used to generate correlation identifiers for log tracking. The performance hit of checking for dupes would have been a waste.
PRNGs work find for many use cases. Please cite some source or at least give me some reasonable argument to believe otherwise.
>Re-encoding them and/or truncating them is as hard and error prone as just generating the randoms yourself.
Feed the whole thing into sha256 and take the last X bytes where X is the amount you need for your shortener. If you can show that the last X bytes of sha256 have a chance for collision due to input, there is a lot of money in it for you. One of the nice things about a cryptographic hash is that all of the output has to be completely unpredictable based on the input.
Exactly, it's as hard and error prone as generating the randoms yourself. The solution you described requires an analysis of the hash function to determine collision probability, diligence to find a sha-256 implementation (circa a few years ago), review of said implementation, another analysis to figure out your truncation, then a proper implementation resulting in ~the same amount of code, except this time slower and with less entropy.
You're generating a random number (the UUID) then you're hashing it. It can't possibly have more entropy, and there's a chance two inputs to the hash will collide thus reducing entropy.
Hashing a counter is actually a variety of CSPRNG. Basically you're re-seeding a hash-based PRNG with whatever PRNG the UUID code uses at each run. It's well known that an PRNG cannot have more entropy than its seed. Hence, it will at best have the same entropy, and probably have slightly less.
* UUIDs are one-size-fits-all 36 character base-16 encoded strings. If you want a short random identifier that you can use in a URL (that people might need to type on a phone or something) they're not ideal. Re-encoding them and/or truncating them is as hard and error prone as just generating the randoms yourself.
Google's shortener is also not very good in a url ( as implementation of unique short identifier).
Any solution that has it possible to have an I ( capital i) or l ( lowercased l) in the same sentence is flawed due to various fonts in any app where you can use an url ( SMS, facebook, Whatsapp, browsers, nokia 3310 ...)
I don't know how much has changed in the last 2 years, but there's a `uuid` npm package that uses the `crypt` built in module on Node, but `Math.random` in the browser.
Using UUIDs does not magically solve your problem of using a bad PRNG. RFC 4122 imposes no guarantees on how random the V4 UUIDs really are. You'll still need to verify yourself if the method you use to generate them is good enough for your application. Especially if you need your UUIDs to be unpredictable. It even says so explicitly:
Do not assume that UUIDs are hard to guess; they should not be used
as security capabilities (identifiers whose mere possession grants
access), for example. A predictable random number source will
exacerbate the situation.
So many people use UUIDs for session cookies, where the unpredictability is vital.
I do. V1 UUIDs, due to the generation via timestamps makes them sortable. This is a obvious boon for DB indexing.
Specifically with Postgres I use the v1mc version as it provides the added benefit of using "a random multicast MAC address instead of the real MAC address of the computer"[0].
> So even if the random number generator does loop back around, it still won't ever generate the same GUID/UUID even on the same hardware, and on other hardware it is more unlikely yet still (due to MAC address).
True, but in crypto that's not the only thing you care about. Usually you want random numbers that are both hard to replicate and, and this was missing, hard to predict. MAC & and timestamp do nothing to reduce the predictability of the RNG.
Also, going back to the speed debate: There are usecases where you don't care about these aspects at all and the cited perlin noise, often used in graphics, is one of them. If anything I probably want the noise to be predictable so my preview in Cinema 4D or whatever looks exactly like the rendered result.
Stuff gets dangerous when implementors ignore the tradeoffs and want to make things go fast without knowing the implications.
I wouldn't touch PCG yet. It's based on a single paper submitted to a journal which hasn't even been reviewed yet. Has there been any independent testing of the algorithm? Not that I know of. And the entire PCG website appears to have been built by the paper author as a promotional tool, which is a bit spooky given that the paper hasn't even been published.
MT is far from a "bad generator". It doesn't pass a few stringent TestU01 tests, which is the case for a number of very well regarded generators.
What failure mode are you worried about? Yes, the author has a definite incentive to self-promote. At the same time she's produced a testable artifact. I think her claims about predictability are bogus, but the rest of it we can trivially verify on the merits.
Simple PCG PRNGs are fast and pass TestU01 BigCrush. Someone could conceivably cook up a new battery of tests that it fails on, but we'd need to reevaluate the whole zoology of RNGs at this point.
Edit: Sorry, I was playing loose classifying MT's BigCrush failures as rudimentary. That said, I think failing any part of BigCrush should disqualify a PRNG from use.
No, it isn't a reasonable comparison: PCG is a family while MT is particular Generalized Feedback Shift Register. Despite my suspicions I can't tell you off the top of my head if the entire GFSR class is dodgy, but we can point out flaws in a single generator. I buy O'Neill's arguments (other than those about security, which are a load of hooey) and I think added scrutiny will bear her out. Usually what I see is people flocking to MT because of its comically large period
When it comes down to it, PRNG quality gets assessed empirically. Why would you use ever a generator with 2kb of state when faster & smaller generators like PCG or Xorshift* pass the same statistical battery?
You can also write proofs about their behavior to ferret out the lousy ones. For example, we can demonstrate that PCG is full period. Consider the inner LCG:
x1 = (ax0 + c) % m
The modulus term is equivalent to the size of `x`, so in practice you can skip it and rely on overflow. By requiring that `c` is odd and `m` is a power of two, we know that `c` and `m` are coprime. If you select `a` such that (a - 1) % 4 = 0, the inner LCG is full period.
If you look at the permutation function, it's also provably unbiased by counting. Since all possible bit sequences are fairly represented in the PCG state we know that all combinations of permuting bits appear with all possible output bits.
I was just researching PCG, and I didn't find anything relevant apart from the the author's paper and website. I was trying to find out if it has been peer reviewed since its release last year. I'm really curious, how did you check if this paper was peer reviewed?
MT only fails LinearComp systematically. It is not a bad generator, it is a very good generator. The biggest problem is the "escape from zero-land" stuff where it's biased towards zero for a while if its seed is mostly zeros, but other generators have the same problem (though not to the same extent) and it's still statistically very unlikely to be a problem and it self corrects as the sequence runs.
Comparing to ChaCha20, which is a CSPRNG, is a little apples/oranges.
Just familiarity. Mostly I just can't recommend something I don't know much about. The SIMD variant of MT is architecture specific, I thought? Other newer variants of MT/WELL generators and PCG are probably all good options. Also xorshift*.
But I know that MT is a very good algorithm and using it has a certain pragmatic appeal - you make a lot of people's lives easier because when they Google to learn more about your generator they'll find good information quickly.
I thought you were interested to find the tests. The tests that fail are there. MT doesn't fail all, far from it. But there are RNGs that don't fail these that MT fails that are faster. PCG was mentioned here.
The only information on PCG I can find is the author's own paper which also fails some TestU01 tests. I am much more inclined to trust an established generator than an unproved one.
PCG is a family of PRNGs. Any sensible member of the family will trivially pass TestU01 BigCrush. If you use a 32-bit PCG it's going to necessarily fail BigCrush.
PRNG design is an empirical science. We can implement a generator and throw it into a statistical battery. If one passes, we've got a constructive proof of goodness. Any of us can run BigCrush and verify the results. Unlike CSPRNGs, we don't need to sit around and wait for the results of cryptanalysis.
MT is "on average" maybe "as fast" but the key is "on average:" depending on the usage pattern it can cost much more.
This specific RNG follows a traditional approach (that is, like in C or Java standard libraries) to have one algorithm in the standard library that is "as fast as possible, as bad as possible." Which meant even "just multiply with constant, add another constant, return. Bad? Yes we know." So it's still on the level "it's two multiplies and two-three adds."
Traditionally, those who expect some specific properties, like the author of the article, should have actually used the implementation they added to their code (for example MT if it fits) if they ultimately cared about the speed and wanted exactly the properties they know they needed, or if they just didn't care about the speed used some much "slower" but "much better" library.
Maybe the tradition can be broken in this specific case, I really don't know which benchmarks define what's acceptable here. Note also that the implementation must be verified, just the fact that there is somewhere some implementation available doesn't even mean that it does what is expected to do. The devil is in the details here.
Did anybody make any tests of the resulting streams and compared the period and the properties? The code which inspired the article is obviously able to discover certain weaknesses.
Default PRNGs primary use case is generating session ids for websites with upto 1000s of simultaneous users, not for assigning UUIDs.
One should use a combo of a sequential component and a random component. When you have multiple servers, make them combine a sequential component, a unique system identifier and a random number. In this case, you can now use the faster PRNGs which will also give you speed while ensuring uniqueness.
Another takeaway from this problem is that the usual method of Multiply-and-floor has a serious flaw of using only the highest bits which may not be good enough for the default fast PRNGs. A better method is to carry forward the remaining bits from the floor using a modulo function. The default fast PRNGs rely on all of their bits and simply throwing away a big chunk is not going to help. Multiply and floor might be more useful in a scenario where you are using better PRNGs. Even then, they are slower and it is not wise to simply throw the hard earned randomness.
Using node IDs and timestamps is just an additional safety factor. Statistically it's not necessary if you have a good generator. Even without those things our target collision probability is less than the expected uncorrectable error bit rate of a HDD.
Good PRNGs have equivalent entropy at each bit. With a good PRNG (even a non-CS PRNG) you shouldn't need to mix entropy to do scaling. You should still do rejection sampling[1] if you care about bias. It looks like a good scaling method might be added to the ECMA spec as part of the standard library thanks to some awesome people at Google.[2]
"Using node IDs and timestamps is just an additional safety factor. Statistically it's not necessary if you have a good generator."
Sorry, that's just not sensible. You always want to add node IDs and timestamps (provided you hash the final output so as not to leak details about your system) in case your generator fails. Why would you not want another layer of safety? It also helps protect against the case where an attacker might gain something by being able to predict the next ID in the sequence.
If the attacker might gain something by being able to predict the next ID in the sequence then you should be using a CSPRNG. That's not a problem here. There's nothing for them to gain.
It absolutely is sensible. Adding node IDs and timestamps leaks information. If you add a hash function now you have two problems -- the hash of a random value is actually a _new_ random value with entirely different characteristics. You're falling into another trap. Which is why you might not want another layer of safety -- you're introducing another layer of complexity and another place to fuck up. You had good intentions, but in the scenario you described you've just introduced an additional point of failure with limited upside. Why wouldn't you do the math and implement the simpler solution using a generator that won't fail?
As I've said elsewhere, the likelihood of collision with our identifiers is lower than the uncorrectable error bit rate of a HDD. In other words, it's more likely for a perfect deterministic method to generate a collision because there was a hardware failure persisting it to disk. Or, more pragmatically, the risk is far below the level that any sensible person should ever be worried about.
It seems your original function made the same assumption of V8's Math.random. All PRNGs fail at some point, even CSPRNGs. You may as well write your code accordingly, with less optimistic assumptions.
If you're not going to be adding layers of safety, and if you're going to keep insisting that your PRNG "won't fail" then I guess it's only a matter of time before you will have to repeat the same mistake.
As other commenters have pointed out, you should have written your function in such a way that it does not place a critical reliance on any single component.
I generally trust peer reviewed formal mathematical proofs that show something won't "fail" in a particular, relevant, way. If you don't then you probably shouldn't be on a computer. The code I'm relying on makes the same sorts of assumptions that keep your data secure. It is inconsistent to trust it in one place but not in another.
I don't see the need for belt-and-suspenders here and there are legit reasons not to add host/time to an identifier. That's why we have UUID1 and UUID4.
JavaScript's Math.random (and most languages' default RNGs) is not intended to be cryptographically secure, so I'm not sure "broken" is correct. It's probably random enough for most uses that don't require a CSPRNG. Are there any scenarios where this isn't the case?
On the other hand, enough people make this mistake that APIs should just use a CSPRNG for "random()" and offer a "fastRandom()" for those who need speed but not "secure" random.
Statistical simulations require good randomness (or your simulation might give a wrong result) and high speed but not cryptographic security. See, for example, xorshift* (http://xorshift.di.unimi.it), which is efficient and high-quality but not cryptographically secure.
My personal definition of random includes the requirement that plotting the bits must look random, and have no visible signs of pattern. The OP did that and you can see patterns. If plotting a picture of a PRNG's bits has visible patterns, I definitely call it completely broken.
If you need good quality random IDs cheaply, you can do the following:
1. Generate a pool of 2048 bytes or so of entropy at startup using window.crypto.getRandomValues or crypto.pseudoRandomBytes. To this, append Date.now() and Math.random().toString() just in case your crypto method fails badly. Then call SHA256 on this to get your starting entropy distilled into 32 bytes. If some of the 2048 bytes of entropy are not high quality, it won't matter as much since you compress it into 32 bytes (i.e. it's worse if you just ask for 32 bytes from getRandomValues).
2. Initialize a counter to 0.
3. Each time you need an ID, increment the counter (handle wrap-around if necessary), and take a SHA256 hash of (counter, 32 bytes entropy hash obtained in 1. above, Date.now(), Math.random().toString()). Then truncate and encode this using whatever character set as needed.
This way you don't drain out your cryptographic entropy pool every time you generate an ID. You also don't leak any details as to your system time, startup time, or your current position in Math.random() (which would allow someone to predict the next Math.random() result) since the final ID is hashed.
Since what you're proposing is pretty close to just running SHA2 in counter mode, you could simplify this by generating 128 bits of random data, using it as an AES key, and just using your crypto library's AES-CTR function to generate a keystream.
If you need a predictable stream of uncorrelated bits, this has the benefits of being simple and trivially seedable.
In case the entropy pool is drained, one would still want to get more than 128 bits from urandom and then hash this with SHA2 to get the 128 bit key, right?
I had in mind something portable to the browser without requiring AES there, but will try it out on the server.
In a practical sense, there is no such thing as draining an entropy pool. The period for AES-CTR is 2^128.
The primary reason CSPRNGs rekey themselves periodically is for forward security, so that if your machine gets hacked, an attacker can't snarf the RNG state and predict all future numbers the machine generates.
Forward security is obviously a desirable design feature in a CSPRNG (as a building block that's evaluated and reviewed on its own merits), but I can't help but feel that it's often distracting people from a whole system view.
If an attacker has access to your computer on a level where he can inspect the CSPRNG's state, you've probably lost completely and no reseeding will help you.
You also want to stop backtracking. If I can grab the key and counter for your AES stream, I can roll back and generate all the numbers you generated last week too.
You can't use monotonically increasing numbers from multiple threads (or servers) without either synchronizing (which is terrible for scalability) or using separate prefixes for different threads or servers (in which case you have problems related to sizing and allocating the prefixes). Right?
The UTC time + random number idea is interesting, but I'm not sure it's any better than totally random. Say you want a 64-bit unique id. It would take about 42 bits to store a millisecond Unix timestamp, leaving 22 bits for randomness. If you consider the probability that a given id will collide with a previously-generated one: with this scheme, you only need to consider ids generated in the same millisecond, but you only have 22 bits of randomness that can be used to avoid a collision. If you use 64 random bits, you can theoretically collide with ids generated across all of time, but the odds of collision are many orders of magnitude lower. By constraining the first 42 bits to store a millisecond timestamp, every millisecond that goes by removes 2^22 values from the possible id space (regardless of whether they're used). (Besides that, this scheme removes from the space of possible ids all of the millisecond values from before the system was created, which seems like about 2/3 of them, assuming about a 68-year span of values for Unix times, which started 45 years ago.)
That's kind of handwavy, so let's do the math. With 22 bits of randomness, the odds of any randomly selected pair of ids generated in the same millisecond colliding is 1/4194304. Assuming independently generated ids, the expected number of collisions after 4194304 milliseconds is 1. So if you have just two threads generating ids once per millisecond, you'd expect a collision in just 70 minutes. That's not great.
You can't use a PRNG from multiple threads without synchronizing either. And interlocked increment is probably a thousand times cheaper than mutexing a PRNG...
But you still have to generate a seed for each thread. Then you could just generate a random starting point for each thread and run with incrementing counters instead of hitting the PRNG for each ID.
True. But I don't think it's hard to come up with a seed which is guaranteed to be unique. If you use counters you have a ton of management overhead. You now need to keep track of where the counter for each thread should start and what happens when a thread crashes. Also, what happens when you want to add a new thread? Now you have to change the increment for your counters.
Why do you have to care about all that? Just pick a random number and start incrementing++ from there. I don't see how this would yield any more collisions than a PRNG? Just because you move across the ID space sequentially instead of in an obfuscated pattern doesn't really mean there's a higher chance of picking the same ID twice?
(My argument here, really, is that using a PRNG at all sounds pretty crazy for ID generation, if you don't take steps to prevent or detect collisions)
Yep, length could be a problem, I was assuming a longer id.
They're using 22 char identifiers. A base64'd unix epoch time in ms is another 36 chars. But I'm sure you could do better than that.
The chances of a collision must be vanishingly smaller within even a second (let alone a millisecond) even if the random number generation is defective.
What kind of UUID? Standard UUIDs of versions 1, 2, and 3 are not really unique enough[1]. Version 4 is basically a random number, so it doesn't help to say "just use that".
well, I use uuidgen[1]. If I have to generate uuids for multiple servers, then I create a microservice to handle uuid generation, and get it to generate a few held in a buffer so the rest of the system doesn't have to wait for one.
Exactly, my biggest takeaway is "don't write your own unique ID algorithm." At best it does the same as existing methods, at worst you screw it up and end up here.
I understand why they want unique generated id's, i don't understand why they just didn't use UUID/Guid, which are way more performant ( string vs int) and less big in size ( 22 bytes vs 16 bytes).
Both are created for the same purpose and Guid's are even possible as keys in MS SQL Server (next to the standard autoincremented integer). So it's safe to say that it's pretty reliable.
Essentially the same size. They ended up with 132 bits because they use a 6 bit alphabet; 128 is not divisible by 6, so you would need 22 characters of a 6 bit alphabet to represent a UUID, exactly the same as them.
(note, the 6 bit alphabet is so identifiers can go in urls; the commonly used uuid hexadecimal representation is a 4-bit alphabet, so there are 32 characters and usually 4 hyphens; so 36 characters in the URL vs. only 22)
As a programmer, I quite enjoyed reading this post. But I can't believe the amount of time the author wasted learning what many crypto library authors already know: Math.random isn't any good. I mean this sincerely. You'd think they have so many more pressing problems to solve for their users.
Perhaps, but the mindset here is fundamentally academic: here is an interesting problem, I wonder what I can learn about its properties? And having looked into those properties, the author decides to write them up and disseminate them. (Very well, I might add. I filed the link in a bookmark folder I keep for "articles I'd like to assign my students to read, next time I teach a relevant class. This one is well-written and well-referenced.)
1. Math.random, despite his insistence, isn't broken. There are different degrees of randomness that a programmer might need for their program and Math.random simply didn't suit his.
2. Their initial implementation of a critical portion of their infrastructure was so naïve, I almost couldn't couldn't believe it. I just seemed like such an anti-pattern to try to build your own UUIDs, discover (in production!) that they're not so good and then spend days figuring out why a built-in random number generator that ships in every browser isn't so great.
1. In this case we're arguing semantics. Fine, it's not broken, but it's still bad and there are better alternatives with no drawbacks. Arguing to keep it is sort of like arguing to keep an O(n) algorithm that has an O(log n) alternative that's also more intuitive and easier to code. If you still don't believe me, here's what Brendan Eich thinks[1].
2. I've addressed the standard solution / UUID question elsewhere[2]. The way we were generating identifiers is not an anti-pattern. It's a pattern. It's the standard way to produce a random string from an alphabet. If you look at the source code of any website, from HN to Google, I guarantee you will find an almost identical piece of code somewhere. The code is simple, but that doesn't make it bad.
Your code used the naive algorithm and the system PRNG to generate a set of "random" and thus probably "unique" strings. The PRNGs supplied with many languages are widely known to be of poor quality. The results were bad, for reasons you later discovered. "Anti-pattern" sounds correct to me.
A safe approach is to run multiple uncorrelated sources of probable uniqueness/[pseudo]randomness through a modern cryptographic hash function, and generate your "unique" string from the hash output.
In other words the code itself is, in principal, fine. The PRNG it used was not fine. Had it used a CSPRNG there would not be a problem, and generating identifiers in this manner is perfectly safe and not an anti-pattern. Hashing the output of an already good (CS)PRNG is the real anti-pattern. It can only reduce entropy and make things worse.
I've acknowledged the incorrect assumptions / taken blame for not doing proper diligence on the PRNG. What I'm saying here is that the identifier generation technique is not, in principal, flawed. It is a common technique and it has many legitimate use cases. It's simple enough that pulling in a dependency to solve it for you is not an obviously better alternative.
True randomness and a crypto-strength PRNG are good options here, but they're not necessary. There are peer reviewed proofs that show something like MT19937 is suitable (perhaps even better suited) for this sort of task. See the deeply nested comment thread ITT for more of that debate.
Suppose, somehow, you seed your super-duper PRNG with the same value on multiple servers. If you are not incorporating other forms of entropy into your unique IDs, you are hosed.
The cryptographic hash acts sort of like a blender for all your random/unique-ish information sources. Counters/MACs/urandom/times/IPaddrs/etc go in, a nicely mixed value comes out.
Meh. You should just use all of that random unique-ish stuff to properly seed your PRNG. You're trying to improve the generator "randomly," which is generally dangerous and can be counterproductive. Better to rely on sound theory and good practice / keep it simple. Leave all that craziness to the kernel entropy pool and just use it as a seed.
Anyways, if you decide you want to learn more about secure hash functions and cryptography in general, you can't go wrong with Bruce Schneier's "Applied Cryptography".
Applied Cryptography is a good book. I first read it more than 15 years ago :). Schneier actually designed the CSPRNG that is used by OS X. It's called Yarrow and is SHA based. It does more or less what you suggest, but with many subtle improvements to mitigate attacks. It has features that you're unlikely to properly reproduce in user-space. Trying to improve its output by re-hashing in user-space is, even if done properly, unnecessary. If you screw something up and do it improperly you could easily reduce the quality of the generator.
I'm not sure how your link is relevant. If there's not enough entropy on the system for the kernel, there's not enough for you either.
I think the whole idea of generating random ids is flawed. Why take a chance when there is no need? Just take some machine id (seq. number, MAC), timestamp and a big enough per-machine counter and there can be no collision, guaranteed. No need to use random.voodoo().
It's easier to generate a random identifier and if the generator is good the probability of a collision is lower than the probability that you HDD will fuck up in an uncorrectable way while storing the identifier (as I've pointed out elsewhere in thread). So the answer is that random identifiers are easier to do in a distributed system and, in reality, have the same guarantees.
Sure. But you are not lowering one risk on account of the other, so what's that got to do with anything? You will still have the same chance of HDD error, but now you also add a (admittedly small) chance of id collision. As for simplicity... You need some unique device identifiers in any distributed system, so why not just use these?
I know MongoDB is not popular around here, but their ObjectIds are (imho) a much better approach than yours. Much more robust and less prone to errors when planets align just right. That said, I have seen worse solutions than yours and made my share of bad decisions too, so I don't judge you. This is just a technical opinion, given that this particular problem wouldn't even exist if you avoided randomness.
There are legitimate reasons to have / not have some form of node id, and/or a timestamp, in an identifier. One reason to have them is that you don't need as good a PRNG. However, with a good PRNG the collision probabilities are astronomically low either way.
The point of the comparison to HDD error rates is meant to indicate that the collision probability is well within what is typically considered an acceptable risk level. There's a higher probability of much worse things happening, and I accept those risks for similar pragmatic reasons. Why should I behave any differently in this case? I do encourage anyone implementing this sort of system to do the math themselves and take whatever precautions they deem necessary. I've done so and decided that the additional complexity and space trade off of the alternative approaches make them inferior for our use cases. That doesn't mean ours is universally the best choice. That's why we have UUID1 and UUID4, for instance.
Ok, we seem to have different approaches... :) I understand your position, I just think other (deterministic) solutions are safer (meaning: less likely to cause trouble). I know it is easy to be smart now that we know what happened, but I think the post nicely shows some of the dangers when using nondeterministic ids.
I am quite sure it would be easier to find cause for duplicates if you were not using random. Anyway, at least you caught a nice bug. :)
If you get identical MAC addresses on machines operating the same service, presumably on the same network, you have bigger problems than ID generation.
Can someone explain how the following calculation was done?
With 2¹³² possible values, if identifiers were randomly generated at the rate of one million per second for the next 300 years the chance of a collision would be roughly 1 in six billion.
I tries using the formula for the Birthday problem, but the values are too large.
I'm not sure about their numbers, but with a little trick it can be estimated as 1-e^(-2^(n-x-1)), where x is the number of bits and n is log_2 of the number of random draws. When n and x are relatively close, you should be able to calculate this without numbers getting too big or too small.
Huh. I always check for collisions if I'm using a random number as a unique key. I guess it's faster to just not check, but it's just a single seek to check for existence. It's pretty fast, and you almost never have to do it twice.
As a side benefit you can get by with shorter (more human readable, typeable, fits-in-a-tweet) keys because you correct the few collisions you get.
If the extra keyspace seek were ever a serious performance bottleneck I would just skip the collision check in that one spot and use a proper RNG as OP is suggesting. But how often does that really happen? Very few systems need to generate millions of unique keys a second.
Checking for collisions becomes hardly possible in distributed systems pretty quickly (take git as an extreme example), and provides virtually no benefit if you can trust your RNG.
At those astronomical possibilities, there are many things that can go wrong before collisions become viable. Think about it: Why bother checking, when you have a somewhat similarly egregious probability that a random bit flip occurs right after your check?
In the situations you encountered, it's probably simple and cheap, but in the cases where it's not (the situation in the article being a probable example), it can be expensive, error-prone complexity without gain.
If he was actually hitting a cycle in the numbers generated, check/retry won't help -- eventually your entire network will bottleneck on DB reads without making any writes.
I would definitely argue against that given the perf that you can get out of CSPRNGs. AES/CTR could give you over 100 million random bytes per second back in 2009 (source: https://www.cryptopp.com/benchmarks.html).
How much faster do you need your RNG to be in any non-niche situation?
The niche situations aren't going to use the built-in rand() anyway because they can write their own or use someone else's that's faster. Even if they end up using the same algorithm (e.g. a Mersenne Twister) they can still often go faster.
Interestingly according to https://bocoup.com/weblog/random-numbers Firefox is just going to use its CSPRNG once the crypto api is ready. (Bug 322529 is still in NEW state, though.)
On many platforms it's several orders of magnitude slower than a fast, dead simple PRNG with very good distributional properties, like xorshift1024* [1].
If your product is in a competitive environment and is being compared against others in benchmarks of random number generation, the answer is "as many as you can possibly generate".
This scenario corresponds to JavaScript engines of major Web browsers.
I can't blame V8 engineers for not wanting to regress SunSpider. Sad or not—and, let me tell you, nobody who works on browsers likes this state of affairs when it comes to that benchmark—doing well on the popular benchmarks has to be an engineering goal in the environment browsers find themselves in.
FWIW xorshift1024* would almost certainly be faster than MWC1616 and it has no systematic failures on BigCrush. The only downside is that it uses more memory, but I think we can afford it. If not xorshift64* is also faster than MWC1616 and miles ahead of it in quality.
For large simulations, this can easily become a bottleneck, which is why those generally use a good non-CS PRNG. In addition, CSPRNGs actually have slightly worse distributional properties than some much faster PRNGs, because the CSPRNG is more focused on being unpredictable. CSPRNGs aren't the end-all be-all of PRNGs, they have their appropriate and inappropriate applications even if we ignore speed.
"Large simulations" sound like the things that should be using a specialized, domain-specific RNG, though. And that's what a non-CS PRNG should be treated as: a specialized "high-performance" RNG, trading away the default of guaranteed unpredictability to achieve that performance.
CSRNGs make good general-purpose-use guarantees, because those who don't demand much of their RNG will enjoy "living in ignorance" of RNG trade-offs while getting the right set of guarantees so as to never accidentally generate a collision; while those who don't want the particular set of guarantees a CSRNG makes will be domain experts (i.e. people who have profiled their code and noticed that the RNG is the hotspot) who know what they need instead.
The same cannot be said if you make a non-CS PRNG the default. There, "living in ignorance" will result in disaster (as in the article); and just knowing that the provided PRNG isn't right for you doesn't imply you know what you do need.
> CSRNGs make good general-purpose-use guarantees, because those who don't demand much of their RNG will enjoy "living in ignorance" of RNG trade-off
I totally agree! Very few applications will notice if the random number generation takes 100 times more/less time. My point wasn't that CSPRNGs shouldn't be the default, it was that there definitely exist real situations where CSPRNGs are both not fast enough, and don't maximize the right kind of (domain-specific) quality.
The more vexing aspect is that many default random number generators are both slower and much, much worse than some really simple (we're talking 10 lines of C) PRNGs that out score them in every measure of quality. This is a choice that makes absolutely zero sense.
> CSPRNGs actually have slightly worse distributional properties than some much faster PRNGs, because the CSPRNG is more focused on being unpredictable.
What does this even mean? Non-uniform distribution for CSRNG is fatal. What is an RNG that has "better" distribution than CSRNG?
The actually relevant aspect is what percentage of a simulation's time is spent generating random numbers if it's using a CSPRNG. I have had this become a bottleneck in my work, and it's very rare to use CSPRNGs in scientific simulations for this reason.
> What does this even mean? Non-uniform distribution for CSRNG is fatal. What is an RNG that has "better" distribution than CSRNG?
There are distributional properties other than uniformity, and depending on how you use these numbers (e.g. Monte Carlo kernel form) these can be far more important than linear uniformity. The author briefly mentioned this in the article ("CSPRNGs emphasize unpredictability over all other measures of quality, some of which might be more important for your use case"). Look into the work of George Marsaglia, he wrote the test suites that generate the quality numbers most engineers use to evaluate both CSPRNG and normal PRNGs.
I find it disingenuous to see properties such as equidistribution or low discrepancy being described as 'quality' of a generator. Yes, they make certain Monte Carlo algorithms converge quicker. No, they are not closer to a truly random sequence than a CSPRNG is. The goal of a CSPRNG is to be indistinguishable from true randomness, and a predictor is quite obviously a distinguisher. But so is a statistical test, or any 'nice' property that makes some algorithm behave better.
By the way, I think I've said this before, but I'll say it again. I really dislike the usage of the term 'CSPRNG' in these discussions. A CSPRNG is typically understood to be /dev/urandom---an algorithm that not only generates a long sequence of numbers, but also harvests entropy and tries to achieve forward and backward security. But it is also sometimes overloaded to mean a stream cipher, which is a conceptually simpler construct, and also able to be much faster. Stream ciphers can definitely compete in performance with things like Xorshift.
If, by quality, you mean "indistinguishable from true randomness" then there are many valid metrics. One important metric that many non-CS PRNGs have, but many CSPRNGs do not have, is a provable deterministic cycle length for all seeds.
These characteristics don't just make Monte Carlo simulations converge quicker, they also do things like guarantee that your 52 deck shuffle will definitely produce the one shuffle that gives player two a royal flush on the flop in a round of texas hold 'em. Or, more generally, they guarantee that your Array.shuffle() method is unbiased for a reasonable sized array. In general you need true randomness and/or a long cycle linear-transformation PRNG to make those sorts of guarantees. It's also nice that they're fast. That said, I do also agree that the scenarios in which these things matter are rather rare, and most people should probably just use a CSPRNG by default.
Do you really think a CSPRNG is fast enough for your standard library's array shuffle though? Wouldn't you rather have something like xorshift1024* that's probably > an order of magnitude faster? Maybe a CSPRNG is fast enough... I don't have all the numbers. So this is an honest question.
CSPRNGs are a type of pseudo random number generator, which are defined to be deterministic. I have no problem with calling /dev/urandom a CSPRNG for practical purposes, but very strictly speaking, it's not.
Do you mean there cannot be a CSPRNG? Most of cryptography is based on security in practice, not in absolute theory (i.e. everything except for one time pads and maybe some quantum crypto). That's like saying there are no cryptographically secure encryption algorithms other than one time pads, since you can break all of them in theory. That's a pretty useless definition of cryptographically secure.
edit: By in theory I mean that with enough computation resources you could break them even if you didn't find some new, clever weakness. Not that you could break them in theory because a weakness could always be found.
No, I mean that the fact that /dev/urandom "reseeds" in mid stream means it is not strictly speaking pseudo-random, since it is not completely deterministic. Maybe I'm wrong, but the comment I was replying to was arguing that /dev/urandom was the only CSPRNG, and things like stream cipher algorithms are not.
> But so is a statistical test, or any 'nice' property that makes some algorithm behave better.
Not really. In general, the statistical tests are chosen such that when the law of large numbers comes into force, a perfectly random sequence would pass them with flying colors. If repeated runs of a perfect random number generator were failing these with a frequency that were improbable (i.e. far outside what the central limit theorem would predict), I think it would be safe to say the universe was playing a trick on you. Many of these are very closely related to predictability in practice as well. Uniformity is a statistical test, and I think it's probably the lowest bar you can set for "unpredictable".
The fundamental issue with measuring "randomness", especially for a PRNG, is that there is no set definition of information entropy suitable for calculation. Shannon entropy requires you to have a (possibly numerically sampled) PDF. The problem with that is that is you want to find the entropy of a sequence, you can't use the PDF of all the bytes you've gotten out the generator, since that PDF will have 1 for the exact sequence you obtained, and zero everywhere else. If you do it by multiple runs, there is no way you'd get enough data to construct an accurate PDF with a suitably large sample size. You have to use something based on highly subdivided sequence of values, but if you only look at each value in isolation you will only determine the uniformity with some small bin size, which ignores patterns due to proximity of values in the output.
So generally entropy is calculated using a more complex probability kernel, like the ones used in these statistical tests. But none of these can possibly be perfect. The "perfect" measure of entropy is Kolmogorov complexity[1], which captures the minimum length of a program you would need to be able to print the given output. On top of being undecidable in general, and also unobtainable in practice, you can show an upper bound of the Kolmogorov complexity on a CSPRNG that is actually quite low. Just take the algorithm the CSPRNG uses, plus any external entropy it got (from hardware interrupts, etc.). So in the end, various kinds of statistical tests are the only ones that make sense.
The actual goal of a CSPRNG is far more specific than just simulating a truly random sequence, which is why it differs from those used in simulations. It is unpredictability, defined by trying to make it so this short program which can generate all the output is impossibly hard to obtain from just the output in practice. In a vague sense, it is like (some of) the desired properties of a hash or cipher; we can't use a perfect random source (resp. one time pad), so let's get something that will take an impossible amount of computational power/measured output to break (resp. ciphertext using the same key), and mix it in with some as true random as we can get seed (resp. IV), and then mix some more in from time to time to make it even harder (resp. changing keys). That is the reason stream ciphers, while generally not as good as /dev/urandom-style CSPRNGs, are actually closely related enough that I think it's fair to call them CSPRNGs as well. They're just ones with a far worse algorithm/entropy source.
A statistical test _is_ a distinguisher! This seems self-evident to me. When the test repeatedly fails (with an extreme p-value or whatever criteria you want to apply here for 'failure'), you have successfully distinguished the sequence from random. Also you seem to misunderstand what I meant by equidistribution---uniformity is a different property, and of course crucial. I am puzzled, then, as to why you claim some generators are more 'uniform' than cryptographic ones.
Any pseudorandom bit generator whose next bit cannot be predicted in under 2^n "operations" also passes _all_ statistical tests that require less than this computational effort. This is a well-known result by Yao. In other words, unpredictability and indistinguishability are equivalent. In the case of, e.g., AES-CTR, n = 64 (birthday-bound distinguishers force it to be 64 instead of 128).
Your philosophizing about what true randomness really is does not strike me as very useful for practical purposes. Given unbounded computational power, obviously no such algorithm is actually indistinguishable. But who cares?
Unpredictability is irrelevant to a use case like shuffling an array. Moreover, your conjectured polynomial-time perfect CSPRNG may actually produce a less useful stream of entropy for this use case than a good non-CS PRNG because its cycle length may well be shorter, and all that matters is uniformity (which can actually be guaranteed, not just conjectured and measured empirically) and enough seed entropy to kick things off. Top it all off with the non-CS PRNG being perhaps orders of magnitude faster.
I'll repeat once again that I do think people should generally default to CSPRNGs, but I really don't understand why everyone involved in crypto insists that there are no valid use cases for non-CS PRNGs / that CSPRNGs are uniformly better for all use cases when it seems very obvious to me that they are not. I don't see any reason why a good PRNG and a good CSPRNG shouldn't exist in every standard library. Am I missing something?
A short cycle would be a catastrophic outcome for any serious cipher. Although not every generator has a provably guaranteed period, relying instead on the properties of the 'average' random function, it easy to guarantee period. AES in counter mode, where key=seed and block=counter, has a guaranteed period of 2^128.
As to why we insist that cryptographic primitives are uniformly better? Well, that is more subjective. I personally believe that they have gotten fast enough that the set of cases where they are unacceptably slow is vanishingly small, and the trend is clearly on the side of crypto getting faster. And I can sleep at night knowing that my results are not wrong due to some unknown, unexpected, property of my generator.
That's all fair-ish, and I think I need to learn more about CS alternatives. Necessary cycle length is relative. For crypto it seems like the definition is "big enough that it generates keys that are infeasible to predict," or "big enough that it can be used to stretch entropy a long ways for a cipher" for all practical purposes. But even the AES generator you cited sounds like it would be incapable of generating an unbiased shuffle of a deck of cards, which has 52! or ~2^225 unique shuffles... For a use case like this it's hard for me to wrap my head around using anything other than a good non-CS PRNG or true randomness. If you believe otherwise I'm truly interested in your thoughts, links, etc.
Another case where a long cycle length is useful would be the sort of scenario in my post, at much larger scale. If you have thousands of machines re-seeding at every service restart and then generating sequences of millions or billions of random numbers from a sequence, and you want a vanishingly small probability that those sequences will never overlap, could you use a CSPRNG?
Re: the generator properties, do you _really_ believe that? I mean, good non-CS PRNGs usually have fairly rigorous presentations and pass statistical tests. They aren't as sexy as crypto stuff, and maybe don't get the eyeballs, but they're relatively straightforward proven math whereas all of the CSPRNG results are based on conjecture. It seems at least as likely that some unexpected property of a CSPRNG will be discovered, a la RC4.
The shuffle question ties in with the overlap question. Let's stick with the AES in counter mode example. AES is a block cipher, or in other words a keyed permutation. What this means is that each key (resp. seed) determines a distinct permutation that we use to generate the blocks AES_K(0), AES_K(1), etc. In effect, each seed will result in an entirely distinct and independent 2^128-block sequence, instead of simply the same sequence starting at a different place, which is usually the case for non-cryptographic generators. So 2^128 possible keys (or 2^256, if you go AES-256) times 2^128 output blocks per key is probably enough to cover all possible card shuffles.
I do believe that. RC4 is a good example, actually; the unexpected properties that make RC4 be considered completely broken for cryptography are considered irrelevant for simulations (I won't swear by this, but I'll bet that RC4 passes all of TestU01's tests). The standards are just so much higher for cryptographic primitives. I'm sure unsexy generators can be suitable to this or that application. But due to the ad hoc nature of the statistical tests used, you can't be sure that your generator is not oddly correlated with a concrete application. I've written a comment a few days ago---on a thread very similar to this one---about a real-world case of this happening [1].
Ok, I get the seed/state space vs. cycle length bit. So it can theoretically generate all of the permutations of a shuffle if you re-key/re-seed periodically. But you can't guarantee that it actually will, right? It is possible that, for all 2^128 keys, and in all 2^128 cycles there are some shuffles that AES will never generate (in other words, some sequences of 52 numbers that will never be generated) so you effectively have a 0% chance of hitting them.
From a cryptanalysis standpoint I think that's consistent -- the probability of generating any particular single shuffle is astronomically low, so the difference between the actual probability and 0% is probably outside the realm of what's computationally feasible. But with something like MT19937 I can show that its at least possible for every outcome to have a non-zero probability. Perhaps this is just interesting academically, but maybe not if you're actually playing a card game for instance?
No, that cannot be guaranteed. Equidistribution in 52 dimensions requires necessarily a minimum state size of 52 log_2(52) bits, which we don't even have. Even if we did, it could not be guaranteed anyway.
You have already covered the distinction between guaranteed equidistribution and not, so I don't think there is much to add there. I do think the concern is mostly academic, since the difference between a subsequence that doesn't exist and one that will never be reached before the universe collapses is irrelevant to the user. I will note that Richard Brent, someone who is much better qualified than I to discuss this, has argued against the usefulness of equidistribution: http://maths-people.anu.edu.au/~brent/pub/pub240.html
I think you have to hit every permutation, at least for AES, because AES is reversible.
In particular, if there were some 0 <= y < 2^128 such that AES_k(x) != y for all 0 <= x < 2^128, then there must also be some z such that two different values x1 and x2 encrypt to the same value, which can't happen if AES is reversible.
Sure but I don't immediately see how that equates to a proof of its ability to produce every combination of, say, 52 consecutive numbers... Just that it will produce every possible individual number. In PRNG-parlance it shows 1-dimensional equidistribution but not 52-dimensional (or generally (k, w)-dimensional) equidistribution. Or is there another pigeonhole step I'm missing?
If you want a random entry from an array, Math.random is fine. Anyone who doesn't know that prng's are broken for things like uuids shouldn't be in charge.
If you want to do something as trivial as an unbiased Fisher-Yates shuffle of an array, however, Math.random() is broken. And Math.random() doesn't have to be broken for things like UUIDs. Python and Ruby both have PRNGs that are suitable for such things.
We fucked up by not vetting the algorithm, that's definitely the primary lesson here. Mea culpa. I'm sure you would have done things differently, but V8 is a modern system and I don't think assuming a modern PRNG was completely unreasonable while quickly getting to MVP at a new startup.
You seem to be getting a lot of flack from people with 20-20 hindsight. For what it's worth, I have made the exact same assumptions you have when generating internal unique IDs in a distributed system purely for tracking purposes. The only difference was that mine was in Python and I got lucky that its random implementation uses a mersenne twister.
We fucked up by not vetting the algorithm, that's definitely the primary lesson here. Mea culpa.
Bingo. Exactly.
I heavily criticized you in another post for 1) your initial choice of substandard algorithm and 2) your rationalizations (e.g. Google's "good reputation").
But that is ancient history. Here you're admitting that you fucked up, which IMO you didn't admit previously.
Also, kudos for your article which kicked off this entire discussion. In that article you showed that you carefully analyzed what was happening, how you went wrong, and how you could improve.
Even more important, you did this all publicly, both your article and your responses here on HN. Everyone learns from this. I commend you for your openness.
Well Google fucked up too, in a way that affects many more people. They used a broken form of an obsolete random number generator with a 2^30 cycle. There are faster and simpler generators that perform overwhelmingly better on statistical tests.
The bug report that prompted V8 to fix their Math.random was someone doing A/B testing on their site by checking Math.random()>0.5, and the V8 PRNG was so bad that it would give a consistent 2-3% error, which would make any results useless.
Seems like V8's Math.random was only fine for extremely trivial cases (Google doodle Pacman I guess). Meanwhile Webkit has been using a CSPRNG and nobody seems to be complaining.
Yeah - the thing I was thinking throughout the article was .... this is all interesting, but isn't it an obvious consequence of using a runtime designed for more quickly loading the Gmail UI on the server side? I went and checked the docs for the Java Math.random() function and it clearly specifies the algorithm and doesn't have the same issue.
V8 is a great piece of work but it was never intended for anything beyond speeding up web pages. Why are people using node.js on the server side at all, when there are more industrial strength runtimes with better performance available?
> Anyone who doesn't know that prng's are broken for things like uuids shouldn't be in charge.
Depends on how you're using them. If you only care about collisions, and you're not worried about an attacker trying to manipulate your system, then using a PRNG whose output is well-distributed (read: not necessarily a language's random function) is fine. The problem is most language's default PRNGs are complete crap :(
True. But every language has gone through this issue. Node is particularly vulnerable to weak prng seeding due to relying on forking to support multiple processes. Am I that old and grumpy to expect people to be aware of the history here?
I think the size of this thread as well as the existence of blog posts like this means it's a mistake that's always going to be repeated. There's way too many programming gotchas to expect people to keep track of them all. Language support to avoid classes of mistakes is a great idea to me e.g. such as having a good random algorithm by default and a UUID library.
The challenge of "randomness" is overlooked and leads to security issues more than most people realize. For example in php, the `rand()` function documentation (to their credit) has a big warning that the function is not cryptographically secure.
When in doubt, use unix, and <updated>/dev/urandom</updated>
Also, here is a command (to my knowledge secure) which generates a nice random 20 length alphanumeric:
LC_CTYPE=C < /dev/urandom tr -dc A-Za-z0-9 | head -c20
That blog post is old and inaccurate. It propagates myths about CSPRNGs that some have been trying to dispel, particularly Thomas Hühn [1] and djb [2].
/dev/urandom is the same cryptographically secure PRNG as /dev/random, except that /dev/random will superstitiously block when it thinks entropy has been "consumed". (CSPRNGs do not meaningfully "consume" entropy.)
If you use /dev/random for the mythical warm fuzzies, you might also be misunderstanding other things. Don't be superstitious. Use /dev/urandom.
If the system has not gathered any entropy, then /dev/random/ will output predictable numbers. You are consuming random numbers does not meaningfully consume entropy [0]. However, if you try to consume a random number before the system has generated entropy in the first place, then /dev/random will still happily give you non random numbers. Most desktop/servers will store entropy between boots, so this is mostly an issue for embeded to worry about, but still worth keeping in mind.
[0] Technically, it is correct to say that we do consume entropy in an information theoretic sense. However, this is not particularly revelent, as the CSPRNG is still (believed to be) secure in a complexity theoretic sense. If it turns out that the CSPRNG is not secure in a complexity theoretic sense, then we have bigger issues to worry about then a broken RNG.
In brief, wifi routers need to generate unique crypto keys. The programmers started off using /dev/random, observed that blocking, and switched to /dev/urandom. The problem was that they were generating the keys at first boot, before any entropy had accumulated, with the result that many, many devices wound up with similar or identical keys.
What is the definition of "part of the OS's boot sequence"?
In the factorable.net paper[1], a test system didn't generate 192 bits of entropy until 66 seconds(!) after boot. The problem was apparently exacerbated by Linux not mixing the entropy into the /dev/urandom pool at all until that threshold was reached, but even if it started doing so immediately, standard server processes like ssh would still start up long before there was enough entropy to make urandom sufficiently unpredictable. Sure, this is something of an extreme case... and as mentioned in the paper, there are various tools used on Linux to save a random seed to disk, so maybe the OS should ensure that happens before it starts any other services... but software that responds to an easy-to-make configuration mistake by silently behaving insecurely is the polar opposite of good security design.
The right answer is Linux's new getrandom syscall, whose default mode "blocks if the entropy pool has not yet been initialized", but thereafter acts like /dev/urandom. But on systems where that syscall is not available, I'm skeptical that using /dev/urandom is sane design.
Yes, getrandom(3) is good. It's just a patch for a broken architecture, though; the right answer, architecturally, is that userland services (like sshd) should not be started at all until /dev/urandom has been seeded.
To put it another way: /dev/urandom is effectively a service that can be in a "starting" or "started" state. When you, as a service, depend on another service, the idiomatic thing to do is to teach your init/supervisor daemon about that dependency. Then, instead of your service coming up and sitting around doing voodoo to your pipes/sockets to try to figure out whether your dependencies are up, your supervisor can just delay starting your service until it knows the dependent service is up.
(Or, to get really clever about it, you could just make /dev/urandom itself a socket attached to a "service" whose job is to, on startup, do a blocking-seed of the "real" /dev/urandom device, and thereafter become a dumb pipe over to the "real" /dev/urandom device. Then all your services just block on reading /dev/urandom [the socket] the first time, with the first one of them to try reading from it in fact causing the reseeding to happen. Socket activation!)
No need to get clever about it, or rather it's already been done. ;-) Except for the reseeding bit, getrandom() does that for you by default: blocks if /dev/urandom hasn't been seeded yet. At least according to the man page.
Hey, does anyone tried to Google for CSPRNG? What Google return for wikipedia's article title is totally off:
https://www.google.com/search?q=CSPRNG
Random are numbers - Wikipedia
https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator
I got the same result. Usually google gives a different title than the Wikipedia title when it's closer to the search term used. However, these titles are taken from redirects to the article and "Random are numbers" wasn't ever an article or a redirect. The phrase also doesn't appear anywhere in the wikitext or the generated html. Searching for the exact string also yields only two other results, one from a reddit thread, and another from a paper. Very strange.
Good idea! But it looks like you re-invented a SHA/counter based CSPRNG :). That's not necessarily a bad thing, but there are probably library CSPRNGs that work as well and hide some details. Also, it's not generally any more or less likely to produce collisions than the method we're using. The probability of collision is dependent on the hash function / PRNG being used.
I have to say I'm surprised — I thought better of V8 than this.
Arguably it's usually going to be preferable to know you're getting the same kind of random numbers across platforms anyway, so a BYO Mersenne Twister is a decent remedy.
For those wondering why not use UUID, most uuid implementation are guaranteed to be unique only when running a single machine. When working on loads of servers without putting any load on a central resource, you can not rely on either uuid or sequential.
This is true for UUID version 4, which is completely random. There are variant that use host-specific information in the process (in fact version 4 is the only one not doing it). [1]
This said, I see that often only v4 is implemented, so maybe there are issue with the other variants.
UUID v4 seems to be based on random numbers, so should be "as good" as whatever scheme they were trying to hand roll here. Additionally, the uuid npm library uses crypto.randomBytes, which is cryptographically secure.
The biggest problem is that they take up a lot of space. Plus you have to find a good library and vet it (there's nothing in the standard library). Or just write 10 lines of straightforward code.
If you go and look at node-uuid commit history it, too, used Math.random() back in the day.
Given that browser JS has no way to interface with /dev/urandom, I think it is a mistake for V8 not to make the only random number generator available to browser JS secure by default.
tl;dr: the algorithm powering V8's Math.random() is very poor quality. For many use cases you can't safely pretend its output is actually random. Don't use it for anything non-trivial that you care about. It should probably be fixed. In the meantime, use crypto.randomBytes() or crypto.getRandomValues() instead.
I think the real lesson here is that if your application is dependent on the exact behavior of an undefined algorithm, you probably should fix your application.
It wasn't dependent on exact behavior. It was dependent on a sensible general contract. We rely on sensible implementations of general contracts all over the place in software development.
We did make an incorrect assumption that the PRNG (created by Google, who has a good reputation, and in what is probably the most popular software in the world) was high quality. That was a mistake, and we should have done more homework. However, there's no reason why the Math.random() implementation in V8 should _not_ be good enough to not have to worry about.
This is a tricky answer. The default random function in many environments is poor (see other examples given above and below), if you are generating a unique id using a PRNG and think in terms of the birthday paradox, then you should also be concerning yourselves with the quality of the random generator. While it is unreasonable to expect every developer to know the ins and outs of a deeply technical issue, defaulting to using a known good (enough) algorithm like MT over making an assumption about the quality of an unknown algorithm (unknown at least until you looked into the source code) just because it came from google is poor engineering (IMHO)
The analysis in this article was great engineering - something that in my experience often happens after poor engineering... YMMV
It was dependent on it being a CSPRNG which it wasn't. Math.random is designed for things like games and animations where randomness isn't critical and performance is important.
Actually CSPRNGs are slightly worse if you are more worried about distribution than predictability (he mentions this in the article). If you are using it in a context where a hostile attacker cannot gain anything via predicting or manipulating the generated numbers, then for birthday-paradox type resistance an alternate non-CS PRNG, like those typically used for simulations, is more appropriate. However when you're using JavaScript, it's likely the natively implemented CSPRNG will be faster than a custom "faster" PRNG implemented in JavaScript.
No, it wasn't. Read the article again. Assumption made was on Math.random being high quality PRNG. Being cryptographically secure is completely unrelated.
META: Maybe the mods need to look into how certain popular sites treat #, as medium seems to generate a new #<hash> on each refresh. I submitted this 1 day ago, intentionally removing the #<hash> - https://news.ycombinator.com/item?id=10598335 - r0muald clearly caught a better timing for this article but frustrating that an exact duplicate submitted in very close proximity becomes a totally separate submission!
HN's dupe detector deliberately allows reposts if an article hasn't had significant attention yet, so the added hash didn't make a difference in this case. That said, there are definitely too many duplicates right now and we're working on a new approach that will (as a side effect) privilege the original submitter more often.
It's frustrating when you get there first and someone else hits the jackpot, but it evens out in the long run if you submit enough good stories.
I am only a little frustrated that I missed out on so much karma love (I am in my 30's now and find that these things matter less as I get older :), it was more the fact that Medium naturally appends this, so literally all submissions of a Medium article will result in reposts.
It's not a simple problem as some URLs will rely on the hash to work correctly, e.g. hxxp://www.example.com/#!/foo/bar, so you can't just strip it. I think it requires some special cases for sites known to do this kind of thing (not sure why Medium do it.)
Glad to hear duplicates is an issue you're looking at, has certainly been a problem, though I agree it makes sense to allow it in certain cases. For example I love the XV6 OS and like to see it periodically submitted again from time-to-time to see new discussion. It might even be nice to have an auto-generated list of previous submissions in this case?
As an amusing aside, this whole issue has a relationship to the story - perhaps the generated hash uses Math.random() and therefore might actually result in duplicate URLs after not so many resubmissions? ;)
I can't believe you're defending "an incorrect assumption" regarding your previous generator, especially after you wrote a very good article showing how badly wrong you were.
Wait, let me restate that. You're Engineering the Disruption of Real Money Gaming and yet you say of some PRNG "no reason the ... implementation ... should not be good enough to not have to worry about"? You're not just wrong, in fact you're "not even wrong", that's how wrong you are. Plus that sentence has three negatives!
Jules Winnfield said it better than I could. Riffing from him, what you were doing was, compared to what you should have been doing: ain't the same fuckin' ballpark, it ain't the same league, it ain't even the same fuckin' sport.
Here's something I came up with in 60 seconds. It's far better than what you didn't want to worry about:
ID Quantique hardware RNG [1]
mixed with Intel RdRand [2]
Yes, you're right. That's not seed-able, so you can't create a reproducible sequence of values (e.g., for testing). That's an advantage. You want to test, you replace hardware with software. But if you want to disrupt real money gaming you don't use some crappy software in some random library that wasn't designed for the purpose.
Once again, I can't believe your defense of how badly you failed was the "reputation" that Google has. Real money is involved and you thought it was OK to rely on some crappy PRNG created by Google? For something that is at the heart of, the quintessence of what you are attempting to accomplish?
For the record, we are not and at no point have we ever used the Node.js PRNG to generate random numbers for our gambling games. That is an entirely separate, audited, and carefully controlled system. This code was unrelated to gambling. In fact, it was not even consumer facing.
Good question. As they author noted, k bits of state can support a cycle length of 2^k. You can look at that from the other direction and state it as a cycle length of L requires at least log2(L) bits of state.
If your cycle length is L and you are using k bits of state, and k > log2(L), then in theory with a cleverer encoding of state you could save k - log2(L) bits of memory.
Whether or not I'd characterize k - log2(L) excess bits as "wasted" memory depends on just how big it is. For instance, suppose someone implemented Mersenne Twister and used twice as many bits as theoretically needed for the cycle length. Typical MT has a cycle length of around 2^20000, so needs 20000 bits of state. Someone doubling that is using an excess of 2500 bytes.
On the other hand, the generator the article is about has a cycle length of around 2^60, and so need in theory 60 bits of state. If an implementation doubled that they have an excess of less of 8 bytes.
2500 bytes is enough that if I were implementing I'd probably take a good look at seeing if I could eliminate it, even if it meant making the implementation a little more obtuse. I'd be inclined to consider that 2500 bytes as wasted memory.
8 bytes? That's small enough that even on most embedded systems I'd probably consider clean and clear code to be more important than saving that memory. I'd not be inclined to consider it wasted memory.
I have another reason based on some mucking with simple PNG's a long time ago. Might not apply to more complicated ones[1].
Maximal cycle length generators don't produce degenerate sequences. Meaning I seem to remember ones where they'd have say a sequence of 2^k - 349 and a sequence of 340 and a sequence of 7. Which means they fail badly if seeded incorrectly.
[1] Noting an old article I read on FPGA design that said, 'helps to design your state machines so that that illegal states transition to legal ones' So I assume sub 2^k generators exist that don't have sub-sequences.
Re: submitting a patch, there were several previous issues and discussions about Math.random() being problematic that were closed-wontfix or otherwise dismissed. It seemed like the general perception was that it was "good enough." It didn't make sense to me to open another similar ticket without a good analysis and whatnot.
I meant to file another issue before posting, but I forgot. There is an open issue now, however![1]
(Similar incentives exist for hash functions, by the way.)
Popular RNG-bound benchmarks:
* "fasta" from the Benchmarks Game [1] (thankfully, this one mandates use of its own RNG, though it's vulnerable to fast-forwarding as demonstrated in [2])
* "Perlin noise" [3] (almost entirely RNG performance bound)
* From SunSpider (this one being particularly relevant to V8): string-validate-input [4], string-base64 [5]
[1]: http://benchmarksgame.alioth.debian.org/u64q/fasta-descripti...
[2]: https://github.com/TeXitoi/benchmarksgame-rs/blob/master/src...
[3]: https://github.com/nsf/pnoise
[4]: https://www.webkit.org/perf/sunspider-0.9/string-validate-in...
[5]: https://www.webkit.org/perf/sunspider-0.9/string-base64.html