2^120 Ways to Ensure Unique Identifiers

CRidge · on Feb 12, 2015

Perhaps I'm missing something here, but where's the guarantee that these ID's are unique? As far as I can tell, an ID is made up of a timestamp and a random number, generated by the JavaScript enginge - and typically seeded with the current time.

Wouldn't that give relatively good chances that for a system with many users, two users would actually generate the same ID?

bmm6o · on Feb 12, 2015

Best-case scenario (for them) is that the random data is "truly" random. 72 bits of the ID are sourced from Math.random(), which means you need to start worrying about collisions when clients attempt to generate 2^36 events with the same timestamp. That's about 70 billion, so it's not something that would keep me up at night.

What's harder to quantify is the role that sub-optimal random number generators would play in this. Implementation of Math.random is left to the browser, but you could imagine a simple LCG with just a 64-bit long as its state, initialized with the current time. Having only 64 bits of state is the limiting factor here, both for properly filling out the 72 bits of entropy (you can't) and for 2 users to have the same internal state in their LCG.

antics · on Feb 12, 2015

Only-kind-of-related: Instagram has a cool post about using ordered IDs to shard photos. It's here: http://instagram-engineering.tumblr.com/post/10853187575/sha...

Pretty cool stuff.

xwintermutex · on Feb 12, 2015

I don't understand this sentence: "One caveat to the randomness is that in order to preserve chronological ordering if a client creates multiple push IDs in the same millisecond, we just ‘increment’ the random bits by one."

avar · on Feb 12, 2015

If you generate ids like:

    <fixed width timestamp><random data>

Your ids are guaranteed to be sorted chronologically until you start generating them at the same time. So as a hack to work around that they generate:

    <fixed width timestamp><fuzzed random data>

Where the fuzz to the random data ensures that it's sorted chronologically.

DanielStraight · on Feb 12, 2015

The ID consists of a timestamp to millisecond precision and a random value. Since the timestamp only offers millisecond precision, two IDs generated in the same millisecond could be out of order depending on the random value because the timestamps will be the same. In order to prevent that, if two IDs are generated in the same millisecond, instead of generating a new random value, they take the random value from last time and add 1. This ensures that even though the timestamp of the second ID is the same as the first one, it will sort after the first one.

xwintermutex · on Feb 12, 2015

I see, thanks!

emmab · on Feb 12, 2015

The article would have benefited from explaining in what situations you might prefer this to a UUID

pstrateman · on Feb 12, 2015

The only reason to use this over a UUID is if you want the primary key to be implicitly chronologically ordered.

The ordering guarantee is weak though, it's provided by clients all potentially with their own concept of time.

rosser · on Feb 12, 2015

Yeah, you absolutely can't trust client-generated timestamps. We have a table tracking metadata on customer-uploaded photos (dimensions, file size, color profile, &c), including the camera's timestamp. According to their cameras, our customers have uploaded pics taken before the Buddha sat beneath the Bodhi tree, and after the Khitomer Accord was signed.

Fortunately, we also store an upload timestamp, and our servers all sync to local NTP.

zaroth · on Feb 12, 2015

> "This guarantees chronological ordering, since the ServerValue.TIMESTAMP will be resolved on the Firebase server to be an accurate timestamp."

To be sure, this can't possibly guarantee chronological ordering. But it can get you fairly close and perhaps almost predictable chronological ordering...

If at any level there is a clock which can skew and is being adjusted with a stochastic process, there is jitter, and you can get out of order. The solution is simply to apply corrective factors in a continuous manor -- slow down or speed up time. That's what the client adjustment should have been doing all along, and then you would never need the ServerValue.TIMESTAMP hack.

If Firebase is relying on a ntp utility, I'm not sure but I would assume that they are smart enough to slow or speedup time and not jerk to a new setting. Or at least they would have a flag to enable that functionality.

Edit: Distributed auto-incrementing counter is not that hard. A field for time, a field for ID. I don't see a reason to conflate the two. I guess it makes object instantiating easier, since you don't have to fetch an ID in any sense. 64 vs 128 bits is not a big difference. So in the end it's a good design decision and is a net positive for them.

bmm6o · on Feb 12, 2015

> then you would never need the ServerValue.TIMESTAMP hack

    ref.push({
    foo: 'bar',
    date: Firebase.ServerValue.TIMESTAMP
    });

Seems way easier than synchronizing a clock in the browser and one on the server, and you don't have to try to account for network latency between browser and server.

> If at any level there is a clock which can skew and is being adjusted with a stochastic process, there is jitter, and you can get out of order

My understanding is that Linux has support for a monotonic clock. As long as its a single box resolving the times this shouldn't be a hard problem.

x3c · on Feb 12, 2015

Could you please elaborate on time speed up or slow down? Or if you could refer me to some reading material? It's an interesting problem and I'd love to deep dive into it.

zaroth · on Feb 12, 2015

http://www.ntp.org/ntpfaq/NTP-s-algo.htm#Q-ALGO-CLK-UPD

ademarre · on Feb 12, 2015

This looks similar to the method employed here:

https://eager.io/blog/how-long-does-an-id-need-to-be/

(HN discussion: https://news.ycombinator.com/item?id=8539866)

obstinate · on Feb 12, 2015

How do you ensure that all your server times are perfectly synched? Or is there just one server? This seems somewhat doubtful.

What happens when I push after a negative leap second?

What happens if the system clock is randomly skewy (see the relevant Rachel by the Bay article).

dkubb · on Feb 12, 2015

You could side-step leap second issues by using a TAI based timestamp.

kenneth · on Feb 12, 2015

This has always been my favorite way to generate IDs, though I typically do it with a 64bit integer and a custom epoch.

Being able to store keys as 64bit integers is much easier to work with, and also much more efficient than b64 encoding. In the odd case you're working with a language that can't handle numbers (ahem… javascript), you can always encode it as a string.

Here's a library that implements this method: https://github.com/Fubelly/rubyflake

jimktrains2 · on Feb 12, 2015

I love snowflakes:)

I wrote a little rust implementation[1] also

[1] https://github.com/jimktrains/rustflake

nodesocket · on Feb 12, 2015

We generate ids using the following:

   $id = Functions::generate_random(5) . uniqid() . Functions::generate_random(3);

`uniqid()` ensures each id is unique (somewhat), and the random paddings of 5 and 3 characters make the ids harder to guess and unpredictable.

stormbrew · on Feb 12, 2015

This uniqid() [1]? That appears to just be a microsecond timestamp. Definitely not globally unique by any stretch.

[1] http://php.net/manual/en/function.uniqid.php

ars · on Feb 12, 2015

Yes, if you need unique, and generated on the server (not remotely) then also include getmypid() and the IP address (or name) of the webserver (if you have more than one).

Then use the option for more entropy in case the function is called more than one in a milliseconds.

So in total:

    uniqid(gethostname().getmypid(), true);

Will be unique if you give sane names to your machines.

intortus · on Feb 12, 2015

This looks like a longer, simpler form of twitter's 64-bit snowflake.