Hacker News new | past | comments | ask | show | jobs | submit login
Sqids – Generate short unique IDs from numbers (sqids.org)
565 points by vyrotek on Nov 25, 2023 | hide | past | favorite | 232 comments



I appreciate that the author clearly states that security, i.e., output can't be reversed back to the input, is a non-requirement. We can't criticize the author too much for that either, because, as a rule, "random-looking id generator" algorithms will always be either not secure, or not short, or not collision-free. Or they'll be a key-value database.

A secure "random-looking id generator" is called a block cipher. Block ciphers with less than 128 bits of output are widely considered insecure: this corresponds to about 22 base64 characters.

Going further, you probably do want to use a secure algorithm. History is full of people thinking "we don't need this to be secure, it's not used for security-sensitive things" and later regretting it. Some case studies at [1].

[1] https://www.schneier.com/blog/archives/2016/04/security_risk... .


It's good to consider this but... Plenty of sites expose user ID as a regular integer. In some cases you might want to avoid this (leaking user count to competitors etc), but I have never heard about anyone calling this a vulnerability.


It's referred to as an "Insecure Direct Object Reference" (IDOR) vulnerability. In many cases it is not actually a vulnerability, however, when an application contains sensitive information and lacks authorization or rate-limiting it can be exploited to enumerate the entire database.

https://cheatsheetseries.owasp.org/cheatsheets/Insecure_Dire...

When I first joined $company, HR sent me a SharePoint document with a numerical ID. Incrementing or decrementing the ID allowed me to view personal information of other employees including their pay.


I hear this all the time. Every 3PPT report I see is cranky if you have userid=2345 as you can enumerate it.

Personally I think it's stupid but this is a tempting solution.


> i.e., output can't be reversed back to the input

But in the "Not good for" section it states the opposite: "[not good for] User IDs can be decoded, revealing user count" so it can be reversed?


Added parenthesis to make what your parent commentor is saying clearer:

>I appreciate that the author clearly states that security, (i.e., output can't be reversed back to the input), is a non-requirement.


I realize now the site lists "One-Time Passwords" as a use-case. I'm a positive-vibes-only HN poster though, so will leave it as an exercise to the reader to decide whether that's going to turn out well or not.


What's the issue with that, provided the input number is random?

My issue with Sqids is they're longer and more complicated than the input!


The mention of one-time passcodes seems odd. Those need to be unguessable, but don't need to be unique. If you supply a suitable random source, then I suppose it works, but the "padded with junk" feature makes these look more complex than they really are.

The standard choice of 4 to 8 random digits works well and it's clear what level of security they provide. Digits are easier to understand than case sensitive latin characters, especially when your native language uses a different character set.


I think you completely misunderstood the article, or you have not read it.

The uniqueness from the system comes from the fact that two different numbers will never have the same id.

The pad only works for things like user ids.

You can also change the alphabet it uses so its not case sensitive; using this alphabet: "ABCDEFGHJKLMNPQRSTUVWXYZ0123456789" and a minimum of 8 digits, it will produce 8 digit ids all the way 4294967295 (0xffffffff).


This doesn't go against what the other commenter said. The understanding was correct.


I like the idea, though I use nanoid with the safe letter dictionary (it excludes letters used for profanity[0])

They should use a similar dictionary approach IMO because I looked at the implementation and it’s hardcoded to look for “bad” words

Otherwise looks real straightforward! I’d love to see some performance test suites for it

[0]: https://github.com/sqids/sqids-javascript/blob/ebca95e114932...

[1]: though with UUID v4 so common to generate and well optimized in most languages I wonder if these userland solutions are really better. You can always generate a UUID and re-encode with base32 or base64 with also is well optimized in most languages


> it excludes letters used for profanity

That doesn't seem possible. How would that work?

> I looked at the implementation and it’s hardcoded to look for “bad” words.

If you mean https://github.com/y-gagar1n/nanoid-good, that seems to be doing the same thing.

In general, I'm a bit weary of solutions that "guarantee no bad words" – this is usually highly language-specific: One language's perfectly acceptable name is another language's swear word.


This is the implementation: https://github.com/CyberAP/nanoid-dictionary

We use it in a highly internationalized product spanning multiple languages and haven’t yet ran into a complaint or value on audit that would constitute something offense in any language per our intl content teams anyway.

That isn’t to say it’s 100% (and simply enough we don’t audit every single URL) but I suspect we would have gotten at least a user heads up by now

Never the less we are moving our approach to uuids that get base32 encoded for some of our use case for this. They’re easier to work for us in many scenarios


It's particularly funny because their example docs for .NET outputs "B4aajs", which to any Swedish l33t speaking individual, would read "Bajs", which means "shit"


Somewhere there's a database for every bad word and every bad typo in every language and that one just got added.


Omit vowels and you're 90% of the way there; omit the vowel-looking digits 0,1,3,4 and you're probably >99% of the way there.


fxck


Which is, evidently, why nanoids also excludes x and X, as well as v and V (fvck).


fjck


> That doesn't seem possible. How would that work?

agree; b00b, DlCK, cntfcker

But I suppose, if user doesn't get to craft input, the collision space of converted numerical ids and words like above is sufficiently small to be ignorable.


Besides vowels, nanoid excludes 0, 1, 3, 4, 5, I, l, x, X, v, V, and other lookalikes, so the chances of generating something naughty in any language are close to zero.


Humans have a high capacity for spotting rudeness. Nanoid’s nolookalikesSafe alphabet would allow blwjb69FKmyD7CK.

(Sorry)


Buy me drink first, jeez


Looks like the dictionaries used are from this file?

https://registry.npmjs.org/naughty-words/-/naughty-words-1.2...

From a quick look, the lists are pretty short, except for the one with English words that at least have some 404 words, but I can imagine there are far more bad words that you want to avoid than just those?



Grepping out naughty words in randomly generated text definitely strictly weakens the information content if you're using it for a secure application but is often necessary.

In the early dotcom era the company I worked for were about to go live and the final step was demoing the end to end flow to the ceo. I had done the back end stuff and hadn't paid much attention to the front-end. The person who did the account creation process wanted to nudge people to generate memorable yet strongish passwords, so when it created your account it would generate with a random password which he did by choosing 2 four letter words at random from the unix dictionary and putting a two digit number between them. He ran that past me as an idea and I thought "yeah, good idea" and didn't think more of it.

However he forgot to first grep out all the naughty words so when we demoed it to the CEO/non-technical founder both of the words in his randomly generated password were swearwords.


I tried something similar with a fixed alphabet that guarantees no profanity and a checksum (luhn)

https://github.com/tttp/dxid


Side note: there are some business insights you can get from a company using serial ids.

i.e if you sign up and get user id 32588 and make another account a few days later, you can tell the growth rate of the company.

And this is possible with every resource type in the application.

I do wonder how much the url bar junk thing matters these days. I tend to use uulids (waiting on uuid v7 wide adoption), and they're a bit ugly, but most browsers hide most of the urls now anyway. The fact that there is a builtin time component comes in clutch sometimes (e.g. object merging rules).


> you can tell the growth rate of the company.

You can even do this when you don’t know the exact interval by using probabilities. The Allies used this method to estimate German tank production in World War II by analyzing the serial numbers of captured or destroyed tanks.

This is know as the German Tank Problem [1]

[1] https://en.wikipedia.org/wiki/German_tank_problem


Very interesting.

I’m a lawyer and using sequential IDs in a fraud case right now, to determine the number of victims.

Unfortunately, so far, I only have the IDs of two victims, and those are from just within about a month, whereas the fraud has likely been going on for several years. Just simply extrapolating that growth rate isn’t going to be very accurate.

Also, I suspect that the perpetrators did not start at ID 1.


You might try to use the information to find more victims first.


ehm, yeah, n=2 will not get you anything useful...

that'll be like trying to determine the average salary in a company with only two known ones, which could be the janitor's and the CEO's


> that'll be like trying to determine the average salary in a company with only two known ones, which could be the janitor's and the CEO's

Ironically that would be somewhat close to the actual average.


It would be significantly above the average unless the company is ridiculously top-heavy or has shockingly little variation in salary. Or if the "salary" for the CEO ignores certain compensation (eg: paid a salary of $1 + stock options).


Sure thing. I could have worded it better, but I was trying to say that it would be much more skewed if the two samples were, say, CEO and the CFO, or two janitors.


Even with n=1 you can get something useful. IIRC "on average" if you have ID x than the best population estimation is 2*x. Of course the error margin is immense, but it's still better than nothing.


It also makes it slightly easier to perform certain attacks since it's trivial to figure out other IDs.


Making non-guessable IDs for broken authorization is security by obscurity.

If you have integer IDs it is also trivial to find authorization flaws on your own. Any pentester will go for it right away.

If you make non guessable IDs they might skip it and go look for other stuff.


I would have introduced random, increasing skips in the sequence to make my army look 10x bigger.


I see so many organizations add weird slowdowns from debts associated with this. I reflect on some of the most successful tech businesses of the last decade and remember that all their APIs exposed this kind of data early on and many still do.

Does anyone have an example they can reference of a business being harmed by this information being out there?


I don’t know any stories of digital businesses but there was a case where someone went and counted customers at the door only to realize that the company was lying on the yearly reports.

So I guess that kinda harmed that fraudulent business strategy…


At an internship long ago, my boss instructed me to always add a few extra to the auto incremented order ID so customers couldn’t guess how business was going if they happen to order stuff quickly in a row.


How big and random were these "few extras"?


The solution I prefer is to simply just encrypt the data such as IDs.

Instead of giving user an ID in response, user gets hmac(cipher(Data, secret_key), secret_key) + cipher(Data, secret_key) and then some simple pre-request handler just iterates over query params / form data and decrypts them if signature matches.

It also works as a really nice CSRF protection as user ID of currently signed user can be embedded into Data and checked if current user.id == decrypted data.id.

Another nice advantage is that you can deny the request right in the beginning as you know ahead of time that the provided data is not valid (signature doesn't match), saving some DB queries.

The down side is that URL gets pretty long though, but if that's hidden by browser or user doesn't care, it's a non-issue


Can confirm, when I worked in VC we used this to verify order volume for a number of startups we were evaluating. For one startup, I wrote a bot to place a small order a few times a day, and log the order number.


Did you use this for due diligence to verify that the data reported by the entrepreneur in the pitch was good or you were looking at some companies in a specific space and checking the company that has more orders?


The former.


> most browsers

Not chrome...

Also, links are a thing in chat, etc


Heres what a recent youtube (which squid documents as a sample use case) link I shared looked like:

> https://www.youtube.com/watch?v=fFMzQ3tYTFU&pp=ygURY2hQImVzZ...

Or Twitter:

> https://x.com/elonmusk/status/172853302828286055507?s=20

Or TikTok:

> https://www.tiktok.com/@<userId>/video/730292574259232054785...

While I tend to strip the tracking params and there are extensions that do this, I don't think most people do. These URLs are pretty 'ugly'.

So if the links that are being shared most on the internet (YT, TikTok, Twitter) don't care, you probably shouldn't either. I think the onus is on the UI layers (Chat apps, etc) to show urls how they look best on their respective platforms.

Edit: to this point, it looks like HN truncates these to make them less ugly too.


If you use the share button and not the url bar you get prettier and shorter urls.

Don’t know how common that is. Wouldn’t be surprised if no -techies don’t know how to copy from url-bar.


I always take note of invoice numbers for this exact reason, they give you a feel for how busy they are.


And this is the stuff you get if you manage to get your access control right.

Get it wrong and we jump from actionable business metadata to actionable business data (like perhaps which of your customer's customers are poachable)


Yes but just using Sqids doesn't fix this. Sqids are decodable. You need to use a random or otherwise unpredictable input.

What's the advantage of uuid7 (sequential + random) vs uuid4 (full random) for this?


You can extract the timestamp from UUID7/8 so it also can reveal business information.


That's why I'm asking, it seems like a liability.


You don't get the cardinality of the data type, just when the object was created.

There are probably some business cases where the “when” information is potentially useful (I cant think of any) but, you cant know, for example, how many users are in the database.


It's usually benign, but why encode any info into your public IDs? I wouldn't go anywhere near that.

It can make sense for some situational internal database use case where you want temporal locality and can't use full sequential since it's distributed, and even then your DBMS might recommend something else, e.g. Spanner explicitly says not to do this. And it doesn't need to be exposed to users.


Assuming the alternative is a fully random key, this can wreak havoc for performance depending on the database engine and index type used (lots written on this topic).

But I do agree, if performance isn't an issue with your db choice and you're not interested in in getting a free "created_at", might as well go fully random.


Primary keys have to be chosen carefully because they impact disk layout, joins, etc, and full random makes bad PKs in certain distributed DBs. But it's simple and cheap to convert a public user ID (full random) to/from internal row keys (sequential-ish) at the API boundaries using a secondary index or even a cache.


(Like, simple/cheap enough that it's probably worth doing instead of exposing your row keys. Maybe in some careful cases exposing uuid7 row keys makes sense, but not nearly enough to recommend that so broadly. It's not a safe default.)


Makes sense. Longest HN convo I've managed to keep at this point -- thanks for engaging!


Yes, that's how I know I was roughly the 600,000th person to sign up for thefacebook.com.


Unless they are doing master-master replication so are incrementing by something other than 1


It's weird under "Get Started" they have links to 40 different languages. You can only get started with 15 of the 40 languages listed, the other 25 are skeleton repos asking for people to start the repo to indicate interest.


It’s kinda clever. The people most likely to look at this project are also likely ideal candidates for implementing the library in a new language (Developer, FOSS enthusiast, interested in the project, need the library in a language they’re familiar with that isn’t implemented yet).

Also, the language pills differentiate between those that have been implemented (color logo, dark text, bold) and those that aren’t (grayscale).


Good points. Those pages also contain links to old implementations (Hashids), because a lot of projects still use those and want to be able to find them.


Also can help track which languages people click on, probably a good proxy of where there would be the need to develop a lib


The approach definitely works. Some time ago I saw .NET listed but discovered it wasn't complete. I was eager to replace an existing Hashids implementation so I made some comments, shared a starter-snippet, and then someone was excited enough to complete in just a few days. It was great to see how quick the community stepped in. Maybe there was a bit of Cunningham's Law in effect with my contribution, ha.

https://github.com/sqids/sqids-dotnet/issues/2#issuecomment-...


Maybe a slam dunk first FOSS contribution?


This seems like a perfect use case for an LLM :)


The second example for each language sample where the generated squid ends up being “B4aajs” essentially reads as “P0ooop” to a Swedish speaker.

Which is fine, they don’t propose to filter “bad” words in other languages, but kind of funny when that’s one of the highlighted examples, right next to the goal of filtering words. Goes to show how hard it is to filter profanity generally for international audiences


Could you just remove vowels and hit 99.9% of profanity in all languages? Ditto for removing their 0-9 equivalents, if you're really worried about it. Quick out of the box support for that via being able to define a custom alphabet.


You might still have issues with generating sequences like "XtrmlyBdWrd" that are still recognizable.


Well until we figure out a way to remove pattern matching from humans... use GUIDs if that's an issue. Removing vowels fixes "spelling almost all bad words explicitly", though I'm open to being proven wrong with fun new swears in exotic (to me) languages :)

The problem of "pick any N symbols that don't make any profanity in any language across all time" isn't what this is solving, nor should it have to. Take the same concept but use whitelisted words to build the token if you're that adverse to computer generated, fill in the blank naughty words. Keep "pen" and "island", among other things, off that list ;)


With enough fricatives, some languages still manage. See the Serbian for Serbian, srpska https://en.wikipedia.org/wiki/Republika_Srpska.



Neat library!

We're using randomly generated strings for many things. IDs, password recovery tokens, etc. We've generated millions of them in our system, for various use-cases. Hundreds of thousands of people see them every day.

I've never heard any complaints about a random content-id being "lR8vDick4r" (dick) or whatever.

But nowadays our society is so afraid of offending anyone, that profanity filters has extended all the way to database IDs and password recovery tokens.

(there are some legit cases, like randomly generated IDs for user profiles shared in public URLs, that users have to live with, but even there just make the min length 8 and you're unlikely to have any full-word profanity as the complete ID; put differently, I don't understand why they made the block list an opt-out thing)


The block list is 2/3 of the (minified) library. I found this entire choice odd.

First, it's highly incomplete because you can find at least 10x more combinations spelling the same "word". And probably 10x more slurs that aren't in this block list. Second, because it's hardcoded in your source. Third, because there are more elegant solutions.

Such as to pick an alphabet that can't spell readable words unless you're trying really hard to read a slur into it. Say this (no vowels or digits):

bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ (length 42)

The full lower+upper+digits alphabet they use is 62. Feels like you're losing a lot, but... not really.

- A 128-bit id in base 62 = 22 letters.

- A 128-bit id in base 42 = 24 letters.

JUST TWO MORE LETTERS. And it's one more letter for 64-bit id (11 vs 12). And we can avoid this entire silliness. The problem is the author doesn't realize that logN is... logarithmic, I suppose.


A slight mod, I'd remove Y despite not exactly a vowel, and add back digits that can't be interpreted as vowels.

bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ25679

Gives us base 45. And below is a JS snippet to make an id. There's your lib.

    function id(num) {
        num = BigInt(num);
        const dict = "bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ25679";
        let id = '';
        while (num > 0n) {
            id += dict[Number(num % 45n)];
            num /= 45n;
        }
        return id || dict[0];
    }
Example:

    id(123456789012345678901234567890n);

    "bq99hC6fbtjLrkxLPm"


Totally agree. 2/3 is wild, especially given it seems like you could mitigate most of the risk just by removing vowels from the dictionary.


I'm actually more convinced the problem is corporate risk management to deal with the tendencies of social media to overhype issues by design, rather than a statement on society


There's other "general" rules when it comes to random human-readable tokens such as not using Os and Is if your strings include numbers - people can and will confuse them with 0s and 1s if they have to type them over.

Most gift card tokens for example don't allow the use of those two (or quietly correct it) to avoid making that mistake.


California State Driving License and or ID Numbers are One Alphabet & then 7 digits. But always, if the character at second place is Zero, everybody reads it as Alphabet O.


Well if you don't filter, things like this may happen:

https://github.com/compiler-explorer/compiler-explorer/issue...

Me and you will know it's just random characters, but if HR enters the chat...


In a Ruby app we just convert to a high base, like

  > 1234567890.to_s(36)
  => "kf12oi"
That gets us most of the way there, but Sqid has a Ruby library and lets you set a much higher base, including upper case characters, and I suppose, emoji. We're going to need much bigger numbers before that space savings makes much difference. I like it, but it's hard to know when something like that is worth adding a dependency.



Ironically its default encoding is base 55. I figured if you're going to use emoji you should at least use the size of the space to maximize the numbers of bits per grapheme beyond what is possible with ASCII, but apparently not.

Which I guess is the part where senior (citizen) programmers like me get triggered as promised in the README.


I believe a big part of the idea is for the hash to be unpredictable as well.

If I figure out you're using (36) then I know the next number 1234567891 is "kf12oj".

Not the case with Sqids.


No, squids are predictable, you can't use them to hide information.

They call it out on their front page.


I haven't looked at the implementation yet but HashIds (the former project name) required a salt. It would be weird if they changed that.


The `salt` in hashids just shuffled the alphabet. Now they removed the `salt`, but you can have the same level of "obfuscation" as before if you shuffle the alphabet yourself before calling the library.


They're weakly unpredictable.

Obfuscated


You can easily brute-force this. Sqids also says it's not good for sensitive data.


It looks like an easy brute force too, there's no compute-hard operations here. I guess you could scramble your alphabet? Otherwise Uk always comes after bM, etc.


My understanding is that you can re-order the source alphabet, and encode numbers with swapped characters. Unless you know of 36 numbers that they're exactly 1 id apart, you will always have uncertainty to what the ids actually map to.I guess given a large, large number of ids along with the order they're assigned (which might be given through the time at which they're assigned), you could create a pribabilistic statement on the actual order of the 36 characters based on the fact that most numbers increase/decrease faster the bigger they are. (this fact is e.g. used to detect fabricated bank statements - if the first digit of any number in the statement is equally likely to br 1 or 9, the numbers are randomly generated whereas if they're real, 9 is less likely than 1)


Scambling the source alphabet should have an effect similar to a monoalphabetic substitution cypher. This is not strong cryptography. If the attacker has any ability to generate IDs quickly, like by creating user accounts or other resources they can create many IDs with known ordering. Likely effective against non-serious attempts.


Yeah this is very much the bad kind of DIY crypto.


I'd prefer to use crockford-encoded entropy with Stripe-style token prefixes to create unique ID namespaces. Run in through a bad words filter, and it's perfect.

user_1hrpt0xpax7ps

file_xpax7psaz0tv6az0tv6

Etc.

In distributed systems you can use the trailing bytes to encode things like author cluster, in case you're active-active and need to route subsequent writes before create event replication.

Easy to copy, debug, run ops/incall against. If you have an API, they're user-friendly.

Of course you still want to instruct people the prefixes are opaque.


Yeah don’t forget the bad words filter. I worked on an IKEA mailing where the list processing house was adding an autogenerated discount code to the address label. The customers received codes with BOOB, DICK, TWAT, and CUNT embedded within. People were not happy.


Did they never make an IKEA purchase after that or did they get over it like a normal adult?

I don't work retail, but something tells me people will make a stink out of just about anything if it meant potentially free products or other compensation.

Plus, are you filtering just English curse words or all curse words for countries that use Latin characters?


The risk is not in offending someone, but in that someone posting the rude string on social media in real or mock indignation, causing the outrage machine to turn on your brand. There’s a steady supply of bottom-feeding journalists waiting to write the article “IKEA’s new system is sending hate messages to customers”.


Can you cite an example of a journalist writing an article that makes such an accusation?


We caught it about 2 million into a 15 million print run. I had to go through the control files and change the offending value or identify the pallet and bundle of mail it was in so we could remove and readdress the catalog. It was about 10 years ago and anyone who complained got a coupon.

The mailing house that did the work was fired though.


this is very interesting. do you know how amazon creates its order id ? they are all numeric


Correct me if I'm wrong, but, It cannot be unpredictable, which makes the library redundant for security concerns, which would be the one business case to seek for anything other than an UUID (which is already built into Ruby).


I didn't think of that, but this is a nice trick!


This used to have a totally different name iirc, they used to be called hashids


A recent Show HN: from the dev (who's also in this thread) explaining the rebrand https://news.ycombinator.com/item?id=38185437


Yeah, it says that both in the page title and the logo at the upper left


I've always thought it was called bijection.


It would be great to have a quick primer on why this is better than what people typically homebrew, like base62 encoding a random number.


If you use a random number then you need to store it somewhere to map back to the original. Sqids is an encoding, you can decode the sqid back to the original without storage overhead.

Features like the profanity filter avoid creating URL routes like /user/cuntFh.

Cross language support allows interop between the encoder and decoder across microservices written in different languages.


base62 decodes back to Int32 perfectly.


Database PKs usually aren’t random, which AFAIK is what is usually used as the number in this case.


Database PK really shouldn't be exposed like this. I think that's what they're getting at, though. The thing is, sequential database PK is still shorter and more readable than a Sqid.


I haven’t been able to find a case for this because ids either need to be unique or they’re not going to be large. If they’re unique, I’m using uuid or ulid (uuidv7 of tomorrow) as the sortable primary key type to avoid conflicts without using the db to generate and maintain sequences.

Where do you have unique ids that aren’t the primary key? I would be more interested in a retrospectively unique truncated encoding for extant ulid/uuid; ie given that we’ve passed timestamp foo, we know that (where no external data is merged) we only need a bucketed time granularity of x for the random component of the id to remain unique (for when sortability is no longer needed).

Or just more generally a way to convert a ulid/uuidv7 to a shorter sequence if we are using it for external hash table lookups only and can do without the timestamp component.


The idea is that you encode and decode database IDs with this. You wouldn't save them separately unless you were using it for a purpose other than shareable "identifiers" which don't leak significant amounts of database state. Imagine something like a link shortener where you want to provide a short link to users, but don't want it to just be a number.


My point was that this only encodes numbers and you wouldn’t use numbers as primary keys for the reasons outlined.


These leak the state just as much. They're reversible back to integers.


> Where do you have unique ids that aren’t the primary key?

For things that need public IDs, I always generate and store uuid4s* to avoid leaking database state or anything like that. Even uuid7 leaks creation timestamps. It's cheap and easy to map public IDs to/from PKs at the API boundaries.

* possibly encoded as something nicer like base64 for APIs or URLs


I’m currently using sqids as slugs to have a shorter url than just using my uuid primary key


Why would the number the sqid converts to not be your primary key, though? Why do you need the uuid at all?


why is ulid the uuidv7 of tomorrow?


Uuidv7 is inspired by ulid.


For a similar thing, (X bytes to X bytes, no collisions) https://en.wikipedia.org/wiki/Format-preserving_encryption is a good page


Also see Knuth Hash and k-dimensional equidistribution.


I wrote a Ruby gem to address this problem of hiding sequential primary keys that uses a Feistel network to effectively shuffle int64 IDs: https://github.com/abevoelker/gfc64

So instead of

    /customers/1
    
    /customers/2
You'll get something like

    /customers/4552956331295818987
    
    /customers/3833777695217202560
Kinda similar idea to this library but you're encoding from an integer to another integer (i.e. it's format-preserving encryption). I like keeping the IDs as integers without having to reach for e.g. UUIDs


I recommend to review my comment where I also use a Feistel cipher [1] but the difference is that it is not limited to int64 but I can even use 8 bits. Obviously loosing security properties but working as an obfuscation method. If you use a random source with relatively few bits you should check if there are duplicates while with the Feistel cipher you are sure there isn't.

[1] https://news.ycombinator.com/item?id=38418198


On a side note, "Sqids ... is an open-source library that lets you generate YouTube-looking IDs from numbers.", "The main use of Sqids is purely visual."

If the purpose of it is to give a friendlier url / id, who not use something like friendly_id instead? (http://norman.github.io/friendly_id).

The url is readable and searchable through the history.

I would much rather prefer people using "www.website.com/channel/video/a-dog-walking" instead of "www.website.com/channel/video/3cXv8c".


These are called slugs <https://en.wikipedia.org/wiki/Clean_URL#Slug>. If you’re willing to commit to a persistent slug, please do: they are nicer. But for many applications, e.g. most user-generated content, you can’t be confident they won’t change. There’s a reason for the general advice to not use data as a primary key in databases, but to use a separate ID.


Actually I'm a bit disappointed that it can't format 128 bit integers or byte arrays. That would allow formatting UUIDs. I'm not a huge fan of public facing integer IDs. There is always the risk of leaking some kind of critical information with ascending IDs.

So I will probably keep Base64URL formatting my UUIDs to make them shorter for URLs, QR Codes and so on.

Quick example:

  20b30b32-d421-4cfb-bdbc-9a4e0475abea
  =>
  MguzICHU-0y9vJpOBHWr6g
It's important to keep in mind that there are different ways to convert an UUID to a byte array (little/big endian, order of segments, ..)


Even if it could format 128 bit integers, I don't see how that would be any better than base64-encoding them. Base64 is already pretty close to the maximum efficiency you can get for shortening UUIDs while still using URL-safe characters.


To add some of the additional features Squids provide. You can also convert any smaller integer to a string with Base64.


I actually used sqids algo for something very different where I had to encode arbitrary sized byte arrays. And with Ruby it was very simple to remove limit - I think it was matter of monkey-patching https://github.com/sqids/sqids-ruby/blob/main/lib/sqids.rb#L... to return infinity.

The reason for the limit is most likely to ensure interoperability with libraries in other languages where working with bignums is much more complicated.


This limit could easily be lifted with byte array encoding support.


The shortuuid library can do that:

https://pypi.org/project/shortuuid/


I offered something similar here [1] and it is used by many companies including Philip Morris, and the Argentinian tax agency for the same purposes.

The technique I used (I should publish it as open source) is using a Feistel cipher [2] with a key. The Feistel network could be adjusted to almost any size and the key used in every round is an expansion of a general key using a key derivation function [3] (KDF3 if I remember well).

Basically it is a symmetric cipher of arbitrary size.

[1] https://www.nektra.com/products/secure-coupon-code-generator...

[2] https://en.wikipedia.org/wiki/Feistel_cipher

[3] https://en.wikipedia.org/wiki/Key_derivation_function


Interestingly, it seems there might have been independent invention of the same idea[0]. Have you checked for any patents in the space?

[0] https://bytes.grubhub.com/why-we-use-crypto-when-generating-...


Thank you for the article. I was not aware of that article. My first implementation was done in 1997 or 1998. I am involve in computer security for three decades and I was familiar with Feistel networks. I am not a cryptographer but I have reverse engineered algorithms using this structure. Never checked patents because I think this is a natural construction and I always wondered why there were no other public ideas around this problem. This is not the first time it happened to me.


I'm not a cryptographer and did not understand half the things you said except symmetric crypto.

I think there are 2 problems with this approach:

How do you prevent the double spend problem, for example duplicate entry tickets. You would have to mark the ticket as used in a central database anyway to prevent it

What happens if the secret key material is compromised? Anyone can issue new valid numbers, etc..

Please correct me if I'm wrong.


No problem, the two answers here:

1/ You don't have duplicates because a Feistel network assures you there is no duplicates: every input has a different output with a fixed key in every round.

2/ If the secret is compromised anyone can issue valid numbers in the same way that if your secret encryption keys for encrypting your data exposes it. The idea is that the secret key material is never compromised as it is assumed in all security cases. You should custody with the right measures based on the attack vectors you have. The custody of secrets is independent of this method.

Please let me know if you have more questions.


>The idea is that the secret key material is never compromised as it is assumed in all security cases.

That's not true, we have (perfect) forward secrecy, backwards secrecy and key rotation mechanisms because we often care what happened after the key is inevitably compromised. In this case the problem makes it hard to "rotate" the keys in a meaningful way, but I'm yet to see a proof it's impossible.


I think we are talking about different things here or the conversation is not clear.

The assertion you mentioned is the "what" assumption while what you said is the "how" we came to that assertion. Most probably you take my word idea in a specific sense other than I wanted to use. My answer was informal because we are here in HN and not writing a paper.


Feistel ciphers are a good technique for doing just this but it's also worth noting that if all you are looking for is "produce a pseudorandom permutation of 1..N without actually shuffling a list of numbers" you can also use an LFSR as well.


The difference is that the method is more secure than an LFSR.


of course, but an LFSR is going to be faster (provided you have a reasonable number of rounds in your Feistel cipher), have some situationally desirable statistical properties and is easier to adapt than a format-preserving encryption technique like a Feistel network.


Sorry, but faster for the computing capabilities we have in every electronic device is no significant in the application. I have not seen a customer inquiring about the speed of this process. They were looking to a method that is more secure than others.

For example, in the case of Philip Morris was about winning prizes and imagine if they tried other methods before where some smart people reversed the method.


Securecouponcodes.com is down. Was looking for PHP example.


How do you adjust or evolve the blocklist with this, without making previously generated IDs incorrect?

The ID is simply incremented if it is blacklisted [1]. So the ID is fixed to the blacklist content, and adjusting it in any way invalidates certain segments of previously generated IDs?

1. https://github.com/sqids/sqids-rust/blob/9f987886bc06875d782...


They address it here: https://sqids.org/faq#future-blocklist

You're right that you can't update the blocklist in a backwards compatible way.


Didn't check the code but you could encode offseted hash + the offset to avoid blacklisted words and the decoder would decode any version of the hash (offseted or not).


Include a version prefix perhaps?


I think the "unique IDs" part of the title throws people off and brings security to mind. "The main use of Sqids is purely visual" is what you need to know. It's not necessarily for IDs, it's just a user-friendly way to encode/decode numbers.


Also, why is it non-sequential? That suggests it's unpredictable, but it's not. Just realized that was the thing bothering me about this.


Exactly. An alphanumeric encoding for integers.


An argument could be made for calling it “compression of the string encoding”. Since HTTP is string oriented, that makes some sense as an optimization of sorts, perhaps.


It doesn't seem compressed, in fact the output of this is less compressible than the input. It's made for human interface.


I see many people in this thread saying that this is a good way to hide insights from ids/numbers, I don't understand, aren't the generated values easily decoded? couldn't I just decode a couple of numbers to get that insight? What am I missing.


The docs state:

  Not Good For:  
  [...]  
  User IDs  
    Can be decoded, revealing user count
So yeah, just using a sequential id and encoding the number with this library is not a viable idea if you want to hide your insights.


I noticed this too, and imo it's a design flaw that people get misled this way. Sequential inputs should yield sequential outputs, otherwise you might think it's meant to be unpredictable like SHA256.


Odd design decision in that if you provide your own blocklist, it overwrites their (extensive) default list instead of adding to it.

And in general the algorithm is surprisingly complicated for something that could be replaced with simply base64 encoding, the given example (1,2,3) base64 encodes to a string with just one more letter than this algorithm.

That said I do appreciate the semicolon-free-style. I don't typically see that in libs besides my own.

https://github.com/sqids/sqids-javascript/blob/main/src/sqid...


The problem is their block list will change over time. If you don't override it, then your IDs won't decode right when you update. This is a huge risk.

> You have to account for scenarios where a new word might be introduced to the default blocklist

https://sqids.org/faq#future-blocklist

Honestly, I think they need to rethink this. Otherwise you've got different library versions for different languages each using different default blocklists, none of which are compatible.


> If you don't override it, then your IDs won't decode right when you update.

They'll decode fine. Encoding might change.


I think your right, but that violates the uniqueness guarantee. It doesnt seem like something that should change over time by default.


There explicitly is no uniqueness guarantee. It is known that the same values can encode into many different strings, and they recommend always decoding then reencoding with a stable algorithm to achieve a "canonical" id if one is needed.

https://sqids.org/faq#valid-ids


What’s the use case for passing in an array of numbers? Typically when generating an ID my input is either a single random number, a string that’s being hashed, or nothing at all.


[shard_number, primary_id_number, timestamp]


But then why not just arbitrary text?


Hashids seems way better than their new implementation


I'd love a short ID standard of some kind. There are so many situations where I just want a quick, ephemeral ID. Doesn't need to be cryptographically secure, collisions are expected. It would be nice to have something to import from the standard library in most languages.


Sqids vs Squids. Missing the 'U' for unique but nevertheless a unique shortened version of the regular spelling.


If the original numeric ID can be figured out from the sqid string, then what’s the point of the conversion?


The point is to make a long number more human-friendly. But I don't get why these are non-sequential then.


Unless we are talking about very long numbers, surely the numbers are easier for people to deal with, say over the phone, etc?


I was going to say it's for very long numbers, but I tried an example, and 123645634 becomes ARsz1pHw789c7ESzhy. The output is actually longer, and more complicated.

Looks like it uses a hash. I was expecting it to just convert the base except with something to skip profanity, which would give you something much shorter going base 10 to base ~36. Tbh I don't see why it's like this.


The original numeric ID(s) can only be decoded if you know the original alphabet that was used for encoding.


I wouldn't bank on that information not getting out. It may even be easy to reverse-engineer. Sqid says not to use it this way.


I wanted to say that I use similar project called HashIDs, but I see that HashIDs rebranded to Sqids :)


What we did 20yrs ago was generate random numbers and then ensure the # wasn't already used. We do this only for employee ID #'s and it works fine for us. It's definitely not ideal though.


Do we really need a library for that? Shouldn't it be a simple function?


A library is a good place to put (even simple) functions that are shared between a lot of programs.


Reminds me of proquints https://github.com/dsw/proquint

But 127.0.0.1 looks more "readable" to me than lusab-babad


Oh, they finally fixed the badly named "hashids"


is there anyway to generate short unique id's from UUID's? snowflake is incredibly slow when joining UUID => UUID columns.


Isn't the thing that makes UUIDs UUIDs that they have enough bits that they are guaranteed to be unique without any synchronization?

I don't think you could reduce the amount of random bits (I guess there are some non-random parts in standard UUIDs but not a significant amount) while preserving that property unless you add back some other form of synchronization to ensure that there aren't collisions which seems like it would defeat the purpose.


If you know which type of uuid you have (v1, v4, etc) then you can take a look at how many bits of randomness it has, how many total items you have, and compute the probability of a collision if you just take a subset of the bits and use that as an ID.

In theory it's definitely possible. The 128 bits you get in a UUID is a LOT of randomness for an identifier. Postgres BIGINTs are just 64 bits. Instagram's sharded IDs are just 64 bits. (See below.)

You can test it. If you're using uuidv4 (which is 100% random bits, minus a few for the version), you could make a new column in your table in Snowflake, populate it with the first 64 random bits of your existing uuid column, then see if you have any collisions.

https://instagram-engineering.com/sharding-ids-at-instagram-...


Encoding the UUID in e.g. base 64 or Crockford’s base 32 (instead of the standard hex+dashes) saves you some space.


Was hoping for more of a high level technical explanation of how it differs from alternatives in the FAQ


saving it to my list of uid implementations https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...


Why should you hide your user count?


The rate of change over time can be used against you; many people consider their businesses’ month-over-month growth (or lack thereof) to be private information.

“$WEBSITE did 50,000 signups a month during the beginning of the pandemic, but now struggles to sign up a thousand a week” is a story.


The name (but not function) seems really close to squuids from Datomic/Clojure.


Sad that it is not for user ids


I think that's only if you don't want to leak user count when your ID is an autoincrement. Elsewhere people mention cryptographicly remapping integers, which could work (by itself, or before passing the ID to sqids).


Makes me think of something totally different, which is how Urbit converts randomly generated numerical IDs into sorta memorable names.

For example, the number 5,702,400 becomes ~sorreg-namtyv. And 1,742,733,824 becomes ~master-morzod.

I enjoy the quirkiness of the names.


> Not Good For:

> User IDs - Can be decoded, revealing user count

Suppose you don't want to leak the count, what's a resonable way of implementing that?

You can of course have a uuid v7 / uulids or something as the primary key. Or have it as a public facing primary key, mapping back to a sequential ID PK (there might be some performance hits with larger PK's in e.g postgres? or is that just fud?)

But you could also generate a public ID with something like encrypt(seq_id, secret) and then encode it with whatever alphabet and or profanity filter you'd like - right? The issue then is that all public ID's would be long (and of course dealing with a decrypt operation on all incoming requests).

Don't know what's best really.


Add an offset, multiply by a large prime number, and modulo. I don't think you can recover the original number without figuring out the prime.


Ah, that's neat. Why is the offset necessary?


Might not be, but I like to start with a big number instead of 0 or 1 to fill all the bits. For example, if your prime is 100019, then your first number in binary is 00011000011010110011 but if your max number is something like 2^53 (00100000000000000000000000000000000000000000000000000000) then you have a lot of unfilled bits. The way I have mine set up, the output ID is always exactly 9 chars. 99% of the time it's naturally 9 chars because just by probability most of the numbers will be large, but some of them come out 8 chars and then I just pad up to 9 so it's nice and consistent.


What is the decimal range of these values?


Damn, those squids are getting smart


anyone have a copy pasta for the widest possible alphabet (i.e. extended unicode safe chars)


I don't get it, that's like two lines of code, why does it have a library and even a domain


What I don't get is that you can see the code, and it is clearly more than two lines, but you still claim that it's two lines of code.

Yes, it's not that complicated (about 160 lines of low-density python), but it's portable, and portable, reusable code goes into libraries.


Also wondering this


[flagged]


I have over 20 years of experience, I run a team, and I can code something like this in my sleep, as well as any average developer (but maybe "average" is lower than where it used to be in my day). Don't talk down to strangers on the Internet so arrogantly, you only end up embarrassing yourself.

I went to check what this code does (beyond a base conversion with a custom alphabet) and I laughed my ass off.

1. Literally TWO-THIRDS of the "library" is a clumsily put together ban list of slurs. Hardcoded, in the library. This DOES NOT belong there.

2. It also contains alphabet checks like say repeated letters, which it'll run every single time it produces an ID, instead of statically checking the alphabet ahead of time, and just using it in production... or even simply selecting an URL-safe alphabet and sticking with it.

3. Despite all the checks, block lists, the default alphabet doesn't do the most elementary thing when producing alpha-numeric ids: exclude letters and digits which look similar, and so will look identical in some fonts. Nasty when you have to write an ID from say a screenshot, or a poster, or what have you.

All in all, that's only borderline more utility than the infamous "leftpad" library, and it may actually be harmful in practice, because it contains a bunch of specific and/or superfluous concerns by the author, almost as if they're trying to justify why this is a library in the first place, but which will likely not align with the precise requirements of the users and their project business rules, while it excludes others (like the one I mentioned in point 3 up there). But I'm sure the kids will be impressed.


>Don't talk down to strangers on the Internet so arrogantly, you only end up embarrassing yourself

hoisted by your own petard there, gramps


You forgot to have a point.



This looks handy.

Given the retry-on-bad-word feature, I was sceptical of the no-collision claim -- but after looking at the JS source code, I'm confident it's correct.

For each number encoded, one character from the alphabet is "held back" to use as a delimiter (with the particular character chosen changing as each number is processed, presumably to make the output "look more random"). The very first character output is essentially a "free choice" that selects the initial permutation of the alphabet to use.

The algorithm is implemented as a function encodeNumbers() that calls toId() to encode each number, then checks for bad words and recurses with a new value of "increment" if it finds any. To prove correctness it's helpful to imagine an "in-between" function, tryEncodeNumbers(numbers, alphabet), which the outer encodeNumbers() calls in a loop to do most of its work (pseudocode):

  function encodeNumbers(numbers) {
    offset = sum(numbers)
    do {
      result =  tryEncodeNumbers(numbers, permute(alphabet, offset + increment++))
    } while (badWordIn(result))
    return result
  }
Here permute() is a function that permutes the characters in its first (string) argument according to its second (integer) argument in some arbitrary way.

The toId(num, alphabet) function, which simply encodes a single number using "digits" taken from the characters in the alphabet parameter, is clearly injective with respect to the num parameter provided that alphabet contains no duplicate characters -- that is, if we hold some duplicate-free alphabet string fixed, every distinct value of num produces a distinct encoded string as output. (For example, toId(42, "0123456789") gives "42", and no other value of num produces this string when the alphabet remains unchanged.) tryEncodeNumbers(numbers, alphabet) first outputs a character representing alphabet (i.e., its initial alphabet -- which is its complete internal state), then joins together a bunch of these toId()-encoded numbers, with an extra character in between each that is known not to appear as a digit in the preceding number, permuting the alphabet in an alphabet-dependent but num-independent way for each encoded number output. Because the alphabet permutation is independent of the input array of numbers, this means that the alphabet used to encode the i-th number depends only on the initial alphabet and i. This means that, again holding its initial alphabet fixed, tryEncodeNumbers() is likewise injective with respect to the input array of numbers. (Suppose it were not: Then there is an initial dupe-free alphabet, and two distinct arrays of numbers, that produce the same output. Find the first position where the two arrays differ. Since the final results are identical by assumption, one of the two encoded outputs for this position must be a prefix of the other. But if the two encoded outputs are of different lengths, the shorter one must either terminate the entire string, making it shorter than the other encoded string, or be immediately followed by a character that cannot appear in the output of toId() with the alphabet used at this position, both of which contradict the assumption that the resulting strings are equal. Therefore toId() must have output identical strings for the two different numbers at this position, when given the same alphabet. But this contradicts injectivity of toId(), so this (non-injectivity of tryEncodeNumbers()) is impossible.)

The final step is to see that, if encoding two inputs with the top-level encodeNumbers() function gives the same first character C, it must be because the first successful (bad-word-free) loop iteration for the first input passed the same alphabet to tryEncodeNumbers() as the first successful loop iteration for the second input. If the remainders of the two encoded strings are also equal, then by injectivity of tryEncodeNumbers() for fixed alphabet choice their input number arrays must have also been the same. Since no restrictions were placed on the two inputs, this holds for all possible input pairs -- that is, it is impossible for encodeNumbers() to produce the same encoded output string for two different inputs.


Skipping profanity seems like a liability in this design. It means in order to preserve the encoding you need to make the banned word list immutable, otherwise old sqids will decode to the wrong thing when you get them back.


The stupid simple way I did this ages ago was:

1. Start with a-z.

2. Drop all vowels, numbers, most homoglyphs, and the letter 'x'.

3. Map digits 0-9 to one of the remaining letters.

4. Stringify the integer and replace the digit in each decimal place with its corresponding character.

For my use-case, all the numbers were >7 digits long, so the odds of you getting an offensive acronym were reasonably low unless you started combining them.

But there's no perfect solution. As this dataset shows, you can find offense in almost anything if you look hard enough:

California Personalized License Plate Requests Flagged for Review 2015-2016: https://docs.google.com/spreadsheets/d/18IUVU9Q4uN_lxqNd5AsN...


Many of those reviewer comments are utterly moronic. And that is my polite opinion.

How does this work? Is there a review board? Is it put to public review? A few of them like "dick out" and "shtlord" are reasonable, but many of them seem so bonkers it looks like the work of trolls.

Anyway, TIL that 1970s Intel was a MS-13 gang outfit and that Octocat really means "eight vaginas".


> Many of those reviewer comments are utterly moronic.

Reason for review: hostile, insulting, or degrading

/s


WONTFIX: behaves as intended.


Most numbers can be used as letters or phonemes.

I could give a fuck about avoiding swear words, but if you want to avoid slurs and eyebleach-inducing ideas and still have any sort of compact representation, I suspect we have to look not at problematic letters but problematic groups of letters. There's nothing intrinsically wrong with the letter E. Not with G, I, N, or R, but you can sure get a lot of attention you don't want by arranging them in the wrong order. K and Y aren't bad either, unless you're hating on Jewish people.

So maybe there's a 5:4 or a 5:3 encoding out there where you avoid making syllables.


> California Personalized License Plate Requests Flagged for Review 2015-2016: https://docs.google.com/spreadsheets/d/18IUVU9Q4uN_lxqNd5AsN...

Wow this is a funny peek into a weird perdicment where people need to justify that they have a good reason to have a specific license plate.

Some seems obviously ok such as:

INT13H

314 PI

And some are obviously not:

DRY(hand emoji)JOB

DICK OUT

Come to think of it: Can license plates have emojis now?!


California allows one hand, star, or heart shape in the plate.


I don't think this holds, you can enforce filtering in the encoding step, i.e. be strict about what you output, but always decode, even if the input is profanity. This means you can also be backwards compatible if you update the list etc. So in short, the old maxim of be strict about your outputs and lenient about your inputs.


From their FAQ: "The best way to ensure your IDs stay consistent throughout future updates is to provide a custom blocklist, even if it is identical to the current default blocklist."


The *encoding* changes. The decoding stays consistent:

> Decoding IDs will usually produce some kind of numeric output, but that doesn't necessarily mean that the ID is canonical. To check that the ID is valid, you can re-encode decoded numbers and check that the ID matches.

The reason this is not done automatically is that if the default blocklist changes in the future, we don't want to automatically invalidate the ID that has been generated in the past and might now be matching a new blocklist word.


In that case it sounds like a shortcoming on their part. There is no fundamental reason to have that limitation. I understand it can make the implementation easier to not have it, but in my opinion being blocklist change agnostic would be a much better value offering.


Agreed, this is a big risk made worse that the default word list can change over time.

https://sqids.org/faq#future-blocklist


It should at least take a blocklistVersion parameter (or similar), even if passing the wrong version just generates an error.


It's a base62 encoder that takes multiple integers as input. Probably a bit-length prefixed encoding. I am assuming it just pads an extra junk integer to re-roll the encoded number.


I guess they haven't heard of base-64.


That doesn't solve the same set of problems as TFA. Randomized output order for sequential input, skips IDs that include profanity.


There is no need for randomised output, and they've said themselves that you can recover the input from the output. It's literally just so that it looks like something cool and technological is happening. You can skip IDs that contain base64 profanity just by doing the reverse encoding on your block list and skipping those IDs. Or you could exclude vowels from your alphabet and do a base 54 encoding. Or you could grow up and accept that your 4 millionth ID which no one is going to read might have a naughty word in it (horror).


There's no info on how this actually works. Actually I have no idea what it does in the first place.. From the FAQ:

> How can I make my IDs unique?

> The library accepts a custom alphabet from which it can generate IDs. Simply pre-shuffle the default alphabet that's provided.

I also have no idea how to use the library.. What are the 3 numbers parameters for encode()? The actual code: https://github.com/sqids/sqids-ruby/blob/main/lib/sqids.rb

I'd simply base-58 a GUID instead. At least you'll know what kind of constraints / sorting the IDs will have.



One of the points is also to use custom alphabet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: