Sqids – Generate short unique IDs from numbers

notfed · on Nov 26, 2023

I appreciate that the author clearly states that security, i.e., output can't be reversed back to the input, is a non-requirement. We can't criticize the author too much for that either, because, as a rule, "random-looking id generator" algorithms will always be either not secure, or not short, or not collision-free. Or they'll be a key-value database.

A secure "random-looking id generator" is called a block cipher. Block ciphers with less than 128 bits of output are widely considered insecure: this corresponds to about 22 base64 characters.

Going further, you probably do want to use a secure algorithm. History is full of people thinking "we don't need this to be secure, it's not used for security-sensitive things" and later regretting it. Some case studies at [1].

[1] https://www.schneier.com/blog/archives/2016/04/security_risk... .

pixel8account · on Nov 26, 2023

It's good to consider this but... Plenty of sites expose user ID as a regular integer. In some cases you might want to avoid this (leaking user count to competitors etc), but I have never heard about anyone calling this a vulnerability.

richbell · on Nov 27, 2023

It's referred to as an "Insecure Direct Object Reference" (IDOR) vulnerability. In many cases it is not actually a vulnerability, however, when an application contains sensitive information and lacks authorization or rate-limiting it can be exploited to enumerate the entire database.

https://cheatsheetseries.owasp.org/cheatsheets/Insecure_Dire...

When I first joined $company, HR sent me a SharePoint document with a numerical ID. Incrementing or decrementing the ID allowed me to view personal information of other employees including their pay.

woofcat · on Nov 26, 2023

I hear this all the time. Every 3PPT report I see is cranky if you have userid=2345 as you can enumerate it.

Personally I think it's stupid but this is a tempting solution.

geek_at · on Nov 26, 2023

> i.e., output can't be reversed back to the input

But in the "Not good for" section it states the opposite: "[not good for] User IDs can be decoded, revealing user count" so it can be reversed?

residentcoder · on Nov 26, 2023

Added parenthesis to make what your parent commentor is saying clearer:

>I appreciate that the author clearly states that security, (i.e., output can't be reversed back to the input), is a non-requirement.

notfed · on Nov 26, 2023

I realize now the site lists "One-Time Passwords" as a use-case. I'm a positive-vibes-only HN poster though, so will leave it as an exercise to the reader to decide whether that's going to turn out well or not.

hot_gril · on Nov 28, 2023

What's the issue with that, provided the input number is random?

My issue with Sqids is they're longer and more complicated than the input!

8organicbits · on Nov 25, 2023

The mention of one-time passcodes seems odd. Those need to be unguessable, but don't need to be unique. If you supply a suitable random source, then I suppose it works, but the "padded with junk" feature makes these look more complex than they really are.

The standard choice of 4 to 8 random digits works well and it's clear what level of security they provide. Digits are easier to understand than case sensitive latin characters, especially when your native language uses a different character set.

pytness · on Nov 27, 2023

I think you completely misunderstood the article, or you have not read it.

The uniqueness from the system comes from the fact that two different numbers will never have the same id.

The pad only works for things like user ids.

You can also change the alphabet it uses so its not case sensitive; using this alphabet: "ABCDEFGHJKLMNPQRSTUVWXYZ0123456789" and a minimum of 8 digits, it will produce 8 digit ids all the way 4294967295 (0xffffffff).

hot_gril · on Nov 28, 2023

This doesn't go against what the other commenter said. The understanding was correct.

no_wizard · on Nov 25, 2023

I like the idea, though I use nanoid with the safe letter dictionary (it excludes letters used for profanity[0])

They should use a similar dictionary approach IMO because I looked at the implementation and it’s hardcoded to look for “bad” words

Otherwise looks real straightforward! I’d love to see some performance test suites for it

[0]: https://github.com/sqids/sqids-javascript/blob/ebca95e114932...

[1]: though with UUID v4 so common to generate and well optimized in most languages I wonder if these userland solutions are really better. You can always generate a UUID and re-encode with base32 or base64 with also is well optimized in most languages

lxgr · on Nov 25, 2023

> it excludes letters used for profanity

That doesn't seem possible. How would that work?

> I looked at the implementation and it’s hardcoded to look for “bad” words.

If you mean https://github.com/y-gagar1n/nanoid-good, that seems to be doing the same thing.

In general, I'm a bit weary of solutions that "guarantee no bad words" – this is usually highly language-specific: One language's perfectly acceptable name is another language's swear word.

no_wizard · on Nov 25, 2023

This is the implementation: https://github.com/CyberAP/nanoid-dictionary

We use it in a highly internationalized product spanning multiple languages and haven’t yet ran into a complaint or value on audit that would constitute something offense in any language per our intl content teams anyway.

That isn’t to say it’s 100% (and simply enough we don’t audit every single URL) but I suspect we would have gotten at least a user heads up by now

Never the less we are moving our approach to uuids that get base32 encoded for some of our use case for this. They’re easier to work for us in many scenarios

Silasdev · on Nov 25, 2023

It's particularly funny because their example docs for .NET outputs "B4aajs", which to any Swedish l33t speaking individual, would read "Bajs", which means "shit"

owyn · on Nov 26, 2023

Somewhere there's a database for every bad word and every bad typo in every language and that one just got added.

Sharlin · on Nov 25, 2023

Omit vowels and you're 90% of the way there; omit the vowel-looking digits 0,1,3,4 and you're probably >99% of the way there.

gberger · on Nov 25, 2023

Sharlin · on Nov 25, 2023

Which is, evidently, why nanoids also excludes x and X, as well as v and V (fvck).

cdelsolar · on Nov 27, 2023

njharman · on Nov 25, 2023

> That doesn't seem possible. How would that work?

agree; b00b, DlCK, cntfcker

But I suppose, if user doesn't get to craft input, the collision space of converted numerical ids and words like above is sufficiently small to be ignorable.

Sharlin · on Nov 25, 2023

Besides vowels, nanoid excludes 0, 1, 3, 4, 5, I, l, x, X, v, V, and other lookalikes, so the chances of generating something naughty in any language are close to zero.

jl6 · on Nov 26, 2023

Humans have a high capacity for spotting rudeness. Nanoid’s nolookalikesSafe alphabet would allow blwjb69FKmyD7CK.

(Sorry)

Two4 · on Nov 26, 2023

Buy me drink first, jeez

livrem · on Nov 25, 2023

Looks like the dictionaries used are from this file?

https://registry.npmjs.org/naughty-words/-/naughty-words-1.2...

From a quick look, the lists are pretty short, except for the one with English words that at least have some 404 words, but I can imagine there are far more bad words that you want to avoid than just those?

ape4 · on Nov 25, 2023

Here's the C++ of the sqid blocked words https://github.com/sqids/sqids-cpp/blob/main/include/sqids/b...

seanhunter · on Nov 27, 2023

Grepping out naughty words in randomly generated text definitely strictly weakens the information content if you're using it for a secure application but is often necessary.

In the early dotcom era the company I worked for were about to go live and the final step was demoing the end to end flow to the ceo. I had done the back end stuff and hadn't paid much attention to the front-end. The person who did the account creation process wanted to nudge people to generate memorable yet strongish passwords, so when it created your account it would generate with a random password which he did by choosing 2 four letter words at random from the unix dictionary and putting a two digit number between them. He ran that past me as an idea and I thought "yeah, good idea" and didn't think more of it.

However he forgot to first grep out all the naughty words so when we demoed it to the CEO/non-technical founder both of the words in his randomly generated password were swearwords.

tttp · on Nov 25, 2023

I tried something similar with a fixed alphabet that guarantees no profanity and a checksum (luhn)

https://github.com/tttp/dxid

parhamn · on Nov 25, 2023

Side note: there are some business insights you can get from a company using serial ids.

i.e if you sign up and get user id 32588 and make another account a few days later, you can tell the growth rate of the company.

And this is possible with every resource type in the application.

I do wonder how much the url bar junk thing matters these days. I tend to use uulids (waiting on uuid v7 wide adoption), and they're a bit ugly, but most browsers hide most of the urls now anyway. The fact that there is a builtin time component comes in clutch sometimes (e.g. object merging rules).

pacificmint · on Nov 25, 2023

> you can tell the growth rate of the company.

You can even do this when you don’t know the exact interval by using probabilities. The Allies used this method to estimate German tank production in World War II by analyzing the serial numbers of captured or destroyed tanks.

This is know as the German Tank Problem [1]

[1] https://en.wikipedia.org/wiki/German_tank_problem

leobg · on Nov 25, 2023

Very interesting.

I’m a lawyer and using sequential IDs in a fraud case right now, to determine the number of victims.

Unfortunately, so far, I only have the IDs of two victims, and those are from just within about a month, whereas the fraud has likely been going on for several years. Just simply extrapolating that growth rate isn’t going to be very accurate.

Also, I suspect that the perpetrators did not start at ID 1.

infogulch · on Nov 26, 2023

You might try to use the information to find more victims first.

namtab00 · on Nov 25, 2023

ehm, yeah, n=2 will not get you anything useful...

that'll be like trying to determine the average salary in a company with only two known ones, which could be the janitor's and the CEO's

selcuka · on Nov 26, 2023

> that'll be like trying to determine the average salary in a company with only two known ones, which could be the janitor's and the CEO's

Ironically that would be somewhat close to the actual average.

mcherm · on Nov 26, 2023

It would be significantly above the average unless the company is ridiculously top-heavy or has shockingly little variation in salary. Or if the "salary" for the CEO ignores certain compensation (eg: paid a salary of $1 + stock options).

selcuka · on Nov 27, 2023

Sure thing. I could have worded it better, but I was trying to say that it would be much more skewed if the two samples were, say, CEO and the CFO, or two janitors.

pixel8account · on Nov 26, 2023

Even with n=1 you can get something useful. IIRC "on average" if you have ID x than the best population estimation is 2*x. Of course the error margin is immense, but it's still better than nothing.

lhamil64 · on Nov 25, 2023

It also makes it slightly easier to perform certain attacks since it's trivial to figure out other IDs.

ozim · on Nov 26, 2023

Making non-guessable IDs for broken authorization is security by obscurity.

If you have integer IDs it is also trivial to find authorization flaws on your own. Any pentester will go for it right away.

If you make non guessable IDs they might skip it and go look for other stuff.

rkagerer · on Nov 26, 2023

I would have introduced random, increasing skips in the sequence to make my army look 10x bigger.

raggi · on Nov 25, 2023

I see so many organizations add weird slowdowns from debts associated with this. I reflect on some of the most successful tech businesses of the last decade and remember that all their APIs exposed this kind of data early on and many still do.

Does anyone have an example they can reference of a business being harmed by this information being out there?

wodenokoto · on Nov 26, 2023

I don’t know any stories of digital businesses but there was a case where someone went and counted customers at the door only to realize that the company was lying on the yearly reports.

So I guess that kinda harmed that fraudulent business strategy…

aequitas · on Nov 25, 2023

At an internship long ago, my boss instructed me to always add a few extra to the auto incremented order ID so customers couldn’t guess how business was going if they happen to order stuff quickly in a row.

qumpis · on Nov 26, 2023

How big and random were these "few extras"?

diznq · on Nov 26, 2023

The solution I prefer is to simply just encrypt the data such as IDs.

Instead of giving user an ID in response, user gets hmac(cipher(Data, secret_key), secret_key) + cipher(Data, secret_key) and then some simple pre-request handler just iterates over query params / form data and decrypts them if signature matches.

It also works as a really nice CSRF protection as user ID of currently signed user can be embedded into Data and checked if current user.id == decrypted data.id.

Another nice advantage is that you can deny the request right in the beginning as you know ahead of time that the provided data is not valid (signature doesn't match), saving some DB queries.

The down side is that URL gets pretty long though, but if that's hidden by browser or user doesn't care, it's a non-issue

throwaway47747 · on Nov 26, 2023

Can confirm, when I worked in VC we used this to verify order volume for a number of startups we were evaluating. For one startup, I wrote a bot to place a small order a few times a day, and log the order number.

ChrisCinelli · on Nov 26, 2023

Did you use this for due diligence to verify that the data reported by the entrepreneur in the pitch was good or you were looking at some companies in a specific space and checking the company that has more orders?

throwaway47747 · on Nov 26, 2023

The former.

paulddraper · on Nov 25, 2023

> most browsers

Not chrome...

Also, links are a thing in chat, etc

parhamn · on Nov 25, 2023

Heres what a recent youtube (which squid documents as a sample use case) link I shared looked like:

> https://www.youtube.com/watch?v=fFMzQ3tYTFU&pp=ygURY2hQImVzZ...

Or Twitter:

> https://x.com/elonmusk/status/172853302828286055507?s=20

Or TikTok:

> https://www.tiktok.com/@<userId>/video/730292574259232054785...

While I tend to strip the tracking params and there are extensions that do this, I don't think most people do. These URLs are pretty 'ugly'.

So if the links that are being shared most on the internet (YT, TikTok, Twitter) don't care, you probably shouldn't either. I think the onus is on the UI layers (Chat apps, etc) to show urls how they look best on their respective platforms.

Edit: to this point, it looks like HN truncates these to make them less ugly too.

wodenokoto · on Nov 26, 2023

If you use the share button and not the url bar you get prettier and shorter urls.

Don’t know how common that is. Wouldn’t be surprised if no -techies don’t know how to copy from url-bar.

taneq · on Nov 26, 2023

I always take note of invoice numbers for this exact reason, they give you a feel for how busy they are.

hinkley · on Nov 26, 2023

And this is the stuff you get if you manage to get your access control right.

Get it wrong and we jump from actionable business metadata to actionable business data (like perhaps which of your customer's customers are poachable)

hot_gril · on Nov 26, 2023

Yes but just using Sqids doesn't fix this. Sqids are decodable. You need to use a random or otherwise unpredictable input.

What's the advantage of uuid7 (sequential + random) vs uuid4 (full random) for this?

krembo · on Nov 26, 2023

You can extract the timestamp from UUID7/8 so it also can reveal business information.

hot_gril · on Nov 26, 2023

That's why I'm asking, it seems like a liability.

parhamn · on Nov 26, 2023

You don't get the cardinality of the data type, just when the object was created.

There are probably some business cases where the “when” information is potentially useful (I cant think of any) but, you cant know, for example, how many users are in the database.

hot_gril · on Nov 26, 2023

It's usually benign, but why encode any info into your public IDs? I wouldn't go anywhere near that.

It can make sense for some situational internal database use case where you want temporal locality and can't use full sequential since it's distributed, and even then your DBMS might recommend something else, e.g. Spanner explicitly says not to do this. And it doesn't need to be exposed to users.

parhamn · on Nov 27, 2023

Assuming the alternative is a fully random key, this can wreak havoc for performance depending on the database engine and index type used (lots written on this topic).

But I do agree, if performance isn't an issue with your db choice and you're not interested in in getting a free "created_at", might as well go fully random.

hot_gril · on Nov 27, 2023

Primary keys have to be chosen carefully because they impact disk layout, joins, etc, and full random makes bad PKs in certain distributed DBs. But it's simple and cheap to convert a public user ID (full random) to/from internal row keys (sequential-ish) at the API boundaries using a secondary index or even a cache.

hot_gril · on Nov 27, 2023

(Like, simple/cheap enough that it's probably worth doing instead of exposing your row keys. Maybe in some careful cases exposing uuid7 row keys makes sense, but not nearly enough to recommend that so broadly. It's not a safe default.)

parhamn · on Nov 27, 2023

Makes sense. Longest HN convo I've managed to keep at this point -- thanks for engaging!

maxcan · on Nov 26, 2023

Yes, that's how I know I was roughly the 600,000th person to sign up for thefacebook.com.

cortesoft · on Nov 26, 2023

Unless they are doing master-master replication so are incrementing by something other than 1

dfc · on Nov 25, 2023

It's weird under "Get Started" they have links to 40 different languages. You can only get started with 15 of the 40 languages listed, the other 25 are skeleton repos asking for people to start the repo to indicate interest.

LeFever · on Nov 25, 2023

It’s kinda clever. The people most likely to look at this project are also likely ideal candidates for implementing the library in a new language (Developer, FOSS enthusiast, interested in the project, need the library in a language they’re familiar with that isn’t implemented yet).

Also, the language pills differentiate between those that have been implemented (color logo, dark text, bold) and those that aren’t (grayscale).

4kimov · on Nov 25, 2023

Good points. Those pages also contain links to old implementations (Hashids), because a lot of projects still use those and want to be able to find them.

alas44 · on Nov 25, 2023

Also can help track which languages people click on, probably a good proxy of where there would be the need to develop a lib

vyrotek · on Nov 25, 2023

The approach definitely works. Some time ago I saw .NET listed but discovered it wasn't complete. I was eager to replace an existing Hashids implementation so I made some comments, shared a starter-snippet, and then someone was excited enough to complete in just a few days. It was great to see how quick the community stepped in. Maybe there was a bit of Cunningham's Law in effect with my contribution, ha.

https://github.com/sqids/sqids-dotnet/issues/2#issuecomment-...

hooverd · on Nov 25, 2023

Maybe a slam dunk first FOSS contribution?

ctoth · on Nov 25, 2023

This seems like a perfect use case for an LLM :)

SebRollen · on Nov 26, 2023

The second example for each language sample where the generated squid ends up being “B4aajs” essentially reads as “P0ooop” to a Swedish speaker.

Which is fine, they don’t propose to filter “bad” words in other languages, but kind of funny when that’s one of the highlighted examples, right next to the goal of filtering words. Goes to show how hard it is to filter profanity generally for international audiences

SheepSlapper · on Nov 26, 2023

Could you just remove vowels and hit 99.9% of profanity in all languages? Ditto for removing their 0-9 equivalents, if you're really worried about it. Quick out of the box support for that via being able to define a custom alphabet.

jabagawee · on Nov 26, 2023

You might still have issues with generating sequences like "XtrmlyBdWrd" that are still recognizable.

SheepSlapper · on Nov 26, 2023

Well until we figure out a way to remove pattern matching from humans... use GUIDs if that's an issue. Removing vowels fixes "spelling almost all bad words explicitly", though I'm open to being proven wrong with fun new swears in exotic (to me) languages :)

The problem of "pick any N symbols that don't make any profanity in any language across all time" isn't what this is solving, nor should it have to. Take the same concept but use whitelisted words to build the token if you're that adverse to computer generated, fill in the blank naughty words. Keep "pen" and "island", among other things, off that list ;)

kitten_mittens_ · on Nov 26, 2023

With enough fricatives, some languages still manage. See the Serbian for Serbian, srpska https://en.wikipedia.org/wiki/Republika_Srpska.

pytness · on Nov 27, 2023

They kinda do: https://github.com/sqids/sqids-blocklist/tree/main/data

sandstrom · on Nov 25, 2023

Neat library!

We're using randomly generated strings for many things. IDs, password recovery tokens, etc. We've generated millions of them in our system, for various use-cases. Hundreds of thousands of people see them every day.

I've never heard any complaints about a random content-id being "lR8vDick4r" (dick) or whatever.

But nowadays our society is so afraid of offending anyone, that profanity filters has extended all the way to database IDs and password recovery tokens.

(there are some legit cases, like randomly generated IDs for user profiles shared in public URLs, that users have to live with, but even there just make the min length 8 and you're unlikely to have any full-word profanity as the complete ID; put differently, I don't understand why they made the block list an opt-out thing)

3cats-in-a-coat · on Nov 25, 2023

The block list is 2/3 of the (minified) library. I found this entire choice odd.

First, it's highly incomplete because you can find at least 10x more combinations spelling the same "word". And probably 10x more slurs that aren't in this block list. Second, because it's hardcoded in your source. Third, because there are more elegant solutions.

Such as to pick an alphabet that can't spell readable words unless you're trying really hard to read a slur into it. Say this (no vowels or digits):

bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ (length 42)

The full lower+upper+digits alphabet they use is 62. Feels like you're losing a lot, but... not really.

- A 128-bit id in base 62 = 22 letters.

- A 128-bit id in base 42 = 24 letters.

JUST TWO MORE LETTERS. And it's one more letter for 64-bit id (11 vs 12). And we can avoid this entire silliness. The problem is the author doesn't realize that logN is... logarithmic, I suppose.

3cats-in-a-coat · on Nov 26, 2023

A slight mod, I'd remove Y despite not exactly a vowel, and add back digits that can't be interpreted as vowels.

bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ25679

Gives us base 45. And below is a JS snippet to make an id. There's your lib.

    function id(num) {
        num = BigInt(num);
        const dict = "bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ25679";
        let id = '';
        while (num > 0n) {
            id += dict[Number(num % 45n)];
            num /= 45n;
        }
        return id || dict[0];
    }

Example:

    id(123456789012345678901234567890n);

    "bq99hC6fbtjLrkxLPm"

darkpatterns · on Nov 25, 2023

Totally agree. 2/3 is wild, especially given it seems like you could mitigate most of the risk just by removing vowels from the dictionary.

doublemint2203 · on Nov 25, 2023

I'm actually more convinced the problem is corporate risk management to deal with the tendencies of social media to overhype issues by design, rather than a statement on society

noirscape · on Nov 25, 2023

There's other "general" rules when it comes to random human-readable tokens such as not using Os and Is if your strings include numbers - people can and will confuse them with 0s and 1s if they have to type them over.

Most gift card tokens for example don't allow the use of those two (or quietly correct it) to avoid making that mistake.

davchana · on Nov 26, 2023

California State Driving License and or ID Numbers are One Alphabet & then 7 digits. But always, if the character at second place is Zero, everybody reads it as Alphabet O.

hobo_mark · on Nov 25, 2023

Well if you don't filter, things like this may happen:

https://github.com/compiler-explorer/compiler-explorer/issue...

Me and you will know it's just random characters, but if HR enters the chat...

progne · on Nov 25, 2023

In a Ruby app we just convert to a high base, like

  > 1234567890.to_s(36)
  => "kf12oi"

That gets us most of the way there, but Sqid has a Ruby library and lets you set a much higher base, including upper case characters, and I suppose, emoji. We're going to need much bigger numbers before that space savings makes much difference. I like it, but it's hard to know when something like that is worth adding a dependency.

candiddevmike · on Nov 25, 2023

BaseEmoji is a thing: https://github.com/amoallim15/base-emoji

paledot · on Nov 26, 2023

Ironically its default encoding is base 55. I figured if you're going to use emoji you should at least use the size of the space to maximize the numbers of bits per grapheme beyond what is possible with ASCII, but apparently not.

Which I guess is the part where senior (citizen) programmers like me get triggered as promised in the README.

vyrotek · on Nov 25, 2023

I believe a big part of the idea is for the hash to be unpredictable as well.

If I figure out you're using (36) then I know the next number 1234567891 is "kf12oj".

Not the case with Sqids.

paulddraper · on Nov 25, 2023

No, squids are predictable, you can't use them to hide information.

They call it out on their front page.

richbell · on Nov 26, 2023

I haven't looked at the implementation yet but HashIds (the former project name) required a salt. It would be weird if they changed that.

slig · on Nov 26, 2023

The `salt` in hashids just shuffled the alphabet. Now they removed the `salt`, but you can have the same level of "obfuscation" as before if you shuffle the alphabet yourself before calling the library.

paulddraper · on Nov 26, 2023

They're weakly unpredictable.

Obfuscated

hot_gril · on Nov 25, 2023

You can easily brute-force this. Sqids also says it's not good for sensitive data.

8organicbits · on Nov 25, 2023

It looks like an easy brute force too, there's no compute-hard operations here. I guess you could scramble your alphabet? Otherwise Uk always comes after bM, etc.

posix86 · on Nov 26, 2023

My understanding is that you can re-order the source alphabet, and encode numbers with swapped characters. Unless you know of 36 numbers that they're exactly 1 id apart, you will always have uncertainty to what the ids actually map to.I guess given a large, large number of ids along with the order they're assigned (which might be given through the time at which they're assigned), you could create a pribabilistic statement on the actual order of the 36 characters based on the fact that most numbers increase/decrease faster the bigger they are. (this fact is e.g. used to detect fabricated bank statements - if the first digit of any number in the statement is equally likely to br 1 or 9, the numbers are randomly generated whereas if they're real, 9 is less likely than 1)

8organicbits · on Nov 26, 2023

Scambling the source alphabet should have an effect similar to a monoalphabetic substitution cypher. This is not strong cryptography. If the attacker has any ability to generate IDs quickly, like by creating user accounts or other resources they can create many IDs with known ordering. Likely effective against non-serious attempts.

hot_gril · on Nov 28, 2023

Yeah this is very much the bad kind of DIY crypto.

echelon · on Nov 25, 2023

I'd prefer to use crockford-encoded entropy with Stripe-style token prefixes to create unique ID namespaces. Run in through a bad words filter, and it's perfect.

user_1hrpt0xpax7ps

file_xpax7psaz0tv6az0tv6

Etc.

In distributed systems you can use the trailing bytes to encode things like author cluster, in case you're active-active and need to route subsequent writes before create event replication.

Easy to copy, debug, run ops/incall against. If you have an API, they're user-friendly.

Of course you still want to instruct people the prefixes are opaque.

wombatpm · on Nov 25, 2023

Yeah don’t forget the bad words filter. I worked on an IKEA mailing where the list processing house was adding an autogenerated discount code to the address label. The customers received codes with BOOB, DICK, TWAT, and CUNT embedded within. People were not happy.

otteromkram · on Nov 25, 2023

Did they never make an IKEA purchase after that or did they get over it like a normal adult?

I don't work retail, but something tells me people will make a stink out of just about anything if it meant potentially free products or other compensation.

Plus, are you filtering just English curse words or all curse words for countries that use Latin characters?

jl6 · on Nov 26, 2023

The risk is not in offending someone, but in that someone posting the rude string on social media in real or mock indignation, causing the outrage machine to turn on your brand. There’s a steady supply of bottom-feeding journalists waiting to write the article “IKEA’s new system is sending hate messages to customers”.

richev · on Nov 26, 2023

Can you cite an example of a journalist writing an article that makes such an accusation?

wombatpm · on Nov 27, 2023

We caught it about 2 million into a 15 million print run. I had to go through the control files and change the offending value or identify the pallet and bundle of mail it was in so we could remove and readdress the catalog. It was about 10 years ago and anyone who complained got a coupon.

The mailing house that did the work was fired though.

sandGorgon · on Nov 26, 2023

this is very interesting. do you know how amazon creates its order id ? they are all numeric

pelagicAustral · on Nov 25, 2023

Correct me if I'm wrong, but, It cannot be unpredictable, which makes the library redundant for security concerns, which would be the one business case to seek for anything other than an UUID (which is already built into Ruby).

exxos · on Nov 25, 2023

I didn't think of that, but this is a nice trick!

whalesalad · on Nov 25, 2023

This used to have a totally different name iirc, they used to be called hashids

ChrisArchitect · on Nov 26, 2023

A recent Show HN: from the dev (who's also in this thread) explaining the rebrand https://news.ycombinator.com/item?id=38185437

resoluteteeth · on Nov 25, 2023

Yeah, it says that both in the page title and the logo at the upper left

unfunco · on Nov 26, 2023

I've always thought it was called bijection.

Schnitz · on Nov 25, 2023

It would be great to have a quick primer on why this is better than what people typically homebrew, like base62 encoding a random number.

8organicbits · on Nov 25, 2023

If you use a random number then you need to store it somewhere to map back to the original. Sqids is an encoding, you can decode the sqid back to the original without storage overhead.

Features like the profanity filter avoid creating URL routes like /user/cuntFh.

Cross language support allows interop between the encoder and decoder across microservices written in different languages.

antifa · on Nov 27, 2023

base62 decodes back to Int32 perfectly.

sneak · on Nov 25, 2023

Database PKs usually aren’t random, which AFAIK is what is usually used as the number in this case.

hot_gril · on Nov 28, 2023

Database PK really shouldn't be exposed like this. I think that's what they're getting at, though. The thing is, sequential database PK is still shorter and more readable than a Sqid.

ComputerGuru · on Nov 25, 2023

I haven’t been able to find a case for this because ids either need to be unique or they’re not going to be large. If they’re unique, I’m using uuid or ulid (uuidv7 of tomorrow) as the sortable primary key type to avoid conflicts without using the db to generate and maintain sequences.

Where do you have unique ids that aren’t the primary key? I would be more interested in a retrospectively unique truncated encoding for extant ulid/uuid; ie given that we’ve passed timestamp foo, we know that (where no external data is merged) we only need a bucketed time granularity of x for the random component of the id to remain unique (for when sortability is no longer needed).

Or just more generally a way to convert a ulid/uuidv7 to a shorter sequence if we are using it for external hash table lookups only and can do without the timestamp component.

Bytewave81 · on Nov 25, 2023

The idea is that you encode and decode database IDs with this. You wouldn't save them separately unless you were using it for a purpose other than shareable "identifiers" which don't leak significant amounts of database state. Imagine something like a link shortener where you want to provide a short link to users, but don't want it to just be a number.

ComputerGuru · on Nov 27, 2023

My point was that this only encodes numbers and you wouldn’t use numbers as primary keys for the reasons outlined.

hot_gril · on Nov 28, 2023

These leak the state just as much. They're reversible back to integers.

hot_gril · on Nov 28, 2023

> Where do you have unique ids that aren’t the primary key?

For things that need public IDs, I always generate and store uuid4s* to avoid leaking database state or anything like that. Even uuid7 leaks creation timestamps. It's cheap and easy to map public IDs to/from PKs at the API boundaries.

* possibly encoded as something nicer like base64 for APIs or URLs

canadiantim · on Nov 25, 2023

I’m currently using sqids as slugs to have a shorter url than just using my uuid primary key

ComputerGuru · on Nov 27, 2023

Why would the number the sqid converts to not be your primary key, though? Why do you need the uuid at all?

swyx · on Nov 25, 2023

why is ulid the uuidv7 of tomorrow?

ComputerGuru · on Nov 27, 2023

Uuidv7 is inspired by ulid.

its-summertime · on Nov 25, 2023

For a similar thing, (X bytes to X bytes, no collisions) https://en.wikipedia.org/wiki/Format-preserving_encryption is a good page

jchook · on Nov 25, 2023

Also see Knuth Hash and k-dimensional equidistribution.

waffle_ss · on Nov 25, 2023

I wrote a Ruby gem to address this problem of hiding sequential primary keys that uses a Feistel network to effectively shuffle int64 IDs: https://github.com/abevoelker/gfc64

So instead of

    /customers/1
    
    /customers/2

You'll get something like

    /customers/4552956331295818987
    
    /customers/3833777695217202560

Kinda similar idea to this library but you're encoding from an integer to another integer (i.e. it's format-preserving encryption). I like keeping the IDs as integers without having to reach for e.g. UUIDs

wslh · on Nov 26, 2023

I recommend to review my comment where I also use a Feistel cipher [1] but the difference is that it is not limited to int64 but I can even use 8 bits. Obviously loosing security properties but working as an obfuscation method. If you use a random source with relatively few bits you should check if there are duplicates while with the Feistel cipher you are sure there isn't.

[1] https://news.ycombinator.com/item?id=38418198

Alifatisk · on Nov 26, 2023

On a side note, "Sqids ... is an open-source library that lets you generate YouTube-looking IDs from numbers.", "The main use of Sqids is purely visual."

If the purpose of it is to give a friendlier url / id, who not use something like friendly_id instead? (http://norman.github.io/friendly_id).

The url is readable and searchable through the history.

I would much rather prefer people using "www.website.com/channel/video/a-dog-walking" instead of "www.website.com/channel/video/3cXv8c".

chrismorgan · on Nov 26, 2023

These are called slugs <https://en.wikipedia.org/wiki/Clean_URL#Slug>. If you’re willing to commit to a persistent slug, please do: they are nicer. But for many applications, e.g. most user-generated content, you can’t be confident they won’t change. There’s a reason for the general advice to not use data as a primary key in databases, but to use a separate ID.

andix · on Nov 25, 2023

Actually I'm a bit disappointed that it can't format 128 bit integers or byte arrays. That would allow formatting UUIDs. I'm not a huge fan of public facing integer IDs. There is always the risk of leaking some kind of critical information with ascending IDs.

So I will probably keep Base64URL formatting my UUIDs to make them shorter for URLs, QR Codes and so on.

Quick example:

  20b30b32-d421-4cfb-bdbc-9a4e0475abea
  =>
  MguzICHU-0y9vJpOBHWr6g

It's important to keep in mind that there are different ways to convert an UUID to a byte array (little/big endian, order of segments, ..)

Mogzol · on Nov 26, 2023

Even if it could format 128 bit integers, I don't see how that would be any better than base64-encoding them. Base64 is already pretty close to the maximum efficiency you can get for shortening UUIDs while still using URL-safe characters.

andix · on Nov 26, 2023

To add some of the additional features Squids provide. You can also convert any smaller integer to a string with Base64.

hippich · on Nov 26, 2023

I actually used sqids algo for something very different where I had to encode arbitrary sized byte arrays. And with Ruby it was very simple to remove limit - I think it was matter of monkey-patching https://github.com/sqids/sqids-ruby/blob/main/lib/sqids.rb#L... to return infinity.

The reason for the limit is most likely to ensure interoperability with libraries in other languages where working with bignums is much more complicated.

andix · on Nov 26, 2023

This limit could easily be lifted with byte array encoding support.

stavros · on Nov 26, 2023

The shortuuid library can do that:

https://pypi.org/project/shortuuid/

wslh · on Nov 26, 2023

I offered something similar here [1] and it is used by many companies including Philip Morris, and the Argentinian tax agency for the same purposes.

The technique I used (I should publish it as open source) is using a Feistel cipher [2] with a key. The Feistel network could be adjusted to almost any size and the key used in every round is an expansion of a general key using a key derivation function [3] (KDF3 if I remember well).

Basically it is a symmetric cipher of arbitrary size.

[1] https://www.nektra.com/products/secure-coupon-code-generator...

[2] https://en.wikipedia.org/wiki/Feistel_cipher

[3] https://en.wikipedia.org/wiki/Key_derivation_function

rdpintqogeogsaa · on Nov 26, 2023

Interestingly, it seems there might have been independent invention of the same idea[0]. Have you checked for any patents in the space?

[0] https://bytes.grubhub.com/why-we-use-crypto-when-generating-...

wslh · on Nov 26, 2023

Thank you for the article. I was not aware of that article. My first implementation was done in 1997 or 1998. I am involve in computer security for three decades and I was familiar with Feistel networks. I am not a cryptographer but I have reverse engineered algorithms using this structure. Never checked patents because I think this is a natural construction and I always wondered why there were no other public ideas around this problem. This is not the first time it happened to me.

smashed · on Nov 26, 2023

I'm not a cryptographer and did not understand half the things you said except symmetric crypto.

I think there are 2 problems with this approach:

How do you prevent the double spend problem, for example duplicate entry tickets. You would have to mark the ticket as used in a central database anyway to prevent it

What happens if the secret key material is compromised? Anyone can issue new valid numbers, etc..

Please correct me if I'm wrong.

wslh · on Nov 26, 2023

No problem, the two answers here:

1/ You don't have duplicates because a Feistel network assures you there is no duplicates: every input has a different output with a fixed key in every round.

2/ If the secret is compromised anyone can issue valid numbers in the same way that if your secret encryption keys for encrypting your data exposes it. The idea is that the secret key material is never compromised as it is assumed in all security cases. You should custody with the right measures based on the attack vectors you have. The custody of secrets is independent of this method.

Please let me know if you have more questions.

pixel8account · on Nov 26, 2023

>The idea is that the secret key material is never compromised as it is assumed in all security cases.

That's not true, we have (perfect) forward secrecy, backwards secrecy and key rotation mechanisms because we often care what happened after the key is inevitably compromised. In this case the problem makes it hard to "rotate" the keys in a meaningful way, but I'm yet to see a proof it's impossible.

wslh · on Nov 26, 2023

I think we are talking about different things here or the conversation is not clear.

The assertion you mentioned is the "what" assumption while what you said is the "how" we came to that assertion. Most probably you take my word idea in a specific sense other than I wanted to use. My answer was informal because we are here in HN and not writing a paper.

ZephyrP · on Nov 26, 2023

Feistel ciphers are a good technique for doing just this but it's also worth noting that if all you are looking for is "produce a pseudorandom permutation of 1..N without actually shuffling a list of numbers" you can also use an LFSR as well.

wslh · on Nov 26, 2023

The difference is that the method is more secure than an LFSR.

ZephyrP · on Nov 26, 2023

of course, but an LFSR is going to be faster (provided you have a reasonable number of rounds in your Feistel cipher), have some situationally desirable statistical properties and is easier to adapt than a format-preserving encryption technique like a Feistel network.

wslh · on Nov 26, 2023

Sorry, but faster for the computing capabilities we have in every electronic device is no significant in the application. I have not seen a customer inquiring about the speed of this process. They were looking to a method that is more secure than others.

For example, in the case of Philip Morris was about winning prizes and imagine if they tried other methods before where some smart people reversed the method.

nwroot · on Nov 26, 2023

Securecouponcodes.com is down. Was looking for PHP example.

orf · on Nov 25, 2023

How do you adjust or evolve the blocklist with this, without making previously generated IDs incorrect?

The ID is simply incremented if it is blacklisted [1]. So the ID is fixed to the blacklist content, and adjusting it in any way invalidates certain segments of previously generated IDs?

1. https://github.com/sqids/sqids-rust/blob/9f987886bc06875d782...

ec109685 · on Nov 25, 2023

They address it here: https://sqids.org/faq#future-blocklist

You're right that you can't update the blocklist in a backwards compatible way.

Guillaume86 · on Nov 26, 2023

Didn't check the code but you could encode offseted hash + the offset to avoid blacklisted words and the decoder would decode any version of the hash (offseted or not).

timwis · on Nov 26, 2023

Include a version prefix perhaps?

hot_gril · on Nov 26, 2023

I think the "unique IDs" part of the title throws people off and brings security to mind. "The main use of Sqids is purely visual" is what you need to know. It's not necessarily for IDs, it's just a user-friendly way to encode/decode numbers.

hot_gril · on Nov 26, 2023

Also, why is it non-sequential? That suggests it's unpredictable, but it's not. Just realized that was the thing bothering me about this.

paulddraper · on Nov 26, 2023

Exactly. An alphanumeric encoding for integers.

mlhpdx · on Nov 26, 2023

An argument could be made for calling it “compression of the string encoding”. Since HTTP is string oriented, that makes some sense as an optimization of sorts, perhaps.

hot_gril · on Nov 26, 2023

It doesn't seem compressed, in fact the output of this is less compressible than the input. It's made for human interface.

BraverHeart · on Nov 25, 2023

I see many people in this thread saying that this is a good way to hide insights from ids/numbers, I don't understand, aren't the generated values easily decoded? couldn't I just decode a couple of numbers to get that insight? What am I missing.

jonasdoesthings · on Nov 25, 2023

The docs state:

  Not Good For:  
  [...]  
  User IDs  
    Can be decoded, revealing user count

So yeah, just using a sequential id and encoding the number with this library is not a viable idea if you want to hide your insights.

hot_gril · on Nov 26, 2023

I noticed this too, and imo it's a design flaw that people get misled this way. Sequential inputs should yield sequential outputs, otherwise you might think it's meant to be unpredictable like SHA256.

dumbo-octopus · on Nov 25, 2023

Odd design decision in that if you provide your own blocklist, it overwrites their (extensive) default list instead of adding to it.

And in general the algorithm is surprisingly complicated for something that could be replaced with simply base64 encoding, the given example (1,2,3) base64 encodes to a string with just one more letter than this algorithm.

That said I do appreciate the semicolon-free-style. I don't typically see that in libs besides my own.

https://github.com/sqids/sqids-javascript/blob/main/src/sqid...

8organicbits · on Nov 25, 2023

The problem is their block list will change over time. If you don't override it, then your IDs won't decode right when you update. This is a huge risk.

> You have to account for scenarios where a new word might be introduced to the default blocklist

https://sqids.org/faq#future-blocklist

Honestly, I think they need to rethink this. Otherwise you've got different library versions for different languages each using different default blocklists, none of which are compatible.

wtetzner · on Nov 26, 2023

> If you don't override it, then your IDs won't decode right when you update.

They'll decode fine. Encoding might change.

8organicbits · on Nov 26, 2023

I think your right, but that violates the uniqueness guarantee. It doesnt seem like something that should change over time by default.

dumbo-octopus · on Nov 26, 2023

There explicitly is no uniqueness guarantee. It is known that the same values can encode into many different strings, and they recommend always decoding then reencoding with a stable algorithm to achieve a "canonical" id if one is needed.

https://sqids.org/faq#valid-ids

seabass · on Nov 25, 2023

What’s the use case for passing in an array of numbers? Typically when generating an ID my input is either a single random number, a string that’s being hashed, or nothing at all.

4kimov · on Nov 25, 2023

[shard_number, primary_id_number, timestamp]

dumbo-octopus · on Nov 25, 2023

But then why not just arbitrary text?

chupapimunyenyo · on Nov 25, 2023

Hashids seems way better than their new implementation

koromak · on Nov 28, 2023

I'd love a short ID standard of some kind. There are so many situations where I just want a quick, ephemeral ID. Doesn't need to be cryptographically secure, collisions are expected. It would be nice to have something to import from the standard library in most languages.

1-6 · on Nov 25, 2023

Sqids vs Squids. Missing the 'U' for unique but nevertheless a unique shortened version of the regular spelling.

s4i · on Nov 26, 2023

If the original numeric ID can be figured out from the sqid string, then what’s the point of the conversion?

hot_gril · on Nov 26, 2023

The point is to make a long number more human-friendly. But I don't get why these are non-sequential then.

s4i · on Nov 26, 2023

Unless we are talking about very long numbers, surely the numbers are easier for people to deal with, say over the phone, etc?

hot_gril · on Nov 26, 2023

I was going to say it's for very long numbers, but I tried an example, and 123645634 becomes ARsz1pHw789c7ESzhy. The output is actually longer, and more complicated.

Looks like it uses a hash. I was expecting it to just convert the base except with something to skip profanity, which would give you something much shorter going base 10 to base ~36. Tbh I don't see why it's like this.

nnf · on Nov 26, 2023

The original numeric ID(s) can only be decoded if you know the original alphabet that was used for encoding.

hot_gril · on Nov 26, 2023

I wouldn't bank on that information not getting out. It may even be easy to reverse-engineer. Sqid says not to use it this way.

urza · on Nov 25, 2023

I wanted to say that I use similar project called HashIDs, but I see that HashIDs rebranded to Sqids :)

zie · on Nov 26, 2023

What we did 20yrs ago was generate random numbers and then ensure the # wasn't already used. We do this only for employee ID #'s and it works fine for us. It's definitely not ideal though.

bufferoverflow · on Nov 25, 2023

Do we really need a library for that? Shouldn't it be a simple function?

wlonkly · on Nov 26, 2023

A library is a good place to put (even simple) functions that are shared between a lot of programs.

c2xlZXB5Cg1 · on Nov 25, 2023

Reminds me of proquints https://github.com/dsw/proquint

But 127.0.0.1 looks more "readable" to me than lusab-babad

ForHackernews · on Nov 26, 2023

Oh, they finally fixed the badly named "hashids"

revenga99 · on Nov 25, 2023

is there anyway to generate short unique id's from UUID's? snowflake is incredibly slow when joining UUID => UUID columns.

resoluteteeth · on Nov 25, 2023

Isn't the thing that makes UUIDs UUIDs that they have enough bits that they are guaranteed to be unique without any synchronization?

I don't think you could reduce the amount of random bits (I guess there are some non-random parts in standard UUIDs but not a significant amount) while preserving that property unless you add back some other form of synchronization to ensure that there aren't collisions which seems like it would defeat the purpose.

bjt · on Nov 26, 2023

If you know which type of uuid you have (v1, v4, etc) then you can take a look at how many bits of randomness it has, how many total items you have, and compute the probability of a collision if you just take a subset of the bits and use that as an ID.

In theory it's definitely possible. The 128 bits you get in a UUID is a LOT of randomness for an identifier. Postgres BIGINTs are just 64 bits. Instagram's sharded IDs are just 64 bits. (See below.)

You can test it. If you're using uuidv4 (which is 100% random bits, minus a few for the version), you could make a new column in your table in Snowflake, populate it with the first 64 random bits of your existing uuid column, then see if you have any collisions.

https://instagram-engineering.com/sharding-ids-at-instagram-...

s4i · on Nov 26, 2023

Encoding the UUID in e.g. base 64 or Crockford’s base 32 (instead of the standard hex+dashes) saves you some space.

mfbx9da4 · on Nov 26, 2023

Was hoping for more of a high level technical explanation of how it differs from alternatives in the FAQ

swyx · on Nov 25, 2023

saving it to my list of uid implementations https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...

Use · on Nov 25, 2023

Why should you hide your user count?

sneak · on Nov 25, 2023

The rate of change over time can be used against you; many people consider their businesses’ month-over-month growth (or lack thereof) to be private information.

“$WEBSITE did 50,000 signups a month during the beginning of the pandemic, but now struggles to sign up a thousand a week” is a story.

packetlost · on Nov 25, 2023

The name (but not function) seems really close to squuids from Datomic/Clojure.

canU4 · on Nov 25, 2023

Sad that it is not for user ids

8organicbits · on Nov 25, 2023

I think that's only if you don't want to leak user count when your ID is an autoincrement. Elsewhere people mention cryptographicly remapping integers, which could work (by itself, or before passing the ID to sqids).

darigo · on Nov 26, 2023

Makes me think of something totally different, which is how Urbit converts randomly generated numerical IDs into sorta memorable names.

For example, the number 5,702,400 becomes ~sorreg-namtyv. And 1,742,733,824 becomes ~master-morzod.

I enjoy the quirkiness of the names.

filleokus · on Nov 25, 2023

> Not Good For:

> User IDs - Can be decoded, revealing user count

Suppose you don't want to leak the count, what's a resonable way of implementing that?

You can of course have a uuid v7 / uulids or something as the primary key. Or have it as a public facing primary key, mapping back to a sequential ID PK (there might be some performance hits with larger PK's in e.g postgres? or is that just fud?)

But you could also generate a public ID with something like encrypt(seq_id, secret) and then encode it with whatever alphabet and or profanity filter you'd like - right? The issue then is that all public ID's would be long (and of course dealing with a decrypt operation on all incoming requests).

Don't know what's best really.

8n4vidtmkvmk · on Nov 25, 2023

Add an offset, multiply by a large prime number, and modulo. I don't think you can recover the original number without figuring out the prime.

filleokus · on Nov 25, 2023

Ah, that's neat. Why is the offset necessary?

erhaetherth · on Nov 26, 2023

Might not be, but I like to start with a big number instead of 0 or 1 to fill all the bits. For example, if your prime is 100019, then your first number in binary is 00011000011010110011 but if your max number is something like 2^53 (00100000000000000000000000000000000000000000000000000000) then you have a lot of unfilled bits. The way I have mine set up, the output ID is always exactly 9 chars. 99% of the time it's naturally 9 chars because just by probability most of the numbers will be large, but some of them come out 8 chars and then I just pad up to 9 so it's nice and consistent.