I appreciate that the author clearly states that security, i.e., output can't be reversed back to the input, is a non-requirement. We can't criticize the author too much for that either, because, as a rule, "random-looking id generator" algorithms will always be either not secure, or not short, or not collision-free. Or they'll be a key-value database.
A secure "random-looking id generator" is called a block cipher. Block ciphers with less than 128 bits of output are widely considered insecure: this corresponds to about 22 base64 characters.
Going further, you probably do want to use a secure algorithm. History is full of people thinking "we don't need this to be secure, it's not used for security-sensitive things" and later regretting it. Some case studies at [1].
It's good to consider this but... Plenty of sites expose user ID as a regular integer. In some cases you might want to avoid this (leaking user count to competitors etc), but I have never heard about anyone calling this a vulnerability.
It's referred to as an "Insecure Direct Object Reference" (IDOR) vulnerability. In many cases it is not actually a vulnerability, however, when an application contains sensitive information and lacks authorization or rate-limiting it can be exploited to enumerate the entire database.
When I first joined $company, HR sent me a SharePoint document with a numerical ID. Incrementing or decrementing the ID allowed me to view personal information of other employees including their pay.
I realize now the site lists "One-Time Passwords" as a use-case. I'm a positive-vibes-only HN poster though, so will leave it as an exercise to the reader to decide whether that's going to turn out well or not.
The mention of one-time passcodes seems odd. Those need to be unguessable, but don't need to be unique. If you supply a suitable random source, then I suppose it works, but the "padded with junk" feature makes these look more complex than they really are.
The standard choice of 4 to 8 random digits works well and it's clear what level of security they provide. Digits are easier to understand than case sensitive latin characters, especially when your native language uses a different character set.
I think you completely misunderstood the article, or you have not read it.
The uniqueness from the system comes from the fact that two different numbers will never have the same id.
The pad only works for things like user ids.
You can also change the alphabet it uses so its not case sensitive; using this alphabet: "ABCDEFGHJKLMNPQRSTUVWXYZ0123456789" and a minimum of 8 digits, it will produce 8 digit ids all the way 4294967295 (0xffffffff).
[1]: though with UUID v4 so common to generate and well optimized in most languages I wonder if these userland solutions are really better. You can always generate a UUID and re-encode with base32 or base64 with also is well optimized in most languages
In general, I'm a bit weary of solutions that "guarantee no bad words" – this is usually highly language-specific: One language's perfectly acceptable name is another language's swear word.
We use it in a highly internationalized product spanning multiple languages and haven’t yet ran into a complaint or value on audit that would constitute something offense in any language per our intl content teams anyway.
That isn’t to say it’s 100% (and simply enough we don’t audit every single URL) but I suspect we would have gotten at least a user heads up by now
Never the less we are moving our approach to uuids that get base32 encoded for some of our use case for this. They’re easier to work for us in many scenarios
It's particularly funny because their example docs for .NET outputs "B4aajs", which to any Swedish l33t speaking individual, would read "Bajs", which means "shit"
> That doesn't seem possible. How would that work?
agree; b00b, DlCK, cntfcker
But I suppose, if user doesn't get to craft input, the collision space of converted numerical ids and words like above is sufficiently small to be ignorable.
Besides vowels, nanoid excludes 0, 1, 3, 4, 5, I, l, x, X, v, V, and other lookalikes, so the chances of generating something naughty in any language are close to zero.
From a quick look, the lists are pretty short, except for the one with English words that at least have some 404 words, but I can imagine there are far more bad words that you want to avoid than just those?
Grepping out naughty words in randomly generated text definitely strictly weakens the information content if you're using it for a secure application but is often necessary.
In the early dotcom era the company I worked for were about to go live and the final step was demoing the end to end flow to the ceo. I had done the back end stuff and hadn't paid much attention to the front-end. The person who did the account creation process wanted to nudge people to generate memorable yet strongish passwords, so when it created your account it would generate with a random password which he did by choosing 2 four letter words at random from the unix dictionary and putting a two digit number between them. He ran that past me as an idea and I thought "yeah, good idea" and didn't think more of it.
However he forgot to first grep out all the naughty words so when we demoed it to the CEO/non-technical founder both of the words in his randomly generated password were swearwords.
Side note: there are some business insights you can get from a company using serial ids.
i.e if you sign up and get user id 32588 and make another account a few days later, you can tell the growth rate of the company.
And this is possible with every resource type in the application.
I do wonder how much the url bar junk thing matters these days. I tend to use uulids (waiting on uuid v7 wide adoption), and they're a bit ugly, but most browsers hide most of the urls now anyway. The fact that there is a builtin time component comes in clutch sometimes (e.g. object merging rules).
You can even do this when you don’t know the exact interval by using probabilities. The Allies used this method to estimate German tank production in World War II by analyzing the serial numbers of captured or destroyed tanks.
I’m a lawyer and using sequential IDs in a fraud case right now, to determine the number of victims.
Unfortunately, so far, I only have the IDs of two victims, and those are from just within about a month, whereas the fraud has likely been going on for several years. Just simply extrapolating that growth rate isn’t going to be very accurate.
Also, I suspect that the perpetrators did not start at ID 1.
It would be significantly above the average unless the company is ridiculously top-heavy or has shockingly little variation in salary. Or if the "salary" for the CEO ignores certain compensation (eg: paid a salary of $1 + stock options).
Sure thing. I could have worded it better, but I was trying to say that it would be much more skewed if the two samples were, say, CEO and the CFO, or two janitors.
Even with n=1 you can get something useful. IIRC "on average" if you have ID x than the best population estimation is 2*x. Of course the error margin is immense, but it's still better than nothing.
I see so many organizations add weird slowdowns from debts associated with this. I reflect on some of the most successful tech businesses of the last decade and remember that all their APIs exposed this kind of data early on and many still do.
Does anyone have an example they can reference of a business being harmed by this information being out there?
I don’t know any stories of digital businesses but there was a case where someone went and counted customers at the door only to realize that the company was lying on the yearly reports.
So I guess that kinda harmed that fraudulent business strategy…
At an internship long ago, my boss instructed me to always add a few extra to the auto incremented order ID so customers couldn’t guess how business was going if they happen to order stuff quickly in a row.
The solution I prefer is to simply just encrypt the data such as IDs.
Instead of giving user an ID in response, user gets hmac(cipher(Data, secret_key), secret_key) + cipher(Data, secret_key) and then some simple pre-request handler just iterates over query params / form data and decrypts them if signature matches.
It also works as a really nice CSRF protection as user ID of currently signed user can be embedded into Data and checked if current user.id == decrypted data.id.
Another nice advantage is that you can deny the request right in the beginning as you know ahead of time that the provided data is not valid (signature doesn't match), saving some DB queries.
The down side is that URL gets pretty long though, but if that's hidden by browser or user doesn't care, it's a non-issue
Can confirm, when I worked in VC we used this to verify order volume for a number of startups we were evaluating. For one startup, I wrote a bot to place a small order a few times a day, and log the order number.
Did you use this for due diligence to verify that the data reported by the entrepreneur in the pitch was good or you were looking at some companies in a specific space and checking the company that has more orders?
While I tend to strip the tracking params and there are extensions that do this, I don't think most people do. These URLs are pretty 'ugly'.
So if the links that are being shared most on the internet (YT, TikTok, Twitter) don't care, you probably shouldn't either. I think the onus is on the UI layers (Chat apps, etc) to show urls how they look best on their respective platforms.
Edit: to this point, it looks like HN truncates these to make them less ugly too.
You don't get the cardinality of the data type, just when the object was created.
There are probably some business cases where the “when” information is potentially useful (I cant think of any) but, you cant know, for example, how many users are in the database.
It's usually benign, but why encode any info into your public IDs? I wouldn't go anywhere near that.
It can make sense for some situational internal database use case where you want temporal locality and can't use full sequential since it's distributed, and even then your DBMS might recommend something else, e.g. Spanner explicitly says not to do this. And it doesn't need to be exposed to users.
Assuming the alternative is a fully random key, this can wreak havoc for performance depending on the database engine and index type used (lots written on this topic).
But I do agree, if performance isn't an issue with your db choice and you're not interested in in getting a free "created_at", might as well go fully random.
Primary keys have to be chosen carefully because they impact disk layout, joins, etc, and full random makes bad PKs in certain distributed DBs. But it's simple and cheap to convert a public user ID (full random) to/from internal row keys (sequential-ish) at the API boundaries using a secondary index or even a cache.
(Like, simple/cheap enough that it's probably worth doing instead of exposing your row keys. Maybe in some careful cases exposing uuid7 row keys makes sense, but not nearly enough to recommend that so broadly. It's not a safe default.)
It's weird under "Get Started" they have links to 40 different languages. You can only get started with 15 of the 40 languages listed, the other 25 are skeleton repos asking for people to start the repo to indicate interest.
It’s kinda clever. The people most likely to look at this project are also likely ideal candidates for implementing the library in a new language (Developer, FOSS enthusiast, interested in the project, need the library in a language they’re familiar with that isn’t implemented yet).
Also, the language pills differentiate between those that have been implemented (color logo, dark text, bold) and those that aren’t (grayscale).
Good points. Those pages also contain links to old implementations (Hashids), because a lot of projects still use those and want to be able to find them.
The approach definitely works. Some time ago I saw .NET listed but discovered it wasn't complete. I was eager to replace an existing Hashids implementation so I made some comments, shared a starter-snippet, and then someone was excited enough to complete in just a few days. It was great to see how quick the community stepped in. Maybe there was a bit of Cunningham's Law in effect with my contribution, ha.
The second example for each language sample where the generated squid ends up being “B4aajs” essentially reads as “P0ooop” to a Swedish speaker.
Which is fine, they don’t propose to filter “bad” words in other languages, but kind of funny when that’s one of the highlighted examples, right next to the goal of filtering words. Goes to show how hard it is to filter profanity generally for international audiences
Could you just remove vowels and hit 99.9% of profanity in all languages? Ditto for removing their 0-9 equivalents, if you're really worried about it. Quick out of the box support for that via being able to define a custom alphabet.
Well until we figure out a way to remove pattern matching from humans... use GUIDs if that's an issue. Removing vowels fixes "spelling almost all bad words explicitly", though I'm open to being proven wrong with fun new swears in exotic (to me) languages :)
The problem of "pick any N symbols that don't make any profanity in any language across all time" isn't what this is solving, nor should it have to. Take the same concept but use whitelisted words to build the token if you're that adverse to computer generated, fill in the blank naughty words. Keep "pen" and "island", among other things, off that list ;)
We're using randomly generated strings for many things. IDs, password recovery tokens, etc. We've generated millions of them in our system, for various use-cases. Hundreds of thousands of people see them every day.
I've never heard any complaints about a random content-id being "lR8vDick4r" (dick) or whatever.
But nowadays our society is so afraid of offending anyone, that profanity filters has extended all the way to database IDs and password recovery tokens.
(there are some legit cases, like randomly generated IDs for user profiles shared in public URLs, that users have to live with, but even there just make the min length 8 and you're unlikely to have any full-word profanity as the complete ID; put differently, I don't understand why they made the block list an opt-out thing)
The block list is 2/3 of the (minified) library. I found this entire choice odd.
First, it's highly incomplete because you can find at least 10x more combinations spelling the same "word". And probably 10x more slurs that aren't in this block list. Second, because it's hardcoded in your source. Third, because there are more elegant solutions.
Such as to pick an alphabet that can't spell readable words unless you're trying really hard to read a slur into it. Say this (no vowels or digits):
The full lower+upper+digits alphabet they use is 62. Feels like you're losing a lot, but... not really.
- A 128-bit id in base 62 = 22 letters.
- A 128-bit id in base 42 = 24 letters.
JUST TWO MORE LETTERS. And it's one more letter for 64-bit id (11 vs 12). And we can avoid this entire silliness. The problem is the author doesn't realize that logN is... logarithmic, I suppose.
A slight mod, I'd remove Y despite not exactly a vowel, and add back digits that can't be interpreted as vowels.
bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ25679
Gives us base 45. And below is a JS snippet to make an id. There's your lib.
function id(num) {
num = BigInt(num);
const dict = "bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ25679";
let id = '';
while (num > 0n) {
id += dict[Number(num % 45n)];
num /= 45n;
}
return id || dict[0];
}
I'm actually more convinced the problem is corporate risk management to deal with the tendencies of social media to overhype issues by design, rather than a statement on society
There's other "general" rules when it comes to random human-readable tokens such as not using Os and Is if your strings include numbers - people can and will confuse them with 0s and 1s if they have to type them over.
Most gift card tokens for example don't allow the use of those two (or quietly correct it) to avoid making that mistake.
California State Driving License and or ID Numbers are One Alphabet & then 7 digits. But always, if the character at second place is Zero, everybody reads it as Alphabet O.
In a Ruby app we just convert to a high base, like
> 1234567890.to_s(36)
=> "kf12oi"
That gets us most of the way there, but Sqid has a Ruby library and lets you set a much higher base, including upper case characters, and I suppose, emoji. We're going to need much bigger numbers before that space savings makes much difference. I like it, but it's hard to know when something like that is worth adding a dependency.
Ironically its default encoding is base 55. I figured if you're going to use emoji you should at least use the size of the space to maximize the numbers of bits per grapheme beyond what is possible with ASCII, but apparently not.
Which I guess is the part where senior (citizen) programmers like me get triggered as promised in the README.
The `salt` in hashids just shuffled the alphabet. Now they removed the `salt`, but you can have the same level of "obfuscation" as before if you shuffle the alphabet yourself before calling the library.
It looks like an easy brute force too, there's no compute-hard operations here. I guess you could scramble your alphabet? Otherwise Uk always comes after bM, etc.
My understanding is that you can re-order the source alphabet, and encode numbers with swapped characters. Unless you know of 36 numbers that they're exactly 1 id apart, you will always have uncertainty to what the ids actually map to.I guess given a large, large number of ids along with the order they're assigned (which might be given through the time at which they're assigned), you could create a pribabilistic statement on the actual order of the 36 characters based on the fact that most numbers increase/decrease faster the bigger they are. (this fact is e.g. used to detect fabricated bank statements - if the first digit of any number in the statement is equally likely to br 1 or 9, the numbers are randomly generated whereas if they're real, 9 is less likely than 1)
Scambling the source alphabet should have an effect similar to a monoalphabetic substitution cypher. This is not strong cryptography. If the attacker has any ability to generate IDs quickly, like by creating user accounts or other resources they can create many IDs with known ordering. Likely effective against non-serious attempts.
I'd prefer to use crockford-encoded entropy with Stripe-style token prefixes to create unique ID namespaces. Run in through a bad words filter, and it's perfect.
user_1hrpt0xpax7ps
file_xpax7psaz0tv6az0tv6
Etc.
In distributed systems you can use the trailing bytes to encode things like author cluster, in case you're active-active and need to route subsequent writes before create event replication.
Easy to copy, debug, run ops/incall against. If you have an API, they're user-friendly.
Of course you still want to instruct people the prefixes are opaque.
Yeah don’t forget the bad words filter. I worked on an IKEA mailing where the list processing house was adding an autogenerated discount code to the address label. The customers received codes with BOOB, DICK, TWAT, and CUNT embedded within. People were not happy.
Did they never make an IKEA purchase after that or did they get over it like a normal adult?
I don't work retail, but something tells me people will make a stink out of just about anything if it meant potentially free products or other compensation.
Plus, are you filtering just English curse words or all curse words for countries that use Latin characters?
The risk is not in offending someone, but in that someone posting the rude string on social media in real or mock indignation, causing the outrage machine to turn on your brand. There’s a steady supply of bottom-feeding journalists waiting to write the article “IKEA’s new system is sending hate messages to customers”.
We caught it about 2 million into a 15 million print run. I had to go through the control files and change the offending value or identify the pallet and bundle of mail it was in so we could remove and readdress the catalog. It was about 10 years ago and anyone who complained got a coupon.
The mailing house that did the work was fired though.
Correct me if I'm wrong, but, It cannot be unpredictable, which makes the library redundant for security concerns, which would be the one business case to seek for anything other than an UUID (which is already built into Ruby).
If you use a random number then you need to store it somewhere to map back to the original. Sqids is an encoding, you can decode the sqid back to the original without storage overhead.
Features like the profanity filter avoid creating URL routes like /user/cuntFh.
Cross language support allows interop between the encoder and decoder across microservices written in different languages.
Database PK really shouldn't be exposed like this. I think that's what they're getting at, though. The thing is, sequential database PK is still shorter and more readable than a Sqid.
I haven’t been able to find a case for this because ids either need to be unique or they’re not going to be large. If they’re unique, I’m using uuid or ulid (uuidv7 of tomorrow) as the sortable primary key type to avoid conflicts without using the db to generate and maintain sequences.
Where do you have unique ids that aren’t the primary key? I would be more interested in a retrospectively unique truncated encoding for extant ulid/uuid; ie given that we’ve passed timestamp foo, we know that (where no external data is merged) we only need a bucketed time granularity of x for the random component of the id to remain unique (for when sortability is no longer needed).
Or just more generally a way to convert a ulid/uuidv7 to a shorter sequence if we are using it for external hash table lookups only and can do without the timestamp component.
The idea is that you encode and decode database IDs with this. You wouldn't save them separately unless you were using it for a purpose other than shareable "identifiers" which don't leak significant amounts of database state. Imagine something like a link shortener where you want to provide a short link to users, but don't want it to just be a number.
> Where do you have unique ids that aren’t the primary key?
For things that need public IDs, I always generate and store uuid4s* to avoid leaking database state or anything like that. Even uuid7 leaks creation timestamps. It's cheap and easy to map public IDs to/from PKs at the API boundaries.
* possibly encoded as something nicer like base64 for APIs or URLs
I wrote a Ruby gem to address this problem of hiding sequential primary keys that uses a Feistel network to effectively shuffle int64 IDs: https://github.com/abevoelker/gfc64
Kinda similar idea to this library but you're encoding from an integer to another integer (i.e. it's format-preserving encryption). I like keeping the IDs as integers without having to reach for e.g. UUIDs
I recommend to review my comment where I also use a Feistel cipher [1] but the difference is that it is not limited to int64 but I can even use 8 bits. Obviously loosing security properties but working as an obfuscation method. If you use a random source with relatively few bits you should check if there are duplicates while with the Feistel cipher you are sure there isn't.
On a side note, "Sqids ... is an open-source library that lets you generate YouTube-looking IDs from numbers.", "The main use of Sqids is purely visual."
If the purpose of it is to give a friendlier url / id, who not use something like friendly_id instead? (http://norman.github.io/friendly_id).
The url is readable and searchable through the history.
I would much rather prefer people using "www.website.com/channel/video/a-dog-walking" instead of "www.website.com/channel/video/3cXv8c".
These are called slugs <https://en.wikipedia.org/wiki/Clean_URL#Slug>. If you’re willing to commit to a persistent slug, please do: they are nicer. But for many applications, e.g. most user-generated content, you can’t be confident they won’t change. There’s a reason for the general advice to not use data as a primary key in databases, but to use a separate ID.
Actually I'm a bit disappointed that it can't format 128 bit integers or byte arrays. That would allow formatting UUIDs. I'm not a huge fan of public facing integer IDs. There is always the risk of leaking some kind of critical information with ascending IDs.
So I will probably keep Base64URL formatting my UUIDs to make them shorter for URLs, QR Codes and so on.
Even if it could format 128 bit integers, I don't see how that would be any better than base64-encoding them. Base64 is already pretty close to the maximum efficiency you can get for shortening UUIDs while still using URL-safe characters.
I actually used sqids algo for something very different where I had to encode arbitrary sized byte arrays. And with Ruby it was very simple to remove limit - I think it was matter of monkey-patching https://github.com/sqids/sqids-ruby/blob/main/lib/sqids.rb#L... to return infinity.
The reason for the limit is most likely to ensure interoperability with libraries in other languages where working with bignums is much more complicated.
I offered something similar here [1] and it is used by many companies including Philip Morris, and the Argentinian tax agency for the same purposes.
The technique I used (I should publish it as open source) is using a Feistel cipher [2] with a key. The Feistel network could be adjusted to almost any size and the key used in every round is an expansion of a general key using a key derivation function [3] (KDF3 if I remember well).
Basically it is a symmetric cipher of arbitrary size.
Thank you for the article. I was not aware of that article. My first implementation was done in 1997 or 1998. I am involve in computer security for three decades and I was familiar with Feistel networks. I am not a cryptographer but I have reverse engineered algorithms using this structure. Never checked patents because I think this is a natural construction and I always wondered why there were no other public ideas around this problem. This is not the first time it happened to me.
I'm not a cryptographer and did not understand half the things you said except symmetric crypto.
I think there are 2 problems with this approach:
How do you prevent the double spend problem, for example duplicate entry tickets. You would have to mark the ticket as used in a central database anyway to prevent it
What happens if the secret key material is compromised? Anyone can issue new valid numbers, etc..
1/ You don't have duplicates because a Feistel network assures you there is no duplicates: every input has a different output with a fixed key in every round.
2/ If the secret is compromised anyone can issue valid numbers in the same way that if your secret encryption keys for encrypting your data exposes it. The idea is that the secret key material is never compromised as it is assumed in all security cases. You should custody with the right measures based on the attack vectors you have. The custody of secrets is independent of this method.
>The idea is that the secret key material is never compromised as it is assumed in all security cases.
That's not true, we have (perfect) forward secrecy, backwards secrecy and key rotation mechanisms because we often care what happened after the key is inevitably compromised. In this case the problem makes it hard to "rotate" the keys in a meaningful way, but I'm yet to see a proof it's impossible.
I think we are talking about different things here or the conversation is not clear.
The assertion you mentioned is the "what" assumption while what you said is the "how" we came to that assertion. Most probably you take my word idea in a specific sense other than I wanted to use. My answer was informal because we are here in HN and not writing a paper.
Feistel ciphers are a good technique for doing just this but it's also worth noting that if all you are looking for is "produce a pseudorandom permutation of 1..N without actually shuffling a list of numbers" you can also use an LFSR as well.
of course, but an LFSR is going to be faster (provided you have a reasonable number of rounds in your Feistel cipher), have some situationally desirable statistical properties and is easier to adapt than a format-preserving encryption technique like a Feistel network.
Sorry, but faster for the computing capabilities we have in every electronic device is no significant in the application. I have not seen a customer inquiring about the speed of this process. They were looking to a method that is more secure than others.
For example, in the case of Philip Morris was about winning prizes and imagine if they tried other methods before where some smart people reversed the method.
How do you adjust or evolve the blocklist with this, without making previously generated IDs incorrect?
The ID is simply incremented if it is blacklisted [1]. So the ID is fixed to the blacklist content, and adjusting it in any way invalidates certain segments of previously generated IDs?
Didn't check the code but you could encode offseted hash + the offset to avoid blacklisted words and the decoder would decode any version of the hash (offseted or not).
I think the "unique IDs" part of the title throws people off and brings security to mind. "The main use of Sqids is purely visual" is what you need to know. It's not necessarily for IDs, it's just a user-friendly way to encode/decode numbers.
An argument could be made for calling it “compression of the string encoding”. Since HTTP is string oriented, that makes some sense as an optimization of sorts, perhaps.
I see many people in this thread saying that this is a good way to hide insights from ids/numbers, I don't understand, aren't the generated values easily decoded? couldn't I just decode a couple of numbers to get that insight? What am I missing.
I noticed this too, and imo it's a design flaw that people get misled this way. Sequential inputs should yield sequential outputs, otherwise you might think it's meant to be unpredictable like SHA256.
Odd design decision in that if you provide your own blocklist, it overwrites their (extensive) default list instead of adding to it.
And in general the algorithm is surprisingly complicated for something that could be replaced with simply base64 encoding, the given example (1,2,3) base64 encodes to a string with just one more letter than this algorithm.
That said I do appreciate the semicolon-free-style. I don't typically see that in libs besides my own.
The problem is their block list will change over time. If you don't override it, then your IDs won't decode right when you update. This is a huge risk.
> You have to account for scenarios where a new word might be introduced to the default blocklist
Honestly, I think they need to rethink this. Otherwise you've got different library versions for different languages each using different default blocklists, none of which are compatible.
There explicitly is no uniqueness guarantee. It is known that the same values can encode into many different strings, and they recommend always decoding then reencoding with a stable algorithm to achieve a "canonical" id if one is needed.
What’s the use case for passing in an array of numbers? Typically when generating an ID my input is either a single random number, a string that’s being hashed, or nothing at all.
I'd love a short ID standard of some kind. There are so many situations where I just want a quick, ephemeral ID. Doesn't need to be cryptographically secure, collisions are expected. It would be nice to have something to import from the standard library in most languages.
I was going to say it's for very long numbers, but I tried an example, and 123645634 becomes ARsz1pHw789c7ESzhy. The output is actually longer, and more complicated.
Looks like it uses a hash. I was expecting it to just convert the base except with something to skip profanity, which would give you something much shorter going base 10 to base ~36. Tbh I don't see why it's like this.
What we did 20yrs ago was generate random numbers and then ensure the # wasn't already used. We do this only for employee ID #'s and it works fine for us. It's definitely not ideal though.
Isn't the thing that makes UUIDs UUIDs that they have enough bits that they are guaranteed to be unique without any synchronization?
I don't think you could reduce the amount of random bits (I guess there are some non-random parts in standard UUIDs but not a significant amount) while preserving that property unless you add back some other form of synchronization to ensure that there aren't collisions which seems like it would defeat the purpose.
If you know which type of uuid you have (v1, v4, etc) then you can take a look at how many bits of randomness it has, how many total items you have, and compute the probability of a collision if you just take a subset of the bits and use that as an ID.
In theory it's definitely possible. The 128 bits you get in a UUID is a LOT of randomness for an identifier. Postgres BIGINTs are just 64 bits. Instagram's sharded IDs are just 64 bits. (See below.)
You can test it. If you're using uuidv4 (which is 100% random bits, minus a few for the version), you could make a new column in your table in Snowflake, populate it with the first 64 random bits of your existing uuid column, then see if you have any collisions.
The rate of change over time can be used against you; many people consider their businesses’ month-over-month growth (or lack thereof) to be private information.
“$WEBSITE did 50,000 signups a month during the beginning of the pandemic, but now struggles to sign up a thousand a week” is a story.
I think that's only if you don't want to leak user count when your ID is an autoincrement. Elsewhere people mention cryptographicly remapping integers, which could work (by itself, or before passing the ID to sqids).
Suppose you don't want to leak the count, what's a resonable way of implementing that?
You can of course have a uuid v7 / uulids or something as the primary key. Or have it as a public facing primary key, mapping back to a sequential ID PK (there might be some performance hits with larger PK's in e.g postgres? or is that just fud?)
But you could also generate a public ID with something like encrypt(seq_id, secret) and then encode it with whatever alphabet and or profanity filter you'd like - right? The issue then is that all public ID's would be long (and of course dealing with a decrypt operation on all incoming requests).
Might not be, but I like to start with a big number instead of 0 or 1 to fill all the bits. For example, if your prime is 100019, then your first number in binary is 00011000011010110011 but if your max number is something like 2^53 (00100000000000000000000000000000000000000000000000000000) then you have a lot of unfilled bits. The way I have mine set up, the output ID is always exactly 9 chars. 99% of the time it's naturally 9 chars because just by probability most of the numbers will be large, but some of them come out 8 chars and then I just pad up to 9 so it's nice and consistent.
I have over 20 years of experience, I run a team, and I can code something like this in my sleep, as well as any average developer (but maybe "average" is lower than where it used to be in my day). Don't talk down to strangers on the Internet so arrogantly, you only end up embarrassing yourself.
I went to check what this code does (beyond a base conversion with a custom alphabet) and I laughed my ass off.
1. Literally TWO-THIRDS of the "library" is a clumsily put together ban list of slurs. Hardcoded, in the library. This DOES NOT belong there.
2. It also contains alphabet checks like say repeated letters, which it'll run every single time it produces an ID, instead of statically checking the alphabet ahead of time, and just using it in production... or even simply selecting an URL-safe alphabet and sticking with it.
3. Despite all the checks, block lists, the default alphabet doesn't do the most elementary thing when producing alpha-numeric ids: exclude letters and digits which look similar, and so will look identical in some fonts. Nasty when you have to write an ID from say a screenshot, or a poster, or what have you.
All in all, that's only borderline more utility than the infamous "leftpad" library, and it may actually be harmful in practice, because it contains a bunch of specific and/or superfluous concerns by the author, almost as if they're trying to justify why this is a library in the first place, but which will likely not align with the precise requirements of the users and their project business rules, while it excludes others (like the one I mentioned in point 3 up there). But I'm sure the kids will be impressed.
Given the retry-on-bad-word feature, I was sceptical of the no-collision claim -- but after looking at the JS source code, I'm confident it's correct.
For each number encoded, one character from the alphabet is "held back" to use as a delimiter (with the particular character chosen changing as each number is processed, presumably to make the output "look more random"). The very first character output is essentially a "free choice" that selects the initial permutation of the alphabet to use.
The algorithm is implemented as a function encodeNumbers() that calls toId() to encode each number, then checks for bad words and recurses with a new value of "increment" if it finds any. To prove correctness it's helpful to imagine an "in-between" function, tryEncodeNumbers(numbers, alphabet), which the outer encodeNumbers() calls in a loop to do most of its work (pseudocode):
function encodeNumbers(numbers) {
offset = sum(numbers)
do {
result = tryEncodeNumbers(numbers, permute(alphabet, offset + increment++))
} while (badWordIn(result))
return result
}
Here permute() is a function that permutes the characters in its first (string) argument according to its second (integer) argument in some arbitrary way.
The toId(num, alphabet) function, which simply encodes a single number using "digits" taken from the characters in the alphabet parameter, is clearly injective with respect to the num parameter provided that alphabet contains no duplicate characters -- that is, if we hold some duplicate-free alphabet string fixed, every distinct value of num produces a distinct encoded string as output. (For example, toId(42, "0123456789") gives "42", and no other value of num produces this string when the alphabet remains unchanged.) tryEncodeNumbers(numbers, alphabet) first outputs a character representing alphabet (i.e., its initial alphabet -- which is its complete internal state), then joins together a bunch of these toId()-encoded numbers, with an extra character in between each that is known not to appear as a digit in the preceding number, permuting the alphabet in an alphabet-dependent but num-independent way for each encoded number output. Because the alphabet permutation is independent of the input array of numbers, this means that the alphabet used to encode the i-th number depends only on the initial alphabet and i. This means that, again holding its initial alphabet fixed, tryEncodeNumbers() is likewise injective with respect to the input array of numbers. (Suppose it were not: Then there is an initial dupe-free alphabet, and two distinct arrays of numbers, that produce the same output. Find the first position where the two arrays differ. Since the final results are identical by assumption, one of the two encoded outputs for this position must be a prefix of the other. But if the two encoded outputs are of different lengths, the shorter one must either terminate the entire string, making it shorter than the other encoded string, or be immediately followed by a character that cannot appear in the output of toId() with the alphabet used at this position, both of which contradict the assumption that the resulting strings are equal. Therefore toId() must have output identical strings for the two different numbers at this position, when given the same alphabet. But this contradicts injectivity of toId(), so this (non-injectivity of tryEncodeNumbers()) is impossible.)
The final step is to see that, if encoding two inputs with the top-level encodeNumbers() function gives the same first character C, it must be because the first successful (bad-word-free) loop iteration for the first input passed the same alphabet to tryEncodeNumbers() as the first successful loop iteration for the second input. If the remainders of the two encoded strings are also equal, then by injectivity of tryEncodeNumbers() for fixed alphabet choice their input number arrays must have also been the same. Since no restrictions were placed on the two inputs, this holds for all possible input pairs -- that is, it is impossible for encodeNumbers() to produce the same encoded output string for two different inputs.
Skipping profanity seems like a liability in this design. It means in order to preserve the encoding you need to make the banned word list immutable, otherwise old sqids will decode to the wrong thing when you get them back.
2. Drop all vowels, numbers, most homoglyphs, and the letter 'x'.
3. Map digits 0-9 to one of the remaining letters.
4. Stringify the integer and replace the digit in each decimal place with its corresponding character.
For my use-case, all the numbers were >7 digits long, so the odds of you getting an offensive acronym were reasonably low unless you started combining them.
But there's no perfect solution. As this dataset shows, you can find offense in almost anything if you look hard enough:
Many of those reviewer comments are utterly moronic. And that is my polite opinion.
How does this work? Is there a review board? Is it put to public review? A few of them like "dick out" and "shtlord" are reasonable, but many of them seem so bonkers it looks like the work of trolls.
Anyway, TIL that 1970s Intel was a MS-13 gang outfit and that Octocat really means "eight vaginas".
I could give a fuck about avoiding swear words, but if you want to avoid slurs and eyebleach-inducing ideas and still have any sort of compact representation, I suspect we have to look not at problematic letters but problematic groups of letters. There's nothing intrinsically wrong with the letter E. Not with G, I, N, or R, but you can sure get a lot of attention you don't want by arranging them in the wrong order. K and Y aren't bad either, unless you're hating on Jewish people.
So maybe there's a 5:4 or a 5:3 encoding out there where you avoid making syllables.
I don't think this holds, you can enforce filtering in the encoding step, i.e. be strict about what you output, but always decode, even if the input is profanity. This means you can also be backwards compatible if you update the list etc. So in short, the old maxim of be strict about your outputs and lenient about your inputs.
From their FAQ: "The best way to ensure your IDs stay consistent throughout future updates is to provide a custom blocklist, even if it is identical to the current default blocklist."
The *encoding* changes. The decoding stays consistent:
> Decoding IDs will usually produce some kind of numeric output, but that doesn't necessarily mean that the ID is canonical. To check that the ID is valid, you can re-encode decoded numbers and check that the ID matches.
The reason this is not done automatically is that if the default blocklist changes in the future, we don't want to automatically invalidate the ID that has been generated in the past and might now be matching a new blocklist word.
In that case it sounds like a shortcoming on their part. There is no fundamental reason to have that limitation. I understand it can make the implementation easier to not have it, but in my opinion being blocklist change agnostic would be a much better value offering.
It's a base62 encoder that takes multiple integers as input. Probably a bit-length prefixed encoding. I am assuming it just pads an extra junk integer to re-roll the encoded number.
There is no need for randomised output, and they've said themselves that you can recover the input from the output. It's literally just so that it looks like something cool and technological is happening. You can skip IDs that contain base64 profanity just by doing the reverse encoding on your block list and skipping those IDs. Or you could exclude vowels from your alphabet and do a base 54 encoding. Or you could grow up and accept that your 4 millionth ID which no one is going to read might have a naughty word in it (horror).
A secure "random-looking id generator" is called a block cipher. Block ciphers with less than 128 bits of output are widely considered insecure: this corresponds to about 22 base64 characters.
Going further, you probably do want to use a secure algorithm. History is full of people thinking "we don't need this to be secure, it's not used for security-sensitive things" and later regretting it. Some case studies at [1].
[1] https://www.schneier.com/blog/archives/2016/04/security_risk... .