Absolutely spot on. Additionally, it's worth mentioning that a lot of content is now locked behind a few major platforms (eg. Facebook, LinkedIn, Medium, YouTube, etc.) or CDNs like Cloudflare, which often block crawling from non-Google IPs or well-known search engines.
While the other costs mentioned here can be optimized with current hardware prices and a good database, anti-crawling measures necessitate thousands of IPs/proxies, making the process even more challenging and costly.
Additionally, it's worth mentioning that a lot of content is now locked behind a few major platforms (eg. Facebook, LinkedIn, Medium, YouTube, etc.) or CDNs like Cloudflare, which often block crawling from non-Google IPs or well-known search engines.
I think this is fine. If I want to find something on one of those big sites I just go there directly. However if I want to search the web for a site I’ve never been to before then I’m stuck with the bad results of the current search offerings. It’s quite depressing!
I created RemotePN because I needed an OTP input during the execution of a server automated cron process. This solution is particularly useful in scenarios where you want to pause the execution of your script and await manual prompts or responses from users via Telegram. It's helpful for tasks where user input or confirmation is necessary before proceeding with further actions.
I made a comparison list with the most known uuids out there, a couple of days ago, it was quite fun discovering all the different kinds of uid and their pros/cons.
what's the resolution on those? 32 bits, 100 years.. that seconds right? doesn't sound excellent for time ordering. 100 years also seems a little short but at least I'll be dead
Author here, you're correct, the title is misleading and assumes a match between surrogate and primary keys in a simplistic way, especially for when you can't make use of a natural key. (most of the cases in the domain I work in)
If you can't identify a natural key that indicates failure to normalize or even the wrong model of the data.
I'm not sure about this point. As a matter of fact a lot of entities in the real world do not have a natural key, or if they have it, it could be subject to changes. How do you identify a user or an anonymous post on a forum?
Maybe for an user you could just use the email. Then, what if the user want to change their email?
Or what if one day you want to allow users to sign-up with just their social login?
Surrogate keys are always a good level of indirection to keep a consistent identity, and especially useful as foreign key and hence also as a primary key. Of course with some exceptions (eg. intermediate tables, where the PK is spanning across multiple columns)
But I'm curious to know also other scenarios since my view is skewed by the business domain I work in.
Primary key seems like a synonym for surrogate key for a few reasons. I think ORMs almost always assume surrogate key-based schemas, for example. It seems that database design has defaulted to surrogate keys over time. My complaints about the article came from my interpretation of primary key equals a surrogate key, I apologize for reading more into it than intended.
Primary keys must satisfy a couple of constraints: uniqueness and not NULL. A primary key can change, and all modern RDBMSs can cascade primary key changes to foreign keys. But using a natural key like an email address does open up the possibility of duplicates, whereas surrogate keys never have that problem. With a schema that uses (say) email as a primary key you have decide between preventing adding a new user because of a duplicate key, or allowing the duplicate email address in the database and dealing with it some other way.
I try to use natural keys but as you point out that can get impractical and seem quixotic when surrogate keys solve real-world problems. I don't default to surrogate keys (and I don't use ORMs) but I often fall back on surrogate keys. And I've had to refactor databases away from natural keys to surrogate keys because the original key caused more problems than it solved.
As for making primary keys publicly visible, that comes up in web applications that include IDs in URLs. That information -- that I have user ID 2345 -- lets hackers try scanning the space of IDs to see what they get. UUIDs or some other non-sequential value prevents that to some degree, at the cost of dealing with long keys creating large indexes and more cache thrashing in the RDBMS. Which works best depends on the application and what other measures you can take to prevent hackers digging through your database with HTTP requests.
Very interesting insight, as you say, probably ORMs instilled this way of thinking, while the primary key is not just constrained to that function I perfectly agree with that.
* at the cost of dealing with long keys creating large indexes and more cache thrashing in the RDBMS. *
This is actually one of the point I'm currently dealing with. Especially finding the right trade-off to manage records in the order of hundred millions/billions, the right decision could save GB of space both on disk and RAM and seconds in computation during each JOIN operation for example.
That's why I drafted out a comparison table, sometimes I see people start using UUIDs right away just because they are offered as out of the box in some ORMs or just because it souds "cool", without knowing that there are better or simplier alternatives out there.
I've run into the same problems. You compiled good useful information, I will bookmark. If you had named it "Choosing the best surrogate key format" I wouldn't have complained at all!
I have a customer with a fairly large database (university students). URLs on the site often have primary keys in them. They use autoincrement integers for the primary keys. Most operations require the user to log in first, so the system can check access to rows based on user permissions, so even if someone tries to change the key in the URL they won't get anything. For the handful of public-facing URLs that have a key in them we set up tokens (UUIDs) that map through an intermediate table to the real key. That way we get the benefits of simple incrementing integer keys for authorized users and internal operations, and hide those keys when presented to the outside world. Very much what you describe. We could not identify good natural keys for the main tables. Some tables have natural keys, i.e. the country table uses ISO country codes. The ordering/pagination issue you mention in your article falls outside the realm of proper relational database design but comes up a lot in real applications.
While working on a big data & distributed system, I started questioning myself on which is the best database primary key to be used in terms of performance, security and, functionality. Despite some articles on Stackoverflow and a couple of blog post around the web highlighting the pros of some identifiers against others, I didn't found a complete overall picture. So I decided to create this comparison table.
My next step is to understand the best way to not expose the primary key to the world, maybe an additional column with a public random id? Maybe encoding/decoding the PK with some cryptographycally secure algorithm in the REST API layer? Or better a key-value mapping cache layer between the public Id and the internal Primary Key?
I understand, I've been in the same situation before, not recommending it to anyone.
Having more than 2 or 3 tools is like having no tool at all or worst. This idea was born more for Open Source communities or hobbyst, etc. rather than companies. So yeah I agree the headline here is a little bit misleading in that sense.
Usually these kind of communities probably need that their content should be indexable by Google or other search engines in order to be discovered by new users. So in my mind it shouldn't dilute the attention of the community members. Of course it's still a proof of concept and every feedback is super welcomed to understand better the environment
I've seen a trending post on HN about the disavantages of using Discord vs a Forum.
Coincidentally I've been working on a solution for that in the past weeks that allows you to create a mirrored version of your Discord server as an online forum.
I certainly get having some kind of searchable log of a Discord server, but turning that into an actual forum with separate discussion threads and everything strikes me as something that probably doesn't work all that well. The forms of communication used in chats and in forums are pretty different. How have you handled this?
How about the other direction as well, forum-first discussions going into chat?
That's a very good point, the idea is to allow the moderators to pick the best content through the help of a Discord bot. Of course I recognize the content shared won't be on par with the one you can find on a Forum but that depends a lot on how the community is using threads. I'm working also on a way to filter out meaningless content or at least help moderators in repurposing only the best content they have on Discord. The other way around seems also a very good idea, I'll think about it for sure
Or maybe you can do both, that's the reason behind the tool I'm building right now: https://www.channelsync.chat/
In this way you can mitigate #1 "the memory hole" and also the #2 "Google can’t see inside chats" while creating a mirrored forum version of the best content you have on Discord.
It's still in closed alpha right now.
But I would love to know what you think about it.
While the other costs mentioned here can be optimized with current hardware prices and a good database, anti-crawling measures necessitate thousands of IPs/proxies, making the process even more challenging and costly.