Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What are some examples of good database schema designs?
437 points by vanilla-almond on Feb 14, 2020 | hide | past | favorite | 175 comments
I've seen many examples of database schemas but don't always recognise a good design over a mediocre or poor design.

This is what I mean by schema (is this a good design?)

https://www.drupal.org/node/1785994

When I read about database schemas with thousands of tables I wonder what kind of problem could possibly require so many tables.

What are good schema examples? And what are some poor examples?




I'd highly recommend reading SQL Antipatterns. It's a very approachable book that illustrates how to design a database schema by taking commonly encountered scenarios and first showing you the naive approach. After explaining why this is bad, it then shows you the recommended way of doing it and why this way works better.

I think 'learning from how not to do something' is a really powerful pedagogical technique that should be more widespread.


I tried that in a class I taught. The students were very frustrated and considered it a waste of time.

I agree it’s a FANTASTIC way to learn. I was very disheartened I didn’t effectively communicate that to the students.


Just a guess, your students probably didn't have the frame of reference or perspective to appreciate good design nor bad design.

Someone who's been in the trenches for several years, they'd probably get A LOT out of this technique as they would have many experiences to pull from.


Beginners often lack the experience to appreciate "critical thinking" based learning. At first they just want (need?) to know the steps to get something right, especially when they are still not particularly fluent in the very basics.

Once you've got some experience (which usually means getting things wrong a few times, seeing wrongness promoted to production because there isn't time to refactor, and having to fix the wrongness later once more layers or wrong are piled on top) you appreciate this sort of analysis much more as it helps you get things right first time more often, and/or helps you spot the actual problem (rather than chasing symptoms) in more complex situations.


It's like humans have to learn how to extract features before they can make efficient use of straightforward supervised learning...


> Beginners often lack the experience to appreciate "critical thinking" based learning.

I don't think that is it. _Beginners_ being the critical word.

Most learning is part of a negative feedback loop, if we only ever succeeded we wouldn't know why we succeeded, failure has such bad connotations in our society that it blinds students from deeply understanding a subject. Maybe replace it with experience?

Back to the subject of _Beginners_, we really should be teaching students from a very young age, philosophy, cognitive science, and epistemology. They should embrace experience, it shouldn't be up to the database schema instructor to teach both data modeling and learning by failure. Students should be fully versed in how to care and feed their brains by the time they arrive in the GPs class.


Generally yes, the general learning and problem solving mindset is going to help more than teaching "do this to do that" by rote.

But in any specific subject you need a certain level a basic knowledge taught that way before you can be expected to use the tools available (and potentially discover more) to problem solve.


I should have been more clear. My conjecture is that if we trained students on learning theory specifically and directly, they would understand the "suffering and pointlessness" of duplicating failed solutions. They would more openly embrace experience (failure) as necessary didactic tool.

Destigmatize not #winning and embracing experiential learning.


Also the "right" way often changes, in technology, and if you only learned the "right" way, when a better way comes out, you'll miss it because you only ever learned the "right" way and it's all you know.


In a work related context recently, I saw this approach used to attempt to describe how to achieve something. The problem for me was that I didn't understand the context well enough to understand why I'd initially even choose the wrong approach, let alone how that then related to choosing the correct approach. Making sure that the fundamentals are important here.


> SQL Antipatterns

I'd definitely second that recommendation, both for relative beginners and those of us who have been at it long enough to have learned and forgotten these things a few times over...

https://pragprog.com/book/bksqla/sql-antipatterns or your favourite [e]book seller, for those wanting a copy.

> why this is bad, it then shows you the recommended way of doing it and why this way works better

He also takes the time to mention when the anti-pattern might be OK to use.


SQL Antipatterns

Saw this thread and just acquired and read this. The book's premise is a great one, I just don't like the execution. Years of my life were dedicated to SQL CRUD and schema evolution before dabbled briefly in NoSQL (meh), random caching systems, then had my aha moment and upgraded to files on unix (awesome caching! great compatibility!) and occasional use of SQLite (easy backup and parallelism! no RDBMS master/slave complexity!).

While there are a few core issues raised, and all struck a chord, I was not a huge fan of the book overall because it was needlessly verbose. I found its authoritative tone grating. Ultimately with so many dialects and projects, SQL style is personal, organizational or project level preference. One issue I felt was undertreated was clarity of syntax. For example, I personally absolutely loathe any use of JOIN as needlessly obtuse cognitive baggage. An untreated issue was (over/mis)use of stored procedures.


How the hell do you deal with locks properly on files? Or each file is a row?


Right tool for the job. If you have a write-heavy, rapidly evolving dataset with short publishing times, critical contention considerations, and absolute referential integrity requirements, or a huge dataset where memory efficient access is an issue, SQL may be your friend. But try SQLite before a full-blown RDBMS. See also https://yourdatafitsinram.net/

In the real world, however, most systems do not have this type of requirement.


I'm sorry can you help me understand how to stop using Join? Some examples or articles on the subject? I can't wrap my head around such a concept.


Your approach sounds sensible! Do you have some blog posts, pointers which further explains the approach and factors to consider?


I've started reading, and already the second anti-pattern, "Naive trees", is something I've been dealing with at work recently. Hey, how'd he know?!? Cool book.


THANK YOU!!!! This is a fantastic reference, and helps me put some teeth behind my instincts when it comes to SQL.


Thanks for the SQL Antipatterns book recommendation. Just ordered it.


[flagged]


Please do not share links to pirated copies of resources such as this. I can't stop you from pirating it yourself, but it shouldn't be actively encouraged.


Just out of curiosity, why? Pirating traditional media, journals, and movies and entertainment is actively encouraged on this site, with popular sites and software for doing so frequently referenced, and it is never flagged. Seems like a double standard.

(Full disclosure, I agree with your objection and believe piracy is wrong in any case, but seems like we should be consistent with the application of the rules)


There is a difference between talking generally about piracy and its ramifications, even encouraging it as some would be doing, and directly linking to unlicensed copies of material.

Even if an individual did believe that piracy is perfectly acceptable, direct linking still puts the site under risk of moving into the cross-hairs of law enforcement.


I do not agree there is a material difference in that case. Whether or not the site is under the crosshairs is also irrelevant to the point.

Besides, as I mentioned in my comment, direct links to hubs that contain pirated material and direct links containing suites of tools specifically used for pirating such as DRM removal on ebooks and video players, and bypasses for news organizations and scientific publications are also frequently directly linked.


I've seen links to hubs taken down like this link to a specific item, so if there is a discrepancy there is isn't consistent. Either a few slip through the cracks and you've seen them or I'm only seeing the other portion for some reason.

Mentions of key hubs like pirate bay don't count as linking to them IMO though might effectively be so as the mention might clue some to their existence. Discussing the matter completely without mentioning those places/services would not be practical.

Tools for removing DRM have valid uses. Minority uses, sure, but still valid. Whether removing it for your own transcoding & platform shifting purposes, rather than distribution, is morally OK is a complex discussion (I say yes, lawyers will almost certainly say no!), but if you allow that then the valid use list grows.

Bypassing news filters is usually a case of exploiting them giving different responses in different cases to improve, for example, their Google rank. This is against the engines policies. You can of course argue that "two wrongs don't make a right" but then equally one wrong doesn't either! I'm not sure if the scientific publications matter is the same or not, I'll have to submit to your expertise/experience there.


Could this be a citizen of Antigua and Barbuda?


I don't think this is a legal (US laws) copy of the book. It is still under copyright and publisher has not released it as Free download.


PostGraphile [1] is a framework for generating a GraphQL API based on the tables in your database; as a result, good database design is crucial. Graphile Starter [2] is a quickstart project that demonstrates best practices for getting up and running with PostGraphile quickly. In particular, check out the SQL migration file in that project [3]. It demonstrates:

1. Dividing up tables so that one user can have more than one email address

2. Using PostgreSQL’s “row-level security” system to restrict database results based on the logged in user

3. Dividing tables across multiple schemas for additional security

4. SQL comments for documentation

5. SQL functions for common functionality, such as logging in or verifying an email

It’s a fascinating project, and well worth a look!

[1](https://www.graphile.org/postgraphile/)

[2](https://github.com/graphile/starter)

[3](https://github.com/graphile/starter/blob/master/%40app/db/mi...)


If you are going to auto generate an api for a database, just use SQL. Adding extra steps between you and a database with no encapsulation is just adding extra steps for no reason.

One of the key things you need to do in good database design is to map business verbs to API endpoints in something like a 1-1 way. Having an api endpoint that is essentially "insert this row to the database" is just cargo culting. There is already an API for that, it's called SQL.


It's not generally safe to expose SQL to untrusted clients. For example, PostgreSQL 12.2 was released yesterday and fixed a security issue where `ALTER ... DEPENDS ON EXTENSION` did not have any privilege check whatsoever. SQL is also not at all well suited for the needs of frontend web app developers - just ask Facebook about their experiences with FQL! Using an API that's more ergonomic for the frontend, such as GraphQL, backed by a language which is optimised for the backend, such as SQL, is the best of both worlds.


No idea why Benjie is getting down voted, SQL from a client is a bad idea, and writing billions and billions of CRUD endpoints is soul-draining.

Postgrahile is the best of both worlds, providing a nice GraphQL interface on top of your database.

If you decide you need to write crud endpoints, you have your database still. It's a zero-cost abstraction, which is wild.


Yes of course you cant allow uncontrolled sql execution, but an api that just maps to crud operations isn't good either.


Agreed: an API that _just_ maps to CRUD operations isn’t good. I’m not advocating for that, neither is singingwolfboy, and the starter repo he’s linked to basically does not use them: there are only 4 CRUD mutations, all the others are custom. I rarely use CRUD operations in PostGraphile, mostly I use custom mutations either defined in SQL or TypeScript.


Counterpoint yes it is.


Not according to https://publications.opengroup.org/standards/soa

There is a lot of context lost in generalities so I admit you have to look at every specific situation, but in general CRUD means pushing business logic down to the client (which is generally some kind of code running in a browser or mobile app), which is the opposite of everything good in the world.


I'm using Postgraphile in production on 2 real-world projects and it saves me an incredible amount of time. I spend 90%+ of my time on the front-end because the api is all automated.

Postgrahile allows you to rename/disable api endpoints if needed.

Postgraphile rocks. And it's written in modualar Javascript, so I can hack it if I need to. Unlike Hasura.


Im saying it does something that is a bad idea in the first place. You are saying "yea, but it does it with so little effort".


No I'm not agreeing with you at all.

It's a great idea, and my clients and bank account agree with me.


This is frustrating.

There are lots of ways to do software. Some are widely considered by experts to be good, some are not so good.

Then there is a thing called business. You can certainly sell software that is built using bad practices, and honestly nobody will probably complain so long as it works. It might not be quite as maintainable, it might require more effort to add features, and it might be necessary to completely rewrite that software in 5 years when other parts of the system change.

Terrible software is bought all the time, and that's not even really a problem.

Even though you sold it and your customers are happy, there still are things you can probably learn, right?


Users having multiple addresses is something I've cursed a lot over. I work in a team that does data analytics for a news publishing company, and our print business is still very important. Unfortunately, in our database over print customers users are basically addresses because you don't really need to know how many people are receiving your paper as a distributor, only where and how many papers. Since it's also been a safe assumption for a century that people share newspapers with each other, market research was done street by street to inform ad buyers of which markets we reached. Many people have more than one home. Some people take out another subscription for a relative.

This mapped very awkwardly to digital subscribers who we had individual data on. We were able to join databases in a way that sort of works through more or less (mostly less) comfortable assumptions. The queries are not pretty.


There's a whole subfield of information science dedicated to basically this exact problem: entity resolution.

Hilariously, it has dozens of names, because it just comes up in so many places for so many people. It appears that "record linkage" is the term that has won the top spot at Wikipedia: https://en.wikipedia.org/wiki/Record_linkage


Record linkage seems to be unrelated. While OP isn't sure how to segregate and join data, he has perfect joining capability through unique indices.

Record linkage seems to be concerned with joins that aren't guaranteed to be correct because there are no unique keys.


Multiple email support seems indeed complex, event the most popular CRM on the market doesn’t support this feature even if it’s requested a lot. https://success.salesforce.com/ideaview?id=08730000000BrPIAA...


>> 2. Using PostgreSQL’s “row-level security” system to restrict database results based on the logged in user

I'm interested in PostGraphile, but i have a question: How do apply permissions when your user table is different from postgres user systems? i only have handful of users that have permissions spaning a lot of tables.

Do i need to create one postgres user for each of my application users?


Absolutely not, this is a common misconception. Have a read of this: https://learn.graphile.org/docs/PostgreSQL_Row_Level_Securit...


The Stackoverflow schema is decent. Especially for someone that is a relative amateur to study. It's nothing special or complex, however it's a good example of something that actually works well in practice (and at massive scale). The CRUD-CMS Q&A style lends itself nicely to a basic db schema that is easy to get your head around at a glance.

https://meta.stackexchange.com/questions/2677/database-schem...

https://i.stack.imgur.com/AyIkW.png

19 million questions, 29m answers, 12m users, 10m visits per day, and most of the db action is in less than two dozen tables.


This is the only answer yet that seems to actually answer the question :) Great example!


Interesting to see the denormalization of user Display names on the most important tables, but not everywhere.


Seeing some training sessions on performance tuning that use this database as the example, an educated guess is that it's done on purpose. I saw a few cases where usual rules have to be bent to get functionality, sometimes reality beats the book.


Oh, I'm sure it is on purpose. I have had the same problem before and seeing this example makes me a little more comfortable to denormalize like that.

I'm curious what they do if someone changes their name (though that should be really rare on SO)


Any idea which tool was used to generate the ER diagram?


I believe it was: https://dbschema.com


Thank you!


A schema is a way of designing data structures such that they are efficiently organized and easy to use.

But that leaves the obvious question: use for what?

Structures, whether it's databases or object graphs, exist for only two reasons: to do stuff and to fit into a pattern of rules you've decided to use beforehand. Without either of those, there is no way to judge a schema. We could talk about generally organizing data. That's set theory and normal forms. But that wasn't your question.

I would design the schema for a three-microservice widely-distributed high-performance application far differently than I would a local checkbook app for my brother. You gotta have more information.

Apologies. I am unable to help.


I agree, and I've come to the conclusion that you should avoid designing a database schema until you have some clear understanding about how the application you're persisting data for will be used. The data structures and schemas will practically jump out at you. It seems, at this point in my career, a very obvious thing, and yet I cannot communicate this point of view in a way to my colleagues that will affect change.


>I agree, and I've come to the conclusion that you should avoid designing a database schema until you have some clear understanding about how the application you're persisting data for will be used.<

I don't buy this argument, with due respect. (I see a lot of this thought these days and I'm genuinely worried we are about to swing the pendulum of software design into the dark ages where focus from the data model is taken away in favour of convenience of process design, front end design and data access frameworks).

In my experience: the objects from the problem domain will always have certain immutable relationships with each other right from the beginning. You have to look for them first before you do anything. They are clearly identifiable. What to persist and what not to persist are also discernible from existing manual business processes and the documents operational staff use daily (many of these are legal documents like tax invoices, contract confirmation, mandates and policy agreements etc ... a great way to start is to look at what reports users might need. Yes, those boring reports! Start there). Then, cardinalities and ordinalities will be known the moment you understand the objects relationships too (Can a client with product A also have product B? etc). Then, identifiers, business keys and other coding formats are also determined very early on in the project.

All of this before you even choose a front-end framework or the people's favourite ORM library.

I start with the data model. I use a good ER modeling tool to avoid frustrating myself when large changes are needed for my model. The models, when sufficiently complete, are printed in large format and pasted on walls for all to see daily. That is then considered holiest of holies.

(I understand that data models that need changes other than adding new columns or new sub entities can be painful to refactor ... but the fear of this eventuality should not automatically translate into turning the development process on its head.)

Just my 2c.


Analysis is great and absolutely necessary.

Too often we confuse analysis with design, then the heartaches start. Some problem domain things are immutable, some are not, some change over a fixed range, some are firm but expected to change, etc. All that stuff is critical to know as part of analysis.

I'd argue that you can do analysis incrementally right along with everything else. Reports sound like a great starting point. I agree with your advice, even the part about the pendulum swinging too far the other way. The problem is that, honestly, we suck no matter which way the pendulum swings. We continue to confuse the process with the goals.


>All that stuff is critical to know as part of analysis.

Exactly. I wonder if the frustration to get started as soon as possible ultimately results in data models that aren't ready yet and therefore brittle (IOW: insufficient analysis and logical testing of the data models before UI and other process logic commenced).


I think the fear from defining the models too early is that you won’t see what’s not needed, this approach probably has a tendency towards adding unnecessary models, indices, etc.


For my last few projects, I’ve started by building with a document-based database knowing a rewrite will come within 6-12 months. On rewrite, I can design the new relational db with 5NF very quickly.


> I would design the schema for a three-microservice widely-distributed high-performance application far differently than I would a local checkbook app for my brother.

I think it might be helpful to expand on what some of those differences are. I think most of us here would be able to come up with at least a satisfactory schema for a local checkbook app. What are the changes you need to make to that type of schema to make it appropriate for a multi-service app, specifically? What about distributed? HP? Are any of these changes in conflict with each other? Etc


i agree with everything you said, except for the part of your being unable to help. i think OP is asking for what information informs schema decisions, and what are some heuristics to use. expanding on the differences between your two examples would be very valuable!


I get what you're saying, that by moving to the meta level we can talk about heuristics and patterns of development.

Unfortunately, and I apologize for sounding difficult, this is still far too broad to gain traction on.

I think the thing to remember when you're learning various architectures, from database schemas to build pipelines, is that many times the people teaching you are teaching you from a position of having a completed project and then looking back on the lessons learned and applying some heuristic-making to them.

For instance, if you look at database normalization, which I started with when I started coding, it makes total sense for a small-ish project. Back in the day, you controlled the app, the machine, the storage, and the code. You owned it all. So changes to the schema involved a finite and easyish-to-do set of practices.

This started falling apart really quickly, though, with folks talking about impedance mismatch just a few years after relational databases went mainstream. If I had to generalize, everything got more and more complicated and the assumption that you could grab the entire application in your head easily and change it was no longer true. Then came a ton of CASE tools, now ORMs, and so forth, all in an effort to get us back to easily owning and changing data schemas.

But the real problem was there all along we just didn't realize it: thinking you knew everything and could manage it. This idea worked great in classrooms, worked great in personal and small apps. It even worked great in larger apps with tight control. But at the level of complexity we have now, it just doesn't seem realistic to be teaching people the perfect way to do things. (Of course, they should be aware of them!). Instead, what's needed is how to gradually get from here-to-there in a complex world without getting lost. So the heuristics you'd get would be perfect world completed apps, and what you really need to know is, well, how to develop software. That kind of advice ain't happening in an HN thread.

Hope that made some sense. Ping me offline if I can help explain any more.


I've been using Len Silverstein's Universal Data Models for 15 years. You'll be writing to lots of tables and will want views for your common aggregates. But you'll have the common tables you'll need, the patterns for those you don't and be able to handle new requirements with minimal change.

There is no Customer table.

"The Data Model Resource Book, Vol. 1: A Library of Universal Data Models for All Enterprises"


>> "The Data Model Resource Book, Vol. 1: A Library of Universal Data Models for All Enterprises"

[1] https://www.wiley.com/en-us/The+Data+Model+Resource+Book%2C+...


As per my comment elsewhere, this open source project implements a lot of that book (I can second the recommendation):

https://ofbiz.apache.org/


In a similar vein, I recommend David Hay's "Data Model Patternns: Conventions of Thought". It's more focused on the concept model side, but with awareness of how models are implemented in db schemas and the trade-offs involved.


1000 tables isn't an extraordinary amount. As far as database schema design goes it really depends on your needs, e.g. is your database intended for OLTP or OLAP use? Depending on your needs you can decide to what degree you want to normalise your tables, also there a lots schema designs to chose from (star, snowflake, etc.) and it's worth reading about them. Having a good schema design is about knowing the data you have and how you plan on using it, there is no one "good" solution in my opinion.

Edit: Security is also something worth taking into account, design for the applications and users that are going to access your database. You might create a separate schema for an application and separate schema for users and then grant them access to a "main" schema for instance, but this really all depends on your needs in the end.


Not a database person but 1000 tables does seem like an extraordinary amount imo. At that scale I wonder if such a use case warrants a document based database...


I have been working in the banking industry since 2015 as an Oracle database developer and contractor and having a few hundred or several thousand tables in a single database schema is typical. In a single ETL or ELT process you can have several different tables such as staging tables, configuration tables, tables for slowly changing dimensions, enrichment tables, configuration tables, temporary tables, tables for exchange partition, etc.


On a somewhat related note: the OpenType (OT) font format specifies 55 different "tables". These are not really what would be a table in a database but more like sections that may have multiple tables. E.g. the "math" OT table has constants, glyph info, italic corrections, top accent attachment, kern info, variants, glyph construction... -- I can't say for sure but I think a hundred tables to properly describe a font is probably a good estimate.


On this topic: I'm in the process of making a compiler for a DSL I designed to help with the schema design process.

https://gist.github.com/nomsolence/69bc0b5fe1ba943d82cd37fdb...

Pictured is the compiler internals; the attached .syrup is the DSL. (It started as a tangent on a project I'm doing for my girlfriend, the schema described in the .syrup has been improved a bit since)

Note: things after # are comments.

I find even just defaulting to NOT NULL and not having to worry about commas is a boon for when I create schemas.

The DSL will of course support things like compound primary keys and SQLite's WITHOUT ROWID.

I'll post the code here, likely before the weekend: https://github.com/nomsolence/syrup


I really like the arrow for foreign keys. That and the intermediate table for n-n relationships (and the associate join queries) are the main pain point of SQL, in comparison to object modeling.


10 years ago, I'd said "at least third normal form"... but today: Whatever gets the job done. When the application is not really dependent on weird queries (e.g. just a blog), screw the normal forms and design your schema to use the least number of queries for a certain task. Nobody understands three lines of code of queries with left and right joins.

On the other hand, if your bookkeeping application uses a database try to keep things as tidy as possible.


Just dont break the first normal form. Nowadays with json columns breaking the first normal form is a real dumb move.


I'd say this applies to virtually all best practices, patterns, architectures, etc. If you're doing something very simple, who cares about modularity or any kind of code hygiene? I don't. But what happens in reality? Simple and small systems or experiments grow, one little addition at a time, and we all know the mess that ensues.

So in my understanding, the question posed only applies to at least moderately complex systems, which is where engineering skills matter. And in that context, learning what distinguishes a good database design is obviously very valuable, not to say crucial.

> Nobody understands three lines of code of queries with left and right joins.

Not sure if you're being flippant, but a) this is not true, and more importantly b) why is it that we don't expect programmers to be at least as fluent in SQL as in other, less important, languages?


I'd rather say: use an ORM ! It will design the DB schema better and faster than you. Still comprehensive enough


If you’re doing any moderately complex analysis of the data in your database, the ORM will quickly start falling down. Abstracting the query layer into the application codebase is nice, and mapping entities to objects is nice, but ORM is not a silver bullet. Learning what makes for a good schema vs a bad schema and how to avoid N+1 loading or query problems is important no matter what.

ORMs aren’t bad, but learn their escape hatches or else you’ll have a hard time doing more complicated things.


It also usually forces your design towards the entities themselves rather than the specific way they’re stored, which positions you better for switching to a completely different storage system in the future if, for instance, it’s becoming too slow or expensive to maintain everything in a traditional big name RDBMS.


> forces your design towards the entities themselves

I agree that it's very important to not let the physical schema leak into the rest of the system, and to have a strong conceptual model (aka entities and relations). This has been well understood for almost half a century: https://en.wikipedia.org/wiki/Three-schema_approach

But I don't think ORMs are in any special position to help with this. They typically introduce so much other confusion that they tend to divert attention from designing both a good physical schema and a good conceptual model, and maintain a sensible mapping between the two. This can be done with RDBMS views for example, with a fraction of the overhead of an ORM. Most ORM-based code bases I've seen leak tons of db-level details.

> switching to a completely different storage system in the future

Designing for this eventuality is not healthy IMO. If you get there, it will be a so-called "good problem to have" and you will have to deal with whatever unique challenges you face at that level. We might as well be writing code with the possibility of "switching to a completely different programming language in the future" in mind. Yes, clean, modular code will help, but beyond that, not committing to the capabilities of the tools you have chosen will harm you system.


I disagree with pretty much everything you said. :)

I think there are plenty of bad ORMs and there are plenty of ways to use the good ones in a bad way, but that doesn’t mean that they aren’t providing the value I mentioned. For instance Entity Framework Core with code-first migrations has you designing the data models themselves, then wiring up relationships and other metadata (indexes, keys, etc.) in the DB context itself - your actual entities are completely portable and have nothing to do with the db itself outside of being used by it.

And sure, needing to switch to another storage system may be a good problem to have... that doesn’t mean you should explicitly tie all of your code to one particular RDBMS. If a user is a user is a user, it shouldn’t matter to anything else in your codebase how or where it is stored, it should still be the same entity. Moving those users from your SQL Server to Mongo or to a third party like Auth0 or an Azure/AWS/etc. federated directory service doesn’t change the fact that every user has an ID, an email, a name, etc.

Code for today, but design for tomorrow.


It's well-known to be a topic that splits opinion, so I'm not surprised we disagree :) To me, "designing the data model", "wiring up relationships", etc doesn't require an ORM. On the other hand, I do agree it's good to have some tooling around it and that's something many more bare-bones frameworks (ORM or not) are lacking.

I don't hear people talk about "coding for the web, but design so that you can easily switch to deploy as a Windows desktop app." Or "write it in Python, but in such a way that we can easily swap to OCaml." It seems to me databases are uniquely treated this way, as some kind of disposable, simple piece of side equipment. Again, modular code will always be easier to migrate, but I prefer to take full advantage of db capabilities, as it results in much less code and frees up time and mental space to focus on a good conceptual model and physical schema, among other things.

I've never used EF, so I might not see what you are seeing.


> It seems to me databases are uniquely treated this way, as some kind of disposable, simple piece of side equipment

This is exactly right - lots of people are still cargo-culting rules of thumb that no longer make any sense.

This was an artifact of the last generation's commercial DB market. Open source DBs weren't "there" yet; a combination of real limitations and risk-conservatism kept companies shoveling huge amounts of money at vendors for features and stability now provided by `apt-get install postgresql-server`.

If you just lit seven figures on fire for a database license, you're not hungry to do it again, so you wanted all your software to be compatible with whichever vendor you just locked yourself in to. And certain DB vendors are very well known for brass-knuckle negotiation; if you could credibly threaten to migrate to $competition instead of upgrading, it was one of the few actually useful negotiating levers available.

Today, open source DBs are better than the commercial ones in many situations, certainly not worse in general use, and the costs of running a bunch of different ones are far lower. Not to mention, the best way to win a software audit is to run zero instances of something.


Very useful historical perspective, thanks! Confirms what I had pieced together, that DBs used to be a big liability for organizations, with a special clan (DBAs) of people gatekeeping and introducing patterns that programmers found infuriating. Hence the hatred towards stored procedures, layered schemas, and databases in general. It's probably important to keep stressing, as you do, how different things are now. It's only been a fews years that Postgres has had row level security for example.


DBAs still have their place. In my shop, we have more DBAs than infrastructure people.

When you have a small team working on a given tool that only really needs to manage its own data, it really doesn't matter. But some point, you do need expert gatekeepers to tell engineers when they're Doing It Wrong when there are many heterogenous clients accessing large datastores for different purposes, complex audit requirements, etc.


Yes specialization is often useful. But the divide between developers and DBAs seems to have been similar to the dev/ops divide. Probably still is in many places. There is always a need for seniors or specialists to guide work, I'm not against that. But something like DevOps for RDBMS is needed. DevDat?


The GP has no sense of cost, designing around being DB agnostic is costly and, those who really need that flexibility are in the 1%.


Haven't heard specifically of the terminology "three schema approach", but the idea fits with other notions I've heard about: https://www.martinfowler.com/bliki/BoundedContext.html and Clean Architecture: https://blog.cleancoder.com/uncle-bob/2012/08/13/the-clean-a...


The number of teams who design a data model & ORM layer "just to make it easy to move later on": Lots

The number of teams who eventually move to a different data store? Almost zero.

Getting "locked in" to a database is a non-issue. In fact you should get locked into a database system, provided you picked a good one to start with. Most teams never even scratch the surface of what a powerful DB like Postgres can do for them and it breaks my heart every time.


which positions you better for switching to a completely different storage system in the future if

YAGNI


DatabaseAnswers[1] has been a valuable resource of learning for me. [1] http://www.databaseanswers.org/data_models/index.htm


Datbase schema should be well organized and well managed


> When I read about database schemas with thousands of tables I wonder what kind of problem could possibly require so many tables.

Businesses that trade in thousands of products, employ thousands of people in dozens of countries, with hundreds of different taxation, human relations, timekeeping, payroll, health insurance and retirement savings laws, dozens of sites, inventories in hundreds of locations managed by dozens of contractors, moved by one of potentially thousands of logistics firms with at least a dozen modes of shipment, reporting standards for multiple countries, reporting standards for stock exchanges, internal management reports, bank accounts in multiple countries, in multiple currencies, with a mix of transfers via a variety of means, hundreds of cash alternatives with varying rules about whether they are cash alternatives...

Modern large businesses are very, very complex.


or the systems arent multi-tenant so they end up cloning the schema for each customer


Sounds like SAP :)


I’ve lived 20 years as an amateur SQL database designer and last year I designed my first production nosql schema on mongo. Nosql is a different world... collections, embedded documents. I’m not sure it’s 100% “correct” but it’s working great for me so far.

My project is a license server for my electron app. The tech stack is a joy: mongo/mongoose, express, graphql, JWT, all with a React web front end. Payments by Stripe. The actual licenses are the signed JWT vended by the graphql API (not to be confused with the totally separate JWT for auth).

The main goal is to sell software so I license by machine fingerprint (node module on electron).

It’s been running for over 6 months without issue. I’m just beginning to start architecting an update where I allow a subscription model similar to Jetbrains Perpetual Fallback license, but slightly different in favor of the user. I’ve taken a lot of notes from comments at https://news.ycombinator.com/item?id=17007036

Here’s what I’m thinking so far:

A) Non-Expiring License at $97.99. Unlimited bug fixes for that one version. or B) Expiring License at $9.79/month, and you always have access to the latest version. After 12 consecutive payments you are granted a Non-Expiring License for the most recent version at the time of each payment.

Now, to model this...


Read up on third normal form. It’s all you need for 99% of oltp databases (the backend to a store) and read up on snowflake (The design not the company) and Star schema for datawarehouses for analytics.


Beware of crossing context boundaries when applying 3NF to commerce records.

E.g. The description of an item in a sales invoice is the description at the time the contract of sale is made, and must be immortalised as such as a copy of that description.


Another commenter mentioned it already, but it’s worth repeating. You can solve this problem by introducing temporal concepts.

We rely strongly on this at my place of work, it really works! Essentially, you just need two tstzrange columns, representing: (1) when the row was a “valid” representation of the key, (2) when the row could have been used to conduct other “transactions”.

With a valid period and a transaction period, you have a history of the values of an object, and the ability to make non-destructive updates to that object.

It’s an essential component of any audit-worthy system, because it empowers you to trivially answer the question “what did our database think John’s 2018/01/01 address was on the day that we mailed him a check on 2017/12/15”?


I don't understand the distinction between "valid" and "could have been used for other transactions". Can you elaborate and/or give an example? Is one of them always a subset of the other?


Happy to! It's not impossible for one instance to be a subset of the other. They can also be mutually exclusive periods of time. Let's do an example to explore

Let's say that the present concept of John's address is "123 Apple St"; this was inserted into the database at 1/1/2020.

We don't have credible evidence for what John's address was prior to 1/1/2020.

Therefore, the row looks like this:

> (address:"123 Apple St", valid_period:[1/1/2020,+inf), transaction_period:[1/1/2020,+inf))

This is a case where transaction period and valid period are equal. If John sent us this information by mail, and he signed it 12/25/2019, we have credible evidence that this was John's address, at least effective 12/25/2019.

Therefore, his row would look like this: (address:"123 Apple St", valid_period:[12/25/2019,+inf), transaction_period:[1/1/2020,+inf)).

Now, the transaction period is still 1/1/2020, because this information found it's way into the database on 1/1/2020.

Now, let's continue with the second scenario. Let's say we get a second letter from John, processed on 6/1/2020, saying that he moved to "456 Orange St" on 5/1/2020. There are now two rows in the database, as below:

> (address:"123 Apple St", vp:[12/25/2019,5/1/2020), tp:[1/1/2020,+inf))

> (address:"456 Orange St", vp:[5/1/2020,+inf), tp:[6/1/2020,+inf))

Then, we receive a FINAL letter from John, processed on 7/1/2020, revealing that his previous letter contained a typo! it was "789 Orange St", not "456"! Darn. Our table now contains three rows:

> (address:"123 Apple St", vp:[12/25/2019,5/1/2020), tp:[1/1/2020,+inf))

> (address:"456 Orange St", vp:[5/1/2020,+inf), tp:[6/1/2020,7/1/2020))

> (address:"456 Orange St", vp:[5/1/2020,+inf), tp:[7/1/2020,+inf))

Let me know if you have any questions about this example!


Ok, so the validity period tries to describe the real world, while the tp is more about the state of the database's knowledge, or rather belief. I think I understand how each one starts when it does, but not when you update the endings.

- Does the validity period of the erroneous entry ever get closed?

- Or the transaction period of the one that got superseded?

- Do transaction periods close for reasons other than finding out something was wrong, or can I think of them as the period that the rest of the row (including validity) is/was believed?

Thanks for the detailed explanation. This is really interesting.


No problem! It’s fun! And a powerful conceptual tool. In my experience, it can be used to solve many thorny problems, and I never learned about it in school.

I had a typo in the last row, it should have been

> (“789 Orange St”, vp:(5/1/2020, +inf), tp:(7/1/2020, +inf))

my apologies!

- the valid period of the middle record is never closed, because it would be a misrepresentation of how the database’s perception of the address at that point in ”transaction time”.

- indeed, the superseded transaction period is closed.

- formally, transaction periods close (and new rows are created) whenever a column’s value changes. Think of it like an “updated at” time stamp, except the meaning is more like “canonical during”. The second half of your question is totally correct, the TP is the period during which the values representing the specified slice of valid time are/were believed.

As the third bullet point implies, one characteristic of this schema is that you have “non-destructive” updates: state is never lost, it’s just put into transaction history. This makes it possible to “roll back” to an earlier, known-good state: simply specify a point in transaction time.

There’s a whole additional rabbit hole to dive down: how to make this stuff fast and intuitive. People have mostly solved the fast part, but we’re still working on the intuition.


I meant to ask about the transaction period of the first row (my bad, "superseded" is ambiguous), but I think I get that one now: the transaction period is still open because we still believe (and probably will forever) that that's the correct validity period for that address. Is that right? And the validity period for the middle one never closes because there's never a time in that row's transaction period where we know the end of the validity period.

The intuition I'm building is that the validity period is metadata and the transaction period is metametadata. I think the similarity between them is a little deceptive, since the validity period is in some sense just another piece of data sort of controlled by (or interpreted in light of) the transaction period, which is itself a way to simulate versioning the whole database; I think of it like the internal representation for a database wrapper that lets you time travel in a database with a history-oblivious schema. In that vision, validity periods are actually part of the history-oblivious schema. You could totally have "transaction periods" for stuff with no concept of validity period, e.g. something timeless where our understanding evolves over time. If you're trying to teach people, I'd suggest introducing the two concepts independently, then showing the interplay with an example like the one you gave here.

I don't know if weird nested contexts can ever be really "intuitive". :D


For total ordering are you using an instant (with say seconds granularity) or an additional monotone id to distinguish between changes for the same day?


I imagine you also save changes to the address as separate versions so you can query them in the future?


Yes. In practice, the typical "UPDATE" looks like this:

1. create a "hole" in the current transaction space's valid space (through expirations of records in transaction time, valid time).

2. insert the new version at (transaction_period=(now, inf), valid_period=(effective_date, inf)).

It sounds complicated, and it sometimes is, but in practice most of the hard work (specifically around transaction periods) is performed for you by triggers. Usually you only have to think about valid periods, as a developer, which is easier to wrap your mind around.

The unfortunate thing is that there aren't a lot of very straightforward open source implementations of bitemporal triggers. The ones that exist are mostly designed to be theoretically sound and feature-complete, rather than usable.

At my company, we have our own implementation that has slowly grown over the years. It's about 500 lines of SQL triggers, and maybe 2k lines of library code in Python/Go to make the ergonomics a little bit easier on the developer.


You made a great example, and I think it's the most important thing to remember when designing schema. I've seen arguments for/against 1NF, 3NF, various db features, etc, but really, the schema should model the real world relationships.

If your InvoiceItem has no description, and only has a FK to Product to get the description you're going to have a bad time. Once you build a system around a wrong relationship like that, it's very hard to go back and fix the inevitable issue that updating a product changes old sales invoice records.


Never done something regarding invoices and related stuff, so just an academic question: Why not just version the product description and FK to appropriate version?


That's one solution. You could copy the description to the invoice item. That's another. There are a lot more solutions, but the point is none of them will be trivial to implement in a large existing code base. It's really important to get the data model right as much as possible up front. This is a bit counter to all of the agile/scrum rage of iterate and refactor. That works for code, but schemas and data migrations do not lend themselves to constant refactoring.


versioning solves this problem.

in practice, you have to support retroactive changes, so you have to have "middle" values for versions (i.e. 1.1 between 1 and 2).

People are usually interested in the timestamps associated with those versions (for which period of time is this version valid, when was this version created), so in practice it's easier to just keep track of the timestamps, and leave the versions implicit.


> E.g. The description of an item in a sales invoice is the description at the time the contract of sale is made, and must be immortalised as such as a copy of that description.

+1. I maintain a couple of systems that younger-me wrote 15+ years ago. If younger-me had understood this, I could have avoided much re-engineering :)


As such, it helps to also study 4NF and 5/6NF (which are temporal normalisations). This allows us to refer a sale to an earlier 'version' of the product.


> I wonder what kind of problem could possibly require so many tables.

There are three main factors behind this, none of which are bad schema design:

1. Table partitioning/sharding. RDBMS storage engines don't tend to scale for common hybrid access patterns (e.g. contended inserts+reads on tables with many indices and uniqueness constraints) past a billion rows or so. You can scale them yourself, though, by splitting a table's dataset into multiple tables (all with the same schema) and getting the RDBMS to direct reads/writes appropriately.

2. ETL ingest tables for separate datasources. If you have e.g. 100 different scrapers for various APIs used in your data warehouse, all writing into your DB, then even if they all are writing data with the same schema, it's operationally very challenging to have them write to the same physical table. Much simpler to have them write into their own staging tables, which can have their own exclusive-access locks, get their own deduping, get TRUNCATEd separately between batch loads, etc. They can copy into a shared table, but this step is often unnecessary and skipped.

3. Matviews built for extremely expensive queries that exist for running specific reports against. These are effectively materializataions of subqueries shared by common customer queries. As such, there might be hundreds or thousands of these.

In the end, none of these are really tables ("relations", in the relational-algebra sense); they're all operational implementation details. Partitions are literally "parts" of larger tables; ETL ingest staging tables are distinct "inboxes" to a shared table; and reporting matviews are "caches" of queries. All of these things could be separate first-class object types, hidden inside the logic of the DBMS, not presented as "tables" per se, despite taking up storage space like tables. But—in most-any RDBMS I know of—they're not. So you see thousands of tables.


I think in the OP's case, they are staring at a dizzying number of tables for Drupal and wondering if the schema is "good."

It's easy to create a clean schema for a simple problem.

Drupal (and many other mature CMSes) are the result of a decade or more accumulation of features. Much like perl, "it's worse ... because people wanted it worse"[1].

1: https://groups.google.com/forum/m/#!msg/comp.lang.perl.misc/...


Ah, I didn't realize people were taking the linked schema as an example of something with a lot of tables. To me, the linked schema is by no means a bad design—nor a design with a particularly large number of tables!

The single problem this schema has, is nothing to do with the schema itself, but rather the way it's visualized (and possibly the way it's manipulated at the RDBMS level.) That problem is that these tables are not namespaced.

Clearly, there are multiple components/microservices here. There is one set of tables for a AAA system with generic object ACLs; another set for a generic object annotations system; another set for a half-baked JIRA-like "database in the database" to handle custom CMS object-types; another set for an object store; another for a custom reverse-indexing search-engine implementation; etc.

Each of these components has a sensible number of tables, and is in a practically-useful normalization form. You just can't see where each component starts and ends in this diagram, so it looks like one massive web of tables.

If you draw in the component boundaries, then this schema is quite elegant, both in concept and in practice. When working with the code of any given component, you'll only be dealing with a mental model consisting of the tables related to that component, and maybe with the foreign-key references from the "generic" components that can make assertions about anything (like the tags/annotations, or the ACLs)—but you can usually forget about those and just CASCADE updates/deletes down to them.

There's no feature-itis here. "Drupal" is just the name of a system containing seven or eight distinct services. Sort of like "Kubernetes" is.

The only "problem", if you want to call it that, is that because Drupal is one codebase and one process, it doesn't strictly need to take advantage of the isolation features RDBMSes provide to allow distinct applications/microservices to not get in one-another's way, e.g. SCHEMA objects. Unlike something built as a bunch of standalone processes, there's no force pushing Drupal to componentize (and therefore, naturally namespace) these tables. They'd have to make an explicit choice to do so.


Kinda tough to give a good answer without more context, IMO. What I mean is that a good e-commerce schema that serves a single small store and runs off a single database server would look quite different then a multi-tenant or distributed data store for a e-commerce site at scale.

The one you linked is a pretty typical relational model and isn't bad, but it has trade offs that I'd personally not make, however, that doesn't make it bad.

In the end context, scale and usage all determine a good schema design. Sometimes what would be a good relational design on paper would be tragically horrid in practice once you get beyond a small dataset.


Yes, but where can we all see the schemas that has worked well and the ones which hasn't, I mean all that information isn't accessible to someone who wants to know the state of the art. In someways software industry has been really bad in documenting and sharing knowledge, it is really hard to get hold of documents of large software projects, and informed commentary on them from the industry for educational purposes, I wish someone would do something about it.


Very fair and very true, and I am 100% in agreement with your points, I do not know of a single resource we can go to as an educational toolset that is based on real world products/experience. The problem is the industry as a whole is locked into these NDA type agreements (along with proprietary information) which prevent a lot of the complete schema sharing etc which would help people learn the best.

That said, some of the best things to look at are open source projects and research their pain points and highlights. That will many times point you to the tradeoffs they made in modeling and in code. For e-commerce things like magento and other open source solutions can be really informative, even woocommerce and how it shoehorns itself into wordpress can be informative as to the tradeoffs on highly normalized vs lightly normalized vs key/values etc.


There are three volumes of "The Data Model Resource Book" by Len Silverston, and despite being about 20 years old they remain industry best practices. All sorts of data model design patterns - insurance, ecommerce, data warehouses, party, finance, etc. David Hay's "Data Model Patterns" is also excellent, although less concrete.


Take a look at Magento (open source e-commerce). I'll leave it to you to decide whether it's reducible to notions like "good" or "bad" but it's definitely fascinating in a WTF sort of way.

https://i.stack.imgur.com/wnwrJ.jpg


LOL, appreciate the link :D


3NF or more is good. Else its poor. Number of tables does not matter.


There are good reasons to denormalise, but as a rule of thumb... yeah, this. I don’t think you can go that far wrong with schemas as long as you have an idea of your entities and their cardinalities. It’s much easier than designing, say, the associated Java classes, because there are clear rules about how to do it and it’s just obvious when you’ve done it wrong (your cardinality is all messed up).


Schema's that reflects reality not the current specs. Flexibility is key. In my experience adding tables and migrating existing data to them is hard, adding columns easy. So spend extra time at the start on what tables there should be.

Spec: Product has a supplier [tables:product, supplier]

Reality: Product can be bought from multiple suppliers [table:product, supplier, product_supplier]


100%

The importance of having db schemas and other software entities reflect real-world things is not appreciated enough. It looks unimportant at first, but soon all intuition becomes useless and the system tends to not have "joints" in the right places, ie it doesn't have flexibility in the same places that reality does.


Database Answers [1] has been around for years. It doesn't exactly have complicated schemas but there are enough examples to showcase various design patterns.

[1] http://www.databaseanswers.org/data_models/


i have been working with eventsourcing for the past few years and a design i have implemented in the repositories(db) lately is to have one table for events and one table for snapshots(ie. the objects in the current state in serialized form). then, depending on the needs of the application(ie. what queries will be run) I will create tables that will serve as pure indices by which I can then lookup the aggregates(object) I need. This gives me incredible flexibility and I do not need to bother with complex schema at all. I use event reactors within transaction context(imagine pre-save trigger per object) to fill these tables with data(or remove data). and from now on i think this is the way to go for me for anything. having your sql schema matching your objects/entities is very restricting and not flexible for future development. with this approach i have the full data available(snapshots) so i don't need to hydrate each aggregate from the event stream and i also have the ability to filter the aggregates as i need and also have highly optimized schema for any query i desire. when something changed in the future, i can simply play through the entire event stream and fill in new indices or whatever is needed. machines are fast these days so storing the entire object as snapshot in serialized form is nothing and it beats having to load tens of fields/columns and parse it into objects manually.


Nice pattern indeed.

This articles explores when it is not the case to apply it (for example when eventual consistency would be a big problem): https://medium.com/@hugo.oliveira.rocha/what-they-dont-tell-...

I guess at some level of load (for scalability reasons) one might need to decouple writing into the append only event log from updating the tables for fast reads, so the one transaction approach won't work, hence the eventual consistency between written data and read data.

Schema maintenance is also a non-trivial task as described in the article: keeping backward compatibility for various event versions, upcasting, lazy upcasting. Hitting a good granularity level for events is important, too big or too small, both are a problem.

For the operational team without intimate app knowledge it is also harder to do ad-hoc work.


How do you solve event schema changes? I mean event has some data attached to it. The schema of this data may need to change in time. How to replay older events that do not match current schema? Do you keep all versions of event reactors to be able to replay old events?


the stored events are called event but they are actually envelopes. the true changes("event"), or payload of the envelope, is stored as serialized field. the envelope holds metadata like dates, domain, aggregate type and id, event name, correlation, causation, user/account, event schema version(this is what you are asking about) and so on. i am using protocol buffers so they are backwards compatible and event schema can evolve in time. but you can use any type of format as long as you keep the schema version within the envelope and only add new fields to the objects.

then, when you are parsing the events(hydrating or replaying) you just check the schema version in the envelope and handle the changes/payload accordingly.

it's actually very trivial once you put it all together.


it really all depends on the requirements.

i use to think wordpress had the worst schema in the world. after actually using it and writing plugins for it, i've come to the conclusion that it is genius. their schema design really makes it very easy for others to "extend" the structure without having to actually alter the schema.

to elaborate on what i mean by extend, wordpress's schema is basically a post table that has the very minimal required columns for creating a post (like the title, date, status and a couple of other), the post_meta table references that post table and basically is used like a huge key/value store. really all it is post_id column (which reference back to the posts table), a key column and a value column. You can add whatever you want to a post by adding them to the post_meta.

this design is copied for all other areas as well, such as users, comments and what have you.

https://codex.wordpress.org/Database_Description

now obviously this kind of design isn't going to work for something like a financial institution where you most likely want a ton of referential integrity built into the schema to prevent accidental data lost and to validate data input, but for a CMS it works very well.

again... it all depends on what the system requirements are.


After searching for years for what that design was called, I finally read somewhere that it’s the “property bag pattern”. Seemed fitting.


I've referred to it, and been understand as entity-attribute-value, or EAV.


This is a database that satisfies at least the first three normal forms, and that might be why it felt comfortable. I admit that I have seen some dirty usage of this from a few WordPress plugin authors on a consultation, and it has since then not thought as greatly of it as a system that enforces good practices.


I've long recommended this book:

"The Data Model Resource Book": https://amzn.to/2tXNiuF

You can look at an implementation of a number of the ideas here: https://ofbiz.apache.org/

It's kind of complicated - probably overly so for some things - but there are a lot of ideas to think about and maybe utilize at least in part.


Take a look at http://www.softwaregems.com/Documents/Student%20Resolutions/.

I will refrain to comment on the quality of Drupal’s schema, but that diagram just shows a bunch of tables and how those tables are connected by foreign keys. What do those connections mean? What other constraints are there in the data? Are they represented in the diagram?

A good database design conveys a lot more semantics. There is currently one ISO standard (ISO 31320-2, key-based style modeling) for database design. Adop ting that standard does not automagically guarantee good designs, but, if used correctly, it helps a lot (it doesn’t help if you don’t have a good grasp of the Relational model, so I would recommend that you would get familiar with that first and foremost).

Most database schemas you’ll find around are rippled with useless ID attributes that are essentially record identifiers (they are not even “surrogates” as many people call them: for the definition of a “surrogate” read Codd’s 1979 paper) and, as a consequence, they are navigational in nature (they resemble more CODASYL structures than Relational models): to connect records in table A with records in table D, you must join with (navigate through) B and C, while in a well-designed Relational schema you might have joined A and D directly. Do you want a rule of thumb to recognized poorly designed databases? Check for ID attributes in every table (there are many other ways in which database design can suck, though).

How do you recognize good database diagrams? They can be easily translated into natural language (predicates and facts), and two (technically competent) different people would translate them in the same way, i.e., the model is unambiguous. Can you say so for Drupal’s schema?


I did a quick google search and ISO 31320-02 seems to be available only by paying ~170 CHF. Is there any text which explains how it is used? Even if I wanted to pay the aforementioned price I am afraid that I would get just a standard reference text with little or nothing in terms of actual teaching how is should be used and what the benefits are.


your link ist just a directory?


diaclaimer: theoretical opinion.

I think the primary problem of giving examples here is similar to teaching software engineering, which needs complex projects solving complex problems - too big for a semester project.

A good schema depends on the problem it's solving.

A secondary problem is similar to code that has sacrificed clarity for performance. The tweaks made for performance are not intrinsic to the logical problem, but are an additional constraint.

For performance on common queries, schema can be flattened or "denomalized". The ability to do so was one of the original motivations for Codd's relational algebra.


> When I read about database schemas with thousands of tables I wonder what kind of problem could possibly require so many tables.

They're probably mostly small (in terms of columns), many of them done just to have foreign key constraints, separate indexes, and arguably easier querying (joining) on just the data you need.

But I think it's a particular style, rather than a 'problem' that 'requires' so many.

(IANA database expert though, just my tuppence.)


When you think about large enterprise software systems with screen after screen full of data entry forms, with drop down lists everywhere, every single one of those drop down selectors is likely a database table. Every schema like that will have a mix of large tables (customers, products, etc) and lots of small tables.

I don’t think there’s a number above which you should not go, it’s really about how the data is organized and how easy it is to get data in and out of. I agree with another comment here about looking at anti-patterns first. There are some common ones that would be red flags.

As an example of what not to do: I once worked with a DBA who insisted that tables shouldn’t have more than 15 columns. So any table with more than 15 columns would have to be split into multiple tables, even if those tables would always have to be fetched together.


Not to be too self-promoty, but I spent a ton of time thinking about the schema for my site, and have gotten some compliments on the detail and documentation: https://aytwit.com/sql

I hope that helps spark some ideas for you.


I came to the same conclusion when I read that a basic SAP installation comes with over 20k tables. (see: https://retool.com/blog/erp-for-engineers/)


A SAP partner once told me (the company I was working at was considering using SAP) that the deployment would have ~180K tables - I don't know if they got the figure wrong, I have misremembered (I did check when they said it) or maybe that's for a "fully loaded" instance.

Edit: Not SAP, but certain other ERP products have an alarming habit of not using foreign keys - which makes working out the structure of the database quite interesting...


It depends on the version and how many modules you have installed. In my company it was ~ 50k tables for R/3, so the order of magnitude is right.


How do you perform joins without foreign keys? Do you just have a column that is effectively a foreign key but is not marked as such?


Sure it is. We can perform DB joins with any column as long as the data type and the data value is matched.


> What are good schema examples?

Anything that doesn't break the first normal form("1NF")[1].

> And what are some poor examples?

Anything that breaks 1NF.

You break 1NF when...

* a column cannot be described by the key, ex. user.id describes user.name but not user.items_purchased.

* values in a column are repeated, ex. a user table that stores multiple addresses for the same user should be split into an addresses table.

Treat tables as classes and you'll be fine. Just like a method may not belong in a class, a column may not belong in a table.

[1](https://en.wikipedia.org/wiki/First_normal_form)


s/1NF/3NF/g ...point here is that data good design comes from objects in 3NF...1NF is a good start, 3NF would be better...


I always think there is some magical perfect schema that I am missing. And if I just designed things perfectly in the beginning, I wouldn't need all this complex query logic. And reverse lookup tables, and fan outs, and other sundry hacks

But action always trumps thought. And its better to just slurp up as much data as possible

The simplest design for keys in a dict type data store such as Redis hashes, is to just auto_increment user ids. Resolve id=user. And then all data is just flat {var:id=value}. Key type is just a string delimited by ":". Gets you in the game fast. Mine structure and relationships later ;)


A database schema is the result of practicing the art of information modeling - a timeless art that will unlock the deepest mysteries of the universe if you master it. When database threads hit HN it's basically disgusting how poorly understood this is and the bad advice that flows here. Everyone here is wrong wrong wrong wrong wrong wrong wrong wrong wrong wrong wrong............. apologies I'm not myself today, posting anyways, good luck!


"what kind of problem could possibly require so many tables"

CRMs often have hundreds of tables and ERPs have thousands or tens of thousands or more.


WikiMedia's database schema is used by Wikipedia, one of the most popular websites there is, so it's arguably a very successful database schema design:

https://www.mediawiki.org/w/index.php?title=Manual:Database_...


Vertabelo has a blog which is almost entirely devoted to modelling schemas for various contrived scenarios: https://www.vertabelo.com/blog/ .. I take umbrage with some of the decisions made but broadly they are interesting and well thought out.


Northwind Traders


That made me laugh.

My 2 cents after doing this long enough to recognise it

- Aim for 3NF but not religiously. Still, if you need a flat table try a view.

- Any ternary relationship can be modeled as a pair of binary relations (you'll never regret keeping it simpler)

- You don't need EAV (Magento is a good example of why you shouldn't)

- On the other hand don't serialize data (looking at you WordPress)

- XML and JSON data types though are perfectly fine when you need to store an object

- Every table should have a primary key (preferably an integer)

- If you really want a string for your primary key make it a candidate key (why, because someone will insist on changing it)

- E/R diagrams are your friend

- So are Venn diagrams for visualizing a complex select


The Sakila schema: https://dev.mysql.com/doc/sakila/en/

Which I learned about from the jOOQ folks: https://www.jooq.org/sakila


The Kimball book series suggests some generalized schemas by industry, for the purpose of data warehousing. Here the academic schema rules tend to bend a bit - denormalization can become a useful and often necessary technique, for example.

Might be helpful for a rounded understanding of good schema design - it can depend on the context.


Too many tables in one schema tells me that there is no clear separation of responsibilities. I would first look at clear domain boundaries and separate them. Json type capability is one thing I have found useful when it comes to need for a non relational data to be stored simpler without creating associative tables.


This is a bit off topic maybe, but I have heard SAP works with tens of thousands of tables. Is that correct?


A recent HN article answers this question, actually!

"A basic installation of SAP has 20,000 database tables, 3,000 of which are configuration tables."

https://news.ycombinator.com/item?id=22244750


Is the schema (even partially) available publicly?


I've always liked how isfdb, the Internet Speculative Fiction Database, laid theirs out. http://www.isfdb.org/wiki/index.php/Database_Schema


Dumb question, but does anyone have a recommendation for good software for generating the schema diagrams in the Drupal link but for Redshift?


MySQL workbench will generate a diagram for mysql/mariadb databases.


I always wondered why people don't share SQL schemas like we do with codes.

Anyone know if there's such a site like it or even a marketplace for it?


This site is pretty solid from what I've dug into http://www.databaseanswers.org/data_models/index.htm


Take a look at StackOverflow’s schema. Freely available to study, used in a production site, and a great resource.


> good database schema designs?

Ones that ship.


I have worked on some shot database designs that managed to ship.


Ones that last, I'd say.


Where are the repositories of this info?

Just for record keeping, analysis, learning etc.


Ask this is like ask:

Exist an example of a good collection of variables/classes in Java?

For a rdbms, the schema IS the "classes, variables and types" on other languages. Only have the massive advantage of being much more visible and easier to understand.


My naive approach these days is avoiding relational schemas, because I'm tired of migrations. Document based NoSQL is hard to maintain too (dupes). The only solution to me are Graph databases like Neo4j or GunDB (inmemory).


1-2-3NF is super important and superficially simple. But just to give you an example with 1NF where the devil is hidden in the deails, let's have a closer look at ye good old "Address" entity. If we are to store a postal address for package delivery the columns "Address Line 1-2-3" should suffice and enable quick entry in the UI and simple printing on letters and package labels. What's not to love?

But what if you are working in a municipality and need to be able select all commercial buildings with more than five stories, or businesses situated in the basement of the building for the yearly fire inspections? Then string-searching all those randomly entered address lines will quickly become a complete nightmare - where as if the floor number was normalized and stored in its own column the query for the fire inspector's report would be a piece of cake.

This is a good example of why it's so hard to do cookie-cutter-implementaion-ready-schema-design-templates. It's also a good example of why datamodeling is important no matter underlying tech-stack this data model is going to be implemented on.

Also, I prefer modeling the app, business or process in Chen's ERD first as I think it is much better at capturing modeling details than UML and other ERD-variants.

Also, just as each object class in OOP should do only one thing, each entity should be saved just one table. Eg. that "Employee" table in the first chapter of every beginner database book with a "Manager" relation as a foreign key to itself is an absolute catastrophe and very . The moment your CEO decides you are now in a matrix-organisation, everything breaks down datamodel-wise. The Employees go into one table, the Organisational Structure type into another - they are related by foreign keys and it's not that different from good OO modeling as people say. The tables containing organisational structure should probaly also have columns with a from- and to-date and a relationship to a Department table so different departments can be organised differently throughout their lifetime.

Also, entities which have some sort of lifecycle should also be split into different tables. So there should be a table for "Prospective Employees", "Current Employees", "Resigned Employees", "Former Employees", etc. An employee's day of resignation can now be not null and go into the right table. You can always UNION these three or four tables together into one big view or materialized table, and at the same time you will avoid a massive amount of WHERE statements that each need their own indexes, picking out just the right employees from that big generic Employee table in every effing query.

Also, columns with endless NULL values are a "code smell" in a relational database. Whatever value is hiding in the few rows with values probably belong to another entity and should have been stored in another table with the name of that "thing". Eg. the employee's day of resignation again.

Also, 99% of all business datamodels can be implemented in a relational database using just tables, views and queries created in standard SQL. You will rarely if ever need user defined functions, generators, connectors, stored procedures, foreign code blobs and other exotic and vendor specific extensions.

Also, I recommend reading everything by Joe Celko.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: