Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Denormalization for Performance: Don't Blame the Relational Model (dbdebunk.com)
70 points by sgeneris on Oct 18, 2016 | hide | past | favorite | 62 comments



From my experience, denormalization of a relational model is a special case of "caching". What you are really doing is caching data where it is most likely to be used at access time.

The overview is like this:

1) You need some operation to have low turnaround time / latency

2) So you maintain and update a cache, typically while writing to the data store.

3) Like all caches, you can either invalidate it before changing the data, slightly harming availability, or you can have it lag behind (eventual consistency).

So the article is actually wrong, you don't need to trade consistency for performance. You can increase read performance (lower latency) without losing consistency, by having a cache (denormalization) and invalidating the relevant caches BEFORE writing data (which lowers write latency, but not necessarily write throughput... typically, we don't care about write latency as much as read latency.)


Well you clearly have cache invalidation figured out. Any tips on naming things?


Similarly -- any thoughts about off-by-one?


Concurrency

Nice one. The only hard thing missing now is:


The nice thing about SQL queries is that they tell you what the dependencies are. It's still hard, but not intractable.


To a hammer ...


If you write often, another way is to use triggers. They can be extremely useful for this kind of scenerio.


Agreed. Triggers are especially useful for maintaining counts. If you want to feel extra smug, there's a post from Instagram eng about how they realized they should denormalize counts of likes only pretty late in the game.


The basic issue is that derived data should not be included in base relations -- they belong in derived relations, such as views.


"Well implemented true RDBMSs..." exist only in the weird fantasy world of relational purists. In that perfectly normalized world you just wave your magic wand and all performance problems disappear.


I think most are pragmatic and admit you may have to do denormalise for performance (e.g. I've read this in C.J. Date's books). I interpret their point as the need for denormalisation is not a flaw with the relational model, but a flaw in its implementation. The relational model does not specify an on-disk format.


So rare to actually have to denormalise for performance today though. Query planners are miles ahead of where they were in the 80s and 90s. SSDs delivered an order of magnitude boost to performance. Rapidly expanding memories hugely increase the how much working set you can hold in memory.

No one seems to talk about this but the "Moore's law has ended" meme does not apply to most database scenarios. Database servers are normally not CPU bound: they are bound by available memory and disk bandwidth and these are both still increasing.


Antidotally, I started at a co 6 years ago where 100's of database servers were all spinning disks. It was common to have to trace down bad queries causing performance issues. Once the servers were upgraded with more ram and SSD's, the need to micromanage performance issues disappeared.


The sheer fact that much of the comments are about physical storage is an excellent validation of the article's claim that data professionals don't know and understand data fundamentals and the RDM.


I think today, most often than not, the fault is between the keyboard and the chair. There is clear lack of domain knowledge. RDBMS's like PostgreSQL, Oracle, DB2, etc. can do much more than people expect them to do.

Relational databases are relatively easy for anyone to start working with, but optimizing and tuning queries requires fair amount of domain knowledge. You'd be surprised to learn how little denormalization big financial company databasees have. They have people who are experts that do only databases.

For I/O intensive work you may need transaction rows where data written together is stored together, but it's not denormalization, it's optimizing for I/O bottleneck.

Query planners can do most of the work, if the person crafting queries has understanding of how they work. Carefully grafted queries and tuning memory layout for tables correctly is usually the right way to fix performance problems.


Yes, I would be surprised.

Because, unlike most products which only require training to work with, true RDBMSs require knowledge of data and relational fundamentals and such is scarce today, because education has been replaced by sheer training.

How many database "experts" do you know today who have any background in logic and math, particularly those designing DBMSs?

I've watched them for almost 50 years and they've become worse and worse, not better.


They are a fantasy only because users do not grasp the advantages of the RDM and confuse them with SQL, thus robbing vendors from any incentive to produce true RDBMSs.

If users are so accepting of denormalization as a solution to performance and are unaware of all its costs, why should vendors bother to come up with true RDBMSs that obviate the need to denormalize?

As long as users do not understand the RDM and its benefits true RDBMSs will remain a fantasy. Ignorance has never been conducive to progress.


I can imagine an optimizer that having analyzed and simplified the query plan notices that certain attributes bubble up all the wait to the final projection. Coupled with some statistics, it's entirely possible for a DBMS to maintain a precomputed partial result cache.


Responding to several comments. (The _underscores_ in the following quotes are mine for emphasis.)

>kpmah: I think most are pragmatic and admit you _may_ have to do denormalise for performance

>calpaterson: So _rare_ to actually have to denormalise for performance today though.

The article's author didn't describe it as "may"/"rare" which leaves some wiggle room for the cases of denormalization required in real-world implementations. Instead he used absolute qualifiers such as "never", "no reason", "any":

- the "additional development costs" that Bolenok refers to -- but they would _never_ be justified:

- "consistency-performance tradeoff [...], there is _no_ reason to expect _any_."

The author does write his advice in abstract terms instead of discussing concrete "case studies" so we are left to speculate what mental model of the database world he holds in his mind when he's rigid with strict rules of normalization and relational purity. Based on the topics in his papers[1], I'm guessing his world consists of a single OLTP database. E.g, you develop a non-cloud restaurant reservation & POS system with a single-instance database. Yes, you don't need any denormalization hacks in that scenario.

But for other problems such as distributed databases, you can't do joins across 2 geographically separated data centers. (Well, you theoretically could do it but the slow nested-loop performance across the WAN would make it unusable for a real-world application.) Some duplication via data denormalization is required and no "sufficiently smart db engine"[2] can automatically optimize for it. An application architect has to manually make that design decision and live with the deliberate tradeoffs. (E.g., batch jobs now have to be run to periodically keep databases at different datacenters in sync.) There are many real-world scenarios that require denormalization which have nothing to do with a junior programmer's lack of SQL knowledge to join 20 tables to populate an "edit customer" data entry screen.

[1]http://www.dbdebunk.com/p/papers_3.html

[2]riff on the theoretical "Sufficiently Smart Compiler" to solve all performance optimizations so there's never a need to write performance-specific language syntax or switch to a "faster" language... because as we all know... "languages" are not fast or slow -- it's the "implementations" that are fast or slow. That Ruby is not as fast as C/C++ is an implementation detail (SSC) and not the issue of the language syntax.


  But for other problems such as distributed databases, you can't do joins across 2 geographically separated data centers
That's where replication comes in as your (arguably) most powerful weapon.

I work with databases for, literally, decades and have never seen a successful implementation of two-phase-commits or joins on geographically distinct entities.


What does this have to do with the article?


The argument in the paper is that data professionals -- 9ncluding yourself -- do not understand data fundamentals, for which reason they blame "poor performance" on the RDM. But RDM and normalization are purely LOGICAL and as the article states, performance is determined exclusively by PHYSICAL implementation. So the article does imply that on rare occasion you may be forced to denormalize WITH EXISTING SQL IMPLEMENTATIONS, but this has nothing to do with the RDM and normalization and everything to do with SQL implementations which are not relational, do not have proper support of physical independence and are not versatile enough at the physical level.


>data professionals -- 9ncluding yourself -- do not understand data fundamentals, for which reason they blame "poor performance" on the RDM.

I don't blame poor performance on the mathematical purity of logical relations between entities.

>the article does imply that on rare occasion you may be forced to denormalize

You didn't read the article carefully enough. You overlooked how the author seemed to allow for denormalization but he immediately negated the followup dev work as "never justified":

" -- and performance is still unsatisfactory. You denormalize and, as Bolenok recognizes, introduce redundancy. [..] - but they would never be justified:"


Given a fully normalized db and a query which has to join say ten tables - what physical implementation would optimise this query?


I find zero explanation of how to solve performance with a relational model.

As I understand the article, it seems to say...just because all the existing databases you have seen suck at performance when normalized doesn't mean normalization can't be fast.


There are several basic concepts you can apply to improve performance in the RDBMS and still avoid denormalization. For example:

* use as many constraints as possible (this helps the query optimizer)

* use indexes which bring better performance in your use case (e.g. bitmap join index or even index-organized tables)

* apply table/index partitioning

* use materialized views as a query result cache


Or you just avoid all that hassle and have some duplicates.

I have never seen a properly denormalized table. In practice you will get a "historically grown" system way to often and doing anything like that will break things.

The whole article seems to be quite academic from my personal experience a textbook normalized database is slow beyond belief (i did exactly that once and we had to revert it back).


Or better yet, since you don't care about your data anyways, just don't bother storing it. Infinitely scalable and always blazingly fast.

>I have never seen a properly denormalized table

Do you mean normalized? There's no such thing as "properly" denormalized, anything that is not normal is denormalized.

>The whole article seems to be quite academic from my personal experience a textbook normalized database is slow beyond belief (i did exactly that once and we had to revert it back).

I've seen lots of people say that, but then consistently found those same people don't actually know what the normalization rules are, and all they did was create a different denormalized database that happened to have poor performance for the queries they were using.


For an existing application I personally prefer changing the index type, partitioning the table or tuning DB parameters first. It's far less risky because you don't need to change a single query and it's transaprent to the application. Sure if you cannot get the desired perfomance by tuning the RDBMS than you need to consider changing the way how the tables are modelled. From my experience, usually normalizing it one step furhter improves the performance, at least for OLTP use cases.


If you are looking to solve performance "with a relational model" then you do not understand physical independence and the relational model. This is exactly what the article explains you should not do.

You mean DBMSs, not databases. Yes, that's the argument, but it's precisely this kind of lack of understanding that prevents better RDBMSs.


Exactly. The idea that you have to sacrifice the right abstraction for the sake of performance is preposterous, and can only be explained by lack of imagination.

Just because most SQL DBMSes happen to implement relations, foreign keys, aggregates, etc. in a specific way, it doesn't automatically mean that the specifics of these implementations must be elevated to the status of laws of nature.


SQL DBMS are not relational, so they don't really implement relations (they support bags and NULLs), not all their operations preserve closure, have weak support of relational constraints, of physical and logical independence, I could go on and on.

Problem is practitioners confuse RDBMSs with SQL DBMSs and are incapable of seeing what they're missing in terms of practical benefits of the latter.


NULL is part of Codd's articulation of the relational model, even if later theorists have proposed cleaner variants of the model without it.


Yeah, these things you mention have always annoyed me. In particular, every relation should have a primary key, possibly consisting of zero attributes. If the primary key consists of zero attributes, the relation must have at most one element, because the primary key (the empty tuple!) determines every other attribute.


Indeed, but it helps a great deal to give examples that show better ways, or if not working examples, even sketches of credible implementation ideas.


Very nice article, but the claims are so far from what is practiced in the industry nowadays (see all companies working with tons of data and popping up big data solutions).

So, while it might be perfectly true it somehow misses some big point.


No, I think you missed its point.

I dare you to demonstrate that the "big data solutions" guarantee logical and semantic correctness the way true RDBMSs do. Without that, garbage in garbage out. But because those "solutions" are so complex that nobody understand them, including those who designed them, their results are BELIEVED to be correct, which is not the same thing as being correct.


I have never worked at true "web scale", so my experience may or may not apply to such scenarios. I tend to make a more or less fully normalized structure, and introduce aggregated data outside of the core database. It is essentially introducting eventual consistency, for performance critical data. This data can be stored in a database or a cache layer in front of the database if it is accessed frequently enough to keep it off the disk all the time.


Good luck trying to show the number of likes in twitter, instagram, face, Kardashian's tinder...

Any query that needs to calculate some result and has many rows benefit from storing pre-calculated values. With a trade-off in precision, correctness, and so on.

Might apply to your business or not.


What does it have to do with normalization?


As the saying goes... Normalize till it hurts..then denormalize till it works :-)


What do you know about the hurts of denormalization? Can you even specify what they are?


Updates, inserts, and deletes become more complicated and expensive when you denormalize.


One more thing: As these comments demonstrate, there is almost exclusive focus on performance, but nobody considers the drawbacks (cost) of denormalization. Practitioners are oblivious to them.


What systems that are usable today come closest to the Platonic ideal of RDBMSs?


Today most RDBMSs have good query planners. Even recent versions of MySQL. In my experience you very often get improvements in performance by normalising (to BCNF). Most of the benefit comes from a) narrow tables reducing the physical size of the working set (hold more of it in memory) and b) narrow tables handle writes quicker (more rows per page) c) more opportunities for covering indices.


Isn't table width dictated by your data model though? And thus, in a purely normalized db schema you have no control over table width?


Normalised designs invariably have pretty narrow tables. In most BCNF schemas most tables will end up with 3 or 4 columns at most. Additionally in the situation where you have a lot of non-prime columns you always have the option of decomposing the table into multiple tables sharing the same primary key. Useful if some columns are large but rarely used.


Normalization is not the same as full normalization.

Full normalization does not decompose into relations with same key, because that means splitting one type of entities in two, that makes no logical sense. Rather, it splits non-5NF relations that represent facts of multiple types into 5NF relations that represent facts of a single type and they have different keys.


What are you talking about? A table has as many columns as the thing it describes has attributes to be described. You can have a fully normalized database with 50 column tables in it, and there's nothing unusual or suspect about that.


No, relations describe facts, not things. Heath's theorem means you can decompose a table/relation that has n non-key attributes into n tables. This is something you have to do for BCNF.

Theoretically it's possible to imagine a normalised relation with 50 columns. In practice a table like that is unusual and suspect (frankly, even in a denormalised design).


There NO relationships whatsoever between normalization and the number of attributes, which are determined by the number of properties that entities have in the real world i.e., the conceptual model.


That is not in conflict with what I said


No. I just thought the clarification was important.


>No, relations describe facts, not things.

That is impressively pointless pedantry even for HN, well done.

>Theoretically it's possible to imagine a normalised relation with 50 columns

Yes, that would be what I said. "You're wrong because exactly what you said is correct!"


No, correctness is not pedantry, except for the ignorant. The database facts are axioms and the facts in query results are theorems -- logical inferences from the axioms. If you don't understand that you have no clue what a relational database is and what a RDBMS does. But hey, you're in the majority.

Any linking of number of columns to normalization is pure nonsense. The only accurate thing that can be said is that if you split a relation into multiple projections, by definition each will have less attributes than the original relation. But this is a triviality and it says nothing about the number of columns in the projected relations.


>No, correctness is not pedantry

Pedantry is though. And that's what that entirely pointless post was.

>Any linking of number of columns to normalization is pure nonsense.

That's what I said, that's the point. Don't jump in to a thread you can't be bothered to read.


Pointless to anybody incapable of getting the point, of which there are too many.


Tables are not relations. Relations have attributes, not columns, that represent entity properties.

A R-table is a visual shorthand for a relation -- a way to visualize it on some physical medium (paper, screen) and the two should not be confused.


> the option of decomposing the table into multiple tables sharing the same primary key

sounds suspiciously like denormalization


First, you got it backwards: denormalization joins relations, normalization splits them;

Second, if they share the same key, then it's not normalization, as the the split relations have distinct keys.


In most cases that ends up being true but that's not the definition of either of those terms.

Normalization is the process of organizing the attributes and relations to, among other things, eliminate redundancy. 2 relations that store and share the same primary key is by definition redundant.

Splitting relations for the sake of skinnier tables is creating redundancy(the primary key is stored twice) and insert/update anomalies and can be reasonably thought of as a denormalizing behavior.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: