Hacker News new | past | comments | ask | show | jobs | submit login
I stopped worrying and learned to love denormalized tables (glean.io)
108 points by vndrewlee on May 13, 2023 | hide | past | favorite | 98 comments



Well...

1) Normalisation at all costs is foolish - if the cost exceeds the value, then don't do it. That isn't complicated. Denormalised data sometimes points at design flaws, but even then all systems have design flaws and they don't automatically need to be fixed. Quality is expensive, like every other property (even doing things the cheap way is expensive, ironically - software is all about managing costs).

2) For any given user it is better to have denormalised data where the data model is perfectly aligned to their use case. For a system with multiple users it is better to have normalised data. And the corollary is that any data important enough to be recorded is probably valuable enough that it will eventually have multiple interested users even if the person building the system swears that this time is different - so they should normalise their data. Brownie point to anyone who has reached enlightenment and understands the you of 12 months hence is a different user with different needs of the data.


Regarding quality being expensive.

It's not only about cost to implement. There's also cost to change.

If your isolated module is bad, you can rewrite the code, keeping API the same. Cost to change is not high. You might introduce new bugs and that's about it.

Changing database structure might be hard. Adding new checks might require manual fixes to already bad data or multiple code paths for old and new data. Some migrations might require putting system offline. Often you can't just rollback your changes if things went wrong after few days.

Changing API with hundreds of customer... Good luck with that.

Changing POSIX API at this moment probably just not possible.

Whenever something is hard to change, quality requirements are naturally higher.

For data model quality requirements should be high. More time you spend, more time will be saved later. As we say: we're not so rich to buy cheap things. We're not so rich to afford poor DB schemas.


As usual, context matters and decontextualized discussions often devolve into people shouting past each other.

Sometimes you can't afford to do it right, quick and dirty is the way. Sometimes you can't afford not to do it right. It all depends heavily on how costs and payoffs are distributed socially and temporally.

The real trick is to be doing what fits your situation at the moment, and knowing how your situation might change over time.


> For any given user it is better to have denormalised data where the data model is perfectly aligned to their use case. For a system with multiple users it is better to have normalized data.

I don't see why you are making the distinction by single user or multi user. If you are writing information into the database, it has to be written into a normalized schema. There is no alternative, multi user or not. To do otherwise would be to maintain multiple sources of truth and nothing in the database would make sense if two different sources of data go out of sync. Make illegal states unrepresentable.

Also if you want to align your data model perfectly to a user you might as well use a single-table database like DynamoDB [1] where everything is screaming fast but the downside is you can't modify your schema at all (because it's already set in stone to perfectly model the old use case).

[1] https://news.ycombinator.com/item?id=18824164


> if the cost exceeds the value, then don't do it. That isn't complicated.

Could you try to elaborate on what you mean by "cost" and "value"? The complicated part is not the part you said, but specifying those two very abstract terms.


In the case of this analyst, say every couple of days he is writing SELECT * FROM events JOIN projects. Everyone is asking him for information about events and projects.

He's going to save time and make less mistakes if he uses DBT and has a denormalised tables that merges events and projects. However you can think of to define costs and benefits, it'll turn out to be a good choice. As he discovered in the article. Good move. People keep asking him to do it, there is probably value there. He is avoiding a cost which is writing the same join over and over again.

Normalisation is there to make data accessible for multiple different users (analysts, application programmers, infrastructure teams, people who want to leverage the database for a new purpose, etc). It isn't good at servicing any specific need, but it is a basic and general data layout that lets people tailor the data to their needs quickly. When there is a specific user, they should always be asking if the normalised data layout is helpful and looking for opportunities to avoid writing the same JOIN in 20 different queries. As long as the source of truth is in normal form it is reasonable practice to denormalise for specific use cases.

DBT does this really well I might add - it encourages normalising the source of truth and then denormalising the data by an analytics team to meet business needs. The ideas there are strong, flexible and encourage good practices. Analysts love big flat tables, they are easy to work with.


> Normalisation is there to make data accessible for multiple different users

Normalization is there to avoid anomalies which is another word for data corruption. If you have the same datum repeated multiple times in the base tables, any update can (and probably will, due to Murphys law) lead to inconsistencies, which mean the database is not a reliable source of information anymore. How do you quantify the cost of that?

> looking for opportunities to avoid writing the same JOIN in 20 different queries.

Then you define a view, with is literally just a named and reusable query, which can be used in other queries. Writing queries or using views is certainly not "denormalization". Having redundant data in a query or view output is commonplace and not a problem since it cannot introduce update-anomalies. (Some databases allow updateable views in some scenarios, but only when the view can be unambiguously mapped back to base tables, so no risk of update-anomalies here either.)


Use a view that acts as a quick-and-useful abstraction to mimic a denormalized table?

E.g.

``` CREATE VIEW vw_events_and_projects AS SELECT * FROM events JOIN projects ```

Then

``` SELECT * FROM vw_events_and_projects ```

Edit:

And if you need OLAP, replicate the normalized table to a database that handles analytics workflows better (e.g. ClickHouse).

Then you get the normalized forms for your OLTP workflows (your "bread and butter"); and you also get the efficiency and ergonomics of real-deal OLAP.

Of course, your biggest issue is going to be keeping the two in-sync. Obvious case is to have your OLTP database stream synchronization data to the replica whenever data is modified.


You'd like DBT, you should go and read up on it. It uses views.


Plus there are many cases where you want to see the data as of a given time, for instance all the positions of a fund at a given time. Then it makes sense to denormalise because this data should never be updated in the future.


What about using logs as the source of truth? It's a pattern I've seen multiple times.


I find it interesting how the argument of the current top comment[1] - that denormalized tables are great, as long as you never have any updates for existing entities - basically follows directly from your point 2) :

If for any individual use case a denormalized table fit specifically to that usecase is superior to a normalized table, the most straightforward solution for multiple use cases would be to just keep multiple tables and replicate any insert on each of them. This would work relatively well until the moment where you have to update something...

[1] https://news.ycombinator.com/item?id=35927226


> Normalisation at all costs is foolish - if the cost exceeds the value, then don't do it.

Value of denormalization: May improve query performance for some particular use cases. (But probably not any more than using a materialized view)

Cost of denormalization: Risk of anomalies (inconsistencies) which mean the database is not a reliable source of information anymore. Murphys law mean that this is basically bound to happen sooner or later. You should worry about this.

Just use a view.


I kinda observe this from a different point of view: what are the use cases needed for the data and what are the queries that could satisfy those?

The dynamodb book was enlightening from this point of view even for designing sql databases where we normalize data as much as possible no matter what.


The book is this one: https://www.dynamodbbook.com/


Hi there, when you say the dynamodb book, do you mean https://www.dynamodbbook.com/?


Yes, I mean that one!


Which DynamoDB book? It's very relevant to my interests.



It sounds like the author is calling a kind of materialized/persisted view "denormalized tables". The actual DB tables stay untouched and fully normalized.

It sure makes sense to love them, views are great. I don't know why they need a new name.

> Transformation tools such as dbt (Data Build Tool) have revolutionized the management and maintenance of denormalized tables. With dbt, we can establish clear relationships between table abstractions, create denormalized analytics datasets on top of them, and ensure data integrity and consistency with tests.


Fully concur with this! It succinctly sums up the article! There really isn't much meat there!


There was me assuming it was about normal form, you know, what everyone else means when they talk about 'normalising' in a database context...

Which even more confusingly almost applies but the opposite way around:

> create denormalized analytics datasets on top of them

A common pattern is to have source data 'warehoused' or whatever you want to call it, and then build your, er, normalised schema on top of that.


One of main reason to normalize data is aim to eliminate duplication, and find "the source data" by properly designing the relationships of tables.


By source data I meant as originally supplied, that's perhaps not normal and which you have no control over.


> It sure makes sense to love them, views are great. I don't know why they need a new name.

in very many cases, old things get new names so that more people can share in the claim that they invented what has been in fact merely re-invented


merely re-invented

*rediscovered


As a thought experiment. Was the original thing invented or discovered?

Many things are re-invented/rediscovered from first principles. In maths, some people prefer to think maths are created, and some prefer to think in terms of discovering/uncovering sth that was already there.


It’s complicated. Databases have entities called “views” and “materialized views” that have a specific meaning in that ecosystem. dbt let’s you define views in the abstract sense, but they’re implemented using a variety of different database primitives, like tables, temp tables, common table expressions, and views. dbt calls them “models”.


The debate over normalized/denormalized has to do with how authoritative online data should be stored. What you do with derived datasets is not really contentious; do whatever you want.


Sounds like a typical workflow in any BI tool. I wish the author would have taken a moment to explain how Glean is any different.


For the third time this week, in relatively unrelated fields of computation science, I'm reminded of the quote: "Duplication is less expensive than the wrong abstraction".

An awful lot of the time, a table schema is a terrible abstraction of the actual series it is designed to record. Sometimes it's designed under constraints that exist only to self-sustain the abstraction. Some of them have viable reasoning, some don't. How these structures sustain themselves for . . decades . . is a mystery to me. These non-relational movements, in part represented by the OP article, are (in part) attempts to shift the computing from data to the actual programmatic area. Because the real world doesn't have schemas - although that's still, incredibly, a source of intense disagreement.

Just an interesting thing that keeps cropping up. I wonder what the formal, "scientific" name for this is?


The schemas always exist, it’s just a question of where: the database or the code that interacts with the database.


The other way of putting it is this: the database gives you the chance to express facts (propositions) about the world in the form of tuples in relations ("rows" in "tables".)

You can, naturally, also express facts any other way you like. Comments in this forum, scribbles on the bathroom walls, seven layers of JavaScript buried in 10 microservices.

But a relational DB gives you the tools to at least attempt the expression of those propositions in a way that makes them findable and provable later, with minimal pain, and by people across the organization, not just in one code base.

Doing it in code, your mileage may vary.

And just like in written or spoken language communication, we can express facts poorly or incorrectly in any medium.

It's just that the relational model gives us a tool and framework to help put discipline (and query flexibility) into that process. If we use the tool right, it helps us clarify our logic.


I'm not entirely sure how true that actually rings because it's true by definition in some sense but having tables just be loose collection of typed nullable columns sans natural (maybe compound) primary keys is a very different experience from the kind of thing you expect. It's a schema in the way @dataclass is a schema and just ends up being a place to put stuff. It leads to very very few joins being necessary, natural keys map well to most problem domains, and foreign key constraints feel almost unnecessary since your real data is the key it's really hard to "lose" relationships.


It feels to me like you're mixing the terms schema and relations/joins.

You can have a schema on a table without any foreign keys or relations. But in the above you seems to use those concepts interchangeably.

Taking OPs comments directly about schemas, it seems true to me.


If your app/architecture is effectively "BYO Schema", if one schema is wrong the other's aren't necessarily, and the cost of making a mistake is much lower. And I would also argue that even if your database is noramlized and has a strict schema, the code _still_ has the ability to implement its own schema after pulling data out.


> the real world doesn't have schemas

This is simply incorrect. The real world has business logic, and that's ironclad (or should be, for a successful business, assuming you're not acquired at which point start the business data schema over). Where business logic has exceptions, those exceptions should be accommodated in the schema. But there should be no daylight between how the business works and how the schema records its every function. Anything else is bad design. It works for decades only if a business makes its decisions partially based on how those are constrained by prior logic, and the DNA of that logic is the schema.


In twenty years of "ERP Whispering" I've never known a "formal" "schema" (going to put "schema" in quotes here too, what that actually represents is a strictly validating structure, or system of any kind) that described a business object as utilized (process, reference, part number database) with anything greater than, say, forty percent commonality. At the high end. The remainder is waived, "ad-hocced", or is simply fibbed ("oh sure, all our parts have registered NSNs[1]").

The systems are, at their best, an executive class (VPs, Senior Mgmt) contracting a shaman/priestly class (me) to lay a sheet of order on throbbing chaos.

I probably should mention that it's possible - likely, even - that I have not worked for a non-dysfunctional business, and of course the ERP software ecosystem introduces its own peculiarities on top of that.

I realize I am badly abusing the word "schema" here. A lot of the time, we design schemas based on high level business requirements, and that's what I'm thinking about. There should be a LOT more between schema and business - a whole universe of business architecture, design, and software - but there never seems to be the money to do so.

In the wider context, though, the phenomenological world, I would posit, does not have business logic inherent in the fact of its own existence. This could get to be a very spatious, "dude-wheres-my-bong" sort of discussion, but I think the difference in Weltanschauung we're seeing here might be due to divergent experience.

[1] NATO Stock Numbers


It's almost certainly down to divergent experience. The shamaning and priesting I've done has almost exclusively begun with small companies that were working out their business operations at the same time as they were commissioning software. As they grow, the business logic evolves and the software had to grapple with that, but the originally unified logic imposes constraints both ways. The software sets limits on, and is in conversation with, the whims of management... if for no other reason than that 10 years in, on version 72, the cost of changing the data schema may outweigh the hoped-for advantage of some off the cuff idea to change fundamental operating practices. More concretely, e.g. if a custom logistics system grows up with a company from the ground floor, the business logic tends to bend around the software's limitations [clarification: Employees learn to get things done that weren't originally anticipated, in ways that weren't envisioned]; then the software slowly incorporates the new needs into formal features. If you're dealing with rational management (like the original founders) they will see the sense in maintaining logical continuity and tweak operations as needed to accommodate what should or shouldn't be done in software. [As opposed to "in operations". It's when employees are repeatedly writing on white boards or passing papers around for the same thing that it needs to be integrated as a feature]. But telling the management that "changing this fundamental aspect is impossible without significant downtime" is often enough to start the conversation of how to shape it so that the schema and the business remain in lock step.

This isn't perfect. Ask Southwest airlines, who grew with their own software until they couldn't, and then switched catastrophically to an entirely new system. Sometimes things reach a scale of complexity that the software simply can't conform to. But a really good designer should see those things 2-3 years out and plan for them.

This is the real power of the shaman. I don't dictate business logic, but I do whisper when it contravenes the hard logic of the schema, I fight to minimize the number of exceptions ("hacks" of any kind) and through this keep the beast in check.


I don't know what you mean by business logic but the (in my view successful) businesses I have seen operate more on the highly variable whim of upper management than any sort of ironclad logic.

You notice this quickly when you computerise existing human processes. They are riddled with (sometimes very valuable!) inconsistencies that are hard to fit into the regular computer mold some designer thought was sensible.


I responded at length to the sibling but just want to say - usually where I've come in there is an operations manual that works very much like a schema. Investigating where it's being overridden in daily practice often helps to form more linear processes / more consistent logic than was in the manuals. Once those are worked out, the software becomes the glue that forces employees to follow the processes, and the manual is about the software. But you're right: Stepping into a messy organic system and writing software around it is hard. It's much harder if they aren't willing to be flexible.


That's not what the article is about but the question of what level of abstraction and how flexible it needs to be is always interesting.

On table schemas, I think designers will usually have a good idea of how stable it needs to be and what should go in it. For instance someone creating an invoice table will probably already have a set of unvariable stuff that need to go there for sheer legal reasons.

The other part being, an unefficient or slightly clunky table schemas is not the end of the world and can either be fixed, even in pretty active production envs (it's "just" that much more effort and cost intensive) or be dealt with at the application layer.

Stacking abstractions is always an option if nothing else helps, and I'd see trying denormalized at the very start of a new application a worse tradeoff and lack of thinking about what the application is supposed to do at its core.


Maybe "Impedance Mismatch"?


That's perfect!

The electrical metaphor is a powerful one, as evidenced that it effortlessly describes a sister of the OP problem, "Object-Relational Impedance Mismatch". Looking at the most compact expression of the problem - the electrical one, i.e. math - you start to wonder if the root cause of all these is scale(observer) vs speed vs signal.

Could it be expressed as a logical abstraction to this family of phenomenon: impedance matching; object-relational impedance; business system vs ERP?

For every "reference frame" (electrical, mechanical, software, database, system) an organization node (single developer, team, organization) might be in, there would be a sort of minimum beyond which no unit is discernible. As this unit grows, the risk of "impedance mismatch" grows, even if signal and velocity remain static. If signal and velocity ALSO grow, the probability of mismatch rapidly becomes 100%. Unlike in electronics, the actual physical size of the "carrier wave" is getting bigger[1].

Which, honestly, ok, this all sounds pretty damn obvious. Maybe that's why this is a solved problem in EE, but it's a forty-year-clusterpoop in ERP world. Could it be that the root cause, then, is nontechnical leadership? A PoliSci MS / MBA won't - or can't - see that larger systems necessarily have different signalling / flow, but they "think they can pull this off" because "airplanes and lawnmowers are basically the same thing" and "our culture is always our first product". Blop. Fail. Repeat for two generations, and here we all are.

[1] Which, hmmph, ok, that can happen in some specialized setups. But that's outside this sandbox.


> the real world doesn't have schemas

In the same sense that the real world doesn't have numbers, types, or functions. Platonist might disagree, but in any case it is beside the point. What matters is if these concepts are useful.

There is no such thing as a schema-less database, the question is if the schema is formalized and stored and enforced in the database itself, or if it only exists implicitly in code or in the minds of the people using the database.


Hard agree on the computing system schema - it's always there, whether in a bag of brains or as a XSD file.

The usefulness of these concepts lie on a spectrum, with user behavior located in one part of that spectrum. Numbers occupies a broad spectrum that overlies most human experience, although not all. Type systems have overhead that occlude some aspects of computing, and by itself the phrase "type system" has fuzzy edges. Somewhat - although less so - same thing with functions - is a function a callable unit or is it an explicit method? Platonists - I think - would definitely agree that these things have the inherent quality of existence as a consequence of their nature; Anselm's Ontological Argument is sort of the Final Evolution of dire Platonism.

The world, though, as it exists in nature? Functions, types, even numbers are not primary observables. Our experience of reality is probably more akin to the observed experience of a rainbow, where the observer stands 138 degrees to the sun precisely . . or the rainbow doesn't exist. We're in a 138 degree arc to reality, living on a rainbow, counting the bands. But honestly? Who knows? Probably some learned cosmologist. I am but a MilStdJunkie on the internet.


The world is always 6NF.


Data denormalization makes sense if the data is written once and never updated - like with a data warehouse / analytics.

If you need to update the data then denormalization can turn into a big source of trouble. For example you end up with many copies of the same stuff, and you must make an extra effort to update all the duplicates upon update.

Or you end up with multiple entries, where the validity of an entry is determined by some extra 'isValid == true' or 'deleted == false' field. Now all these 'invalid' entries then start to clog up the table/collection, and performance may quitely deteriorate.

I once had to use a denormalized schema for nested data, as lookup through too many reference would have suffered. But that wasn't funny at all.


I'd argue that most of the time it is better to write normalized data and use some form of permanently existing database views (rather than actual table) to read from pseudo-denormalized data. That way you combine the best of both worlds.


That really, really depends on whether you’re driving something that gets read often enough that query complexity ruins your product’s ability to predictably deliver on query deadlines — whether that’s “load a webpage” deadline or a “submit half a million payments to the bank before it closes” deadline.


I've started doing this (materialized views denormalizing data) at work more recently and it's been immensely helpful. Surprised more people don't do this.


Data denormalization also helps with restoring individual tables and sharding. IMO one should aim for normalization and slowly denormalize only if needed.


It's important to note that the author uses denormalized tables for data analysis only. It was never outright stated "don't do this for source-of-truth, authoritative data," but yeah, don't do this for source-of-truth, authoritative data (in general).


OT (or maybe not) : It's interesting how the idiom "I stopped worrying and learned to love X" today is taken at face value and is basically a plea to accept something seemingly insane and just go with it - when in the original movie, the person making the statement was genuinely insane and the movie's entire message was basically the opposite.


As someone who today has to maintain a database with a lot of denormalised data, do this only if your database is pretty much write once (the author's use case).

For anything else you might feel that it's saving you time and performance by not needing to join tables, but you're just shooting yourself in the foot with a delayed effect. You're just moving complexity from the read to the write operation with a multiplication effect.


This post is about data analysis i.e. read-only queries. Of course denormalization wouldn't hurt, and might speed things up. Normalization is about making writes easier, because you only have a single source of truth to update. Try writing into a database that has a denormalized schema... ugh.

> Denormalized tables prioritize performance and simplicity, allowing data redundancy and duplicate info for faster queries. By embracing denormalization, we can create efficient, maintainable data models that promote insightful analysis.

Yes, for read-only queries. Again, your business should not be storing data into denormalized tables. Store them in normalized tables and pull the data out into denormalized tables for data analysis.


While this talks mostly about data warehousing, oftentimes denormalization is useful for everyday web app data storage. If your web app (usually on Postgres) is mostly frequent reads and rare writes (most web apps are) — there's no excuse for your pages to load slower than a static site. Store your data as normalized as you want, add a denormalized materialized view, update it on writes, render pages based on the view.

Of course I'm talking about 95% of the apps where this is acceptable, not 5% where table locking can cause problems, leading to the need for concurrent update handling.


i think i heard once that there are only a few problems in computer science, including off-by-one errors, and concurrent update handling

oh yeah and overengineering, probably


Concurrency.""There are three hard things in computer science: cache invalidation, naming things, off-by-one errors, and

Original quote:

   There are only two hard things in computer science: cache invalidation and naming things. — Phil Karlton 
The one that most people know:

   There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors. - Jeff Atwood


- Knock knock

- Race condition

- Who's there?


95% of the apps don't need a database. Just store data in files. one layer less to worry about.


> there's no excuse for your pages to load slower than a static site.

Indeed all web pages should be static pages. Anyone still doing server-side rendering in 2023 and defending it for any use case at all needs to turn in their badge.

Same goes for people promoting frameworks like react or vue for anything but sufficiently complex web apps.


Uh, what? Let’s take this very site. How do static pages work for logged-in users?


Strong statements without any substance behind them like yours are not helpful in my opinion.

So how about something a bit more nuanced: choose the right tool for the right job.


The whole SSR revival thing in frontend right now seems to exist only to game SEO and has no benefit for the user.


There is real speed gains to be had.

Typically your server is right next to the database. The end user isn't, so being able to give their first view being right, without an extra trip around is pretty good.


What do you think about Remix Run?


Okay, so someone who analyzes reads, who's never written software that needs consistent writes, is in favor of denormalized data. Let's see how this post updates in ten years with parts 4 thru 9 where they go from realizing their data is inconsistent to writing some monstrous beast to try to normalize it.


This really does remind me of that meme with the bell curve.

On the far left and right side, there is a person saying "denormalized data is great"

It is just this one in the middle which doesn't like it.

After doing pretty major projects for 40 years, let me tell you, "denormalized data is great"

It is like watching people go down the hole of "patterns for everything" and then watching them crawl back out of it again many years later.


A provocative spin:

Left-side: ORMs suck, just write SQL

Middle: ORMs relieve the impedance mismatch of the OOP "object graph" paradigm and the SQL "relational algebra" paradigm. One could even make the argument that this mismatch is inherent and we're just doing CPR on a rotting horse -- so now the industry standard is to use NoSQL databases, such as MongoDB, to get away from bygone ways of thinking about data access. Likewise, SQL is not conducive to thorough testing coverage; and migration management is prone to user-error. In this vein, the industry has once again innovated and revolutionized data-access, moving away from "CREATE READ UPDATE DELETE" to "ENCAPSULATE INSTANTIATE INJECT CREATE READ(+UPDATE|DELETE)* TRACK PUSH SYNC." With all this in-mind, it would be utterly baroque to use anything other than JS+Node+Mongo+Mongoose for a fully unified front-end and back-end.

*An astute reader will recognize that ORMs are constantly doing N+1 queries (to first pull the data, transform it into objects, update said objects, and then push the changes to the database). We feel as though hardware has gotten sufficiently advanced that the mind-space these costs inhabit is no longer justified -- and it is OK to do things this way.

Right-side: ORMs suck, just write SQL


The weird part here is you are assuming SQL was even the right thing to begin with.

But sure, if you want to use SQL, then yeah, I agree with you.


Why isn't the author just using a document store instead? This entire post is defeating the purpose of RDBMS'. What happens when data needs to be updated? Would that require a number of n update statements in order to change a username for instance?


Given that initially defining your db schema is a one-time thing and we generally don’t change it very often after, I can’t fully get behind what the article suggests. However, the one thing I tend to do is add a text or json field called “extra” to my main tables that just stores a JSON map with fields I want to add to the record but don’t need to necessarily query by.


You have to denormalize in very common cases even for mostly OLTP workloads in order to get sortable data into an index. Consider a cloud storage product with folders connecting with a many to one to documents. The product wants to display the most recently used folders ordered by their inner document modification date.

Because composite indexes commonly can't span tables, you have to push the last modification date up to the folder row in order to get the data in the right place to build the obvious index.

Denormalization is a normal and expected optimization to scale a relational database.


Couldn't you also use a materialised view here? Or, alternatively, what about an index on the documents table alone? It might depend on whether your db has a flexible enough indexing system to do what you want, but I don't see why the index would have to span tables since it only needs to depend on the folder id and the document modification time


A materialized view is another approach but thats still essentially denormalization. I prefer using indeces as they are a bit more of a 1st class relational concept in my mind. With a MV you are copying ALL table data, but with an index you can concentrate on just the ordering columns (you can make a few column MV too and then join but you are now reinventing indexes).

You cannot get an index to be used across a join in most relational DB system. But joins pop up everywhere when you normalize.

This query is a fairly distilled example of it https://stackoverflow.com/questions/16402225/index-spanning-...


Hmm as far as I can remember that's what they told us at uni. After hammering normal forms into our heads for a while they added "and you can carefully denormalize for speed".


Why not use first normal form instead of fully denormalized tables? What is a point using RDBMS if you do not need the normalization? Denormalized tables are only good for sequential scans, you will screw up the DB performance if you need update, insert operations on such tables. And if you do not need updates/inserts then you do not really need RDBMS. And you will definitely screw the DB performance if you permanently scan denormalized tables.


Who decided the point of using databases is normalization? Where is that coming from? Relational databases have existed before the concept of normalization existed. Also an index is nothing more than a partial copy of a table with a different key. It denormalizes you data. Do you use indexes other than pk?


> Who decided the point of using databases is normalization? Where is that coming from? Relational databases have existed before the concept of normalization existed.

Ted Codd, the guy who wrote A Relational Model for Large Shared Data Banks, the paper that introduced the relational model. The first section is entitled “Relational Model and Normal Form”

https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

https://en.wikipedia.org/wiki/Relational_model

https://en.wikipedia.org/wiki/Database_normalization

Projections (think views) and indexes (which are generally on-disk projections) are not disallowed. The point is that the logical representation is normalized. The logical representation and the on-disk or in-memory representations are not the same thing. There’s nothing in the logical model from preventing you from having different representations for performance or other reasons. This has been obscured somewhat by the common implementations of relational databases where the on-disk and logical representations are often the same, which tends to make people think they have to be the same. That’s not the case.


If we simply mark denormalized data as internal representation or an index, then denormalized won’t exist and everyone will be happy.


> Also an index is nothing more than a partial copy of a table with a different key. It denormalizes you data. Do you use indexes other than pk?

The relational model distinguish between the logical and physical levels. Normalization applies to the logical level. At the physical level you can have duplicate data, e.g. caches, which is fine as long as the engine guarantee consistency of the logical model.


Normalization only applies to the base tables, i.e. the tables which work as the "single source of truth". Queries and views (which are just stored and possibly cached queries) often have repeated information and this is not a problem. A data-warehouse is basically a cached query over the base tables, so again, normalization is not an issue.

The article seem to agree with this, but it is kind of buried in the text.


I've worked in early-stage startup environments where we didn't always have the time or resources to build proper management tools, and editing the database sometimes was the management tool. I'm not proud of this but thats how it was.

In this situation, denormalised tables are much easier to hand edit than tables that have been normalised out into the eighth dimension and beyond.


Denormalized tables are like bedrooms: the mess is fine if it is your own mess, not so great if it is someone else's.

Right now I have some data with both MUNI and MUNICODE. I just thought I'd casually check and sure enough, I have a non-trivial number of cases where the two do not match up right. So, the risk there is ... which one do I believe?


Indexing is a form of denormalization. It rekeys the table from another vantage point to enable optimal different queries. So the presence of denormalized data reveals shortcoming of the indexing capabilities of a system. An index is a projection that is guaranteed synced with the source of truth. And debirmalization of data are all projections.


Denormalized table is just an hard to mantain (materialized)view over normalized tables.


Yeah for an OLTP system until you hit certain scale normalized is fine and actually really needed.

For OLAP reporting and analysis denormalized is the way to go for sure. Reducing the number of joins needed (preferably to 0) makes things go very fast.


At my company we have normalized data warehouses, and data scientists query them to make big, static, denormalized tables to run through their analyses and models.

I thought this was a pretty standard, table stakes move in the analytics world.


dimensional modeling seems to be a waste of time in many cases, when the end user just wants a flat table they can drop into excel or into a data frame to do some eda/modeling.

if it makes it easier for the data warehouse team to build a kimball model and then put views on top of it to deliver the flat tables, or it works better to build a kimball model because you're going to set powerbi on top of it, fine.

for most data analysts, data scientists, etc. stopping at a dimensional model just leaves them needing to put the pieces together themselves.


Every time someone praises dbt, a view sheds another tear.


Honest question: if I want to apply IaC to my views, is there a better tool than dbt to take care of that for me? I have never used dbt before, only read about it.


I use view everyday but what is a dbt ?

If I have to install binaries locally that sound boring


Can you explain why that sounds boring?

I ask, as my preference would be to do the boring thing and install a binary locally. Like how one generally uses git for example.


I meant in a SQL context. I don’t want to manage extra binaries to deploy by environments. Specially if the same result can be archived via standard SQL


What is dbt ? How's it related to denormalized tables ? The article is a bit confusing to me. What're alternatives ?


A wizened Perl guru once told me “Normalization is an optimization not a rule…”


How do you deal with multiple cardinality in a denormalized table? Using Arrays?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: