> Normalization was built for a world with very different assumptions. In the data centers of the 1980s, storage was at a premium and compute was relatively cheap. But the times have changed. Storage is cheap as can be, while compute is at a premium.
Normalisation isn't primarily about about saving storage, it's about avoiding update anomalies i.e. correctness.
I'm reminded of being 18, my first year of college, and I had this lovely database professor who was good with crowds. He'd get us all to memorize and repeat, as a group certain phrases.
"Why do we normalize?"
150 students, in unison: "To make better relations"
"And why do we DE-normalize?"
150 students, in unison: "Performance"
"And what is a database?"
150 students, in unison: "A self-describing set of integrated records"
That was 16 years ago, and I'm still able to say those answers verbatim.
Maybe to someone who could make sense of the DDL and read the language the label col names are written in. And understand all the implicit units, rules around nulls/empties, and presence of magic strings (SSN, SKU) and special numbers (-1) and on and on. For that you need something like RDF and a proper data model.
Aren't you conflating the lexicon of data management with specific implementations of a relational database management system (RDBMS)?
Sorry, but your response sounds snarky and reminds me of all the ego hurdles I had to overcome when leaving/loving databases and set theory. Please remember that your comment could be someone's first introduction or step early step in learning.
If you use Oracle, PostgreSQL or MySQL (those are the ones I'm familliar) you can always query the data dictionary and see how your tables relate. For me that is self-describing.
From the abstract of [1], Codd's stated motivation for applying relation theory to data storage:
"Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed."
From Section 1.4, "Normal Form":
"A relation whose domains are all simple can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating. There is, in fact, a very simple elimination procedure, which we shall call normalization."
As I read this, normalization was originally "about" making storage simpler than it was with contemporaneous models. Section 2.3 ("Consistency") discusses how "data bank state" can become inconsistent, and how such inconsistencies might be addressed, up to and including "the user or someone responsible for the security and integrity of the data is notified".
I think it's reasonable to infer that guaranteed transactional consistency (what I think kpmah means above by "correctness") and the space-saving properties of eliminating redundant data both happened later, and both fell out of the initial motivation of simplification.
[The relational model] provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation and organization of data on the other.
A further advantage of the relational view is that it forms a sound basis for treating derivability, redundancy, and consistency of relations [...]
So I would say it is mainly about flexibility, correctness and the possibility to create a simple yet powerful query language.
Great point! It was originally about both of these things, but the storage aspect isn't discussed much anymore because it's really not a concern.
The data integrity issue is still a concern, and I talk about that in the book. You need to manage data integrity in your application and think about how to handle updates properly. But it's completely doable and many people have.
> You need to manage data integrity in your application
This is just another way of saying you need to implement your own system for managing consistency. Dynamo offers transactions now, but they don’t offer actual serialization. Your transaction will simply fail if it runs into any contention. You might think that’s ok, just retry. But because you’ve chosen to sacrifice modelling your data, this will happen a lot. If you want to use aggregates in your Dynamo, you have to update an aggregate field every time you update your data, which means you create single points of contention for your failure prone transactions to run into all over your app. There are ways around this, but they all come at the cost of further sacrificing consistency guarantees. The main issue with that being that once you’ve achieved an inconsistent state, it’s incredibly difficult to even know it’s happened, let alone to fix it.
Then you run into the issue of actually implementing your data access. With a denormalized data set, implementing all your data access comes at the expense of increasing interface complexity. Your schema ends up having objects, which contain arrays of objects, which contain arrays... and you have to design all of your interfaces around keys that belong to the top level parent object.
The relational model wasn’t designed to optimize one type of performance over another. It was designed to optimize operating on a relational dataset, regardless of the implementation details of the underlying management system. Trying to cram a relational set into a NoSQL DB unavoidably comes at the expense of some very serious compromises, with the primary benefit being that the DB itself is easier to administer. It’s not as simple as cost of storage vs cost of compute. NoSQL DBs like dynamo (actually especially dynamo) are great technology, with many perfectly valid use cases. But RDBMS is not one of them, and everybody I’ve seen attempt to use it as an RDBMS eventually regrets it.
Are there any basic examples you can give around maintaining that integrity?
I'm liking DynamoDB for tasks that fit nicely within a single domain, have relatively pain-free access patterns, etc. And I've found good fits, but there are some places where the eventual consistency model makes me nervous.
I'm specifically thinking about updating multiple different DynamoDB keys that might need to be aggregated for a data object. The valid answer may be "don't do that!" – if so, what should I do?
> Are there any basic examples you can give around maintaining that integrity?
For those types of use cases, the OP’s advice would actually require implementing a fully bespoke concurrency control system in your business logic layer. Without trying to disparage the OP, this is for all intents and purposes, impossible (aside from also being very, very impractical). There’s some things you can do to create additional almost-functional (though still highly impractical) consistency controls for dynamo (like throttling through FIFO queues), but they all end up being worse performance and scaling trade-offs then you’d get from simply using an RDBMS.
A lot of it boils down to the fact that dynamo doesn’t have (and wasn’t designed to have) locking, meaning that pretty much any concurrency control system you want to implement on top of it, is eventually going to run into a brick wall. The best you’d possibly be able to do is a very, very slow and clunky reimplementation of some of Spanner’s design patterns.
Yeah, the limited transaction support is a killer for many use cases.
Thankfully, it’s not actually all that hard to implement your own “dynamo layer” on top of an SQL database and get most of the scaling benefits without giving up real transactions.
Author here! If you want more, I just released a book on DynamoDB yesterday --> https://www.dynamodbbook.com/ . There's a launch discount for the next few days.
The book is highly recommended by folks at AWS, including Rick Houlihan, the leader of the NoSQL Blackbelt Team at AWS[0].
Happy to answer any questions you have! Also available on Twitter and via email (I'm easily findable).
After seeing Rick's Re:invent talk, the one where at about minute 40 everyones' heads exploded, I emailed him (I'm in a very far away other department of Amazon) to ask him for more, because everything he was saying was absolutely not the way my group was using DynamoDB (ie: we were doing it wrong).
He could have ignored my email entirely. He's a busy guy, right? I wouldn't have held it against him at all. Instead, he was super nice, provided me with more documentation, and honestly was just really helpful.
Shawn lists a four-part breakdown that's pretty on-point:
- Background and basics (Chapters 1-6)
- General advice for modeling & implementation (Chapters 7-9)
- DynamoDB strategies, such as how to handle one-to-many relationships, many-to-many relationships, complex filtering, migrations, etc. (Chapters 10-16).
- Five full walkthrough examples, including some pretty complex ones. One of them implements most of the GitHub metadata backend. Nothing related to the git contents specifically but everything around Repos, Issues, PRs, Stars, Forks, Users, Orgs, etc.
The first nine chapters you can probably find available if you google around enough. It's helpful to have all in one place. But the last 13 chapters are unlike anything else available, if I do say so myself. Super in-depth stuff.
I think it will be helpful even if you've been using it for a while, but it really depends on how deep you went on DynamoDB.
It's a bit expensive especially with exchange rates but it is what it is, these things take your time and effort to produce.
From the website I cannot see a table of contents for the book unless this is under provide an email for free chapters. Can you please provide a publicly visible table of contents no email required on the site/here?.
This will be the deciding factor for me making a purchase as I'll be able to see what the book covers and if it'll be of use to me having several years Dynamo experience.
The concepts are definitely applicable. You'll need to do some small work to translate vocabulary and work around a slightly different feature set, but most of it should work for you.
And yep, Cassandra is pretty similar to DynamoDB. Both are wide-column data stores. Some of the original folks that worked on Dynamo (not DynamoDB) at Amazon.com went to Facebook and worked on Cassandra. The concepts underlying Dynamo because the basis for DynamoDB at AWS.
If you take Cassandra and remove the power user features whose runtime cost is hard to predict, what you're left with is pretty close to DynamoDB. It's harder to use and incompatible with everything else, but its key feature is not overpromising capacity. We're only considering alternatives because there's no cost saving story around tiered storage.
Thanks! Honestly, I'm right with you. I would love a physical copy. I did a bit of research and didn't find any great options for making a physical copy of a self-published book for the number of copies I'm expecting to sell (given it's a fairly niche technical area).
That said, if anyone has any great recommendations here, I'm all ears. Actual experience would be best if possible, rather than the first thing you see in Google :).
> "With denormalizing, data integrity is more of an application concern. You'll need to consider when this duplicated data can change and how to update it if needed. But this denormalization will give you a greater scale than is possible with other databases."
There's the big catch. As another poster pointed out, normalisation is not about efficiency. It's about correctness. People have been quick to make the comparison between storage and compute cost. The high cost of development and bug-fixing time trumps both of them by an order of magnitude. The guarantee of referential-integrity alone that SQL offers helps eradicate an entire class of bugs for your application with no added effort. This article glosses so blithely over this critical caveat. Whenever this discussion comes up I'm quick to refer back to the yardstick of "Does your application have users? If so, then its data is relational". I can't wait for the day when we look back at NoSQL as the 'dancing sickness' of the IT world.
It's also worth questioning: 'At what scale does this tradeoff become worthwhile?' Another poster here correctly pointed out that modern versions of Postgres scale remarkably well. The tipping point where this kind of NoSQL implementation becomes the most efficient option is likely to be far beyond the scale of most products. It's true that completely denormalising your data will make reads much faster, this is undeniable. This does not mean you need to throw the baby out with the bathwater and store your master data in NoSQL.
> I can't wait for the day when we look back at NoSQL as the 'dancing sickness' of the IT world.
It does have some compelling use cases. It’s just relational data isn’t one of them. If you have a use case with a low potential for write contention, a tolerance for eventual consistency, a very simple data structure, and a high demand for read throughput, then it’s great. One area that I’ve seen it used with great success is content publishing. You have one author, perhaps an additional editor/proofreader, the content is one document (with perhaps one other related document, like an author bio), and hopefully you want thousands or perhaps millions of people to be able to get decent read performance. Another example could be pretty much anything you’d typically use a Materialized View for in a DB. You can compute the view in your RDBMS, periodically publish it to a document database, and then offload all read throughput to a better suited system.
NoSQL is usually used wrong imo, but that doesn’t mean there aren’t ways to use it right. There’s valid use cases for graph databases and stream processing systems too. But they’re not hip enough to produce the same volume of highly questionable web apps.
You're absolutely correct with the point about the materialised view use case. I wasn't going to labor the point going into extra detail in my post. The most successful use I've seen for NoSQL databases is aggregating complex relational data structures into a single document record periodically. You're entirely correct. It's not that the technology is inherently wrong ( in most cases, MongoDB is another story ) it's just the widespread misuse giving this technology a bad name.
At what point do these auto-sharding databases like DynamoDB become worth the effort these days? You can squeeze a lot out of a single Postgres instance and much more if you go with read replicas or Redis caches.
When you start with a relational model you don't need a priori knowledge of your data access and you get solid performance and guarantees. If you need this access knowledge beforehand, is DynamoDB best for scaling mature products?
I just answered this on Twitter, but I think there are two instances where it's a no-brainer to use DynamoDB:
- High-scale situations where you're worried about performance of a relational database, particularly joins, as it scales.
- If you're using serverless compute (e.g. AWS Lambda or AppSync) where traditional databases don't fit well with the connection model.
That said, you can use DynamoDB for almost every OLTP application. It's just more a matter of personal preference as to whether you want to use a relational database or something like DynamoDB. I pick DynamoDB every time b/c I understand how to use it and like the other benefits (billing model, permissions model, performance characteristics), but I won't say you're wrong if you don't choose it in these other situations.
I would say there is much more maintenance, around schema and migrations.
Since there is no enforcement of schema at the database level, you need to be very careful in understanding every single way your application(s) work with the data, and ensure that they are backwards and forward compatible.
This generally involves writing a lot of custom tooling batch migration logic and ensuring strict control over code that modifies data.
It's very easy to discover schema migration problems in production as the data is accessed.
If you know all your access pattern and your writes >>> reads, a NoSQL solution will be cheaper to operate than Postgres. Meaning, I believe, for most deployments, you can get the same amount of performance from postgres, but simply at a higher cost (which may be 3-6x at most). Another reason to go with NoSQL is if you are latency sensitive, although I don't think Dynamo falls in this bucket.
NoSQL was also really good for OLAP, but I think now there are several really good OLAP solutions (like Clickhouse for OSS and Redshift/BigQuery in the cloud) that are easier to manage.
What happens when you need to restart that “single Postgres instance” to apply config changes or upgrade to a more powerful instance class? How do you promote a replica to primary without downtime?
Those concerns are mostly gone when you rely on a service like DynamoDB. It's not “free”, it comes with increased complexity at the app level, but it does offer a piece of mind if you can afford the $$$.
In my experience (using Dynamo for 4-5 use cases while Postgres for many more) a properly configured HA Postgres with RDS costs way more than Dynamo for the same workload. There is only a big catch, data modeling is not as straightforward with Dynamo. You cannot change it up so easily than with Postgres and your access patterns must cover every single use case. If you can do that, model your data and the access pattern fits Dynamo great, if not you are going to have a hard time.
Managed postgres services tend to be fairly expensive for production usecases at any small - medium organization, and they all come with their own little caveats here and there.
Are you running one cluster per application? I've seen that a lot but it's actually quite inexpensive when you use one cluster for multiple applications.
There is a lot more you should learn about DynamoDB but I appreciate the effort of the author. Please read the AWS documentation, it's not that big and explains vital things that just aren't in this article. Very important things like:
- LSI can't be created after table is created
- GSI late creation generate back pressure on main table
- If you have an LSI, your table will not scale beyond 10GB
- How often a table will scale up and down per day/hour?
- Cost of auto-scaling in cloudwatch (alarms aren't free)
...and so much more. I've been working with Dynamodb for over 2 years now and I love it.
One note on this -- if you have an LSI, you can't have an item collection larger than 10GB, where an item collection refers to all the items with the same partition key in your main table and your LSI.
A DynamoDB table with an LSI can scale far beyond 10GB. That said, I would avoid LSIs in almost all circumstances. Just go with a GSI.
Thank you for the clarification abd12, you are right. I only create LSI if I know for sure the data within a given shard will not go beyond 10GB... since we never really do we always go with GSI.
Did business with a startup, signed up, started getting service, they play an intermediary biller / payor role.
Because of an issue in company name used in signup their billing system fell over and didn't setup billing.
But what was crazy is a quickly realized this shop was a noSQL shop. NOTHING connected to anything - so since they hadn't built any reports to cross check any of this they literally did not notice (I noticed other consistency issues elsewhere).
In a SQL database this stuff especially around accounting / money is 101 stuff, but noSQL seemed to really struggle here based on how they'd set it up.
I finally bugged them to charge us, but even that was a bit goofy (basically it looked they exported some transaction to a credit card system - but I doubt had any logic to handle failed payment issues etc).
We have one other vendor where the amount actually charged is a few pennies off the receipt totals - issues with doubles, rounding and application logic or something which doesn't simply use the same number (from database) for order / item detail and total and billing.
So at least in finance / accounting, a ledger which is basically a source of truth, and is linked / summarized / etc in various ways to other parts of systems (sales tax, receipts, credit card charges by order etc) really results in some free consistency wins that don't seem to free in nosql land.
Operational data stores are not reporting data stores. A healthy system would fire change events to which a reporting system subscribes, placing the data where it belongs in the reporting data model.
This is a great post, and DDB is a great database for the right use cases. I want to give a shout out to FaunaDb to anybody looking for alternatives - its also serverless and crazy scalable, and usage-based pricing. Its downside is its proprietary FQL query language, not because it sucks (it doesn't!) but there is a learning curve. They provide a rich GraphQL interface as an alternative to FQL. Its upside vs DDB is a much richer set of functions including aggregations, and first-class support for user defined functions. Their attribute-based permissions system is phenomenal. Its definitely worth a look if you're considering DynamoDb but want something that takes less upfront planning about access patterns.
Having spent a few years working with DynamoDB to build multi-region, multi-tenancy platforms, I must say that DynamoDB is a good fit as a supplement datastore i.e. you should only store a sub-set of information managed by your serverless microservice. DynamoDB multi-region replication is just amazing. Unfortunately, we had a few massive billing spikes with DynamoDB, and we end-up adding pricing measurement and tests to track read/write units in all our functions.
I generally don't recommend DynamoDB as primary data store irrespective of your use case. It takes too much time to model the data. With every new requirement, you have to redo a lot of modelling exercise. Choices you made in beginning start looking bad and you will not remember why you created that particular combination of the composite key or local secondary index which offers no benefit due to incremental changes. Transaction support is painful, existing SDKs just don't cut.
I often wish some of the GCP Firebase features are available in DynamoDB like namespace, control on daily throughput to avoid billing spikes and transaction support.
Curious, besides being a truly serverless and scalable database why else would one choose to model relational data in DynamoDB? For the 'single table design' scheme the author talks about you are in a world of hurt if you need new access patterns? which is highly probable for most systems.
I think there are some features of DynamoDB that are miles ahead of other databases:
- Billing model. Pay for reads and writes directly rather than trying to guess how your queries turn into CPU & RAM. Also able to scale reads and writes up and down independently, or use pay-per-use pricing to avoid capacity planning.
- Permissions model: Integrates tightly with AWS IAM so works well with AWS compute (EC2, ECS/EKS, Lambda) with IAM roles. Don't need to think about credential management and rotation.
- Queries will perform the same as you scale. It's going to work the exact same in testing and staging as it is in prod. You don't need to rewrite when you get four times as many users.
A lot of folks are worried about migrations, but they're not as bad as you think. I've got a whole chapter on how to handle migrations. Plus, one of the examples imagines that we're re-visiting a previous example a year later and want to add new objects and change some access patterns. I show how it all works, and they're really not that scary.
Already implied that in "truly serverless and scalable database". My point is that it would be very dumb to not use something like postgres for relational data unless you really need those dynamodb features.
DynamoDB seems to be so low level that it takes a lot of design and programming effort to get right. Are there any higher level solutions that build on DynamoDB that take care of these things automatically? For example denormalization sounds pretty error prone if you implement it by hand.
It's different than a relational database in that you need to model your data to your patterns, rather than model your data and then handle your patterns.
Once you learn the principles, it really is like clockwork. It changes your process, but you implement the same process every time.
Honestly, I think part of the problem is that there's a lot of bad NoSQL content out there. A little standardization of process in this space will go a long way, IMO :)
"Normalization was built for a world with very different assumptions. In the data centers of the 1980s, storage was at a premium and compute was relatively cheap."
But forget to do normalisation and you will be paying 5 figures a month on your AWS RDS server.
"Storage is cheap as can be, while compute is at a premium."
This person fundamentally does not understand databases. Compute has almost nothing to do with the data layer - or at least, if your DB is maxing on CPU, then something is wrong like a missing index. And for storage, its not like you are just keeping old movies on your old hard disk - you are actively accessing that data.
It would be more correct to say: Disk storage is cheap, but SDRAM cache is x1000 more expensive.
The main issue with databases is IO and the more data you have to read, process and keep in cache, the slower your database becomes. Relational or non-relation still follows these rules of physics.
> This person fundamentally does not understand databases.
Oh boy I do love hackernews :).
It sounds like you’ve spent a lot of your career in a SQL world. Have you worked a lot with DDB/MongoDB/Cassandra? If not then give it a whirl with more than a toy application and share your thoughts. Already done that? Try the brand new “constructive criticism” framework.
Instead of “this person fundamentally does not understand databases” based on 13 words in a 1200+ word article, consider: “I disagree with this statement and here’s why”.
You get all of the karma with none of the ad hominem! Win win!
> To handle [compound primary keys] in DynamoDB, you would need to create two items in a transaction where each operation asserts that there is not an existing item with the same primary key.
There are other approaches--in cases where I've needed compound keys I've had success using version-5 UUIDs as the primary key constructed from a concatenation of the compound key fields. The advantage is that Dynamo's default optimistic locking works as expected with no transaction needed. A potential disadvantage is if you frequently need to look records up by just one component you'd need a secondary index instead of the primary doing double duty.
Thanks for this Alex! Will definitely check this out. DynamoDBGuide.com was a huge help to me when I was learning serverless last year to build wanderium.com. There's definitely a learning curve for DynamoDB but the performance is really good.
Do you talk about the best way to do aggregations in your book? That's one of the more annoying downsides of DynamoDB that I've kind of had to hack my way around. (I combine DynamoDB streams with a Lambda function to increment/decrement a count)
Thank you! Glad you found DynamoDBGuide.com helpful :)
Yep, I do talk about aggregations in the book. One strategy that I've discussed is available in a blog post here[0] and involves using DynamoDB Transactions to handle aggregates.
If you're looking for large-scale aggregates for analytics (e.g. "What are my top-selling items last month?"), I have an Analytics supplement in the Plus package that includes notes on different patterns for analytics w/ DynamoDB.
Once you get comfortable modeling your data in Dynamo, it becomes hard to justify using RDBMS with all of the overhead that goes along with it. Dynamo is not for every use case, of course, but as long as you design the table generically enough to handle adjacency lists and you don't mind making more than 1 query sometimes, it works really well.
I disagree. I spent a few months with it and it was a bit of a nightmare. It fits certain use cases and AWS and others push it as a good use case for more than it should be.
Have geo data? Need a count? Need to look something up for your team member and you don't have time to setup BI tools? Want to query something differently because of design changes? The answer is pretty much no or you can but it'll take some extra thing for all of those.
We ended up picking up an Aurora Serverless and it is working just fine. We can quickly make changes as business cases change, our entire team can query it for now (until we have fancier tools on the biz end), and we are able to move much more quickly.
Yep, if you need aggregated data at all for operational purposes, Dynamo isn't a good choice. A lot of the hardship people have with Dynamo is not really understanding how to model their data correctly. If you don't design it correctly from the start, you are f*cked, and I sympathize with that.
There is an unhealthy attachment to relational data stores today. It’s a tool, not an architecture or solution. We shouldn’t start with them and often should exclude them from our operational implementations. Reporting and analysis systems benefit tremendously from relational data stores. But we learned years ago that separate operational and reporting systems provided optimal performance.
I suggest those of you still unfamiliar with nosql operational data storage patterns trust companies like Trek10 and Accenture (where i saw great success).
There’s an aspect to software development relating to speed/agility. NoSQL data stores offer a schemaless approach That reduces a great deal of unnecessary friction. The amount of time we’ve spent managing relational schemas is ludicrously expensive. There are still great usage patterns for relational, but operationally is not one of them. I’d argue it’s an anti-pattern.
I would not call NoSQL data stores more agile. While it is true that you can add new columns at any point in time the same cannot be said about supporting new access patterns you didn't design for.
People often confuse startup speed with agility. The talk "Agility ≠ Speed" does a good job of critiquing that idea. Agility is about the ability to change direction gracefully. NoSQL definitely gets you started quickly—as OP noted, you avoid much of the overhead of data modeling and making changes to schemas—but at a certain point you can get into real trouble, and that will slam the brakes on your velocity. I've experienced the same thing with weakly typed languages. These sorts of tools can be very useful in the right use cases, but trading away some of the guardrails can extract a large cost if you aren't thoughtful and careful.
If I hadn’t been on large corporate projects that were wildly successful in extreme shortened periods of design and development, I wouldn’t make my arguments.
Adapting to the needs of a given domain may require a relational data store, but it’s my experience those are rare occasions.
Normalisation isn't primarily about about saving storage, it's about avoiding update anomalies i.e. correctness.