Beware of crossing context boundaries when applying 3NF to commerce records. E.g...

gen220 · on Feb 14, 2020

Another commenter mentioned it already, but it’s worth repeating. You can solve this problem by introducing temporal concepts.

We rely strongly on this at my place of work, it really works! Essentially, you just need two tstzrange columns, representing: (1) when the row was a “valid” representation of the key, (2) when the row could have been used to conduct other “transactions”.

With a valid period and a transaction period, you have a history of the values of an object, and the ability to make non-destructive updates to that object.

It’s an essential component of any audit-worthy system, because it empowers you to trivially answer the question “what did our database think John’s 2018/01/01 address was on the day that we mailed him a check on 2017/12/15”?

andrewflnr · on Feb 14, 2020

I don't understand the distinction between "valid" and "could have been used for other transactions". Can you elaborate and/or give an example? Is one of them always a subset of the other?

gen220 · on Feb 14, 2020

Happy to! It's not impossible for one instance to be a subset of the other. They can also be mutually exclusive periods of time. Let's do an example to explore

Let's say that the present concept of John's address is "123 Apple St"; this was inserted into the database at 1/1/2020.

We don't have credible evidence for what John's address was prior to 1/1/2020.

Therefore, the row looks like this:

> (address:"123 Apple St", valid_period:[1/1/2020,+inf), transaction_period:[1/1/2020,+inf))

This is a case where transaction period and valid period are equal. If John sent us this information by mail, and he signed it 12/25/2019, we have credible evidence that this was John's address, at least effective 12/25/2019.

Therefore, his row would look like this: (address:"123 Apple St", valid_period:[12/25/2019,+inf), transaction_period:[1/1/2020,+inf)).

Now, the transaction period is still 1/1/2020, because this information found it's way into the database on 1/1/2020.

Now, let's continue with the second scenario. Let's say we get a second letter from John, processed on 6/1/2020, saying that he moved to "456 Orange St" on 5/1/2020. There are now two rows in the database, as below:

> (address:"123 Apple St", vp:[12/25/2019,5/1/2020), tp:[1/1/2020,+inf))

> (address:"456 Orange St", vp:[5/1/2020,+inf), tp:[6/1/2020,+inf))

Then, we receive a FINAL letter from John, processed on 7/1/2020, revealing that his previous letter contained a typo! it was "789 Orange St", not "456"! Darn. Our table now contains three rows:

> (address:"123 Apple St", vp:[12/25/2019,5/1/2020), tp:[1/1/2020,+inf))

> (address:"456 Orange St", vp:[5/1/2020,+inf), tp:[6/1/2020,7/1/2020))

> (address:"456 Orange St", vp:[5/1/2020,+inf), tp:[7/1/2020,+inf))

Let me know if you have any questions about this example!

andrewflnr · on Feb 15, 2020

Ok, so the validity period tries to describe the real world, while the tp is more about the state of the database's knowledge, or rather belief. I think I understand how each one starts when it does, but not when you update the endings.

- Does the validity period of the erroneous entry ever get closed?

- Or the transaction period of the one that got superseded?

- Do transaction periods close for reasons other than finding out something was wrong, or can I think of them as the period that the rest of the row (including validity) is/was believed?

Thanks for the detailed explanation. This is really interesting.

gen220 · on Feb 15, 2020

No problem! It’s fun! And a powerful conceptual tool. In my experience, it can be used to solve many thorny problems, and I never learned about it in school.

I had a typo in the last row, it should have been

> (“789 Orange St”, vp:(5/1/2020, +inf), tp:(7/1/2020, +inf))

my apologies!

- the valid period of the middle record is never closed, because it would be a misrepresentation of how the database’s perception of the address at that point in ”transaction time”.

- indeed, the superseded transaction period is closed.

- formally, transaction periods close (and new rows are created) whenever a column’s value changes. Think of it like an “updated at” time stamp, except the meaning is more like “canonical during”. The second half of your question is totally correct, the TP is the period during which the values representing the specified slice of valid time are/were believed.

As the third bullet point implies, one characteristic of this schema is that you have “non-destructive” updates: state is never lost, it’s just put into transaction history. This makes it possible to “roll back” to an earlier, known-good state: simply specify a point in transaction time.

There’s a whole additional rabbit hole to dive down: how to make this stuff fast and intuitive. People have mostly solved the fast part, but we’re still working on the intuition.

andrewflnr · on Feb 15, 2020

I meant to ask about the transaction period of the first row (my bad, "superseded" is ambiguous), but I think I get that one now: the transaction period is still open because we still believe (and probably will forever) that that's the correct validity period for that address. Is that right? And the validity period for the middle one never closes because there's never a time in that row's transaction period where we know the end of the validity period.

The intuition I'm building is that the validity period is metadata and the transaction period is metametadata. I think the similarity between them is a little deceptive, since the validity period is in some sense just another piece of data sort of controlled by (or interpreted in light of) the transaction period, which is itself a way to simulate versioning the whole database; I think of it like the internal representation for a database wrapper that lets you time travel in a database with a history-oblivious schema. In that vision, validity periods are actually part of the history-oblivious schema. You could totally have "transaction periods" for stuff with no concept of validity period, e.g. something timeless where our understanding evolves over time. If you're trying to teach people, I'd suggest introducing the two concepts independently, then showing the interplay with an example like the one you gave here.

I don't know if weird nested contexts can ever be really "intuitive". :D

madno · on Feb 21, 2020

For total ordering are you using an instant (with say seconds granularity) or an additional monotone id to distinguish between changes for the same day?

andrei_says_ · on Feb 14, 2020

I imagine you also save changes to the address as separate versions so you can query them in the future?

gen220 · on Feb 14, 2020

Yes. In practice, the typical "UPDATE" looks like this:

1. create a "hole" in the current transaction space's valid space (through expirations of records in transaction time, valid time).

2. insert the new version at (transaction_period=(now, inf), valid_period=(effective_date, inf)).

It sounds complicated, and it sometimes is, but in practice most of the hard work (specifically around transaction periods) is performed for you by triggers. Usually you only have to think about valid periods, as a developer, which is easier to wrap your mind around.

The unfortunate thing is that there aren't a lot of very straightforward open source implementations of bitemporal triggers. The ones that exist are mostly designed to be theoretically sound and feature-complete, rather than usable.

At my company, we have our own implementation that has slowly grown over the years. It's about 500 lines of SQL triggers, and maybe 2k lines of library code in Python/Go to make the ergonomics a little bit easier on the developer.

snarfy · on Feb 14, 2020

You made a great example, and I think it's the most important thing to remember when designing schema. I've seen arguments for/against 1NF, 3NF, various db features, etc, but really, the schema should model the real world relationships.

If your InvoiceItem has no description, and only has a FK to Product to get the description you're going to have a bad time. Once you build a system around a wrong relationship like that, it's very hard to go back and fix the inevitable issue that updating a product changes old sales invoice records.

KptMarchewa · on Feb 14, 2020

Never done something regarding invoices and related stuff, so just an academic question: Why not just version the product description and FK to appropriate version?

snarfy · on Feb 15, 2020

That's one solution. You could copy the description to the invoice item. That's another. There are a lot more solutions, but the point is none of them will be trivial to implement in a large existing code base. It's really important to get the data model right as much as possible up front. This is a bit counter to all of the agile/scrum rage of iterate and refactor. That works for code, but schemas and data migrations do not lend themselves to constant refactoring.

gen220 · on Feb 14, 2020

versioning solves this problem.

in practice, you have to support retroactive changes, so you have to have "middle" values for versions (i.e. 1.1 between 1 and 2).

People are usually interested in the timestamps associated with those versions (for which period of time is this version valid, when was this version created), so in practice it's easier to just keep track of the timestamps, and leave the versions implicit.

pbowyer · on Feb 14, 2020

> E.g. The description of an item in a sales invoice is the description at the time the contract of sale is made, and must be immortalised as such as a copy of that description.

+1. I maintain a couple of systems that younger-me wrote 15+ years ago. If younger-me had understood this, I could have avoided much re-engineering :)

edejong · on Feb 14, 2020

As such, it helps to also study 4NF and 5/6NF (which are temporal normalisations). This allows us to refer a sale to an earlier 'version' of the product.