Hacker News new | past | comments | ask | show | jobs | submit login
Versioning data in Postgres? Testing a Git like approach (specfy.io)
178 points by h1fra on Oct 20, 2023 | hide | past | favorite | 89 comments



I just taught one of my developers how to use a basic Oracle Flashback query to pull a historical version of a table.

This is my cheat sheet:

  ALTER SESSION SET NLS_TIMESTAMP_FORMAT = 'YYYY/MM/DD HH24:MI:SS';

  select count(*) from dual
  AS OF TIMESTAMP TO_TIMESTAMP('2010/01/01 00:00:00');
I was aware that Postgres had "time travel," but I don't know if it used this syntax, but I am aware that the feature has been removed. I don't know if "AS OF" is still supported. In Oracle, it has been quite helpful.

Edit: Flashback came to Oracle in 9i, when "rollback segments" were replaced by the "undo tablespace" that could present any table as it appeared in the past, limited by the amount of "undo" available. The "FLASHBACK ANY TABLE" privilege is required to see this history on tables that you do not own, and that also conveys the privilege to fully revert tables to their previous contents. You must be a DBA to exercise flashback on tables owned by SYS.


Expanding on the the above,

> The original version of PostgreSQL from the 1980s did not remove dead tuples. The idea was that keeping all the older versions allowed applications to execute “time-travel” queries to examine the database at a particular point in time [via https://ottertune.com/blog/the-part-of-postgresql-we-hate-th...]

Postgres deprecated support for time-travel in ~1997 and the more general notion of "system time" wasn't standardised until SQL:2011. This blog post is a good overview on the SQL:2011 spec + adoption in databases of (bi-)temporal versioning: https://illuminatedcomputing.com/posts/2019/08/sql2011-surve...

Temporal versioning is less sophisticated than git-like versioning (no branching etc.) but is usually more aligned with common end-user requirements. Kent Beck suggests this framing of "eventual business consistency": https://tidyfirst.substack.com/p/eventual-business-consisten...


> (bi-)temporal versioning

I once spent some time trying to find a way to do bi-temporal versioning in Postgres.

The only thing I found was a half-dead abandonware external project and an associated presentation PDF from some conference the author once spoke at.

I was unaware that they previously had some form of temporal queries and deprecated it. That is a great shame.


A shame indeed, they were only a few decades too early! The support for "Time Varying Data" gets discussed briefly in this 1995 paper by Stonebraker "The Design of Postgres"

> POSTQUEL allows users to save and query historical data and versions. By default, data in a relation is never deleted or updated. Conventional retrievals always access the current tuples in the relation. Historical data can be accessed by indicating the desired time when defining a tuple variable.

> [...] Finally, POSTGRES provides support for versions. A version can be created from a relation or a snapshot. Updates to a version do not modify the underlying relation and updates to the underlying relation will be visible through the version unless the value has been modified in the version.

https://dsf.berkeley.edu/papers/ERL-M85-95.pdf


Those are cool features, so why were they dis-continued?


Been a feature of SQL Server since I believe SQL Server 2008 - temporal tables. They are used (incorrectly) at my place of work quite a bit. If auditing data (who changed what and when) is important - there are two choices: 1) system versioning or 2) add a "version" column or something similar which increments with each change. There are tradeoffs for both.

"Temporal tables (also known as system-versioned temporal tables) are a database feature that brings built-in support for providing information about data stored in the table at any point in time, rather than only the data that is correct at the current moment in time."

https://learn.microsoft.com/en-us/sql/relational-databases/t...


It was the audit features that got introduced in 2008 C2 audit (or something like this) and change tracking. From what I remember, temporal table was a 2016 feature and more or less confirmed by your link.


Oracle, with all the baggage attached to the namesake company, is a remarkable database.

I was working in a dual MySQL/Oracle environment 20 years ago exactly, and MySQL - still at version 3 o 4 - was a toy DB, compared. I was writing Oracle queries that could make your head spin, with optimizer hints, for example:

https://renenyffenegger.ch/notes/development/databases/Oracl...


> I was writing Oracle queries that could make your head spin

Absolutely, and some of the things Tom Kyte (Oracle's resident DB performance guru) could do with Oracle were spectacular.

As you say, its a shame about the Oracle baggage, and in particular the steep price tag. Otherwise I'm certain it would be far more widely deployed as database.


It's either free (for XE), $17,500 per cpu (for SE2), or $47,500 per cpu (for Enterprise). x86-64 gets a 2-for-1 core discount.

You can also license it by user.

https://www.oracle.com/assets/technology-price-list-070617.p...


Using hints is typically a sign of failure. The DB optimiser should typically not need them.


Totally untrue...but adding hints makes your queries dependent on current performance.

Assuming you're a good DBA, you can be better than the optimizer because you understand the whole system. Even databases are general-purpose machines. That's the whole point of adding indexes, etc.

I mean, has there ever been a database that auto-adds indexes based on profiler/optimizer feedback? (To answer my own question, apparently Azure SQL does).


> (To answer my own question, apparently Azure SQL does)

Which is fun when you add a future schema migration that drops a column, test in locally and in some test cases, and then see it bomb in production because the column has an Index that you weren't aware of. (That's an easy thing to solve though, but definitely one of those cases where "The database is messing with its schema on its own" does do funny things)


It's not automatically creating them AFAIK, but alloydb has an index suggestion tool https://cloud.google.com/alloydb/docs/use-index-advisor

Generally this seems like a nice balance to me - watch real queries, look for expensive plans and sniff out likely missing indexes based on the data distribution, but give the DBA final say on whether it's a good idea


> adding hints makes your queries dependent on current performance

I don't understand this, can you explain.

> Assuming you're a good DBA, you can be better than the optimizer because you understand the whole system

That's inaccurate. As a DBA I can understand the system better at higher level, but the database has statistics which gives it typically better understanding of the data distribution at a lower level. Indeed I could get that information and feed it into the query plan via hints, but that's going to be an enormous amount of my time, and I would have to do it every time the query is run whereas the database can keep an eye on the statistics as it varies and rebuild the query over time.

Equally, if an index is added you can expect the database to start using it immediately without adding hints. Etc

I am a fairly(?) skilled DBA who has a reasonable idea of what goes on underneath the hood, I do have some idea what I'm talking about.


Around the same time, I wrote an extremely convoluted MySQL query that changed its plan based on the column order in the SELECT clause.

I wouldn't post snooty and dismissive comments about this - assume that you are in a company of people who have been to more rodeos than you have.


I've >25 years of DB rodeoing. Perhaps I know something about this by now. See my other answers on this thread.

> that changed its plan based on the column order in the SELECT clause.

Hmm? Could you elaborate; you mean the order of the cols or using ORDER BY? I assume you mean the latter but it sounded like the former.


The postgress team is generally opposed to hints. And their reasoning sort of makes sense.

When do you need a hint? why do you need a hint? If you can answer those questions wouldn't it be better to codify those answers into the query planner than a one off hint?

Which nicely sidesteps how hard it actually would be to first, understand the query planner to the degree needed to change it and second, the amount of effort it would take transform a one off hint into a general purpose optimization engine.


In a perfect world, I wouldn't disagree; but we don't live in a perfect world where everything always goes your way.

Sometimes the planner needs a little help, and that's OK.


I did say 'typically'! Yes, agreed, but it should rarely need help. Also hints are oh so often jammed in by newbies. I strongly get the impression use of hints is inversely correlated with experience/knowledge.


I'm a greybeard you cheeky scamp ;)

I very rarely used hints in my more database'y days - but every so often, they were needed, and often to tell the database something that seemed screamingly obvious to me.

I'm not sure where you got the notion that hints were something noobs would be working with? At least, that certainly wasn't my experience.


We are in strong agreement, and the phrasing and measure of your post made it pretty obvious you are knowledgeable (another inverse correlation: absolute opinions launched with violence also seemed to be the realm of newbies; and yours was eminently not).

> I very rarely used hints in my more database'y days

evidence thereof!

> I'm not sure where you got the notion that hints were something noobs would be working with? At least, that certainly wasn't my experience.

It's been very, very much my experience. I worked in one place where nolock hints were applied everywhere, without them realising that READ UNCOMMITTED trans iso even existed, so had to show them that, and show them it could be specified at the application layer (not in the SQL). In another recent case I came across their use and I'm damn sure the person who used it didn't actually know what it did. Yeah, I've seen far too much hinting in my career.


Ok downvoter, why do you think a DB that needs hints for good performance is better than one that usually doesn't? That is, why do you think it better a person does a computer's job instead of the computer?


You've never run into a query planner that suddenly changes its mind once your tables grow.

Try using row level security in postgres - it's awfully slow and you need to be a master architect and sql developer to make it run fast because it's too dumb on its own. With query planner hints I could at least guide it. Shame pg doesn't believe in that.


Poster said oracle not PG


Total db noob here: is Oracle's query planner better? What's the difference between the two?


This is not about Oracle versus postgres, it's about whether or not to use hints. My position is that needing hints indicates a failure of the database optimiser. The optimiser theoretically should do a perfect job. In practice it doesn't, and never can so sometimes hints are necessary, but IME and speaking as an MSSQL guy, I very rarely need them.


Oracle has query hints; pg does not. Query planner hinters are useful. That's my point.


No one is saying that. Your original comment can be read like you're saying using hints is a sign of failure of the person writing the query which is probably why the downvotes. Maybe you meant a sign of failure for the optimizer, which is also debatable given they really cannot be perfect. An optimizer will quickly run into NP-hard problems like join ordering - which is just one small reason among many that they won't always find the fastest query plan.


You're right, I meant a failure of the DB/optimiser not the person. Thanks.

No they're not perfect but I need them rarely with MSSQL

> An optimizer will quickly run into NP-hard problems like join ordering

As a start, dynamic planning based on cardinality estimation based on stats. And then some. It explodes horribly, yes.


https://neon.tech/docs/introduction/branching is exactly what the author is looking for and is the best way to do highly efficient point-in-time versioning of Postgres data I have seen so far.

I deployed NeonDB in a large enterprise client in Q2 of this year and, yes, the initial data migration is the hard part. We added 2x100gbe network cards directly on the VMware hosts where the data was so the migration to the Kubernetes cluster running NeonDB took hours instead of weeks.


Neon CEO here. Did you deploy Neon on prem?


Hi Neon CEO. I’m curious if users have deployed Neon as a (selective) audit-logging tool for configuration which might need to be edited or rolled back from time to time. Think configuration-driven/event-driven systems where an update to a single database-stored config can cause other downstream applications to start behaving mysteriously. Going to scour the website some more!


We have customers using Neon as a time machine b/c you can create a branch at a point in time. Is this what you need?


Essentially, yes. Between a few times per day and a few times per month I would like to create branches to store the “backup” copies of table-stored confs to compare against logs/events and see if configuration changes caused certain conditions to arise. In the case of a configuration error, rolling back to a prior state would be awesome.


Thanks for sharing, adding that to the article.


We are using a Datomic derivative at my work because Postgres doesn’t have this feature. I don’t think anyone has actually used Daromic’s capabilities for this purpose.


Which derivative?


If I had to guess, they are probably using XTDB, which is bitemporal, i.e. every entity has an associated "transaction time" and "validity time". You can use the validity time to deal with situations like "Company reported X assets and Y liabilities in their 2020 balance sheet, and in 2022 it was discovered there was some 2020 assets unaccounted for", and in addition to this, answer questions like "How much assets did Company believe it had in 2020 given the facts we know in 2020? How about given the facts we know in 2022?"

I am not sure how useful this is for versioning data though. It seems like an orthogonal problem to me.


Honest question: Why do you suppose no one at work used this feature, even though the stack was picked for it explicitly?


Project manager needed to check a box on some form for the skip-boss to be happy.


> That also means at any given time, git has a copy of all your files since the beginning of the project.

This is true in a sense, but these are commonly stored compressed and of course you can expect to have a pretty good compression ratio when changed files are mostly similar. And if they're not, you really have to store a copy anyway.


On the topic of PostgreSQL and git. As crazy as this sounds, I once entertained the idea of writing a VCS that used PostgreSQL's large object facility for storing the data. I always have PostgreSQL installed on my personal devices so communication would be through Unix Domain Sockets anyway (or through a gigabit LAN in worst case). And the transactions and SQL interface (for metadata) were just that tempting.

But I realized pg_largeobject (1) chunks data to around 2kB per chunk, and (2) stores each chunk on its own row, and (3) each row uses "oid" type as identifier, which is just 32 bit long. It's probably large enough for anything I would ever need, but for some reason I don't feel comfortable with only 32 bits as primary key.


You probably already know of fossil, which is based on SQLite, but just in case, I’ll share a link with you:

https://sqlite.org/whynotgit.html


Why not git? Specifically as relates to the current discussion?


Assuming that the full 32 bits are used, that means there can be up to 4,294,967,295 chunks.

4,294,967,295 * 2 = 8,589,934,590 KB

So, around 8.5 TB. Unless your hypothetical VCS is for a large library of videos, you should be fine


Not entirely crazy. The creater of SQLite has done just that: https://fossil-scm.org/home/doc/trunk/www/index.wiki


4 years ago I wrote an internal tool to serialize records of our hierarchical database to a bunch of hierarchical json files, which are then managed in a normal git repo.

This was/is a very specific solution to our very specific set of problems. So not applicable to the general problem of "versioning a database".

It is still in use and still under active development – I'm actually fixing a few bugs with it just now.

I wrote much more about it here:

https://news.ycombinator.com/item?id=25005993


Neon does something like this I believe. They allow you to "branch" your databases: https://neon.tech


That's interesting. Seems to be made for CI, testing, not inside a single production though.


Neon is designed to run in production. It runs on top of out storage that is designed for webscale and production usage.


I meant the branching itself. Is it possible to build features on top of this branching strategy? Is it possible to list versions and get a specific one using just SQL?


> Naively when you read a git diff or a pull request, you might think like me, that git is storing a diff between A and B and just displays it when needed. But in reality it stores the full file in the object storage.

I thought logically git stored every file, but implementation-wise, it does git-object compaction. So in reality, it's not actually storing every file on disk, no?


The objects directory stores every file (and tree, commit, etc). Pack files are an optimization storing diffs.


Which is such a genius implementation. You get the straightforward implementation of using plain files (e.g. not deltas), while also being able to get the storage boost from storing deltas.


Creates lots of complexity and Performance issues. I agree using PostgreSQL Extension for Temporal tables is a better idea.

I wish PostgreSQL would allow something as Oracle Flashback feature plus allowing to control how far history do you want to keep for specific tables, as typically only part of your data needs full auditing.


I am curious what success stories people have with versioned databases, and whether those success stories could be replicated by a combination of bitemporal schema design and infrequent snapshots/backups. That is, to avoid the complexity and performance issues of adding a temporal dimensions to 100% of your data instead of the X% that actually needs it.


Just to pile on, I think that adopting bitemporal schemas is the solution.

There are various schemes you can employ to keep "archived" data within postgres (e.g. offloading it to another instance accessible behind a FDW).

I've thought about this problem a lot, and I think every alternative solution I've seen is just an approximation of bitemporal schemas with some limitations that are typically not worth the cost:benefit trade-off.

The main difficulty is there isn't a standardized framework/FOSS solution for bitemporal data. I feel like it'll have to get into the SQL standard before we see widespread adoption.


I would prefer a generalized version of the "Oracle Flashback" feature: ability to add extra where conditions to a table, e.g.

create table test (id int primary key, attr int, deleted_at timestamp);

alter table test add default_filter is_active where deleted_at IS NULL;

then all selects like "select * from test where attr = 42" automatically gets added the above criteria and becomes "select * from test where attr = 42 AND deleted_at is NULL"

When you want old rows, you can override the default condition using something like:

select * from test where attr = 42 filter is_active (deleted_at IS NULL or deleted_at > '2010-01-01'::datetime)

You would then be able to have multiple criterias instead of just when row was deleted.


In case it's useful to anyone, I've been running the cheapest possible implementation of Git-based revision history for my blog's PostgreSQL database for a few years now.

I run a GitHub Actions workflow every two hours which grabs the latest snapshot of the database, writes the key tables out as newline-delimited JSON and commits them to a Git repository.

https://github.com/simonw/simonwillisonblog-backup

This gives me a full revision history (1,500+ commits at this point) for all of my content and I didn't have to do anything extra in my PostgreSQL or Django app to get it.

If you need version tracking for audit purposes or to give you the ability to manually revert a mistake, and you're dealing with tens-of-thousands of rows, I think this is actually a pretty solid simple way to get that.


That's not a full revision history, just snapshots (basically a backup solution). Multiple changes in a 2-hour window will get conflated into a single one.


pgreplay parses not the WAL Write Ahead Log but the log file, " extracts the SQL statements and executes them in the same order and relative time against a PostgreSQL database cluster": https://github.com/laurenz/pgreplay

"A PostgreSQL Docker container that automatically upgrades your database" (2023) https://news.ycombinator.com/item?id=36748041 :

pgkit wraps Postgres PITR backup and recovery: https://github.com/SadeghHayeri/pgkit#pitr :

  $ sudo pgkit pitr backup <name> <delay>
  
  $ sudo pgkit pitr recover <name> <time>
  $ sudo pgkit pitr recover <name> latest


Are you overwriting the previous version of the JSON when committing then merging? Or are you just archiving a new copy of the JSON alongside all the old ones?



Aquameta has an interesting extension that something like this in a different way: https://github.com/aquametalabs/aquameta/tree/master/extensi....


There's an interesting approach to it that works with Rails and PostgreSQL using triggers.

https://github.com/palkan/logidze


> If you have rewrote a file 999 times, there is 999 times copy of this file in the object storage. That's why it can be slow to clone a git repository.

Sort of. While Git’s canonical storage is indeed not diffs, it uses delta compression to, well, store diffs as a storage and transit optimization https://git-scm.com/book/en/v2/Git-Internals-Packfiles


Rather than generic blobs, you can easily integrate versioning directly in your entity tables with only two extra fields:

https://higherlogics.blogspot.com/2015/10/versioning-domain-...

With this schema, you simply add an extra clause to each of your queries and they can simultaneously return the latest version or any past version.


Postgres replicates the whole row on a change already, its somewhat accessible too

https://stackoverflow.com/questions/7118432/how-to-reveal-ol...

This work is great but it feels wasteful given the features Postgres already has but probably the only reasonable solution is to rebuild it in user-space.


I've built this a few times as well.

They all revolve around known patterns, https://en.wikipedia.org/wiki/Slowly_changing_dimension

I highly doubt there's a single way to do this that solves every scenario.


It is possible to produce a long transaction versioned database, but the tricky bit is normally merging the structured data. For most applications this either needs to be entirely automated (so no user intervention is required) or the conflict resolution needs to surface in a lot of UI.


You mentioned you hadn't looked at it in the article, but FWIW this is exactly the approach Dolt takes.



This looks really promising.

Here's an alternative designed to work with an ORM, if you like that: https://www.youtube.com/watch?v=JsO551E7ySY&t=1094s.


Isn't this basically how etcd (the config database used in Kubernetes) works?



I didn’t see it in the article but that’s exactly what https://postgres.ai/ is doing already unless I’m missing some thing.


I haven’t read about NeonDB before but a brief search of their docs doesn’t mention ZFS. Postgres.ai uses ZFS to implement the snapshots, branching, etc.

I’d be very interested in a comparison of the two.


No ZFS. ZFS is not "cloud native" - we build our storage for the cloud and for Postgres from the ground up.


You can rename the blobs/branches table to be post-bodies/drafts. Then all your tables are domain specific


Better idea: just install the Postgres extension that enables ISO SQL Temporal Tables, just like what MSSQL and MariaDB have had for 8 years now: https://pgxn.org/dist/temporal_tables/


Does this robustly support altering tables? What happens if a column is deleted - does it delete historical data, or mark the history column as nullable? As the OP article notes/hints, there are all sorts of performance and regulatory considerations around this.

Separately though, it's great to see that there's still people trying to solve this problem at the extension level for Postgres. I was sad to discover that Postgres for a very long time had an official extension for time travel - https://www.postgresql.org/docs/11/contrib-spi.html#id-1.11.... - only to realize that it had been removed in https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit... - with a note that "it's easy to implement this separately" but no real hints that anyone was taking up the mantle.

As a simpler alternative, if you're only modifying data via Django and can tolerate some unreliability, https://django-simple-history.readthedocs.io/en/latest/ can get you much of the way there without a need for extensions. We've built on top of it with some domain-specific modifications to visualize and filter diffs, giving us at least a best-efforts audit-esque log system.


I wrote a guide and simple experimental setup for this extension: https://tembo.io/blog/table-version-history


Does MySQL have temporal? I thought only MariaDB implemented this.

https://mariadb.com/kb/en/temporal-tables/

A quick Google turns up nothing for MySQL temporal.


My mistake, sorry; I meant MariaDB. I've edited my post.


author here: Yes I agree, and I mention it at the end. However it's not possible to install the extension everywhere, for example it's not available in GCP Cloud SQL unfortunately


It was reimplemented in pure SQL here https://github.com/nearform/temporal_tables for this purpose


I feel your pain, except in my case it's the crippling inability to use SQLCLR in Azure SQL, so we can't even do things like use a Regex to clean-up data in-situ. Yeargh.


Great tip, thanks




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: