I just taught one of my developers how to use a basic Oracle Flashback query to pull a historical version of a table.
This is my cheat sheet:
ALTER SESSION SET NLS_TIMESTAMP_FORMAT = 'YYYY/MM/DD HH24:MI:SS';
select count(*) from dual
AS OF TIMESTAMP TO_TIMESTAMP('2010/01/01 00:00:00');
I was aware that Postgres had "time travel," but I don't know if it used this syntax, but I am aware that the feature has been removed. I don't know if "AS OF" is still supported. In Oracle, it has been quite helpful.
Edit: Flashback came to Oracle in 9i, when "rollback segments" were replaced by the "undo tablespace" that could present any table as it appeared in the past, limited by the amount of "undo" available. The "FLASHBACK ANY TABLE" privilege is required to see this history on tables that you do not own, and that also conveys the privilege to fully revert tables to their previous contents. You must be a DBA to exercise flashback on tables owned by SYS.
> The original version of PostgreSQL from the 1980s did not remove dead tuples. The idea was that keeping all the older versions allowed applications to execute “time-travel” queries to examine the database at a particular point in time [via https://ottertune.com/blog/the-part-of-postgresql-we-hate-th...]
Postgres deprecated support for time-travel in ~1997 and the more general notion of "system time" wasn't standardised until SQL:2011. This blog post is a good overview on the SQL:2011 spec + adoption in databases of (bi-)temporal versioning: https://illuminatedcomputing.com/posts/2019/08/sql2011-surve...
Temporal versioning is less sophisticated than git-like versioning (no branching etc.) but is usually more aligned with common end-user requirements. Kent Beck suggests this framing of "eventual business consistency": https://tidyfirst.substack.com/p/eventual-business-consisten...
A shame indeed, they were only a few decades too early! The support for "Time Varying Data" gets discussed briefly in this 1995 paper by Stonebraker "The Design of Postgres"
> POSTQUEL allows users to save and query historical data and versions. By default, data in a relation is never deleted or updated. Conventional retrievals always access the current tuples in the relation. Historical data can be accessed by indicating the desired time when defining a tuple variable.
> [...] Finally, POSTGRES provides support for versions. A version can be created from a relation or a snapshot. Updates to a version do not modify the underlying relation and updates to the underlying relation will be visible through the version unless the value has been modified in the version.
Been a feature of SQL Server since I believe SQL Server 2008 - temporal tables. They are used (incorrectly) at my place of work quite a bit. If auditing data (who changed what and when) is important - there are two choices: 1) system versioning or 2) add a "version" column or something similar which increments with each change. There are tradeoffs for both.
"Temporal tables (also known as system-versioned temporal tables) are a database feature that brings built-in support for providing information about data stored in the table at any point in time, rather than only the data that is correct at the current moment in time."
It was the audit features that got introduced in 2008 C2 audit (or something like this) and change tracking.
From what I remember, temporal table was a 2016 feature and more or less confirmed by your link.
Oracle, with all the baggage attached to the namesake company, is a remarkable database.
I was working in a dual MySQL/Oracle environment 20 years ago exactly, and MySQL - still at version 3 o 4 - was a toy DB, compared. I was writing Oracle queries that could make your head spin, with optimizer hints, for example:
> I was writing Oracle queries that could make your head spin
Absolutely, and some of the things Tom Kyte (Oracle's resident DB performance guru) could do with Oracle were spectacular.
As you say, its a shame about the Oracle baggage, and in particular the steep price tag. Otherwise I'm certain it would be far more widely deployed as database.
Totally untrue...but adding hints makes your queries dependent on current performance.
Assuming you're a good DBA, you can be better than the optimizer because you understand the whole system. Even databases are general-purpose machines. That's the whole point of adding indexes, etc.
I mean, has there ever been a database that auto-adds indexes based on profiler/optimizer feedback? (To answer my own question, apparently Azure SQL does).
> (To answer my own question, apparently Azure SQL does)
Which is fun when you add a future schema migration that drops a column, test in locally and in some test cases, and then see it bomb in production because the column has an Index that you weren't aware of. (That's an easy thing to solve though, but definitely one of those cases where "The database is messing with its schema on its own" does do funny things)
Generally this seems like a nice balance to me - watch real queries, look for expensive plans and sniff out likely missing indexes based on the data distribution, but give the DBA final say on whether it's a good idea
> adding hints makes your queries dependent on current performance
I don't understand this, can you explain.
> Assuming you're a good DBA, you can be better than the optimizer because you understand the whole system
That's inaccurate. As a DBA I can understand the system better at higher level, but the database has statistics which gives it typically better understanding of the data distribution at a lower level. Indeed I could get that information and feed it into the query plan via hints, but that's going to be an enormous amount of my time, and I would have to do it every time the query is run whereas the database can keep an eye on the statistics as it varies and rebuild the query over time.
Equally, if an index is added you can expect the database to start using it immediately without adding hints. Etc
I am a fairly(?) skilled DBA who has a reasonable idea of what goes on underneath the hood, I do have some idea what I'm talking about.
The postgress team is generally opposed to hints. And their reasoning sort of makes sense.
When do you need a hint? why do you need a hint? If you can answer those questions wouldn't it be better to codify those answers into the query planner than a one off hint?
Which nicely sidesteps how hard it actually would be to first, understand the query planner to the degree needed to change it and second, the amount of effort it would take transform a one off hint into a general purpose optimization engine.
I did say 'typically'! Yes, agreed, but it should rarely need help. Also hints are oh so often jammed in by newbies. I strongly get the impression use of hints is inversely correlated with experience/knowledge.
I very rarely used hints in my more database'y days - but every so often, they were needed, and often to tell the database something that seemed screamingly obvious to me.
I'm not sure where you got the notion that hints were something noobs would be working with? At least, that certainly wasn't my experience.
We are in strong agreement, and the phrasing and measure of your post made it pretty obvious you are knowledgeable (another inverse correlation: absolute opinions launched with violence also seemed to be the realm of newbies; and yours was eminently not).
> I very rarely used hints in my more database'y days
evidence thereof!
> I'm not sure where you got the notion that hints were something noobs would be working with? At least, that certainly wasn't my experience.
It's been very, very much my experience. I worked in one place where nolock hints were applied everywhere, without them realising that READ UNCOMMITTED trans iso even existed, so had to show them that, and show them it could be specified at the application layer (not in the SQL). In another recent case I came across their use and I'm damn sure the person who used it didn't actually know what it did. Yeah, I've seen far too much hinting in my career.
Ok downvoter, why do you think a DB that needs hints for good performance is better than one that usually doesn't? That is, why do you think it better a person does a computer's job instead of the computer?
You've never run into a query planner that suddenly changes its mind once your tables grow.
Try using row level security in postgres - it's awfully slow and you need to be a master architect and sql developer to make it run fast because it's too dumb on its own. With query planner hints I could at least guide it. Shame pg doesn't believe in that.
This is not about Oracle versus postgres, it's about whether or not to use hints. My position is that needing hints indicates a failure of the database optimiser. The optimiser theoretically should do a perfect job. In practice it doesn't, and never can so sometimes hints are necessary, but IME and speaking as an MSSQL guy, I very rarely need them.
No one is saying that. Your original comment can be read like you're saying using hints is a sign of failure of the person writing the query which is probably why the downvotes. Maybe you meant a sign of failure for the optimizer, which is also debatable given they really cannot be perfect. An optimizer will quickly run into NP-hard problems like join ordering - which is just one small reason among many that they won't always find the fastest query plan.
https://neon.tech/docs/introduction/branching is exactly what the author is looking for and is the best way to do highly efficient point-in-time versioning of Postgres data I have seen so far.
I deployed NeonDB in a large enterprise client in Q2 of this year and, yes, the initial data migration is the hard part. We added 2x100gbe network cards directly on the VMware hosts where the data was so the migration to the Kubernetes cluster running NeonDB took hours instead of weeks.
Hi Neon CEO. I’m curious if users have deployed Neon as a (selective) audit-logging tool for configuration which might need to be edited or rolled back from time to time. Think configuration-driven/event-driven systems where an update to a single database-stored config can cause other downstream applications to start behaving mysteriously. Going to scour the website some more!
Essentially, yes. Between a few times per day and a few times per month I would like to create branches to store the “backup” copies of table-stored confs to compare against logs/events and see if configuration changes caused certain conditions to arise. In the case of a configuration error, rolling back to a prior state would be awesome.
We are using a Datomic derivative at my work because Postgres doesn’t have this feature. I don’t think anyone has actually used Daromic’s capabilities for this purpose.
If I had to guess, they are probably using XTDB, which is bitemporal, i.e. every entity has an associated "transaction time" and "validity time". You can use the validity time to deal with situations like "Company reported X assets and Y liabilities in their 2020 balance sheet, and in 2022 it was discovered there was some 2020 assets unaccounted for", and in addition to this, answer questions like "How much assets did Company believe it had in 2020 given the facts we know in 2020? How about given the facts we know in 2022?"
I am not sure how useful this is for versioning data though. It seems like an orthogonal problem to me.
> That also means at any given time, git has a copy of all your files since the beginning of the project.
This is true in a sense, but these are commonly stored compressed and of course you can expect to have a pretty good compression ratio when changed files are mostly similar. And if they're not, you really have to store a copy anyway.
On the topic of PostgreSQL and git. As crazy as this sounds, I once entertained the idea of writing a VCS that used PostgreSQL's large object facility for storing the data. I always have PostgreSQL installed on my personal devices so communication would be through Unix Domain Sockets anyway (or through a gigabit LAN in worst case). And the transactions and SQL interface (for metadata) were just that tempting.
But I realized pg_largeobject (1) chunks data to around 2kB per chunk, and (2) stores each chunk on its own row, and (3) each row uses "oid" type as identifier, which is just 32 bit long. It's probably large enough for anything I would ever need, but for some reason I don't feel comfortable with only 32 bits as primary key.
4 years ago I wrote an internal tool to serialize records of our hierarchical database to a bunch of hierarchical json files, which are then managed in a normal git repo.
This was/is a very specific solution to our very specific set of problems. So not applicable to the general problem of "versioning a database".
It is still in use and still under active development – I'm actually fixing a few bugs with it just now.
I meant the branching itself. Is it possible to build features on top of this branching strategy? Is it possible to list versions and get a specific one using just SQL?
> Naively when you read a git diff or a pull request, you might think like me, that git is storing a diff between A and B and just displays it when needed. But in reality it stores the full file in the object storage.
I thought logically git stored every file, but implementation-wise, it does git-object compaction. So in reality, it's not actually storing every file on disk, no?
Which is such a genius implementation. You get the straightforward implementation of using plain files (e.g. not deltas), while also being able to get the storage boost from storing deltas.
Creates lots of complexity and Performance issues. I agree using PostgreSQL Extension for Temporal tables is a better idea.
I wish PostgreSQL would allow something as Oracle Flashback feature plus allowing to control how far history do you want to keep for specific tables, as typically only part of your data needs full auditing.
I am curious what success stories people have with versioned databases, and whether those success stories could be replicated by a combination of bitemporal schema design and infrequent snapshots/backups. That is, to avoid the complexity and performance issues of adding a temporal dimensions to 100% of your data instead of the X% that actually needs it.
Just to pile on, I think that adopting bitemporal schemas is the solution.
There are various schemes you can employ to keep "archived" data within postgres (e.g. offloading it to another instance accessible behind a FDW).
I've thought about this problem a lot, and I think every alternative solution I've seen is just an approximation of bitemporal schemas with some limitations that are typically not worth the cost:benefit trade-off.
The main difficulty is there isn't a standardized framework/FOSS solution for bitemporal data. I feel like it'll have to get into the SQL standard before we see widespread adoption.
I would prefer a generalized version of the "Oracle Flashback" feature: ability to add extra where conditions to a table, e.g.
create table test (id int primary key, attr int, deleted_at timestamp);
alter table test add default_filter is_active where deleted_at IS NULL;
then all selects like "select * from test where attr = 42" automatically gets added the above criteria and becomes "select * from test where attr = 42 AND deleted_at is NULL"
When you want old rows, you can override the default condition using something like:
select * from test where attr = 42 filter is_active (deleted_at IS NULL or deleted_at > '2010-01-01'::datetime)
You would then be able to have multiple criterias instead of just when row was deleted.
In case it's useful to anyone, I've been running the cheapest possible implementation of Git-based revision history for my blog's PostgreSQL database for a few years now.
I run a GitHub Actions workflow every two hours which grabs the latest snapshot of the database, writes the key tables out as newline-delimited JSON and commits them to a Git repository.
This gives me a full revision history (1,500+ commits at this point) for all of my content and I didn't have to do anything extra in my PostgreSQL or Django app to get it.
If you need version tracking for audit purposes or to give you the ability to manually revert a mistake, and you're dealing with tens-of-thousands of rows, I think this is actually a pretty solid simple way to get that.
That's not a full revision history, just snapshots (basically a backup solution). Multiple changes in a 2-hour window will get conflated into a single one.
pgreplay parses not the WAL Write Ahead Log but the log file, " extracts the SQL statements and executes them in the same order and relative time against a PostgreSQL database cluster": https://github.com/laurenz/pgreplay
Are you overwriting the previous version of the JSON when committing then merging? Or are you just archiving a new copy of the JSON alongside all the old ones?
> If you have rewrote a file 999 times, there is 999 times copy of this file in the object storage. That's why it can be slow to clone a git repository.
This work is great but it feels wasteful given the features Postgres already has but probably the only reasonable solution is to rebuild it in user-space.
It is possible to produce a long transaction versioned database, but the tricky bit is normally merging the structured data. For most applications this either needs to be entirely automated (so no user intervention is required) or the conflict resolution needs to surface in a lot of UI.
I haven’t read about NeonDB before but a brief search of their docs doesn’t mention ZFS. Postgres.ai uses ZFS to implement the snapshots, branching, etc.
I’d be very interested in a comparison of the two.
Better idea: just install the Postgres extension that enables ISO SQL Temporal Tables, just like what MSSQL and MariaDB have had for 8 years now: https://pgxn.org/dist/temporal_tables/
Does this robustly support altering tables? What happens if a column is deleted - does it delete historical data, or mark the history column as nullable? As the OP article notes/hints, there are all sorts of performance and regulatory considerations around this.
Separately though, it's great to see that there's still people trying to solve this problem at the extension level for Postgres. I was sad to discover that Postgres for a very long time had an official extension for time travel - https://www.postgresql.org/docs/11/contrib-spi.html#id-1.11.... - only to realize that it had been removed in https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit... - with a note that "it's easy to implement this separately" but no real hints that anyone was taking up the mantle.
As a simpler alternative, if you're only modifying data via Django and can tolerate some unreliability, https://django-simple-history.readthedocs.io/en/latest/ can get you much of the way there without a need for extensions. We've built on top of it with some domain-specific modifications to visualize and filter diffs, giving us at least a best-efforts audit-esque log system.
author here: Yes I agree, and I mention it at the end.
However it's not possible to install the extension everywhere, for example it's not available in GCP Cloud SQL unfortunately
I feel your pain, except in my case it's the crippling inability to use SQLCLR in Azure SQL, so we can't even do things like use a Regex to clean-up data in-situ. Yeargh.
This is my cheat sheet:
I was aware that Postgres had "time travel," but I don't know if it used this syntax, but I am aware that the feature has been removed. I don't know if "AS OF" is still supported. In Oracle, it has been quite helpful.Edit: Flashback came to Oracle in 9i, when "rollback segments" were replaced by the "undo tablespace" that could present any table as it appeared in the past, limited by the amount of "undo" available. The "FLASHBACK ANY TABLE" privilege is required to see this history on tables that you do not own, and that also conveys the privilege to fully revert tables to their previous contents. You must be a DBA to exercise flashback on tables owned by SYS.