Yes, for those pgroll migrations that require a new column + backfill, starting the migration can be expensive.
Backfills are done in fixed size batches to avoid long lived row locks, but the operation can still be expensive in terms of time and potentially I/O. Options to control the rate of backfilling could be a useful addition here but they aren't present yet.
Any pgroll operations[0] that require a change to an existing column, such as adding a constraint, will create a new copy of the column and backfill it using 'up' SQL defined in the migration and apply the change to that new column.
There are no operations that will modify the data of an existing column in-place, as this would violate the invariant that the old schema must remain usable alongside the new one.
Maybe this is explained somewhere in the docs but I'm lazy: how does it cope with possible performance issue in highly trafficked tables? Can you somehow control the backfill speed if it's taking up too much I/O?
Backfills are done in fixed size batches to avoid taking long-lived row locks on many rows but there is nothing in place to control the overall rate of backfilling.
This would certainly be a nice feature to add soon though.
The bloat incurred by the extra column is certainly present while the migration is in progress (ie after it's been started with `pgroll start` but before running `pgroll complete`).
Once the migration is completed any extra columns are dropped.
Allowing the database and the application to be out of sync (to +1/-1 versions) is really the point of pgroll though.
pgroll presents two versions of the database schema, to be used by the current and vNext versions of the app while syncing data between the two.
An old version of the app can continue to access the old version of the schema until such time as all instances of the application are gracefully shut down. At the same time the new versions of the app can be deployed and run against the new schema.
But then how do avoid errors introduced due to the lack of app awareness? For example the old app can keep writing data with the old schema, whereas the new version makes assumptions based on the new schema, which could be broken by the new data in the old schema that don't follow the same assumptions.
Seems like a dangerous way to introduce very hard to find data inconsistencies.
> For example the old app can keep writing data with the old schema, whereas the new version makes assumptions based on the new schema
To do safe migrations it's a good idea to avoid rolling out versions that start making assumptions about the new schema until the migration is done and dusted.
This does of course start to create weird territory where migrations cease to become trivially "stackable" if you want to retain the whole zero-downtime thing, but this is true for a lot of "graceful update" patterns in computing (in fact I've not seen it solved properly before).
> To do safe migrations it's a good idea to avoid rolling out versions that start making assumptions about the new schema until the migration is done and dusted.
Or make the app do the migration as part of its normal operation, so you have total control over it. :)
I think we're saying the same thing, I'm just suggesting it's safer to do it in the app itself, because then app changes and database changes are tied together instead of divorced from each other.
If you only have one or two people working on the code that interacts with the table that is being changed, both methods are pretty much the same. But if you have a large team or want to be able to rollback better, then doing it all in the app code feels like the better way to go for me.
For migrations altering columns (for instance adding a constraint), data gets upgraded/downgraded between old and new versions trough the up/down functions. These are defined by the migration. They work like this:
- For rows inserted/updated through the old version of the schema, the up function gets executed, copying the resulting value into the new column
- For rows inserted/updated through the new version, the down function gets executed, copying the resulting value into the old column
For instance, you can test that up function works before releasing the new app relying on it, just by checking that the values present in the new schema are correct.
This migration[0] adds a CHECK constraint to a column.
When the migration is started, a new column with the constraint is created and values from the old column are backfilled using the 'up' SQL from the migration. The 'up' SQL rewrites values that don't meet the constraint so that they do.
The same 'up' SQL is used 'on the fly' as data is written to the old schema by applications - the 'up' SQL is used (as part of a trigger) to copy data into the new column, rewriting as necessary to ensure the constraint on the new column is met.
As the sibling comment makes clear, it is currently the migration author's responsibility to ensure that the 'up' SQL really does rewrite values so that they meet the constraint.
If I may be so free, I read your docs quite carefully last night and it wasn’t immediately clear to me that “up” and “down” are automagically run to keep both the old and the new data in sync, while a migration is pending. To my reading this is a spectacular feature because it solves the “the only way is forward” problem (where you can’t actually ever roll back a migration or risk losing / making inconsistent some recent data) for many kinds of migrations.
May I suggest spelling this out more explicitly in the docs? As far as I can tell it’s only mentioned by example halfway down the tutorial. To me it seems like one of pgroll’s true killer features, worth more coverage :-)
Yes, this is the case as of today. It's important to get the up/down functions right.
The good thing is that this will probably be detected during the start phase, as data would fail to be backfilled.
We are thinking of ideas to improve this in the longer term, for example:
* Having a dry-run mode where you check that the data present in the DB will stay valid after applying the up function
* Optionally allow for this to happen, but "quarantine" the affected rows so they are no longer available in the new schema.
There is no code to do this because it's actually a nice feature of postgres - if the underlying column is renamed, the pgroll views that depend on that column are updated automatically as part of the same transaction.
You're right. I wish schema wasn't such an overloaded term :)
In order to access either the old or new version of the schema, applications should configure the Postgres `search_path`[0] which determines which schema and hence which views of the underlying tables they see.
This is touched on in the documentation here[1], but could do with further expansion.
I'm not very familiar with pg-osc, but migrations with pgroll are a two phase process - an 'in progress' phase, during which both old and new versions of the schema are accessible to client applications, and a 'complete' phase after which only the latest version of the schema is available.
To support the 'in progress' phase, some migrations (such as adding a constraint) require creating a new column and backfilling data into it. Triggers are also created to keep both old and new columns in sync. So during this phase there is 'bloat' in the table in the sense that this extra column and the triggers are present.
Once completed however, the old version of this column is dropped from the table along with any triggers so there there is no bloat left behind after the migration is done.
Thanks for the reply. My question was specifically about the MVCC feature that creates new rows for updates like this. If you're backfilling data into a new column then you'll likely end up creating new rows for the entire table and the space for the old rows will be marked for re-use via auto-vacuuming. Anyway, bloat like this is a big pain for me when make migrations on huge tables. It doesn't sound like this type of bloat cleanup is a goal for pgroll. Regardless, it's always great to have more options in this space. Thanks for your work!
> ... so there there is no bloat left behind after the migration is done.
This is only true after all rows are rewritten after the old column is dropped. In standard, unmodified Postgres, DROP COLUMN does not rewrite existing tuples.
Yea, keeping your application consistent with two different schema versions. And I'm not saying from a blue/green standpoint, from whatever pgroll does instead so when the rollback happens you don't lose data.
I don't see the need to keep your application consistent with both schema versions. During a migration pgroll exposes two Postgres schema - one for the old version of the database schema and another for the new one. The old version of the application can be ignorant of the new schema and the new version of the application can be ignorant of the old.
pgroll (or rather the database triggers that it creates along with the up and down SQL defined in the migration) does the work to ensure that data written by the old applications is visible to the new and vice-versa.
A rollback in pgroll then only requires dropping the schema that contains the new version of the views on the underlying tables and any new versions of columns that were created to support them.
Eh, isn't this making it easier on the application side? Today, when I make a change in my app that needs a migration, I need multiple steps to make sure I don't break old instances of the app still running. With this it looks like one can avoid all that?
That is what this project is trying to achieve. By allowing your client apps to access both the old and the new schema at the same time, you can have the old instances of your application working while the new ones (using the new schema) get deployed.
They can work in parallel for a while until you complete the rollout and call the migration as done.
We are looking to build integrations with other tools but for now isn't recommended to use pgroll alongside another migration tool.
To try out pgroll on a database with an existing schema (whether created by hand or by another migration tool), you should be able to have pgroll infer the schema when you run your first migration.
You could try this out in a staging/test environment by following the docs to create your first migration with pgroll. The resulting schema will then contain views for all your existing tables that were created with alembic. Subsequent migrations could then be created with pgroll.
It would be great to try this out and get some feedback on how easy it is to make this switch; it may be that the schema inference is incomplete in some way.
Migrations are JSON format as opposed to pure SQL for at least a couple of reasons:
1. The need to define up and down SQL scripts that are run to backfill a new column with values from an old column (eg when adding a constraint).
2. Each of the supported operation types is careful to sequence operations in such a way to avoid taking long-lived locks (eg, initially creating constraints as NOT VALID). A pure SQL solution would push this kind of responsibility onto migration authors.
A state-based approach to infer migrations based on schema diffs is out of scope for pgroll for now but could be something to consider in future.
Thanks for releasing this tool! I actually interpreted the question differently: Rather than manipulating in SQL, would you consider exposing it as something like a stored procedure? Could still take in JSON to describe the schema change, and would presumably execute multiple transactions under the hood. But this would mean I can invoke a migration from my existing code rather than needing something out-of-band, I can use PG’s existing authn/authz in a simple way, etc.
Backfills are done in fixed size batches to avoid long lived row locks, but the operation can still be expensive in terms of time and potentially I/O. Options to control the rate of backfilling could be a useful addition here but they aren't present yet.