Hacker News new | past | comments | ask | show | jobs | submit | more tudorg's comments login

> Anyone know how well this works with very large datasets? The backfill sounds like it would take a while to do.

It can take a long time, yes. It's somehow similar in that regard with, for example, gh-ost for Mysql that also does backfills. The advantage of Postgres here, is that backfill is required for fewer migration types, and pgroll only does backfills when needed.

> Does this Go binary need to be continuously running, or does it keep track of migration state in the database?

The latter, you only run the Go binary when doing schema changes.


How can you properly plan for eg. disk storage requirements etc. Does the tool calculate that upfront via some sort of dry-run mode? For companies with larger datasets this would be a rather important consideration. Also, those backfills will generate a lot of network traffic in clustered environments.


This is a good point, I believe we can look into trying to estimate storage needs or timings before a migration. It definitely looks like a nice to have.


Sorry about it, small glitch with our blog as we tried to fix something in the title and accidentally broke the metadata for it :)


When can we expect blogroll to be released??


Well played, well played.


Ah yes, good catch! Fixing..


Yes, that's exactly it. I generally agree with "always move forward" but if rolling back is as easy as dropping the "new" view, that makes it a lot less scary.


hey HN, repo is here: https://github.com/xataio/pgroll

Would love to hear your thoughts!


Great to see more innovation in this space! How does this compare to?

https://github.com/shayonj/pg-osc


Hi there, I'm one of the pgroll authors :)

I could be mistaken here, but I believe that pg-osc and pgroll use similar approaches to ensuring no locking or how backfilling happens.

While pg-osc uses a shadow table and switches to it at the end of the process, pgroll creates shadow columns within the existing table and leverages views to expose old and new versions of the schema at the same time. Having both versions available means you can deploy the new version of the client app in parallel to the old one, and perform an instant rollback if needed.


Thanks for the reply, a write up on pros/cons in these approaches would be fantastic. I have no clue which is better but I believe pgosc is heavily inspired by github/gh-ost, their tool for online schema change for mysql.


Does pgroll have any process to address table bloat after the migration? One of the (many) nice things about pg-osc is that it results in a fresh new table without bloat.


Another pgroll author here :)

I'm not very familiar with pg-osc, but migrations with pgroll are a two phase process - an 'in progress' phase, during which both old and new versions of the schema are accessible to client applications, and a 'complete' phase after which only the latest version of the schema is available.

To support the 'in progress' phase, some migrations (such as adding a constraint) require creating a new column and backfilling data into it. Triggers are also created to keep both old and new columns in sync. So during this phase there is 'bloat' in the table in the sense that this extra column and the triggers are present.

Once completed however, the old version of this column is dropped from the table along with any triggers so there there is no bloat left behind after the migration is done.


Thanks for the reply. My question was specifically about the MVCC feature that creates new rows for updates like this. If you're backfilling data into a new column then you'll likely end up creating new rows for the entire table and the space for the old rows will be marked for re-use via auto-vacuuming. Anyway, bloat like this is a big pain for me when make migrations on huge tables. It doesn't sound like this type of bloat cleanup is a goal for pgroll. Regardless, it's always great to have more options in this space. Thanks for your work!


> ... so there there is no bloat left behind after the migration is done.

This is only true after all rows are rewritten after the old column is dropped. In standard, unmodified Postgres, DROP COLUMN does not rewrite existing tuples.


Is it possible to make it work using SQL only in the future?

Also, what about if the user can just maintain one schema file (no migrations), and the lib figures out the change and applies it?


Hi, one of the authors of pgroll here.

Migrations are JSON format as opposed to pure SQL for at least a couple of reasons:

1. The need to define up and down SQL scripts that are run to backfill a new column with values from an old column (eg when adding a constraint).

2. Each of the supported operation types is careful to sequence operations in such a way to avoid taking long-lived locks (eg, initially creating constraints as NOT VALID). A pure SQL solution would push this kind of responsibility onto migration authors.

A state-based approach to infer migrations based on schema diffs is out of scope for pgroll for now but could be something to consider in future.


Thanks for releasing this tool! I actually interpreted the question differently: Rather than manipulating in SQL, would you consider exposing it as something like a stored procedure? Could still take in JSON to describe the schema change, and would presumably execute multiple transactions under the hood. But this would mean I can invoke a migration from my existing code rather than needing something out-of-band, I can use PG’s existing authn/authz in a simple way, etc.


> Also, what about if the user can just maintain one schema file (no migrations), and the lib figures out the change and applies it?

Because that only solves the DDL issues and not the DML. It is still useful though.

I use a schema comparison tool that does exactly this to assist in building my migration and rollback plans, but when simply comparing two schemas there is no way to tell the difference (for example) between a column rename and a drop column/add column. The tooling provides a great scaffold and saves a ton of time.


The link to the introductory blog post here appears to be broken

https://xata.io/blog/pgroll-schema-migrations-postgres


Cool stuff! Do you have any thoughts about how this compares to https://github.com/fabianlindfors/reshape?


Great question! Reshape was definitely a source of inspiration, and in fact, our first PoC version was based on it.

We decided to start a new project for a couple of reasons. First, we preferred it to have it in Go, so we can integrated it easier in Xata. And we wanted to push it further, based on our experience, to also deal with constraints (with Reshape constraints are shared between versions).


This is a fantastic article! It shows that even simple migrations (like adding or removing a column) can be quite tricky to deploy in concert with the application deployement.

We (at Xata) have tried for a while to come up with a generic schema migration system for PostgreSQL that makes this easier. We ended up using views and temporary columns in such a way that we can provide both the "old" and the "new" schema simultaneously. Up/down triggers convert newly inserted data from old to new and the other way around. This also has the advantage the it can do rollbacks instantly by just dropping the "new" view.

We were just planning to announce this as an open source project this week, but actually it is already public, so if you are curious: https://github.com/xataio/pgroll


If you are using Postgres, I think you can create a view for each table and put all the views in a schema, then you switch the app all at once by using `SET search_path TO`.


You can, but I'd argue if your API schema is just 1-to-1 views for the underlying tables, you're not getting much value from this approach.


I was thinking of it as the starting point, then you evolve from there.

Tbh, views are a bit of a leaky abstraction (for example when it comes to constraints) so adopting them gradually seems like a good idea.


Until you change the underlying tables, or the api schema. At which point you don’t need to change the other.


If your API view is a direct mapping of an underlying table (e.g. "SELECT *"), and you make a breaking change to that table, the view will also change and your application will break.

What I've done instead is change the underlying table as needed, and ensure the view interface stays the same, by changing the view definition. This lets you refactor the low-level schema without breaking clients, and this migration can be done in a single transaction.


hey HN, and thank you todsacerdoti for posting it!

This is a feature that we wanted for a long time, but we also wanted to get it right. It's somehow the equivalent of storing files and images in an S3 bucket and then putting URLs to them in the database, but then you have to take care of keeping things in sync, take care of security permissions, delete the files when the database row is deleted, etc. We automate all that for you and even more: we cache the files in a CDN, and if they are images we support transformations out of the box.


There are plenty of frameworks and libraries that link files in buckets with rows in a database. Django is one that is pretty simple.


I think Azure offers sharding via Citus: https://learn.microsoft.com/en-us/azure/cosmos-db/postgresql...

Then, there's the Postgres compatible offerings like CockroachDB and Yugabyte that do sharding transparently.


Thanks, I found out Pecona does this. I'll check out CockroachDB and Yugabyte.


Yes, CockroachDB and YugabyteDB do more than sharding, they are distributed SQL (difference here: https://dev.to/yugabyte/partitioning-vs-sharding-what-about-...). YugabyteDB is open source and has more postgres features as its query layer is a fork of postgres, but good to test both


That sounds interesting, are you using `ts_rank` or otherwise sorting by relevancy?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: