Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Version Control for Databases (github.com/infostreams)
224 points by akie on Dec 9, 2019 | hide | past | favorite | 88 comments



The way I have been doing database versioning and rollout for years is with Liquibase (https://www.liquibase.org/). It is based on the Database refactoring book that came out some years ago and provides forwards and backwards progress based on a table that tracks the checksums of the changes allowing them to be rolled out where ever the file goes. You can share with teammates, propagate to environments and have seperate setup for structure, data and test data setups.

Restoring snapshots isn't really what you want with databases. The data in production is only in production so you need to refactor it. The issue in today's 365/24/7 always online is that refactoring database structure in that way with a lot of data causes downtime so these sorts of tools require a lot of orchestration around them.

Liquibase can also capture an existing structure and data set with an export as well from just about any database engine out there, so you can capture your initial point as you are and then move forward with updates being individual changes so it is relatively easy to adopt.


I don't know why people are stuck on this model of a long chain/log of successive schema changes when you can simply make a diff.

I lay out this approach here: https://djrobstep.com/talks/your-migrations-are-bad-and-you-...


Hibernate has this model built in as well, I believe. https://stackoverflow.com/a/221422

hbm2ddl.auto=update

But it isn’t recommended for production because the automatic diff-based approach can lead to unpleasant surprises. With individual migration files, you always know exactly what they are going to do and can think and plan for how a rollback might be necessary if needed.

It’s fair to say that you don’t necessarily need to keep 100 individual files around once they’ve been deployed to every environment for a while and there’s no chance you’d rollback. But I think people typically do this approach because it is simpler and easier than attempting to export the database structure at #100 or whatever and setting a new baseline every few months.


I'm certainly not suggesting applying generated diff scripts without reviewing them!

In fact, probably the main benefit of the diff approach is that you can rigorously test the outcome, and explicitly confirm a matching schema will result.


How does this possibly work then, if you have to review the scripts first?

You're just storing a migration script in the repo still, but it's generated and works against only a single schema "revision"... ?


I don't understand the objection here. You want to apply the same migration script against different schemas?


The same schema, from potentially different points in time.

So it's not really applying to different schemas, each single migration script applies against the same state that the previous one produced - but collectively they can be applied against a database from any point in time while that migration system has been used.


Again, what exactly are you objecting to? You're just describing how traditional migrations work. There's nothing stopping you saving a sequence of scripts that you use a diff tool to generate, it's just less necessary because you can work against the actual version instead.


... your comments are both confusing and contradictory.

You've simultaneously said that storing scripts is "bad" and that a live "diff" is better, but then also said that those generated "diff" scripts can be reviewed, which means you have to store them.

I'm completely aware of using tools to generated scripts by diffing a model schema against a DB, and using a different tool to apply the finalised, reviewed scripts without invoking the "diff" tooling outside of development environments.

What I cannot grasp is your convoluted claims that storing scripts is bad, and that diff's can be reviewed before they're applied, without storing them, and will work against some cowboy-esque DB modifications.


? Nobody is saying that storing scripts is bad. This is getting silly. I'm skeptical that you even read the original link before attacking it.


I tried to read your "approach" and I got as far as "have to track version numbers" and gave up.

IMO you're inventing problems to justify your own preferences.

For context: I use, and am a proponent of the 'series of patches' approach for SQL migrations. I wrote a tool to apply them, because I wasn't happy having to constantly patch the existing one I'd been using.

I don't really understand what you mean about "managing versions" - you just need a naming scheme that ensures the items can be applied in order. Dates (as a prefix, with a descriptive suffix) usually works well.

I also don't see how you think having the patches over time in in the repo is a problem. If you don't like seeing the full history of changes - just remove the old ones after a while.

The "series of patches" approach also has a benefit that I don't see how your "diff" approach would solve (without some overly complex tooling): upgrading a previous version of the database to the current schema.

With our current system, I could take a database dump from 2 years and dozens of migrations ago, run the migrations on it, and it would come up to the current schema.


If you gain some satisfaction from having to track version numbers, then good for you. I and most other people find it tedious.

They're also impractical to use in environments where your one versioning system isn't the only use of the data. If you or some DBA needs to make an emergency database change - suddenly the real DB doesn't match your versioning.

But if you compare directly against production, it's not a problem at all.

Not directly checking against the production schema is rather like flying a plane with no reference to what is actually outside the windows - rather an unreliable experience.

Nothing is stopping you from keeping a long chain of these scripts if you want - you'll lose nothing and gain more automation and testability.

But in practice nobody wants to keep these files around, and nobody wants to restore 2 year old backups to production.

And if you did, a diff-based approach will do just as well, more automatically.

Where the diff-based approach probably shines the most is when making experimental changes locally during development. Play around with your models, add a column, add a constraint, rename the column, make it not null, change your mind and drop it again. One command syncs the dev database to your models automatically and instantly.


> If you gain some satisfaction from having to track version numbers, then good for you. I and most other people find it tedious.

I literally said there are no version numbers to track. If you find identifying the current date either manually or in some automated tool tedious, I don't know how to help you.

> If you or some DBA needs to make an emergency database change - suddenly the real DB doesn't match your versioning.

Well by that logic, if I suddenly need to make some "emergency" fix to the code, it won't match the version control system.

The solution there, is to have a method in place to rapidly deploy a change, not to make your migration tool also work around cowboy solutions.

Your previous comments also imply that to be usable, the "diff" system needs to be reviewed (i.e. to handle table renames, and to be considered safe for production).

So how does that handle the cowboy approach where the schema doesn't match? Either your "diffs" are generated at the time of execution, so the previous state is regardless but you can't review them, OR your diffs are generated ahead of time, so they can be reviewed, but will not necessarily work against the live database.

So which is it?

I'm not against a "make changes against the DB directly" workflow for development. I've written code that does exactly that, as you describe, from models. But it's not practical for production use.

It's usable as a development tool, and to produce static migration scripts.


"They're version dates not version numbers" is hairsplitting. You still have external information that you must rely on to determine what state your production db is in, and that's bad.

If an on-call SRE calls me in the middle of the night asking me if he can add an index to solve a performance problem, I'd rather say "yes, no problem", not present a series of additional steps for them to jump through.

You review the script, and when the time comes to apply, recheck the database to make sure it's still the same state you generated the script against. Generally people tell you if they've made changes on live dbs that you're working on, but it's nice to double-check regardless.


> You still have external information that you must rely on to determine what state your production db is in, and that's bad

What external information? Whether each migration has been applied or not is stored in the database itself. The dates are literally used just to ensure correct ordering - that's literally their only purpose.

> not present a series of additional steps for them to jump through.

If someone can't write the change they want to make into a file, write the opposite action into another file, commit and push that change to a VCS repo, I don't think they should be given access to a god damn toaster oven much less your production database.

> You review the script, and when the time comes to apply, recheck the database to make sure it's still the same state you generated the script against.

.. How can that possibly work with automated deployments? And how on earth do you "recheck the database to make sure it's still the same", with any degree of certainty?

Your entire approach smells like a very manual process that doesn't work for teams any larger than 1 person.


> Whether each migration has been applied or not is stored in the database itself.

You're dragging this further into pedantic territory here. A chain of scripts and a version table is external to the structure of the database itself.

> If someone can't write the change they want to make into a file, write the opposite action into another file, commit and push that change to a VCS repo...

The recurring theme here is that you have a preference for mandatory busywork instead of a direct approach. People putting out fires ought to be focused on what will directly solve the problem most quickly and safely. In larger environments with dedicated ops people supporting multiple applications/environments/databases, not every ops person is going to be familiar with your code and preferred workflow.

> How can that possibly work with automated deployments? And how on earth do you "recheck the database to make sure it's still the same", with any degree of certainty?

...with a diff tool.

> Your entire approach smells like a very manual process that doesn't work for teams any larger than 1 person.

The whole point is that it is automatic rather than manual. I've used it before in teams "larger than 1 person" and it has worked fine.


> Your entire approach smells like a very manual process that doesn't work for teams any larger than 1 person.

You may be misunderstanding the concept. Automated declarative schema management (AKA diff-based approach) has been successfully used company-wide by Facebook for nearly a decade, to manage schema changes for one of the largest relational database installations on the planet. It's also a widely used approach for MS SQL Server shops. It's not some newfangled untested crazy thing.

I have a post discussing the benefits here: https://www.skeema.io/blog/2019/01/18/declarative/


I understand the concept of a tool that changes the schema to match some declared state dynamically.

I wrote the same functionality into a library.

What I cannot comprehend is the poster who claims that such an approach can simultaneously:

- be automatically applied

- be reviewed and even edited after generation to handle e.g. renames

- handle previously unknown changes in the DB schema (aka handling cowboy behaviour from other ops).

All three are simply not possible at once.


Here's how you can achieve all 3.

- Develop intended schema (I)

- Inspect production schema (P), save as P0

- Generate migration (M) by comparing to production (P0): I

- P0 = M

- Edit M as necessary, test for correctness, commit to master (meets your second criteria)

- Deploy code, with the migration running as a deploy step (meets your first criteria)

- Migration works as follows:

- Inspect P again, save as P1. If P0 != P1, abort process (this prevents any issues from out-of-band changes as per your third criteria, and means the pending script won't run more than once)

- Apply M.

- Inspect P once more save as P2. Double-check P2 == I as expected.


I disagree. This is definitely all possible at once with proper tooling.

Ideally this workflow is wrapped in a CI/CD pipeline. To request a schema change, you create a pull request which modifies the CREATE statement in a .sql file in the schema repo. CI then automatically generates and shows you what DDL it wants to translate this change to.

If that DDL matches your intention, merge the PR and the CD system will apply the change automatically. If that DDL doesn't match your intention, you can either modify the PR by adding more commits, or take some manual out-of-band action, or modify the generated DDL (if the system materializes it to a file prior to execution).

In any case, renames are basically the only general situation requiring such manual action. Personally I feel they're a FUD example, for reasons I've mentioned here: https://news.ycombinator.com/item?id=21758143


Because diffs don't work. If you rename a table, for instance, a diff algorithm will see that as dropping the table and creating a new one.


People love to cite this reason, but practically speaking, it's FUD. Renaming in production -- whether entire tables or just a column -- is operationally complex no matter what.

Assuming any non-trivial software deployment (multiple servers), it's impossible to execute the rename DDL at the exact same time as new application code goes live. So either you end up with user-facing errors in the interim, or you can try to write application logic that can interact with both old and new names simultaneously. That's overly complex, typically not supported by ORMS or DAOs, and very risky in terms of bug potential anyway.

I'm a database expert, and among 100% of the companies I've worked at or consulted for, renames were either banned entirely or treated as a very rare special-occasion event requiring extra manual work. Either way, lack of rename support in diff-based schema management isn't really a problem, as long as the tooling has these two properties:

1. Catches unsafe/destructive changes in general and require special confirmation for them (preventing accidental rename-as-drop in the rare case where a rename truly is desired)

2. Offers a way to "pull" from the db side, so that if a rename is actually needed, it can be done "out of band" / manually, and the schema repo can still be updated anyway


> People love to cite this reason, but practically speaking, it's FUD.

> Either way, lack of rename support in diff-based schema management

You're trying to polish a turd here, this isn't "lack of support" it's "it will drop objects in production."

> Catches unsafe/destructive changes in general and require special confirmation for them

Again, polishing a turd: the only way your "automation" works is through manual intervention.

> Assuming any non-trivial software deployment (multiple servers), it's impossible to execute the rename DDL at the exact same time as new application code goes live.

You can construct a view referencing the old table, and rename the table. Yes, it has to be an updateable view and you need transactional DDL, but within those constraints, it's doable.

> I'm a database expert, and among 100% of the companies I've worked at or consulted for, renames were either banned entirely or treated as a very rare special-occasion event requiring extra manual work.

If they're using a DBMS that doesn't support transactional DDL, completely understandable. If their tools are liable to drop production data due to renames, also completely understandable.

But the fact that they ban a trivial operation is a symptom of the problem with all the half solutions and snake oil in SQL schema management. It's so bad that you have large companies investing heavily in ripping out the schema entirely, which itself is just more snake oil.

In the problem space of trying to keep a schema in sync, we know that diffing leads to unacceptable answers, that is an indicator that it's the wrong conceptual basis for a correct solution.


> You're trying to polish a turd here, this isn't "lack of support" it's "it will drop objects in production."

That's a strawman argument. Any reasonable schema management implementation has safeguards against DROPs. If your tooling blindly executes a DROP without extra confirmation, use better tooling.

> Again, polishing a turd: the only way your "automation" works is through manual intervention.

There's absolutely nothing wrong with requiring extra human confirmation for destructive actions. Quite the contrary. I've spent most of the last decade working on database automation and operations at social network scale, and will happily say this is a common practice, and it's a good one at that.

> You can construct a view referencing the old table, and rename the table. Yes, it has to be an updateable view and you need transactional DDL, but within those constraints, it's doable.

So you're assuming that every single table has a view in front; or you're dynamically replacing the table with a view and hoping that has no detrimental impact to other database objects or app performance. Either way, you're talking about something operationally complex enough that it isn't fair to say that production table renames or column renames are a "trivial operation" at the vast majority of companies.

> It's so bad that you have large companies investing heavily in ripping out the schema entirely, which itself is just more snake oil.

This is frequently overstated. For example, although Facebook uses a flexible solution for its largest sharded tables, there are many tens of thousands of other tables at Facebook using traditional schemas.

> In the problem space of trying to keep a schema in sync, we know that diffing leads to unacceptable answers, that is an indicator that it's the wrong conceptual basis for a correct solution.

The only "unacceptable answer" you've cited is rename scenarios, which even if it incorrectly leads to a DROP, the tooling will catch.

If you need crazy view-swapping magic to support an operation (renames), that is an indicator that it's a conceptually problematic operation that should be strongly reconsidered in production.

As I've already stated elsewhere in this thread, declarative schema management has been successfully used company-wide by Facebook by nearly a decade, and is also a common practice in the MS SQL Server world. If you're unconvinced, that's fine, but many companies have found it to be a great workflow!


Which is straightforwardly solved by editing the generated script accordingly.


Huh, this got me thinking.

How about you just give a UUID to everything in the schema?

The technical challenge with generated scripts (that you could edit by hand, but that just means that you now don't have an automated system) is that they don't understand changes at a deep enough level - they lack the context to see that a table or column has been renamed because they have no understanding of identity.

So - just give them identity.

  Schema at commit 20b1ea23

  03496418-e44c-42a6-a6a4-6563b7ae7bfb users
    25233812-9a95-4bc3-893e-6accb935fa49 name
    2f4c79c3-81b6-42d4-8379-ce5f0ed8ef62 address
    83fc34c8-56c7-49d4-94d0-150cd76204bc password

  Schema at commit c0d07562

  03496418-e44c-42a6-a6a4-6563b7ae7bfb users
    89482484-8205-40ad-a73b-a1bb988dc1d9 firstname
    25233812-9a95-4bc3-893e-6accb935fa49 lastname
    2f4c79c3-81b6-42d4-8379-ce5f0ed8ef62 address
    c9f0d35d-439e-488c-a6d5-7a144c54335c address2
    83fc34c8-56c7-49d4-94d0-150cd76204bc password
Now you could diff these two:

  @@ -1,4 +1,6 @@
   03496418-e44c-42a6-a6a4-6563b7ae7bfb users
  -    25233812-9a95-4bc3-893e-6accb935fa49 name
  +    89482484-8205-40ad-a73b-a1bb988dc1d9 firstname
  +    25233812-9a95-4bc3-893e-6accb935fa49 lastname
       2f4c79c3-81b6-42d4-8379-ce5f0ed8ef62 address
  +    c9f0d35d-439e-488c-a6d5-7a144c54335c address2
       83fc34c8-56c7-49d4-94d0-150cd76204bc password
and every line of the diff has the necessary information to decide what operation you intended.


Liquibase "Hello World": https://github.com/joelparkerhenderson/demo_liquibase_hello_...

    --liquibase formatted sql
    --changeset alice@example.com:1
    create table person (
      id int not null primary key,
      name varchar(100)
    );


Can/should I use liquibase instead of my framework's own migrations? Maybe I should compliment those with liquibase?


I made this, and have been using it in production and for local development for the past year or two. It's been immensely helpful to quickly save or easily restore the database to a previously know 'good point'. I have also been using it to quickly load staging databases into development and vice versa. So far it's MySQL only, but the connection points for other databases such as PostgreSQL are in place.

It's main goal is to simplify development - it's not intended as a replacement for writing database migrations or to work with production databases. I would be grateful for any feedback, preferably in the form of pull requests ;-)


I like this pattern of being able to restore snapshots in your development environment. For the uninitiated, docker actually has all these functions. For example, if you create a container for postgres and mount a volume, you can create a snapshot of the volume for distribution for your team so they can mount/unmount anytime.


I was under the impression container file systems make running a DB in a container a bad idea? Or is it just in performance critical situations?


Doing so into the main container filesystem (which is usually an overlay/union stack) is probably bad for performance.

But a bind mount (which is usually the recommended way to do it so that data survives the container lifespan) will avoid that and just leave you with whatever "problems" the host filesystem has re: database files.


> you can create a snapshot of the volume

how? note that docker commit/export does not include volumes


This functionality for Docker was proposed by NetApp (https://netapp.io/2017/06/22/snapshots-clones-docker-volume-...)

There was also a github issue with numerous people voting for it (https://github.com/moby/moby/issues/33782)

But for whatever reason, the docker team apparently decided not to build it. We still don't have a properly easy way to snapshot or share docker volumes.

So I came up with my own method.

Basically, you have to mount the volume and then tar the parts of it that you want snapshotted (typically your data directories). You can do this with any linux machine image. Here's an example (where "foo" is the docker container you want to snapshot):

  mkdir -p docker-volume-snapshots && docker run --rm --volumes-from foo -v $(shell pwd)/docker-volume-snapshots:/docker-volume-snapshots debian:stretch-slim tar -czvf /docker-volume-snapshots/snapshot.tar.gz /var/lib/my_data_folder
Then, to restore the snapshot, here's what you do:

  docker run --rm --volumes-from mysql8.0 -v $(shell pwd)/docker-volume-snapshots:/docker-volume-snapshots:delegated debian:stretch-slim /bin/sh -c "cd / && tar -xvf /docker-volume-snapshots/snapshot.tar.gz"
(Edit: Don't forget to start up/initialize and then pause/stop the running "foo" docker container before you run these commands to snapshot the volume)


FYI, in the Titan project (https://titan-data.io), we solve this problem by running a privileges storage container (with ZFS on Linux) that then is able to export volumes into other containers, giving us ZFS snapshots and clones under the hood. This was inspired by the approach taken by dotmesh (https://dotmesh.com/) but is definitely tailor-made to our use case. If someone was sufficiently motivated, I'm sure you could build a different CLI on top of our server and use it simply to manage docker volumes without the Titan trimmings (e.g. push and pull).


Looks interesting but I doubt that'll be a consistent snapshit, right? I guess you could combine it with `docker pause`


oh, yeah, I forgot to mention that I stop the running docker container before I do this


Interesting utility. I can see a few really exciting use cases for this.

If I had one wish it'd be native support for syncing generated snapshots to S3 or Google Storage for easy transfers between authenticated systems and users. Maybe I'll take a look at building out a PR for that over the weekend if that's okay


Yes, by all means please do. Simplicity of use is very important to me, so make sure it's generally applicable and easily understandable, but other than that - go for it! Awesome.


Very interesting, my go-to workflow so far has been creating backup copies in phpMyAdmin but this seems much more convenient and reliable. I'll give it a try soon!


There are multiple tools that attempt to do this, some specific to a framework some not. To name a few,

- flyway

- liquibase

- dbmate

- alembic for python/sqlalchemy

- django migrations for python/django

- diesel migrations for rust/diesel

- echo migrations for elixir/echo

- persistent migrations for haskell/persistent

All of these, in one way or another, provide an ability to create sequential database schema changes that can be stored in version control and applied/rolled back.

What is your system doing that is different from all of these?


This is more meant as something that supplements migrations, as a “time machine” of your database so to speak. You would run it before you do any kind of potentially destructive operation on your database, such as running or rolling back a migration. Not all migrations are reversible (for example, migrations that drop columns), so you cannot easily go back to before you ran that migration. It’s also convenient when you check out an older version of your code - you can just restore the database as it was at that point in time. You could run a command in a git hook to enforce saving a new version when you tag a new release for example. In any case, it makes your database less of a black box, and it basically becomes ok to destroy it - because you can so easily make a new snapshot and go back to how it was before.


> it makes your database less of a black box, and it basically becomes ok to destroy it - because you can so easily make a new snapshot and go back to how it was before.

There are some extremely serious caveats here that need mentioning. A production DB is typically taking application writes 24/7. Restoring from a snapshot would cause data loss of any writes that occurred after the snapshot and before the restore.

You would need to also stop prod traffic and then replay the binary logs beginning from the same point as the snapshot, but that can be complicated if you're trying to reverse some DDL operation that is inherently incompatible with the transactions in the binlogs.

Also, if I'm not mistaken, your tool's invocation of mysqldump is not using the --single-transaction option. Without that, mysqldump doesn't create consistent snapshots for InnoDB tables. This means what you're dumping and restoring doesn't actually represent a single valid point-in-time of the DB's state.


You are right: this tool offers no guarantees whatsoever. It's intended use is for development and staging. You could theoretically use it in production, but not on a live database server. If you have that use case, you need to think carefully about your tools and your processes to guarantee a consistent outcome.


I'm working on a temporal document store in my spare time. If done right and you have an index structure which let's say indexes each revision of a table (in my case binary XML or JSON as of now) as in Git for instance you simply append a new RevisionRootPage. This page is indexed of course and points to the former revision-data. Should be super fast, so you don't have to replay changes made in the meantime. Even if I'd write to a distributed log for instance and would read from it the changes since revision n. I think that way you can also version the whole database easily.

https://sirix.io / https://github.com/sirixdb/sirix


That sounds like point in time recovery. Is that what it is?



Migrations allow you to version the db _schema_, not the actual rows.

I can see myself using a tool like this to streamline testing of destructive data changes.


For elixir, it’s ecto, not echo. If we are listing migration libraries, the migrations in rails’ activerecord are nice too.


Tangential, but is there anything that does this for spreadsheets?

I haven't tried to maintain one in Git before, but I've been curious about this functionality.


Dolt does this: https://github.com/liquidata-inc/dolt.

You can import from either CSV or directly from XLS.


FWIW, Google Sheets has version history (including at the cell level). I'm not sure if this is available in their API but it's pretty easy to ingest via the web.


Is this approach materially different/more scalable than using the active record, ORM, and seed file patterns supported by popular MVC frameworks (e.g., like how Rails does it https://edgeguides.rubyonrails.org/active_record_migrations....)? Isn't the crux of all that to be able to source control and easily distribute schemas and data in a database agnostic way?


This was my same thought. Really impressive as an exercise to develop this yourself, but this is core functionality of a lot of ORMS. Rails' ActiveRecord is kind of the shining example of this, with it's `schema.rb` versioning file


This is more meant as something that supplements migrations, as a “time machine” of your database so to speak. You would run it before you do any kind of potentially destructive operation on your database, such as running or rolling back a migration. Not all migrations are reversible (for example, migrations that drop columns), so you cannot easily go back to before you ran that migration. It’s also convenient when you check out an older version of your code - you can just restore the database as it was at that point in time. You could run a command in a git hook to enforce saving a new version when you tag a new release for example. In any case, it makes your database less of a black box, and it basically becomes ok to destroy it - because you can so easily make a new snapshot and go back to how it was before.


I think this is for data too, not just schemas.


But for data we can use the seed files and associated patterns, right? Since it's just text in those files, they're trivially checked into source control, diffed, distributed etc.


This doesn't tackle the primary issue I have with databases and source control - diff.

A proper source control system for databases should handle both migrations and diffing across migrations. I should be able to compare any two commits and view exactly what is different between them.


There are purpose-built databases for this purpose such as Dolt (https://www.dolthub.com/), itself purportedly derived from noms (https://github.com/attic-labs/noms). Because it's baked into the database, you can do complex things like diffs and merges, but it requires that you run everything on that RDBMS vs. something standard. Generic diffing from two arbitrary data states of any database is pretty challenging, even if you have all the transaction history.


Yeah, Dolt is built for distribution / collaboration of data sets, rather than for actually hosting live data in production (although it can do that). The workflow you would use to diff your database versions would be very similar to OPs: periodically take a snapshot dump of prod, then import that snapshot into Dolt. The reason this is better than just diffing the snapshot directly (i.e. textual diff where you see which INSERT statements are present in one version but not the other) is that 1) you get git-like version control semantics where each revision is associated with a person and a commit message, and 2) you can run SQL queries to inspect what's different between two revisions, rather than needing to write some analysis program.


In the case of a large db that's not really practical for data changes, though it could be done easily if you just mean schema diffs, but then that's of limited extra use if you can see the migrations themselves.


I agree a data diff would be challenging, especially at scale, but one tool I think is lacking is schema diffs. For sure one can see the sequence of migrations that were applied, but if all you have is a series of sql files that add/remove/update column definitions, by the end of one or more diffs, you may not actually know or remember what's IN the table you're trying to understand. And if you don't have prod access to show create table (or equivalent), you're left with tracing the diff operations and reconstructing the table schema yourself. Have you seen a tool that can do that?


I'm the author of a schema management tool, Skeema [1], designed to solve this problem for MySQL and MariaDB.

There are a number of other existing tools that can compare/diff schemas on 2 live databases, but Skeema is designed to also actually manage your database structure through a declarative repo of CREATE statements. It works at any scale (natively supports sharding and external OSC tools) and is trusted by several large users, including GitHub [2] and Twilio SendGrid [3].

[1] https://www.skeema.io

[2] https://fosdem.org/2020/schedule/event/mysql_github_schema/

[3] https://sendgrid.com/blog/schema-management-with-skeema/


jOOQ is an SQL library, which in addition to an internal SQL DSL for Java also includes other goodies like an SQL parser and since very recently (still under active development) also a schema diff tool (also available as CLI): https://www.jooq.org/diff/.

One thing which sets jOOQ apart from most other tools out there is the fact that it supports many different SQL dialects. Thus the schema diff tool can for instance also parse DDL in one dialect and render the diff in another SQL dialect. For certain applications this could be of interest.

Disclaimer: I am an active committer on the jOOQ project.


I run migrations locally and on dev on sanitised snapshots of live data, and have easy access to those, so I just use the db to view the schema if required. Regular snapshots of the data are useful too.

If migrations are kept small there isn't usually much confusion over what changed (see migration), or what exists (see db).


For sure you'd have to add rolling hashes, but that's probably more natural in a tree based data store.

I've added these to https://sirix.io, which fastens diffing considerably, especially with deep trees. Otherwise indexing changes could be done :-)


We built a Git-style version controlled SQL database called Dolt (https://github.com/liquidata-inc/dolt) mentioned deep in this thread. It allows diffs and merges across schema and data based on commits.

We are focused on the data distribution/open data use case (https://www.liquidata.co/blog/2019-10-09-where-is-the-data-c..., https://www.liquidata.co/blog/2019-12-06-the-history-of-data...) but I think a number of use cases mentioned in this thread are covered by Dolt.

You do have to move a copy of your data to Dolt, it does not sit on top or beside of your current database. So, it is an extra step in your data modification workflow, similar to versioning code. You had to add a version control step into your code modification workflow.


Pretty nice tool :-)

But what if the data store would do all of that already?

  - retaining the history of your data:
    - free deduplication through copy on write and sharing unchanged structures
    - further store space savings through page-level versioning (not all records in a page have to be written: just the changed/inserted/deleted ones plus depending on the algorithm a few more to fast track it's reconstruction)
    - read and write performance is balanced using a sliding snapshot algorithm: This avoids having to read long chains of incremental page changes and it avoids write peaks from intermittant full page snapshots. Instead they get trailed over a few (usually three or four revisions).
    - the sorage manager stores timestamps for all revision in a RevisionRootPage, which itself is indexed

  - Optional commit author and message
  - Reverting to and committing a past "good" revision is just adding a few bytes to index a new RevisionRootPage which references the past revision data (simple pointer)
  - You can diff taking rolling hashes into account
  - Currently I'm working on a diffing JSON format and a web front-end to display these in special space filling approaches
  - Sophisticated time travel queries
  - typed, user-defined and versioned secondary index structures (based on AVL trees)
  - a path summary kept up to date always
  - a lot of metadata stored and kept up to date for each record/node
It's already implemented in https://sirix.io / https://github.com/sirixdb/sirix

I'm working on getting it stable, but help from you, the Open Source community is always needed :-)


I'm working on a project looking to solve a similar problem, but using a very different mechanism to provide git-like semantics for any database that runs in a container (https://titan-data.io). We support local docker containers today and will be adding k8s support in the coming weeks to run in any Kubernetes cluster using k8s snapshots (e.g. push data state from CI/CD k8s cluster to S3 bucket, and clone locally later for forensics).

Each automation approach (SQL, dumpfiles, or storage) carries different tradeoffs, but we believe there's a real problem facing developers when it comes to managing data within their workflows. Would love to learn more about what some of the killer use cases are, and how these approaches can complement each other.


Which databases can this be used for? Based on the usage of phpmyadmin and WordPress in the examples, I take it only works for MySQL?


Yes, so far this is MySQL only.

I don't think adding Postgres support would be that far fetched, have a look at https://github.com/infostreams/db/tree/master/bin/drivers/my... for example to see the meat of the load & save operations. You'd have to implement it twice though, once for local servers, and once for remote servers (via ssh).


This should be on your README (ignore if it already is and I just missed it).


I was a little annoyed at having to hunt for this information too, but scrolling the README.md a little, I found this [0]:

> At the moment, only mysql is supported.

Seems like I won't be able to test this for work since we are on SQL Server. Too bad.

[0]: https://github.com/infostreams/db/blob/master/readme.md#synt...


I think dotmesh does something similar - https://dotmesh.com/

Github link - https://github.com/dotmesh-io/dotmesh


Looks very interesting. In the README it says:

> The text mentions connection type phpmyadmin, but it doesn't actually work. Don't use it.

Will phpmyadmin support work in future? My use case is making backups of a DB where I don't have SSH access but do have a phpmyadmin account.


Yes, that's why I originally added it. After a day of hacking I had it (mostly) working on one particular instance of phpmyadmin, but when I tried it on another one it completely broke. Not sure why that was, but since I didn't have the need myself anymore I stopped developing that part.


So this just automates myqldump and import, right?


I can't tell if it is for just the schema, or the schema + data inside the tables. If it is just the schema, I feel like there are better software ways to define + manage database migrations (aka make them part of the repo with up/down methods to roll forward / roll back)



I'm not sure what advantage this has over putting MySQL dumps into git (or manually versioning them with filenames)?


Hi, thanks for making this. I tried to give it a shot with a remote MySQL database, but I'm not sure how to specify the connection port (I submitted an issue about it). Sorry if it's obvious and I missed it.


Seems useful as a tool for integration tests. Makes it a bit cleaner to update the state of a database and test against the application level code


Nice idea to speed up the proccess of development and testing.

Does the snapshot generated include the views, triggers and stored procedures?


This is a really cool idea. Thanks for sharing this. I will definitely give it a go if you have postgres.


Looks nice, but I don't see what advantages it has compared to liquibase or flyway.


It's not just structure, it's data, too


Well, most of the flyway migration schemes I have seen so far also cover data (both initial refdata and migrating existing when the structure changes).


If you were to take the hash and place it in a bitcoin transaction on the blockchain, you could "notarize" each hash with your private key.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: