How do you end up with two identical SQL tables in the first place?

thedirt0115 · on July 28, 2023

Consider owning an ETL system where users can create periodic jobs, like calculating a "daily active user" rollup table from a bunch of event source tables. Now consider being asked to switch the query execution engine from A to B, perhaps because engine B is much more cost-effective for your system's query patterns. Ex: You are tasked with switching from MapReduce to Tez, from Presto to Spark, etc. Before full migration, you can reduce risk by double-writing jobs, with engine A writing to the standard output location, and engine B writing to some other migration location. This should lead to two identical SQL tables, which you should then verify, because sometimes these execution engines have bugs hiding in the corners :)

hypercube33 · on July 28, 2023

I immediately considered it for comparison of backups.

j16sdiz · on July 28, 2023

The sql in the article is more like brian gymnastics.

It is interesting, but not something you should use. It scales horribly with number of columns

abujazar · on July 28, 2023

Good point

amtamt · on July 28, 2023

or a host migration.

theodpHN · on July 28, 2023

Regression testing.

remywang · on July 28, 2023

grading homework