It is always good to see new approaches to testing but I don't see how this one ...

hichkaker · on Aug 8, 2020

Thank you for sharing!

I assume we are talking about analytical, not transactional data:

> Diff'ing data is one of the weakest and most cumbersome ways to verify correctness.

It depends on the use case: if the goal is to assess the impact of a change in source code on the resulting dataset produced (extremely common in ETL dev workflow in my experience), then isn't diff the natural solution? Of course, it depends on how the results are presented. A row-by-row output for a billion-row dataset is useless. That's why we provide diff stats across columns/rows and data distribution comparisons while allowing the user to see value-level diff if needed.

> Diffs are relatively slow

In general – yes, that's why we've implemented configurable sampling. In the majority of cases, developer is looking to assess the magnitude of difference and certain patterns, for which you don't need a large sample size. Our customers typically use ~1/10000 of the target table row count as a sample size.

> when they fail you get a blizzard of errors We try to fail gracefully :)

> I don't see how this helps with schema migration or performance issues.

For schema migration, you can verify whether anything has changed in your dataset besides the intended schema changes (which certainly happened on my watch).

> or performance issues

We certainly don't claim to solve all DBA issues with diff, but here's an actual real example from our customer: they are optimizing their ETL jobs in BigQuery to lower GCP bill by reducing query runtime. After refactoring the code, they diff production vs. new code output to ensure that the data produced hasn't been affected.

> If you really care about correctness it's better to use approaches like having focused test cases that check specific predicates on data.

Possibly, but

> They're also a pain to code

...which is often a prohibitive pain point if you have 300+ analytical datasets with 50+ columns each (a common layout for companies of 250+).

And another problem: the more test cases, the more failures on every run, and unlike app code unit testing, you can't expect the cases to stay relevant since the data is changing constantly, so those "unit testing" test suites require constant maintenance, and as soon as you stop actualizing them, their value drops to 0.

I think that diffing and "unit testing" are complimentary approaches and neither one is a panacea. So my recommendation has been to use both: 1) Specific test cases to validate the most important assumptions on data 2) Diff tool for regression testing.

hodgesrm · on Aug 8, 2020

I'm unconvinced your approach works beyond a narrow range of use cases. The weakness is the "is this a problem" issue. You have a diff. Is it really significant? If it's significant, how did it arise? You can spend an inordinate amount of time answering those two questions, and you may have to do it again with every run. Diffs are cheap to implement but costly to use over time. That inversion of costs means users may end up bogged down maintaining the existing mechanism and unable to invest in other approaches.

If I were going after the same problem I would try to do a couple of things.

1. Reframe the QA problem to make it smaller. Reducing the number and size of pipelines is a good start. That has a bunch of knock-on benefits beyond correctness.

2. Look at data cleaning technologies. QA on datasets is a variation on this problem. For example if you can develop predicates that check for common safety conditions on data like detecting bad addresses or SSANs you give users immediately usable quality information. There's a lot more you can do here.

Assuming you are working on this project, I wish you good luck. You can contact me at rhodges at altinity dot com if you want to discuss further. I've been dealing with QA problems on data for a long time.

hodgesrm · on Aug 8, 2020

p.s., to expand on #2 if you can "discover" useful safety conditions on data you change the economics of testing, much as #1 does.