Hi, I'm Alex from Datafold. By large tables we mean 1B+ rows. You'll likely do n...

throwaway_pdp09 · on Aug 6, 2020

Hi Alex,

> You'll likely do not want to do joins on them directly.

I can't see you have any alternative. Looking for any discrepancy in either table, you have to compare them both completely. Is there an alternative?

(edit: actually there is, the checksumming I mentioned earlier. Never needed to do that).

> so let's add an order by and limit. Now you'll need to spot differences in 100 column table, since hiding matching values is even more hand-written / generated sql.

Yes, I added a limit for this too, and spotting two different columns out of a hundred, well it's not a big deal, but having done that I agree some GUI/other assistance would make it a lot nicer.

> It's pretty good to know about all problems upfront and not find and fix them one-by-one...

Well, you compare the schemas first and being small, you can do this very quickly so you can pick up errors there one after the other very easily. After that you compare the data, and I suppose you can run multiple queries one after the other in a batch file and check them when done. I guess I never had enough data to need to do that.

> ...but creating representative sample of differences is difficult to do with hand-written sql.

I'm not sure what you mean.

> Then what about cross-db comparison,

Straightforward. In fact if you weren't comparing data across DBs then you aren't doing ETL. (edit: ok I think ISWYM. Solutions are roundtripping if the amount of data is not overwhelming, checksumming otherwise).

> sampling, integrating with CI / github PRs?

I don't understand why any of these are needed, but I'm pretty certain that reflects on my lack of experience, so okay.

> Some companies are using complex pipelines that can and sometimes do lose events

Then you got a software quality problem. It's also not difficult (with joins against primary keys), to efficiently pick up the difference and to make things idempotent - just re-run the script. I've done that too.

> Often they don't need their data to be 100% perfect. So they need to know if they lost anything, what exactly and if it's within acceptable range.

Very good point.

> As always, roll-your-own is a valid approach, as is using already available tools to save time

this leads to very interesting question about upskilling of employees versus the cost of tools - good training (say, in SQL) is somewhat expensive but lack of knowledge is far more expensive. I don't think this is the right place to have such a discussion, and I don't want to detract from your product (I do seem to have dissed it which is not my intention).

> Many large companies like Uber and Spotify

Okay, if we're talking on the scale of companies like this then we talking something well out of my experience.

I'll leave it here, I think a distracted enough from your launch, and I wish you well with it!

hichkaker · on Aug 6, 2020

Thank you for asking deep questions and providing specific examples! The degree of scale and complexity varies significantly between companies, so we definitely can't claim that every ETL developer should use such a tool.

Philosophically, I see our diff tool playing a similar role to autocomplete/autorefactor in an IDE. Can you type/change every word manually? Yes. Is it a good use of your time though?

zimpenfish · on Aug 7, 2020

> you have to compare them both completely. Is there an alternative?

When I was doing this stuff for a telco 10 years ago (comparing before/after for CDR mediation changes), I found it was much faster to dump the two x00M row tables as CSV, sort them, and then use a Perl script to compare on a row by row basis. The join approach taken by the Oracle expert took many hours; my dump-sort-scan took under an hour.

throwaway_pdp09 · on Aug 7, 2020

OMG. Something went very, very wrong if it's faster to do that outside the database than within.

zimpenfish · on Aug 7, 2020

Yeah, I've no idea how it was that much slower but they were reasonably sized tables (x00M rows) with tens of columns and it was basically a full join on each column - probably would need an index on every column to make it sensible?

[edit: And with CDR mediation, it's not as simple as "do all the columns match exactly?" because some of them are going to be different in certain allowed ways which needs to be considered a match and also you need to be able to identify which rows differ in one column to say "these look the same but XYZ is different?" Which is probably why the query was horrendous.]

etesial · on Aug 6, 2020

> I can't see you have any alternative. Looking for any discrepancy in either table, you have to compare them both completely. Is there an alternative?

The alternative is sampling. With analytical data often you have to accept that data isn't perfect, and that it doesn't need to be. So if it can be guaranteed that amount of "bad" data is, say, less than 0.001% than it may just be good enough. So cheap checks, like "how many PKs are null" can be done on full tables, anything that requires joins or mass-comaring all values can be done on sampled subsets.

>> ...but creating representative sample of differences is difficult to do with hand-written sql. > I'm not sure what you mean.

Let's say column "a" has completely different values in both tables. Column "b" has differences in 0.1% of rows. If we limit diff output to 100 rows in a straightforward way, most likely it won't capture differences in column "b" at all. So you'll need to fix column "a" first or remove it from diff, then you'll discover that "b" has problems too, then the issues with some column "c" may become visible. With right sampling you'll get 20 examples for each of "a", "b", "c" and whatever else is different, and can fix ETL in one go. Or maybe it'll help to uncover the root issue below those discrepancies faster. One very well may do without this feature, but it saves time and effort.

Generally that's what the tool does, it saves time and effort. On writing one-off SQL, or building more generic scripts that do codegen, or building a set of scripts that handle edge cases, have a nice UX and can be used by teammates, or putting a GUI on top of that.

> Solutions are roundtripping if the amount of data is not overwhelming, checksumming otherwise

Agree, checksumming or sampling again. Since data is often expected to have small amounts of discrepancies, checksums would need to be done on blocks to see be able to drill down into failed blocks to see which rows caused failures.

> Then you got a software quality problem. It's also not difficult (with joins against primary keys), to efficiently pick up the difference and to make things idempotent - just re-run the script. I've done that too.

Agree, but software quality problems are bound to happen since requirements change, what was once a perfect architecture becomes a pain point. Attempts to fix some piece of the system often requires creating a new one from scratch and running both side-by-side during transitional period. Some companies begin to realize that they don't have enough visibility into their data migration processes, some understand the need for tooling, but can't afford to spend time or resources to build it in-house, especially if they'll need it only during migration. If we can help companies by providing ready to go solution, that's a great thing.

> this leads to very interesting question about upskilling of employees versus the cost of tools

Right, it's a complicated question involving multiple tradeoffs...

throwaway_pdp09 · on Aug 7, 2020

I'll print this out, it deserves careful reading which I'll do shortly, so I'll reply briefly on one point.

I've worked with data where a small amount of data loss was tolerable and indeed inevitable (once in the system it was entirely reliable but incoming data was sometimes junk and had to be discarded, and that was ok if occasional) so I understand.

But a third-party tool that can be configured to tolerate a small amount of data loss could put you at a legal disadvantage if things go wrong, even if it is within the spec given for your product. If you have a very small amount of lost data then checksumming and patching the holes when transferring data might be a very good idea, legally speaking, not data-speaking, and low overhead too.

Also, you just might be assuming that lost data is uncorrelated ie. scattered randomly throughout a table. Depending on where it's lost during the transfer, say some network outage, it may be lost in coherent chunks. That might, or might not, matter.