Except that NULL is not the same as UNKNOWN! NULL is a data value (like integers) that can appear in data expressions (like NULL + 1) and comparisons (like NULL = 1) whereas UNKNOWN is a truth value that can appear in boolean/logical expressions constructed from logical connectives like AND, OR, NOT.
A data expression always evaluates to a data value, and usually whenever any part of the expression is NULL, the entire expression evaluates to NULL.
A comparison evaluates to a truth value, and usually when a comparison invovles a NULL it returns UNKNOWN. This leads to weird behaviors where both `SELECT 3 WHERE NULL = NULL;` and `SELECT 3 WHERE NULL <> NULL;` returns nothing (because the query engine does not output a row if the predicate returns UNKNOWN on it).
What you listed above only comes into play for boolean/logical connectives like AND, OR, NOT, and in that case we follow 3-valued logic.
And there's more annoying corner cases when you deal with DISTINCT. The situation is so hopeless that SQLite has a whole table documenting divergent behaviors of NULL in different systems: https://www.sqlite.org/nulls.html
Indeed, they're not identical - that's why I just said "based on", and that's likely why the word UNKNOWN itself isn't used in SQL.
Nevertheless I find it a useful intuition pump. I wager that most people reading `UNKNOWN = UNKNOWN` or `UNKNOWN <> UNKNOWN` and thinking about the examples above would stop and say, "Wait, I actually don't know the value of that statement for sure either, since the LHS and the RHS could be completely different things," and would then double check what their SQL dialect would actually do in this situation.
That leads to an even more confusing point, that some systems (at least SQLite) overloads NULL to mean UNKNOWN, for example `SELECT 1 WHERE NULL = (NULL = NULL)`.
And the dangerous thing about NULLs is not when they are explicitly used in the query as a literal (as I did for brevity), but when they appear in tables. It's perfectly reasonable to assume `SELECT COUNT( * ) FROM t;` should be the same as `SELECT COUNT( * ) from t WHERE t.x = t.x OR t.x <> t.x`, but they are not the same because the latter does not return NULL rows. This has lead to real query optimizer bugs. For more examples see this paper https://dl.acm.org/doi/10.14778/3551793.3551818
ANSI SQL has had IS [NOT] UNKNOWN since SQL:1999 [1]. It's an optional feature that some databases don't support. Postgres, MySQL, and MSSQL do support it.
This is fantastic! I’ve been looking for something like this to take on my runs forever, as I hate bringing my phone while running. Also for people who want to replace their phone with a smartwatch, the only thing missing is this.
Author here! This was mostly intended for entertainment, and partly to demonstrate the idea of "relations as vectors" and "queries as polynomials". But I guess I'll indulge myself a bit more:
> Just use sqldiff
sqldiff is sensitive to ordering, e.g., it'll say the relation [1, 2] is different from [2, 1] (I consider them to be the same because they are the same multiset). You'd need to sort with ORDER BY first, but that also requires listing all attributes explicitly like the GROUP BY solution (ORDER BY * doesn't work).
> What about CHECKSUM
It's also sensitive to ordering, and I was told different tables can have the same CHECKSUM (hash collisions?).
> Are the tables the same if they only differ by schema?
I'd say no. Perhaps a better definition of "the same" is that all SQL queries (using "textbook" SQL features) return the same result over the tables, if you just replace t1 with t2 in the query. Wait, but how do you know if the results are the same... :)
> There are better ways to compare tables
Absolutely, my recursive query runs in time O(N^N) so I'm sure you can be a little better than that.
> Perhaps a better definition of "the same" is that all SQL queries (using "textbook" SQL features) return the same result over the tables, if you just replace t1 with t2 in the query. Wait, but how do you know if the results are the same... :)
It's actually still a useful definition! (Assuming we're talking about all deterministic SQL queries and can define precisely what we mean by that!)
It's a useful definition because it includes all possible aggregations as well, including very onerous ones like ridiculous STRING_AGG stuff. Those almost certainly be candidates for reasonable queries to solve your original problem, but they are useful in benchmarking whether a proposed solution is accurate.
tl;dr: SQL hates you and it's your fault it hates you.
to elaborate: I'm rapidly appreciating that SQL is a "bondage and discipline" language where it is impossible for SQL to fail at any given task, it can only be failed by its users/developers.
edit: further, it occurs to me also that SQL hates you because in a sane language you'd be able to write up some generic method that says "compare every element of these collections to each other" that was reusable for all possible collections. But try defining a scalar user-defined function that takes two arbitrary resultsets as parameters. But not only can I not do that, I'm a bad person for wanting that because it violates Relational Theory.
It doesn't violate relational theory at all -- it merely violates the query compiler's requirements about what must be known at the time of query compilation.
You can absolute write a query that does what you want (you need a stored procedure that generates ad-hoc functions based on the involved tables' signatures), but stored sql objects must have defined inputs and outputs. What you're asking for is a meta-sql function that can translate to different query plans based on its inputs, and that's not allowed because its performance cannot be predetermined.
A data expression always evaluates to a data value, and usually whenever any part of the expression is NULL, the entire expression evaluates to NULL.
A comparison evaluates to a truth value, and usually when a comparison invovles a NULL it returns UNKNOWN. This leads to weird behaviors where both `SELECT 3 WHERE NULL = NULL;` and `SELECT 3 WHERE NULL <> NULL;` returns nothing (because the query engine does not output a row if the predicate returns UNKNOWN on it).
What you listed above only comes into play for boolean/logical connectives like AND, OR, NOT, and in that case we follow 3-valued logic.
And there's more annoying corner cases when you deal with DISTINCT. The situation is so hopeless that SQLite has a whole table documenting divergent behaviors of NULL in different systems: https://www.sqlite.org/nulls.html