remywang's comments

remywang · 2025-01-10T04:23:47 1736483027

Except that NULL is not the same as UNKNOWN! NULL is a data value (like integers) that can appear in data expressions (like NULL + 1) and comparisons (like NULL = 1) whereas UNKNOWN is a truth value that can appear in boolean/logical expressions constructed from logical connectives like AND, OR, NOT.

A data expression always evaluates to a data value, and usually whenever any part of the expression is NULL, the entire expression evaluates to NULL.

A comparison evaluates to a truth value, and usually when a comparison invovles a NULL it returns UNKNOWN. This leads to weird behaviors where both `SELECT 3 WHERE NULL = NULL;` and `SELECT 3 WHERE NULL <> NULL;` returns nothing (because the query engine does not output a row if the predicate returns UNKNOWN on it).

What you listed above only comes into play for boolean/logical connectives like AND, OR, NOT, and in that case we follow 3-valued logic.

And there's more annoying corner cases when you deal with DISTINCT. The situation is so hopeless that SQLite has a whole table documenting divergent behaviors of NULL in different systems: https://www.sqlite.org/nulls.html

hiAndrewQuinn · 2025-01-10T06:29:36 1736490576

Indeed, they're not identical - that's why I just said "based on", and that's likely why the word UNKNOWN itself isn't used in SQL.

Nevertheless I find it a useful intuition pump. I wager that most people reading `UNKNOWN = UNKNOWN` or `UNKNOWN <> UNKNOWN` and thinking about the examples above would stop and say, "Wait, I actually don't know the value of that statement for sure either, since the LHS and the RHS could be completely different things," and would then double check what their SQL dialect would actually do in this situation.

remywang · 2025-01-10T07:04:05 1736492645

> the word UNKNOWN itself isn't used in SQL

That leads to an even more confusing point, that some systems (at least SQLite) overloads NULL to mean UNKNOWN, for example `SELECT 1 WHERE NULL = (NULL = NULL)`.

And the dangerous thing about NULLs is not when they are explicitly used in the query as a literal (as I did for brevity), but when they appear in tables. It's perfectly reasonable to assume `SELECT COUNT( * ) FROM t;` should be the same as `SELECT COUNT( * ) from t WHERE t.x = t.x OR t.x <> t.x`, but they are not the same because the latter does not return NULL rows. This has lead to real query optimizer bugs. For more examples see this paper https://dl.acm.org/doi/10.14778/3551793.3551818

atombender · 2025-01-11T12:23:50 1736598230

ANSI SQL has had IS [NOT] UNKNOWN since SQL:1999 [1]. It's an optional feature that some databases don't support. Postgres, MySQL, and MSSQL do support it.

[1] https://modern-sql.com/concept/three-valued-logic#compatibil...

remywang · 2024-06-04T02:57:20 1717469840

This is fantastic! I’ve been looking for something like this to take on my runs forever, as I hate bringing my phone while running. Also for people who want to replace their phone with a smartwatch, the only thing missing is this.

remywang · on July 28, 2023

grading homework

remywang · on July 28, 2023

My guess is that t1 = t2 seems cheap but is actually expensive, so you might accidentally write very slow queries if it were built into the language.

Pxtl · on July 28, 2023

Uh, that ship has sailed. Sooooo sailed. It is incredibly trivially easy to write unsargable queries in SQL.

remywang · on July 28, 2023

Author here! This was mostly intended for entertainment, and partly to demonstrate the idea of "relations as vectors" and "queries as polynomials". But I guess I'll indulge myself a bit more:

> Just use sqldiff

sqldiff is sensitive to ordering, e.g., it'll say the relation [1, 2] is different from [2, 1] (I consider them to be the same because they are the same multiset). You'd need to sort with ORDER BY first, but that also requires listing all attributes explicitly like the GROUP BY solution (ORDER BY * doesn't work).

> What about CHECKSUM

It's also sensitive to ordering, and I was told different tables can have the same CHECKSUM (hash collisions?).

> Are the tables the same if they only differ by schema?

I'd say no. Perhaps a better definition of "the same" is that all SQL queries (using "textbook" SQL features) return the same result over the tables, if you just replace t1 with t2 in the query. Wait, but how do you know if the results are the same... :)

> There are better ways to compare tables

Absolutely, my recursive query runs in time O(N^N) so I'm sure you can be a little better than that.

tshaddox · on July 28, 2023

> Perhaps a better definition of "the same" is that all SQL queries (using "textbook" SQL features) return the same result over the tables, if you just replace t1 with t2 in the query. Wait, but how do you know if the results are the same... :)

It's actually still a useful definition! (Assuming we're talking about all deterministic SQL queries and can define precisely what we mean by that!)

It's a useful definition because it includes all possible aggregations as well, including very onerous ones like ridiculous STRING_AGG stuff. Those almost certainly be candidates for reasonable queries to solve your original problem, but they are useful in benchmarking whether a proposed solution is accurate.

justinclift · on July 28, 2023

Not sure if Go code is your kind of thing. But if it is my colleague wrote a SQLite "diff" capability for our online SQLite hosting service:

https://github.com/sqlitebrowser/dbhub.io/blob/5c9e1ab1cfe0f...

The code there can also output a "merge" object out of the differences too, in order to merge the differences from one database object into another.

bigbonch · on July 28, 2023

When you say "bag semantics" are you just referring to Set Theory? Is there a formal distinction between bags and sets?

pkkim · on July 28, 2023

With bag (multiset) semantics, {1, 2, 2} is different from {1, 2}. With set semantics, they're the same.

Pxtl · on July 28, 2023

tl;dr: SQL hates you and it's your fault it hates you.

to elaborate: I'm rapidly appreciating that SQL is a "bondage and discipline" language where it is impossible for SQL to fail at any given task, it can only be failed by its users/developers.

edit: further, it occurs to me also that SQL hates you because in a sane language you'd be able to write up some generic method that says "compare every element of these collections to each other" that was reusable for all possible collections. But try defining a scalar user-defined function that takes two arbitrary resultsets as parameters. But not only can I not do that, I'm a bad person for wanting that because it violates Relational Theory.

tremon · on July 28, 2023

because it violates Relational Theory

It doesn't violate relational theory at all -- it merely violates the query compiler's requirements about what must be known at the time of query compilation.

You can absolute write a query that does what you want (you need a stored procedure that generates ad-hoc functions based on the involved tables' signatures), but stored sql objects must have defined inputs and outputs. What you're asking for is a meta-sql function that can translate to different query plans based on its inputs, and that's not allowed because its performance cannot be predetermined.

Pxtl · on July 30, 2023

C has had this problem solved with preprocessor macros for 50 years.

BrentBrewington · on July 28, 2023

You should definitely check out dbt :)

Some links in my comment: https://news.ycombinator.com/item?id=36911937