> but it's the twin meanings of NULL that really kills me
NULL only has one meaning: NULL. This is roughly analogous to unknown.
The one that his a lot of people is WHERE <value> NOT IN (<set>) where <set> contains a NULL. Because NOT IN unrolls to “<value> <> <s1> AND <value> <> <s2> AND … AND <value> <> <sx>” any NULL values in the set makes one predicate NULL which makes the whole expression NULL even if one or more of the other values match.
> I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.
Not directly, but you can infer the difference going by the value of (in your example) y.a - if it is NULL then there was no match, otherwise a NULL for y.b is a NULL from the source not an indication of no match.
> SQL should have specified another NULL-like value
This sort of thing causes problems of its own. Are the unknowns equivalent? Where are they relevant? How do they affect each other? Do you need more to cover other edge cases? I have memories of VB6's four constants of the apocalypse (null, empty, missing, nothing).
This is one of the reasons some purists argue against NULL existing in SQL at all, rather than needing a family of NULL-a-likes.
> NULL only has one meaning: NULL. This is roughly analogous to unknown.
Which is what I'm arguing against, because it's used in two unrelated ways in SQL -- as a data value, and to express no matching row found in a JOIN. So no matter how it's defined formally, in practice it has two meanings that have nothing whatsoever to do with each other. One is a value and can be stored, the other says 'not found' and results only from expressions and isn't for storage.
> but you can infer the difference
Yes, as I said there are lots of workarounds. But they're still workarounds.
> This is one of the reasons some purists argue against NULL existing in SQL at all
Sure, but the reality is that NULL is used in a practical sense to mean "no data entered". It's so ridiculously common that only some columns have no data for a particular row, and so e.g. you need the 'time_finished' column to either have a valid date, or it's NULL if the activity hasn't finished yet. The alternative is to have an additional boolean column 'is_finished' and for 'time_finished' to be arbitrarily '1970-01-01T00:00:00Z' whenever is_finished is false, which is clunky and redundant.
Purists can argue what they want, but NULL is so ridiculously useful as a stored data value it doesn't really matter.
> you need the 'time_finished' column to either have a valid date, or it's NULL if the activity hasn't finished yet
Rust’s Diesel has an interesting way of dealing with this. “None” means missing while “Some(None)” means the value null. So when updating a record, “None” makes no change while “Some(None)” sets the value to null.
This distinction is obviously lost when retrieving from the database, but it’s an interesting concept.
But in classic Javascript style, it's half-assed. Being able to assign anything to undefined makes the distinction between null and undefined pointless.
If you have a business need to represent "empty" or "n/a" or "declined to answer" or something like that, use a specific value for that. NULL does not mean anything. Or, it means nothing. It's just NULL. Once I got that into my head, SQL became less frustrating.
You can't without defining yet another field, usually. It's really annoying to double the number of columns so that [time_started, time_finished] becomes [time_started_exists, time_started, time_finished_exists, time_finished].
NULL values makes business logic far more compact and intuitive. For enumerated values in fields it's easy enough to define another value in the same field to mean 'unknown' or 'not entered', but you can't do that for strings, numbers, datetimes, etc. -- you have to throw in a bunch of unwieldy additional boolean fields instead.
The underlying problem here is that SQL lacks Sum Types (aka Tagged Unions). Such types solve all these problems effortlessly.
In contrast to what SQL has, Sum Types combined with Product Types (which is basically what a row is) are actually a universal way to model all possible data[0]. (Of course you may want syntax sugar, etc. on top of that, but Sums and Products at the bottom is sufficient.)
[0] I'm actually not sure if I should qualify that -- I believe Sums+Products can actually model anything, assuming you allow recursive type definitions -- which might be hard to make perform well. Storing a linked list in a database field, e.g. might not be the best idea.
That may be implementation or circumstance specific. For instance in MS SQL Server with a heap table (one without a clustered index) or a table where the primary key is not the clustering key, it will result in extra page reads to check the other field's value (the query planner / engine could infer from it being the PK that it can never be null, so the lookup to check is unnecessary, but IIRC it does not do this). As the columns used in the join predicate have to be read to perform the join, no extra reads will result from using them for other filtering.
In your example it is very likely that the primary key is the clustering key, so will be present in the non-clustered index that I assume will be on answers.questionId, making my point moot, but if for some unusual reason neither Id nor questionId were the clustering key checking Id may result in extra reads being needed.
In DBMSs without clustering keys implemented similarly to SQL Server, there may be such concerns in all cases.
I assume NULL would still exist there, but there would be no explicit NULL values permitted in tuples (rows) that actually exist.
Or perhaps the purists would remove outer joins too, it isn't since University that I've read around the discussion, but given alternate syntax to do the same thing can sometimes be convoluted that might be a bad idea itself.
In the relational model, no there is no such thing as "optional values".
Even Codd often hinted (quite a bit of material in the 1990 book gives such hints) that "if a row has a null somewhere, it SHOULDN'T BE CONSIDERED AS BELONGING TO THE RELATION". E.g. if an attribute of a FK is null, no FK checking should be done at all on account of that null. The FK checking should be done only when the FK value is "complete".
As for outer join, one problem with it is that its result is (intended to be) the extension of a disjuctive predicate, the disjuncts in question being exactly the predicate for the "matching" case and the other one the predicate for the "no match" case. Now if you see a row in the result, how can you tell from that row which of the two disjuncts made that row appear in the result ?
Outer join is problematic by definition precisely because of the possibly ambiguous interpretation that derives from its disjunctive predicate.
(Would you do that with base tables ??? I mean design a base table that can hold a row for strictly more than one possible reason ??? I mean where the user then sees a row in that table and subsequently cannot tell whether that row is there for reason/meaning 1 or for reason/meaning 2 ??? I'll tell what you -and anyone else- would do in such a case : you'd also add the indicator telling whether it's meaning 1 or 2 (and then the problem of 'what if it's both' is also solved). Now think back about what could be done in outer join to resolve that very same kind of disjunctive ambiguity.)
Yes. The reason is lazy people expect to be able to ask five questions in one go and expect one single answer to answer all five of them.
The key is to realize that submitting a query is to ask a question. So how many queries to you think you have to issue if the number of questions you have is five ?
It is technically perfectly possible to "ask five questions at the same time" to any given truly relational DBMS. Said DBMS will give five answers at the same time (and not leave you to guess which answer corresponds to which question). That SQL doesn't do it, does not mean it's impossible.
You've got a list of countries and are pulling each country's national flower, national bird, largest port city etc.
Without outer joins, Liechtenstein with no ports doesn't show in the list at all. Sad news for people who want to know all countries, or Liechtenstein's national bird (eagle).
What's your point? You can live without nulls, they're not required and you don't need to avoid WHERE-clause for that. It's a theoretical concept, nobody in sane mind would do that, but nulls are not required for relational algebra.
You started this subthread by asserting that one shouldn't use outer joins, and when other people pointed out valid use cases for outer joins, you advocated to use application logic to work around your refusal to use them.
My response was just to point out the absurdity of your premise.
"NULL only has one meaning: NULL. This is roughly analogous to unknown."
From the re-published version of "Much ado about nothing, part 2" (re-published in "Database Dreaming, Vol II) :
"An outer join produces nulls because we ask the DBMS to show us data from one table even when no matching data can be found in another table. Such nulls have no meaning, no semantic content."
Personally I'd add to that : and even if it's the case that they do, that meaning is patently not "simply unknown" but rather a very clear indication of non-existence.
But I have no doubt the apologists will obviously always apologize no matter what and handwave arguments such as these away with "So what ? What's the difference." Well, it's the difference between knowing of non-existence and not knowing at all.
"The one that his a lot of people is WHERE <value> NOT IN (<set>) where <set> contains a NULL. Because NOT IN unrolls to “<value> <> <s1> AND <value> <> <s2> AND … AND <value> <> <sx>” any NULL values in the set makes one predicate NULL which makes the whole expression NULL even if one or more of the other values match."
Sorry, but this is FALSE.
"Even if one of the other values match" will make that particular INequality test FALSE and one FALSE conjunct will make the entire conjunction FALSE, even in 3VL.
WHERE NULL=NULL should never return true, so no rows returned is correct. It is only implementation defined in that some implementations get this wrong!¹³
SQL Server's handling of NULLs with in unique indexes/constraints is not standard. You can get better using a filtered unique index (“CREATE UNIQUE INDEX more_compliant_unique_index ON dbo.some_table(this_column) WHERE this_column IS NOT NULL;”), but be aware that filtered indexes have their own collection of gotchas⁴.
----
[1] MS Access IIRC, MS SQL Server if the ANSI_NULLS option set to off, MS² SSIS's default (consider NULLS equal) behaviour for merge steps, and so on.
[2] You may be noticing a pattern here… MS are not the only culprit though.
[3] I'm making a distinction here between places where behaviour is not defined in the standards, so implementations are free to go their own way, and where implementations go against what is expected by the standards.
[4] particularly with respect to the collection of unfixed/wontfix gnarly edge cases that is the MERGE statement.
NULL only has one meaning: NULL. This is roughly analogous to unknown.
The one that his a lot of people is WHERE <value> NOT IN (<set>) where <set> contains a NULL. Because NOT IN unrolls to “<value> <> <s1> AND <value> <> <s2> AND … AND <value> <> <sx>” any NULL values in the set makes one predicate NULL which makes the whole expression NULL even if one or more of the other values match.
> I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.
Not directly, but you can infer the difference going by the value of (in your example) y.a - if it is NULL then there was no match, otherwise a NULL for y.b is a NULL from the source not an indication of no match.
> SQL should have specified another NULL-like value
This sort of thing causes problems of its own. Are the unknowns equivalent? Where are they relevant? How do they affect each other? Do you need more to cover other edge cases? I have memories of VB6's four constants of the apocalypse (null, empty, missing, nothing).
This is one of the reasons some purists argue against NULL existing in SQL at all, rather than needing a family of NULL-a-likes.