Hacker News new | past | comments | ask | show | jobs | submit login

I've written a bajillion queries and have tons of nitpicks, but it's the twin meanings of NULL that really kills me.

NULL can be the value of a field in a record, but it is also used to indicate the lack of a record in a JOIN.

If I run:

  SELECT x.a, y.b FROM x LEFT JOIN y on x.a = y.a
and I get back

  [5, NULL]
I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.

Obviously there are lots of workarounds, but to me this is a truly fundamental design flaw. SQL should have specified another NULL-like value, call it EMPTY, used only for joins that find no data. And analagous to IS NULL, it would be checked using IS EMPTY.




> but it's the twin meanings of NULL that really kills me

NULL only has one meaning: NULL. This is roughly analogous to unknown.

The one that his a lot of people is WHERE <value> NOT IN (<set>) where <set> contains a NULL. Because NOT IN unrolls to “<value> <> <s1> AND <value> <> <s2> AND … AND <value> <> <sx>” any NULL values in the set makes one predicate NULL which makes the whole expression NULL even if one or more of the other values match.

> I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.

Not directly, but you can infer the difference going by the value of (in your example) y.a - if it is NULL then there was no match, otherwise a NULL for y.b is a NULL from the source not an indication of no match.

> SQL should have specified another NULL-like value

This sort of thing causes problems of its own. Are the unknowns equivalent? Where are they relevant? How do they affect each other? Do you need more to cover other edge cases? I have memories of VB6's four constants of the apocalypse (null, empty, missing, nothing).

This is one of the reasons some purists argue against NULL existing in SQL at all, rather than needing a family of NULL-a-likes.


> NULL only has one meaning: NULL. This is roughly analogous to unknown.

Which is what I'm arguing against, because it's used in two unrelated ways in SQL -- as a data value, and to express no matching row found in a JOIN. So no matter how it's defined formally, in practice it has two meanings that have nothing whatsoever to do with each other. One is a value and can be stored, the other says 'not found' and results only from expressions and isn't for storage.

> but you can infer the difference

Yes, as I said there are lots of workarounds. But they're still workarounds.

> This is one of the reasons some purists argue against NULL existing in SQL at all

Sure, but the reality is that NULL is used in a practical sense to mean "no data entered". It's so ridiculously common that only some columns have no data for a particular row, and so e.g. you need the 'time_finished' column to either have a valid date, or it's NULL if the activity hasn't finished yet. The alternative is to have an additional boolean column 'is_finished' and for 'time_finished' to be arbitrarily '1970-01-01T00:00:00Z' whenever is_finished is false, which is clunky and redundant.

Purists can argue what they want, but NULL is so ridiculously useful as a stored data value it doesn't really matter.


> you need the 'time_finished' column to either have a valid date, or it's NULL if the activity hasn't finished yet

Rust’s Diesel has an interesting way of dealing with this. “None” means missing while “Some(None)” means the value null. So when updating a record, “None” makes no change while “Some(None)” sets the value to null.

This distinction is obviously lost when retrieving from the database, but it’s an interesting concept.


Javascript having both `null` and `undefined` doesn't seem so crazy now


But in classic Javascript style, it's half-assed. Being able to assign anything to undefined makes the distinction between null and undefined pointless.


If you have a business need to represent "empty" or "n/a" or "declined to answer" or something like that, use a specific value for that. NULL does not mean anything. Or, it means nothing. It's just NULL. Once I got that into my head, SQL became less frustrating.


You can't without defining yet another field, usually. It's really annoying to double the number of columns so that [time_started, time_finished] becomes [time_started_exists, time_started, time_finished_exists, time_finished].

NULL values makes business logic far more compact and intuitive. For enumerated values in fields it's easy enough to define another value in the same field to mean 'unknown' or 'not entered', but you can't do that for strings, numbers, datetimes, etc. -- you have to throw in a bunch of unwieldy additional boolean fields instead.


The underlying problem here is that SQL lacks Sum Types (aka Tagged Unions). Such types solve all these problems effortlessly.

In contrast to what SQL has, Sum Types combined with Product Types (which is basically what a row is) are actually a universal way to model all possible data[0]. (Of course you may want syntax sugar, etc. on top of that, but Sums and Products at the bottom is sufficient.)

[0] I'm actually not sure if I should qualify that -- I believe Sums+Products can actually model anything, assuming you allow recursive type definitions -- which might be hard to make perform well. Storing a linked list in a database field, e.g. might not be the best idea.


Indeed, the issue of non-existence is a somewhat tricky philosophical question (e.g. there are many ways something may not exist).


A better option in many cases is to check the primary key.

e.g.

select questions.Id , questions.Text , answers.Id , answers.Text

from questions left join answers on answers.QuestionId = Questions.Id

answers.Id is non-nullable as a primary key, so if answers.Text is null but answers.Id is not null they've declined to answer.


That may be implementation or circumstance specific. For instance in MS SQL Server with a heap table (one without a clustered index) or a table where the primary key is not the clustering key, it will result in extra page reads to check the other field's value (the query planner / engine could infer from it being the PK that it can never be null, so the lookup to check is unnecessary, but IIRC it does not do this). As the columns used in the join predicate have to be read to perform the join, no extra reads will result from using them for other filtering.

In your example it is very likely that the primary key is the clustering key, so will be present in the non-clustered index that I assume will be on answers.questionId, making my point moot, but if for some unusual reason neither Id nor questionId were the clustering key checking Id may result in extra reads being needed.

In DBMSs without clustering keys implemented similarly to SQL Server, there may be such concerns in all cases.


> This is one of the reasons some purists argue against NULL existing in SQL at all, rather than needing a family of NULL-a-likes.

So then are there no optional values? What happens when OUTER JOINs don't match?


I assume NULL would still exist there, but there would be no explicit NULL values permitted in tuples (rows) that actually exist.

Or perhaps the purists would remove outer joins too, it isn't since University that I've read around the discussion, but given alternate syntax to do the same thing can sometimes be convoluted that might be a bad idea itself.


In the relational model, no there is no such thing as "optional values".

Even Codd often hinted (quite a bit of material in the 1990 book gives such hints) that "if a row has a null somewhere, it SHOULDN'T BE CONSIDERED AS BELONGING TO THE RELATION". E.g. if an attribute of a FK is null, no FK checking should be done at all on account of that null. The FK checking should be done only when the FK value is "complete".

As for outer join, one problem with it is that its result is (intended to be) the extension of a disjuctive predicate, the disjuncts in question being exactly the predicate for the "matching" case and the other one the predicate for the "no match" case. Now if you see a row in the result, how can you tell from that row which of the two disjuncts made that row appear in the result ?

Outer join is problematic by definition precisely because of the possibly ambiguous interpretation that derives from its disjunctive predicate.

(Would you do that with base tables ??? I mean design a base table that can hold a row for strictly more than one possible reason ??? I mean where the user then sees a row in that table and subsequently cannot tell whether that row is there for reason/meaning 1 or for reason/meaning 2 ??? I'll tell what you -and anyone else- would do in such a case : you'd also add the indicator telling whether it's meaning 1 or 2 (and then the problem of 'what if it's both' is also solved). Now think back about what could be done in outer join to resolve that very same kind of disjunctive ambiguity.)


You don’t do outer joins.


Don’t these exist for a reason?


Yes. The reason is lazy people expect to be able to ask five questions in one go and expect one single answer to answer all five of them.

The key is to realize that submitting a query is to ask a question. So how many queries to you think you have to issue if the number of questions you have is five ?

It is technically perfectly possible to "ask five questions at the same time" to any given truly relational DBMS. Said DBMS will give five answers at the same time (and not leave you to guess which answer corresponds to which question). That SQL doesn't do it, does not mean it's impossible.


Of course.

You've got a list of countries and are pulling each country's national flower, national bird, largest port city etc.

Without outer joins, Liechtenstein with no ports doesn't show in the list at all. Sad news for people who want to know all countries, or Liechtenstein's national bird (eagle).


You’re not obliged to pull everything in one request. You can issue several requests.


Sure. You're also not obligated to include a WHERE-clause in the query you send to your database. You can do the filtering in the application.


What's your point? You can live without nulls, they're not required and you don't need to avoid WHERE-clause for that. It's a theoretical concept, nobody in sane mind would do that, but nulls are not required for relational algebra.


You started this subthread by asserting that one shouldn't use outer joins, and when other people pointed out valid use cases for outer joins, you advocated to use application logic to work around your refusal to use them.

My response was just to point out the absurdity of your premise.


"NULL only has one meaning: NULL. This is roughly analogous to unknown."

From the re-published version of "Much ado about nothing, part 2" (re-published in "Database Dreaming, Vol II) :

"An outer join produces nulls because we ask the DBMS to show us data from one table even when no matching data can be found in another table. Such nulls have no meaning, no semantic content."

Personally I'd add to that : and even if it's the case that they do, that meaning is patently not "simply unknown" but rather a very clear indication of non-existence.

But I have no doubt the apologists will obviously always apologize no matter what and handwave arguments such as these away with "So what ? What's the difference." Well, it's the difference between knowing of non-existence and not knowing at all.


"The one that his a lot of people is WHERE <value> NOT IN (<set>) where <set> contains a NULL. Because NOT IN unrolls to “<value> <> <s1> AND <value> <> <s2> AND … AND <value> <> <sx>” any NULL values in the set makes one predicate NULL which makes the whole expression NULL even if one or more of the other values match."

Sorry, but this is FALSE.

"Even if one of the other values match" will make that particular INequality test FALSE and one FALSE conjunct will make the entire conjunction FALSE, even in 3VL.


NULL itself is open to interpretation.

If I CREATE TABLE foo (bar CHAR(1) PRIMARY KEY, baz char(1) UNIQUE), then different things happen on different databases.

In Microsoft SQL server, only one insert of a null into the baz column is allowed, and the null value is indexed.

In Oracle, null is never indexed in this context, so any number of null insertions into baz are allowed.

On a composite index, I believe that nulls are always indexed, on any database (they must be).

As far as I understand it, this is implementation-defined:

  SQL> select * from dual where null=null;

  no rows selected


WHERE NULL=NULL should never return true, so no rows returned is correct. It is only implementation defined in that some implementations get this wrong!¹³

SQL Server's handling of NULLs with in unique indexes/constraints is not standard. You can get better using a filtered unique index (“CREATE UNIQUE INDEX more_compliant_unique_index ON dbo.some_table(this_column) WHERE this_column IS NOT NULL;”), but be aware that filtered indexes have their own collection of gotchas⁴.

----

[1] MS Access IIRC, MS SQL Server if the ANSI_NULLS option set to off, MS² SSIS's default (consider NULLS equal) behaviour for merge steps, and so on.

[2] You may be noticing a pattern here… MS are not the only culprit though.

[3] I'm making a distinction here between places where behaviour is not defined in the standards, so implementations are free to go their own way, and where implementations go against what is expected by the standards.

[4] particularly with respect to the collection of unfixed/wontfix gnarly edge cases that is the MERGE statement.


A nullable column can always be extracted to a separate table with a non-nullable column and a foreign key. If you left-join this back with the base table, you will get the nulls again for the missing values. So to me it seem nicely symmetrical to use the same kind of NULL value.


I'd check to see if y.a IS NULL in that situation. I'm sure there are cases where it matters, but most of the time for me the difference between "there's a row in y, but the value is NULL" and "there's no row in y, the value is NULL" is irrelevant. I can't think of a time when that distinction has been important, and I'm working on a project converting thousands of complex SQL queries.

The thing that really bugs me about NULL is the default assumptions - 99.9% of the time I want NULL in a string context to be the empty string and NULL in a numeric context to be 0, but I have to use ISNULL or COALESCE to get that. I wish it were the other way round where NULL is treated as '' or 0 by default, but I can do something special if I really want to test for NULL'ness.


Interesting, we must work with really different data/queries.

I've come up against the distinction lots of times (and yes have to retrieve additional fields in order to address it), while I don't ever want to confuse 0 with NULL. Tons of things are legitimately zero but crucially non-null, like an inventory count. (Empty strings, on the other hand, do seem much more interchangeable with NULL in probably the vast majority of contexts.)


'' is NULL in Oracle. People complain about it, and it always seemed a little weird to me.


As another commenter said, if your table have required ID columns, you should specify it in your SELECT.

     SELECT x.Id,x.a,y.ID,y.b 
     FROM x 
     LEFT JOIN y 
       on x.a = y.a    
The output [Id12345,5,Id6789,NULL] means the column is null. The output [Id12345,5,NULL,NULL] means the row is empty.

On a side not, I use the SQL below all the time to identify missing rows on the "comparetable".

  SELECT #basetable.columnname
  FROM #basetable
  LEFT OUTER JOIN #comparetable
  ON #basetable.columnname1 = #comparetable.columnname1
    and #basetable.columnname2 = #comparetable.columnname2
    and #basetable.columnname3 = #comparetable.columnname3
  WHERE #comparetable.columnname1 is null


Fabian Pascal made a whole career out of writing about how stupid "NULL" is in SQL. See if you can find his writings some time. He was positively livid about it.

Notably NULL of any kind is completely absent from the relational algebra. There are other ways to express absence. Notably, this is all supposed to be first-order predicate logic <hand wave hand wave> stuff, and... in that world "null" is also not a "thing".

Also SQL insists on allowing duplicate "rows" in its "tables"; whereas the relational model is very clear that the contents of a relation ("table") is a set, not a bag.

These two confusions in SQL actually lead to many many data integrity problems as follow-on effects and also complicate query planning and optimization.


> I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.

You can always add SELECT y.a, which will allow you to disambiguate between the two options (it will be non-NULL in the first case and NULL in the second).


You need one more output column. The specifics will vary by RDBMS because SQL isn't that standard. The most portable thing would be to do this:

  SELECT x.a, y.a IS NOT NULL, y.b
  FROM x
  LEFT JOIN y ON x.a = y.a;
assuming y.a is a NOT NULL column, or a PRIMARY KEY column, since PKs are supposed to be non-nullable.

In PG you could:

  SELECT x.a, y IS DISTINCT FROM NULL, y.b
  FROM x
  LEFT JOIN y ON x.a = y.a;


The predicate corresponding to an outer join is fundamentally disjunctive in nature (any outer join is equivalent to a UNION of at least 2 SELECTs) meaning that "after the fact", there is no way to determine which particular one of the disjuncts made the row appear in the result.


Why does it kill you? A NULL in a row is semantically identical to the row never existing in the first place, unless there is some other non-null column (the primary key) that lets you disambiguate between the two cases. It's hardly a problem in practice because all good tables should have primary keys.


This is why JS has both null and undefined.


Ah, JS, the pinnacle of data correctness. 1 == "1" results in undefined, right?


nope 1 == "1" is true and 1 === "1" is false


How do you expect `y.b` to behave on EMPTY rows? Should it return EMPTY or NULL?


you should not need to distinguish between empty and null because they mean the same thing. if you want to select records from x that aren't in y you should use an anti-join (where not exists, not in, etc.)


This is not true--especially when considering different SQL implementations (e.g. Oracle SQL versus Microsoft SQL). NULL and EMPTY handle the intersection of ontic versus epistemic claims.

EMPTY implies a known, 0-byte value whereas NULL can imply either an unknown value or the "unknowability" of a value (i.e. the in-existence of a value).

In practical terms, this would be like equating the statements "I don't know whether that dog has a name" (i.e. a NULL name) and "I do not know the name of that dog" (an EMPTY name). The former does not assert the existence of a proper noun to represent "that dog", whereas the latter implicitly asserts that there exists a proper noun which nominates "that dog".


beg pardon I wasn't aware sql actually defined empty. of the engines I'm familiar with there is no keyword for empty like what gp is asking for (and I hold my argument for why it should not be necessary). it sounds like what you're describing is a blank string and I agree with using blanks to distinguish between "has no phone number" and "we don't know their phone number" but that isn't what gp was talking about.


Also you should select y.a to know wether the y-record exists.


Good point, it would be nice to have both NULL and EMPTY to split these meanings.


To piggy-back, it bothers me so much that this is valid syntax in many implementations:

UPDATE x SET a = NULL WHERE b IS NULL;

Like... wat?


Why are you joining on a nullable column in the first place?

If your database is well designed, joining on a nullable column should be a relatively exotic use case, and can be handled by writing a tiny bit more code to check for NULLs before performing your join.


It's not.

Assume y.a is the non-nullable primary key, while y.b is the column that may be null.

But in this example there might not be a row where y.a = 5. Or there might be. But you can't tell.


But in those cases, you can always tell by including the non-nullable key in your result set. It's already being evaluated, so it's virtually free (only adds to the network transport size).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: