09:48

Comparing NULLs Considered Silly

17 December 2010

tl;dr: You can’t compare NULLs. A nullable primary key is a contradiction in terms. You can’t join on NULL, so a NULL foreign key refers to nothing, by definition. NULL doesn’t do what you think it does, no matter what you think it does.

NULL in SQL is annoyingly complex.

There’s really no conceptual model of NULL that will not end up surprisingly you in unpleasant ways. Jeff Davis, last year, wrote a great blog post that, if could be so bold, could be paraphrased as “conceptual models of NULL considered harmful.”

Thus, it’s not surprising that some… well, surprising ideas about NULL sometimes pop up.

Recently, on the Django developers’ list, the phrase “nullable primary key” caught my eye. This inspired me to write these thoughts about NULL, and in particular NULL being used as keys.

First:

It ie meaningless to compare two NULL values.

I’ve noticed application programmers often treat NULL as a magic value that any type can possess (I’ve been quite guilty of this, too). While this is somewhat true, it’s also a dangerous path to go down, because:

NULL = NULL

… is NULL, not true. Whatever else you can say about NULL, a NULL value means you can make no claims about what value it is. Saying, “I have no idea what this value is, and I have no idea about what that value is, but are they equal?” is, I would hope, pretty self-evidently meaningless.

Now, this immediately implies:

You cannot join on NULL.

If a foreign key column is NULL, you can’t do an inner join on it to another table, even if key column(s) being referred to is NULL. This follows directly from the fact that you can’t compare NULL values; joining is just built around comparison, after all.

Yes, you can do things like:

SELECT a.*
    FROM a
    INNER JOIN b
        ON (b.col = a.col) OR ( (b.col IS NULL) AND (a.col IS NULL) )

Setting aside that you’ve pretty much committed yourself to a nested loop at this point (and thus a very expensive operation), the fact that you have to jump through this hoop should be an indication that the wrong path is being trod.

So, please remember: NULL in a foreign key field does not mean “This refers to rows in the other table that have a matching NULL,” because there’s no such thing as a “matching NULL.”

Moving on to primary keys:

A primary key is a combination of columns whose values, taken together, uniquely specify a row.

Thus, a nullable primary key is equally meaningless, as the whole point of a primary key is for it it to be compared to other values to determine uniqueness. (The SQL standard prohibits NULLs in primary key columns, so it’s not just a good idea, it’s the law, or at least the recommendation.)

The SQL standard calls for NULL to be thrown up in places where it really should require an error. For example:

SELECT SUM(col) FROM t WHERE FALSE

… returns NULL, as the result of any aggregate function over zero rows is NULL. But the sum of no numbers is 0, not “unspecified” (or whatever you want to call NULL).

Worse:

SELECT AVG(col) FROM t WHERE FALSE

is NULL, while:

SELECT 0/0

… much more rationally gives a divsion-by-zero error.

My guess is that the SQL standards committee is loathe to have the spec require errors for more-or-less common operations, and that’s where a lot of the stranger cases of NULL come from, as a way of having a normal-but-flagged return from an edge case.

It’s really a shame that NULL is so complex and counterintuitive, but there’s really no hope for it except to learn the rules, and not try to abuse NULL to do things it wasn’t designed for.

There are 3 comments.

Joe at 15:21, 19 December 2010:

Here’s a related problem. Say you have a table that keeps track of URLs, each with a unique numeric identifier as the primary key. The URL table breaks down the path components: scheme, subdomain, domain_id, port_num, url_path, query_string. To save space, since close to 100% of the scheme values are ‘http’ and the port is almost never present (i.e., it defaults to 80), the columns are defined as nullable. Similarly, the subdomain and query_string are also nullable because they’re not present in a large number of URLs. A view is used to reconstruct the full URL, i.e., use COALESCE to add a ‘?’ and the query_string if not null, or a ‘.’ between the subdomain and domain strings. The problem is: if you create a UNIQUE index on the component columns, you can still add duplicate rows because UNIQUE index on nullable columns isn’t really useful.

Xof at 00:02, 20 December 2010:

Yes, that’s no good. If there was ever an example of premature optimization, using NULL as a way of saving space in a text field must qualify.

Ben Finney at 03:57, 22 December 2010:

Adrian Walker gives us SQL Null Rage:

Just don’t allow nulls in your relational database tables, just don’t do it. No. No, not even then, no. No. NO.
…
“”The Third Manifesto”, by C.J. Date and Hugh Darwen (3rd edition, Addison-Wesley, 2005), contains a categorical proscription against support for anything like SQL’s NULL, in its blueprint for relational database language design.” Sounds good to me.