Nullability is whether or not a column may contain the NULL value. By default in
SQL, a column in any row can contain a special value that says that the value is
empty or unknown. Although this is quite useful, NULLs have unexpected side
effects. Almost every comparison returns “unknown” if any argument is NULL, and
“unknown” is treated as false.
The following very simple query looks like it is counting all the rows in the
ZipCensus table where the FIPCo column is not NULL. (<> is the SQL operator for
FROM ZipCensus zc
WHERE zc.fipco <> NULL
Alas, this query always returns zero. When a NULL value is involved in a
comparison—even “not equals”—the result is almost always NULL, which is treated
Of course, determining which rows have NULL values is quite useful, so SQL
provides the special operators IS NULL and IS NOT NULL. These behave as expected,
with the preceding query returning 32,845 instead of 0.
The problem is more insidious when comparing column values, either within a
single table or between tables. For instance, the column fipco contains the
primary county of a zip code and fipco2 contains the second county, if any. The
following query counts the number of zip codes in total and the number where
these two county columns have different values. This query uses conditional
aggregation, which is when a conditional statement (CASE) is the argument to an
aggregation function such as SUM():
SUM(CASE WHEN fipco <> fipco2 THEN 1 ELSE 0 END) as numsame
FROM ZipCensus zc
Or does it? The columns fipco and fipco2 should always have different values, so
the two counts should be the same. In reality, the query returns the values 32,989
and 8,904. And changing the not-equals to equals shows that there are 0 rows
where the values are equal. What is happening on the other 32,989 − 8,904 rows?
Once again, the “problem” is NULL values. When fipco2 is NULL, the test always
When a table is created, there is the option to allow NULL values on each column in
the table. This is a relatively minor decision when creating the table. However,
making mistakes on columns with NULL values is easy.
NULL values may seem troublesome, but they solve an important problem: how to
represent values that are not present. One alternative method is to use a special
value, such as -99 or 0. However, the database would just treat this as a regular
value, so calculations (such as MIN(), MAX(), and SUM()) would be incorrect.
Another alternative would be to have separate flags indicating whether or not a
value is NULL. That would make even simple calculations cumbersome. “A + B”, for
instance, would have to be written as something like “(CASE WHEN A_flag = 1 AND
B_flag = 1 THEN A + B END)”. Given the alternatives, having NULLs in the database
is a practical approach to handling missing values.
Designing databases is different from analyzing the data inside them. For
example, NULL columns can cause unexpected—and inaccurate—results when
analyzing data and make reading queries difficult. Be very careful when using
columns that allow them.