ZipCensus is an example of a reference table summarized at the zip code level.
Each row describes a zip code and any given zip code appears exactly once in the
table. As a consequence, the zcta5 column makes it possible to look up census
information for zip codes stored in another table. Intuitively, this is one of the
most natural join operations, using a foreign key in one table to look up values in
a reference table.
A lookup join makes the following two assumptions about the base and reference
All values of the key in the base table are in the reference table (missing join
keys lose rows unexpectedly).
The lookup key is the primary key in the reference table (duplicate join keys
cause unexpected rows).
Unfortunately, SQL does not provide direct support for lookups because there is
no simple check in the query ensuring these two conditions are true. However, the
join mechanism makes it possible to do lookups, and this works smoothly when
the two preceding conditions are true.
Consider the SQL query that appends the zip code population to each row of
SELECT o.OrderId, o.ZipCode, zc.totpop
FROM Orders o JOIN
ON o.ZipCode = zc.zcta5
This example uses the ON clause to establish the condition between the tables.
There is no requirement that the condition be equality in general, but for a lookup
From the dataflow perspective, the lookup could be implemented with
CROSSJOIN. The output from the CROSSJOIN is first filtered to the correct rows
(those where the two zip codes are equal) and the desired columns (all columns
from Orders plus totpop) are selected.
Unlike the dataflow diagram, the SQL query describes that a join needs to take
place, but does not explain how this is done. The cross-join is one method,
although it would be quite inefficient in practice. Databases are practical, so
database writers have invented many different ways to speed this up. The details
of such performance enhancements are touched upon in Chapter 14,
“Performance Is the Issue: Using SQL Effectively.” It is worth remembering that
databases are practical, not theoretical, and the database engine is usually trying
to optimize the run-time performance of queries.
Although the preceding query does implement the lookup, it does not guarantee
the two conditions mentioned earlier. If there were multiple rows in ZipCensus for
a given zip code, there would be extra rows in the output (because any matching
row would appear more than once). You can define a constraint or unique index
on the table to ensure that it has no duplicates, but in the query itself there is no
evidence of whether or not such a constraint is present. On the other hand, if zip
code values in Orders were missing in ZipCensus, rows would unexpectedly
disappear. In fact, this happens and the output has fewer rows than the original
Orders table. The condition that all the zip codes in Orders match a row in
ZipCensus could be enforced (if it were true) with another type of constraint, a
foreign key constraint.
Having multiple rows in ZipCensus for a given zip code is not an outlandish idea.
For instance, it could include information for both the 2000 and 2010 censuses,
which would make it possible to see changes over time. One way to implement this
would be to have another column, say, CensusYear to specify the year of the
census. Now the primary key would be a compound key composed of zcta5 and
CensusYear together. A join on the table using just zip code would result in
multiple rows, one for each census year.