dataflow is a graphical way of visualizing data transformations. Dataflows have
two important elements. The nodes in a dataflow diagram transform data, taking
zero or more inputs and producing output. The edges in a dataflow diagram are
pipes connecting the nodes. Think of the data flowing through the pipes and
getting banged and pushed and pulled and flattened into shape by the nodes. In
the end, the data has been transformed into information.
Figure 1.3 shows a simple dataflow that adds a new column, called SCF for
Sectional Center Facility (something the U.S. Post Office uses to route mail). This
column is the first three digits of a zip code. The output is each zip code with its
SCF. The dataflow has four nodes, connected by three edges. The first, shaped like
a cylinder, represents a database table or file and is the source of the data. The
edge leaving this node shows some of the records being passed from it, records
from the ZipCensus table.
The second node appends the new column to the table, which is also visible along
the edge leading out from the node. The third selects two columns for output—in
this case, zcta5 and SCF. And the final node simply represents the output. On the
dataflow diagram, imagine a magnifying glass that makes it possible to see the
data moving through the flow. Seeing the data move from node to node shows
what is happening in the flow.
The actual processing could be implemented in either SQL or Excel. The SQL code
corresponding to this dataflow is:
SELECT zc.zcta5, LEFT(zc.zcta5, 3) as scf
FROM ZipCensus zc
Alternatively, if the data were in an Excel worksheet with the zip codes in column
A, the following formula would extract the SCF:
=MID(A1, 1, 3)
Of course, the formula would have to be copied down the column.
Excel, SQL, and dataflows are three different ways of expressing similar
transformations. The advantage of dataflows is that they provide an intuitive way
of visualizing and thinking about data manipulations, independent of the tool
used for the processing. Dataflows facilitate understanding, but in the end, the
work described in this book is in SQL or Excel.