[postgis-devel] PostGIS 2.0.0SVN: TIGER Geocoder and TIGER's own Primary Keys and the use of tableA.statefp = tableB.statefp: I think its' a kludge

Tue Dec 20 17:31:53 PST 2011

(Disclaimer in advance:  I could be totally wrong)

I have just started to work with the TIGER Geocoder against a previously
pre-populated complete TIGER database.  

[Aside:  Recommend de-coupling code that builds the geocoder from code
that imports the database]

It seems to me that the geocoder ignores TIGER's own native primary keys
and relies on some unnecessary workarounds (ie populating tables
unnecessarily with the STATEFP attribute).

I have started digging into for example the geocode.sql script and so
can give one specific example.   But let me back up.

I believe the problem stems from the very beginning - during the
creation and import of the TIGER data itself.   Most TIGER themes do
have native primary keys (e.g. the EDGES theme has the primary key TLID,
the ADDR theme has the pkey 'ARID' and links to the FEATNAMES theme via
ADDRFN arid-->linearid).  However, these will be duplicated at the
shared boundaries of counties and states, and so the process of
importing the entire database needs to take this into account.  I
achieved this by making sure the following flags were included in the
ogr2ogr command which loads the TIGER data:

-skipfailures -gt 1

The -gt 1 (Group features per Transaction = 1 ) flag will attempt to
load and then verify every single record individually.  The
-skipfailures flag will keep the bulk load from crashing when a specific
record fails.  So here, what we're doing is rejecting one-by-one the
features that would violate the primary key constraint.   (Fake example:
"TLID:   12345567890; FULLNAME:  County Line RD" would be a line shared
by two county shapefiles.  The second attempt to insert this into the
database would fail gracefully, since the TLID pkey already exists)

The end result is a nation-wide TIGER database with all native primary
keys.

I believe that the TIGER geocoder suffers from the lack of this use of
native primary key integrity.

Following from that belief, I have a hunch that the geocoder uses a
number of unneccesary workarounds.

Namely, I see the geocoder artificially and unnecessarily populating
state FIPS (statefp) data into tables (e.g. FEATNAMES, ZCTA5, etc) when
the database normal form does not require it.

Here is the promised example, from the file 'geocode_address.sql'

Current SVN code, requiring the tableA.statefp = tableB.statefp kludge
(around line 107)

FROM featnames AS f INNER JOIN addr As ad ON (f.tlid = ad.tlid) WHERE
$10 = f.statefp AND $10 = ad.statefp      ### neither Featnames nor Addr
have  STATEFP as a native attribute 

I think the following does the same thing using the native TIGER data
and without requiring the kludge of adding and populating the statefp
attribute into the featnames table:

FROM featnames NATURAL JOIN addrfn INNER JOIN
addr(addrfn.arid=addrfn.arid) WHERE featnames.tlid = addr.tlid '   ###
requires only the columns native to the TIGER tables

......

And so that is where I am right now.  Looking through the various SQL
statements I see this WHERE tableA.statefp = tableB.statefp clause
peppered throughout the code.

I do not believe there is any inherent reason within the TIGER database
to require these WHERE clauses, but the complexity of the SQL code is
such that I don't yet understand it fully enough to make all (more?) of
the required changes.

Discussion?

-- 
Steve Walker
Middle Fork Geographic Information Services
(360)671-2505
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20111220/0815c568/attachment.html>