[postgis-devel] PostGIS 2.0.0SVN: TIGER Geocoder and TIGER's own Primary Keys and the use of tableA.statefp = tableB.statefp: I think its' a kludge

Tue Dec 20 19:40:13 PST 2011

Steve,

Thanks for taking an interest in this.  Concerning some of your
observations.

1) TLID  - correct these would be duplicated in counties which is why we
don't use it as a primary key.  We were thinking of after the fact creating
a routine that would purge duplicate tlids,
but since shp2pgsql doesn't have a skip failures option, and we can't rely
on people having ogr2ogr because it doesn't ship with PostGIS, we kept it as
is.

It probably would help improve things if we made it a unique key at least.

2) The statefp being needlessly added to all tables.  There is a very good
reason for that.  The main reason is we need it there to take advantage of
constraint exclusion.  
Since each set of records goes in its own table, as silly as it sounds we
need statefp in there so that tables that couldn't possibly offer results
can be skipped.

Hope that answers your questions,
Regina
http://www.postgis.us

  _____  

From: postgis-devel-bounces at postgis.refractions.net
[mailto:postgis-devel-bounces at postgis.refractions.net] On Behalf Of Steve
Walker
Sent: Tuesday, December 20, 2011 8:32 PM
To: postgis-devel at postgis.refractions.net
Subject: [postgis-devel] PostGIS 2.0.0SVN: TIGER Geocoder and TIGER's own
Primary Keys and the use of tableA.statefp = tableB.statefp: I think its' a
kludge

(Disclaimer in advance:  I could be totally wrong)

I have just started to work with the TIGER Geocoder against a previously
pre-populated complete TIGER database.  

[Aside:  Recommend de-coupling code that builds the geocoder from code that
imports the database]

It seems to me that the geocoder ignores TIGER's own native primary keys and
relies on some unnecessary workarounds (ie populating tables unnecessarily
with the STATEFP attribute).

I have started digging into for example the geocode.sql script and so can
give one specific example.   But let me back up.

I believe the problem stems from the very beginning - during the creation
and import of the TIGER data itself.   Most TIGER themes do have native
primary keys (e.g. the EDGES theme has the primary key TLID, the ADDR theme
has the pkey 'ARID' and links to the FEATNAMES theme via ADDRFN
arid-->linearid).  However, these will be duplicated at the shared
boundaries of counties and states, and so the process of importing the
entire database needs to take this into account.  I achieved this by making
sure the following flags were included in the ogr2ogr command which loads
the TIGER data:

-skipfailures -gt 1

The -gt 1 (Group features per Transaction = 1 ) flag will attempt to load
and then verify every single record individually.  The -skipfailures flag
will keep the bulk load from crashing when a specific record fails.  So
here, what we're doing is rejecting one-by-one the features that would
violate the primary key constraint.   (Fake example: "TLID:   12345567890;
FULLNAME:  County Line RD" would be a line shared by two county shapefiles.
The second attempt to insert this into the database would fail gracefully,
since the TLID pkey already exists)

The end result is a nation-wide TIGER database with all native primary keys.

I believe that the TIGER geocoder suffers from the lack of this use of
native primary key integrity.

Following from that belief, I have a hunch that the geocoder uses a number
of unneccesary workarounds.

Namely, I see the geocoder artificially and unnecessarily populating state
FIPS (statefp) data into tables (e.g. FEATNAMES, ZCTA5, etc) when the
database normal form does not require it.

Here is the promised example, from the file 'geocode_address.sql'

Current SVN code, requiring the tableA.statefp = tableB.statefp kludge
(around line 107)

FROM featnames AS f INNER JOIN addr As ad ON (f.tlid = ad.tlid) WHERE $10 =
f.statefp AND $10 = ad.statefp      ### neither Featnames nor Addr have
STATEFP as a native attribute 

I think the following does the same thing using the native TIGER data and
without requiring the kludge of adding and populating the statefp attribute
into the featnames table:

FROM featnames NATURAL JOIN addrfn INNER JOIN addr(addrfn.arid=addrfn.arid)
WHERE featnames.tlid = addr.tlid '   ### requires only the columns native to
the TIGER tables

......

And so that is where I am right now.  Looking through the various SQL
statements I see this WHERE tableA.statefp = tableB.statefp clause peppered
throughout the code.

I do not believe there is any inherent reason within the TIGER database to
require these WHERE clauses, but the complexity of the SQL code is such that
I don't yet understand it fully enough to make all (more?) of the required
changes.

Discussion?

-- 

Steve Walker

Middle Fork Geographic Information Services

(360)671-2505

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20111220/ae263f83/attachment.html>