<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META content="text/html; charset=us-ascii" http-equiv=Content-Type>

<META name=GENERATOR content="MSHTML 9.00.8112.16440"></HEAD>

<BODY>

<DIV dir=ltr align=left><SPAN class=060083403-21122011><FONT color=#0000ff 

size=2 face=Arial>Steve,</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011><FONT color=#0000ff 

size=2 face=Arial></FONT></SPAN> </DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011><FONT color=#0000ff 

size=2 face=Arial>Thanks for taking an interest in this.  Concerning some 

of your observations.</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011><FONT color=#0000ff 

size=2 face=Arial></FONT></SPAN><SPAN class=060083403-21122011><FONT 

color=#0000ff size=2 face=Arial></FONT></SPAN> </DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011><FONT color=#0000ff 

size=2 face=Arial>1) TLID  - correct these would be duplicated in counties 

which is why we don't use it as a primary key.  We were thinking of after 

the fact creating a routine that would purge duplicate 

tlids,</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011><FONT color=#0000ff 

size=2 face=Arial>but since shp2pgsql doesn't have a skip failures option, and 

we can't rely on people having ogr2ogr because it doesn't ship with PostGIS, we 

kept it as is.</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011></SPAN> </DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011><FONT color=#0000ff 

size=2 face=Arial>It probably would help improve things if we made it 

a unique key at least.</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011></SPAN> </DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011><FONT color=#0000ff 

size=2 face=Arial>2) The statefp being needlessly added to all tables.  

There is a very good reason for that.  The main reason is we need it there 

to take advantage of constraint exclusion.  </FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011><FONT color=#0000ff 

size=2 face=Arial>Since each set of records goes in its own table, as silly 

as it sounds we need statefp in there so that tables that couldn't possibly 

offer results can be skipped.</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011></SPAN> </DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011><FONT color=#0000ff 

size=2 face=Arial>Hope that answers your questions,</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011><FONT color=#0000ff 

size=2 face=Arial>Regina</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011><FONT color=#0000ff 

size=2 face=Arial><A 

href="http://www.postgis.us">http://www.postgis.us</A></FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011></SPAN> </DIV>

<DIV dir=ltr align=left><SPAN class=060083403-21122011> </SPAN></DIV><BR>

<BLOCKQUOTE 

style="BORDER-LEFT: #0000ff 2px solid; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; MARGIN-RIGHT: 0px">

  <DIV dir=ltr lang=en-us class=OutlookMessageHeader align=left>

  <HR tabIndex=-1>

  <FONT size=2 face=Tahoma><B>From:</B> 

  postgis-devel-bounces@postgis.refractions.net 

  [mailto:postgis-devel-bounces@postgis.refractions.net] <B>On Behalf Of 

  </B>Steve Walker<BR><B>Sent:</B> Tuesday, December 20, 2011 8:32 

  PM<BR><B>To:</B> postgis-devel@postgis.refractions.net<BR><B>Subject:</B> 

  [postgis-devel] PostGIS 2.0.0SVN: TIGER Geocoder and TIGER's own Primary Keys 

  and the use of tableA.statefp = tableB.statefp: I think its' a 

  kludge<BR></FONT><BR></DIV>

  <DIV></DIV>(Disclaimer in advance:  I could be totally wrong)<BR><BR>I 

  have just started to work with the TIGER Geocoder against a previously 

  pre-populated complete TIGER database.  <BR><BR>[Aside:  Recommend 

  de-coupling code that builds the geocoder from code that imports the 

  database]<BR><BR><BR>It seems to me that the geocoder ignores TIGER's own 

  native primary keys and relies on some unnecessary workarounds (ie populating 

  tables unnecessarily with the STATEFP attribute).<BR><BR>I have started 

  digging into for example the geocode.sql script and so can give one specific 

  example.   But let me back up.<BR><BR>I believe the problem stems 

  from the very beginning - during the creation and import of the TIGER data 

  itself.   Most TIGER themes do have native primary keys (e.g. the 

  EDGES theme has the primary key TLID, the ADDR theme has the pkey 'ARID' and 

  links to the FEATNAMES theme via ADDRFN arid-->linearid).  However, 

  these will be duplicated at the shared boundaries of counties and states, and 

  so the process of importing the entire database needs to take this into 

  account.  I achieved this by making sure the following flags were 

  included in the ogr2ogr command which loads the TIGER 

  data:<BR><BR>-skipfailures -gt 1<BR><BR>The -gt 1 (Group features per 

  Transaction = 1 ) flag will attempt to load and then verify every single 

  record individually.  The -skipfailures flag will keep the bulk load from 

  crashing when a specific record fails.  So here, what we're doing is 

  rejecting one-by-one the features that would violate the primary key 

  constraint.   (Fake example: "TLID:   12345567890; 

  FULLNAME:  County Line RD" would be a line shared by two county 

  shapefiles.  The second attempt to insert this into the database would 

  fail gracefully, since the TLID pkey already exists)<BR><BR>The end result is 

  a nation-wide TIGER database with all native primary keys.<BR><BR>I believe 

  that the TIGER geocoder suffers from the lack of this use of native primary 

  key integrity.<BR><BR>Following from that belief, I have a hunch that the 

  geocoder uses a number of unneccesary workarounds.<BR><BR>Namely, I see the 

  geocoder artificially and unnecessarily populating state FIPS (statefp) data 

  into tables (e.g. FEATNAMES, ZCTA5, etc) when the database normal form does 

  not require it.<BR><BR>Here is the promised example, from the file 

  'geocode_address.sql'<BR><BR>Current SVN code, requiring the tableA.statefp = 

  tableB.statefp kludge<BR>(around line 107)<BR><BR><B>FROM featnames AS f INNER 

  JOIN addr As ad ON (f.tlid = ad.tlid) WHERE $10 = f.statefp AND $10 = 

  ad.statefp      ### neither Featnames nor Addr 

  have  STATEFP as a native attribute </B><BR><BR>I think the following 

  does the same thing using the native TIGER data and without requiring the 

  kludge of adding and populating the statefp attribute into the featnames 

  table:<BR><BR><B>FROM featnames NATURAL JOIN addrfn INNER JOIN 

  addr(addrfn.arid=addrfn.arid) WHERE featnames.tlid = addr.tlid '   

  ### requires only the columns native to the TIGER 

  tables</B><BR><BR>......<BR><BR>And so that is where I am right now.  

  Looking through the various SQL statements I see this <B>WHERE tableA.statefp 

  = tableB.statefp</B> clause peppered throughout the code.<BR><BR><BR>I do not 

  believe there is any inherent reason within the TIGER database to require 

  these WHERE clauses, but the complexity of the SQL code is such that I don't 

  yet understand it fully enough to make all (more?) of the required 

  changes.<BR><BR><BR>Discussion?<BR><BR><BR><BR><BR><BR><BR><BR>

  <TABLE cellSpacing=0 cellPadding=0 width="100%">

    <TBODY>

    <TR>

      <TD><PRE>-- 

Steve Walker

Middle Fork Geographic Information Services

(360)671-2505

</PRE></TD></TR></TBODY></TABLE></BLOCKQUOTE></BODY></HTML>