[postgis-tickets] [PostGIS] #2289: Redesign Tiger Geocoder

Thu Apr 25 20:55:20 PDT 2013

#2289: Redesign Tiger Geocoder
----------------------------+-----------------------------------------------
 Reporter:  woodbri         |       Owner:  robe 
     Type:  enhancement     |      Status:  new  
 Priority:  medium          |   Milestone:       
Component:  tiger geocoder  |     Version:  2.0.x
 Keywords:                  |  
----------------------------+-----------------------------------------------
 I migrated these comments from #1118 as they really should be in a
 separate ticket. The discussion around the issue in #1118 is really a
 symptom of the fact that we need to redesign the tiger geocoder to better
 use the new pagc_address_standardizer, and to make it possible in the
 future to be less Tiger centric (as a secondary goal.

 Ticket #1118 is a problem of not standardizing the reference dataset and
 relying on the existing standardization. This is a process bug, not a code
 bug. If you take a random address and ask some people to standardize it
 into components, you will surely get some different results because the
 people will have a different set of rules in mind. So we take Tiger data
 which has been standardized by 3300 different counties where it was
 collect and given to Census and you will not even find consistency within
 Tiger. So relying on the pre-parsed standardization is the wrong way to
 approach this problem.

 The way to fix this is to load the tiger data, then clump the name
 attributes into a single string and give it to the standardizer to parse
 and then save that. When we get a query request, we standardize that using
 our same standardizer and rules and we match those results against our
 standardized reference set.

 Then we don't care if the standardization is right or wrong, because if it
 is wrong, it will be wrong in both cases and will still match.

 This process also has the benefit that you can analyze those records that
 failed to standardize because of missing lexicon, gazeteer or rules and
 add those that we might need to improve the tools over time. This part can
 be done separate from the automated loading process. I should be done as
 part of the bug fixing and enhancements to the geocoder over time.

 While the pagc address standardizer improves things and proves some easy
 tool to change the behavior if you don't make this process change. You
 will have an endless list of bugs like this that have nothing to do with
 the code. While you might be able to fix some of these with change to lex,
 gaz and rules you also might be breaking other cases that are not obvious
 when you make changes. DAMHIK.

 I know the plan it to move forward without making this process change, but
 it should be planned for sometime in the future.

 ---------------- robe ----------------------------

 Yah I was thinking of it in future. I'll ticket that I'm leaning toward
 using hstore to store the normalized hash for the tiger set possibly only
 doing it for the obvious ambiguities.

 The issue I have with doing it for after load and for all

 1) inserting is a lot less painful than updating since updating requires
 both an insert and delete. So its faster to do on load.

 2) Since this is in flux, they'll be a lot of updating going on initially
 so I don't want to push that on users until things are more stable, plus
 it complicates update script with update requiring user data changes --
 something I kind of want to stay away from until I have my upgrade bullet
 proof.

 3) I actually don't think its necessary to standardize all tiger (I would
 say about 85% or more of it is fine). For the most part there aren't that
 many ambiguities and a lot of those would be long and painful to itemize
 and doing it by lex is probably not the right way.

 Clearly for things like Camino etc that would be the right thing.

 so I'm thinking more along a hybrid. It would also make my hstore index
 way shorter and faster to scan if its only the questionable problematic
 ones that need to be changed. Anyway I'll put in a separate future ticket.
 For PostGIS 2.1 I would like to change the norm_addy structure since that
 is part issue that I am mixing pre abbrev with post abbrevs.

 ---------------- woodbri --------------------------------

 I don't do any updates. I load the tiger data into a table, I then
 standardize that into a stdaddr table that is linked by the primary of the
 tiger table. If I make changes to the lex, gaz, or rules, I drop the
 stdaddr table and recreate it. All searches are done only on the stdaddr
 table and only when I have candidate records do I join those back to the
 tiger data to get the geometry and compute the location.

 So for production, you install a "standard" set of tables for lex, gaz and
 rules. you load your data, create the stdaddr table and you are done.
 Users should not be modifying the lex, gaz or rules unless they are
 developing a different geocoder and then they are not you normal user and
 they have to understand the process for doing this including the fact that
 they need to recreate the stdaddr table if they make changes.

 While this may require a lot of changes in the current geocoder to move to
 this structure, long term it is good because it moves you away from being
 Tiger centric. If our northern neighbors want to use it for Canadian data,
 then can make a loader for that data, standardize it into stdaddr table
 and your geocoder will work on that too.

 This simplicity will also translate into cleaner and simpler code which
 will be easier to maintain and in all likelihood be faster also.

-- 
Ticket URL: <http://trac.osgeo.org/postgis/ticket/2289>
PostGIS <http://trac.osgeo.org/postgis/>
The PostGIS Trac is used for bug, enhancement & task tracking, a user and developer wiki, and a view into the subversion code repository of PostGIS project.