[postgis-tickets] [PostGIS] #2289: Redesign Tiger Geocoder
PostGIS
trac at osgeo.org
Thu Apr 25 20:55:20 PDT 2013
#2289: Redesign Tiger Geocoder
----------------------------+-----------------------------------------------
Reporter: woodbri | Owner: robe
Type: enhancement | Status: new
Priority: medium | Milestone:
Component: tiger geocoder | Version: 2.0.x
Keywords: |
----------------------------+-----------------------------------------------
I migrated these comments from #1118 as they really should be in a
separate ticket. The discussion around the issue in #1118 is really a
symptom of the fact that we need to redesign the tiger geocoder to better
use the new pagc_address_standardizer, and to make it possible in the
future to be less Tiger centric (as a secondary goal.
Ticket #1118 is a problem of not standardizing the reference dataset and
relying on the existing standardization. This is a process bug, not a code
bug. If you take a random address and ask some people to standardize it
into components, you will surely get some different results because the
people will have a different set of rules in mind. So we take Tiger data
which has been standardized by 3300 different counties where it was
collect and given to Census and you will not even find consistency within
Tiger. So relying on the pre-parsed standardization is the wrong way to
approach this problem.
The way to fix this is to load the tiger data, then clump the name
attributes into a single string and give it to the standardizer to parse
and then save that. When we get a query request, we standardize that using
our same standardizer and rules and we match those results against our
standardized reference set.
Then we don't care if the standardization is right or wrong, because if it
is wrong, it will be wrong in both cases and will still match.
This process also has the benefit that you can analyze those records that
failed to standardize because of missing lexicon, gazeteer or rules and
add those that we might need to improve the tools over time. This part can
be done separate from the automated loading process. I should be done as
part of the bug fixing and enhancements to the geocoder over time.
While the pagc address standardizer improves things and proves some easy
tool to change the behavior if you don't make this process change. You
will have an endless list of bugs like this that have nothing to do with
the code. While you might be able to fix some of these with change to lex,
gaz and rules you also might be breaking other cases that are not obvious
when you make changes. DAMHIK.
I know the plan it to move forward without making this process change, but
it should be planned for sometime in the future.
---------------- robe ----------------------------
Yah I was thinking of it in future. I'll ticket that I'm leaning toward
using hstore to store the normalized hash for the tiger set possibly only
doing it for the obvious ambiguities.
The issue I have with doing it for after load and for all
1) inserting is a lot less painful than updating since updating requires
both an insert and delete. So its faster to do on load.
2) Since this is in flux, they'll be a lot of updating going on initially
so I don't want to push that on users until things are more stable, plus
it complicates update script with update requiring user data changes --
something I kind of want to stay away from until I have my upgrade bullet
proof.
3) I actually don't think its necessary to standardize all tiger (I would
say about 85% or more of it is fine). For the most part there aren't that
many ambiguities and a lot of those would be long and painful to itemize
and doing it by lex is probably not the right way.
Clearly for things like Camino etc that would be the right thing.
so I'm thinking more along a hybrid. It would also make my hstore index
way shorter and faster to scan if its only the questionable problematic
ones that need to be changed. Anyway I'll put in a separate future ticket.
For PostGIS 2.1 I would like to change the norm_addy structure since that
is part issue that I am mixing pre abbrev with post abbrevs.
---------------- woodbri --------------------------------
I don't do any updates. I load the tiger data into a table, I then
standardize that into a stdaddr table that is linked by the primary of the
tiger table. If I make changes to the lex, gaz, or rules, I drop the
stdaddr table and recreate it. All searches are done only on the stdaddr
table and only when I have candidate records do I join those back to the
tiger data to get the geometry and compute the location.
So for production, you install a "standard" set of tables for lex, gaz and
rules. you load your data, create the stdaddr table and you are done.
Users should not be modifying the lex, gaz or rules unless they are
developing a different geocoder and then they are not you normal user and
they have to understand the process for doing this including the fact that
they need to recreate the stdaddr table if they make changes.
While this may require a lot of changes in the current geocoder to move to
this structure, long term it is good because it moves you away from being
Tiger centric. If our northern neighbors want to use it for Canadian data,
then can make a loader for that data, standardize it into stdaddr table
and your geocoder will work on that too.
This simplicity will also translate into cleaner and simpler code which
will be easier to maintain and in all likelihood be faster also.
--
Ticket URL: <http://trac.osgeo.org/postgis/ticket/2289>
PostGIS <http://trac.osgeo.org/postgis/>
The PostGIS Trac is used for bug, enhancement & task tracking, a user and developer wiki, and a view into the subversion code repository of PostGIS project.
More information about the postgis-tickets
mailing list