[postgis-users] A more practical geocoder
Jason Horning
jason.horning at bullberrysystems.com
Tue Nov 13 08:17:11 PST 2007
To whom it may concern,
When we started to prototype our proof-of-concept web map using PostGIS
we needed a geocoder, but couldn't find one to fit our needs. We did
try the Tiger Geocoder and attempted to use it but the data requirements
seemed excessive and additionally we needed to operate on the custom GIS
data being developed by state and local government's GIS departments.
So, based on our fairly intimate understanding of how people do address
searches, we set out to prototype our own geocoder. Our geocoder
functions somewhat differently from others we have worked with.
One notable difference, we rely more on pattern matching than upon
normalization in the source data. For example, our geocoder does not
require road names in the roads table to be broken down into: prefix
direction, prefix type, street name, suffix type, and suffix direction.
We require only that a given road segment have a label. So, for example,
with other geocoders, a segment of "Main Street NorthEast" would be
attributed like so: prefix direction = "", prefix type = "", street
name = "Main", suffix type = "none", suffix direction = "NorthEast".
Our geocoder allows the segment of road to be attributed with only:
label="Main Street NorthEast". We use pattern matching techniques to
normalize that data when we create the geocoding indexes. We then use
the same pattern matching techniques to normalize user input when
someone searches for "123 Main Street NE". While this may not be
entirely revolutionary, we do get good matches. We firmly believe that
simplifying the data model to allow the computer (instead of the GIS
analyst) to do the normalization is less error-prone and can have other
side benefits as well. We perform interpolation along line segments in
regular fashion by having LeftFrom, RightFrom, LeftTo and RightTo.
Finally, we do not require zone information (Zip Left, Zip Right,
Community Left, Community Right, etc.) on the road segments themselves
and instead rely on the spatial relationship of a road segment to the
zone (polygon) it intersects or is within.
When we're looking to find and sort matches by relevance, multi-step
incremental scoring algorithm (matching the street name perfectly earns
a lot of points, getting a soundex match gets you a few, bonus points
for being in the correct postal zone, etc). We use a number of
heuristics that have come out of our general experience. Those
heuristics are evident in the code.
Essentially, we have approached geocoding as a natural language parsing
problem. Our geocoder has been constructed based on the USPS postal
service standards, it would be possible to generalize it more to work
with other locales. The key difference in the way we have approached
the problem is that the street name data is not normalized and as such
there is no requirement to break it up into components that only make
sense to a subset of localities.
While we understand that what we have is still essentially a prototype,
we are thinking that what we have done to date could be of some use to
the people who are focussed on mapping and geocoding projects and would
be happy to provide it for the inspection of others. It is implemented
in fairly well-commented (and relatively standard) Perl, so we think the
code should be readable by most coders working in this area.
Jason Horning
BullBerry Systems, Inc.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20071113/c1986302/attachment.html>
More information about the postgis-users
mailing list