[postgis-users] Geocoders - General thoughts and questions
woodbri at swoodbridge.com
Sun Dec 9 07:08:26 PST 2012
Hi Regina, and others interested in geocoders,
There has been a more general OSGeo Labs discussion that has included
OpenGeocoder. OpenGeocoder was a list I started a while back that never
grew into project. I want to have a more focused discussion here.
Ten years ago I wrote a geocoder that used the old Tiger/Line data as
input. Back then I converted Tiger/Line to shapefiles, and wrote a
cgi-bin geocoder written in C that searched the shapefiles. It worked
very well and returned results typically in 0.5-2.0 sec responses which
was pretty good considering the HW back then. The problem with it was
that a lot of code I wrote was implementing indexes and searching that
is done way better in a database. That product was abandoned when I
decided it was too hard to morph the Tiger shapefiles into the data
structures that I used back then and the code was too structured around
those old data structures.
In the interim, I supported and enhanced a proprietary geocoder written
in Perl and eventually helped migrate these legacy tools over to Oracle
Spatial so I got some experience with that also.
Fast forward to the last few years, where I was looking for an
alternative, to my old code and I started looking at the postgis
geocoder, PAGC geocoder, and a few others. At that time the Tiger
geocoder looked like what I wanted but was a little too oriented to
Tiger data and was a little rough around the edges, I think Regina and
other have made some significant progress in smoothing that out and
improving performance, but I have not tried it recently. PAGC had some
very good technology internally for fuzzy match address standardization,
result scoring, etc, but when I tried to scale it up from a few counties
or a state to a national data set it ran into serious performance
problems. I have been working over the past year with the developer to
fix these issues and it has become more apparent that the code needs to
be refactored and modularized. Similar to the geocoder that I wrote,
much of the PAGC code is dedicated to doing database like things.
So what does this have to do with postgis geocoding?
If we break down geocoding into a series a process steps it looks
something like this:
1. prep and load data for a reference set
2. standardize the reference data
3. build indexes and supporting infrastructure as required
Geocoding a request:
1. parse request and standardize it
2. generate a query plan to match the standardized request to the
standardized reference set, this might include fuzzy searching
3. execute the query plan
4. score the results and order them best to worst
5. compute the location for each result
So from this there appear to be the following sub processes that are not
handle directly by the database:
a. parse the request, for example a complete address in a single field
b. standardize an address or reference record
c. dynamically building a query plan based on the standardized request
d. scoring results
e. computing the resultant location
f. the prep and load data initially
Using this model I was able to write a new geocoder very quickly by:
1. I extracted the address standardizer code from PAGC and created an
simple API and library. This code is checked into a branch in PAGC svn.
2. I wrote a simple postgresql wrapper for the standardizer, not yet
checked into svn
3. I wrote a similar postgresql wrapper for the single line parser.
4. I wrote the rest of the geocoder c, d, and e in plpgsql.
5. for step f, I reused a process that I developed for load Tiger
shapefiles into PAGC, only loaded the records into a postgis table instead.
Most of my effort was in extracting the standardizer code and creating a
library. This is still pretty rough code as it was more of a prototype
to see if I could do it. Step 2 still has a problem that I need some
help on but I have a temporary work around. Step 3. was a trivial after
thought. And step 4. took a few days to work out, code, and test.
This prototype was able to geocode a table of about 216K addresses in St
Paul, MN on average in 44 ms each. Obviously there was a lot of caching
happening because all the queries were focused in one fairly small area,
but the Tiger data is represented in 50+ Million records and cold cache
hits are typically 100-200 ms. This varies based on doing fuzzy searches
or not. This is on a 3-4 year old server with 4GB of memory and SATA1
disks in software RAID, running Postgresql 8.3.
I also spent a day and added intersection query support and have spent
some time writing a PHP wrapper to expose it as a webservice. I plan to
add support for geocoding to parcel or site addresses, ie: where you are
not interpolating an address along a segment.
So my thoughts related to the Tiger Geocoder is that some of this might
be able to be used there. Also, there is nothing that is Tiger specific
other than the initial loading of the data and preprocessing that might
need to be done. The standardizer is table driven and it loads a
lexicon, gazeteer, and the address parsing rules from tables. So if the
data changes or the locale changes you only need to change these tables
in theory. In reality, more work needs to be done on the address parser
so that it will support UTF8 for internationalization but I have ideas
So if this is of interest, I would be happy to answer questions, compare
results with the Tiger geocoder, etc.
Does the Tiger geocoder have a set of test address that could be used
for comparison? Or do any of the users have this? I would be happy to
invest some effort into loading a test set of addresses, especially if
they have geocodes already so I can compare results.
More information about the postgis-users