[postgis-users] Geocoders - General thoughts and questions

Sun Dec 9 07:08:26 PST 2012

Hi Regina, and others interested in geocoders,

There has been a more general OSGeo Labs discussion that has included 
OpenGeocoder. OpenGeocoder was a list I started a while back that never 
grew into project. I want to have a more focused discussion here.

Some background:

Ten years ago I wrote a geocoder that used the old Tiger/Line data as 
input. Back then I converted Tiger/Line to shapefiles, and wrote a 
cgi-bin geocoder written in C that searched the shapefiles. It worked 
very well and returned results typically in 0.5-2.0 sec responses which 
was pretty good considering the HW back then. The problem with it was 
that a lot of code I wrote was implementing indexes and searching that 
is done way better in a database. That product was abandoned when I 
decided it was too hard to morph the Tiger shapefiles into the data 
structures that I used back then and the code was too structured around 
those old data structures.

In the interim, I supported and enhanced a proprietary geocoder written 
in Perl and eventually helped migrate these legacy tools over to Oracle 
Spatial so I got some experience with that also.

Fast forward to the last few years, where I was looking for an 
alternative, to my old code and I started looking at the postgis 
geocoder, PAGC geocoder, and a few others. At that time the Tiger 
geocoder looked like what I wanted but was a little too oriented to 
Tiger data and was a little rough around the edges, I think Regina and 
other have made some significant progress in smoothing that out and 
improving performance, but I have not tried it recently. PAGC had some 
very good technology internally for fuzzy match address standardization, 
result scoring, etc, but when I tried to scale it up from a few counties 
or a state to a national data set it ran into serious performance 
problems. I have been working over the past year with the developer to 
fix these issues and it has become more apparent that the code needs to 
be refactored and modularized. Similar to the geocoder that I wrote, 
much of the PAGC code is dedicated to doing database like things.

So what does this have to do with postgis geocoding?

If we break down geocoding into a series a process steps it looks 
something like this:

Loading data:
1. prep and load data for a reference set
2. standardize the reference data
3. build indexes and supporting infrastructure as required

Geocoding a request:
1. parse request and standardize it
2. generate a query plan to match the standardized request to the 
standardized reference set, this might include fuzzy searching
3. execute the query plan
4. score the results and order them best to worst
5. compute the location for each result

So from this there appear to be the following sub processes that are not 
handle directly by the database:

a. parse the request, for example a complete address in a single field
b. standardize an address or reference record
c. dynamically building a query plan based on the standardized request
d. scoring results
e. computing the resultant location
f. the prep and load data initially

Using this model I was able to write a new geocoder very quickly by:

1. I extracted the address standardizer code from PAGC and created an 
simple API and library. This code is checked into a branch in PAGC svn.
2. I wrote a simple postgresql wrapper for the standardizer, not yet 
checked into svn
3. I wrote a similar postgresql wrapper for the single line parser.
4. I wrote the rest of the geocoder c, d, and e in plpgsql.
5. for step f, I reused a process that I developed for load Tiger 
shapefiles into PAGC, only loaded the records into a postgis table instead.

Most of my effort was in extracting the standardizer code and creating a 
library. This is still pretty rough code as it was more of a prototype 
to see if I could do it. Step 2 still has a problem that I need some 
help on but I have a temporary work around. Step 3. was a trivial after 
thought. And step 4. took a few days to work out, code, and test.

This prototype was able to geocode a table of about 216K addresses in St 
Paul, MN on average in 44 ms each. Obviously there was a lot of caching 
happening because all the queries were focused in one fairly small area, 
but the Tiger data is represented in 50+ Million records and cold cache 
hits are typically 100-200 ms. This varies based on doing fuzzy searches 
or not. This is on a 3-4 year old server with 4GB of memory and SATA1 
disks in software RAID, running Postgresql 8.3.

I also spent a day and added intersection query support and have spent 
some time writing a PHP wrapper to expose it as a webservice. I plan to 
add support for geocoding to parcel or site addresses, ie: where you are 
not interpolating an address along a segment.

So my thoughts related to the Tiger Geocoder is that some of this might 
be able to be used there. Also, there is nothing that is Tiger specific 
other than the initial loading of the data and preprocessing that might 
need to be done. The standardizer is table driven and it loads a 
lexicon, gazeteer, and the address parsing rules from tables. So if the 
data changes or the locale changes you only need to change these tables 
in theory. In reality, more work needs to be done on the address parser 
so that it will support UTF8 for internationalization but I have ideas 
for that.

So if this is of interest, I would be happy to answer questions, compare 
results with the Tiger geocoder, etc.

Does the Tiger geocoder have a set of test address that could be used 
for comparison? Or do any of the users have this? I would be happy to 
invest some effort into loading a test set of addresses, especially if 
they have geocodes already so I can compare results.

Thoughts?
    -Steve W