[postgis-users] Geocoders - General thoughts and questions

Sun Dec 9 10:59:03 PST 2012

Steve,

> Does the Tiger geocoder have a set of test address that could be used for
comparison?
  As far as test address we use, in the PostGIS code base trunk, there is a:

  extras\tiger_geocoder\tiger_2011\regress folder that has some normalizer,
geocode,reverse_geocode tests we run. These are mostly  ones from tickets
that people complained about that we fixed.

Loading data:
> 1. prep and load data for a reference set 
PostGIS tiger geocoder does this for tiger set the 2.1 can handle Tiger
2011/2012 load as well via a table driven pseudo rule approach
, doing the wget download, unzip, and the shp2pgsql load.

It also abstracts away Linux vs. Windows all in sql.  Basically it uses SQL
to generate a Linux/Unix sh script or windows batch script.

Does yours currently do that?  Just wondering.  When we took on maintenance
of tiger geocoder, that was the main stickler
Was that windows users were in essence forgotten so it was no good for
windows users, and since our main client supporting our work 
Was on windows, this was REALLY important to fix to make it easy for windows
users to load data without much thinking,
But we also needed to make sure the process was easy for Linux users.

> 2. standardize the reference data 
I think George Silva was interested in this too.  He and I think David
Bitner were working on similar.
I am a bit bothered that postgis tiger is so tied to TIGER structure and
hope to change that, but speed and normalizing are highest priority at
moment.

> Geocoding a request:
> 1. parse request and standardize it
This I think is one area where we can probably share a good chunk of effort.
I think it's the hardest and consumes most of our time.
What we have for normalizing is not modular and as Leo would say is a huge
mess and eyesore he wants badly to modularize.

When we took it over, I discovered that a good chunk of the time was being
wasted normalizing. As much as 200 ms and someitmes a minute.  This for most
cases
I think we reduced to 4-20 ms by getting rid of some of the regex checks in
the WHERE conditions so that normalizing could use indexes.

I would be interested in seeing what the performance is like if we can swap
out that with you C extension and maybe we can package that along with the
postgis geocoder. 

> 2. generate a query plan to match the standardized request to the
standardized reference set, this might include fuzzy searching 

I think the inheritance model we have and building an adhoc query works
fairly well. I haven't loaded the whole US in a while since I'm usually only
interested in one state like MA or PA, but I suspect it scales well.
But for 1 - 8 (with CA, TX,AZ loaded) states even on a low end windows
desktop we were getting around 30-100ms speeds warm and 200ms cold.
Cloud offerings which we usually run on have varying disk speeds so hard to
benchmark.

I suspect since we do a state hop first, performance should be even better
in 9.2 and 9.3 since improvements to inheritance planning have been made in
those.

> scoring results
Tiger geocoder currently does a hard-coded scoring.  I forget the order of
preference.  I think one thing to consider is to have a flexible scoring
system.
This is something several people have asked for.  E.g. if I know I'm dealing
with LA data penalize anything not in a specific region.

> Most of my effort was in extracting the standardizer code and creating a
library. This is still pretty rough code as it was more of a prototype to
see if I could do it. 
Yap we'd love to see this piece and see if we can reuse it.

> This prototype was able to geocode a table of about 216K addresses in St
Paul, MN on average in 44 ms each. Obviously there was a lot of caching
happening because all the queries were focused in one fairly small area, but
the Tiger data is represented in 50+ Million records and cold cache hits are
typically 100-200 ms. This varies based on doing fuzzy searches or not. This
is on a 3-4 year old server with 4GB of memory and SATA1 disks in software
RAID, running Postgresql 8.3.

Again would love to see benchmarks.  We haven't been benchmarking much on
Linux since we service mostly windows PostGIS users. Though from the short
tests we've done I think Linux is generally a bit faster (but not by much)
with the same number of cores again, but we weren't able to do an apple and
apple test on exact server config, disk etc to say one way or other.

I think the timings you are getting sound about similar to ours, but given
that we haven't loaded the whole tiger set and that we often run on cloud
servers which generally have worse speed than physical discs, can't be sure.

> So my thoughts related to the Tiger Geocoder is that some of this might be
able to be used there. Also, there is nothing that is Tiger specific other
than the initial loading of the data and preprocessing that might need to be
done. The standardizer is table driven and it loads a lexicon, gazeteer, and
the address parsing rules from tables. So if the data changes or the locale
changes you only need to change these tables in theory. In reality, more
work needs to be done on the address parser so that it will support UTF8 for
internationalization but I have ideas for that.

Definitely and again we would like to focus on building something that is
more internationally useful.  We like the basic philosophy of the postgis
tiger geocoder
In that it really only requires a machine running PostGIS (no external
dependencies like PHP etc. needed) and that you can do geocoding with it
without leaving the database.

Thanks,
Regina and Leo
http://www.postgis.us