[postgis-users] Geocoders - General thoughts and questions

Sun Dec 9 18:23:57 PST 2012

On 12/9/2012 1:59 PM, Paragon Corporation wrote:
> Steve,
>
>
>> Does the Tiger geocoder have a set of test address that could be used for
> comparison?
>    As far as test address we use, in the PostGIS code base trunk, there is a:
>
>    extras\tiger_geocoder\tiger_2011\regress folder that has some normalizer,
> geocode,reverse_geocode tests we run. These are mostly  ones from tickets
> that people complained about that we fixed.

If you want a large set of geocoded addresses like the one I ran, I can 
check if we can check it into the project and use it for testing.

In the past I have just used a large set of addresses that I geocode and 
save, then I repeat that after code changes can compare the results 
including the distances from the previous run. This can be used as a 
sanity check for performance and for changes from a baseline. Obviously 
you have to account for the fact the reference set changing over time, 
but spot checking difference is usually adequate to determine if 
something is really wrong.

> Loading data:
>> 1. prep and load data for a reference set
> PostGIS tiger geocoder does this for tiger set the 2.1 can handle Tiger
> 2011/2012 load as well via a table driven pseudo rule approach
> , doing the wget download, unzip, and the shp2pgsql load.
>
> It also abstracts away Linux vs. Windows all in sql.  Basically it uses SQL
> to generate a Linux/Unix sh script or windows batch script.
>
> Does yours currently do that?  Just wondering.

No, I have Perl and C code that I use to process the Tiger shapefile 
into a standard table structure. I can then you that standard table 
structure for loading PAGC, my reverse geocoder or this new geocoder. I 
have also written another variant on this that will map Navteq data into 
this same table structure. The C code acutally uses a SQLite database 
and a lot of SQL during the process. So it read the raw vendor data 
files and creates new shapefiles using my standard table structure that 
looks like this:

Filename:       /u/srcdata/tiger2012-pagc/01/001/Streets.dbf
Version:        0x03 (ver. 3)
Num of records: 11837
Header length:  737
Record length:  547
Last change:    1995/7/26
Num fields:     22
Field info:
Num     Name            Type    Len     Decimal
1.      TLID            N       10      0  -- Tiger TLID
2.      REFADDR         N       10      0
3.      NREFADDR        N       10      0
4.      SQLNUMF         C       12      0  -- house num format string
5.      SQLFMT          C       16      0  -- house num format string
6.      CFORMAT         C       16      0  -- house num format string
7.      NAME            C       80      0
8.      PREDIRABRV      C       15      0
9.      PRETYPABRV      C       50      0
10.     PREQUALABR      C       15      0
11.     SUFDIRABRV      C       15      0
12.     SUFTYPABRV      C       50      0
13.     SUFQUALABR      C       15      0
14.     SIDE            N       1       0  -- 1=right, 2=left
15.     TFID            N       10      0  -- Tiger TFID
16.     USPS            C       35      0  -- USPS preferred city name
17.     AC5             C       35      0  -- place || cousub
18.     AC4             C       35      0  -- cousub
19.     AC3             C       35      0  -- county
20.     AC2             C       35      0  -- state
21.     AC1             C       35      0  -- country
22.     POSTCODE        C       11      0  -- Tiger Zipcode

>  When we took on maintenance
> of tiger geocoder, that was the main stickler
> Was that windows users were in essence forgotten so it was no good for
> windows users, and since our main client supporting our work
> Was on windows, this was REALLY important to fix to make it easy for windows
> users to load data without much thinking,
> But we also needed to make sure the process was easy for Linux users.

Right, this makes perfect sense. For most of my clients I load and prep 
the data and create a dump for them and that is trivial to load on 
Windows or Linux.

>> 2. standardize the reference data
> I think George Silva was interested in this too.  He and I think David
> Bitner were working on similar.
> I am a bit bothered that postgis tiger is so tied to TIGER structure and
> hope to change that, but speed and normalizing are highest priority at
> moment.

One of the things that I did on this most recent geocoder was to load my 
shapefiles into a table. Then standardize them into a separate table 
link by the gid field. This table tends to be smaller and does not 
include the geometry column. This separate standardized table is where 
all the queries are done so it has the structure of the standardizer 
output and not of the source data. You only access the source data when 
you score the results and compute location.

>> Geocoding a request:
>> 1. parse request and standardize it
> This I think is one area where we can probably share a good chunk of effort.
> I think it's the hardest and consumes most of our time.
> What we have for normalizing is not modular and as Leo would say is a huge
> mess and eyesore he wants badly to modularize.
>
> When we took it over, I discovered that a good chunk of the time was being
> wasted normalizing. As much as 200 ms and someitmes a minute.  This for most
> cases
> I think we reduced to 4-20 ms by getting rid of some of the regex checks in
> the WHERE conditions so that normalizing could use indexes.
>
> I would be interested in seeing what the performance is like if we can swap
> out that with you C extension and maybe we can package that along with the
> postgis geocoder.

This was my epiphany, which lead to building the standardizer API using 
the PAGC code. This code has some issues but I can probably get is 
checked into svn so you can play it. It currently has the following 
signature:

select * from standardize_address(
     'select seq, word::text, stdword::text, token from lex order by id',
     'select seq, word::text, stdword::text, token from gaz order by id',
     'select * from rules order by id',
     'select 0::int4 as id,
             ''116A commonwealth ave''::text as micro,
             ''west concord, ma 01742''::text as macro');

The first three arguments are SQL queries used to load the lexicon, 
gazeteer and parser rules. The 4th argument is a query to retrieve the 
the addresses you want to standardize in three fields of id, micro, 
marco. We do not want to use the preparsed fields from Tiger because we 
want the standardizer code to decide how to break the fields into 
components. So even if it parses them incorrectly, we want it parse 
consistently on both the reference set and the geocode request.

One problem I have with this wrapper is it uses up all the memory if I 
try to standardize 50M records. It is not a leak, I just have not 
figured out how to handle something like a cursor over successive calls 
in a SRF or how to properly chunk the rows. So I currently call it via 
Perl and standardize 1M rows at a wack selecting them with gid between A 
and B.

Also, there might be room for improvement if we could compile the lex, 
gaz and rules into a blob and retrieve that, rather than reconstructing 
them each call.

>> 2. generate a query plan to match the standardized request to the
> standardized reference set, this might include fuzzy searching
>
> I think the inheritance model we have and building an adhoc query works
> fairly well. I haven't loaded the whole US in a while since I'm usually only
> interested in one state like MA or PA, but I suspect it scales well.
> But for 1 - 8 (with CA, TX,AZ loaded) states even on a low end windows
> desktop we were getting around 30-100ms speeds warm and 200ms cold.
> Cloud offerings which we usually run on have varying disk speeds so hard to
> benchmark.
>
> I suspect since we do a state hop first, performance should be even better
> in 9.2 and 9.3 since improvements to inheritance planning have been made in
> those.

Yeah, my queries and table structure do not have this optimiaztion build 
into them and it would probably help for some queries.

>> scoring results
> Tiger geocoder currently does a hard-coded scoring.  I forget the order of
> preference.  I think one thing to consider is to have a flexible scoring
> system.
> This is something several people have asked for.  E.g. if I know I'm dealing
> with LA data penalize anything not in a specific region.

I have thought about creating a metadata table for my geocoder that 
would be like a config file. It would be trivial to store key-value 
pairs that could be used to control how scoring is done.

>> Most of my effort was in extracting the standardizer code and creating a
> library. This is still pretty rough code as it was more of a prototype to
> see if I could do it.
> Yap we'd love to see this piece and see if we can reuse it.

Ok, I will see about getting this code checked in, include a simple test 
case to demonstrate how to use it. Next is going to get busy for me so 
that my slow things down a little. PAGC does compile under Windows if 
I'm not mistaken, but I only have tools for Linux.

The standardizer code the I built into a library is here:
http://pagc.svn.sourceforge.net/viewvc/pagc/branches/sew-refactor/pagclib/api/

This everything you need to start on linux.
Start by looking at test_main.c this implements a simple CLI for 
testing. This was made to mirror a similar command pagc_stand so that I 
could verify that my extracted library work identically to the the 
original PAGC code. The postgresql binding is a little cleaner and 
easier to understand. I will get that checked in.

>> This prototype was able to geocode a table of about 216K addresses in St
> Paul, MN on average in 44 ms each. Obviously there was a lot of caching
> happening because all the queries were focused in one fairly small area, but
> the Tiger data is represented in 50+ Million records and cold cache hits are
> typically 100-200 ms. This varies based on doing fuzzy searches or not. This
> is on a 3-4 year old server with 4GB of memory and SATA1 disks in software
> RAID, running Postgresql 8.3.
>
> Again would love to see benchmarks.  We haven't been benchmarking much on
> Linux since we service mostly windows PostGIS users. Though from the short
> tests we've done I think Linux is generally a bit faster (but not by much)
> with the same number of cores again, but we weren't able to do an apple and
> apple test on exact server config, disk etc to say one way or other.
>
> I think the timings you are getting sound about similar to ours, but given
> that we haven't loaded the whole tiger set and that we often run on cloud
> servers which generally have worse speed than physical discs, can't be sure.

Yeah getting apples to apples comparisons can be tuff. At some point if 
we need to do it maybe someone can setup a couple of VMs one with each 
OS, but I would assume that they should be reasonably comparable 
assuming the OS's and postgresql are optimized appropriately.

I think the real value of having performance tests to to track changes 
from change to change and release to release.

>> So my thoughts related to the Tiger Geocoder is that some of this might be
> able to be used there. Also, there is nothing that is Tiger specific other
> than the initial loading of the data and preprocessing that might need to be
> done. The standardizer is table driven and it loads a lexicon, gazeteer, and
> the address parsing rules from tables. So if the data changes or the locale
> changes you only need to change these tables in theory. In reality, more
> work needs to be done on the address parser so that it will support UTF8 for
> internationalization but I have ideas for that.
>
> Definitely and again we would like to focus on building something that is
> more internationally useful.  We like the basic philosophy of the postgis
> tiger geocoder

I would love to get some funding so we could get the author of PAGC to 
extend the standardizer along international needs or do that our selves 
I already have a lot of data on what is needed for most of the countries 
in Europe that could be the starting point for doing some work along 
these lines.

> In that it really only requires a machine running PostGIS (no external
> dependencies like PHP etc. needed) and that you can do geocoding with it
> without leaving the database.

This is what I like also. It is trivial to access the geocoder via PHP, 
.NET, Perl, Java, etc if you can connect to the database and issue a 
simple query.

Thanks,
   -Steve

> Thanks,
> Regina and Leo
> http://www.postgis.us
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>