[postgis-users] Geocoders - General thoughts and questions

Sun Dec 9 19:11:05 PST 2012

Regina,

I just checked in the wrapper code and some README files to get you 
started. the wrappers are in .../api/psql/ folder. See the link to svn 
below.

Thanks,
   -Steve

On 12/9/2012 9:23 PM, Stephen Woodbridge wrote:
> On 12/9/2012 1:59 PM, Paragon Corporation wrote:
>> Steve,
>>
>>
>>> Does the Tiger geocoder have a set of test address that could be used
>>> for
>> comparison?
>>    As far as test address we use, in the PostGIS code base trunk,
>> there is a:
>>
>>    extras\tiger_geocoder\tiger_2011\regress folder that has some
>> normalizer,
>> geocode,reverse_geocode tests we run. These are mostly  ones from tickets
>> that people complained about that we fixed.
>
> If you want a large set of geocoded addresses like the one I ran, I can
> check if we can check it into the project and use it for testing.
>
> In the past I have just used a large set of addresses that I geocode and
> save, then I repeat that after code changes can compare the results
> including the distances from the previous run. This can be used as a
> sanity check for performance and for changes from a baseline. Obviously
> you have to account for the fact the reference set changing over time,
> but spot checking difference is usually adequate to determine if
> something is really wrong.
>
>> Loading data:
>>> 1. prep and load data for a reference set
>> PostGIS tiger geocoder does this for tiger set the 2.1 can handle Tiger
>> 2011/2012 load as well via a table driven pseudo rule approach
>> , doing the wget download, unzip, and the shp2pgsql load.
>>
>> It also abstracts away Linux vs. Windows all in sql.  Basically it
>> uses SQL
>> to generate a Linux/Unix sh script or windows batch script.
>>
>> Does yours currently do that?  Just wondering.
>
> No, I have Perl and C code that I use to process the Tiger shapefile
> into a standard table structure. I can then you that standard table
> structure for loading PAGC, my reverse geocoder or this new geocoder. I
> have also written another variant on this that will map Navteq data into
> this same table structure. The C code acutally uses a SQLite database
> and a lot of SQL during the process. So it read the raw vendor data
> files and creates new shapefiles using my standard table structure that
> looks like this:
>
> Filename:       /u/srcdata/tiger2012-pagc/01/001/Streets.dbf
> Version:        0x03 (ver. 3)
> Num of records: 11837
> Header length:  737
> Record length:  547
> Last change:    1995/7/26
> Num fields:     22
> Field info:
> Num     Name            Type    Len     Decimal
> 1.      TLID            N       10      0  -- Tiger TLID
> 2.      REFADDR         N       10      0
> 3.      NREFADDR        N       10      0
> 4.      SQLNUMF         C       12      0  -- house num format string
> 5.      SQLFMT          C       16      0  -- house num format string
> 6.      CFORMAT         C       16      0  -- house num format string
> 7.      NAME            C       80      0
> 8.      PREDIRABRV      C       15      0
> 9.      PRETYPABRV      C       50      0
> 10.     PREQUALABR      C       15      0
> 11.     SUFDIRABRV      C       15      0
> 12.     SUFTYPABRV      C       50      0
> 13.     SUFQUALABR      C       15      0
> 14.     SIDE            N       1       0  -- 1=right, 2=left
> 15.     TFID            N       10      0  -- Tiger TFID
> 16.     USPS            C       35      0  -- USPS preferred city name
> 17.     AC5             C       35      0  -- place || cousub
> 18.     AC4             C       35      0  -- cousub
> 19.     AC3             C       35      0  -- county
> 20.     AC2             C       35      0  -- state
> 21.     AC1             C       35      0  -- country
> 22.     POSTCODE        C       11      0  -- Tiger Zipcode
>
>
>>  When we took on maintenance
>> of tiger geocoder, that was the main stickler
>> Was that windows users were in essence forgotten so it was no good for
>> windows users, and since our main client supporting our work
>> Was on windows, this was REALLY important to fix to make it easy for
>> windows
>> users to load data without much thinking,
>> But we also needed to make sure the process was easy for Linux users.
>
> Right, this makes perfect sense. For most of my clients I load and prep
> the data and create a dump for them and that is trivial to load on
> Windows or Linux.
>
>>> 2. standardize the reference data
>> I think George Silva was interested in this too.  He and I think David
>> Bitner were working on similar.
>> I am a bit bothered that postgis tiger is so tied to TIGER structure and
>> hope to change that, but speed and normalizing are highest priority at
>> moment.
>
> One of the things that I did on this most recent geocoder was to load my
> shapefiles into a table. Then standardize them into a separate table
> link by the gid field. This table tends to be smaller and does not
> include the geometry column. This separate standardized table is where
> all the queries are done so it has the structure of the standardizer
> output and not of the source data. You only access the source data when
> you score the results and compute location.
>
>>> Geocoding a request:
>>> 1. parse request and standardize it
>> This I think is one area where we can probably share a good chunk of
>> effort.
>> I think it's the hardest and consumes most of our time.
>> What we have for normalizing is not modular and as Leo would say is a
>> huge
>> mess and eyesore he wants badly to modularize.
>>
>> When we took it over, I discovered that a good chunk of the time was
>> being
>> wasted normalizing. As much as 200 ms and someitmes a minute.  This
>> for most
>> cases
>> I think we reduced to 4-20 ms by getting rid of some of the regex
>> checks in
>> the WHERE conditions so that normalizing could use indexes.
>>
>> I would be interested in seeing what the performance is like if we can
>> swap
>> out that with you C extension and maybe we can package that along with
>> the
>> postgis geocoder.
>
> This was my epiphany, which lead to building the standardizer API using
> the PAGC code. This code has some issues but I can probably get is
> checked into svn so you can play it. It currently has the following
> signature:
>
> select * from standardize_address(
>      'select seq, word::text, stdword::text, token from lex order by id',
>      'select seq, word::text, stdword::text, token from gaz order by id',
>      'select * from rules order by id',
>      'select 0::int4 as id,
>              ''116A commonwealth ave''::text as micro,
>              ''west concord, ma 01742''::text as macro');
>
> The first three arguments are SQL queries used to load the lexicon,
> gazeteer and parser rules. The 4th argument is a query to retrieve the
> the addresses you want to standardize in three fields of id, micro,
> marco. We do not want to use the preparsed fields from Tiger because we
> want the standardizer code to decide how to break the fields into
> components. So even if it parses them incorrectly, we want it parse
> consistently on both the reference set and the geocode request.
>
> One problem I have with this wrapper is it uses up all the memory if I
> try to standardize 50M records. It is not a leak, I just have not
> figured out how to handle something like a cursor over successive calls
> in a SRF or how to properly chunk the rows. So I currently call it via
> Perl and standardize 1M rows at a wack selecting them with gid between A
> and B.
>
> Also, there might be room for improvement if we could compile the lex,
> gaz and rules into a blob and retrieve that, rather than reconstructing
> them each call.
>
>>> 2. generate a query plan to match the standardized request to the
>> standardized reference set, this might include fuzzy searching
>>
>> I think the inheritance model we have and building an adhoc query works
>> fairly well. I haven't loaded the whole US in a while since I'm
>> usually only
>> interested in one state like MA or PA, but I suspect it scales well.
>> But for 1 - 8 (with CA, TX,AZ loaded) states even on a low end windows
>> desktop we were getting around 30-100ms speeds warm and 200ms cold.
>> Cloud offerings which we usually run on have varying disk speeds so
>> hard to
>> benchmark.
>>
>> I suspect since we do a state hop first, performance should be even
>> better
>> in 9.2 and 9.3 since improvements to inheritance planning have been
>> made in
>> those.
>
> Yeah, my queries and table structure do not have this optimiaztion build
> into them and it would probably help for some queries.
>
>>> scoring results
>> Tiger geocoder currently does a hard-coded scoring.  I forget the
>> order of
>> preference.  I think one thing to consider is to have a flexible scoring
>> system.
>> This is something several people have asked for.  E.g. if I know I'm
>> dealing
>> with LA data penalize anything not in a specific region.
>
> I have thought about creating a metadata table for my geocoder that
> would be like a config file. It would be trivial to store key-value
> pairs that could be used to control how scoring is done.
>
>>> Most of my effort was in extracting the standardizer code and creating a
>> library. This is still pretty rough code as it was more of a prototype to
>> see if I could do it.
>> Yap we'd love to see this piece and see if we can reuse it.
>
> Ok, I will see about getting this code checked in, include a simple test
> case to demonstrate how to use it. Next is going to get busy for me so
> that my slow things down a little. PAGC does compile under Windows if
> I'm not mistaken, but I only have tools for Linux.
>
> The standardizer code the I built into a library is here:
> http://pagc.svn.sourceforge.net/viewvc/pagc/branches/sew-refactor/pagclib/api/
>
>
> This everything you need to start on linux.
> Start by looking at test_main.c this implements a simple CLI for
> testing. This was made to mirror a similar command pagc_stand so that I
> could verify that my extracted library work identically to the the
> original PAGC code. The postgresql binding is a little cleaner and
> easier to understand. I will get that checked in.
>
>>> This prototype was able to geocode a table of about 216K addresses in St
>> Paul, MN on average in 44 ms each. Obviously there was a lot of caching
>> happening because all the queries were focused in one fairly small
>> area, but
>> the Tiger data is represented in 50+ Million records and cold cache
>> hits are
>> typically 100-200 ms. This varies based on doing fuzzy searches or
>> not. This
>> is on a 3-4 year old server with 4GB of memory and SATA1 disks in
>> software
>> RAID, running Postgresql 8.3.
>>
>> Again would love to see benchmarks.  We haven't been benchmarking much on
>> Linux since we service mostly windows PostGIS users. Though from the
>> short
>> tests we've done I think Linux is generally a bit faster (but not by
>> much)
>> with the same number of cores again, but we weren't able to do an
>> apple and
>> apple test on exact server config, disk etc to say one way or other.
>>
>> I think the timings you are getting sound about similar to ours, but
>> given
>> that we haven't loaded the whole tiger set and that we often run on cloud
>> servers which generally have worse speed than physical discs, can't be
>> sure.
>
> Yeah getting apples to apples comparisons can be tuff. At some point if
> we need to do it maybe someone can setup a couple of VMs one with each
> OS, but I would assume that they should be reasonably comparable
> assuming the OS's and postgresql are optimized appropriately.
>
> I think the real value of having performance tests to to track changes
> from change to change and release to release.
>
>>> So my thoughts related to the Tiger Geocoder is that some of this
>>> might be
>> able to be used there. Also, there is nothing that is Tiger specific
>> other
>> than the initial loading of the data and preprocessing that might need
>> to be
>> done. The standardizer is table driven and it loads a lexicon,
>> gazeteer, and
>> the address parsing rules from tables. So if the data changes or the
>> locale
>> changes you only need to change these tables in theory. In reality, more
>> work needs to be done on the address parser so that it will support
>> UTF8 for
>> internationalization but I have ideas for that.
>>
>> Definitely and again we would like to focus on building something that is
>> more internationally useful.  We like the basic philosophy of the postgis
>> tiger geocoder
>
> I would love to get some funding so we could get the author of PAGC to
> extend the standardizer along international needs or do that our selves
> I already have a lot of data on what is needed for most of the countries
> in Europe that could be the starting point for doing some work along
> these lines.
>
>> In that it really only requires a machine running PostGIS (no external
>> dependencies like PHP etc. needed) and that you can do geocoding with it
>> without leaving the database.
>
> This is what I like also. It is trivial to access the geocoder via PHP,
> .NET, Perl, Java, etc if you can connect to the database and issue a
> simple query.
>
> Thanks,
>    -Steve
>
>> Thanks,
>> Regina and Leo
>> http://www.postgis.us
>>
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at lists.osgeo.org
>> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users