[postgis-users] Geocoders - General thoughts and questions
woodbri at swoodbridge.com
Sun Dec 9 19:11:05 PST 2012
I just checked in the wrapper code and some README files to get you
started. the wrappers are in .../api/psql/ folder. See the link to svn
On 12/9/2012 9:23 PM, Stephen Woodbridge wrote:
> On 12/9/2012 1:59 PM, Paragon Corporation wrote:
>>> Does the Tiger geocoder have a set of test address that could be used
>> As far as test address we use, in the PostGIS code base trunk,
>> there is a:
>> extras\tiger_geocoder\tiger_2011\regress folder that has some
>> geocode,reverse_geocode tests we run. These are mostly ones from tickets
>> that people complained about that we fixed.
> If you want a large set of geocoded addresses like the one I ran, I can
> check if we can check it into the project and use it for testing.
> In the past I have just used a large set of addresses that I geocode and
> save, then I repeat that after code changes can compare the results
> including the distances from the previous run. This can be used as a
> sanity check for performance and for changes from a baseline. Obviously
> you have to account for the fact the reference set changing over time,
> but spot checking difference is usually adequate to determine if
> something is really wrong.
>> Loading data:
>>> 1. prep and load data for a reference set
>> PostGIS tiger geocoder does this for tiger set the 2.1 can handle Tiger
>> 2011/2012 load as well via a table driven pseudo rule approach
>> , doing the wget download, unzip, and the shp2pgsql load.
>> It also abstracts away Linux vs. Windows all in sql. Basically it
>> uses SQL
>> to generate a Linux/Unix sh script or windows batch script.
>> Does yours currently do that? Just wondering.
> No, I have Perl and C code that I use to process the Tiger shapefile
> into a standard table structure. I can then you that standard table
> structure for loading PAGC, my reverse geocoder or this new geocoder. I
> have also written another variant on this that will map Navteq data into
> this same table structure. The C code acutally uses a SQLite database
> and a lot of SQL during the process. So it read the raw vendor data
> files and creates new shapefiles using my standard table structure that
> looks like this:
> Filename: /u/srcdata/tiger2012-pagc/01/001/Streets.dbf
> Version: 0x03 (ver. 3)
> Num of records: 11837
> Header length: 737
> Record length: 547
> Last change: 1995/7/26
> Num fields: 22
> Field info:
> Num Name Type Len Decimal
> 1. TLID N 10 0 -- Tiger TLID
> 2. REFADDR N 10 0
> 3. NREFADDR N 10 0
> 4. SQLNUMF C 12 0 -- house num format string
> 5. SQLFMT C 16 0 -- house num format string
> 6. CFORMAT C 16 0 -- house num format string
> 7. NAME C 80 0
> 8. PREDIRABRV C 15 0
> 9. PRETYPABRV C 50 0
> 10. PREQUALABR C 15 0
> 11. SUFDIRABRV C 15 0
> 12. SUFTYPABRV C 50 0
> 13. SUFQUALABR C 15 0
> 14. SIDE N 1 0 -- 1=right, 2=left
> 15. TFID N 10 0 -- Tiger TFID
> 16. USPS C 35 0 -- USPS preferred city name
> 17. AC5 C 35 0 -- place || cousub
> 18. AC4 C 35 0 -- cousub
> 19. AC3 C 35 0 -- county
> 20. AC2 C 35 0 -- state
> 21. AC1 C 35 0 -- country
> 22. POSTCODE C 11 0 -- Tiger Zipcode
>> When we took on maintenance
>> of tiger geocoder, that was the main stickler
>> Was that windows users were in essence forgotten so it was no good for
>> windows users, and since our main client supporting our work
>> Was on windows, this was REALLY important to fix to make it easy for
>> users to load data without much thinking,
>> But we also needed to make sure the process was easy for Linux users.
> Right, this makes perfect sense. For most of my clients I load and prep
> the data and create a dump for them and that is trivial to load on
> Windows or Linux.
>>> 2. standardize the reference data
>> I think George Silva was interested in this too. He and I think David
>> Bitner were working on similar.
>> I am a bit bothered that postgis tiger is so tied to TIGER structure and
>> hope to change that, but speed and normalizing are highest priority at
> One of the things that I did on this most recent geocoder was to load my
> shapefiles into a table. Then standardize them into a separate table
> link by the gid field. This table tends to be smaller and does not
> include the geometry column. This separate standardized table is where
> all the queries are done so it has the structure of the standardizer
> output and not of the source data. You only access the source data when
> you score the results and compute location.
>>> Geocoding a request:
>>> 1. parse request and standardize it
>> This I think is one area where we can probably share a good chunk of
>> I think it's the hardest and consumes most of our time.
>> What we have for normalizing is not modular and as Leo would say is a
>> mess and eyesore he wants badly to modularize.
>> When we took it over, I discovered that a good chunk of the time was
>> wasted normalizing. As much as 200 ms and someitmes a minute. This
>> for most
>> I think we reduced to 4-20 ms by getting rid of some of the regex
>> checks in
>> the WHERE conditions so that normalizing could use indexes.
>> I would be interested in seeing what the performance is like if we can
>> out that with you C extension and maybe we can package that along with
>> postgis geocoder.
> This was my epiphany, which lead to building the standardizer API using
> the PAGC code. This code has some issues but I can probably get is
> checked into svn so you can play it. It currently has the following
> select * from standardize_address(
> 'select seq, word::text, stdword::text, token from lex order by id',
> 'select seq, word::text, stdword::text, token from gaz order by id',
> 'select * from rules order by id',
> 'select 0::int4 as id,
> ''116A commonwealth ave''::text as micro,
> ''west concord, ma 01742''::text as macro');
> The first three arguments are SQL queries used to load the lexicon,
> gazeteer and parser rules. The 4th argument is a query to retrieve the
> the addresses you want to standardize in three fields of id, micro,
> marco. We do not want to use the preparsed fields from Tiger because we
> want the standardizer code to decide how to break the fields into
> components. So even if it parses them incorrectly, we want it parse
> consistently on both the reference set and the geocode request.
> One problem I have with this wrapper is it uses up all the memory if I
> try to standardize 50M records. It is not a leak, I just have not
> figured out how to handle something like a cursor over successive calls
> in a SRF or how to properly chunk the rows. So I currently call it via
> Perl and standardize 1M rows at a wack selecting them with gid between A
> and B.
> Also, there might be room for improvement if we could compile the lex,
> gaz and rules into a blob and retrieve that, rather than reconstructing
> them each call.
>>> 2. generate a query plan to match the standardized request to the
>> standardized reference set, this might include fuzzy searching
>> I think the inheritance model we have and building an adhoc query works
>> fairly well. I haven't loaded the whole US in a while since I'm
>> usually only
>> interested in one state like MA or PA, but I suspect it scales well.
>> But for 1 - 8 (with CA, TX,AZ loaded) states even on a low end windows
>> desktop we were getting around 30-100ms speeds warm and 200ms cold.
>> Cloud offerings which we usually run on have varying disk speeds so
>> hard to
>> I suspect since we do a state hop first, performance should be even
>> in 9.2 and 9.3 since improvements to inheritance planning have been
>> made in
> Yeah, my queries and table structure do not have this optimiaztion build
> into them and it would probably help for some queries.
>>> scoring results
>> Tiger geocoder currently does a hard-coded scoring. I forget the
>> order of
>> preference. I think one thing to consider is to have a flexible scoring
>> This is something several people have asked for. E.g. if I know I'm
>> with LA data penalize anything not in a specific region.
> I have thought about creating a metadata table for my geocoder that
> would be like a config file. It would be trivial to store key-value
> pairs that could be used to control how scoring is done.
>>> Most of my effort was in extracting the standardizer code and creating a
>> library. This is still pretty rough code as it was more of a prototype to
>> see if I could do it.
>> Yap we'd love to see this piece and see if we can reuse it.
> Ok, I will see about getting this code checked in, include a simple test
> case to demonstrate how to use it. Next is going to get busy for me so
> that my slow things down a little. PAGC does compile under Windows if
> I'm not mistaken, but I only have tools for Linux.
> The standardizer code the I built into a library is here:
> This everything you need to start on linux.
> Start by looking at test_main.c this implements a simple CLI for
> testing. This was made to mirror a similar command pagc_stand so that I
> could verify that my extracted library work identically to the the
> original PAGC code. The postgresql binding is a little cleaner and
> easier to understand. I will get that checked in.
>>> This prototype was able to geocode a table of about 216K addresses in St
>> Paul, MN on average in 44 ms each. Obviously there was a lot of caching
>> happening because all the queries were focused in one fairly small
>> area, but
>> the Tiger data is represented in 50+ Million records and cold cache
>> hits are
>> typically 100-200 ms. This varies based on doing fuzzy searches or
>> not. This
>> is on a 3-4 year old server with 4GB of memory and SATA1 disks in
>> RAID, running Postgresql 8.3.
>> Again would love to see benchmarks. We haven't been benchmarking much on
>> Linux since we service mostly windows PostGIS users. Though from the
>> tests we've done I think Linux is generally a bit faster (but not by
>> with the same number of cores again, but we weren't able to do an
>> apple and
>> apple test on exact server config, disk etc to say one way or other.
>> I think the timings you are getting sound about similar to ours, but
>> that we haven't loaded the whole tiger set and that we often run on cloud
>> servers which generally have worse speed than physical discs, can't be
> Yeah getting apples to apples comparisons can be tuff. At some point if
> we need to do it maybe someone can setup a couple of VMs one with each
> OS, but I would assume that they should be reasonably comparable
> assuming the OS's and postgresql are optimized appropriately.
> I think the real value of having performance tests to to track changes
> from change to change and release to release.
>>> So my thoughts related to the Tiger Geocoder is that some of this
>>> might be
>> able to be used there. Also, there is nothing that is Tiger specific
>> than the initial loading of the data and preprocessing that might need
>> to be
>> done. The standardizer is table driven and it loads a lexicon,
>> gazeteer, and
>> the address parsing rules from tables. So if the data changes or the
>> changes you only need to change these tables in theory. In reality, more
>> work needs to be done on the address parser so that it will support
>> UTF8 for
>> internationalization but I have ideas for that.
>> Definitely and again we would like to focus on building something that is
>> more internationally useful. We like the basic philosophy of the postgis
>> tiger geocoder
> I would love to get some funding so we could get the author of PAGC to
> extend the standardizer along international needs or do that our selves
> I already have a lot of data on what is needed for most of the countries
> in Europe that could be the starting point for doing some work along
> these lines.
>> In that it really only requires a machine running PostGIS (no external
>> dependencies like PHP etc. needed) and that you can do geocoding with it
>> without leaving the database.
> This is what I like also. It is trivial to access the geocoder via PHP,
> .NET, Perl, Java, etc if you can connect to the database and issue a
> simple query.
>> Regina and Leo
>> postgis-users mailing list
>> postgis-users at lists.osgeo.org
> postgis-users mailing list
> postgis-users at lists.osgeo.org
More information about the postgis-users