[postgis-users] Geocoders - General thoughts and questions
Stephen Woodbridge
woodbri at swoodbridge.com
Sun Dec 9 18:23:57 PST 2012
On 12/9/2012 1:59 PM, Paragon Corporation wrote:
> Steve,
>
>
>> Does the Tiger geocoder have a set of test address that could be used for
> comparison?
> As far as test address we use, in the PostGIS code base trunk, there is a:
>
> extras\tiger_geocoder\tiger_2011\regress folder that has some normalizer,
> geocode,reverse_geocode tests we run. These are mostly ones from tickets
> that people complained about that we fixed.
If you want a large set of geocoded addresses like the one I ran, I can
check if we can check it into the project and use it for testing.
In the past I have just used a large set of addresses that I geocode and
save, then I repeat that after code changes can compare the results
including the distances from the previous run. This can be used as a
sanity check for performance and for changes from a baseline. Obviously
you have to account for the fact the reference set changing over time,
but spot checking difference is usually adequate to determine if
something is really wrong.
> Loading data:
>> 1. prep and load data for a reference set
> PostGIS tiger geocoder does this for tiger set the 2.1 can handle Tiger
> 2011/2012 load as well via a table driven pseudo rule approach
> , doing the wget download, unzip, and the shp2pgsql load.
>
> It also abstracts away Linux vs. Windows all in sql. Basically it uses SQL
> to generate a Linux/Unix sh script or windows batch script.
>
> Does yours currently do that? Just wondering.
No, I have Perl and C code that I use to process the Tiger shapefile
into a standard table structure. I can then you that standard table
structure for loading PAGC, my reverse geocoder or this new geocoder. I
have also written another variant on this that will map Navteq data into
this same table structure. The C code acutally uses a SQLite database
and a lot of SQL during the process. So it read the raw vendor data
files and creates new shapefiles using my standard table structure that
looks like this:
Filename: /u/srcdata/tiger2012-pagc/01/001/Streets.dbf
Version: 0x03 (ver. 3)
Num of records: 11837
Header length: 737
Record length: 547
Last change: 1995/7/26
Num fields: 22
Field info:
Num Name Type Len Decimal
1. TLID N 10 0 -- Tiger TLID
2. REFADDR N 10 0
3. NREFADDR N 10 0
4. SQLNUMF C 12 0 -- house num format string
5. SQLFMT C 16 0 -- house num format string
6. CFORMAT C 16 0 -- house num format string
7. NAME C 80 0
8. PREDIRABRV C 15 0
9. PRETYPABRV C 50 0
10. PREQUALABR C 15 0
11. SUFDIRABRV C 15 0
12. SUFTYPABRV C 50 0
13. SUFQUALABR C 15 0
14. SIDE N 1 0 -- 1=right, 2=left
15. TFID N 10 0 -- Tiger TFID
16. USPS C 35 0 -- USPS preferred city name
17. AC5 C 35 0 -- place || cousub
18. AC4 C 35 0 -- cousub
19. AC3 C 35 0 -- county
20. AC2 C 35 0 -- state
21. AC1 C 35 0 -- country
22. POSTCODE C 11 0 -- Tiger Zipcode
> When we took on maintenance
> of tiger geocoder, that was the main stickler
> Was that windows users were in essence forgotten so it was no good for
> windows users, and since our main client supporting our work
> Was on windows, this was REALLY important to fix to make it easy for windows
> users to load data without much thinking,
> But we also needed to make sure the process was easy for Linux users.
Right, this makes perfect sense. For most of my clients I load and prep
the data and create a dump for them and that is trivial to load on
Windows or Linux.
>> 2. standardize the reference data
> I think George Silva was interested in this too. He and I think David
> Bitner were working on similar.
> I am a bit bothered that postgis tiger is so tied to TIGER structure and
> hope to change that, but speed and normalizing are highest priority at
> moment.
One of the things that I did on this most recent geocoder was to load my
shapefiles into a table. Then standardize them into a separate table
link by the gid field. This table tends to be smaller and does not
include the geometry column. This separate standardized table is where
all the queries are done so it has the structure of the standardizer
output and not of the source data. You only access the source data when
you score the results and compute location.
>> Geocoding a request:
>> 1. parse request and standardize it
> This I think is one area where we can probably share a good chunk of effort.
> I think it's the hardest and consumes most of our time.
> What we have for normalizing is not modular and as Leo would say is a huge
> mess and eyesore he wants badly to modularize.
>
> When we took it over, I discovered that a good chunk of the time was being
> wasted normalizing. As much as 200 ms and someitmes a minute. This for most
> cases
> I think we reduced to 4-20 ms by getting rid of some of the regex checks in
> the WHERE conditions so that normalizing could use indexes.
>
> I would be interested in seeing what the performance is like if we can swap
> out that with you C extension and maybe we can package that along with the
> postgis geocoder.
This was my epiphany, which lead to building the standardizer API using
the PAGC code. This code has some issues but I can probably get is
checked into svn so you can play it. It currently has the following
signature:
select * from standardize_address(
'select seq, word::text, stdword::text, token from lex order by id',
'select seq, word::text, stdword::text, token from gaz order by id',
'select * from rules order by id',
'select 0::int4 as id,
''116A commonwealth ave''::text as micro,
''west concord, ma 01742''::text as macro');
The first three arguments are SQL queries used to load the lexicon,
gazeteer and parser rules. The 4th argument is a query to retrieve the
the addresses you want to standardize in three fields of id, micro,
marco. We do not want to use the preparsed fields from Tiger because we
want the standardizer code to decide how to break the fields into
components. So even if it parses them incorrectly, we want it parse
consistently on both the reference set and the geocode request.
One problem I have with this wrapper is it uses up all the memory if I
try to standardize 50M records. It is not a leak, I just have not
figured out how to handle something like a cursor over successive calls
in a SRF or how to properly chunk the rows. So I currently call it via
Perl and standardize 1M rows at a wack selecting them with gid between A
and B.
Also, there might be room for improvement if we could compile the lex,
gaz and rules into a blob and retrieve that, rather than reconstructing
them each call.
>> 2. generate a query plan to match the standardized request to the
> standardized reference set, this might include fuzzy searching
>
> I think the inheritance model we have and building an adhoc query works
> fairly well. I haven't loaded the whole US in a while since I'm usually only
> interested in one state like MA or PA, but I suspect it scales well.
> But for 1 - 8 (with CA, TX,AZ loaded) states even on a low end windows
> desktop we were getting around 30-100ms speeds warm and 200ms cold.
> Cloud offerings which we usually run on have varying disk speeds so hard to
> benchmark.
>
> I suspect since we do a state hop first, performance should be even better
> in 9.2 and 9.3 since improvements to inheritance planning have been made in
> those.
Yeah, my queries and table structure do not have this optimiaztion build
into them and it would probably help for some queries.
>> scoring results
> Tiger geocoder currently does a hard-coded scoring. I forget the order of
> preference. I think one thing to consider is to have a flexible scoring
> system.
> This is something several people have asked for. E.g. if I know I'm dealing
> with LA data penalize anything not in a specific region.
I have thought about creating a metadata table for my geocoder that
would be like a config file. It would be trivial to store key-value
pairs that could be used to control how scoring is done.
>> Most of my effort was in extracting the standardizer code and creating a
> library. This is still pretty rough code as it was more of a prototype to
> see if I could do it.
> Yap we'd love to see this piece and see if we can reuse it.
Ok, I will see about getting this code checked in, include a simple test
case to demonstrate how to use it. Next is going to get busy for me so
that my slow things down a little. PAGC does compile under Windows if
I'm not mistaken, but I only have tools for Linux.
The standardizer code the I built into a library is here:
http://pagc.svn.sourceforge.net/viewvc/pagc/branches/sew-refactor/pagclib/api/
This everything you need to start on linux.
Start by looking at test_main.c this implements a simple CLI for
testing. This was made to mirror a similar command pagc_stand so that I
could verify that my extracted library work identically to the the
original PAGC code. The postgresql binding is a little cleaner and
easier to understand. I will get that checked in.
>> This prototype was able to geocode a table of about 216K addresses in St
> Paul, MN on average in 44 ms each. Obviously there was a lot of caching
> happening because all the queries were focused in one fairly small area, but
> the Tiger data is represented in 50+ Million records and cold cache hits are
> typically 100-200 ms. This varies based on doing fuzzy searches or not. This
> is on a 3-4 year old server with 4GB of memory and SATA1 disks in software
> RAID, running Postgresql 8.3.
>
> Again would love to see benchmarks. We haven't been benchmarking much on
> Linux since we service mostly windows PostGIS users. Though from the short
> tests we've done I think Linux is generally a bit faster (but not by much)
> with the same number of cores again, but we weren't able to do an apple and
> apple test on exact server config, disk etc to say one way or other.
>
> I think the timings you are getting sound about similar to ours, but given
> that we haven't loaded the whole tiger set and that we often run on cloud
> servers which generally have worse speed than physical discs, can't be sure.
Yeah getting apples to apples comparisons can be tuff. At some point if
we need to do it maybe someone can setup a couple of VMs one with each
OS, but I would assume that they should be reasonably comparable
assuming the OS's and postgresql are optimized appropriately.
I think the real value of having performance tests to to track changes
from change to change and release to release.
>> So my thoughts related to the Tiger Geocoder is that some of this might be
> able to be used there. Also, there is nothing that is Tiger specific other
> than the initial loading of the data and preprocessing that might need to be
> done. The standardizer is table driven and it loads a lexicon, gazeteer, and
> the address parsing rules from tables. So if the data changes or the locale
> changes you only need to change these tables in theory. In reality, more
> work needs to be done on the address parser so that it will support UTF8 for
> internationalization but I have ideas for that.
>
> Definitely and again we would like to focus on building something that is
> more internationally useful. We like the basic philosophy of the postgis
> tiger geocoder
I would love to get some funding so we could get the author of PAGC to
extend the standardizer along international needs or do that our selves
I already have a lot of data on what is needed for most of the countries
in Europe that could be the starting point for doing some work along
these lines.
> In that it really only requires a machine running PostGIS (no external
> dependencies like PHP etc. needed) and that you can do geocoding with it
> without leaving the database.
This is what I like also. It is trivial to access the geocoder via PHP,
.NET, Perl, Java, etc if you can connect to the database and issue a
simple query.
Thanks,
-Steve
> Thanks,
> Regina and Leo
> http://www.postgis.us
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>
More information about the postgis-users
mailing list