[postgis-users] Geocoder (from extras)
Stephen Woodbridge
woodbri at swoodbridge.com
Sat May 7 17:06:38 PDT 2011
On 5/7/2011 7:29 PM, Johnathan Leppert wrote:
> Do you know what the status of the PAGC project is? Is it still under
> active development? The last post was in 2008.
Johnathan,
Yes, the restructuring done by Walter was a major step forward in
usability in that is allow you to use SQLite instead of Berkeley DB
which means that you can load the data into the geocoder and zip up the
SQLite database for deployment which was not easy and maybe not possible
with Berkeley DB. In addition to that, support has been added for
landmark geocoding, this is basically named places without street
addresses, so cities, landmarks, points of interest and the like can now
be loaded and geocoded.
I have also added a single line parser as a library and integrated it
with the geocode_response module. It will recognize, street addresses,
street intersections and landmarks and then pass the correct parameters
into the PAGC library.
Part of what I have been trying to drive to is some internal interfaces
that will modularize the code somewhat and hopefully allow others to
learn parts of it and be able to contribute to the project. I have start
with my own efforts in this regard and hope to learn more of the
internals as I have time to jump into it.
-Steve
> On Sat, May 7, 2011 at 5:00 PM, Stephen Woodbridge
> <woodbri at swoodbridge.com <mailto:woodbri at swoodbridge.com>> wrote:
>
> Stephen,
>
> In a geocoder I wrote in 2000, I mapped city, state and/or zipcode
> into a list of counties, then only searched those counties. This
> greatly reduced the 3300 counties in Tiger to a small handful that
> needed to be searched. The FIPS-4 document has a mapping of all
> place names to counties. Zipcodes are easier to handle because you
> can just index the records with zipcodes but if you wanted to widen
> the search, then map the zipcode to a place name and then to counties.
>
> My first thought was to use inherited tables where each state or
> state equivalent is in a separate table and filter the tables by the
> state name of abbreviation. It sounds like you might be already
> doing something like this.
>
> I think trigrams would provide some benefits. I have used double
> metaphone, but only used the first code for fuzzy searches because
> this is a better phonic search key than soundex in my opinion
> because it matches thinks that sound alike even if they do not start
> with the same letter. Since I used it as a key I only used the first
> 4 characters of the metaphone code.
>
> I wrote my own tiger geocoder in C and have since abandoned that in
> favor of PAGC. PAGC has just recently refactored the code to support
> SQLite as a datastore and has had some work done to support
> Postgresql in addition to the previous support for Berkeley DB. One
> idea that I had was that with appropriate support for database
> backing stores, that the PAGC library might be able to be wrapped in
> a stored procedure which would then provide a high quality and high
> performance geocoder in the database.
>
> Here is an instance that I created using SQLite as the backing store
> and Tiger 2010 data:
>
> http://imaptools.com:8080/geocode/
>
> The single line parsing is broken at the moment, I have a ticket to
> look into that and hope to have it resolved shortly.
>
> PAGC project can be found here:
>
> http://www.pagcgeo.org/
>
> Regards,
> -Steve
>
>
>
> On 5/6/2011 10:36 PM, Stephen Frost wrote:
>
> * Johnathan Leppert (johnathan.leppert at gmail.com
> <mailto:johnathan.leppert at gmail.com>) wrote:
>
> Ok, I have it working, but it's very very slow. It takes
> about two seconds
> to geocode a single record. This isn't very realistic for
> large datasets.
> Anyone have any ideas on possible optimization? It appears
> to have created
> proper indexes on all the tables.
>
>
> Yeah, there's a few things I've been working on. For one, a lot
> of the
> time is spent planning that massive query at the heart of the
> geocoding,
> when it really doesn't need to be re-planned for every query.
> It's done
> the way it is currently because the tables are partitioned up by
> state.
> I've got someone working on simply splitting that query up to be
> 53 (or
> however many...) queries, one for each state/territory, which
> will allow
> them to be planned once and then the plans re-used inside the same
> session. That should improve things.
>
> I've also been looking into using trigrams for things instead of the
> current indexes.. I think they're perform better with regard to
> speed,
> just need to make sure it still returns good results, etc..
>
> I'm very interested in anyone else working on this and any
> suggestions
> people have for how to improve it.. I've not been able to work
> on it
> much recently and while I've tried to delegate it to others, it's a
> pretty complex system which is, in particular, hard to test well..
>
> Thanks,
>
> Stephen
>
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> <mailto:postgis-users at postgis.refractions.net>
> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> <mailto:postgis-users at postgis.refractions.net>
> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
>
>
>
> --
> /Johnathan /
> Software Architect & Developer
> Columbus, Ohio
> /Follow me on Twitter: @iamleppert <http://twitter.com/#!/iamleppert> /
>
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
More information about the postgis-users
mailing list