[postgis-users] Geocoding Issues with Route, ##-## house numbers; upgrade questions

Stephen Woodbridge woodbri at swoodbridge.com
Wed Jul 27 20:29:39 PDT 2011


On 7/27/2011 5:36 PM, Paragon Corporation wrote:
> Dan,
>
>>  Hi,
>>  I'm using and abusing the geocoder, and I've come across a couple issues:
>
>>  1) Routes
>>  example: '1820 ROUTE 32, MODENA , NY 12548 ':

Ok there are a couple of issues here:

1. you are working with Tiger data
2. there is probably some potential for improvement in the matching of:
    state highway == route == ny highway == etc

So try these addresses instead:

1820 State Highway 32, plattekill, ny
1820 State Highway 32, wallkill, ny

and you will probably get your location. You I know 12548 is Modena, but 
I don;t think the segment in tiger is tagged as 12548. In my PAGC 
geocoder I'm getting this as a likely candidate:

<GeocodedAddress>
   <Address>
     <SiteAddress>
       <CompleteAddressNumber>1820</CompleteAddressNumber>
       <CompleteStreetName>
         <PreType>State Hwy</PreType>
         <StreetName>32</StreetName>
       </CompleteStreetName>
       <PlaceName>WALLKILL</PlaceName>
       <PlaceName_USPS>Plattekill</PlaceName_USPS>
       <StateName>NEW YORK</StateName>
       <ZipCode>12589</ZipCode>
     </SiteAddress>
   </Address>
   <gml:Point><gml:pos>-74.075645 41.612969</gml:pos></gml:Point>
   <GeocodeMatchCode accuracy="0.830191" matchType="INTERPOLATED" note="P"/>
   <source>
     <dataSource>Streets</dataSource>
     <addressIdentifier>41916602</addressIdentifier>
   </source>
</GeocodedAddress>

>>  rating | lon | lat | address | predirabbrev | streetname |
> streettypeabbrev | postdirabbrev | internal | location | stateabbrev |
> zip | parsed
> --------+------------+-----------+---------+--------------+------------+------------------+---------------+----------+----------+-------------+-------+--------
>>  22 | -73.9374945714286 | 40.6108123469388 | 1820 | E | 32nd | St | | |
> New York | NY | 11234 | t
>
>>  which is 85 miles away =)
>
> I think item 1 I fixed already. I forget if I committed my fix for it
> though. I think I did, but I haven’t committed anything for a while since
>
> I’m working on speeding up things, and sadly if things work faster in
> one version of PostgreSQL, they work slower in another and so forth. So
> I’m working on a comfortable balance. Mostly fiddling with index
> selectivity.
>
>
>>  2) ##-## addresses
>
>>  example: '112-31 196 STREET, SAINT ALBANS , NY'

This is a harder problem because there are a great many different house 
number range patterns and most geocoders assume only simple house 
numbers. This problem is compounded by the fact that you need to match 
even if one or more of the components are missing. For example, if the 
street has an address range does the 112 or the 31 component vary over 
the range?

Regina, If you want all the various patterns I have what so other docs 
that have all the patterns described. We might also want to load all the 
Tiger data and run a pattern classifier on all the Tiger house numbers. 
We could convert the actual house number into something like:

n - a string of digits
a - a string of letters
p - punctuation
s - space(s)

Then generate a distinct list of the patterns like:

n    - 23
npn  - 123-45
an   - G23
anan - N123W45
etc.

I'm not sure this helps or not. It seems the geocoding should prioritize 
from macro to micro, like:
   country, state, city, street, number
   country, postcode, state, city, street, number
we some appropriate rules around relaxing the tests if fields are 
missing or have potential errors in them. So for house number you could 
match on the pattern first, then on variants if the first test fails, so 
input of 123-45 is "npn", if that fails then check for "nsn" or "n?n" or 
finally "n" or something like that.

-Steve

> rating | lon | lat | address | predirabbrev | streetname |
> streettypeabbrev | postdirabbrev | internal | location | stateabbrev |
> zip | parsed
> --------+------------+-----------+---------+--------------+------------+------------------+---------------+----------+----------+-------------+-------+--------
>>  20 | -73.756229 | 40.693842 | | | 196th | St | | | New York | NY |
> 11412 | t
>
>>  which is only .3 miles away, but note that it just ignored the house
> number.
> This one I have listed as a bug already on my todo –
>
> http://trac.osgeo.org/postgis/ticket/886 (although your above looks like
> a slightly different issue which I may have already fixed)
>
> Questions:
> a. Is there something I can do to pre-process either of these types of
> addresses to help the geocoder?
>>  b. If I know that the zip code is correct, is there a setting I can
> adjust so that the geocoder never looks outside the provided zip code?
>
> http://www.postgis.org/documentation/manual-svn/Geocode.html (Give the
> geometry filter option a try. I haven’t really stress tested it)
>
> I’ve also got on todo to revamp the rating so that you can better
> control the weighting scores, but that won’t happen until I’ve tackled
> the speed
>
> Listed here: http://trac.osgeo.org/postgis/ticket/1111
>
> You can add yourself to the cc of these tickets if you want to be
> notified when they are amended/closed
>
> Ø According to normalize_address.sql, I'm using this version of the
> Geocoder:
>>  7616 2011-07-07 12:41:13Z
>>  If this is the version I 'installed' - ie started with - do I still
> need to run upgrade_geocoder.sh? what about
>
> Yes – latest version is: 7632 2011-07-12 (so you are already behind J )
>
> |Ø ||*Missing_Indexes_Generate_Script *||()? |||
>
> I have that now as part of the update script to install missing indexes.
> It runs pretty fast if you have all the key indexes in place already.
>
> So basically runs this command now --
> http://www.postgis.org/documentation/manual-svn/Install_Missing_Indexes.html
>
>
>>  Lastly, a small contribution: I noticed the geocoder was also having
> problems with addresses like '45 3 STREET' and '45 WEST 3
>
>>  STREET', and I found that by adding a suffix to the '3' ('3' -> '3RD')
> gave it a push in the right direction. The regular expression I'm using
> to catch these is:
>
>>  foo=re.match(r'([0-9\-]+ +)([0-9]+)( +[a-zA-Z_]+)', street)
>>  foo2=re.match(r'([0-9\-]+ +)([WESTASOUHNOR]+ )([0-9]+)( +[a-zA-Z_]+)',
> street)
>
> Thanks – I’ll check that out.
>
> Regina
>
> http://www.postgis.us <http://www.postgis.us/>
>
>
>
>
>
>
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users




More information about the postgis-users mailing list