[postgis-users] standardize_address: Rule does not work
Stephen Woodbridge
woodbri at swoodbridge.com
Thu Oct 29 06:47:01 PDT 2015
On 10/29/2015 2:42 AM, Iris Rittinger (Terraplan) wrote:
> Hi Steve,
>
> that sounds great. If you need someone to test it with German Navteq
> Data or with data for each housenumber I could try to do that.
Thank you for the offer, I will keep that in mind. So while the current
code can be used for a lot of lanuages, German is somewhat problematic
as the address standardizer needs a lot of work internally for the likes of:
* adding utf8 support
* enhance it to support separating embedded prefix and postfix types
* probably a few other issues related to internationalization
* possible conversion to C++ to make it more modular and managable
These are things that we want to do, but are major efforts and will
require funding or additional developer time.
I think it would be a very valuable effort to the community to have a
world class geocoder that can support data for multiple countries and
multiple languages. We are working in the direction as we have time but
it is not happening very fast without funding.
Thanks,
-Steve
> Thanks,
> Iris
>
> Am 28.10.2015 um 14:47 schrieb Stephen Woodbridge:
>> Iris,
>>
>> I build a geocoder which will be used as the model for the next
>> generation geocoder in postgis. It is structured like this:
>>
>>
>> Table or view of raw street segments with name, location, address
>> ranges and geometry, etc. I prefer to split records with both right
>> and left address and location information into multiple records one
>> for each side because it greatly simplifies and streamlines processing
>> later in the process.
>>
>> A standard address table that looks like the output of the
>> standard_address with a foreign key pointing to the raw data table
>>
>> We standardize the raw data into standard address table, add metaphone
>> columns for fuzzy search, etc.
>>
>> Now we geocode by running the request through standardize_address and
>> then match that against the standard address table and score the
>> results for the best match, then fetch the geometry for that record
>> and compute the location based on the address or address range.
>>
>> The key step in this process is to run the raw data through the
>> standardize_address() and to match against that. Think of
>> standardize_address as generating a key, we don't really care how the
>> key is generated, as long as the key generated from the reference data
>> matches the key generated by the incoming request. If we use data that
>> is already standardized by a vendor and then try to match and incoming
>> request against that, we run the risk that the vendor does not
>> standardize the same way that the code does and we get a mismatch.
>>
>> I have been able to get about 25-30 ms on average geocoding results on
>> a 80 million record reference set using US Tiger data or Navteq data
>> for US and Canada. And this can be improved upon if the data is
>> presorted by zip, state, city.
>>
>> Anyway, hope this give you some ideas.
>>
>> -Steve
>>
>> On 10/28/2015 7:34 AM, Iris Rittinger (Terraplan) wrote:
>>> Hi Regina, hi Steve,
>>>
>>> thanks a lot for your answers.
>>> When I try the parse_address function I do not get any useful result.
>>> Seperating the macro and micro part in the standardize_address function
>>> does work and the result looks pretty good.
>>>
>>> * The pagc code was not built around parsing UTF8 character strings.
>>> -- Yes, that is a problem. It does not parse german letters
>>>
>>> * One needed enhancement to the the pagc parser is that it currently has
>>> no ability separate inline street types from their street name
>>> -- Yes, that is in deed a problem for our use case
>>>
>>> Would be nice if others are willing to help out on this effort (either
>>> via development or monetary support)
>>> -- I will talk to our developers and my chef if there is a possiblity to
>>> get involved
>>>
>>> and also our upcoming next generation geocoder effort which we plan to
>>> use the address_standardizer with
>>> -- sound great, since that is what I am really looking for. We have a
>>> reference table for geocoding and we would like to pass the incoming
>>> address against that table.
>>>
>>> Thanks a lot,
>>> Iris
>>>
>>>
>>>
>>> Am 27.10.2015 um 23:08 schrieb Paragon Corporation:
>>>> Hi Iris,
>>>>
>>>> There are actually two standard_address functions:
>>>> http://postgis.net/docs/manual-2.2/standardize_address.html
>>>>
>>>>
>>>>
>>>> 1) takes a single line (which is what you are using) and internally
>>>> relies on parse_address which does have a lot of hard-coded
>>>> assumptions about things (particularly US and Canada assumptions)
>>>> stdaddr standardize_address(text lextab, text gaztab, text rultab,
>>>> text address);
>>>>
>>>> To check if you are being bitten by internal hard-codings in
>>>> parse_address, then try the parse_address function and see if what you
>>>> get back makes sense:
>>>>
>>>> http://postgis.net/docs/manual-2.2/parse_address.html
>>>>
>>>>
>>>>
>>>> 2) Is a two line standardizer that takes micro and macro
>>>> (I apologize I neglected to provide an example for the second variant
>>>> since I use the first mostly. I'll add some documentation to that
>>>> effect) and more detail about what is considered micro/ macro
>>>>
>>>> standardize_address(text lextab, text gaztab, text rultab, text micro,
>>>> text macro);
>>>>
>>>> micro is like - house_num street suffix
>>>> macro is like - city, state, zip country
>>>>
>>>>
>>>> Steve and I had discussed redoing the parse_address function to not
>>>> use a hard-coded set of data, but that's another thing on our todo.
>>>> Would be nice if others are willing to help out on this effort (either
>>>> via development or monetary support) and also our upcoming next
>>>> generation geocoder effort which we plan to use the
>>>> address_standardizer with.
>>>>
>>>>
>>>> Thanks,
>>>> Regina
>>>> http://www.postgis.us
>>>> http://postgis.net
>>>>
>>>> -----Original Message-----
>>>> From: postgis-users [mailto:postgis-users-bounces at lists.osgeo.org] On
>>>> Behalf Of Stephen Woodbridge
>>>> Sent: Tuesday, October 27, 2015 10:41 AM
>>>> To: postgis-users at lists.osgeo.org
>>>> Subject: Re: [postgis-users] standardize_address: Rule does not work
>>>>
>>>> Hi Iris,
>>>>
>>>> It has been able 3+ years since I did the initial integration of pagc
>>>> into postgres and I'm afraid I have forgotten a lot in that time. But
>>>> here are my thoughts.
>>>>
>>>> * I do not think the code explicitly hard codes any mapping like str
>>>> => street so I would check that you lex and gaz tables are truly empty
>>>> and that there is not some other table in the search_path that it is
>>>> finding. I think this is likely the issue.
>>>>
>>>> * The pagc code was not built around parsing UTF8 character strings.
>>>> It might be ok, but it might not handle them well. I don't think
>>>> anyone has done any testing of this so any feedback as you work on
>>>> your project would be interesting and please open tickets if you find
>>>> specific issues.
>>>>
>>>> * One needed enhancement to the the pagc parser is that it currently
>>>> has no ability separate inline street types from their street name. IE:
>>>> street types that are appended to the street name without a space.
>>>>
>>>> * I just pushed rules2txt and txt2rules to
>>>> https://sourceforge.net/p/pagc/code/HEAD/tree/branches/sew-refactor/postgresql/
>>>>
>>>>
>>>> You might find these scripts helpful because the convert rules to
>>>> human readable tokens and back to numbers.
>>>>
>>>> Regina did the final integration of standardize_address into postgis
>>>> and she might have some addition thoughts.
>>>>
>>>> -Steve
>>>>
>>>> On 10/27/2015 6:00 AM, Iris Rititnger (Terraplan) wrote:
>>>>> Hi,
>>>>>
>>>>> I am trying to use the standardize_address function to standardize
>>>>> german street names. But I have a problem when the street is
>>>>> recognized as some kind of special token and then followed by a space.
>>>>>
>>>>> When I try
>>>>> select * from standardize_address('lex_d','gaz_d','rules_d','Hein
>>>>> strasse 11')
>>>>> I get the result I wish for: name -- HEIN STRASSE; house_num -- 11
>>>>> When I try select * from
>>>>> standardize_address('lex_d','gaz_d','rules_d','Hein str 11') I get an
>>>>> empty row as result.
>>>>> When I try
>>>>> select * from standardize_address('lex_d','gaz_d','rules_d','Hein
>>>>> str11') I get the result I wish for: name -- HEIN STR; house_num -- 11
>>>>>
>>>>> My lex and gaz tables are empty. My rule table only has two rules
>>>>> 1 1 0 -1 5 5 1 -1 1 1
>>>>> 1 0 -1 5 1 -1 1 1
>>>>> It seems that str is recognized as something different even though my
>>>>> gaz and lex tables are empty.
>>>>>
>>>>> When I add the rule 1 2 0 -1 5 5 1 -1 1 1 and define
>>>>> ;1;"STR";"street";2; in my lex table I still have the same problem
>>>>> select * from standardize_address('lex_d','gaz_d','rules_d','Hein
>>>>> str11') gives me a result of name -- HEIN street and house_num -- 1
>>>>> select * from standardize_address('lex_d','gaz_d','rules_d','Hein str
>>>>> 11') returns an empty row.
>>>>>
>>>>> To double check I installed pagc on my computer. Here I get the
>>>>> desired result when I define the rule 1 2 0 -1 5 5 1 -1 1 1, but this
>>>>> does not seem to work in Postgis.
>>>>>
>>>>> Do you know why it is not working. Can I somehow get postgres to show
>>>>> me the input and output tokens it tries to use like pagc shows me?
>>>>>
>>>>> Thanks,
>>>>> Iris
>>>>>
>>>>>
>>>> _______________________________________________
>>>> postgis-users mailing list
>>>> postgis-users at lists.osgeo.org
>>>> http://lists.osgeo.org/mailman/listinfo/postgis-users
>>>>
>>>>
>>>> _______________________________________________
>>>> postgis-users mailing list
>>>> postgis-users at lists.osgeo.org
>>>> http://lists.osgeo.org/mailman/listinfo/postgis-users
>>>
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at lists.osgeo.org
>> http://lists.osgeo.org/mailman/listinfo/postgis-users
>
More information about the postgis-users
mailing list