[postgis-users] One line solution: select a dirty full address and get a clean address

Shaozhong SHI shishaozhong at gmail.com
Mon Dec 21 13:58:38 PST 2020


Hi, Steve,

If I can find budget or a bigger enough project, your geocoder is certainly
very useful.

Regards,

David

On Mon, 21 Dec 2020 at 21:40, Shaozhong SHI <shishaozhong at gmail.com> wrote:

> Hi, Steve,
>
> Alternatively, one could try to do something like the following to
> generate values to mark each row with a group number, by using something
> like the following:
>
> SELECT t.value
> from regexp_matches('1234567 - 7654321 - some - more - text', '\d+', 'g')
> with ordinality as t(value,idx)
> where t.idx = 2;
>
> Then, the strategy is to do address-matching for each group, in order to
> copy with different cases.
>
> The cases, here, of interest is to, for the first step, determine how many
> elements of regexp_matches are present.
>
> What do you think?
>
> Regards,
>
> David
>
> On Mon, 21 Dec 2020 at 21:32, Shaozhong SHI <shishaozhong at gmail.com>
> wrote:
>
>> Hi, Steve,
>>
>> I do respect your work.
>>
>> I am exploring a much simpler way, due to current circumstance, and
>> curiosity to research.
>>
>> I have been struggling to find a way to locate and access regex
>> elements.  For instance, how to refer to and access the one before the
>> last.  The idea is precisely about your mentioning of matched groups.
>>
>> Is there a way to reverse this index?
>>
>>
>> https://stackoverflow.com/questions/52178844/get-second-match-from-regexp-matches-results
>>
>> Regards,
>>
>> David
>>
>>
>>
>> On Mon, 21 Dec 2020 at 20:52, Stephen Woodbridge <
>> stephenwoodbridge37 at gmail.com> wrote:
>>
>>> David,
>>>
>>> in terms of geocoding, which is how I think about these issues, a street
>>> can have a range of house numbers distributed over it. When you talk
>>> about "101d--120a Some Great Street" as a street you have to decided:
>>>
>>> 1. is this one location or many?
>>> 2. how is it distributed over the street segment?
>>>
>>> On input what does this address mean?
>>>
>>> 1. is it one location? therefore only an exact match should geocode to
>>> it.
>>> 2. if it is an address range, then what are you looking for in the
>>> reference data set. If multiple ranges intersect this input range, then
>>> what is the correct response coordinate?
>>> 3. what is the location that the user is looking for? What is the
>>> physical location of "101d--120a Some Great Street" that someone is
>>> trying to drive to, or deliver a package to? And how is that encoded in
>>> the reference dataset? Because that is what you need to match it to.
>>>
>>> Geocoding is a human language parsing problem: a human has entered an
>>> address and is expecting a response that will get them to the desired
>>> location. This is a hard problem and I have never seen a "simple""
>>> solution that works for a wide range of inputs. I always start with the
>>> 80-20 rule, try to solve 80% of your cases with 20% of the rules or
>>> effort, and then look at the cases that are failing and try to abstract
>>> them so you can add a few more rules to solve 80% of the failures and so
>>> on till you get close enough.
>>>
>>> My address standardizer is only the front-end parser for the address
>>> text to separate it into classified components so you can then make
>>> sense out of the text and do what you want with the components.
>>>
>>> Read some of the links I sent you to understand the problems. This code
>>> has been used with multiple datasets and multiple countries. I
>>> personally have used it to build geocoders based on US Tiger data,
>>> Canada Census data, Australia's master address list, UK data,
>>> Navteq/HERE data for US, Canada, UK, and Germany.
>>>
>>> It requires some work but the results have been really good. You have to
>>> try stuff and see if it works for you or not. Maybe as you suggest there
>>> is a simple solution for the case you present, but is this case 80% of
>>> your addresses fit into or is it the case that 1/2% of your addresses
>>> fit into. Whichever it is, focus on the 80% of the cases and solve that
>>> first, then tackle the harder outliers.
>>>
>>> By the way postgis, has an address standardizer included with it. I
>>> ported that into postgis from another project, my
>>> github.com/woodbri/address_standardizer repository is not the same, I
>>> wrote this to be C++ based, supporting utf8, multiple languages and
>>> multiple countries and to be easier for humans to configure and modify,
>>> none of which exists in the current postgis code. I had hoped to get my
>>> new code integrated into postgis, but I'm retired now and energies are
>>> getting directed elsewhere.
>>>
>>> Hope this helps you to get started.
>>>
>>> -Steve
>>>
>>> On 12/21/2020 3:00 PM, Shaozhong SHI wrote:
>>> > Hi, Steve,
>>> >
>>> > Thanks.
>>> >
>>> > I think that the key matter lies to one thing:
>>> >
>>> > Get SAO and PAO numbers and text separate, so that we can intersect
>>> > SAO and PAO number arrays.  By doing so, we can detect addresses for
>>> > matching addresses to known address points.
>>> >
>>> > This could offer a much simpler alternative.
>>> >
>>> > What do you think?
>>> >
>>> > Regards,
>>> >
>>> > David
>>> >
>>> > On Mon, 21 Dec 2020 at 17:06, Stephen Woodbridge
>>> > <stephenwoodbridge37 at gmail.com <mailto:stephenwoodbridge37 at gmail.com>>
>>>
>>> > wrote:
>>> >
>>> >     On 12/21/2020 11:40 AM, Shaozhong SHI wrote:
>>> >     > Hi, Steve W,
>>> >     >
>>> >     > Many thanks,
>>> >     >
>>> >     > How best to see where two addresses match.
>>> >     >
>>> >     > For instance, one would say:  101d--120a Some Great Street and
>>> >     > 104d-110d Some Great Street match.
>>> >     >
>>> >     > How best to do it?
>>> >     >
>>> >     > Would it be possible to turn both ranges into arrays, and then
>>> >     > intersect the range to check out?
>>> >     >
>>> >     > Please enlighten me.
>>> >     So what I've done with ranges is to put start and end in the
>>> database
>>> >     for the reference dataset, then you have to decide how you want to
>>> >     handle ranges on input. Ranges on input typically imply generating
>>> >     multiple queries to the database. When geocoding you first want to
>>> >     match
>>> >     the country, city, street which will give you a small set of
>>> records,
>>> >     then evaluate each record but ranking them with some rules, or
>>> >     evaluating each, etc to get the best case. It also depends on
>>> >     where you
>>> >     are doing fuzzy matching or not.
>>> >
>>> >     You might want to checkout:
>>> >     https://github.com/woodbri/imaptools.com
>>> >     <https://github.com/woodbri/imaptools.com> - this is all my public
>>> >     code
>>> >     that I developed consulting
>>> >
>>> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md
>>> >     <
>>> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md
>>> >
>>> >
>>> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md
>>> >     <
>>> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md>
>>> >
>>> >     I'm retired now and released all this code under MIT-X license,
>>> >     and you
>>> >     will find my SQL Geocoder buried in there and hopefully, I have
>>> left
>>> >     enough comments for you to figure out how to reuse it. Start by
>>> >     reading
>>> >     the README-geocoder*.md links above.
>>> >
>>> >     -Steve
>>> >
>>> >     >
>>> >     > Regards,
>>> >     >
>>> >     > David
>>> >     >
>>> >     > On Sun, 20 Dec 2020 at 00:30, Stephen Woodbridge
>>> >     > <stephenwoodbridge37 at gmail.com
>>> >     <mailto:stephenwoodbridge37 at gmail.com>
>>> >     <mailto:stephenwoodbridge37 at gmail.com
>>> >     <mailto:stephenwoodbridge37 at gmail.com>>>
>>> >     > wrote:
>>> >     >
>>> >     >     On 12/19/2020 7:24 PM, Stephen Woodbridge wrote:
>>> >     >     > Hi David,
>>> >     >     >
>>> >     >     > Yes and No! you can do that for some simple addresses, but
>>> >     >     trying to
>>> >     >     > clean up addresses in general is much more complicated.
>>> >     I've been
>>> >     >     > doing this for 20+ years writing geocoders. Check out:
>>> >     >     >
>>> >     >     > https://github.com/woodbri/address-standardizer
>>> >     <https://github.com/woodbri/address-standardizer>
>>> >     >     <https://github.com/woodbri/address-standardizer
>>> >     <https://github.com/woodbri/address-standardizer>>
>>> >     >     >
>>> >     >     > This will install as a postgresql extension and once you
>>> >     set it
>>> >     >     up it
>>> >     >     > has an easy function call that can be used to clean things
>>> up.
>>> >     >     > Read this to get a better idea:
>>> >     >     >
>>> >     >     >
>>> >     >
>>> >
>>> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>>> >     <
>>> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>>> >
>>> >     >
>>> >      <
>>> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>>> >     <
>>> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>>> >>
>>> >     >
>>> >     >     >
>>> >     >     >
>>> >     >     > And there is a docs/ directory with more documentation. I
>>> also
>>> >     >     provide
>>> >     >     > sample starter files for parsing addresses of about 25
>>> >     >     countries, so
>>> >     >     > you can pick the one(s) you need and then customize them
>>> for
>>> >     >     your needs.
>>> >     >     >
>>> >     >     > If you just want to try working with simple regex you
>>> probably
>>> >     >     want to
>>> >     >     > read:
>>> >     >     >
>>> >     >     > https://www.postgresqltutorial.com/regexp_replace/
>>> >     <https://www.postgresqltutorial.com/regexp_replace/>
>>> >     >     <https://www.postgresqltutorial.com/regexp_replace/
>>> >     <https://www.postgresqltutorial.com/regexp_replace/>>
>>> >     >     >
>>> >     >     > |SELECT REGEXP_REPLACE( REGEXP_REPLACE||(string, |'[,
>>> .]+', '
>>> >     >     > '),       -- convert comma, space, period to a single space
>>> >     >     >                 '^ *|--| *$', '')     -- remove leading and
>>> >     >     trailing
>>> >     >     > white space and '--'
>>> >     >     >        );
>>> >     >
>>> >     >     But this will not handle "101 - 103 -- a" without adding more
>>> >     >     clauses.
>>> >     >     And how do you plan to handle things like abbreviations?
>>> >     Ave, ave,
>>> >     >     avenue, st, street, saint, or lots of other issues. A lot
>>> >     depends on
>>> >     >     your use case, are you geocoding these are just trying to
>>> >     clean up
>>> >     >     some
>>> >     >     text.
>>> >     >     >
>>> >     >     >
>>> >     >     > -Steve W
>>> >     >     >
>>> >     >     > On 12/19/2020 5:40 PM, Shaozhong SHI wrote:
>>> >     >     >> For instance,
>>> >     >     >>
>>> >     >     >> We may encounter,
>>> >     >     >>
>>> >     >     >> My Great Place. 101-103--a, Great Avenue, London, , SW3
>>> 5EE
>>> >     >     >>
>>> >     >     >> Can we do some thing like
>>> >     >     >>
>>> >     >     >> SELECT
>>> >     >     >>     REGEXP_MATCHES(' My Great' Place. 101-103--a, Great
>>> >     Avenue,
>>> >     >     >> London, , SW3 5EE ',
>>> >     >     >>          a_regular_expression,
>>> >     >     >>         'g');
>>> >     >     >>
>>> >     >     >> so that we end up a clean address line with one line of
>>> code?
>>> >     >     >>
>>> >     >     >> My Great Place 101-103a Great Avenue London SW3 5EE
>>> >     >     >>
>>> >     >     >> What regex is that?
>>> >     >     >>
>>> >     >     >> Can we guarantee a single space between words?
>>> >     >     >>
>>> >     >     >> Regards,
>>> >     >     >>
>>> >     >     >> David
>>> >     >     >>
>>> >     >     >> _______________________________________________
>>> >     >     >> postgis-users mailing list
>>> >     >     >> postgis-users at lists.osgeo.org
>>> >     <mailto:postgis-users at lists.osgeo.org>
>>> >     >     <mailto:postgis-users at lists.osgeo.org
>>> >     <mailto:postgis-users at lists.osgeo.org>>
>>> >     >     >> https://lists.osgeo.org/mailman/listinfo/postgis-users
>>> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>>> >     >     <https://lists.osgeo.org/mailman/listinfo/postgis-users
>>> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
>>> >     >     >
>>> >     >
>>> >     >     _______________________________________________
>>> >     >     postgis-users mailing list
>>> >     > postgis-users at lists.osgeo.org
>>> >     <mailto:postgis-users at lists.osgeo.org>
>>> >     <mailto:postgis-users at lists.osgeo.org
>>> >     <mailto:postgis-users at lists.osgeo.org>>
>>> >     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>>> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>>> >     >     <https://lists.osgeo.org/mailman/listinfo/postgis-users
>>> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
>>> >     >
>>> >     >
>>> >     > _______________________________________________
>>> >     > postgis-users mailing list
>>> >     > postgis-users at lists.osgeo.org <mailto:
>>> postgis-users at lists.osgeo.org>
>>> >     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>>> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>>> >
>>> >     _______________________________________________
>>> >     postgis-users mailing list
>>> >     postgis-users at lists.osgeo.org <mailto:
>>> postgis-users at lists.osgeo.org>
>>> >     https://lists.osgeo.org/mailman/listinfo/postgis-users
>>> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>>> >
>>> >
>>> > _______________________________________________
>>> > postgis-users mailing list
>>> > postgis-users at lists.osgeo.org
>>> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>>>
>>> _______________________________________________
>>> postgis-users mailing list
>>> postgis-users at lists.osgeo.org
>>> https://lists.osgeo.org/mailman/listinfo/postgis-users
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20201221/730dd819/attachment.html>


More information about the postgis-users mailing list