[postgis-users] One line solution: select a dirty full address and get a clean address

Shaozhong SHI shishaozhong at gmail.com
Mon Dec 21 13:40:09 PST 2020


Hi, Steve,

Alternatively, one could try to do something like the following to generate
values to mark each row with a group number, by using something like the
following:

SELECT t.value
from regexp_matches('1234567 - 7654321 - some - more - text', '\d+', 'g')
with ordinality as t(value,idx)
where t.idx = 2;

Then, the strategy is to do address-matching for each group, in order to
copy with different cases.

The cases, here, of interest is to, for the first step, determine how many
elements of regexp_matches are present.

What do you think?

Regards,

David

On Mon, 21 Dec 2020 at 21:32, Shaozhong SHI <shishaozhong at gmail.com> wrote:

> Hi, Steve,
>
> I do respect your work.
>
> I am exploring a much simpler way, due to current circumstance, and
> curiosity to research.
>
> I have been struggling to find a way to locate and access regex elements.
> For instance, how to refer to and access the one before the last.  The idea
> is precisely about your mentioning of matched groups.
>
> Is there a way to reverse this index?
>
>
> https://stackoverflow.com/questions/52178844/get-second-match-from-regexp-matches-results
>
> Regards,
>
> David
>
>
>
> On Mon, 21 Dec 2020 at 20:52, Stephen Woodbridge <
> stephenwoodbridge37 at gmail.com> wrote:
>
>> David,
>>
>> in terms of geocoding, which is how I think about these issues, a street
>> can have a range of house numbers distributed over it. When you talk
>> about "101d--120a Some Great Street" as a street you have to decided:
>>
>> 1. is this one location or many?
>> 2. how is it distributed over the street segment?
>>
>> On input what does this address mean?
>>
>> 1. is it one location? therefore only an exact match should geocode to it.
>> 2. if it is an address range, then what are you looking for in the
>> reference data set. If multiple ranges intersect this input range, then
>> what is the correct response coordinate?
>> 3. what is the location that the user is looking for? What is the
>> physical location of "101d--120a Some Great Street" that someone is
>> trying to drive to, or deliver a package to? And how is that encoded in
>> the reference dataset? Because that is what you need to match it to.
>>
>> Geocoding is a human language parsing problem: a human has entered an
>> address and is expecting a response that will get them to the desired
>> location. This is a hard problem and I have never seen a "simple""
>> solution that works for a wide range of inputs. I always start with the
>> 80-20 rule, try to solve 80% of your cases with 20% of the rules or
>> effort, and then look at the cases that are failing and try to abstract
>> them so you can add a few more rules to solve 80% of the failures and so
>> on till you get close enough.
>>
>> My address standardizer is only the front-end parser for the address
>> text to separate it into classified components so you can then make
>> sense out of the text and do what you want with the components.
>>
>> Read some of the links I sent you to understand the problems. This code
>> has been used with multiple datasets and multiple countries. I
>> personally have used it to build geocoders based on US Tiger data,
>> Canada Census data, Australia's master address list, UK data,
>> Navteq/HERE data for US, Canada, UK, and Germany.
>>
>> It requires some work but the results have been really good. You have to
>> try stuff and see if it works for you or not. Maybe as you suggest there
>> is a simple solution for the case you present, but is this case 80% of
>> your addresses fit into or is it the case that 1/2% of your addresses
>> fit into. Whichever it is, focus on the 80% of the cases and solve that
>> first, then tackle the harder outliers.
>>
>> By the way postgis, has an address standardizer included with it. I
>> ported that into postgis from another project, my
>> github.com/woodbri/address_standardizer repository is not the same, I
>> wrote this to be C++ based, supporting utf8, multiple languages and
>> multiple countries and to be easier for humans to configure and modify,
>> none of which exists in the current postgis code. I had hoped to get my
>> new code integrated into postgis, but I'm retired now and energies are
>> getting directed elsewhere.
>>
>> Hope this helps you to get started.
>>
>> -Steve
>>
>> On 12/21/2020 3:00 PM, Shaozhong SHI wrote:
>> > Hi, Steve,
>> >
>> > Thanks.
>> >
>> > I think that the key matter lies to one thing:
>> >
>> > Get SAO and PAO numbers and text separate, so that we can intersect
>> > SAO and PAO number arrays.  By doing so, we can detect addresses for
>> > matching addresses to known address points.
>> >
>> > This could offer a much simpler alternative.
>> >
>> > What do you think?
>> >
>> > Regards,
>> >
>> > David
>> >
>> > On Mon, 21 Dec 2020 at 17:06, Stephen Woodbridge
>> > <stephenwoodbridge37 at gmail.com <mailto:stephenwoodbridge37 at gmail.com>>
>> > wrote:
>> >
>> >     On 12/21/2020 11:40 AM, Shaozhong SHI wrote:
>> >     > Hi, Steve W,
>> >     >
>> >     > Many thanks,
>> >     >
>> >     > How best to see where two addresses match.
>> >     >
>> >     > For instance, one would say:  101d--120a Some Great Street and
>> >     > 104d-110d Some Great Street match.
>> >     >
>> >     > How best to do it?
>> >     >
>> >     > Would it be possible to turn both ranges into arrays, and then
>> >     > intersect the range to check out?
>> >     >
>> >     > Please enlighten me.
>> >     So what I've done with ranges is to put start and end in the
>> database
>> >     for the reference dataset, then you have to decide how you want to
>> >     handle ranges on input. Ranges on input typically imply generating
>> >     multiple queries to the database. When geocoding you first want to
>> >     match
>> >     the country, city, street which will give you a small set of
>> records,
>> >     then evaluate each record but ranking them with some rules, or
>> >     evaluating each, etc to get the best case. It also depends on
>> >     where you
>> >     are doing fuzzy matching or not.
>> >
>> >     You might want to checkout:
>> >     https://github.com/woodbri/imaptools.com
>> >     <https://github.com/woodbri/imaptools.com> - this is all my public
>> >     code
>> >     that I developed consulting
>> >
>> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md
>> >     <
>> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md
>> >
>> >
>> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md
>> >     <
>> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md>
>> >
>> >     I'm retired now and released all this code under MIT-X license,
>> >     and you
>> >     will find my SQL Geocoder buried in there and hopefully, I have left
>> >     enough comments for you to figure out how to reuse it. Start by
>> >     reading
>> >     the README-geocoder*.md links above.
>> >
>> >     -Steve
>> >
>> >     >
>> >     > Regards,
>> >     >
>> >     > David
>> >     >
>> >     > On Sun, 20 Dec 2020 at 00:30, Stephen Woodbridge
>> >     > <stephenwoodbridge37 at gmail.com
>> >     <mailto:stephenwoodbridge37 at gmail.com>
>> >     <mailto:stephenwoodbridge37 at gmail.com
>> >     <mailto:stephenwoodbridge37 at gmail.com>>>
>> >     > wrote:
>> >     >
>> >     >     On 12/19/2020 7:24 PM, Stephen Woodbridge wrote:
>> >     >     > Hi David,
>> >     >     >
>> >     >     > Yes and No! you can do that for some simple addresses, but
>> >     >     trying to
>> >     >     > clean up addresses in general is much more complicated.
>> >     I've been
>> >     >     > doing this for 20+ years writing geocoders. Check out:
>> >     >     >
>> >     >     > https://github.com/woodbri/address-standardizer
>> >     <https://github.com/woodbri/address-standardizer>
>> >     >     <https://github.com/woodbri/address-standardizer
>> >     <https://github.com/woodbri/address-standardizer>>
>> >     >     >
>> >     >     > This will install as a postgresql extension and once you
>> >     set it
>> >     >     up it
>> >     >     > has an easy function call that can be used to clean things
>> up.
>> >     >     > Read this to get a better idea:
>> >     >     >
>> >     >     >
>> >     >
>> >
>> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>> >     <
>> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>> >
>> >     >
>> >      <
>> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>> >     <
>> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>> >>
>> >     >
>> >     >     >
>> >     >     >
>> >     >     > And there is a docs/ directory with more documentation. I
>> also
>> >     >     provide
>> >     >     > sample starter files for parsing addresses of about 25
>> >     >     countries, so
>> >     >     > you can pick the one(s) you need and then customize them for
>> >     >     your needs.
>> >     >     >
>> >     >     > If you just want to try working with simple regex you
>> probably
>> >     >     want to
>> >     >     > read:
>> >     >     >
>> >     >     > https://www.postgresqltutorial.com/regexp_replace/
>> >     <https://www.postgresqltutorial.com/regexp_replace/>
>> >     >     <https://www.postgresqltutorial.com/regexp_replace/
>> >     <https://www.postgresqltutorial.com/regexp_replace/>>
>> >     >     >
>> >     >     > |SELECT REGEXP_REPLACE( REGEXP_REPLACE||(string, |'[, .]+',
>> '
>> >     >     > '),       -- convert comma, space, period to a single space
>> >     >     >                 '^ *|--| *$', '')     -- remove leading and
>> >     >     trailing
>> >     >     > white space and '--'
>> >     >     >        );
>> >     >
>> >     >     But this will not handle "101 - 103 -- a" without adding more
>> >     >     clauses.
>> >     >     And how do you plan to handle things like abbreviations?
>> >     Ave, ave,
>> >     >     avenue, st, street, saint, or lots of other issues. A lot
>> >     depends on
>> >     >     your use case, are you geocoding these are just trying to
>> >     clean up
>> >     >     some
>> >     >     text.
>> >     >     >
>> >     >     >
>> >     >     > -Steve W
>> >     >     >
>> >     >     > On 12/19/2020 5:40 PM, Shaozhong SHI wrote:
>> >     >     >> For instance,
>> >     >     >>
>> >     >     >> We may encounter,
>> >     >     >>
>> >     >     >> My Great Place. 101-103--a, Great Avenue, London, , SW3 5EE
>> >     >     >>
>> >     >     >> Can we do some thing like
>> >     >     >>
>> >     >     >> SELECT
>> >     >     >>     REGEXP_MATCHES(' My Great' Place. 101-103--a, Great
>> >     Avenue,
>> >     >     >> London, , SW3 5EE ',
>> >     >     >>          a_regular_expression,
>> >     >     >>         'g');
>> >     >     >>
>> >     >     >> so that we end up a clean address line with one line of
>> code?
>> >     >     >>
>> >     >     >> My Great Place 101-103a Great Avenue London SW3 5EE
>> >     >     >>
>> >     >     >> What regex is that?
>> >     >     >>
>> >     >     >> Can we guarantee a single space between words?
>> >     >     >>
>> >     >     >> Regards,
>> >     >     >>
>> >     >     >> David
>> >     >     >>
>> >     >     >> _______________________________________________
>> >     >     >> postgis-users mailing list
>> >     >     >> postgis-users at lists.osgeo.org
>> >     <mailto:postgis-users at lists.osgeo.org>
>> >     >     <mailto:postgis-users at lists.osgeo.org
>> >     <mailto:postgis-users at lists.osgeo.org>>
>> >     >     >> https://lists.osgeo.org/mailman/listinfo/postgis-users
>> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>> >     >     <https://lists.osgeo.org/mailman/listinfo/postgis-users
>> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
>> >     >     >
>> >     >
>> >     >     _______________________________________________
>> >     >     postgis-users mailing list
>> >     > postgis-users at lists.osgeo.org
>> >     <mailto:postgis-users at lists.osgeo.org>
>> >     <mailto:postgis-users at lists.osgeo.org
>> >     <mailto:postgis-users at lists.osgeo.org>>
>> >     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>> >     >     <https://lists.osgeo.org/mailman/listinfo/postgis-users
>> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
>> >     >
>> >     >
>> >     > _______________________________________________
>> >     > postgis-users mailing list
>> >     > postgis-users at lists.osgeo.org <mailto:
>> postgis-users at lists.osgeo.org>
>> >     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>> >
>> >     _______________________________________________
>> >     postgis-users mailing list
>> >     postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org
>> >
>> >     https://lists.osgeo.org/mailman/listinfo/postgis-users
>> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>> >
>> >
>> > _______________________________________________
>> > postgis-users mailing list
>> > postgis-users at lists.osgeo.org
>> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at lists.osgeo.org
>> https://lists.osgeo.org/mailman/listinfo/postgis-users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20201221/c1b296a7/attachment.html>


More information about the postgis-users mailing list