[postgis-users] One line solution: select a dirty full address and get a clean address

Shaozhong SHI shishaozhong at gmail.com
Mon Dec 21 13:32:47 PST 2020


Hi, Steve,

I do respect your work.

I am exploring a much simpler way, due to current circumstance, and
curiosity to research.

I have been struggling to find a way to locate and access regex elements.
For instance, how to refer to and access the one before the last.  The idea
is precisely about your mentioning of matched groups.

Is there a way to reverse this index?

https://stackoverflow.com/questions/52178844/get-second-match-from-regexp-matches-results

Regards,

David



On Mon, 21 Dec 2020 at 20:52, Stephen Woodbridge <
stephenwoodbridge37 at gmail.com> wrote:

> David,
>
> in terms of geocoding, which is how I think about these issues, a street
> can have a range of house numbers distributed over it. When you talk
> about "101d--120a Some Great Street" as a street you have to decided:
>
> 1. is this one location or many?
> 2. how is it distributed over the street segment?
>
> On input what does this address mean?
>
> 1. is it one location? therefore only an exact match should geocode to it.
> 2. if it is an address range, then what are you looking for in the
> reference data set. If multiple ranges intersect this input range, then
> what is the correct response coordinate?
> 3. what is the location that the user is looking for? What is the
> physical location of "101d--120a Some Great Street" that someone is
> trying to drive to, or deliver a package to? And how is that encoded in
> the reference dataset? Because that is what you need to match it to.
>
> Geocoding is a human language parsing problem: a human has entered an
> address and is expecting a response that will get them to the desired
> location. This is a hard problem and I have never seen a "simple""
> solution that works for a wide range of inputs. I always start with the
> 80-20 rule, try to solve 80% of your cases with 20% of the rules or
> effort, and then look at the cases that are failing and try to abstract
> them so you can add a few more rules to solve 80% of the failures and so
> on till you get close enough.
>
> My address standardizer is only the front-end parser for the address
> text to separate it into classified components so you can then make
> sense out of the text and do what you want with the components.
>
> Read some of the links I sent you to understand the problems. This code
> has been used with multiple datasets and multiple countries. I
> personally have used it to build geocoders based on US Tiger data,
> Canada Census data, Australia's master address list, UK data,
> Navteq/HERE data for US, Canada, UK, and Germany.
>
> It requires some work but the results have been really good. You have to
> try stuff and see if it works for you or not. Maybe as you suggest there
> is a simple solution for the case you present, but is this case 80% of
> your addresses fit into or is it the case that 1/2% of your addresses
> fit into. Whichever it is, focus on the 80% of the cases and solve that
> first, then tackle the harder outliers.
>
> By the way postgis, has an address standardizer included with it. I
> ported that into postgis from another project, my
> github.com/woodbri/address_standardizer repository is not the same, I
> wrote this to be C++ based, supporting utf8, multiple languages and
> multiple countries and to be easier for humans to configure and modify,
> none of which exists in the current postgis code. I had hoped to get my
> new code integrated into postgis, but I'm retired now and energies are
> getting directed elsewhere.
>
> Hope this helps you to get started.
>
> -Steve
>
> On 12/21/2020 3:00 PM, Shaozhong SHI wrote:
> > Hi, Steve,
> >
> > Thanks.
> >
> > I think that the key matter lies to one thing:
> >
> > Get SAO and PAO numbers and text separate, so that we can intersect
> > SAO and PAO number arrays.  By doing so, we can detect addresses for
> > matching addresses to known address points.
> >
> > This could offer a much simpler alternative.
> >
> > What do you think?
> >
> > Regards,
> >
> > David
> >
> > On Mon, 21 Dec 2020 at 17:06, Stephen Woodbridge
> > <stephenwoodbridge37 at gmail.com <mailto:stephenwoodbridge37 at gmail.com>>
> > wrote:
> >
> >     On 12/21/2020 11:40 AM, Shaozhong SHI wrote:
> >     > Hi, Steve W,
> >     >
> >     > Many thanks,
> >     >
> >     > How best to see where two addresses match.
> >     >
> >     > For instance, one would say:  101d--120a Some Great Street and
> >     > 104d-110d Some Great Street match.
> >     >
> >     > How best to do it?
> >     >
> >     > Would it be possible to turn both ranges into arrays, and then
> >     > intersect the range to check out?
> >     >
> >     > Please enlighten me.
> >     So what I've done with ranges is to put start and end in the database
> >     for the reference dataset, then you have to decide how you want to
> >     handle ranges on input. Ranges on input typically imply generating
> >     multiple queries to the database. When geocoding you first want to
> >     match
> >     the country, city, street which will give you a small set of records,
> >     then evaluate each record but ranking them with some rules, or
> >     evaluating each, etc to get the best case. It also depends on
> >     where you
> >     are doing fuzzy matching or not.
> >
> >     You might want to checkout:
> >     https://github.com/woodbri/imaptools.com
> >     <https://github.com/woodbri/imaptools.com> - this is all my public
> >     code
> >     that I developed consulting
> >
> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md
> >     <
> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md
> >
> >
> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md
> >     <
> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md>
> >
> >     I'm retired now and released all this code under MIT-X license,
> >     and you
> >     will find my SQL Geocoder buried in there and hopefully, I have left
> >     enough comments for you to figure out how to reuse it. Start by
> >     reading
> >     the README-geocoder*.md links above.
> >
> >     -Steve
> >
> >     >
> >     > Regards,
> >     >
> >     > David
> >     >
> >     > On Sun, 20 Dec 2020 at 00:30, Stephen Woodbridge
> >     > <stephenwoodbridge37 at gmail.com
> >     <mailto:stephenwoodbridge37 at gmail.com>
> >     <mailto:stephenwoodbridge37 at gmail.com
> >     <mailto:stephenwoodbridge37 at gmail.com>>>
> >     > wrote:
> >     >
> >     >     On 12/19/2020 7:24 PM, Stephen Woodbridge wrote:
> >     >     > Hi David,
> >     >     >
> >     >     > Yes and No! you can do that for some simple addresses, but
> >     >     trying to
> >     >     > clean up addresses in general is much more complicated.
> >     I've been
> >     >     > doing this for 20+ years writing geocoders. Check out:
> >     >     >
> >     >     > https://github.com/woodbri/address-standardizer
> >     <https://github.com/woodbri/address-standardizer>
> >     >     <https://github.com/woodbri/address-standardizer
> >     <https://github.com/woodbri/address-standardizer>>
> >     >     >
> >     >     > This will install as a postgresql extension and once you
> >     set it
> >     >     up it
> >     >     > has an easy function call that can be used to clean things
> up.
> >     >     > Read this to get a better idea:
> >     >     >
> >     >     >
> >     >
> >
> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
> >     <
> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
> >
> >     >
> >      <
> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
> >     <
> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
> >>
> >     >
> >     >     >
> >     >     >
> >     >     > And there is a docs/ directory with more documentation. I
> also
> >     >     provide
> >     >     > sample starter files for parsing addresses of about 25
> >     >     countries, so
> >     >     > you can pick the one(s) you need and then customize them for
> >     >     your needs.
> >     >     >
> >     >     > If you just want to try working with simple regex you
> probably
> >     >     want to
> >     >     > read:
> >     >     >
> >     >     > https://www.postgresqltutorial.com/regexp_replace/
> >     <https://www.postgresqltutorial.com/regexp_replace/>
> >     >     <https://www.postgresqltutorial.com/regexp_replace/
> >     <https://www.postgresqltutorial.com/regexp_replace/>>
> >     >     >
> >     >     > |SELECT REGEXP_REPLACE( REGEXP_REPLACE||(string, |'[, .]+', '
> >     >     > '),       -- convert comma, space, period to a single space
> >     >     >                 '^ *|--| *$', '')     -- remove leading and
> >     >     trailing
> >     >     > white space and '--'
> >     >     >        );
> >     >
> >     >     But this will not handle "101 - 103 -- a" without adding more
> >     >     clauses.
> >     >     And how do you plan to handle things like abbreviations?
> >     Ave, ave,
> >     >     avenue, st, street, saint, or lots of other issues. A lot
> >     depends on
> >     >     your use case, are you geocoding these are just trying to
> >     clean up
> >     >     some
> >     >     text.
> >     >     >
> >     >     >
> >     >     > -Steve W
> >     >     >
> >     >     > On 12/19/2020 5:40 PM, Shaozhong SHI wrote:
> >     >     >> For instance,
> >     >     >>
> >     >     >> We may encounter,
> >     >     >>
> >     >     >> My Great Place. 101-103--a, Great Avenue, London, , SW3 5EE
> >     >     >>
> >     >     >> Can we do some thing like
> >     >     >>
> >     >     >> SELECT
> >     >     >>     REGEXP_MATCHES(' My Great' Place. 101-103--a, Great
> >     Avenue,
> >     >     >> London, , SW3 5EE ',
> >     >     >>          a_regular_expression,
> >     >     >>         'g');
> >     >     >>
> >     >     >> so that we end up a clean address line with one line of
> code?
> >     >     >>
> >     >     >> My Great Place 101-103a Great Avenue London SW3 5EE
> >     >     >>
> >     >     >> What regex is that?
> >     >     >>
> >     >     >> Can we guarantee a single space between words?
> >     >     >>
> >     >     >> Regards,
> >     >     >>
> >     >     >> David
> >     >     >>
> >     >     >> _______________________________________________
> >     >     >> postgis-users mailing list
> >     >     >> postgis-users at lists.osgeo.org
> >     <mailto:postgis-users at lists.osgeo.org>
> >     >     <mailto:postgis-users at lists.osgeo.org
> >     <mailto:postgis-users at lists.osgeo.org>>
> >     >     >> https://lists.osgeo.org/mailman/listinfo/postgis-users
> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
> >     >     <https://lists.osgeo.org/mailman/listinfo/postgis-users
> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
> >     >     >
> >     >
> >     >     _______________________________________________
> >     >     postgis-users mailing list
> >     > postgis-users at lists.osgeo.org
> >     <mailto:postgis-users at lists.osgeo.org>
> >     <mailto:postgis-users at lists.osgeo.org
> >     <mailto:postgis-users at lists.osgeo.org>>
> >     > https://lists.osgeo.org/mailman/listinfo/postgis-users
> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
> >     >     <https://lists.osgeo.org/mailman/listinfo/postgis-users
> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
> >     >
> >     >
> >     > _______________________________________________
> >     > postgis-users mailing list
> >     > postgis-users at lists.osgeo.org <mailto:
> postgis-users at lists.osgeo.org>
> >     > https://lists.osgeo.org/mailman/listinfo/postgis-users
> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
> >
> >     _______________________________________________
> >     postgis-users mailing list
> >     postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org>
> >     https://lists.osgeo.org/mailman/listinfo/postgis-users
> >     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
> >
> >
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users at lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20201221/e6d46407/attachment.html>


More information about the postgis-users mailing list