[postgis-users] One line solution: select a dirty full address and get a clean address

Stephen Woodbridge stephenwoodbridge37 at gmail.com
Mon Dec 21 09:06:10 PST 2020


On 12/21/2020 11:40 AM, Shaozhong SHI wrote:
> Hi, Steve W,
>
> Many thanks,
>
> How best to see where two addresses match.
>
> For instance, one would say:  101d--120a Some Great Street and 
> 104d-110d Some Great Street match.
>
> How best to do it?
>
> Would it be possible to turn both ranges into arrays, and then 
> intersect the range to check out?
>
> Please enlighten me.
So what I've done with ranges is to put start and end in the database 
for the reference dataset, then you have to decide how you want to 
handle ranges on input. Ranges on input typically imply generating 
multiple queries to the database. When geocoding you first want to match 
the country, city, street which will give you a small set of records, 
then evaluate each record but ranking them with some rules, or 
evaluating each, etc to get the best case. It also depends on where you 
are doing fuzzy matching or not.

You might want to checkout:
https://github.com/woodbri/imaptools.com  - this is all my public code 
that I developed consulting
https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md
https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md

I'm retired now and released all this code under MIT-X license, and you 
will find my SQL Geocoder buried in there and hopefully, I have left 
enough comments for you to figure out how to reuse it. Start by reading 
the README-geocoder*.md links above.

-Steve

>
> Regards,
>
> David
>
> On Sun, 20 Dec 2020 at 00:30, Stephen Woodbridge 
> <stephenwoodbridge37 at gmail.com <mailto:stephenwoodbridge37 at gmail.com>> 
> wrote:
>
>     On 12/19/2020 7:24 PM, Stephen Woodbridge wrote:
>     > Hi David,
>     >
>     > Yes and No! you can do that for some simple addresses, but
>     trying to
>     > clean up addresses in general is much more complicated. I've been
>     > doing this for 20+ years writing geocoders. Check out:
>     >
>     > https://github.com/woodbri/address-standardizer
>     <https://github.com/woodbri/address-standardizer>
>     >
>     > This will install as a postgresql extension and once you set it
>     up it
>     > has an easy function call that can be used to clean things up.
>     > Read this to get a better idea:
>     >
>     >
>     https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>     <https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md>
>
>     >
>     >
>     > And there is a docs/ directory with more documentation. I also
>     provide
>     > sample starter files for parsing addresses of about 25
>     countries, so
>     > you can pick the one(s) you need and then customize them for
>     your needs.
>     >
>     > If you just want to try working with simple regex you probably
>     want to
>     > read:
>     >
>     > https://www.postgresqltutorial.com/regexp_replace/
>     <https://www.postgresqltutorial.com/regexp_replace/>
>     >
>     > |SELECT REGEXP_REPLACE( REGEXP_REPLACE||(string, |'[, .]+', '
>     > '),       -- convert comma, space, period to a single space
>     >                 '^ *|--| *$', '')     -- remove leading and
>     trailing
>     > white space and '--'
>     >        );
>
>     But this will not handle "101 - 103 -- a" without adding more
>     clauses.
>     And how do you plan to handle things like abbreviations? Ave, ave,
>     avenue, st, street, saint, or lots of other issues. A lot depends on
>     your use case, are you geocoding these are just trying to clean up
>     some
>     text.
>     >
>     >
>     > -Steve W
>     >
>     > On 12/19/2020 5:40 PM, Shaozhong SHI wrote:
>     >> For instance,
>     >>
>     >> We may encounter,
>     >>
>     >> My Great Place. 101-103--a, Great Avenue, London, , SW3 5EE
>     >>
>     >> Can we do some thing like
>     >>
>     >> SELECT
>     >>     REGEXP_MATCHES(' My Great' Place. 101-103--a, Great Avenue,
>     >> London, , SW3 5EE ',
>     >>          a_regular_expression,
>     >>         'g');
>     >>
>     >> so that we end up a clean address line with one line of code?
>     >>
>     >> My Great Place 101-103a Great Avenue London SW3 5EE
>     >>
>     >> What regex is that?
>     >>
>     >> Can we guarantee a single space between words?
>     >>
>     >> Regards,
>     >>
>     >> David
>     >>
>     >> _______________________________________________
>     >> postgis-users mailing list
>     >> postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>
>     >> https://lists.osgeo.org/mailman/listinfo/postgis-users
>     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>     >
>
>     _______________________________________________
>     postgis-users mailing list
>     postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org>
>     https://lists.osgeo.org/mailman/listinfo/postgis-users
>     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users



More information about the postgis-users mailing list