[postgis-users] One line solution: select a dirty full address and get a clean address

Stephen Woodbridge stephenwoodbridge37 at gmail.com
Mon Dec 21 12:51:46 PST 2020


David,

in terms of geocoding, which is how I think about these issues, a street 
can have a range of house numbers distributed over it. When you talk 
about "101d--120a Some Great Street" as a street you have to decided:

1. is this one location or many?
2. how is it distributed over the street segment?

On input what does this address mean?

1. is it one location? therefore only an exact match should geocode to it.
2. if it is an address range, then what are you looking for in the 
reference data set. If multiple ranges intersect this input range, then 
what is the correct response coordinate?
3. what is the location that the user is looking for? What is the 
physical location of "101d--120a Some Great Street" that someone is 
trying to drive to, or deliver a package to? And how is that encoded in 
the reference dataset? Because that is what you need to match it to.

Geocoding is a human language parsing problem: a human has entered an 
address and is expecting a response that will get them to the desired 
location. This is a hard problem and I have never seen a "simple"" 
solution that works for a wide range of inputs. I always start with the 
80-20 rule, try to solve 80% of your cases with 20% of the rules or 
effort, and then look at the cases that are failing and try to abstract 
them so you can add a few more rules to solve 80% of the failures and so 
on till you get close enough.

My address standardizer is only the front-end parser for the address 
text to separate it into classified components so you can then make 
sense out of the text and do what you want with the components.

Read some of the links I sent you to understand the problems. This code 
has been used with multiple datasets and multiple countries. I 
personally have used it to build geocoders based on US Tiger data, 
Canada Census data, Australia's master address list, UK data, 
Navteq/HERE data for US, Canada, UK, and Germany.

It requires some work but the results have been really good. You have to 
try stuff and see if it works for you or not. Maybe as you suggest there 
is a simple solution for the case you present, but is this case 80% of 
your addresses fit into or is it the case that 1/2% of your addresses 
fit into. Whichever it is, focus on the 80% of the cases and solve that 
first, then tackle the harder outliers.

By the way postgis, has an address standardizer included with it. I 
ported that into postgis from another project, my 
github.com/woodbri/address_standardizer repository is not the same, I 
wrote this to be C++ based, supporting utf8, multiple languages and 
multiple countries and to be easier for humans to configure and modify, 
none of which exists in the current postgis code. I had hoped to get my 
new code integrated into postgis, but I'm retired now and energies are 
getting directed elsewhere.

Hope this helps you to get started.

-Steve

On 12/21/2020 3:00 PM, Shaozhong SHI wrote:
> Hi, Steve,
>
> Thanks.
>
> I think that the key matter lies to one thing:
>
> Get SAO and PAO numbers and text separate, so that we can intersect 
> SAO and PAO number arrays.  By doing so, we can detect addresses for 
> matching addresses to known address points.
>
> This could offer a much simpler alternative.
>
> What do you think?
>
> Regards,
>
> David
>
> On Mon, 21 Dec 2020 at 17:06, Stephen Woodbridge 
> <stephenwoodbridge37 at gmail.com <mailto:stephenwoodbridge37 at gmail.com>> 
> wrote:
>
>     On 12/21/2020 11:40 AM, Shaozhong SHI wrote:
>     > Hi, Steve W,
>     >
>     > Many thanks,
>     >
>     > How best to see where two addresses match.
>     >
>     > For instance, one would say:  101d--120a Some Great Street and
>     > 104d-110d Some Great Street match.
>     >
>     > How best to do it?
>     >
>     > Would it be possible to turn both ranges into arrays, and then
>     > intersect the range to check out?
>     >
>     > Please enlighten me.
>     So what I've done with ranges is to put start and end in the database
>     for the reference dataset, then you have to decide how you want to
>     handle ranges on input. Ranges on input typically imply generating
>     multiple queries to the database. When geocoding you first want to
>     match
>     the country, city, street which will give you a small set of records,
>     then evaluate each record but ranking them with some rules, or
>     evaluating each, etc to get the best case. It also depends on
>     where you
>     are doing fuzzy matching or not.
>
>     You might want to checkout:
>     https://github.com/woodbri/imaptools.com
>     <https://github.com/woodbri/imaptools.com> - this is all my public
>     code
>     that I developed consulting
>     https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md
>     <https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md>
>     https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md
>     <https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md>
>
>     I'm retired now and released all this code under MIT-X license,
>     and you
>     will find my SQL Geocoder buried in there and hopefully, I have left
>     enough comments for you to figure out how to reuse it. Start by
>     reading
>     the README-geocoder*.md links above.
>
>     -Steve
>
>     >
>     > Regards,
>     >
>     > David
>     >
>     > On Sun, 20 Dec 2020 at 00:30, Stephen Woodbridge
>     > <stephenwoodbridge37 at gmail.com
>     <mailto:stephenwoodbridge37 at gmail.com>
>     <mailto:stephenwoodbridge37 at gmail.com
>     <mailto:stephenwoodbridge37 at gmail.com>>>
>     > wrote:
>     >
>     >     On 12/19/2020 7:24 PM, Stephen Woodbridge wrote:
>     >     > Hi David,
>     >     >
>     >     > Yes and No! you can do that for some simple addresses, but
>     >     trying to
>     >     > clean up addresses in general is much more complicated.
>     I've been
>     >     > doing this for 20+ years writing geocoders. Check out:
>     >     >
>     >     > https://github.com/woodbri/address-standardizer
>     <https://github.com/woodbri/address-standardizer>
>     >     <https://github.com/woodbri/address-standardizer
>     <https://github.com/woodbri/address-standardizer>>
>     >     >
>     >     > This will install as a postgresql extension and once you
>     set it
>     >     up it
>     >     > has an easy function call that can be used to clean things up.
>     >     > Read this to get a better idea:
>     >     >
>     >     >
>     >
>     https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>     <https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md>
>     >   
>      <https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>     <https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md>>
>     >
>     >     >
>     >     >
>     >     > And there is a docs/ directory with more documentation. I also
>     >     provide
>     >     > sample starter files for parsing addresses of about 25
>     >     countries, so
>     >     > you can pick the one(s) you need and then customize them for
>     >     your needs.
>     >     >
>     >     > If you just want to try working with simple regex you probably
>     >     want to
>     >     > read:
>     >     >
>     >     > https://www.postgresqltutorial.com/regexp_replace/
>     <https://www.postgresqltutorial.com/regexp_replace/>
>     >     <https://www.postgresqltutorial.com/regexp_replace/
>     <https://www.postgresqltutorial.com/regexp_replace/>>
>     >     >
>     >     > |SELECT REGEXP_REPLACE( REGEXP_REPLACE||(string, |'[, .]+', '
>     >     > '),       -- convert comma, space, period to a single space
>     >     >                 '^ *|--| *$', '')     -- remove leading and
>     >     trailing
>     >     > white space and '--'
>     >     >        );
>     >
>     >     But this will not handle "101 - 103 -- a" without adding more
>     >     clauses.
>     >     And how do you plan to handle things like abbreviations?
>     Ave, ave,
>     >     avenue, st, street, saint, or lots of other issues. A lot
>     depends on
>     >     your use case, are you geocoding these are just trying to
>     clean up
>     >     some
>     >     text.
>     >     >
>     >     >
>     >     > -Steve W
>     >     >
>     >     > On 12/19/2020 5:40 PM, Shaozhong SHI wrote:
>     >     >> For instance,
>     >     >>
>     >     >> We may encounter,
>     >     >>
>     >     >> My Great Place. 101-103--a, Great Avenue, London, , SW3 5EE
>     >     >>
>     >     >> Can we do some thing like
>     >     >>
>     >     >> SELECT
>     >     >>     REGEXP_MATCHES(' My Great' Place. 101-103--a, Great
>     Avenue,
>     >     >> London, , SW3 5EE ',
>     >     >>          a_regular_expression,
>     >     >>         'g');
>     >     >>
>     >     >> so that we end up a clean address line with one line of code?
>     >     >>
>     >     >> My Great Place 101-103a Great Avenue London SW3 5EE
>     >     >>
>     >     >> What regex is that?
>     >     >>
>     >     >> Can we guarantee a single space between words?
>     >     >>
>     >     >> Regards,
>     >     >>
>     >     >> David
>     >     >>
>     >     >> _______________________________________________
>     >     >> postgis-users mailing list
>     >     >> postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>
>     >     <mailto:postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>>
>     >     >> https://lists.osgeo.org/mailman/listinfo/postgis-users
>     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>     >     <https://lists.osgeo.org/mailman/listinfo/postgis-users
>     <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
>     >     >
>     >
>     >     _______________________________________________
>     >     postgis-users mailing list
>     > postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>
>     <mailto:postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>>
>     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>     >     <https://lists.osgeo.org/mailman/listinfo/postgis-users
>     <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
>     >
>     >
>     > _______________________________________________
>     > postgis-users mailing list
>     > postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org>
>     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>
>     _______________________________________________
>     postgis-users mailing list
>     postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org>
>     https://lists.osgeo.org/mailman/listinfo/postgis-users
>     <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users



More information about the postgis-users mailing list