[postgis-users] One line solution: select a dirty full address and get a clean address
Shaozhong SHI
shishaozhong at gmail.com
Mon Dec 21 13:40:09 PST 2020
Hi, Steve,
Alternatively, one could try to do something like the following to generate
values to mark each row with a group number, by using something like the
following:
SELECT t.value
from regexp_matches('1234567 - 7654321 - some - more - text', '\d+', 'g')
with ordinality as t(value,idx)
where t.idx = 2;
Then, the strategy is to do address-matching for each group, in order to
copy with different cases.
The cases, here, of interest is to, for the first step, determine how many
elements of regexp_matches are present.
What do you think?
Regards,
David
On Mon, 21 Dec 2020 at 21:32, Shaozhong SHI <shishaozhong at gmail.com> wrote:
> Hi, Steve,
>
> I do respect your work.
>
> I am exploring a much simpler way, due to current circumstance, and
> curiosity to research.
>
> I have been struggling to find a way to locate and access regex elements.
> For instance, how to refer to and access the one before the last. The idea
> is precisely about your mentioning of matched groups.
>
> Is there a way to reverse this index?
>
>
> https://stackoverflow.com/questions/52178844/get-second-match-from-regexp-matches-results
>
> Regards,
>
> David
>
>
>
> On Mon, 21 Dec 2020 at 20:52, Stephen Woodbridge <
> stephenwoodbridge37 at gmail.com> wrote:
>
>> David,
>>
>> in terms of geocoding, which is how I think about these issues, a street
>> can have a range of house numbers distributed over it. When you talk
>> about "101d--120a Some Great Street" as a street you have to decided:
>>
>> 1. is this one location or many?
>> 2. how is it distributed over the street segment?
>>
>> On input what does this address mean?
>>
>> 1. is it one location? therefore only an exact match should geocode to it.
>> 2. if it is an address range, then what are you looking for in the
>> reference data set. If multiple ranges intersect this input range, then
>> what is the correct response coordinate?
>> 3. what is the location that the user is looking for? What is the
>> physical location of "101d--120a Some Great Street" that someone is
>> trying to drive to, or deliver a package to? And how is that encoded in
>> the reference dataset? Because that is what you need to match it to.
>>
>> Geocoding is a human language parsing problem: a human has entered an
>> address and is expecting a response that will get them to the desired
>> location. This is a hard problem and I have never seen a "simple""
>> solution that works for a wide range of inputs. I always start with the
>> 80-20 rule, try to solve 80% of your cases with 20% of the rules or
>> effort, and then look at the cases that are failing and try to abstract
>> them so you can add a few more rules to solve 80% of the failures and so
>> on till you get close enough.
>>
>> My address standardizer is only the front-end parser for the address
>> text to separate it into classified components so you can then make
>> sense out of the text and do what you want with the components.
>>
>> Read some of the links I sent you to understand the problems. This code
>> has been used with multiple datasets and multiple countries. I
>> personally have used it to build geocoders based on US Tiger data,
>> Canada Census data, Australia's master address list, UK data,
>> Navteq/HERE data for US, Canada, UK, and Germany.
>>
>> It requires some work but the results have been really good. You have to
>> try stuff and see if it works for you or not. Maybe as you suggest there
>> is a simple solution for the case you present, but is this case 80% of
>> your addresses fit into or is it the case that 1/2% of your addresses
>> fit into. Whichever it is, focus on the 80% of the cases and solve that
>> first, then tackle the harder outliers.
>>
>> By the way postgis, has an address standardizer included with it. I
>> ported that into postgis from another project, my
>> github.com/woodbri/address_standardizer repository is not the same, I
>> wrote this to be C++ based, supporting utf8, multiple languages and
>> multiple countries and to be easier for humans to configure and modify,
>> none of which exists in the current postgis code. I had hoped to get my
>> new code integrated into postgis, but I'm retired now and energies are
>> getting directed elsewhere.
>>
>> Hope this helps you to get started.
>>
>> -Steve
>>
>> On 12/21/2020 3:00 PM, Shaozhong SHI wrote:
>> > Hi, Steve,
>> >
>> > Thanks.
>> >
>> > I think that the key matter lies to one thing:
>> >
>> > Get SAO and PAO numbers and text separate, so that we can intersect
>> > SAO and PAO number arrays. By doing so, we can detect addresses for
>> > matching addresses to known address points.
>> >
>> > This could offer a much simpler alternative.
>> >
>> > What do you think?
>> >
>> > Regards,
>> >
>> > David
>> >
>> > On Mon, 21 Dec 2020 at 17:06, Stephen Woodbridge
>> > <stephenwoodbridge37 at gmail.com <mailto:stephenwoodbridge37 at gmail.com>>
>> > wrote:
>> >
>> > On 12/21/2020 11:40 AM, Shaozhong SHI wrote:
>> > > Hi, Steve W,
>> > >
>> > > Many thanks,
>> > >
>> > > How best to see where two addresses match.
>> > >
>> > > For instance, one would say: 101d--120a Some Great Street and
>> > > 104d-110d Some Great Street match.
>> > >
>> > > How best to do it?
>> > >
>> > > Would it be possible to turn both ranges into arrays, and then
>> > > intersect the range to check out?
>> > >
>> > > Please enlighten me.
>> > So what I've done with ranges is to put start and end in the
>> database
>> > for the reference dataset, then you have to decide how you want to
>> > handle ranges on input. Ranges on input typically imply generating
>> > multiple queries to the database. When geocoding you first want to
>> > match
>> > the country, city, street which will give you a small set of
>> records,
>> > then evaluate each record but ranking them with some rules, or
>> > evaluating each, etc to get the best case. It also depends on
>> > where you
>> > are doing fuzzy matching or not.
>> >
>> > You might want to checkout:
>> > https://github.com/woodbri/imaptools.com
>> > <https://github.com/woodbri/imaptools.com> - this is all my public
>> > code
>> > that I developed consulting
>> >
>> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md
>> > <
>> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md
>> >
>> >
>> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md
>> > <
>> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md>
>> >
>> > I'm retired now and released all this code under MIT-X license,
>> > and you
>> > will find my SQL Geocoder buried in there and hopefully, I have left
>> > enough comments for you to figure out how to reuse it. Start by
>> > reading
>> > the README-geocoder*.md links above.
>> >
>> > -Steve
>> >
>> > >
>> > > Regards,
>> > >
>> > > David
>> > >
>> > > On Sun, 20 Dec 2020 at 00:30, Stephen Woodbridge
>> > > <stephenwoodbridge37 at gmail.com
>> > <mailto:stephenwoodbridge37 at gmail.com>
>> > <mailto:stephenwoodbridge37 at gmail.com
>> > <mailto:stephenwoodbridge37 at gmail.com>>>
>> > > wrote:
>> > >
>> > > On 12/19/2020 7:24 PM, Stephen Woodbridge wrote:
>> > > > Hi David,
>> > > >
>> > > > Yes and No! you can do that for some simple addresses, but
>> > > trying to
>> > > > clean up addresses in general is much more complicated.
>> > I've been
>> > > > doing this for 20+ years writing geocoders. Check out:
>> > > >
>> > > > https://github.com/woodbri/address-standardizer
>> > <https://github.com/woodbri/address-standardizer>
>> > > <https://github.com/woodbri/address-standardizer
>> > <https://github.com/woodbri/address-standardizer>>
>> > > >
>> > > > This will install as a postgresql extension and once you
>> > set it
>> > > up it
>> > > > has an easy function call that can be used to clean things
>> up.
>> > > > Read this to get a better idea:
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>> > <
>> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>> >
>> > >
>> > <
>> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>> > <
>> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
>> >>
>> > >
>> > > >
>> > > >
>> > > > And there is a docs/ directory with more documentation. I
>> also
>> > > provide
>> > > > sample starter files for parsing addresses of about 25
>> > > countries, so
>> > > > you can pick the one(s) you need and then customize them for
>> > > your needs.
>> > > >
>> > > > If you just want to try working with simple regex you
>> probably
>> > > want to
>> > > > read:
>> > > >
>> > > > https://www.postgresqltutorial.com/regexp_replace/
>> > <https://www.postgresqltutorial.com/regexp_replace/>
>> > > <https://www.postgresqltutorial.com/regexp_replace/
>> > <https://www.postgresqltutorial.com/regexp_replace/>>
>> > > >
>> > > > |SELECT REGEXP_REPLACE( REGEXP_REPLACE||(string, |'[, .]+',
>> '
>> > > > '), -- convert comma, space, period to a single space
>> > > > '^ *|--| *$', '') -- remove leading and
>> > > trailing
>> > > > white space and '--'
>> > > > );
>> > >
>> > > But this will not handle "101 - 103 -- a" without adding more
>> > > clauses.
>> > > And how do you plan to handle things like abbreviations?
>> > Ave, ave,
>> > > avenue, st, street, saint, or lots of other issues. A lot
>> > depends on
>> > > your use case, are you geocoding these are just trying to
>> > clean up
>> > > some
>> > > text.
>> > > >
>> > > >
>> > > > -Steve W
>> > > >
>> > > > On 12/19/2020 5:40 PM, Shaozhong SHI wrote:
>> > > >> For instance,
>> > > >>
>> > > >> We may encounter,
>> > > >>
>> > > >> My Great Place. 101-103--a, Great Avenue, London, , SW3 5EE
>> > > >>
>> > > >> Can we do some thing like
>> > > >>
>> > > >> SELECT
>> > > >> REGEXP_MATCHES(' My Great' Place. 101-103--a, Great
>> > Avenue,
>> > > >> London, , SW3 5EE ',
>> > > >> a_regular_expression,
>> > > >> 'g');
>> > > >>
>> > > >> so that we end up a clean address line with one line of
>> code?
>> > > >>
>> > > >> My Great Place 101-103a Great Avenue London SW3 5EE
>> > > >>
>> > > >> What regex is that?
>> > > >>
>> > > >> Can we guarantee a single space between words?
>> > > >>
>> > > >> Regards,
>> > > >>
>> > > >> David
>> > > >>
>> > > >> _______________________________________________
>> > > >> postgis-users mailing list
>> > > >> postgis-users at lists.osgeo.org
>> > <mailto:postgis-users at lists.osgeo.org>
>> > > <mailto:postgis-users at lists.osgeo.org
>> > <mailto:postgis-users at lists.osgeo.org>>
>> > > >> https://lists.osgeo.org/mailman/listinfo/postgis-users
>> > <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>> > > <https://lists.osgeo.org/mailman/listinfo/postgis-users
>> > <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
>> > > >
>> > >
>> > > _______________________________________________
>> > > postgis-users mailing list
>> > > postgis-users at lists.osgeo.org
>> > <mailto:postgis-users at lists.osgeo.org>
>> > <mailto:postgis-users at lists.osgeo.org
>> > <mailto:postgis-users at lists.osgeo.org>>
>> > > https://lists.osgeo.org/mailman/listinfo/postgis-users
>> > <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>> > > <https://lists.osgeo.org/mailman/listinfo/postgis-users
>> > <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
>> > >
>> > >
>> > > _______________________________________________
>> > > postgis-users mailing list
>> > > postgis-users at lists.osgeo.org <mailto:
>> postgis-users at lists.osgeo.org>
>> > > https://lists.osgeo.org/mailman/listinfo/postgis-users
>> > <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>> >
>> > _______________________________________________
>> > postgis-users mailing list
>> > postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org
>> >
>> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>> > <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>> >
>> >
>> > _______________________________________________
>> > postgis-users mailing list
>> > postgis-users at lists.osgeo.org
>> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at lists.osgeo.org
>> https://lists.osgeo.org/mailman/listinfo/postgis-users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20201221/c1b296a7/attachment.html>
More information about the postgis-users
mailing list