[postgis-users] One line solution: select a dirty full address and get a clean address
Stephen Woodbridge
stephenwoodbridge37 at gmail.com
Mon Dec 21 14:40:16 PST 2020
I would start with the address standardizer and try to compile and
install it. I'm pretty sure it compiled and install in PostgreSQL up
through 11, but more recent ones may need a little tweaking. Anyway, it
handles tokenizing the address string into tokens. It works internally
by building and optimizing HUGE regular expressions for matching tokens
to the lexicon and gazetteer. Part of what you are trying to do is to
tokenize the string. The standardizer, first tokenizes the string, then
classifies and standardizes the tokens based on your lexicon and
gazetteer. It is important to standardize the reference records the same
way you standardize the input requests so you are comparing apples to
apples and not to oranges. This is true even if you reference data is
already tokenized. If I tokenize/standardize one way and you
tokenize/standardize another way then they will not always match even if
they match sometimes. You can think of this process like creating a hash
key, if you generate the same key for both the input record and the
reference record then you have a match.
-Steve
On 12/21/2020 4:58 PM, Shaozhong SHI wrote:
> Hi, Steve,
>
> If I can find budget or a bigger enough project, your geocoder is
> certainly very useful.
>
> Regards,
>
> David
>
> On Mon, 21 Dec 2020 at 21:40, Shaozhong SHI <shishaozhong at gmail.com
> <mailto:shishaozhong at gmail.com>> wrote:
>
> Hi, Steve,
>
> Alternatively, one could try to do something like the following to
> generate values to mark each row with a group number, by using
> something like the following:
>
> SELECT t.value
> from regexp_matches('1234567 - 7654321 - some - more - text',
> '\d+', 'g') with ordinality as t(value,idx)
> where t.idx = 2;
>
> Then, the strategy is to do address-matching for each group, in
> order to copy with different cases.
>
> The cases, here, of interest is to, for the first step, determine
> how many elements of regexp_matches are present.
>
> What do you think?
>
> Regards,
>
> David
>
> On Mon, 21 Dec 2020 at 21:32, Shaozhong SHI
> <shishaozhong at gmail.com <mailto:shishaozhong at gmail.com>> wrote:
>
> Hi, Steve,
>
> I do respect your work.
>
> I am exploring a much simpler way, due to current
> circumstance, and curiosity to research.
>
> I have been struggling to find a way to locate and access
> regex elements. For instance, how to refer to and access the
> one before the last. The idea is precisely about your
> mentioning of matched groups.
>
> Is there a way to reverse this index?
>
> https://stackoverflow.com/questions/52178844/get-second-match-from-regexp-matches-results
> <https://stackoverflow.com/questions/52178844/get-second-match-from-regexp-matches-results>
>
> Regards,
>
> David
>
>
>
> On Mon, 21 Dec 2020 at 20:52, Stephen Woodbridge
> <stephenwoodbridge37 at gmail.com
> <mailto:stephenwoodbridge37 at gmail.com>> wrote:
>
> David,
>
> in terms of geocoding, which is how I think about these
> issues, a street
> can have a range of house numbers distributed over it.
> When you talk
> about "101d--120a Some Great Street" as a street you have
> to decided:
>
> 1. is this one location or many?
> 2. how is it distributed over the street segment?
>
> On input what does this address mean?
>
> 1. is it one location? therefore only an exact match
> should geocode to it.
> 2. if it is an address range, then what are you looking
> for in the
> reference data set. If multiple ranges intersect this
> input range, then
> what is the correct response coordinate?
> 3. what is the location that the user is looking for? What
> is the
> physical location of "101d--120a Some Great Street" that
> someone is
> trying to drive to, or deliver a package to? And how is
> that encoded in
> the reference dataset? Because that is what you need to
> match it to.
>
> Geocoding is a human language parsing problem: a human has
> entered an
> address and is expecting a response that will get them to
> the desired
> location. This is a hard problem and I have never seen a
> "simple""
> solution that works for a wide range of inputs. I always
> start with the
> 80-20 rule, try to solve 80% of your cases with 20% of the
> rules or
> effort, and then look at the cases that are failing and
> try to abstract
> them so you can add a few more rules to solve 80% of the
> failures and so
> on till you get close enough.
>
> My address standardizer is only the front-end parser for
> the address
> text to separate it into classified components so you can
> then make
> sense out of the text and do what you want with the
> components.
>
> Read some of the links I sent you to understand the
> problems. This code
> has been used with multiple datasets and multiple
> countries. I
> personally have used it to build geocoders based on US
> Tiger data,
> Canada Census data, Australia's master address list, UK data,
> Navteq/HERE data for US, Canada, UK, and Germany.
>
> It requires some work but the results have been really
> good. You have to
> try stuff and see if it works for you or not. Maybe as you
> suggest there
> is a simple solution for the case you present, but is this
> case 80% of
> your addresses fit into or is it the case that 1/2% of
> your addresses
> fit into. Whichever it is, focus on the 80% of the cases
> and solve that
> first, then tackle the harder outliers.
>
> By the way postgis, has an address standardizer included
> with it. I
> ported that into postgis from another project, my
> github.com/woodbri/address_standardizer
> <http://github.com/woodbri/address_standardizer>
> repository is not the same, I
> wrote this to be C++ based, supporting utf8, multiple
> languages and
> multiple countries and to be easier for humans to
> configure and modify,
> none of which exists in the current postgis code. I had
> hoped to get my
> new code integrated into postgis, but I'm retired now and
> energies are
> getting directed elsewhere.
>
> Hope this helps you to get started.
>
> -Steve
>
> On 12/21/2020 3:00 PM, Shaozhong SHI wrote:
> > Hi, Steve,
> >
> > Thanks.
> >
> > I think that the key matter lies to one thing:
> >
> > Get SAO and PAO numbers and text separate, so that we
> can intersect
> > SAO and PAO number arrays. By doing so, we can detect
> addresses for
> > matching addresses to known address points.
> >
> > This could offer a much simpler alternative.
> >
> > What do you think?
> >
> > Regards,
> >
> > David
> >
> > On Mon, 21 Dec 2020 at 17:06, Stephen Woodbridge
> > <stephenwoodbridge37 at gmail.com
> <mailto:stephenwoodbridge37 at gmail.com>
> <mailto:stephenwoodbridge37 at gmail.com
> <mailto:stephenwoodbridge37 at gmail.com>>>
> > wrote:
> >
> > On 12/21/2020 11:40 AM, Shaozhong SHI wrote:
> > > Hi, Steve W,
> > >
> > > Many thanks,
> > >
> > > How best to see where two addresses match.
> > >
> > > For instance, one would say: 101d--120a Some
> Great Street and
> > > 104d-110d Some Great Street match.
> > >
> > > How best to do it?
> > >
> > > Would it be possible to turn both ranges into
> arrays, and then
> > > intersect the range to check out?
> > >
> > > Please enlighten me.
> > So what I've done with ranges is to put start and
> end in the database
> > for the reference dataset, then you have to decide
> how you want to
> > handle ranges on input. Ranges on input typically
> imply generating
> > multiple queries to the database. When geocoding you
> first want to
> > match
> > the country, city, street which will give you a
> small set of records,
> > then evaluate each record but ranking them with some
> rules, or
> > evaluating each, etc to get the best case. It also
> depends on
> > where you
> > are doing fuzzy matching or not.
> >
> > You might want to checkout:
> > https://github.com/woodbri/imaptools.com
> <https://github.com/woodbri/imaptools.com>
> > <https://github.com/woodbri/imaptools.com
> <https://github.com/woodbri/imaptools.com>> - this is all
> my public
> > code
> > that I developed consulting
> >
> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md
> <https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md>
> >
> <https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md
> <https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md>>
> >
> https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md
> <https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md>
> >
> <https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md
> <https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md>>
> >
> > I'm retired now and released all this code under
> MIT-X license,
> > and you
> > will find my SQL Geocoder buried in there and
> hopefully, I have left
> > enough comments for you to figure out how to reuse
> it. Start by
> > reading
> > the README-geocoder*.md links above.
> >
> > -Steve
> >
> > >
> > > Regards,
> > >
> > > David
> > >
> > > On Sun, 20 Dec 2020 at 00:30, Stephen Woodbridge
> > > <stephenwoodbridge37 at gmail.com
> <mailto:stephenwoodbridge37 at gmail.com>
> > <mailto:stephenwoodbridge37 at gmail.com
> <mailto:stephenwoodbridge37 at gmail.com>>
> > <mailto:stephenwoodbridge37 at gmail.com
> <mailto:stephenwoodbridge37 at gmail.com>
> > <mailto:stephenwoodbridge37 at gmail.com
> <mailto:stephenwoodbridge37 at gmail.com>>>>
> > > wrote:
> > >
> > > On 12/19/2020 7:24 PM, Stephen Woodbridge wrote:
> > > > Hi David,
> > > >
> > > > Yes and No! you can do that for some simple
> addresses, but
> > > trying to
> > > > clean up addresses in general is much more
> complicated.
> > I've been
> > > > doing this for 20+ years writing geocoders.
> Check out:
> > > >
> > > >
> https://github.com/woodbri/address-standardizer
> <https://github.com/woodbri/address-standardizer>
> > <https://github.com/woodbri/address-standardizer
> <https://github.com/woodbri/address-standardizer>>
> > >
> <https://github.com/woodbri/address-standardizer
> <https://github.com/woodbri/address-standardizer>
> > <https://github.com/woodbri/address-standardizer
> <https://github.com/woodbri/address-standardizer>>>
> > > >
> > > > This will install as a postgresql extension
> and once you
> > set it
> > > up it
> > > > has an easy function call that can be used
> to clean things up.
> > > > Read this to get a better idea:
> > > >
> > > >
> > >
> >
> https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
> <https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md>
> >
> <https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
> <https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md>>
> > >
> >
> <https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
> <https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md>
> >
> <https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
> <https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md>>>
> > >
> > > >
> > > >
> > > > And there is a docs/ directory with more
> documentation. I also
> > > provide
> > > > sample starter files for parsing addresses
> of about 25
> > > countries, so
> > > > you can pick the one(s) you need and then
> customize them for
> > > your needs.
> > > >
> > > > If you just want to try working with simple
> regex you probably
> > > want to
> > > > read:
> > > >
> > > >
> https://www.postgresqltutorial.com/regexp_replace/
> <https://www.postgresqltutorial.com/regexp_replace/>
> > <https://www.postgresqltutorial.com/regexp_replace/
> <https://www.postgresqltutorial.com/regexp_replace/>>
> > >
> <https://www.postgresqltutorial.com/regexp_replace/
> <https://www.postgresqltutorial.com/regexp_replace/>
> > <https://www.postgresqltutorial.com/regexp_replace/
> <https://www.postgresqltutorial.com/regexp_replace/>>>
> > > >
> > > > |SELECT REGEXP_REPLACE(
> REGEXP_REPLACE||(string, |'[, .]+', '
> > > > '), -- convert comma, space, period to
> a single space
> > > > '^ *|--| *$', '') --
> remove leading and
> > > trailing
> > > > white space and '--'
> > > > );
> > >
> > > But this will not handle "101 - 103 -- a"
> without adding more
> > > clauses.
> > > And how do you plan to handle things like
> abbreviations?
> > Ave, ave,
> > > avenue, st, street, saint, or lots of other
> issues. A lot
> > depends on
> > > your use case, are you geocoding these are
> just trying to
> > clean up
> > > some
> > > text.
> > > >
> > > >
> > > > -Steve W
> > > >
> > > > On 12/19/2020 5:40 PM, Shaozhong SHI wrote:
> > > >> For instance,
> > > >>
> > > >> We may encounter,
> > > >>
> > > >> My Great Place. 101-103--a, Great Avenue,
> London, , SW3 5EE
> > > >>
> > > >> Can we do some thing like
> > > >>
> > > >> SELECT
> > > >> REGEXP_MATCHES(' My Great' Place.
> 101-103--a, Great
> > Avenue,
> > > >> London, , SW3 5EE ',
> > > >> a_regular_expression,
> > > >> 'g');
> > > >>
> > > >> so that we end up a clean address line with
> one line of code?
> > > >>
> > > >> My Great Place 101-103a Great Avenue London
> SW3 5EE
> > > >>
> > > >> What regex is that?
> > > >>
> > > >> Can we guarantee a single space between words?
> > > >>
> > > >> Regards,
> > > >>
> > > >> David
> > > >>
> > > >> _______________________________________________
> > > >> postgis-users mailing list
> > > >> postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>
> > <mailto:postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>>
> > > <mailto:postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>
> > <mailto:postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>>>
> > > >>
> https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>
> >
> <https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
> > >
> <https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>
> >
> <https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>>>
> > > >
> > >
> > > _______________________________________________
> > > postgis-users mailing list
> > > postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>
> > <mailto:postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>>
> > <mailto:postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>
> > <mailto:postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>>>
> > >
> https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>
> >
> <https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
> > >
> <https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>
> >
> <https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>>>
> > >
> > >
> > > _______________________________________________
> > > postgis-users mailing list
> > > postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>
> <mailto:postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>>
> > >
> https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>
> >
> <https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
> >
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>
> <mailto:postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>>
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>
> >
> <https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
> >
> >
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> <mailto:postgis-users at lists.osgeo.org>
> https://lists.osgeo.org/mailman/listinfo/postgis-users
> <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
More information about the postgis-users
mailing list