[postgis-users] Various ways to handle addresses in postgresql
Shaozhong SHI
shishaozhong at gmail.com
Sun Apr 26 14:22:09 PDT 2020
Hi, Steve,
Thank you very much.
I am testing and evaluating possibilities and things which are worthwhile
to do. Things which have solid proof.
Regards,
Shao
On Sun, 26 Apr 2020 at 17:13, Stephen Woodbridge <
stephenwoodbridge37 at gmail.com> wrote:
> On 4/26/2020 8:09 AM, Shaozhong SHI wrote:
> > Hi, Steve,
> >
> > Thanks.
> >
> > 2 questions.
> >
> > 1. How can we remove things like Room 2a, Buildings 2-6b and etc with
> > regexp replace?
> You need to read up on how to write regexp, becuase you will keep
> finding cases like this you want to add to your regexp and you need to
> understand what they are doing. Compare the first one I gave you to this
> and make sure you understand why I made these changes to it. Also this
> is a little outside the scope of the PostGIS list.
>
> '^( *(Buildings?|Bldg|Room) *[0-9]+)?[- 0-9]*[a-zA-Z]? '
>
> > 2. Once extensions created, can these functions be adapted? Are
> > codes available ? I will see whether to put it into a project, so
> > that our programmers can have work to do.
>
> The code for these in in the PostGIS repository, but if you are serious
> about making changes to them, then I recommend working with
> https://github.com/woodbri/address-standardizer because I rewrote these
> function to be easier to modify. The existing code is very hard to
> understand and very difficult to make changes to without breaking
> things. My new code is hopefully well documented in the repository and
> should be straight forward to work with. FYI, I have never built/tested
> with PostgreSQL 12+ so it might need some changes to support that, but
> it should be good for 9, 10 and 11 versions.
>
> -Steve
>
> >
> > Regards,
> >
> > Shao
> >
> > On Sun, 26 Apr 2020 at 03:09, Stephen Woodbridge
> > <stephenwoodbridge37 at gmail.com <mailto:stephenwoodbridge37 at gmail.com>>
> > wrote:
> >
> > On 4/25/2020 7:19 PM, Shaozhong SHI wrote:
> > > Hi, Steve,
> > >
> > > Many thanks. Please send me the link to parse_address() and
> > > standardize_address().
> >
> > If you already have postGIS installed then
> >
> > create extension address_standardizer;
> >
> > # \df parse_address
> > List of functions
> > Schema | Name | Result data type | Argument data types
> > | Type
> >
> --------+---------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+--------
> > public | parse_address | record | text, OUT num text, OUT
> > street text, OUT street2 text, OUT address1 text, OUT city text, OUT
> > state text, OUT zip text, OUT zipplus text, OUT country text | normal
> > (1 row)
> >
> > # select * from parse_address('123-2 main street city ny');
> > num | street | street2 | address1 | city | state
> > | zip
> > | zipplus | country
> >
> -------+-------------+---------+-------------------+------+-------+-----+---------+---------
> > 123-2 | main street | | 123-2 main street | city | NY |
> > | | US
> > (1 row)
> >
> > # \df standardize*
> > List of
> functions
> > Schema | Name | Result data type |
> > Argument data types | Type
> >
> --------+---------------------+------------------+---------------------------------------------------------------+--------
> > public | standardize_address | stdaddr | lextab text,
> > gaztab
> > text, rultab text, address text | normal
> > public | standardize_address | stdaddr | lextab text,
> > gaztab
> > text, rultab text, micro text, macro text | normal
> >
> > You need tables for the lexicon, gazetteer, and rules, which
> > should be
> > included in the extension but I'm not seeing them. So you can grab
> > these
> > from:
> >
> >
> https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-gaz.sql
> >
> https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-lex.sql
> >
> https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-rules.sql
> >
> > and load them like:
> >
> > psql mydb -f us-gaz.sql
> > psql mydb -f us-lex.sql
> > psql mydb -f us-rules.sql
> >
> > # select * from standardize_address('lex', 'gaz', 'rules', '123-2
> > main
> > street city ny');
> > building | house_num | predir | qual | pretype | name | suftype |
> > sufdir | ruralroute | extra | city | state | country | postcode
> > | box
> > | unit
> >
> ----------+-----------+--------+------+---------+--------+---------+--------+------------+-------+------+----------+---------+----------+-----+------
> > | 123 | | | | 2 MAIN | STREET
> > | | | | CITY | NEW YORK | USA | | |
> > (1 row)
> >
> >
> > This is a good example of why parsing addresses is so difficult. The
> > rules for standardize_address do not account for a house number like
> > "123-2", but the regexp in parse_address do handle it. It is easy
> > to get
> > the 80% right and very hard to get it much above that.
> >
> > -Steve
> >
> >
> > >
> > > I need to find these first before test-running.
> > >
> > > Regards,
> > >
> > > Shao
> > >
> > > On Sat, 25 Apr 2020 at 21:20, Stephen Woodbridge
> > > <stephenwoodbridge37 at gmail.com
> > <mailto:stephenwoodbridge37 at gmail.com>
> > <mailto:stephenwoodbridge37 at gmail.com
> > <mailto:stephenwoodbridge37 at gmail.com>>>
> > > wrote:
> > >
> > > Shao,
> > >
> > > '^( *Building *[0-9]+)?[- 0-9]*'
> > >
> > > or something like that should do it. But I think you will
> > find that a
> > > more robust solution is to use parse_address() and/or
> > > standardize_address() as they will recognize a lot of other
> > address
> > > constructs, like "apt 3a" for for example.
> > >
> > > parse_address() that a text field and breaks it into "house
> > number
> > > street name" and "city state zip", but only works well in North
> > > America.
> > >
> > > standardize_address() that comes with postGIS, breaks the
> > address
> > > down
> > > into its components and can separate out things like
> > buildings, and
> > > apartment/unit specifiers so you can then take the fields
> > you are
> > > interested in and recombine just them in a new string.
> > Again, this
> > > works
> > > best in North America.
> > >
> > > My github address-standardizer is built to recognize address
> > for most
> > > counties, but it can also be configured to recognize address
> > > standards
> > > for any county without too much effort. It compiles and
> > installs as
> > > postgresql extension.
> > >
> > > Addresses are generally very messy and unless your addresses
> > are vary
> > > simple you will be constantly fighting with this or that
> > exception.
> > >
> > > -Steve
> > >
> > > On 4/25/2020 2:55 PM, Shaozhong SHI wrote:
> > > > Is there a way to left trim including the building and
> number?
> > > >
> > > > Building 3 21-1 Great Avenue, a city, a
> > country, this
> > > > planet.
> > > >
> > > > How to take way those things which are too local to an
> > address?
> > > >
> > > > Regards,
> > > >
> > > > Shao
> > > >
> > > > On Sat, 25 Apr 2020 at 01:48, Shaozhong SHI
> > > <shishaozhong at gmail.com <mailto:shishaozhong at gmail.com>
> > <mailto:shishaozhong at gmail.com <mailto:shishaozhong at gmail.com>>
> > > > <mailto:shishaozhong at gmail.com
> > <mailto:shishaozhong at gmail.com> <mailto:shishaozhong at gmail.com
> > <mailto:shishaozhong at gmail.com>>>>
> > > wrote:
> > > >
> > > > I find this is a simple, but important question.
> > > >
> > > > How best to split numbers and the rest of address?
> > > >
> > > > For instance, one tricky one is as follows:
> > > >
> > > > 21-1 Great Avenue, a city, a country, this planet
> > > >
> > > > How to turn this into the following:
> > > >
> > > > column 1, column 2
> > > >
> > > > 21-1 Great Avenue, a city, a country,
> > this planet
> > > >
> > > > Note: there is a hyphen in 21-1
> > > >
> > > > Any clue?
> > > >
> > > > Regards,
> > > >
> > > > Shao
> > > >
> > > >
> > > > _______________________________________________
> > > > postgis-users mailing list
> > > > postgis-users at lists.osgeo.org
> > <mailto:postgis-users at lists.osgeo.org>
> > <mailto:postgis-users at lists.osgeo.org
> > <mailto:postgis-users at lists.osgeo.org>>
> > > > https://lists.osgeo.org/mailman/listinfo/postgis-users
> > >
> > > _______________________________________________
> > > postgis-users mailing list
> > > postgis-users at lists.osgeo.org
> > <mailto:postgis-users at lists.osgeo.org>
> > <mailto:postgis-users at lists.osgeo.org
> > <mailto:postgis-users at lists.osgeo.org>>
> > > https://lists.osgeo.org/mailman/listinfo/postgis-users
> > >
> > >
> > > _______________________________________________
> > > postgis-users mailing list
> > > postgis-users at lists.osgeo.org <mailto:
> postgis-users at lists.osgeo.org>
> > > https://lists.osgeo.org/mailman/listinfo/postgis-users
> >
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org>
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
> >
> >
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users at lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20200426/233b5bce/attachment.html>
More information about the postgis-users
mailing list