[postgis-users] Various ways to handle addresses in postgresql

Shaozhong SHI shishaozhong at gmail.com
Sun Apr 26 14:22:09 PDT 2020


Hi, Steve,

Thank you very much.

I am testing and evaluating possibilities and things which are worthwhile
to do.  Things which have solid proof.

Regards,

Shao

On Sun, 26 Apr 2020 at 17:13, Stephen Woodbridge <
stephenwoodbridge37 at gmail.com> wrote:

> On 4/26/2020 8:09 AM, Shaozhong SHI wrote:
> > Hi, Steve,
> >
> > Thanks.
> >
> > 2 questions.
> >
> > 1.  How can we remove things like Room 2a, Buildings 2-6b and etc with
> > regexp replace?
> You need to read up on how to write regexp, becuase you will keep
> finding cases like this you want to add to your regexp and you need to
> understand what they are doing. Compare the first one I gave you to this
> and make sure you understand why I made these changes to it. Also this
> is a little outside the scope of the PostGIS list.
>
> '^( *(Buildings?|Bldg|Room) *[0-9]+)?[- 0-9]*[a-zA-Z]? '
>
> > 2.  Once extensions created, can these functions be adapted?  Are
> > codes available ?    I will see whether to put it into a project, so
> > that our programmers can have work to do.
>
> The code for these in in the PostGIS repository, but if you are serious
> about making changes to them, then I recommend working with
> https://github.com/woodbri/address-standardizer because I rewrote these
> function to be easier to modify. The existing code is very hard to
> understand and very difficult to make changes to without breaking
> things. My new code is hopefully well documented in the repository and
> should be straight forward to work with. FYI, I have never built/tested
> with PostgreSQL 12+ so it might need some changes to support that, but
> it should be good for 9, 10 and 11 versions.
>
> -Steve
>
> >
> > Regards,
> >
> > Shao
> >
> > On Sun, 26 Apr 2020 at 03:09, Stephen Woodbridge
> > <stephenwoodbridge37 at gmail.com <mailto:stephenwoodbridge37 at gmail.com>>
> > wrote:
> >
> >     On 4/25/2020 7:19 PM, Shaozhong SHI wrote:
> >     > Hi, Steve,
> >     >
> >     > Many thanks.  Please send me the link to parse_address() and
> >     > standardize_address().
> >
> >     If you already have postGIS installed then
> >
> >     create extension address_standardizer;
> >
> >     # \df parse_address
> >     List of functions
> >       Schema |     Name      | Result data type | Argument data types
> >     |  Type
> >
>  --------+---------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+--------
> >       public | parse_address | record           | text, OUT num text, OUT
> >     street text, OUT street2 text, OUT address1 text, OUT city text, OUT
> >     state text, OUT zip text, OUT zipplus text, OUT country text | normal
> >     (1 row)
> >
> >     # select * from parse_address('123-2 main street city ny');
> >        num  |   street    | street2 |     address1      | city | state
> >     | zip
> >     | zipplus | country
> >
>  -------+-------------+---------+-------------------+------+-------+-----+---------+---------
> >       123-2 | main street |         | 123-2 main street | city | NY |
> >     |         | US
> >     (1 row)
> >
> >     # \df standardize*
> >                                                          List of
> functions
> >       Schema |        Name         | Result data type |
> >     Argument data types                      | Type
> >
>  --------+---------------------+------------------+---------------------------------------------------------------+--------
> >       public | standardize_address | stdaddr          | lextab text,
> >     gaztab
> >     text, rultab text, address text           | normal
> >       public | standardize_address | stdaddr          | lextab text,
> >     gaztab
> >     text, rultab text, micro text, macro text | normal
> >
> >     You need tables for the lexicon, gazetteer, and rules, which
> >     should be
> >     included in the extension but I'm not seeing them. So you can grab
> >     these
> >     from:
> >
> >
> https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-gaz.sql
> >
> https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-lex.sql
> >
> https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-rules.sql
> >
> >     and load them like:
> >
> >     psql mydb -f us-gaz.sql
> >     psql mydb -f us-lex.sql
> >     psql mydb -f us-rules.sql
> >
> >     # select * from standardize_address('lex', 'gaz', 'rules', '123-2
> >     main
> >     street city ny');
> >       building | house_num | predir | qual | pretype |  name  | suftype |
> >     sufdir | ruralroute | extra | city |  state   | country | postcode
> >     | box
> >     | unit
> >
>  ----------+-----------+--------+------+---------+--------+---------+--------+------------+-------+------+----------+---------+----------+-----+------
> >                | 123       |        |      |         | 2 MAIN | STREET
> >     |        |            |       | CITY | NEW YORK | USA     | |     |
> >     (1 row)
> >
> >
> >     This is a good example of why parsing addresses is so difficult. The
> >     rules for standardize_address do not account for a house number like
> >     "123-2", but the regexp in parse_address do handle it. It is easy
> >     to get
> >     the 80% right and very hard to get it much above that.
> >
> >     -Steve
> >
> >
> >     >
> >     > I need to find these first before test-running.
> >     >
> >     > Regards,
> >     >
> >     > Shao
> >     >
> >     > On Sat, 25 Apr 2020 at 21:20, Stephen Woodbridge
> >     > <stephenwoodbridge37 at gmail.com
> >     <mailto:stephenwoodbridge37 at gmail.com>
> >     <mailto:stephenwoodbridge37 at gmail.com
> >     <mailto:stephenwoodbridge37 at gmail.com>>>
> >     > wrote:
> >     >
> >     >     Shao,
> >     >
> >     >     '^( *Building *[0-9]+)?[- 0-9]*'
> >     >
> >     >     or something like that should do it. But I think you will
> >     find that a
> >     >     more robust solution is to use parse_address() and/or
> >     >     standardize_address() as they will recognize a lot of other
> >     address
> >     >     constructs, like "apt 3a" for for example.
> >     >
> >     >     parse_address() that a text field and breaks it into "house
> >     number
> >     >     street name" and "city state zip", but only works well in North
> >     >     America.
> >     >
> >     >     standardize_address() that comes with postGIS, breaks the
> >     address
> >     >     down
> >     >     into its components and can separate out things like
> >     buildings, and
> >     >     apartment/unit specifiers so you can then take the fields
> >     you are
> >     >     interested in and recombine just them in a new string.
> >     Again, this
> >     >     works
> >     >     best in North America.
> >     >
> >     >     My github address-standardizer is built to recognize address
> >     for most
> >     >     counties, but it can also be configured to recognize address
> >     >     standards
> >     >     for any county without too much effort. It compiles and
> >     installs as
> >     >     postgresql extension.
> >     >
> >     >     Addresses are generally very messy and unless your addresses
> >     are vary
> >     >     simple you will be constantly fighting with this or that
> >     exception.
> >     >
> >     >     -Steve
> >     >
> >     >     On 4/25/2020 2:55 PM, Shaozhong SHI wrote:
> >     >     > Is there a way to left trim including the building and
> number?
> >     >     >
> >     >     > Building 3  21-1              Great Avenue, a city, a
> >     country, this
> >     >     > planet.
> >     >     >
> >     >     > How to take way those things which are too local to an
> >     address?
> >     >     >
> >     >     > Regards,
> >     >     >
> >     >     > Shao
> >     >     >
> >     >     > On Sat, 25 Apr 2020 at 01:48, Shaozhong SHI
> >     >     <shishaozhong at gmail.com <mailto:shishaozhong at gmail.com>
> >     <mailto:shishaozhong at gmail.com <mailto:shishaozhong at gmail.com>>
> >     >     > <mailto:shishaozhong at gmail.com
> >     <mailto:shishaozhong at gmail.com> <mailto:shishaozhong at gmail.com
> >     <mailto:shishaozhong at gmail.com>>>>
> >     >     wrote:
> >     >     >
> >     >     >     I find this is a simple, but important question.
> >     >     >
> >     >     >     How best to split numbers and the rest of address?
> >     >     >
> >     >     >     For instance, one tricky one is as follows:
> >     >     >
> >     >     >     21-1 Great Avenue, a city, a country, this planet
> >     >     >
> >     >     >     How to turn this into the following:
> >     >     >
> >     >     >     column 1,       column 2
> >     >     >
> >     >     >       21-1              Great Avenue, a city, a country,
> >     this planet
> >     >     >
> >     >     >     Note:  there is a hyphen in  21-1
> >     >     >
> >     >     >     Any clue?
> >     >     >
> >     >     >     Regards,
> >     >     >
> >     >     >     Shao
> >     >     >
> >     >     >
> >     >     > _______________________________________________
> >     >     > postgis-users mailing list
> >     >     > postgis-users at lists.osgeo.org
> >     <mailto:postgis-users at lists.osgeo.org>
> >     <mailto:postgis-users at lists.osgeo.org
> >     <mailto:postgis-users at lists.osgeo.org>>
> >     >     > https://lists.osgeo.org/mailman/listinfo/postgis-users
> >     >
> >     >     _______________________________________________
> >     >     postgis-users mailing list
> >     > postgis-users at lists.osgeo.org
> >     <mailto:postgis-users at lists.osgeo.org>
> >     <mailto:postgis-users at lists.osgeo.org
> >     <mailto:postgis-users at lists.osgeo.org>>
> >     > https://lists.osgeo.org/mailman/listinfo/postgis-users
> >     >
> >     >
> >     > _______________________________________________
> >     > postgis-users mailing list
> >     > postgis-users at lists.osgeo.org <mailto:
> postgis-users at lists.osgeo.org>
> >     > https://lists.osgeo.org/mailman/listinfo/postgis-users
> >
> >     _______________________________________________
> >     postgis-users mailing list
> >     postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org>
> >     https://lists.osgeo.org/mailman/listinfo/postgis-users
> >
> >
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users at lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20200426/233b5bce/attachment.html>


More information about the postgis-users mailing list