[postgis-users] Various ways to handle addresses in postgresql

Stephen Woodbridge stephenwoodbridge37 at gmail.com
Wed Apr 29 10:52:29 PDT 2020


On 4/29/2020 9:15 AM, Shaozhong SHI wrote:
> Hi, Steve,
>
> I have address_standardizer extension created.
>
> It is not like other functions.  How to review its code? Does it need 
> to adapted to handle other countries addresses?

As I tried to indicate, the postGIS address_standardizer extension is 
extremely hard to understand and nearly impossible to customize the 
files. I migrated the PAGC code into this extension but I barely 
understood the code. Conceptually, the ideas are straight forward, but 
the code is very hard to follow and understand so I portted it as a 
black box.

If the exisiting address_standardizer_data_us does not work for you or 
you need to support other countries, you will need to build and install 
the one here:

https://github.com/woodbri/address-standardizer

This was written as a replacement
This one has sample files for 25 countries here:

https://github.com/woodbri/address-standardizer/tree/develop/data/sample

that you can customize and has (hopefully) good documentation here:

https://github.com/woodbri/address-standardizer/blob/develop/README.md
https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
https://github.com/woodbri/address-standardizer/tree/develop/data

Also, this code is written in C++ and I hope it is written to be easier 
to understand and review.

Not sure I can help much more than this unless you have questions on 
https://github.com/woodbri/address-standardizer which we should probably 
take off the PostGIS list as this is not part of PostGIS.

-Steve

>
> Regards,
>
> Shao
>
> On Sun, 26 Apr 2020 at 17:23, Stephen Woodbridge 
> <stephenwoodbridge37 at gmail.com <mailto:stephenwoodbridge37 at gmail.com>> 
> wrote:
>
>     Shao,
>
>     I just remembered the lex, gaz, and rules data is in a separate
>     extension. The correct way to install it is with:
>
>     create extension address_standardizer_data_us;
>
>     -Steve
>
>     On 4/26/2020 8:09 AM, Shaozhong SHI wrote:
>     > Hi, Steve,
>     >
>     > Thanks.
>     >
>     > 2 questions.
>     >
>     > 1.  How can we remove things like Room 2a, Buildings 2-6b and
>     etc with
>     > regexp replace?
>     > 2.  Once extensions created, can these functions be adapted?  Are
>     > codes available ?    I will see whether to put it into a
>     project, so
>     > that our programmers can have work to do.
>     >
>     > Regards,
>     >
>     > Shao
>     >
>     > On Sun, 26 Apr 2020 at 03:09, Stephen Woodbridge
>     > <stephenwoodbridge37 at gmail.com
>     <mailto:stephenwoodbridge37 at gmail.com>
>     <mailto:stephenwoodbridge37 at gmail.com
>     <mailto:stephenwoodbridge37 at gmail.com>>>
>     > wrote:
>     >
>     >     On 4/25/2020 7:19 PM, Shaozhong SHI wrote:
>     >     > Hi, Steve,
>     >     >
>     >     > Many thanks.  Please send me the link to parse_address() and
>     >     > standardize_address().
>     >
>     >     If you already have postGIS installed then
>     >
>     >     create extension address_standardizer;
>     >
>     >     # \df parse_address
>     >     List of functions
>     >       Schema |     Name      | Result data type | Argument data
>     types
>     >     |  Type
>     >
>      --------+---------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+--------
>     >       public | parse_address | record           | text, OUT num
>     text, OUT
>     >     street text, OUT street2 text, OUT address1 text, OUT city
>     text, OUT
>     >     state text, OUT zip text, OUT zipplus text, OUT country text
>     | normal
>     >     (1 row)
>     >
>     >     # select * from parse_address('123-2 main street city ny');
>     >        num  |   street    | street2 |     address1      | city |
>     state
>     >     | zip
>     >     | zipplus | country
>     >
>      -------+-------------+---------+-------------------+------+-------+-----+---------+---------
>     >       123-2 | main street |         | 123-2 main street | city |
>     NY |
>     >     |         | US
>     >     (1 row)
>     >
>     >     # \df standardize*
>     > List of functions
>     >       Schema |        Name         | Result data type |
>     >     Argument data types                      | Type
>     >
>      --------+---------------------+------------------+---------------------------------------------------------------+--------
>     >       public | standardize_address | stdaddr          | lextab text,
>     >     gaztab
>     >     text, rultab text, address text           | normal
>     >       public | standardize_address | stdaddr          | lextab text,
>     >     gaztab
>     >     text, rultab text, micro text, macro text | normal
>     >
>     >     You need tables for the lexicon, gazetteer, and rules, which
>     >     should be
>     >     included in the extension but I'm not seeing them. So you
>     can grab
>     >     these
>     >     from:
>     >
>     >
>     https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-gaz.sql
>     >
>     https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-lex.sql
>     >
>     https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-rules.sql
>     >
>     >     and load them like:
>     >
>     >     psql mydb -f us-gaz.sql
>     >     psql mydb -f us-lex.sql
>     >     psql mydb -f us-rules.sql
>     >
>     >     # select * from standardize_address('lex', 'gaz', 'rules',
>     '123-2
>     >     main
>     >     street city ny');
>     >       building | house_num | predir | qual | pretype | name  |
>     suftype |
>     >     sufdir | ruralroute | extra | city |  state   | country |
>     postcode
>     >     | box
>     >     | unit
>     >
>      ----------+-----------+--------+------+---------+--------+---------+--------+------------+-------+------+----------+---------+----------+-----+------
>     >                | 123       |        |      |         | 2 MAIN |
>     STREET
>     >     |        |            |       | CITY | NEW YORK | USA     |
>     |     |
>     >     (1 row)
>     >
>     >
>     >     This is a good example of why parsing addresses is so
>     difficult. The
>     >     rules for standardize_address do not account for a house
>     number like
>     >     "123-2", but the regexp in parse_address do handle it. It is
>     easy
>     >     to get
>     >     the 80% right and very hard to get it much above that.
>     >
>     >     -Steve
>     >
>     >
>     >     >
>     >     > I need to find these first before test-running.
>     >     >
>     >     > Regards,
>     >     >
>     >     > Shao
>     >     >
>     >     > On Sat, 25 Apr 2020 at 21:20, Stephen Woodbridge
>     >     > <stephenwoodbridge37 at gmail.com
>     <mailto:stephenwoodbridge37 at gmail.com>
>     >     <mailto:stephenwoodbridge37 at gmail.com
>     <mailto:stephenwoodbridge37 at gmail.com>>
>     >     <mailto:stephenwoodbridge37 at gmail.com
>     <mailto:stephenwoodbridge37 at gmail.com>
>     >     <mailto:stephenwoodbridge37 at gmail.com
>     <mailto:stephenwoodbridge37 at gmail.com>>>>
>     >     > wrote:
>     >     >
>     >     >     Shao,
>     >     >
>     >     >     '^( *Building *[0-9]+)?[- 0-9]*'
>     >     >
>     >     >     or something like that should do it. But I think you will
>     >     find that a
>     >     >     more robust solution is to use parse_address() and/or
>     >     >     standardize_address() as they will recognize a lot of
>     other
>     >     address
>     >     >     constructs, like "apt 3a" for for example.
>     >     >
>     >     >     parse_address() that a text field and breaks it into
>     "house
>     >     number
>     >     >     street name" and "city state zip", but only works well
>     in North
>     >     >     America.
>     >     >
>     >     >     standardize_address() that comes with postGIS, breaks the
>     >     address
>     >     >     down
>     >     >     into its components and can separate out things like
>     >     buildings, and
>     >     >     apartment/unit specifiers so you can then take the fields
>     >     you are
>     >     >     interested in and recombine just them in a new string.
>     >     Again, this
>     >     >     works
>     >     >     best in North America.
>     >     >
>     >     >     My github address-standardizer is built to recognize
>     address
>     >     for most
>     >     >     counties, but it can also be configured to recognize
>     address
>     >     >     standards
>     >     >     for any county without too much effort. It compiles and
>     >     installs as
>     >     >     postgresql extension.
>     >     >
>     >     >     Addresses are generally very messy and unless your
>     addresses
>     >     are vary
>     >     >     simple you will be constantly fighting with this or that
>     >     exception.
>     >     >
>     >     >     -Steve
>     >     >
>     >     >     On 4/25/2020 2:55 PM, Shaozhong SHI wrote:
>     >     >     > Is there a way to left trim including the building
>     and number?
>     >     >     >
>     >     >     > Building 3  21-1              Great Avenue, a city, a
>     >     country, this
>     >     >     > planet.
>     >     >     >
>     >     >     > How to take way those things which are too local to an
>     >     address?
>     >     >     >
>     >     >     > Regards,
>     >     >     >
>     >     >     > Shao
>     >     >     >
>     >     >     > On Sat, 25 Apr 2020 at 01:48, Shaozhong SHI
>     >     >     <shishaozhong at gmail.com
>     <mailto:shishaozhong at gmail.com> <mailto:shishaozhong at gmail.com
>     <mailto:shishaozhong at gmail.com>>
>     >     <mailto:shishaozhong at gmail.com
>     <mailto:shishaozhong at gmail.com> <mailto:shishaozhong at gmail.com
>     <mailto:shishaozhong at gmail.com>>>
>     >     >     > <mailto:shishaozhong at gmail.com
>     <mailto:shishaozhong at gmail.com>
>     >     <mailto:shishaozhong at gmail.com
>     <mailto:shishaozhong at gmail.com>> <mailto:shishaozhong at gmail.com
>     <mailto:shishaozhong at gmail.com>
>     >     <mailto:shishaozhong at gmail.com
>     <mailto:shishaozhong at gmail.com>>>>>
>     >     >     wrote:
>     >     >     >
>     >     >     >     I find this is a simple, but important question.
>     >     >     >
>     >     >     >     How best to split numbers and the rest of address?
>     >     >     >
>     >     >     >     For instance, one tricky one is as follows:
>     >     >     >
>     >     >     >     21-1 Great Avenue, a city, a country, this planet
>     >     >     >
>     >     >     >     How to turn this into the following:
>     >     >     >
>     >     >     >     column 1,       column 2
>     >     >     >
>     >     >     >       21-1              Great Avenue, a city, a country,
>     >     this planet
>     >     >     >
>     >     >     >     Note:  there is a hyphen in  21-1
>     >     >     >
>     >     >     >     Any clue?
>     >     >     >
>     >     >     >     Regards,
>     >     >     >
>     >     >     >     Shao
>     >     >     >
>     >     >     >
>     >     >     > _______________________________________________
>     >     >     > postgis-users mailing list
>     >     >     > postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>
>     >     <mailto:postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>>
>     >     <mailto:postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>
>     >     <mailto:postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>>>
>     >     >     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>     >     >
>     >     >  _______________________________________________
>     >     >     postgis-users mailing list
>     >     > postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>
>     >     <mailto:postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>>
>     >     <mailto:postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>
>     >     <mailto:postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>>>
>     >     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>     >     >
>     >     >
>     >     > _______________________________________________
>     >     > postgis-users mailing list
>     >     > postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>
>     <mailto:postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>>
>     >     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>     >
>     >     _______________________________________________
>     >     postgis-users mailing list
>     > postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>
>     <mailto:postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>>
>     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>     >
>     >
>     > _______________________________________________
>     > postgis-users mailing list
>     > postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org>
>     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>     _______________________________________________
>     postgis-users mailing list
>     postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org>
>     https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users



More information about the postgis-users mailing list