[postgis-users] Various ways to handle addresses in postgresql

Stephen Woodbridge stephenwoodbridge37 at gmail.com
Sun Apr 26 09:12:52 PDT 2020


On 4/26/2020 8:09 AM, Shaozhong SHI wrote:
> Hi, Steve,
>
> Thanks.
>
> 2 questions.
>
> 1.  How can we remove things like Room 2a, Buildings 2-6b and etc with 
> regexp replace?
You need to read up on how to write regexp, becuase you will keep 
finding cases like this you want to add to your regexp and you need to 
understand what they are doing. Compare the first one I gave you to this 
and make sure you understand why I made these changes to it. Also this 
is a little outside the scope of the PostGIS list.

'^( *(Buildings?|Bldg|Room) *[0-9]+)?[- 0-9]*[a-zA-Z]? '

> 2.  Once extensions created, can these functions be adapted?  Are 
> codes available ?    I will see whether to put it into a project, so 
> that our programmers can have work to do.

The code for these in in the PostGIS repository, but if you are serious 
about making changes to them, then I recommend working with 
https://github.com/woodbri/address-standardizer because I rewrote these 
function to be easier to modify. The existing code is very hard to 
understand and very difficult to make changes to without breaking 
things. My new code is hopefully well documented in the repository and 
should be straight forward to work with. FYI, I have never built/tested 
with PostgreSQL 12+ so it might need some changes to support that, but 
it should be good for 9, 10 and 11 versions.

-Steve

>
> Regards,
>
> Shao
>
> On Sun, 26 Apr 2020 at 03:09, Stephen Woodbridge 
> <stephenwoodbridge37 at gmail.com <mailto:stephenwoodbridge37 at gmail.com>> 
> wrote:
>
>     On 4/25/2020 7:19 PM, Shaozhong SHI wrote:
>     > Hi, Steve,
>     >
>     > Many thanks.  Please send me the link to parse_address() and
>     > standardize_address().
>
>     If you already have postGIS installed then
>
>     create extension address_standardizer;
>
>     # \df parse_address
>     List of functions
>       Schema |     Name      | Result data type | Argument data types
>     |  Type
>     --------+---------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+--------
>       public | parse_address | record           | text, OUT num text, OUT
>     street text, OUT street2 text, OUT address1 text, OUT city text, OUT
>     state text, OUT zip text, OUT zipplus text, OUT country text | normal
>     (1 row)
>
>     # select * from parse_address('123-2 main street city ny');
>        num  |   street    | street2 |     address1      | city | state
>     | zip
>     | zipplus | country
>     -------+-------------+---------+-------------------+------+-------+-----+---------+---------
>       123-2 | main street |         | 123-2 main street | city | NY |
>     |         | US
>     (1 row)
>
>     # \df standardize*
>                                                          List of functions
>       Schema |        Name         | Result data type |
>     Argument data types                      | Type
>     --------+---------------------+------------------+---------------------------------------------------------------+--------
>       public | standardize_address | stdaddr          | lextab text,
>     gaztab
>     text, rultab text, address text           | normal
>       public | standardize_address | stdaddr          | lextab text,
>     gaztab
>     text, rultab text, micro text, macro text | normal
>
>     You need tables for the lexicon, gazetteer, and rules, which
>     should be
>     included in the extension but I'm not seeing them. So you can grab
>     these
>     from:
>
>     https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-gaz.sql
>     https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-lex.sql
>     https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-rules.sql
>
>     and load them like:
>
>     psql mydb -f us-gaz.sql
>     psql mydb -f us-lex.sql
>     psql mydb -f us-rules.sql
>
>     # select * from standardize_address('lex', 'gaz', 'rules', '123-2
>     main
>     street city ny');
>       building | house_num | predir | qual | pretype |  name  | suftype |
>     sufdir | ruralroute | extra | city |  state   | country | postcode
>     | box
>     | unit
>     ----------+-----------+--------+------+---------+--------+---------+--------+------------+-------+------+----------+---------+----------+-----+------
>                | 123       |        |      |         | 2 MAIN | STREET
>     |        |            |       | CITY | NEW YORK | USA     | |     |
>     (1 row)
>
>
>     This is a good example of why parsing addresses is so difficult. The
>     rules for standardize_address do not account for a house number like
>     "123-2", but the regexp in parse_address do handle it. It is easy
>     to get
>     the 80% right and very hard to get it much above that.
>
>     -Steve
>
>
>     >
>     > I need to find these first before test-running.
>     >
>     > Regards,
>     >
>     > Shao
>     >
>     > On Sat, 25 Apr 2020 at 21:20, Stephen Woodbridge
>     > <stephenwoodbridge37 at gmail.com
>     <mailto:stephenwoodbridge37 at gmail.com>
>     <mailto:stephenwoodbridge37 at gmail.com
>     <mailto:stephenwoodbridge37 at gmail.com>>>
>     > wrote:
>     >
>     >     Shao,
>     >
>     >     '^( *Building *[0-9]+)?[- 0-9]*'
>     >
>     >     or something like that should do it. But I think you will
>     find that a
>     >     more robust solution is to use parse_address() and/or
>     >     standardize_address() as they will recognize a lot of other
>     address
>     >     constructs, like "apt 3a" for for example.
>     >
>     >     parse_address() that a text field and breaks it into "house
>     number
>     >     street name" and "city state zip", but only works well in North
>     >     America.
>     >
>     >     standardize_address() that comes with postGIS, breaks the
>     address
>     >     down
>     >     into its components and can separate out things like
>     buildings, and
>     >     apartment/unit specifiers so you can then take the fields
>     you are
>     >     interested in and recombine just them in a new string.
>     Again, this
>     >     works
>     >     best in North America.
>     >
>     >     My github address-standardizer is built to recognize address
>     for most
>     >     counties, but it can also be configured to recognize address
>     >     standards
>     >     for any county without too much effort. It compiles and
>     installs as
>     >     postgresql extension.
>     >
>     >     Addresses are generally very messy and unless your addresses
>     are vary
>     >     simple you will be constantly fighting with this or that
>     exception.
>     >
>     >     -Steve
>     >
>     >     On 4/25/2020 2:55 PM, Shaozhong SHI wrote:
>     >     > Is there a way to left trim including the building and number?
>     >     >
>     >     > Building 3  21-1              Great Avenue, a city, a
>     country, this
>     >     > planet.
>     >     >
>     >     > How to take way those things which are too local to an
>     address?
>     >     >
>     >     > Regards,
>     >     >
>     >     > Shao
>     >     >
>     >     > On Sat, 25 Apr 2020 at 01:48, Shaozhong SHI
>     >     <shishaozhong at gmail.com <mailto:shishaozhong at gmail.com>
>     <mailto:shishaozhong at gmail.com <mailto:shishaozhong at gmail.com>>
>     >     > <mailto:shishaozhong at gmail.com
>     <mailto:shishaozhong at gmail.com> <mailto:shishaozhong at gmail.com
>     <mailto:shishaozhong at gmail.com>>>>
>     >     wrote:
>     >     >
>     >     >     I find this is a simple, but important question.
>     >     >
>     >     >     How best to split numbers and the rest of address?
>     >     >
>     >     >     For instance, one tricky one is as follows:
>     >     >
>     >     >     21-1 Great Avenue, a city, a country, this planet
>     >     >
>     >     >     How to turn this into the following:
>     >     >
>     >     >     column 1,       column 2
>     >     >
>     >     >       21-1              Great Avenue, a city, a country,
>     this planet
>     >     >
>     >     >     Note:  there is a hyphen in  21-1
>     >     >
>     >     >     Any clue?
>     >     >
>     >     >     Regards,
>     >     >
>     >     >     Shao
>     >     >
>     >     >
>     >     > _______________________________________________
>     >     > postgis-users mailing list
>     >     > postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>
>     <mailto:postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>>
>     >     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>     >
>     >     _______________________________________________
>     >     postgis-users mailing list
>     > postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>
>     <mailto:postgis-users at lists.osgeo.org
>     <mailto:postgis-users at lists.osgeo.org>>
>     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>     >
>     >
>     > _______________________________________________
>     > postgis-users mailing list
>     > postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org>
>     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>     _______________________________________________
>     postgis-users mailing list
>     postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org>
>     https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users



More information about the postgis-users mailing list