[postgis-tickets] [PostGIS] #5397: debug_standardize_address

Wed Jun 7 17:31:52 PDT 2023

#5397: debug_standardize_address
---------------------------------+---------------------------
 Reporter:  robe                 |      Owner:  robe
     Type:  defect               |     Status:  new
 Priority:  high                 |  Milestone:  PostGIS 3.4.0
Component:  pagc_address_parser  |    Version:  master
 Keywords:                       |
---------------------------------+---------------------------
 This is a new function for address_standardizer extension that allows you
 to debug why an address is standardized the way it is.

 It's most useful when abbreviations or words can resolve to more than one
 kind of token - e.g.  ST for Street vs. ST for SAINT.

 The way the address standardizer works, it generates a standardized list
 of all tokens in an address and then applies rules to them.  In many cases
 many rules can result in some answer and the address standardizer picks
 the one with the highest score.

 This function allows you to look into the guts of what it is doing.

 I'm still working locally fitting all the bits and am almost there.
 One thought is if I should expose this as just json text (similar to
 ST_AsGeoJSON) or as a bonified jsonb or json.  I don't like that jsonb
 looses the sorting of the keys so that is one downside.

 I went with JSON as Leo's suggestion cause it allows maximum flexibility
 in turns of structure without having to change the output of the function.

 For example

 {{{
 SELECT s::jsonb
 FROM debug_standardize_address( 'us_lex', 'us_gaz','us_rules','One
 Devonshire Pl PH 301', 'Boston, MA 02109') AS s;
 }}}

 returns at the moment:

 {{{
 {
   "rules": [
     {
       "no": 0,
       "score": 0.87625,
       "rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
       "rule_tokens": [
         {
           "pos": 0,
           "input-token": "NUMBER",
           "output-token": "HOUSE",
           "input-token-code": 0,
           "output-token-code": 1
         },
         {
           "pos": 1,
           "input-token": "WORD",
           "output-token": "STREET",
           "input-token-code": 1,
           "output-token-code": 5
         },
         {
           "pos": 2,
           "input-token": "TYPE",
           "output-token": "SUFTYP",
           "input-token-code": 2,
           "output-token-code": 6
         },
         {
           "pos": 3,
           "input-token": "UNITT",
           "output-token": "UNITT",
           "input-token-code": 17,
           "output-token-code": 17
         },
         {
           "pos": 4,
           "input-token": "NUMBER",
           "output-token": "UNITT",
           "input-token-code": 0,
           "output-token-code": 17
         }
       ]
     },
     {
       "no": 1,
       "score": 0.81125,
       "rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
       "rule_tokens": [
         {
           "pos": 0,
           "input-token": "NUMBER",
           "output-token": "HOUSE",
           "input-token-code": 0,
           "output-token-code": 1
         },
         {
           "pos": 1,
           "input-token": "WORD",
           "output-token": "STREET",
           "input-token-code": 1,
           "output-token-code": 5
         },
         {
           "pos": 2,
           "input-token": "TYPE",
           "output-token": "SUFTYP",
           "input-token-code": 2,
           "output-token-code": 6
         },
         {
           "pos": 3,
           "input-token": "UNITT",
           "output-token": "UNITT",
           "input-token-code": 17,
           "output-token-code": 17
         },
         {
           "pos": 4,
           "input-token": "NUMBER",
           "output-token": "UNITT",
           "input-token-code": 0,
           "output-token-code": 17
         }
       ]
     },
     {
       "no": 2,
       "score": 0.754167,
       "rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
       "rule_tokens": [
         {
           "pos": 0,
           "input-token": "NUMBER",
           "output-token": "HOUSE",
           "input-token-code": 0,
           "output-token-code": 1
         },
         {
           "pos": 1,
           "input-token": "WORD",
           "output-token": "STREET",
           "input-token-code": 1,
           "output-token-code": 5
         },
         {
           "pos": 2,
           "input-token": "TYPE",
           "output-token": "SUFTYP",
           "input-token-code": 2,
           "output-token-code": 6
         },
         {
           "pos": 3,
           "input-token": "UNITT",
           "output-token": "UNITT",
           "input-token-code": 17,
           "output-token-code": 17
         },
         {
           "pos": 4,
           "input-token": "NUMBER",
           "output-token": "UNITT",
           "input-token-code": 0,
           "output-token-code": 17
         }
       ]
     },
     {
       "no": 3,
       "score": 0.735937,
       "rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
       "rule_tokens": [
         {
           "pos": 0,
           "input-token": "NUMBER",
           "output-token": "HOUSE",
           "input-token-code": 0,
           "output-token-code": 1
         },
         {
           "pos": 1,
           "input-token": "WORD",
           "output-token": "STREET",
           "input-token-code": 1,
           "output-token-code": 5
         },
         {
           "pos": 2,
           "input-token": "TYPE",
           "output-token": "SUFTYP",
           "input-token-code": 2,
           "output-token-code": 6
         },
         {
           "pos": 3,
           "input-token": "UNITT",
           "output-token": "UNITT",
           "input-token-code": 17,
           "output-token-code": 17
         },
         {
           "pos": 4,
           "input-token": "NUMBER",
           "output-token": "UNITT",
           "input-token-code": 0,
           "output-token-code": 17
         }
       ]
     },
     {
       "no": 4,
       "score": 0.724167,
       "rule_string": "0 1 2 17 0 -1 1 5 5 17 17",
       "rule_tokens": [
         {
           "pos": 0,
           "input-token": "NUMBER",
           "output-token": "HOUSE",
           "input-token-code": 0,
           "output-token-code": 1
         },
         {
           "pos": 1,
           "input-token": "WORD",
           "output-token": "STREET",
           "input-token-code": 1,
           "output-token-code": 5
         },
         {
           "pos": 2,
           "input-token": "TYPE",
           "output-token": "STREET",
           "input-token-code": 2,
           "output-token-code": 5
         },
         {
           "pos": 3,
           "input-token": "UNITT",
           "output-token": "UNITT",
           "input-token-code": 17,
           "output-token-code": 17
         },
         {
           "pos": 4,
           "input-token": "NUMBER",
           "output-token": "UNITT",
           "input-token-code": 0,
           "output-token-code": 17
         }
       ]
     },
     {
       "no": 5,
       "score": 0.70625,
       "rule_string": "0 1 2 2 0 -1 1 5 5 6 17",
       "rule_tokens": [
         {
           "pos": 0,
           "input-token": "NUMBER",
           "output-token": "HOUSE",
           "input-token-code": 0,
           "output-token-code": 1
         },
         {
           "pos": 1,
           "input-token": "WORD",
           "output-token": "STREET",
           "input-token-code": 1,
           "output-token-code": 5
         },
         {
           "pos": 2,
           "input-token": "TYPE",
           "output-token": "STREET",
           "input-token-code": 2,
           "output-token-code": 5
         },
         {
           "pos": 3,
           "input-token": "TYPE",
           "output-token": "SUFTYP",
           "input-token-code": 2,
           "output-token-code": 6
         },
         {
           "pos": 4,
           "input-token": "NUMBER",
           "output-token": "UNITT",
           "input-token-code": 0,
           "output-token-code": 17
         }
       ]
     }
   ],
   "input_tokens": [
     {
       "pos": 0,
       "token": "NUMBER",
       "stdword": "1",
       "token-code": 0
     },
     {
       "pos": 0,
       "token": "WORD",
       "stdword": "1",
       "token-code": 1
     },
     {
       "pos": 1,
       "token": "WORD",
       "stdword": "DEVONSHIRE",
       "token-code": 1
     },
     {
       "pos": 2,
       "token": "TYPE",
       "stdword": "PLACE",
       "token-code": 2
     },
     {
       "pos": 3,
       "token": "TYPE",
       "stdword": "PATH",
       "token-code": 2
     },
     {
       "pos": 3,
       "token": "UNITT",
       "stdword": "PENTHOUSE",
       "token-code": 17
     },
     {
       "pos": 4,
       "token": "NUMBER",
       "stdword": "301",
       "token-code": 0
     }
   ]
 }
 }}}

 Changes I need to make before I commit my work

 1) Add an option for the single address version (this will be the same
 function, but the last will be a default arg, which when left blank will
 call parse address first (much like current one does when only one arg).

 {{{
 SELECT s::jsonb
 FROM debug_standardize_address( 'us_lex', 'us_gaz','us_rules','One
 Devonshire Pl PH 301, Boston, MA 02109') AS s;
 }}}

 2) Output the macro rules (Not sure why it isn't at moment. This is the
 one that parses Boston, MA 02109 part
 3) Define for each rule what kind of rule it is
 4) Finish off the rulestring.  Right now the rule_string is missing the
 type of rule and the weight of the rule to allow it to be easily matched
 up in the rule table.  I also think there might be something wrong with my
 looping thru the rules since some rules appear to be repeated.

 5) Output the results much like what:

 {{{
 SELECT s.*
 FROM standardize_address( 'us_lex', 'us_gaz','us_rules','One Devonshire Pl
 PH 301', 'Boston, MA 02109') AS s;
 }}}

 This will be another array node in the document: where each element with
 be a jsonified stdaddress

 e.g.

 {{{
 "standardized_addresses": [
 {"building":null,"house_num":"1","predir":null,"qual":null,"pretype":null,"name":"DEVONSHIRE","suftype":"PLACE","sufdir":null,"ruralroute":null,"extra":null,"city":"BOSTON","state":"MASSACHUSETTS","country":null,"postcode":"02109","box":null,"unit":"#
 PENTHOUSE 301"} ]
 }}}

 I was thinking of skipping the nulls and eventually adding the lower
 ranking addresses after.
-- 
Ticket URL: <https://trac.osgeo.org/postgis/ticket/5397>
PostGIS <http://trac.osgeo.org/postgis/>
The PostGIS Trac is used for bug, enhancement & task tracking, a user and developer wiki, and a view into the subversion code repository of PostGIS project.