[postgis-tickets] [PostGIS] #5397: debug_standardize_address
PostGIS
trac at osgeo.org
Wed Jun 7 17:31:52 PDT 2023
#5397: debug_standardize_address
---------------------------------+---------------------------
Reporter: robe | Owner: robe
Type: defect | Status: new
Priority: high | Milestone: PostGIS 3.4.0
Component: pagc_address_parser | Version: master
Keywords: |
---------------------------------+---------------------------
This is a new function for address_standardizer extension that allows you
to debug why an address is standardized the way it is.
It's most useful when abbreviations or words can resolve to more than one
kind of token - e.g. ST for Street vs. ST for SAINT.
The way the address standardizer works, it generates a standardized list
of all tokens in an address and then applies rules to them. In many cases
many rules can result in some answer and the address standardizer picks
the one with the highest score.
This function allows you to look into the guts of what it is doing.
I'm still working locally fitting all the bits and am almost there.
One thought is if I should expose this as just json text (similar to
ST_AsGeoJSON) or as a bonified jsonb or json. I don't like that jsonb
looses the sorting of the keys so that is one downside.
I went with JSON as Leo's suggestion cause it allows maximum flexibility
in turns of structure without having to change the output of the function.
For example
{{{
SELECT s::jsonb
FROM debug_standardize_address( 'us_lex', 'us_gaz','us_rules','One
Devonshire Pl PH 301', 'Boston, MA 02109') AS s;
}}}
returns at the moment:
{{{
{
"rules": [
{
"no": 0,
"score": 0.87625,
"rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
"rule_tokens": [
{
"pos": 0,
"input-token": "NUMBER",
"output-token": "HOUSE",
"input-token-code": 0,
"output-token-code": 1
},
{
"pos": 1,
"input-token": "WORD",
"output-token": "STREET",
"input-token-code": 1,
"output-token-code": 5
},
{
"pos": 2,
"input-token": "TYPE",
"output-token": "SUFTYP",
"input-token-code": 2,
"output-token-code": 6
},
{
"pos": 3,
"input-token": "UNITT",
"output-token": "UNITT",
"input-token-code": 17,
"output-token-code": 17
},
{
"pos": 4,
"input-token": "NUMBER",
"output-token": "UNITT",
"input-token-code": 0,
"output-token-code": 17
}
]
},
{
"no": 1,
"score": 0.81125,
"rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
"rule_tokens": [
{
"pos": 0,
"input-token": "NUMBER",
"output-token": "HOUSE",
"input-token-code": 0,
"output-token-code": 1
},
{
"pos": 1,
"input-token": "WORD",
"output-token": "STREET",
"input-token-code": 1,
"output-token-code": 5
},
{
"pos": 2,
"input-token": "TYPE",
"output-token": "SUFTYP",
"input-token-code": 2,
"output-token-code": 6
},
{
"pos": 3,
"input-token": "UNITT",
"output-token": "UNITT",
"input-token-code": 17,
"output-token-code": 17
},
{
"pos": 4,
"input-token": "NUMBER",
"output-token": "UNITT",
"input-token-code": 0,
"output-token-code": 17
}
]
},
{
"no": 2,
"score": 0.754167,
"rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
"rule_tokens": [
{
"pos": 0,
"input-token": "NUMBER",
"output-token": "HOUSE",
"input-token-code": 0,
"output-token-code": 1
},
{
"pos": 1,
"input-token": "WORD",
"output-token": "STREET",
"input-token-code": 1,
"output-token-code": 5
},
{
"pos": 2,
"input-token": "TYPE",
"output-token": "SUFTYP",
"input-token-code": 2,
"output-token-code": 6
},
{
"pos": 3,
"input-token": "UNITT",
"output-token": "UNITT",
"input-token-code": 17,
"output-token-code": 17
},
{
"pos": 4,
"input-token": "NUMBER",
"output-token": "UNITT",
"input-token-code": 0,
"output-token-code": 17
}
]
},
{
"no": 3,
"score": 0.735937,
"rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
"rule_tokens": [
{
"pos": 0,
"input-token": "NUMBER",
"output-token": "HOUSE",
"input-token-code": 0,
"output-token-code": 1
},
{
"pos": 1,
"input-token": "WORD",
"output-token": "STREET",
"input-token-code": 1,
"output-token-code": 5
},
{
"pos": 2,
"input-token": "TYPE",
"output-token": "SUFTYP",
"input-token-code": 2,
"output-token-code": 6
},
{
"pos": 3,
"input-token": "UNITT",
"output-token": "UNITT",
"input-token-code": 17,
"output-token-code": 17
},
{
"pos": 4,
"input-token": "NUMBER",
"output-token": "UNITT",
"input-token-code": 0,
"output-token-code": 17
}
]
},
{
"no": 4,
"score": 0.724167,
"rule_string": "0 1 2 17 0 -1 1 5 5 17 17",
"rule_tokens": [
{
"pos": 0,
"input-token": "NUMBER",
"output-token": "HOUSE",
"input-token-code": 0,
"output-token-code": 1
},
{
"pos": 1,
"input-token": "WORD",
"output-token": "STREET",
"input-token-code": 1,
"output-token-code": 5
},
{
"pos": 2,
"input-token": "TYPE",
"output-token": "STREET",
"input-token-code": 2,
"output-token-code": 5
},
{
"pos": 3,
"input-token": "UNITT",
"output-token": "UNITT",
"input-token-code": 17,
"output-token-code": 17
},
{
"pos": 4,
"input-token": "NUMBER",
"output-token": "UNITT",
"input-token-code": 0,
"output-token-code": 17
}
]
},
{
"no": 5,
"score": 0.70625,
"rule_string": "0 1 2 2 0 -1 1 5 5 6 17",
"rule_tokens": [
{
"pos": 0,
"input-token": "NUMBER",
"output-token": "HOUSE",
"input-token-code": 0,
"output-token-code": 1
},
{
"pos": 1,
"input-token": "WORD",
"output-token": "STREET",
"input-token-code": 1,
"output-token-code": 5
},
{
"pos": 2,
"input-token": "TYPE",
"output-token": "STREET",
"input-token-code": 2,
"output-token-code": 5
},
{
"pos": 3,
"input-token": "TYPE",
"output-token": "SUFTYP",
"input-token-code": 2,
"output-token-code": 6
},
{
"pos": 4,
"input-token": "NUMBER",
"output-token": "UNITT",
"input-token-code": 0,
"output-token-code": 17
}
]
}
],
"input_tokens": [
{
"pos": 0,
"token": "NUMBER",
"stdword": "1",
"token-code": 0
},
{
"pos": 0,
"token": "WORD",
"stdword": "1",
"token-code": 1
},
{
"pos": 1,
"token": "WORD",
"stdword": "DEVONSHIRE",
"token-code": 1
},
{
"pos": 2,
"token": "TYPE",
"stdword": "PLACE",
"token-code": 2
},
{
"pos": 3,
"token": "TYPE",
"stdword": "PATH",
"token-code": 2
},
{
"pos": 3,
"token": "UNITT",
"stdword": "PENTHOUSE",
"token-code": 17
},
{
"pos": 4,
"token": "NUMBER",
"stdword": "301",
"token-code": 0
}
]
}
}}}
Changes I need to make before I commit my work
1) Add an option for the single address version (this will be the same
function, but the last will be a default arg, which when left blank will
call parse address first (much like current one does when only one arg).
{{{
SELECT s::jsonb
FROM debug_standardize_address( 'us_lex', 'us_gaz','us_rules','One
Devonshire Pl PH 301, Boston, MA 02109') AS s;
}}}
2) Output the macro rules (Not sure why it isn't at moment. This is the
one that parses Boston, MA 02109 part
3) Define for each rule what kind of rule it is
4) Finish off the rulestring. Right now the rule_string is missing the
type of rule and the weight of the rule to allow it to be easily matched
up in the rule table. I also think there might be something wrong with my
looping thru the rules since some rules appear to be repeated.
5) Output the results much like what:
{{{
SELECT s.*
FROM standardize_address( 'us_lex', 'us_gaz','us_rules','One Devonshire Pl
PH 301', 'Boston, MA 02109') AS s;
}}}
This will be another array node in the document: where each element with
be a jsonified stdaddress
e.g.
{{{
"standardized_addresses": [
{"building":null,"house_num":"1","predir":null,"qual":null,"pretype":null,"name":"DEVONSHIRE","suftype":"PLACE","sufdir":null,"ruralroute":null,"extra":null,"city":"BOSTON","state":"MASSACHUSETTS","country":null,"postcode":"02109","box":null,"unit":"#
PENTHOUSE 301"} ]
}}}
I was thinking of skipping the nulls and eventually adding the lower
ranking addresses after.
--
Ticket URL: <https://trac.osgeo.org/postgis/ticket/5397>
PostGIS <http://trac.osgeo.org/postgis/>
The PostGIS Trac is used for bug, enhancement & task tracking, a user and developer wiki, and a view into the subversion code repository of PostGIS project.
More information about the postgis-tickets
mailing list