[postgis-users] address_standardizer - More detailed documentation on how rules work

Grant Orr Grant.Orr at sjrb.ca
Thu Mar 18 10:49:56 PDT 2021


I realized where you were going with this and because my implementation is entirely within the database as procedural code I will likely use a regular expression approach to match on patterns and reorder using expression groups using regexp_replace.  I would use those against the parsed result rather than try to preprocess them.

Something like 
select RR, regexp_replace(RR, '(ROUTE) ([0-9]+)($|\s)', '\2 \1')
from (VALUES ('ROUTE 110'), ('120 ROUTE')) AS t (RR);

I'm trying to use the parser to standardized user entered address data that is of questionable quality.

Right now I'm having 1 main issue with the parser that has me concerned.  

There appears to be a bug in the logic that uses the lexicon where the parse is presenting the wrong value when applying a few rule combinations.
The example is where I have FIRST in the lexicon coming back as "1" for type ORD and "FIRST" for word yet in some rule scenarios it returns the ORD value even though the command line response indicates that it is not.

Regardless of what the rules are, based on the output response from the pagc_stand function, It should be giving me "FIRST NATIONS" for the STREETNAME
This is consistent for both the command line response and the Postgres function call.

If I reverse the definition number for the lexicon, I get the correct result from the command line but continue to get the wrong result from the postgresql implementation.

(I am using the parser in the database but I use the command line interface to test where I have challenges)
------------------------------------------------------------------------------------------------
Incorrect in both postgres and pagc_stand With Lexicon entries
------------------------------------------------------------------------------------------------
"1","FIRST",1,"FIRST"
"2","FIRST",15,"1"

MICRO:41 FIRST NATIONS TR 
MACRO:CALGARY
<SiteAddress><CompleteAddressNumber>41</CompleteAddressNumber>
<CompleteStreetName><StreetName>1 NATIONS</StreetName>
<PostType>TRAIL</PostType></CompleteStreetName> <PlaceName>CALGARY</PlaceName></SiteAddress>Input tokenization candidates:
	(0) std: 41, tok: 0 (NUMBER)
	(1) std: FIRST, tok: 1 (WORD)
	(1) std: 1, tok: 15 (ORD)
	(2) std: NATIONS, tok: 1 (WORD)
	(3) std: TRAIL, tok: 2 (TYPE)
Raw standardization 0 with score 0.692500:
	(0) Input 0 (NUMBER) text 41 mapped to output 1 (HOUSE)
	(1) Input 1 (WORD) text FIRST mapped to output 5 (STREET)
	(2) Input 1 (WORD) text NATIONS mapped to output 5 (STREET)
	(3) Input 2 (TYPE) text TRAIL mapped to output 6 (SUFTYP)
Raw standardization 1 with score 0.665000:
	(0) Input 0 (NUMBER) text 41 mapped to output 1 (HOUSE)
	(1) Input 1 (WORD) text FIRST mapped to output 5 (STREET)
	(2) Input 1 (WORD) text NATIONS mapped to output 5 (STREET)
	(3) Input 2 (TYPE) text TRAIL mapped to output 6 (SUFTYP)
Raw standardization 2 with score 0.652500:
	(0) Input 0 (NUMBER) text 41 mapped to output 17 (UNITT)
	(1) Input 15 (ORD) text 1 mapped to output 1 (HOUSE)
	(2) Input 1 (WORD) text NATIONS mapped to output 5 (STREET)
	(3) Input 2 (TYPE) text TRAIL mapped to output 6 (SUFTYP)

Rule 4 is of type 0 (MACRO)
: Input : |1 (WORD)|
Output: |10 (CITY)|
rank 3 ( 0.375000): hits 1 out of 6

Rule 308 is of type 1 (MICRO)
: Input : |0 (NUMBER)||1 (WORD)||2 (TYPE)|
Output: |1 (HOUSE)||5 (STREET)||6 (SUFTYP)|
rank 10 ( 0.700000): hits 1 out of 6

Rule 332 is of type 2 (ARC)
: Input : |1 (WORD)||2 (TYPE)|
Output: |5 (STREET)||6 (SUFTYP)|
rank 10 ( 0.700000): hits 2 out of 6

Rule 350 is of type 3 (CIVIC)
: Input : |0 (NUMBER)|
Output: |1 (HOUSE)|
rank 15 ( 0.900000): hits 1 out of 6

Rule 354 is of type 3 (CIVIC)
: Input : |0 (NUMBER)||15 (ORD)|
Output: |17 (UNITT)||1 (HOUSE)|
rank 12 ( 0.800000): hits 1 out of 6
Found 5 rules hit

------------------------------------------------------------------------------------------------
Correct in pagc_stand and Incorrect in postgres With Lexicon entries
------------------------------------------------------------------------------------------------
"1","FIRST",15,"1"
"2","FIRST",1,"FIRST"

MICRO:41 FIRST NATIONS TR
MACRO:CALGARY
<SiteAddress><CompleteAddressNumber>41</CompleteAddressNumber>
<CompleteStreetName><StreetName>FIRST NATIONS</StreetName>
<PostType>TRAIL</PostType></CompleteStreetName> <PlaceName>CALGARY</PlaceName></SiteAddress>Input tokenization candidates:
	(0) std: 41, tok: 0 (NUMBER)
	(1) std: 1, tok: 15 (ORD)
	(1) std: FIRST, tok: 1 (WORD)
	(2) std: NATIONS, tok: 1 (WORD)
	(3) std: TRAIL, tok: 2 (TYPE)
Raw standardization 0 with score 0.692500:
	(0) Input 0 (NUMBER) text 41 mapped to output 1 (HOUSE)
	(1) Input 1 (WORD) text FIRST mapped to output 5 (STREET)
	(2) Input 1 (WORD) text NATIONS mapped to output 5 (STREET)
	(3) Input 2 (TYPE) text TRAIL mapped to output 6 (SUFTYP)
Raw standardization 1 with score 0.665000:
	(0) Input 0 (NUMBER) text 41 mapped to output 1 (HOUSE)
	(1) Input 1 (WORD) text FIRST mapped to output 5 (STREET)
	(2) Input 1 (WORD) text NATIONS mapped to output 5 (STREET)
	(3) Input 2 (TYPE) text TRAIL mapped to output 6 (SUFTYP)
Raw standardization 2 with score 0.652500:
	(0) Input 0 (NUMBER) text 41 mapped to output 17 (UNITT)
	(1) Input 15 (ORD) text 1 mapped to output 1 (HOUSE)
	(2) Input 1 (WORD) text NATIONS mapped to output 5 (STREET)
	(3) Input 2 (TYPE) text TRAIL mapped to output 6 (SUFTYP)

Rule 4 is of type 0 (MACRO)
: Input : |1 (WORD)|
Output: |10 (CITY)|
rank 3 ( 0.375000): hits 1 out of 6

Rule 308 is of type 1 (MICRO)
: Input : |0 (NUMBER)||1 (WORD)||2 (TYPE)|
Output: |1 (HOUSE)||5 (STREET)||6 (SUFTYP)|
rank 10 ( 0.700000): hits 1 out of 6

Rule 332 is of type 2 (ARC)
: Input : |1 (WORD)||2 (TYPE)|
Output: |5 (STREET)||6 (SUFTYP)|
rank 10 ( 0.700000): hits 2 out of 6

Rule 350 is of type 3 (CIVIC)
: Input : |0 (NUMBER)|
Output: |1 (HOUSE)|
rank 15 ( 0.900000): hits 1 out of 6

Rule 354 is of type 3 (CIVIC)
: Input : |0 (NUMBER)||15 (ORD)|
Output: |17 (UNITT)||1 (HOUSE)|
rank 12 ( 0.800000): hits 1 out of 6
Found 5 rules hit

Any thoughts?



On 2021-03-18, 8:12 AM, "Stephen Woodbridge" <stephenwoodbridge37 at gmail.com> wrote:

      CAUTION: This email is from an external source. Do not click links or open attachments unless you recognize the sender and know the content is safe.

    If you problem is just "11 RTE" and not the general class of "<num>
    RTE", then just add "11 RTE" to the lexicon as a street type and that
    will fix the problem and bypass the rules issues. Or if you only have a
    few <num>s to deal with add them all.

    Did you try using the rules2txt and txt2rules scripts I linked to? These
    convert the numbers into human readable text so you can understand what
    the rules are saying.

    You could add a rule like: <num> <num> <type> -> <house> <name>
    <suftype> <score>
    (token names are not correct, but you should get the idea)
    and then play with the <score> value, but as I said this will impact
    other rules as you increase its value to force it into play over other
    rules and potentially cause side effects that are not desirable.

    -Steve

    On 3/18/2021 4:26 AM, Grant Orr wrote:
    > Thanks Stephen.
    >
    > I've been finding a couple of bugs and I've been trying to figure out if it is just my understanding of the functionality or an issue in the code.
    > I haven't touched C in a long time and I'm principally using this in an AWS RDS instance so there is little opportunity to address them anyways.
    >
    > I appreciate the feedback.
    >
    > Grant
    >
    > On 2021-03-15, 8:05 PM, "Stephen Woodbridge" <stephenwoodbridge37 at gmail.com> wrote:
    >
    >        CAUTION: This email is from an external source. Do not click links or open attachments unless you recognize the sender and know the content is safe.
    >
    >      Hi Grant,
    >
    >      I built ported the address standardizer from PAGC which is no longer
    >      supported. It is VERY confusing so much so that I could not understand
    >      it well enough to make changes myself and the code is very opaque and
    >      very hard (impossible?) to follow. This was why I wrote a new address
    >      standardizer in C++ designed to be easy to make changes to. I think it
    >      is unlikely that you will find anyone that can answer you very good
    >      questions.
    >
    >      Here is what I remember about the existing code, each rule has a weight
    >      associated with it. When a string of tokens is compared to a rule, the
    >      weights are applied to calculate an overall weight for the string of
    >      tokens. Remember an address == string of tokens which might match
    >      multiple different collection of rules. So each set of rule matches
    >      provides a score for that combination of rules. And the best scored rule
    >      combination wins.
    >
    >      But, and this is the unstable part, if you change the rule weights so
    >      solve one address, you might have ALSO broken it for another address. So
    >      it is usually safer to make changes to the lexicon or gazetteer to make
    >      the address tokens fall into the rule that you want, rather than
    >      changing the rules.
    >
    >      My address standardizer has a similar problem, but it is much easier to
    >      work with.
    >
    >      When I did this in the past, I would load my reference addresses into a
    >      table say "raw" and then standardize them into "standard" then look at
    >      all the records that failed to standardize and try to abstract then into
    >      classes of failures, then make some small tweaks to the lexicon,
    >      gazetteer or rules, and repeat the standardization process, and do this
    >      until I was happy, which was getting to 95+% standardizing and ignore
    >      the rest in some cases. Your tolerance might be different, but you might
    >      find that getting the last few percent is hard has the changes to fix
    >      one problem cause other problems.
    >
    >      Sorry I can be more helpful, but its been 6-7 years since I used it and
    >      mostly I used my new code because of the above issues with that code.
    >
    >      Best regards,
    >         -Steve W
    >
    >      On 3/15/2021 5:25 PM, Grant Orr wrote:
    >      >
    >      > I’m looking for more detailed documentation on how rules work for the
    >      > address_standardizer
    >      >
    >      >   * Which rule types take precedence over which rule types
    >      >       o Micro precedes ARC, etc..
    >      >   * How the rules that have overlapping attributes are resolved
    >      >       o Address NUMBER NUMBER WORD SUFTYPE
    >      >       o Rules  :   (ARC) NUMBER WORD SUFTYPE, (CIV) NUMBER NUMBER,
    >      >         (CIV) NUMBER , (EXTRA) NUMBER
    >      >       o How does the parser know which rules to apply? How does Rank
    >      >         come into this?
    >      >   * Where does the RAW standardization come from?  Why does it appear
    >      >     to supersede everything except MICRO?
    >      >
    >      > I’ve been trying to figure this out with the existing documentation
    >      > and a lot of trial and error but it is challenging
    >      >
    >      > Any help is appreciated
    >      >
    >      >
    >      > _______________________________________________
    >      > postgis-users mailing list
    >      > postgis-users at lists.osgeo.org
    >      > https://lists.osgeo.org/mailman/listinfo/postgis-users
    >
    >
    >      --
    >      This email has been checked for viruses by Avast antivirus software.
    >      https://www.avast.com/antivirus
    >
    >


    --
    This email has been checked for viruses by Avast antivirus software.
    https://www.avast.com/antivirus




More information about the postgis-users mailing list