[postgis-users] address_standardizer - More detailed documentation on how rules work

Thu Mar 18 20:28:35 PDT 2021

Grant,

I'm not sure how much help I can be with this as its been years since I 
ported the code.

My gut says remove the lexicon entry for "FIRST" -> "1" as ORD, because 
this is probably a least used artifact, but not sure if that will help. 
At least try that and see if it works.

And if the cmdline and the database don't match that sounds like a bug, 
but I would make sure that both tests are using the same lexicon, 
gazetteer and rules. If its a bug it is unlikely it will get fixed 
unless you find it and submit a pull request.

My new address parser makes extensive use of regular expressions, and it 
fact turns the whole lexicon into a giant highly optimized regular 
expression, but luckily the user doesn't have to deal with that directly 
as its all done under the covers.

	(0) std: 41, tok: 0 (NUMBER)
	(1) std: FIRST, tok: 1 (WORD)
	(1) std: 1, tok: 15 (ORD)
	(2) std: NATIONS, tok: 1 (WORD)
	(3) std: TRAIL, tok: 2 (TYPE)

This means FIRST was catagorized as both WORD and ORD, so you have to 
possible token streams to evaluate:

1) NUMBER WORD WORD TYPE
2) NUMBER ORD WORD TYPE

both of these get processed trough the rules and scored and this should 
be the winning evaluation:

Raw standardization 0 with score 0.692500:
	(0) Input 0 (NUMBER) text 41 mapped to output 1 (HOUSE)
	(1) Input 1 (WORD) text FIRST mapped to output 5 (STREET)
	(2) Input 1 (WORD) text NATIONS mapped to output 5 (STREET)
	(3) Input 2 (TYPE) text TRAIL mapped to output 6 (SUFTYP)

> If I reverse the definition number for the lexicon, I get the correct result from the command line but continue to get the wrong result from the postgresql implementation.

Ahh, there is some caching of the lexicon, rules, etc at the connection 
level. So if you make changes in postgresql, you have to close the 
connection and reopen it (ie: exit to the commandline and reentry psql) 
when you are testing changes. You might be able to \c database, but if 
that doesn't work, exit psql and restart it to get a new connection. The 
reason for this is because is speeds things up during production work, 
but its not obvious when you are testing and making changes.

Hope that helps,
   -Steve W

On 3/18/2021 1:49 PM, Grant Orr wrote:
> I realized where you were going with this and because my implementation is entirely within the database as procedural code I will likely use a regular expression approach to match on patterns and reorder using expression groups using regexp_replace.  I would use those against the parsed result rather than try to preprocess them.
>
> Something like
> select RR, regexp_replace(RR, '(ROUTE) ([0-9]+)($|\s)', '\2 \1')
> from (VALUES ('ROUTE 110'), ('120 ROUTE')) AS t (RR);
>
> I'm trying to use the parser to standardized user entered address data that is of questionable quality.
>
> Right now I'm having 1 main issue with the parser that has me concerned.
>
> There appears to be a bug in the logic that uses the lexicon where the parse is presenting the wrong value when applying a few rule combinations.
> The example is where I have FIRST in the lexicon coming back as "1" for type ORD and "FIRST" for word yet in some rule scenarios it returns the ORD value even though the command line response indicates that it is not.
>
> Regardless of what the rules are, based on the output response from the pagc_stand function, It should be giving me "FIRST NATIONS" for the STREETNAME
> This is consistent for both the command line response and the Postgres function call.
>
> If I reverse the definition number for the lexicon, I get the correct result from the command line but continue to get the wrong result from the postgresql implementation.
>
> (I am using the parser in the database but I use the command line interface to test where I have challenges)
> ------------------------------------------------------------------------------------------------
> Incorrect in both postgres and pagc_stand With Lexicon entries
> ------------------------------------------------------------------------------------------------
> "1","FIRST",1,"FIRST"
> "2","FIRST",15,"1"
>
> MICRO:41 FIRST NATIONS TR
> MACRO:CALGARY
> <SiteAddress><CompleteAddressNumber>41</CompleteAddressNumber>
> <CompleteStreetName><StreetName>1 NATIONS</StreetName>
> <PostType>TRAIL</PostType></CompleteStreetName> <PlaceName>CALGARY</PlaceName></SiteAddress>Input tokenization candidates:
> 	(0) std: 41, tok: 0 (NUMBER)
> 	(1) std: FIRST, tok: 1 (WORD)
> 	(1) std: 1, tok: 15 (ORD)
> 	(2) std: NATIONS, tok: 1 (WORD)
> 	(3) std: TRAIL, tok: 2 (TYPE)
> Raw standardization 0 with score 0.692500:
> 	(0) Input 0 (NUMBER) text 41 mapped to output 1 (HOUSE)
> 	(1) Input 1 (WORD) text FIRST mapped to output 5 (STREET)
> 	(2) Input 1 (WORD) text NATIONS mapped to output 5 (STREET)
> 	(3) Input 2 (TYPE) text TRAIL mapped to output 6 (SUFTYP)
> Raw standardization 1 with score 0.665000:
> 	(0) Input 0 (NUMBER) text 41 mapped to output 1 (HOUSE)
> 	(1) Input 1 (WORD) text FIRST mapped to output 5 (STREET)
> 	(2) Input 1 (WORD) text NATIONS mapped to output 5 (STREET)
> 	(3) Input 2 (TYPE) text TRAIL mapped to output 6 (SUFTYP)
> Raw standardization 2 with score 0.652500:
> 	(0) Input 0 (NUMBER) text 41 mapped to output 17 (UNITT)
> 	(1) Input 15 (ORD) text 1 mapped to output 1 (HOUSE)
> 	(2) Input 1 (WORD) text NATIONS mapped to output 5 (STREET)
> 	(3) Input 2 (TYPE) text TRAIL mapped to output 6 (SUFTYP)
>
> Rule 4 is of type 0 (MACRO)
> : Input : |1 (WORD)|
> Output: |10 (CITY)|
> rank 3 ( 0.375000): hits 1 out of 6
>
> Rule 308 is of type 1 (MICRO)
> : Input : |0 (NUMBER)||1 (WORD)||2 (TYPE)|
> Output: |1 (HOUSE)||5 (STREET)||6 (SUFTYP)|
> rank 10 ( 0.700000): hits 1 out of 6
>
> Rule 332 is of type 2 (ARC)
> : Input : |1 (WORD)||2 (TYPE)|
> Output: |5 (STREET)||6 (SUFTYP)|
> rank 10 ( 0.700000): hits 2 out of 6
>
> Rule 350 is of type 3 (CIVIC)
> : Input : |0 (NUMBER)|
> Output: |1 (HOUSE)|
> rank 15 ( 0.900000): hits 1 out of 6
>
> Rule 354 is of type 3 (CIVIC)
> : Input : |0 (NUMBER)||15 (ORD)|
> Output: |17 (UNITT)||1 (HOUSE)|
> rank 12 ( 0.800000): hits 1 out of 6
> Found 5 rules hit
>
> ------------------------------------------------------------------------------------------------
> Correct in pagc_stand and Incorrect in postgres With Lexicon entries
> ------------------------------------------------------------------------------------------------
> "1","FIRST",15,"1"
> "2","FIRST",1,"FIRST"
>
> MICRO:41 FIRST NATIONS TR
> MACRO:CALGARY
> <SiteAddress><CompleteAddressNumber>41</CompleteAddressNumber>
> <CompleteStreetName><StreetName>FIRST NATIONS</StreetName>
> <PostType>TRAIL</PostType></CompleteStreetName> <PlaceName>CALGARY</PlaceName></SiteAddress>Input tokenization candidates:
> 	(0) std: 41, tok: 0 (NUMBER)
> 	(1) std: 1, tok: 15 (ORD)
> 	(1) std: FIRST, tok: 1 (WORD)
> 	(2) std: NATIONS, tok: 1 (WORD)
> 	(3) std: TRAIL, tok: 2 (TYPE)
> Raw standardization 0 with score 0.692500:
> 	(0) Input 0 (NUMBER) text 41 mapped to output 1 (HOUSE)
> 	(1) Input 1 (WORD) text FIRST mapped to output 5 (STREET)
> 	(2) Input 1 (WORD) text NATIONS mapped to output 5 (STREET)
> 	(3) Input 2 (TYPE) text TRAIL mapped to output 6 (SUFTYP)
> Raw standardization 1 with score 0.665000:
> 	(0) Input 0 (NUMBER) text 41 mapped to output 1 (HOUSE)
> 	(1) Input 1 (WORD) text FIRST mapped to output 5 (STREET)
> 	(2) Input 1 (WORD) text NATIONS mapped to output 5 (STREET)
> 	(3) Input 2 (TYPE) text TRAIL mapped to output 6 (SUFTYP)
> Raw standardization 2 with score 0.652500:
> 	(0) Input 0 (NUMBER) text 41 mapped to output 17 (UNITT)
> 	(1) Input 15 (ORD) text 1 mapped to output 1 (HOUSE)
> 	(2) Input 1 (WORD) text NATIONS mapped to output 5 (STREET)
> 	(3) Input 2 (TYPE) text TRAIL mapped to output 6 (SUFTYP)
>
> Rule 4 is of type 0 (MACRO)
> : Input : |1 (WORD)|
> Output: |10 (CITY)|
> rank 3 ( 0.375000): hits 1 out of 6
>
> Rule 308 is of type 1 (MICRO)
> : Input : |0 (NUMBER)||1 (WORD)||2 (TYPE)|
> Output: |1 (HOUSE)||5 (STREET)||6 (SUFTYP)|
> rank 10 ( 0.700000): hits 1 out of 6
>
> Rule 332 is of type 2 (ARC)
> : Input : |1 (WORD)||2 (TYPE)|
> Output: |5 (STREET)||6 (SUFTYP)|
> rank 10 ( 0.700000): hits 2 out of 6
>
> Rule 350 is of type 3 (CIVIC)
> : Input : |0 (NUMBER)|
> Output: |1 (HOUSE)|
> rank 15 ( 0.900000): hits 1 out of 6
>
> Rule 354 is of type 3 (CIVIC)
> : Input : |0 (NUMBER)||15 (ORD)|
> Output: |17 (UNITT)||1 (HOUSE)|
> rank 12 ( 0.800000): hits 1 out of 6
> Found 5 rules hit
>
> Any thoughts?
>
>
>
> On 2021-03-18, 8:12 AM, "Stephen Woodbridge" <stephenwoodbridge37 at gmail.com> wrote:
>
>        CAUTION: This email is from an external source. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
>      If you problem is just "11 RTE" and not the general class of "<num>
>      RTE", then just add "11 RTE" to the lexicon as a street type and that
>      will fix the problem and bypass the rules issues. Or if you only have a
>      few <num>s to deal with add them all.
>
>      Did you try using the rules2txt and txt2rules scripts I linked to? These
>      convert the numbers into human readable text so you can understand what
>      the rules are saying.
>
>      You could add a rule like: <num> <num> <type> -> <house> <name>
>      <suftype> <score>
>      (token names are not correct, but you should get the idea)
>      and then play with the <score> value, but as I said this will impact
>      other rules as you increase its value to force it into play over other
>      rules and potentially cause side effects that are not desirable.
>
>      -Steve
>
>      On 3/18/2021 4:26 AM, Grant Orr wrote:
>      > Thanks Stephen.
>      >
>      > I've been finding a couple of bugs and I've been trying to figure out if it is just my understanding of the functionality or an issue in the code.
>      > I haven't touched C in a long time and I'm principally using this in an AWS RDS instance so there is little opportunity to address them anyways.
>      >
>      > I appreciate the feedback.
>      >
>      > Grant
>      >
>      > On 2021-03-15, 8:05 PM, "Stephen Woodbridge" <stephenwoodbridge37 at gmail.com> wrote:
>      >
>      >        CAUTION: This email is from an external source. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>      >
>      >      Hi Grant,
>      >
>      >      I built ported the address standardizer from PAGC which is no longer
>      >      supported. It is VERY confusing so much so that I could not understand
>      >      it well enough to make changes myself and the code is very opaque and
>      >      very hard (impossible?) to follow. This was why I wrote a new address
>      >      standardizer in C++ designed to be easy to make changes to. I think it
>      >      is unlikely that you will find anyone that can answer you very good
>      >      questions.
>      >
>      >      Here is what I remember about the existing code, each rule has a weight
>      >      associated with it. When a string of tokens is compared to a rule, the
>      >      weights are applied to calculate an overall weight for the string of
>      >      tokens. Remember an address == string of tokens which might match
>      >      multiple different collection of rules. So each set of rule matches
>      >      provides a score for that combination of rules. And the best scored rule
>      >      combination wins.
>      >
>      >      But, and this is the unstable part, if you change the rule weights so
>      >      solve one address, you might have ALSO broken it for another address. So
>      >      it is usually safer to make changes to the lexicon or gazetteer to make
>      >      the address tokens fall into the rule that you want, rather than
>      >      changing the rules.
>      >
>      >      My address standardizer has a similar problem, but it is much easier to
>      >      work with.
>      >
>      >      When I did this in the past, I would load my reference addresses into a
>      >      table say "raw" and then standardize them into "standard" then look at
>      >      all the records that failed to standardize and try to abstract then into
>      >      classes of failures, then make some small tweaks to the lexicon,
>      >      gazetteer or rules, and repeat the standardization process, and do this
>      >      until I was happy, which was getting to 95+% standardizing and ignore
>      >      the rest in some cases. Your tolerance might be different, but you might
>      >      find that getting the last few percent is hard has the changes to fix
>      >      one problem cause other problems.
>      >
>      >      Sorry I can be more helpful, but its been 6-7 years since I used it and
>      >      mostly I used my new code because of the above issues with that code.
>      >
>      >      Best regards,
>      >         -Steve W
>      >
>      >      On 3/15/2021 5:25 PM, Grant Orr wrote:
>      >      >
>      >      > I’m looking for more detailed documentation on how rules work for the
>      >      > address_standardizer
>      >      >
>      >      >   * Which rule types take precedence over which rule types
>      >      >       o Micro precedes ARC, etc..
>      >      >   * How the rules that have overlapping attributes are resolved
>      >      >       o Address NUMBER NUMBER WORD SUFTYPE
>      >      >       o Rules  :   (ARC) NUMBER WORD SUFTYPE, (CIV) NUMBER NUMBER,
>      >      >         (CIV) NUMBER , (EXTRA) NUMBER
>      >      >       o How does the parser know which rules to apply? How does Rank
>      >      >         come into this?
>      >      >   * Where does the RAW standardization come from?  Why does it appear
>      >      >     to supersede everything except MICRO?
>      >      >
>      >      > I’ve been trying to figure this out with the existing documentation
>      >      > and a lot of trial and error but it is challenging
>      >      >
>      >      > Any help is appreciated
>      >      >
>      >      >
>      >      > _______________________________________________
>      >      > postgis-users mailing list
>      >      > postgis-users at lists.osgeo.org
>      >      > https://lists.osgeo.org/mailman/listinfo/postgis-users
>      >
>      >
>      >      --
>      >      This email has been checked for viruses by Avast antivirus software.
>      >      https://www.avast.com/antivirus
>      >
>      >
>
>
>      --
>      This email has been checked for viruses by Avast antivirus software.
>      https://www.avast.com/antivirus
>
>

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus