[postgis-users] address_standardizer - More detailed documentation on how rules work

Mon Mar 15 19:05:03 PDT 2021

Hi Grant,

I built ported the address standardizer from PAGC which is no longer 
supported. It is VERY confusing so much so that I could not understand 
it well enough to make changes myself and the code is very opaque and 
very hard (impossible?) to follow. This was why I wrote a new address 
standardizer in C++ designed to be easy to make changes to. I think it 
is unlikely that you will find anyone that can answer you very good 
questions.

Here is what I remember about the existing code, each rule has a weight 
associated with it. When a string of tokens is compared to a rule, the 
weights are applied to calculate an overall weight for the string of 
tokens. Remember an address == string of tokens which might match 
multiple different collection of rules. So each set of rule matches 
provides a score for that combination of rules. And the best scored rule 
combination wins.

But, and this is the unstable part, if you change the rule weights so 
solve one address, you might have ALSO broken it for another address. So 
it is usually safer to make changes to the lexicon or gazetteer to make 
the address tokens fall into the rule that you want, rather than 
changing the rules.

My address standardizer has a similar problem, but it is much easier to 
work with.

When I did this in the past, I would load my reference addresses into a 
table say "raw" and then standardize them into "standard" then look at 
all the records that failed to standardize and try to abstract then into 
classes of failures, then make some small tweaks to the lexicon, 
gazetteer or rules, and repeat the standardization process, and do this 
until I was happy, which was getting to 95+% standardizing and ignore 
the rest in some cases. Your tolerance might be different, but you might 
find that getting the last few percent is hard has the changes to fix 
one problem cause other problems.

Sorry I can be more helpful, but its been 6-7 years since I used it and 
mostly I used my new code because of the above issues with that code.

Best regards,
   -Steve W

On 3/15/2021 5:25 PM, Grant Orr wrote:
>
> I’m looking for more detailed documentation on how rules work for the 
> address_standardizer
>
>   * Which rule types take precedence over which rule types
>       o Micro precedes ARC, etc..
>   * How the rules that have overlapping attributes are resolved
>       o Address NUMBER NUMBER WORD SUFTYPE
>       o Rules  :   (ARC) NUMBER WORD SUFTYPE, (CIV) NUMBER NUMBER,
>         (CIV) NUMBER , (EXTRA) NUMBER
>       o How does the parser know which rules to apply? How does Rank
>         come into this?
>   * Where does the RAW standardization come from?  Why does it appear
>     to supersede everything except MICRO?
>
> I’ve been trying to figure this out with the existing documentation 
> and a lot of trial and error but it is challenging
>
> Any help is appreciated
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus