[postgis-users] address_standardizer - More detailed documentation on how rules work

Grant Orr Grant.Orr at sjrb.ca
Thu Mar 18 01:26:16 PDT 2021


Thanks Stephen. 

I've been finding a couple of bugs and I've been trying to figure out if it is just my understanding of the functionality or an issue in the code.  
I haven't touched C in a long time and I'm principally using this in an AWS RDS instance so there is little opportunity to address them anyways.

I appreciate the feedback.  

Grant

On 2021-03-15, 8:05 PM, "Stephen Woodbridge" <stephenwoodbridge37 at gmail.com> wrote:

      CAUTION: This email is from an external source. Do not click links or open attachments unless you recognize the sender and know the content is safe.

    Hi Grant,

    I built ported the address standardizer from PAGC which is no longer
    supported. It is VERY confusing so much so that I could not understand
    it well enough to make changes myself and the code is very opaque and
    very hard (impossible?) to follow. This was why I wrote a new address
    standardizer in C++ designed to be easy to make changes to. I think it
    is unlikely that you will find anyone that can answer you very good
    questions.

    Here is what I remember about the existing code, each rule has a weight
    associated with it. When a string of tokens is compared to a rule, the
    weights are applied to calculate an overall weight for the string of
    tokens. Remember an address == string of tokens which might match
    multiple different collection of rules. So each set of rule matches
    provides a score for that combination of rules. And the best scored rule
    combination wins.

    But, and this is the unstable part, if you change the rule weights so
    solve one address, you might have ALSO broken it for another address. So
    it is usually safer to make changes to the lexicon or gazetteer to make
    the address tokens fall into the rule that you want, rather than
    changing the rules.

    My address standardizer has a similar problem, but it is much easier to
    work with.

    When I did this in the past, I would load my reference addresses into a
    table say "raw" and then standardize them into "standard" then look at
    all the records that failed to standardize and try to abstract then into
    classes of failures, then make some small tweaks to the lexicon,
    gazetteer or rules, and repeat the standardization process, and do this
    until I was happy, which was getting to 95+% standardizing and ignore
    the rest in some cases. Your tolerance might be different, but you might
    find that getting the last few percent is hard has the changes to fix
    one problem cause other problems.

    Sorry I can be more helpful, but its been 6-7 years since I used it and
    mostly I used my new code because of the above issues with that code.

    Best regards,
       -Steve W

    On 3/15/2021 5:25 PM, Grant Orr wrote:
    >
    > I’m looking for more detailed documentation on how rules work for the
    > address_standardizer
    >
    >   * Which rule types take precedence over which rule types
    >       o Micro precedes ARC, etc..
    >   * How the rules that have overlapping attributes are resolved
    >       o Address NUMBER NUMBER WORD SUFTYPE
    >       o Rules  :   (ARC) NUMBER WORD SUFTYPE, (CIV) NUMBER NUMBER,
    >         (CIV) NUMBER , (EXTRA) NUMBER
    >       o How does the parser know which rules to apply? How does Rank
    >         come into this?
    >   * Where does the RAW standardization come from?  Why does it appear
    >     to supersede everything except MICRO?
    >
    > I’ve been trying to figure this out with the existing documentation
    > and a lot of trial and error but it is challenging
    >
    > Any help is appreciated
    >
    >
    > _______________________________________________
    > postgis-users mailing list
    > postgis-users at lists.osgeo.org
    > https://lists.osgeo.org/mailman/listinfo/postgis-users


    --
    This email has been checked for viruses by Avast antivirus software.
    https://www.avast.com/antivirus




More information about the postgis-users mailing list