[postgis-users] address_standardizer - More detailed documentation on how rules work

Stephen Woodbridge stephenwoodbridge37 at gmail.com
Thu Mar 18 07:12:36 PDT 2021


If you problem is just "11 RTE" and not the general class of "<num> 
RTE", then just add "11 RTE" to the lexicon as a street type and that 
will fix the problem and bypass the rules issues. Or if you only have a 
few <num>s to deal with add them all.

Did you try using the rules2txt and txt2rules scripts I linked to? These 
convert the numbers into human readable text so you can understand what 
the rules are saying.

You could add a rule like: <num> <num> <type> -> <house> <name> 
<suftype> <score>
(token names are not correct, but you should get the idea)
and then play with the <score> value, but as I said this will impact 
other rules as you increase its value to force it into play over other 
rules and potentially cause side effects that are not desirable.

-Steve

On 3/18/2021 4:26 AM, Grant Orr wrote:
> Thanks Stephen.
>
> I've been finding a couple of bugs and I've been trying to figure out if it is just my understanding of the functionality or an issue in the code.
> I haven't touched C in a long time and I'm principally using this in an AWS RDS instance so there is little opportunity to address them anyways.
>
> I appreciate the feedback.
>
> Grant
>
> On 2021-03-15, 8:05 PM, "Stephen Woodbridge" <stephenwoodbridge37 at gmail.com> wrote:
>
>        CAUTION: This email is from an external source. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
>      Hi Grant,
>
>      I built ported the address standardizer from PAGC which is no longer
>      supported. It is VERY confusing so much so that I could not understand
>      it well enough to make changes myself and the code is very opaque and
>      very hard (impossible?) to follow. This was why I wrote a new address
>      standardizer in C++ designed to be easy to make changes to. I think it
>      is unlikely that you will find anyone that can answer you very good
>      questions.
>
>      Here is what I remember about the existing code, each rule has a weight
>      associated with it. When a string of tokens is compared to a rule, the
>      weights are applied to calculate an overall weight for the string of
>      tokens. Remember an address == string of tokens which might match
>      multiple different collection of rules. So each set of rule matches
>      provides a score for that combination of rules. And the best scored rule
>      combination wins.
>
>      But, and this is the unstable part, if you change the rule weights so
>      solve one address, you might have ALSO broken it for another address. So
>      it is usually safer to make changes to the lexicon or gazetteer to make
>      the address tokens fall into the rule that you want, rather than
>      changing the rules.
>
>      My address standardizer has a similar problem, but it is much easier to
>      work with.
>
>      When I did this in the past, I would load my reference addresses into a
>      table say "raw" and then standardize them into "standard" then look at
>      all the records that failed to standardize and try to abstract then into
>      classes of failures, then make some small tweaks to the lexicon,
>      gazetteer or rules, and repeat the standardization process, and do this
>      until I was happy, which was getting to 95+% standardizing and ignore
>      the rest in some cases. Your tolerance might be different, but you might
>      find that getting the last few percent is hard has the changes to fix
>      one problem cause other problems.
>
>      Sorry I can be more helpful, but its been 6-7 years since I used it and
>      mostly I used my new code because of the above issues with that code.
>
>      Best regards,
>         -Steve W
>
>      On 3/15/2021 5:25 PM, Grant Orr wrote:
>      >
>      > I’m looking for more detailed documentation on how rules work for the
>      > address_standardizer
>      >
>      >   * Which rule types take precedence over which rule types
>      >       o Micro precedes ARC, etc..
>      >   * How the rules that have overlapping attributes are resolved
>      >       o Address NUMBER NUMBER WORD SUFTYPE
>      >       o Rules  :   (ARC) NUMBER WORD SUFTYPE, (CIV) NUMBER NUMBER,
>      >         (CIV) NUMBER , (EXTRA) NUMBER
>      >       o How does the parser know which rules to apply? How does Rank
>      >         come into this?
>      >   * Where does the RAW standardization come from?  Why does it appear
>      >     to supersede everything except MICRO?
>      >
>      > I’ve been trying to figure this out with the existing documentation
>      > and a lot of trial and error but it is challenging
>      >
>      > Any help is appreciated
>      >
>      >
>      > _______________________________________________
>      > postgis-users mailing list
>      > postgis-users at lists.osgeo.org
>      > https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>
>      --
>      This email has been checked for viruses by Avast antivirus software.
>      https://www.avast.com/antivirus
>
>


-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



More information about the postgis-users mailing list