[postgis-users] address_standardizer - More detailed documentation on how rules work
Stephen Woodbridge
stephenwoodbridge37 at gmail.com
Thu Mar 18 07:12:36 PDT 2021
If you problem is just "11 RTE" and not the general class of "<num>
RTE", then just add "11 RTE" to the lexicon as a street type and that
will fix the problem and bypass the rules issues. Or if you only have a
few <num>s to deal with add them all.
Did you try using the rules2txt and txt2rules scripts I linked to? These
convert the numbers into human readable text so you can understand what
the rules are saying.
You could add a rule like: <num> <num> <type> -> <house> <name>
<suftype> <score>
(token names are not correct, but you should get the idea)
and then play with the <score> value, but as I said this will impact
other rules as you increase its value to force it into play over other
rules and potentially cause side effects that are not desirable.
-Steve
On 3/18/2021 4:26 AM, Grant Orr wrote:
> Thanks Stephen.
>
> I've been finding a couple of bugs and I've been trying to figure out if it is just my understanding of the functionality or an issue in the code.
> I haven't touched C in a long time and I'm principally using this in an AWS RDS instance so there is little opportunity to address them anyways.
>
> I appreciate the feedback.
>
> Grant
>
> On 2021-03-15, 8:05 PM, "Stephen Woodbridge" <stephenwoodbridge37 at gmail.com> wrote:
>
> CAUTION: This email is from an external source. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> Hi Grant,
>
> I built ported the address standardizer from PAGC which is no longer
> supported. It is VERY confusing so much so that I could not understand
> it well enough to make changes myself and the code is very opaque and
> very hard (impossible?) to follow. This was why I wrote a new address
> standardizer in C++ designed to be easy to make changes to. I think it
> is unlikely that you will find anyone that can answer you very good
> questions.
>
> Here is what I remember about the existing code, each rule has a weight
> associated with it. When a string of tokens is compared to a rule, the
> weights are applied to calculate an overall weight for the string of
> tokens. Remember an address == string of tokens which might match
> multiple different collection of rules. So each set of rule matches
> provides a score for that combination of rules. And the best scored rule
> combination wins.
>
> But, and this is the unstable part, if you change the rule weights so
> solve one address, you might have ALSO broken it for another address. So
> it is usually safer to make changes to the lexicon or gazetteer to make
> the address tokens fall into the rule that you want, rather than
> changing the rules.
>
> My address standardizer has a similar problem, but it is much easier to
> work with.
>
> When I did this in the past, I would load my reference addresses into a
> table say "raw" and then standardize them into "standard" then look at
> all the records that failed to standardize and try to abstract then into
> classes of failures, then make some small tweaks to the lexicon,
> gazetteer or rules, and repeat the standardization process, and do this
> until I was happy, which was getting to 95+% standardizing and ignore
> the rest in some cases. Your tolerance might be different, but you might
> find that getting the last few percent is hard has the changes to fix
> one problem cause other problems.
>
> Sorry I can be more helpful, but its been 6-7 years since I used it and
> mostly I used my new code because of the above issues with that code.
>
> Best regards,
> -Steve W
>
> On 3/15/2021 5:25 PM, Grant Orr wrote:
> >
> > I’m looking for more detailed documentation on how rules work for the
> > address_standardizer
> >
> > * Which rule types take precedence over which rule types
> > o Micro precedes ARC, etc..
> > * How the rules that have overlapping attributes are resolved
> > o Address NUMBER NUMBER WORD SUFTYPE
> > o Rules : (ARC) NUMBER WORD SUFTYPE, (CIV) NUMBER NUMBER,
> > (CIV) NUMBER , (EXTRA) NUMBER
> > o How does the parser know which rules to apply? How does Rank
> > come into this?
> > * Where does the RAW standardization come from? Why does it appear
> > to supersede everything except MICRO?
> >
> > I’ve been trying to figure this out with the existing documentation
> > and a lot of trial and error but it is challenging
> >
> > Any help is appreciated
> >
> >
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users at lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>
> --
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
>
>
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the postgis-users
mailing list