[postgis-users] address_standardizer - More detailed documentation on how rules work
Stephen Woodbridge
stephenwoodbridge37 at gmail.com
Mon Mar 15 19:05:03 PDT 2021
Hi Grant,
I built ported the address standardizer from PAGC which is no longer
supported. It is VERY confusing so much so that I could not understand
it well enough to make changes myself and the code is very opaque and
very hard (impossible?) to follow. This was why I wrote a new address
standardizer in C++ designed to be easy to make changes to. I think it
is unlikely that you will find anyone that can answer you very good
questions.
Here is what I remember about the existing code, each rule has a weight
associated with it. When a string of tokens is compared to a rule, the
weights are applied to calculate an overall weight for the string of
tokens. Remember an address == string of tokens which might match
multiple different collection of rules. So each set of rule matches
provides a score for that combination of rules. And the best scored rule
combination wins.
But, and this is the unstable part, if you change the rule weights so
solve one address, you might have ALSO broken it for another address. So
it is usually safer to make changes to the lexicon or gazetteer to make
the address tokens fall into the rule that you want, rather than
changing the rules.
My address standardizer has a similar problem, but it is much easier to
work with.
When I did this in the past, I would load my reference addresses into a
table say "raw" and then standardize them into "standard" then look at
all the records that failed to standardize and try to abstract then into
classes of failures, then make some small tweaks to the lexicon,
gazetteer or rules, and repeat the standardization process, and do this
until I was happy, which was getting to 95+% standardizing and ignore
the rest in some cases. Your tolerance might be different, but you might
find that getting the last few percent is hard has the changes to fix
one problem cause other problems.
Sorry I can be more helpful, but its been 6-7 years since I used it and
mostly I used my new code because of the above issues with that code.
Best regards,
-Steve W
On 3/15/2021 5:25 PM, Grant Orr wrote:
>
> I’m looking for more detailed documentation on how rules work for the
> address_standardizer
>
> * Which rule types take precedence over which rule types
> o Micro precedes ARC, etc..
> * How the rules that have overlapping attributes are resolved
> o Address NUMBER NUMBER WORD SUFTYPE
> o Rules : (ARC) NUMBER WORD SUFTYPE, (CIV) NUMBER NUMBER,
> (CIV) NUMBER , (EXTRA) NUMBER
> o How does the parser know which rules to apply? How does Rank
> come into this?
> * Where does the RAW standardization come from? Why does it appear
> to supersede everything except MICRO?
>
> I’ve been trying to figure this out with the existing documentation
> and a lot of trial and error but it is challenging
>
> Any help is appreciated
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the postgis-users
mailing list