[postgis-users] Fuzzy Address Matching - PostgreSql equivalent to FuzzyStringComparer using Python difflib module

James Klassen klassen.js at gmail.com
Mon May 11 10:34:26 PDT 2020


There are lots of classes of tricky cases:

For example:

    1234 County Road E, Hudson

Is actually "County Road E", not "County Road East".  The county roads
there are indexed by letters.  There is also a County Road F, and a County
Road N, and a County Road FF, etc.)  Last I checked, even Google, with all
their data, gets this one wrong.

In general, address parsing is a hard problem and there are always (at
least in the US) going to be cases that are ambiguous to parse.  The people
who have been answering your questions on this list have been working on
this problem for as long as I can remember.  And they have spent many years
developing libraries that make dealing with many of the corner cases of
address parsing easier.  They started these libraries because the quick and
simple solutions you seem to be hoping for failed to deal with the
realities of addressing.  There is not a perfect solution, but what they
are suggesting are the best solutions available.

If you want to start over and reinvent a simpler solution, go ahead.  But
be advised that the people answering you have been down that path.  They
know what the next problem will be, and the problem after that, and they
know that at the end of the chain of problems you will discover you need
something like the tools and libraries they are suggesting.

On Mon, May 11, 2020 at 12:31 PM Paul Ramsey <pramsey at cleverelephant.ca>
wrote:

> There's a huge range of potential functionality/features that can reside
> under the banner "fuzzy address matching".
>
> At the simplest, you're trying to catch common lexicographic differences,
> like off-by-one addresses or alternate spellings. This is the realm of
> trigrams and levenstein distances.
> Then you start dealing with different forms of abbreviations (rd, road, r)
> and formats (unit 4, #4, apt 4, 4). This is the realm of data-less
> standardizers.
> Then you start dealing with larger forms of aliasing. Standard city and
> state names, recognizing major components of addresses. This is the real of
> dictionary-backed standardizers.
> Then you start dealing with all forms of aliasing. This is actually
> getting down into geocoding, and address standardizers backed by complete
> address and road databases.
>
> What people mean when they say "can you do address standardization" can
> vary massively. It can also be frustrating because the performance of
> something like Google's geocoder algorithm, backed by the largest, and most
> up-to-data geographic database in the world gets casually compared to
> simple format and dictionary backed standardizer by folks with no
> understanding of the complexity or amount of data that lives under the
> covers of this "simple" task. I think people are far more understanding of
> something like machine translation and "get" that it's a really hard
> problem, because learning a new language is hard. But geocoding is "easy",
> anyone can look at an address and then look that address up in a map book
> (ha ha, well anyone over 40 at least).
>
> P.
>
>
> > On May 11, 2020, at 10:15 AM, Imre Samu <pella.samu at gmail.com> wrote:
> >
> > > ... Fuzzy Address Matching? ...
> > > For instance, South Great Avenue, A City, Planet Earth may be put down
> as the following:
> >
> > sometimes .. the same  country/city/building   -  has a different names
> ..
> >
> > "A City"  example:
> > https://en.wikipedia.org/wiki/Vienna
> >
> > Vienna:
> > "Beč (Croatian, Serbian, older Bulgarian), Beç (older Turkish)*, Bech or
> Vidnya (Romani), Bécs (Hungarian)*, Bin / Pin - 빈 (Korean), Dunaj
> (Slovene)*, Fienna (Welsh), Vedunia (Celtic), Vena - Вена (Russian), Vídeň
> (Czech)*, Viden' / Videň (Ukrainian)*, Viedeň (Slovak), Viên (Vietnamese),
> Viena / Vijena/ Виена (Belarusian, Bulgarian, Macedonian), Viena (Catalan*,
> Lithuanian, Portuguese*, Romanian*, Spanish*, Tagalog*), Vienna (Italian)*,
> Vienne (French)*, Viénni - Βιέννη (Greek), Vieno (Esperanto), Viin
> (Estonian), Vin - װין (Yiddish), Vín (Irish, Icelandic), Vina - וינה
> (Hebrew), Vínarborg (Icelandic variant), Vindobona (Latin), Vīne
> (Latvian)*, Viyana (Turkish)*, Vjenë (Albanian), Vjenna (Maltese), Vyana
> (Azeri), Wean (local Viennese, Austrian and Bavarian dialects)*, Weiyena -
> 維也納 (Chinese)*, Wene (Afrikaans), Wenen (Dutch)*, Wiedeń (Polish)*, Wien
> (Danish*, Finnish*, German*, Norwegian*, Swedish*), Wīn - ウィーン (Japanese)*,
> Wina (Indonesian), فيينا (Arabic), وين (Persian)"
> > "
> > see:
> https://en.wikipedia.org/wiki/Names_of_European_cities_in_different_languages_(U%E2%80%93Z)#V
> >
> https://en.wikipedia.org/wiki/List_of_names_of_European_cities_in_different_languages
> >
> > or see the Nominatim names for Vienna:
> >
> https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=109166&class=boundary
> >
> >
> > And sometimes - the buildings has a different names ...
> >
> https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=11101&class=amenity
> > - Wien Rathaus (name)
> > - Ратуша (name:be)
> > - Vídeňská radnice (name:cs)
> > - Rathaus (name:de)
> > - Vienna City Hall (name:en)
> > - 市庁舎 (name:ja)
> > - Ратуша (name:ru)
> > - Віденська ратуша (name:uk)
> >
> >
> >
> > Best,
> >  Imre
> >
> >
> >
> >
> > Shaozhong SHI <shishaozhong at gmail.com> ezt írta (időpont: 2020. máj.
> 11., H, 18:24):
> > Hello,
> >
> > I got a few questions as follows:
> >
> > 1.  Which one is the best way for Fuzzy Address Matching?
> >
> > 2.  FME FuzzyStringComparer uses  Python difflib module.  Which one in
> Postgres is equivalent or similar to it?
> >
> > 3.  Often, addresses collected by different people may well be correct.
> But, there may be typing errors, or addresses are composed not in a
> consistent manner.
> >
> > For instance, South Great Avenue, A City, Planet Earth may be put down
> as the following:
> >
> > S. Great Aveue, City A, Earth Planet
> > Great Avene South, A City, Earth Planet
> > Great Avenue S, A City, Planet Earth
> >
> > Surely, there would be solutions to deal with this problem.
> >
> > Can anyone enlighten me?
> >
> > Regards,
> >
> > Shao
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users at lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users at lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20200511/78b539be/attachment.html>


More information about the postgis-users mailing list