[postgis-users] Fuzzy Address Matching - PostgreSql equivalent to FuzzyStringComparer using Python difflib module

Paul Ramsey pramsey at cleverelephant.ca
Mon May 11 10:31:09 PDT 2020


There's a huge range of potential functionality/features that can reside under the banner "fuzzy address matching". 

At the simplest, you're trying to catch common lexicographic differences, like off-by-one addresses or alternate spellings. This is the realm of trigrams and levenstein distances.
Then you start dealing with different forms of abbreviations (rd, road, r) and formats (unit 4, #4, apt 4, 4). This is the realm of data-less standardizers.
Then you start dealing with larger forms of aliasing. Standard city and state names, recognizing major components of addresses. This is the real of dictionary-backed standardizers.
Then you start dealing with all forms of aliasing. This is actually getting down into geocoding, and address standardizers backed by complete address and road databases.

What people mean when they say "can you do address standardization" can vary massively. It can also be frustrating because the performance of something like Google's geocoder algorithm, backed by the largest, and most up-to-data geographic database in the world gets casually compared to simple format and dictionary backed standardizer by folks with no understanding of the complexity or amount of data that lives under the covers of this "simple" task. I think people are far more understanding of something like machine translation and "get" that it's a really hard problem, because learning a new language is hard. But geocoding is "easy", anyone can look at an address and then look that address up in a map book (ha ha, well anyone over 40 at least).

P.


> On May 11, 2020, at 10:15 AM, Imre Samu <pella.samu at gmail.com> wrote:
> 
> > ... Fuzzy Address Matching? ... 
> > For instance, South Great Avenue, A City, Planet Earth may be put down as the following:
> 
> sometimes .. the same  country/city/building   -  has a different names .. 
> 
> "A City"  example:
> https://en.wikipedia.org/wiki/Vienna
> 
> Vienna:
> "Beč (Croatian, Serbian, older Bulgarian), Beç (older Turkish)*, Bech or Vidnya (Romani), Bécs (Hungarian)*, Bin / Pin - 빈 (Korean), Dunaj (Slovene)*, Fienna (Welsh), Vedunia (Celtic), Vena - Вена (Russian), Vídeň (Czech)*, Viden' / Videň (Ukrainian)*, Viedeň (Slovak), Viên (Vietnamese), Viena / Vijena/ Виена (Belarusian, Bulgarian, Macedonian), Viena (Catalan*, Lithuanian, Portuguese*, Romanian*, Spanish*, Tagalog*), Vienna (Italian)*, Vienne (French)*, Viénni - Βιέννη (Greek), Vieno (Esperanto), Viin (Estonian), Vin - װין (Yiddish), Vín (Irish, Icelandic), Vina - וינה (Hebrew), Vínarborg (Icelandic variant), Vindobona (Latin), Vīne (Latvian)*, Viyana (Turkish)*, Vjenë (Albanian), Vjenna (Maltese), Vyana (Azeri), Wean (local Viennese, Austrian and Bavarian dialects)*, Weiyena - 維也納 (Chinese)*, Wene (Afrikaans), Wenen (Dutch)*, Wiedeń (Polish)*, Wien (Danish*, Finnish*, German*, Norwegian*, Swedish*), Wīn - ウィーン (Japanese)*, Wina (Indonesian), فيينا (Arabic), وين (Persian)"
> "
> see: https://en.wikipedia.org/wiki/Names_of_European_cities_in_different_languages_(U%E2%80%93Z)#V
> https://en.wikipedia.org/wiki/List_of_names_of_European_cities_in_different_languages
> 
> or see the Nominatim names for Vienna:
> https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=109166&class=boundary
> 
> 
> And sometimes - the buildings has a different names ...
> https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=11101&class=amenity
> - Wien Rathaus (name)
> - Ратуша (name:be)
> - Vídeňská radnice (name:cs)
> - Rathaus (name:de)
> - Vienna City Hall (name:en)
> - 市庁舎 (name:ja)
> - Ратуша (name:ru)
> - Віденська ратуша (name:uk)
> 
> 
> 
> Best,
>  Imre
> 
> 
> 
> 
> Shaozhong SHI <shishaozhong at gmail.com> ezt írta (időpont: 2020. máj. 11., H, 18:24):
> Hello,
> 
> I got a few questions as follows:
> 
> 1.  Which one is the best way for Fuzzy Address Matching?
> 
> 2.  FME FuzzyStringComparer uses  Python difflib module.  Which one in Postgres is equivalent or similar to it?
> 
> 3.  Often, addresses collected by different people may well be correct.  But, there may be typing errors, or addresses are composed not in a consistent manner.
> 
> For instance, South Great Avenue, A City, Planet Earth may be put down as the following:
> 
> S. Great Aveue, City A, Earth Planet
> Great Avene South, A City, Earth Planet
> Great Avenue S, A City, Planet Earth
> 
> Surely, there would be solutions to deal with this problem.
> 
> Can anyone enlighten me?
> 
> Regards,
> 
> Shao
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users



More information about the postgis-users mailing list