[postgis-users] Fuzzy Address Matching - PostgreSql equivalent to FuzzyStringComparer using Python difflib module

Shaozhong SHI shishaozhong at gmail.com
Mon May 11 19:56:32 PDT 2020


Sorry.  It is a very interesting topic, so far as our field is concerned.

Regards,

Shao

On Tue, 12 May 2020 at 03:55, Shaozhong SHI <shishaozhong at gmail.com> wrote:

> Dear All Friends,
>
> Thank you very much for all the information.
>
> I will spend time to digest and think over all these.
>
> It is certainly a very interest topic.
>
> Regards,
>
> Shao
>
> On Mon, 11 May 2020 at 18:34, James Klassen <klassen.js at gmail.com> wrote:
>
>> There are lots of classes of tricky cases:
>>
>> For example:
>>
>>     1234 County Road E, Hudson
>>
>> Is actually "County Road E", not "County Road East".  The county roads
>> there are indexed by letters.  There is also a County Road F, and a County
>> Road N, and a County Road FF, etc.)  Last I checked, even Google, with all
>> their data, gets this one wrong.
>>
>> In general, address parsing is a hard problem and there are always (at
>> least in the US) going to be cases that are ambiguous to parse.  The people
>> who have been answering your questions on this list have been working on
>> this problem for as long as I can remember.  And they have spent many years
>> developing libraries that make dealing with many of the corner cases of
>> address parsing easier.  They started these libraries because the quick and
>> simple solutions you seem to be hoping for failed to deal with the
>> realities of addressing.  There is not a perfect solution, but what they
>> are suggesting are the best solutions available.
>>
>> If you want to start over and reinvent a simpler solution, go ahead.  But
>> be advised that the people answering you have been down that path.  They
>> know what the next problem will be, and the problem after that, and they
>> know that at the end of the chain of problems you will discover you need
>> something like the tools and libraries they are suggesting.
>>
>> On Mon, May 11, 2020 at 12:31 PM Paul Ramsey <pramsey at cleverelephant.ca>
>> wrote:
>>
>>> There's a huge range of potential functionality/features that can reside
>>> under the banner "fuzzy address matching".
>>>
>>> At the simplest, you're trying to catch common lexicographic
>>> differences, like off-by-one addresses or alternate spellings. This is the
>>> realm of trigrams and levenstein distances.
>>> Then you start dealing with different forms of abbreviations (rd, road,
>>> r) and formats (unit 4, #4, apt 4, 4). This is the realm of data-less
>>> standardizers.
>>> Then you start dealing with larger forms of aliasing. Standard city and
>>> state names, recognizing major components of addresses. This is the real of
>>> dictionary-backed standardizers.
>>> Then you start dealing with all forms of aliasing. This is actually
>>> getting down into geocoding, and address standardizers backed by complete
>>> address and road databases.
>>>
>>> What people mean when they say "can you do address standardization" can
>>> vary massively. It can also be frustrating because the performance of
>>> something like Google's geocoder algorithm, backed by the largest, and most
>>> up-to-data geographic database in the world gets casually compared to
>>> simple format and dictionary backed standardizer by folks with no
>>> understanding of the complexity or amount of data that lives under the
>>> covers of this "simple" task. I think people are far more understanding of
>>> something like machine translation and "get" that it's a really hard
>>> problem, because learning a new language is hard. But geocoding is "easy",
>>> anyone can look at an address and then look that address up in a map book
>>> (ha ha, well anyone over 40 at least).
>>>
>>> P.
>>>
>>>
>>> > On May 11, 2020, at 10:15 AM, Imre Samu <pella.samu at gmail.com> wrote:
>>> >
>>> > > ... Fuzzy Address Matching? ...
>>> > > For instance, South Great Avenue, A City, Planet Earth may be put
>>> down as the following:
>>> >
>>> > sometimes .. the same  country/city/building   -  has a different
>>> names ..
>>> >
>>> > "A City"  example:
>>> > https://en.wikipedia.org/wiki/Vienna
>>> >
>>> > Vienna:
>>> > "Beč (Croatian, Serbian, older Bulgarian), Beç (older Turkish)*, Bech
>>> or Vidnya (Romani), Bécs (Hungarian)*, Bin / Pin - 빈 (Korean), Dunaj
>>> (Slovene)*, Fienna (Welsh), Vedunia (Celtic), Vena - Вена (Russian), Vídeň
>>> (Czech)*, Viden' / Videň (Ukrainian)*, Viedeň (Slovak), Viên (Vietnamese),
>>> Viena / Vijena/ Виена (Belarusian, Bulgarian, Macedonian), Viena (Catalan*,
>>> Lithuanian, Portuguese*, Romanian*, Spanish*, Tagalog*), Vienna (Italian)*,
>>> Vienne (French)*, Viénni - Βιέννη (Greek), Vieno (Esperanto), Viin
>>> (Estonian), Vin - װין (Yiddish), Vín (Irish, Icelandic), Vina - וינה
>>> (Hebrew), Vínarborg (Icelandic variant), Vindobona (Latin), Vīne
>>> (Latvian)*, Viyana (Turkish)*, Vjenë (Albanian), Vjenna (Maltese), Vyana
>>> (Azeri), Wean (local Viennese, Austrian and Bavarian dialects)*, Weiyena -
>>> 維也納 (Chinese)*, Wene (Afrikaans), Wenen (Dutch)*, Wiedeń (Polish)*, Wien
>>> (Danish*, Finnish*, German*, Norwegian*, Swedish*), Wīn - ウィーン (Japanese)*,
>>> Wina (Indonesian), فيينا (Arabic), وين (Persian)"
>>> > "
>>> > see:
>>> https://en.wikipedia.org/wiki/Names_of_European_cities_in_different_languages_(U%E2%80%93Z)#V
>>> >
>>> https://en.wikipedia.org/wiki/List_of_names_of_European_cities_in_different_languages
>>> >
>>> > or see the Nominatim names for Vienna:
>>> >
>>> https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=109166&class=boundary
>>> >
>>> >
>>> > And sometimes - the buildings has a different names ...
>>> >
>>> https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=11101&class=amenity
>>> > - Wien Rathaus (name)
>>> > - Ратуша (name:be)
>>> > - Vídeňská radnice (name:cs)
>>> > - Rathaus (name:de)
>>> > - Vienna City Hall (name:en)
>>> > - 市庁舎 (name:ja)
>>> > - Ратуша (name:ru)
>>> > - Віденська ратуша (name:uk)
>>> >
>>> >
>>> >
>>> > Best,
>>> >  Imre
>>> >
>>> >
>>> >
>>> >
>>> > Shaozhong SHI <shishaozhong at gmail.com> ezt írta (időpont: 2020. máj.
>>> 11., H, 18:24):
>>> > Hello,
>>> >
>>> > I got a few questions as follows:
>>> >
>>> > 1.  Which one is the best way for Fuzzy Address Matching?
>>> >
>>> > 2.  FME FuzzyStringComparer uses  Python difflib module.  Which one in
>>> Postgres is equivalent or similar to it?
>>> >
>>> > 3.  Often, addresses collected by different people may well be
>>> correct.  But, there may be typing errors, or addresses are composed not in
>>> a consistent manner.
>>> >
>>> > For instance, South Great Avenue, A City, Planet Earth may be put down
>>> as the following:
>>> >
>>> > S. Great Aveue, City A, Earth Planet
>>> > Great Avene South, A City, Earth Planet
>>> > Great Avenue S, A City, Planet Earth
>>> >
>>> > Surely, there would be solutions to deal with this problem.
>>> >
>>> > Can anyone enlighten me?
>>> >
>>> > Regards,
>>> >
>>> > Shao
>>> > _______________________________________________
>>> > postgis-users mailing list
>>> > postgis-users at lists.osgeo.org
>>> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>>> > _______________________________________________
>>> > postgis-users mailing list
>>> > postgis-users at lists.osgeo.org
>>> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>>>
>>> _______________________________________________
>>> postgis-users mailing list
>>> postgis-users at lists.osgeo.org
>>> https://lists.osgeo.org/mailman/listinfo/postgis-users
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at lists.osgeo.org
>> https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20200512/abfdbb5b/attachment.html>


More information about the postgis-users mailing list