[postgis-users] Fuzzy Address Matching - PostgreSql equivalent to FuzzyStringComparer using Python difflib module

Shaozhong SHI shishaozhong at gmail.com
Mon May 11 19:55:10 PDT 2020


Dear All Friends,

Thank you very much for all the information.

I will spend time to digest and think over all these.

It is certainly a very interest topic.

Regards,

Shao

On Mon, 11 May 2020 at 18:34, James Klassen <klassen.js at gmail.com> wrote:

> There are lots of classes of tricky cases:
>
> For example:
>
>     1234 County Road E, Hudson
>
> Is actually "County Road E", not "County Road East".  The county roads
> there are indexed by letters.  There is also a County Road F, and a County
> Road N, and a County Road FF, etc.)  Last I checked, even Google, with all
> their data, gets this one wrong.
>
> In general, address parsing is a hard problem and there are always (at
> least in the US) going to be cases that are ambiguous to parse.  The people
> who have been answering your questions on this list have been working on
> this problem for as long as I can remember.  And they have spent many years
> developing libraries that make dealing with many of the corner cases of
> address parsing easier.  They started these libraries because the quick and
> simple solutions you seem to be hoping for failed to deal with the
> realities of addressing.  There is not a perfect solution, but what they
> are suggesting are the best solutions available.
>
> If you want to start over and reinvent a simpler solution, go ahead.  But
> be advised that the people answering you have been down that path.  They
> know what the next problem will be, and the problem after that, and they
> know that at the end of the chain of problems you will discover you need
> something like the tools and libraries they are suggesting.
>
> On Mon, May 11, 2020 at 12:31 PM Paul Ramsey <pramsey at cleverelephant.ca>
> wrote:
>
>> There's a huge range of potential functionality/features that can reside
>> under the banner "fuzzy address matching".
>>
>> At the simplest, you're trying to catch common lexicographic differences,
>> like off-by-one addresses or alternate spellings. This is the realm of
>> trigrams and levenstein distances.
>> Then you start dealing with different forms of abbreviations (rd, road,
>> r) and formats (unit 4, #4, apt 4, 4). This is the realm of data-less
>> standardizers.
>> Then you start dealing with larger forms of aliasing. Standard city and
>> state names, recognizing major components of addresses. This is the real of
>> dictionary-backed standardizers.
>> Then you start dealing with all forms of aliasing. This is actually
>> getting down into geocoding, and address standardizers backed by complete
>> address and road databases.
>>
>> What people mean when they say "can you do address standardization" can
>> vary massively. It can also be frustrating because the performance of
>> something like Google's geocoder algorithm, backed by the largest, and most
>> up-to-data geographic database in the world gets casually compared to
>> simple format and dictionary backed standardizer by folks with no
>> understanding of the complexity or amount of data that lives under the
>> covers of this "simple" task. I think people are far more understanding of
>> something like machine translation and "get" that it's a really hard
>> problem, because learning a new language is hard. But geocoding is "easy",
>> anyone can look at an address and then look that address up in a map book
>> (ha ha, well anyone over 40 at least).
>>
>> P.
>>
>>
>> > On May 11, 2020, at 10:15 AM, Imre Samu <pella.samu at gmail.com> wrote:
>> >
>> > > ... Fuzzy Address Matching? ...
>> > > For instance, South Great Avenue, A City, Planet Earth may be put
>> down as the following:
>> >
>> > sometimes .. the same  country/city/building   -  has a different names
>> ..
>> >
>> > "A City"  example:
>> > https://en.wikipedia.org/wiki/Vienna
>> >
>> > Vienna:
>> > "Beč (Croatian, Serbian, older Bulgarian), Beç (older Turkish)*, Bech
>> or Vidnya (Romani), Bécs (Hungarian)*, Bin / Pin - 빈 (Korean), Dunaj
>> (Slovene)*, Fienna (Welsh), Vedunia (Celtic), Vena - Вена (Russian), Vídeň
>> (Czech)*, Viden' / Videň (Ukrainian)*, Viedeň (Slovak), Viên (Vietnamese),
>> Viena / Vijena/ Виена (Belarusian, Bulgarian, Macedonian), Viena (Catalan*,
>> Lithuanian, Portuguese*, Romanian*, Spanish*, Tagalog*), Vienna (Italian)*,
>> Vienne (French)*, Viénni - Βιέννη (Greek), Vieno (Esperanto), Viin
>> (Estonian), Vin - װין (Yiddish), Vín (Irish, Icelandic), Vina - וינה
>> (Hebrew), Vínarborg (Icelandic variant), Vindobona (Latin), Vīne
>> (Latvian)*, Viyana (Turkish)*, Vjenë (Albanian), Vjenna (Maltese), Vyana
>> (Azeri), Wean (local Viennese, Austrian and Bavarian dialects)*, Weiyena -
>> 維也納 (Chinese)*, Wene (Afrikaans), Wenen (Dutch)*, Wiedeń (Polish)*, Wien
>> (Danish*, Finnish*, German*, Norwegian*, Swedish*), Wīn - ウィーン (Japanese)*,
>> Wina (Indonesian), فيينا (Arabic), وين (Persian)"
>> > "
>> > see:
>> https://en.wikipedia.org/wiki/Names_of_European_cities_in_different_languages_(U%E2%80%93Z)#V
>> >
>> https://en.wikipedia.org/wiki/List_of_names_of_European_cities_in_different_languages
>> >
>> > or see the Nominatim names for Vienna:
>> >
>> https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=109166&class=boundary
>> >
>> >
>> > And sometimes - the buildings has a different names ...
>> >
>> https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=11101&class=amenity
>> > - Wien Rathaus (name)
>> > - Ратуша (name:be)
>> > - Vídeňská radnice (name:cs)
>> > - Rathaus (name:de)
>> > - Vienna City Hall (name:en)
>> > - 市庁舎 (name:ja)
>> > - Ратуша (name:ru)
>> > - Віденська ратуша (name:uk)
>> >
>> >
>> >
>> > Best,
>> >  Imre
>> >
>> >
>> >
>> >
>> > Shaozhong SHI <shishaozhong at gmail.com> ezt írta (időpont: 2020. máj.
>> 11., H, 18:24):
>> > Hello,
>> >
>> > I got a few questions as follows:
>> >
>> > 1.  Which one is the best way for Fuzzy Address Matching?
>> >
>> > 2.  FME FuzzyStringComparer uses  Python difflib module.  Which one in
>> Postgres is equivalent or similar to it?
>> >
>> > 3.  Often, addresses collected by different people may well be
>> correct.  But, there may be typing errors, or addresses are composed not in
>> a consistent manner.
>> >
>> > For instance, South Great Avenue, A City, Planet Earth may be put down
>> as the following:
>> >
>> > S. Great Aveue, City A, Earth Planet
>> > Great Avene South, A City, Earth Planet
>> > Great Avenue S, A City, Planet Earth
>> >
>> > Surely, there would be solutions to deal with this problem.
>> >
>> > Can anyone enlighten me?
>> >
>> > Regards,
>> >
>> > Shao
>> > _______________________________________________
>> > postgis-users mailing list
>> > postgis-users at lists.osgeo.org
>> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>> > _______________________________________________
>> > postgis-users mailing list
>> > postgis-users at lists.osgeo.org
>> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at lists.osgeo.org
>> https://lists.osgeo.org/mailman/listinfo/postgis-users
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20200512/9199f653/attachment.html>


More information about the postgis-users mailing list