[OSGeo-Discuss] The existence (and value of) "clean" geocoding tools?
Stephen Woodbridge
woodbri at swoodbridge.com
Wed Sep 24 21:44:06 PDT 2008
David Dearing wrote:
> Hi. I just recently stumbled across OSGeo and have poked around to try
> and get a feel for the different projects, but still have a lingering
> question. Forgive me if this isn't the appropriate channel to be asking
> this.
>
> It seems that there is a solid focus on mapping, image manipulation, and
> geometric processing at OSGeo. And, in the more broad world including
> non-open source projects, there are a lot of tools available for the
> mass production of geotagged or geocoded documents. However, the
> accuracy of these systems, while good, doesn't seem sufficient when
> accuracy is at a premium (from what I've seen they tend to focus on
> volume).
>
> Are there any existing tools that can be used to tag/code documents,
> perhaps sacrificing the mass-produced aspect for better accuracy? Have
> I just missed/overlooked some existing tool(s) that meet this
> description? Or, am I in the minority in wanting to produce fewer
> "clean" geocoded/tagged documents rather than many "pretty good" documents?
Have you looked at http://ofb.net/~egnor/google.html
http://www.pagcgeo.org/
Geocoding is NOT exact, in fact it deals with a very messy area of
natural language parsing. While it is constrained more than free text,
it still has to deal with all the issues of typos, abbreviations,
punctuations, etc and then it has to match the user into to some vendor
data.
For example: matching AL 44, Alabama 44, AL-44, Alabama Highway 44,
Highway 44, State Highway 44, Rt 44, and various other abbreviations for
Highway, simple typo errors, adding N, N., North, S, S., South, etc
designations to the Highway, adding Alt., Bus., Byp., etc and on it
goes. You also need to deal with accented characters, that are sometimes
entered without accents.
In a geocoder, you typically have a standardizer that sort our all that
craziness. Then when you load the geocoder, you standardize the vendor
data and store it in a standard form. When you get a geocode request you
standardize the incoming request and then try to match the standard form
with the vendor data which is also in standard form. As an alternative
to a standardizer some geocoders use statistical record match techniques.
You can also you techniques like metaphone/soundex codes to do fuzzy
searching and then use levensthein distance to score the possible
matched results for how close they are to the request.
You need to be prepared to handle multiple results to a query, for
example you search for Oak St. but only find North Oak Street and South
Oak Street.
And all this can only happen after you have tagged some text in a
document if you are doing tagging. You mention accuracy is important,
well how do you determine what is "right", remember the Oak St example
above.
Anyway this is a good place to discuss this topic.
-Stephen Woodbridge
http://imaptools.com/
More information about the Discuss
mailing list