[OSGeo-Discuss] The existence (and value of) "clean" geocoding tools?

Stephen Woodbridge woodbri at swoodbridge.com
Thu Sep 25 00:44:06 EDT 2008


David Dearing wrote:
> Hi.  I just recently stumbled across OSGeo and have poked around to try 
> and get a feel for the different projects, but still have a lingering 
> question.  Forgive me if this isn't the appropriate channel to be asking 
> this.
> 
> It seems that there is a solid focus on mapping, image manipulation, and 
> geometric processing at OSGeo.  And, in the more broad world including 
> non-open source projects, there are a lot of tools available for the 
> mass production of geotagged or geocoded documents.  However, the 
> accuracy of these systems, while good, doesn't seem sufficient when 
> accuracy is at a premium (from what I've seen they tend to focus on 
> volume).
> 
> Are there any existing tools that can be used to tag/code documents, 
> perhaps sacrificing the mass-produced aspect for better accuracy?  Have 
> I just missed/overlooked some existing tool(s) that meet this 
> description?  Or, am I in the minority in wanting to produce fewer 
> "clean" geocoded/tagged documents rather than many "pretty good" documents?

Have you looked at http://ofb.net/~egnor/google.html
http://www.pagcgeo.org/


Geocoding is NOT exact, in fact it deals with a very messy area of 
natural language parsing. While it is constrained more than free text, 
it still has to deal with all the issues of typos, abbreviations, 
punctuations, etc and then it has to match the user into to some vendor 
data.

For example: matching AL 44, Alabama 44, AL-44, Alabama Highway 44, 
Highway 44, State Highway 44, Rt 44, and various other abbreviations for 
Highway, simple typo errors, adding N, N., North, S, S., South, etc 
designations to the Highway, adding Alt., Bus., Byp., etc and on it 
goes. You also need to deal with accented characters, that are sometimes 
entered without accents.

In a geocoder, you typically have a standardizer that sort our all that 
craziness. Then when you load the geocoder, you standardize the vendor 
data and store it in a standard form. When you get a geocode request you 
standardize the incoming request and then try to match the standard form 
with the vendor data which is also in standard form. As an alternative 
to a standardizer some geocoders use statistical record match techniques.

You can also you techniques like metaphone/soundex codes to do fuzzy 
searching and then use levensthein distance to score the possible 
matched results for how close they are to the request.

You need to be prepared to handle multiple results to a query, for 
example you search for Oak St. but only find North Oak Street and South 
Oak Street.

And all this can only happen after you have tagged some text in a 
document if you are doing tagging. You mention accuracy is important, 
well how do you determine what is "right", remember the Oak St example 
above.

Anyway this is a good place to discuss this topic.

-Stephen Woodbridge
  http://imaptools.com/


More information about the Discuss mailing list