[OSGeo-Discuss] Automatic geocoding of PDF documents

Tue Jan 17 19:00:42 PST 2012

On 1/17/2012 2:51 PM, Arnie Shore wrote:
> I wonder if someone can describe what's seen as the
> tall-pole-in-the-tent here, difficulty-wise.

Arnie,

I think that there is no simple answer to this because it is largely 
defined by the specific requirements.

If your problem is scanning text and extracting location references, 
then the problem is based on how do you recognize locations in a text 
document?, how do you deal with languages?, how do you use context to 
disambiguate locations?, etc, and then how do you geocode it?.

For the geocoding part, what are you geocoding? eg, addresses, 
intersections, placenames, postal codes, landmarks, parcel data, 
geography names, historical names? and do you have good reference data 
for these? What is your reference data set?, how accurate/complete is 
it?, how do you standardize it?, how do you standardize you input 
locations? Are there different standardization rules for different types 
of data? For addresses in different countries? Fuzzy searching is 
another area of expertise that can be deployed in this problem area 
which has its one set of issues with respect to the specific requirements.

Between dealing with natural language issues, idiomatic and slang 
references, local knowledge issues, spelling abbreviations and errors 
and reference data errors and missing data and how these interact is 
probably one of the harder issues.

I'm not sure there is one long pole, more like 5-6 long poles ;-)

-Steve