[OSGeo-Discuss] Automatic geocoding of PDF documents

slesage slesage at geo.gob.bo
Tue Jan 17 07:48:07 PST 2012


El 2012-01-14 15:59, Andrew Turner escribió:
> On Fri, Jan 13, 2012 at 6:00 PM, slesage <slesage at geo.gob.bo> wrote:
>> Hi,
>>
>> does anybody knows about some opensource software dedicated to 
>> automatic
>> geocoding of text documents ? The idea of that "black box" would be:
>> * give, as an input, a text document or a PDF,
>> * receive, as an output, a list of place names with their 
>> coordinates / a
>> map of POI corresponding to that places.
>>
>> Using the geonames database (http://www.geonames.org/), the solution 
>> appears
>> to be only a fulltext search, that could be done using Lucene
>> (https://lucene.apache.org/java/docs/index.html).
>>
>> I found the metacarta solution
>> (http://www.metacarta.com/products-platform-geotag.htm) but couldn't 
>> find
>> any opensource solution.
>
> The reason that there isn't an open-source solution is because it is
> Very Difficult. Even geocoding is difficult and until a short while
> ago there weren't any decent open-source geocoders. So we worked with
> Schuyler (formerly of Metacarta) to build an open-source one [1].
>
> Your idea of using Geonames gazeteer with Apache Lucene is 
> interesting
> and I think I've seen it suggested before. However, at best it will
> find location names but will be missing any logic for disambiguation
> or words or relative locations. So you could likely find that "Paris"
> was mentioned, but not sure if it's Paris, France or Paris, Texas, 
> US.
>
> Gisgraphy [2] is an open-source option that says it provides 
> Full-text
> searching. I don't know more about it though.
>
> Definitely share what else you find or try.
>
> Andrew

Thanks for the links, Andrew, I will investigate them. I had seen 
Gisgraphy before, but did not understand well what is its purpose 
exactly. Did anybody use it ? It seems to be developped by only one 
person, do you think the community is broader ?

In order to refine my ideas on a geocoding tool, I think it would be 
very difficult to do a totally automatic processing, because of 
disambiguation and fixing of false positives/false negatives. A 
semi-automatic approach would certainly be much more efficient, with a 
posterior validation by the user and a learning engine to record these 
decisions.

I think that kind of processing would be most efficient interfaced as a 
plugin for a text editor, allowing:
* geocoding of a word selected by the user (selection -> right clic -> 
georeference, etc.)
* geocoding of a whole text, with a bubble for each word, and three 
buttons for post-validation: "OK", "disambiguate" (your example of 
Paris, Texas), "not a location"

I don't know if that sounds interesting or not. But without a doubt, 
that means a lot of development! In order not to reinvent the wheel, 
could anybody give me more hints on the two initiatives you mentionned 
(geocoding, gisgraphy) so I could better determine to which one it would 
be better to contribute ?

Thanks

Sylvain Lesage



More information about the Discuss mailing list