[OSGeo-Discuss] Automatic geocoding of PDF documents
slesage
slesage at geo.gob.bo
Tue Jan 17 07:48:07 PST 2012
El 2012-01-14 15:59, Andrew Turner escribió:
> On Fri, Jan 13, 2012 at 6:00 PM, slesage <slesage at geo.gob.bo> wrote:
>> Hi,
>>
>> does anybody knows about some opensource software dedicated to
>> automatic
>> geocoding of text documents ? The idea of that "black box" would be:
>> * give, as an input, a text document or a PDF,
>> * receive, as an output, a list of place names with their
>> coordinates / a
>> map of POI corresponding to that places.
>>
>> Using the geonames database (http://www.geonames.org/), the solution
>> appears
>> to be only a fulltext search, that could be done using Lucene
>> (https://lucene.apache.org/java/docs/index.html).
>>
>> I found the metacarta solution
>> (http://www.metacarta.com/products-platform-geotag.htm) but couldn't
>> find
>> any opensource solution.
>
> The reason that there isn't an open-source solution is because it is
> Very Difficult. Even geocoding is difficult and until a short while
> ago there weren't any decent open-source geocoders. So we worked with
> Schuyler (formerly of Metacarta) to build an open-source one [1].
>
> Your idea of using Geonames gazeteer with Apache Lucene is
> interesting
> and I think I've seen it suggested before. However, at best it will
> find location names but will be missing any logic for disambiguation
> or words or relative locations. So you could likely find that "Paris"
> was mentioned, but not sure if it's Paris, France or Paris, Texas,
> US.
>
> Gisgraphy [2] is an open-source option that says it provides
> Full-text
> searching. I don't know more about it though.
>
> Definitely share what else you find or try.
>
> Andrew
Thanks for the links, Andrew, I will investigate them. I had seen
Gisgraphy before, but did not understand well what is its purpose
exactly. Did anybody use it ? It seems to be developped by only one
person, do you think the community is broader ?
In order to refine my ideas on a geocoding tool, I think it would be
very difficult to do a totally automatic processing, because of
disambiguation and fixing of false positives/false negatives. A
semi-automatic approach would certainly be much more efficient, with a
posterior validation by the user and a learning engine to record these
decisions.
I think that kind of processing would be most efficient interfaced as a
plugin for a text editor, allowing:
* geocoding of a word selected by the user (selection -> right clic ->
georeference, etc.)
* geocoding of a whole text, with a bubble for each word, and three
buttons for post-validation: "OK", "disambiguate" (your example of
Paris, Texas), "not a location"
I don't know if that sounds interesting or not. But without a doubt,
that means a lot of development! In order not to reinvent the wheel,
could anybody give me more hints on the two initiatives you mentionned
(geocoding, gisgraphy) so I could better determine to which one it would
be better to contribute ?
Thanks
Sylvain Lesage
More information about the Discuss
mailing list