[QGIS-Developer] GeoSeer ogc services data harvesting
Jonathan Moules
jonathan-lists at lightpear.com
Tue Jun 9 04:46:29 PDT 2020
Hi Andreas,
Interesting.
Behind the scenes, GeoSeer one-way hashes the GetCapabilities documents
and that hash is used as the document key. Identical GetCapabilities
documents therefore get the same key and thus only appear once in the
final index. But one single character different in the entire document
and it's a completely different hash.
There's also de-duplication at the endpoint, service, and dataset levels
using a similar mechanism. GeoSeer also de-duplicates across services.
I.e. if something is served from the same place as both WMS and WFS, we
glue them together.
The problem with using DNS is that you get organisations the size of
NOAA/USGS and they have deployments across various subdomains that are
doing different (but similar) things. You also get a kind of opposite -
a single domain belonging to a geospatial "cloud" hosting provider that
has lots of layers that have the same names and similar metadata because
all their local-government customers are sharing their own
fire-stations/roads etc.
There are all manner of ways in which server admins and data custodians
make this more complicated than it seems. :-)
Cheers,
Jonathan
On 2020-06-09 12:25, Andreas Neumann wrote:
>
> Hi Jonathan,
>
> Thanks for sharing this information. I don't know anything better.
>
> While looking at some services that I know personally, I also found
> out that others services are listed twice, because a machine might
> have a DNS alias. That is also something to consider - perhaps sort
> out machines that have identical GetCapabilities responses and just
> the DNS name varies.
>
> I agree, the numbers probably wouldn't change significantly.
>
> Thanks and greetings,
>
> Andreas
>
> On 2020-06-09 13:14, Jonathan Moules wrote:
>
>> Hi Andreas,
>> Sure, happy to share.
>> There's a little on the About page: https://www.geoseer.net/about.php
>> and then scattered around blog posts (the ones with the "GeoSeer" tag
>> are probably best for that: https://www.geoseer.net/blog/?t=GeoSeer
>> ), but put simply - We scrape a lot of different sources and metadata
>> catalogs and get the services from them. Then we request not only the
>> GetCapabilities that was declared, but also make educated guesses as
>> to what else might be on the box and request those too.
>>
>> It's not perfect, but to the best of my knowledge it's by far the
>> largest such index in the world, and more importantly, it's
>> *current*. Everything in there responded with a valid GetCapabilities
>> document with at least one meaningful named dataset when it was last
>> scraped within the last few weeks.
>>
>> Pertaining to your given services, GeoSeer has:
>> http://geoweb.so.ch/wms/sogis_natgef.wms? and a few others on that
>> sub-domain, as well as some on the subdomain:
>> http://www.sogis1.so.ch/cgi-bin/sogis/sogis_natgef.wms? - both are
>> now defunct I see which is why they're not in the database.
>>
>> Thanks for the URL, I've added it for scraping.
>>
>>> So I wonder how many other QGIS server installations may not be in
>>> your database?
>> Alas that's a "unknown unknown"; there's no way to know (I can't
>> think of a way to find out anyway; suggestions welcome). However the
>> vast majority of the time when I come across a new service manually
>> (i.e. from following various mailing lists like this), it turns out
>> it's already in the index, so I think it's reasonably comprehensive
>> at this point.
>>
>> While missing servers may change the absolute number of QGIS
>> Installations, they're very unlikely to change the proportions. For a
>> sample-size this large I'd expect the proportions to remain largely
>> the same, certainly for deployments.
>>
>> Hope that's of interest and answers the question,
>> Cheers,
>> Jonathan
>>
>>
>> On 2020-06-09 10:45, Andreas Neumann wrote:
>>>
>>> Hi Jonathan,
>>>
>>> Can you share with us how you harvest your information on available
>>> public OGC services? You probably have that information published
>>> somewhere - so if you could point me towards this URL, it would help.
>>>
>>> I noticed that all of the services of our province (my employer)
>>> can't be found, as an example.
>>>
>>> Here is the start point:
>>>
>>> https://so.ch/verwaltung/bau-und-justizdepartement/amt-fuer-geoinformation/geoportal/geodienste/wms-web-map-service/
>>>
>>> and the GetCapabilities link:
>>>
>>> https://geo.so.ch/api/wms?SERVICE=WMS&REQUEST=GetCapabilities&VERSION=1.3.0
>>>
>>> So I wonder how many other QGIS server installations may not be in
>>> your database? Of course I know you don't claim full coverage, but
>>> it would still be good to know how you harvest your data.
>>>
>>> Thanks for clarifying and greetings,
>>>
>>> Andreas
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/qgis-developer/attachments/20200609/6528f131/attachment-0001.html>
More information about the QGIS-Developer
mailing list