[gdal-dev] Open source vector geoprocessing libraries?

Wed Jan 13 10:13:11 EST 2010

Mateusz,

Thank you very much for your insight. I have a few more questions I'm hoping
you could answer.

> Alternative is to try to divide the tasks:
> 1. Query features from data source using spatial index capability of
> data source.
> 2. Having only subject features selected, apply geometric processing.

That sounds like a reasonable approach. Considering just the simpler
scenarios, such as the one I mentioned, is it possible to implement
efficiently it with OGR compiled with GEOS? I believe OGR can pass through
SQL directly to the data source driver, allowing the caller to submit SQL
containing spatial operators. In principle, one could submit a spatial query
to PostGIS or SpatiaLite and efficiently get back the features (including
geometry) that could possibly intersect a bounding box. Then one could use
the GEOS functions on OGRGeometry to do the actual intersecting. Is that
what you were suggesting?

Of course, it may be that PostGIS or SpatiaLite can handle both steps 1 and
2 in a single query. If so, would it be best to do it that way?

It appears that the OGR shapefile driver supports a spatial indexing scheme
(.qix file) that is respected by OGRLayer::SetSpatialFilter. The
documentation says that "Currently this test is may be inaccurately
implemented, but it is guaranteed that all features who's envelope (as
returned by OGRGeometry::getEnvelope()) overlaps the envelope of the spatial
filter will be returned." Therefore, it appears that the shapefile driver
can implement step 1 but not step 2. Is that correct?

> The problem with OGR and GEOS is cost of translation from OGR geometry
> to GEOS geometry. It can be a bottleneck.

Is it correct that this cost would only be incurred when you call OGR
functions implemented by GEOS, such as OGRGeometry::Intersects,
OGRGeometry::Disjoint, etc? 

> It's plenty of combinations and my point is that if performance (it's
> not only in terms of speed, but any resource) is critical, it would be
> extremely difficult to provide efficient  implementation of such
> features in OGR with guaranteed or even determinable degree of
> complexity. Without these guarantees, I see little of use of
> such solution.

Yes, I see what you mean. But I suggest to the open source community that
there is still value in implementing such features, either as part of OGR or
another library, even if optimal performance cannot be guaranteed in all
scenarios. The reason is that ArcGIS provides such generic tools (e.g.
intersect/union/symdiff layers, regardless of underlying storage). These
geoprocessing tools are considered the most basic capabilities of ArcGIS,
available in the cheapest versions of the software. IMHO, if the open source
community wants to win over a large number of ArcGIS users to open GIS
systems, I believe the community needs to provide parity with these basic
tools.

Thanks again,

Jason

-----Original Message-----
From: Mateusz Loskot [mailto:mateusz at loskot.net] 
Sent: Tuesday, January 12, 2010 8:33 PM
To: Jason Roberts
Cc: 'gdal-dev'
Subject: Re: [gdal-dev] Open source vector geoprocessing libraries?

Jason Roberts wrote:
> Mateusz,
> 
> I'm not an expert in this area, but I think that big performance 
> gains can be obtained by using a spatial index.

Yes, likely true.

> For example, consider a situation where you want to clip out a study 
> region from the full resolution GSHHS shoreline database, a polygon 
> layer. The shoreline polygons have very large, complicated 
> geometries. It would be expensive to loop over every polygon, loading
>  its full geometry and calling GEOS. Instead, you would use the 
> spatial index to isolate the polygons that are likely to overlap with
>  the study region, then loop over just those ones.

GEOS as JTS provides support of various spatial indexes.
It is possible to index data and optimise it in this manner as you
mention. In fact, GEOS uses index internally in various operations.
The problem is that such index is not persistent, not serialised
anywhere, so happens in memory only. In fact, there are much more
problems than this one.

BTW, PostGIS is an index serialisation.

OGR does not provide any spatial indexing layer common to various
vector datasets. For many simple formats it performs the brute-force
selection.

Alternative is to try to divide the tasks:
1. Query features from data source using spatial index capability of
data source.
2. Having only subject features selected, apply geometric processing.

I did it that way, actually.

> If OGR takes advantage of spatial indexes internally (e.g. if the 
> data source drivers can tell the core about these indexes, and the 
> core can use them when OGRLayer::SetSpatialFilter is called), then 
> many scenarios could be efficiently implemented by just OGR and GEOS 
> alone.

The problem with OGR and GEOS is cost of translation from OGR geometry
to GEOS geometry. It can be a bottleneck.

However, if such processing functionality would be considered as
built in to OGR, that would make sense, but I still see limitations:

Let's brainstom a bit and assume it implements operation:

OGRLayer OGR::SymDifference(OGRLayer layer1, OGRLayer layer2);

Depending on data source, OGR could exploit its capabilities.,
If both layers sit in the same PostGIS (or other spatial)
database, OGR just delegates the processing to PostGIS
where ST_SymDifference is executed and OGR only grabs the
results and generates OGRLayer.

What if layer1 is a Shapefile and layer2 is Oracle table?
Let's assume Shapefile has .qix file with spatial index
and Oracle has its own index. What does OGR do?

Loads .qix to memory, then grabs layer2 and decides which features to
select form layer1?
Loads the whole Shapefile to memory and uses Oracle index to select
features from layer2 "masked" by layer1?
How to calculate cost which one to transfer in which direction, etc.

Certainly, it depends on number of elements, what algorithm is used,
direction of application of algorithm (who is subject, who is object),
and many more.

It's plenty of combinations and my point is that if performance (it's
not only in terms of speed, but any resource) is critical, it would be
extremely difficult to provide efficient  implementation of such
features in OGR with guaranteed or even determinable degree of
complexity. Without these guarantees, I see little of use of
such solution.

Given that, depending on needs, write a specialised application using
available tools like OGR and GEOS, that is optimised according to
specifics of datasets, type of processing, system requirements, etc.

> If not, then your suggestion may be as fast as any other. For 
> example, the idea of loading the features in to PostGIS or SpatiaLite
>  will require loading all of the full geometries, passing them to 
> another database system, etc, etc. It may be that shuffling all of 
> the data around will be hugely expensive and that just using OGR 
> functions with simple approaches like calling GEOS from nested loops 
> will be faster than shuffling the data to a system that implements a 
> more efficient approach once the data gets there.

It's never "just using". Performance is usualy a concern regarding large
datasets. Large datasets are unlikely to be stored in a simple
format, but in proper spatial data storage, like PostGIS.
It nicely combines all the elements necessary to perform geometrical
processing in usable and optimised form, with index.

> Is that basically what you are saying?

It is.

Best regards,
-- 
Mateusz Loskot, http://mateusz.loskot.net
Charter Member of OSGeo, http://osgeo.org