[postgis-devel] (no subject)

Mon Jun 27 03:38:57 PDT 2005

Other analisys from the GeomUnion world.

We found that bottleneck for GeomUnion operation
is build of graph and noding of lineworks.

This means that reducing these operations will
speed up the query. Issuing a buffer(0) on the
collection of ALL input is the most performant
way to accomplish this.

This can already be done with all postgis releases
using buffer(collect(the_geom),0);

Now. This approach is memory hungry as the uniq graph
being build will contain the FULL topology set
(vertex + edges). 

Let's see how we can leverage this problem.

*GEOS improvements*

(1) When building the graph GEOS makes *copies* of elements instead
of using references to them. Improvements could be done here, using
a new Coordinates bucket to store intersections.

(2) GEOS Coordinates are 3d, even when input is 2d only. Graph
elements are composed of Coordinates, so this influences total
size of them. We could instead use 2d+arbitrary_number_of_dimensions
approach, something on which Martin is working within development
version of JTS.

(3) GEOS threats all binary operation in the same way, building
full graph, labeling it and then extracting required elements.
It could rather take into account requested operation thus
removing out-of-result elements as soon as they are detected.

*PostGIS improvements*

On the postgis side, the only thing we can do is group input geoms
in chunks and invoke graph operations for each chunk.

_IF_ the result of each graph operation will contain less points
then the sum of input points, we'd have reduced memory occupation.
The more points this operation will strip away, the less memory
next operation will require.

_ELSE_ we would only be building more graphs, spending more
CPU cycles and thrashing more.

Now, research topic is "how do we compose chunks to maximize the reduction
of points?"

A few observation:

	(1) Isolated geometries will keep all their points
	(2) Geometries fully contained in another geometry will disappear
	(3) Union of intersecting or touching sets of geometries will
	    contain less points then sum of input.

All statement actually say the same thing. Good formed chunks are
those rich of intersection!

We experimented SPATIAL ORDERING to augment probability
of intersection. Btree is simple to perform and can use
an index ( a btree index ) while GiST ordering requires clustering
of table.

About chunks size I'd add another observation:

	(4) Topologically correct geoms won't reduce in a buffer(0)
	    operation.

This is importat because we can define size of chunks by either
number of geometries (a) or number of input points (b).
I implemented and experimented both.

(a) Number of input points give us more control on the memory required to
store each chunk and graph.

(b) Number of input geometries give us more control on the number of 
geometries involved in each graph.

Since our main problem is memory occupation I think (a) is more
appropriate. Anyway we need a way to let the user set this size,
as it will greatly influence performances of the operation.

This could be a compile-time define or an information extracted
from the postgresql running system (is it possible Mark ?).
I'd like the second most.

*Conclusions*

Generally speaking I think our big problem here is GEOS.

Implementing chunked work in PostGIS is still worth it,
but I think we can get the most out of improving GEOS
performances.

I'd close the PostGIS work first so we close a step
and then would take care of GEOS.

For closing PostGIS work i see these steps:

	- Obsolete MemGeomUnion (do we still have a use for it?)

	- Modify GeomUnion definition
	  (requires dump/reload OR postgis_proc_upgrade.pl script
	   ability to replace it)

	- Add ordering tip in performance chapter of manual.

Comments welcome.

--strk;