[postgis-devel] brainstorming about topology polygonizer

Thu Sep 15 08:21:43 PDT 2016

As reported before, I'm experimenting with a function to
determine faces generated by correctly linked edges in
a topology.

Right now I'm using the "Arezzo UCS" dataset, which is composed by
16746 shells (CCW rings) and 1817 holes (CW rings) composed by
a total of 47708 edges.

These numbers mean there will be at the end 16746 faces (= shells)
with a total number of holes being at most 1817-1=1816 (there must
be *at least* one "hole" in the universe face, being the outermost
shell).

Now the current algorithm ( which can be seen in 
https://git.osgeo.org/gogs/strk/postgis/src/batch-topo )
goes as follows (pseudo-code):

 For each yet-to-visit edge-side:
   Compute edge-side ring (walking)
   If edge-side ring is a shell (ccw):
     - Create a face, register it in each of the ring edge sides
       (marking the edge side as visited)
     - Save the shell in a "shells container"
   Otherwise (is an hole, clockwise): 
     - Register each of the ring edge sides as being an "hole"
       (marking the edge side as being an hole, and thus visited)
     - Save the ring in a "holes container"

 For each of the elements in the "holes container":
   - Find face-shell containing an arbitrary vertex of the hole ring
     (from the "shells container")
   - Register it in each of the ring edge sides

This is proving effective, but memory hungry (stopped the process
while taking more than 20 GB of RAM).

Theoretically, holding "holes" and "shells" in memory should not
take much more than the size of all the face geometries, which
I've computed for this case to be ~228 MB.
Even considering the multiple representations of each face geometry
component (edges, polygon, geos, prepared) I could understand a x10
increase in size, but this is a x100 increase (20000 MB from 228 MB).

So my current theory is that the RAM used is the one of DETOASTed
geometries being converted by the postgresql module during backend
callbacks. Right now the callback code to fetch and return geometries
to the library does something like this:

   geom = (GSERIALIZED *)PG_DETOAST_DATUM_COPY(dat);
   edge->geom = lwgeom_from_gserialized(geom);

The library will only clean edge->geom, after it has done with using
it, but what about the DETOAST_DATUM_COPY ?

Normally, all that memory would get released by the end of the
outer function scope. Not a big deal while the functions do a few
operations, but the "polygonize" function (both the new and the old)
can make a lot of operation. Even the ST_CreateTopoGeo function could.

I'll try a different approach, along these lines:

   geom = (GSERIALIZED *)PG_DETOAST_DATUM_COPY(dat);
   lwg = lwgeom_from_gserialized(geom);
   edge->geom = lwgeom_clone_deep(lwg);
   lwgeom_free(lwg);
   pfree(geom);

I'm afraid that doing so would still keep the Datum memory around
unless context memory is switched, which I suspect is not the case
as we call SPI_connect only once for the whole lifetime of the
function.

Enough for a first braindump.
I hope this is at least useful to spread some info about what
kind of algorithm I'm building :)

--strk;