R: RE: [postgis-users] Performance question (how many years willthistake?)

Tue May 11 13:18:25 PDT 2004

> > The $CELL_ID reference I don't understand ... seems like it might be either
> a variable with the value of cell_id or a typographic error.

Some background, as this thread seems to have sparked a bit of interest...

The analysis is of 200,000 fishing trawls for which we have start/finish
points.
Each trawl tow has about 25 attributes, in addition to the tow line stored as a
LINESTRING in a geometry table.

To implement a structure to facilitate the analysis I want to carry out, I have
created a grid comprising about 30,000 polygons in a separate table.

Each cell in the grid has a few attrs as well as the polygon geometry, one of
which is a unique cell ID.

The SQL to create the join of the lines/cells is the one being discussed in
this thread.

I was not aware that the intersects function failed to use the GiST indices.
That was problem number one with performance.

When both datasets were joined with a && where clause as well as the
intersects, to make use of the indices, I ran out of memory real quick. There
are about 10,000,000,000 potential combinations to check. Doing them all at
once required more resources than my system had.

So, the final query was run in an iterative script, processing 100 cells per
iteration, which never even swapped out, but ran in main memory the whole time.
Took about 45 minutes.

Along the lines of:

START_ID=0
END_ID=`expr $START_ID + 100`
LAST_ID=`get max cell_id`

while [ $START_ID -le $LAST_ID ] ; do
  insert ...
  select ...
  where cells.cell_id >= $START_ID and cells.cell_id < $END_ID  
    and trawls.geom && cells.geom
    and intersects(trawls.geom, cells.geom);

  START_ID=$END_ID
  END_ID=`expr $START_ID + 100`
done

Basically doing the same work in a series of manageable bites instead of an
indigestible lump.

It worked for me :-)

I can now run queries to return XYZ tuples for gridding, like 

select cell_x, cell_y, sum(swept_area) group by cell_x, cell_y;

The Z value can be area, count, summed or avg catches etc, so a very wide range
of gridded models can be built & plotted, depicting a wide range of aspects of
the data. All the queries are aspatial, on indexed attrs, the group by is in
effect the spatial operator, but much faster :-)

The fianl result will be a report on the spatial & temporal distribution of a
range of active bottom fishing methods, as a precursor of an environmental
impact assessment.

All the spatial data is in PostGIS, using SQL as the analysis tool (as above),
QGIS & JUMP as data visualisation/browsing and GMT for gridding & publication
quality cartographic output. 

I'm looking at GRASS & R to provide any more analysis capability required, but
at this stage they don't seem to be needed.

I'm also looking at using Mapserver, with QGIS generating the mapfiles & JUMP
or MapLab as front ends, to web-enable the system.

The interoperability of all these OpenSource products offers pretty exiting
synergies & possibilities.

There are 2 other projects I'm looking to use the same approach for, one dredge
oyster & one antarctic toothfish. 

Who says you can't have fun in your work :-)

My thanks to all those developers who've helped make this sort of thing
possible with OS software!!!!

Brent Wood