[postgis-users] Performance and memory problems

Thu Dec 4 02:48:18 PST 2003

Hi all,
yesterday I've started playing with postgis. Nice tool.
For the record, since I wanted to play with sources and did not
want to break a working machine, I've set up a Linux mandrake 9.1
in a VMWare virtual machine, got the pg 7.4, geos, postgis 0.8 sources 
and compiled everything (proj is available as an rpm), set up and loaded 
some shapefile into the dbms, created the r-tree indexes (oh, I also 
increased the postgres shared memory to around 64MB).

Now, loading speed is acceptable, index creation is fast (less than a 
second even with thousands of geometries), but when
using geos I start to feel the pain. Example: I have two shapefiles
downloaded from http://nationalatlas.gov/atlasftp.html, specifically
http://edcftp.cr.usgs.gov/pub/data/nationalatlas/countyp020.tar.gz
and http://edcftp.cr.usgs.gov/pub/data/nationalatlas/hucs00m020.tar.gz 
(but using only the polygonal file contained in this tar),
that is, conties and hydrologic units, that are about 14 MB each
(6000 counties and 5000 hydrologic units).

I want to overlay them with the following command:

select intersection(counties.the_geom, hydrounit_poly.the_geom), 
counties.county, hydrounit_poly.map_label, hydrounit_poly.reg_name
into hydro_counties
from counties, hydrounit_poly
where counties.the_geom && hydrounit_poly.the_geom

Explain says that the spatial indexes are used. After 30 seconds of 
intense computation the backend bombs out due to an out of memory...
why? The overlay process should keep in memory only the current 
reference geometry from conties and the intersecting ones from 
hydrounit_poly, that is, a few kylobytes...
Moreover I see that the postmaster actually doing the computation was
using 12 MB or resident memory and 6 MB of shared memory (top output),
so I don't see why it bombs in the first place...

Another example: I want to know which counties contain at least one 
abandoned mine. The mines file is downloaded at 
http://edcftp.cr.usgs.gov/pub/data/nationalatlas/abnminx020.tar.gz and
contains approximately 20000 points.

The query is:

select counties.county from counties where exists (select * from 
abnmines where abnmines.the_geom && counties.the_geom and 
contains(counties.the_geom, abnmines.the_geom))

Which takes 32seconds to execute first time (disk access involved),
25 seconds after (everything is in memory, no disk access), spatial 
index is used. I expected it to be faster given that it's running on a 
Athlon XP 2800+ and VMWare does not do any transcoding (cpu bound tasks 
run at about the same speed in the vmware vm as in a real pc AFAIK)).
It is a good time in your experience? Can I do anything to improve it?
Using distance <= 0 instead of contains lowers the time to 21secs.

Oh, another question: if I create the table with the above command, is 
there any simple way to update geometry_columns too (I don't understand
where the last fields in that table come from, that is, the fields
after the type)

Best regards
Andrea Aime