[postgis-users] question on gist performance

Thu Sep 6 00:39:29 PDT 2007

On Wed, 2007-09-05 at 18:12 +0200, Stefan Zweig wrote:
> hi list,
> 
> i have noticed a strange thing while selecting geometries from a table and using the && operator (gist):
> 
> i have a table which holds all countrys of the world (somewhat more than 200) an their boundaries in column the_geom
> the boundaries have a quite good resolution, so the total amount of points is huge.
> for better performance (if low resolution is needed) i created a second column 
> 
> the_geom_1 as:
> UPDATE table SET the_geom_1=simplify(the_geom,0.01)
> 
> which results in a less total amount of points compared to the original data.
> 
> i have set up a gist index on both columns
> 
> and now the strange thing:
> SELECT name FROM _g2965 WHERE the_geom && SetSrid('BOX(13.760675 51.171104,15.231802 51.83737)'::box2d,4326);
> takes 297ms
> 
> SELECT name FROM _g2965 WHERE the_geom_1 && SetSrid('BOX(13.760675 51.171104,15.231802 51.83737)'::box2d,4326);
> takes 15ms
> 
> actually i would have thought that the query using the gist index should take the same time on both columns, because gist index strategy is comparing bounding boxes of the geometries with the bounding box of my query, isn't it?
> can anybody give me a hint why there is such a massive time difference when selecting from different columns?
> 
> 
> thanks in advance, stefan

Hi Stefan,

I suspect that this because of your UPDATE statement. For each row that
is touched during an UPDATE, a new entry is created in the GiST index -
for various transactional reasons, the old version cannot be deleted.

For example: imagine I have your table with containing 200 countries of
the world. When first created, your heap/indices look like this:

table: 200 rows
the_geom_gist: 200 rows

Now you do your UPDATE table SET the_geom_1=simplify(the_geom,0.01):
this creates new versions of each row in both the heap and the index.

table: 400 rows (200 live)
the_geom_gist: 400 rows (200 live)

If you now build your second index:

table: 400 rows (200 live)
the_geom_gist: 400 rows (200 live)
the_geom_1_gist: 200 rows

The second index is smaller because the table only contained 200 live
rows, and so these were the only ones used to build the index. Now I see
from your schema that you've done this another two times which gives
this:

table: 800 rows (200 live)
the_geom_gist: 800 rows (200 live)
the_geom_1_gist: 600 rows (200 live)
the_geom_2_gist: 400 rows (200 live)
the_geom_3_gist: 200 rows

So as you can see the original index is 4 times the size of the original
which is why it takes longer to traverse. The quick solution would be to
rebuild all of the GiST indices on your table (if these tables are
fairly static you may wish to do a VACUUM FULL too). However, I would
probably create one table with just a primary key and geometry column
for each different resolution, and then perform a join onto your
original table based upon the primary key, dropping the geometry columns
from your original table.

HTH,

Mark.

-- 
ILande - Open Source Consultancy
http://www.ilande.co.uk