[postgis-users] Performace problems with 250.000.000 points

Brent Wood pcreso at pcreso.com
Tue Jul 24 14:23:48 PDT 2007


--- "Boehm, Andreas" <boehm at rif.fuedo.de> wrote:

A few options:

You have 2Gb,memory, up your sort/work mem values, as per the docs.

You may find partitioning your tree table speeds things up, if you can tile or
stripe the trees into separate partitions, or do something similar in a more
manual fashion as below. 

You may get a significant speed up by tiling your trees into (for example)
10000 "boxes", then add an indexed column to your trees table for the box_id
which assigns each tree to a box. Then you can try (ideally with a clustered
index on t.box_id in a tree table which is partitioned on t.box_id):

select x(t.geom),
       y(t.geom),
from land_parcel l,
     tree t,
     box b
where l.lp_id = <n>
  and l.parcel && b.box     
  and t.box_id = b.id  
  and t.geom && l.geom
  and contains(l.geom, t.geom);

The goal being to quickly reduce the number of trees you need to process in
your query via some imposed structure.

Your spatial operator is then constrained to only trees in overlapping boxes,
instead of all trees, via a relatively fast integer index. If you do try this,
it may be worth comparing the performance of tiles with varying numbers of
trees to see where the peak benefit is obtained.

You may also find some value in implementing a c-Squares approach to accessing
your data, although this is normally used where a spatial index isn't otherwise
available, see:  http://en.wikipedia.org/wiki/C-squares

c-squares won't apply directly to your data, but the principle still applies.


Plenty to try :-)

   Cheers,

     Brent Wood


> Hi list,

> 
> working with an huge data table I've massive performance problems.
> 
> Here's the scenario:
> There is one table trees, containing about random 250.000.000 points
> within the square (x = 100000 / y = 100000). SRID is -1. A second table
> land_parcel contains about 900.000 polygons (1 ring with 17 points).
> These polygons are uniformly distributed within the square from lower
> left to upper right.
> 
> The task is to read all the trees within a specific land_parcel.
> Something like
> 	select x(t.geom), y(t.geom) 
> 	from land_parcel as l, trees as t 
> 	where (l.lp_id = 123456) 
> 	and (l.geom && t.geom) and Contains(l.geom, t.geom)
> 
> The problem is that the execution of this query takes ages. Although the
> result set is very small (approximately 25 trees). 
> In the first attempts I measured an average computation time of 366 s
> for selecting one random land_parcel. The maximum computation time I
> measured was about 700 s. In the meantime I've got better results (see
> below).
> 
> My assumption is that the bottle neck is the geo index. Reading the 100
> first land_parcels in a loop I got the result for the 1st land_parcel
> after 12 s. The times for the following land_parcels are decreasing (7
> s, 6 s, ... 0.5 s). 
> There is an other thing, too. If you once picked a random land_parcel
> the computation for a neighbour land_parcel is much faster.
> 
> PostgreSQL is running on a windows machine (Win 2003 Server) with an
> Intel Core2 CPU 6700, 2.66 GHz, 2 GB RAM and a 1.1 TB RAID 5. CPU usage
> is low, I/O traffic is height, mem usage is about 80 MB. 
> 
> I've already walked through Mark's tuning guide. Here are some snippets
> of the configuration file:
> 	shared_buffers = 64MB
> 	effective_cache_size = 196MB
> 	random_page_cost = 4
> I've altered the tables:
> 	ALTER TABLE trees ALTER COLUMN geom SET STATISTICS 1000;
> 	ALTER TABLE trees ALTER COLUMN tr_id SET STATISTICS 1000;
> 	ALTER TABLE land_parcel ALTER COLUMN geom SET STATISTICS 1000;
> 	ALTER TABLE land_parcel ALTER COLUMN lp_id SET STATISTICS 1000;
> I've done several vacuum analyzes.
> I've done a reindex on trees and land_parcels.
> I've re-clusterd the table trees
> 	CLUSTER trees_index ON trees;
> 
> The clustering was successful. The average time for one land_parcel is
> 9.7 s. But this is still to slow. Running the scenario in ArcSDE /
> Oracle the average computation time is approximately 1.4 s.
> 
> Has anybody still an idea? Or are 250.000.000 points just too many
> items?
> Thanks
> 
> Andreas
> 
> ---
> 
> --
> -- Query plan
> --
> Nested Loop  (cost=99.47..5072.48 rows=19 width=21)
>   Join Filter: contains(l.geom, t.geom)
>   ->  Index Scan using land_parcel_pkey on land_parcel l
> (cost=0.00..9.25 rows=1 width=317)
>         Index Cond: (lp_id = 123456)
>   ->  Bitmap Heap Scan on trees t  (cost=99.47..5044.38 rows=1250
> width=21)
>         Filter: (l.geom && t.geom)
>         ->  Bitmap Index Scan on trees_index  (cost=0.00..99.16
> rows=1250 width=0)
>               Index Cond: (l.geom && t.geom)
> 
> --
> -- Table: trees
> --
> CREATE TABLE trees
> (
>   tr_id integer NOT NULL,
>   breed_id character varying(10) NOT NULL,
>   height double precision,
>   crown_diam double precision,
>   bhd double precision,
>   volume double precision,
>   plant_date character varying(10),
>   geom geometry,
>   CONSTRAINT trees_x_pkey PRIMARY KEY (tr_id),
>   CONSTRAINT enforce_dims_geom CHECK (ndims(geom) = 2),
>   CONSTRAINT enforce_geotype_geom CHECK (geometrytype(geom) =
> 'POINT'::text OR geom IS NULL),
>   CONSTRAINT enforce_srid_geom CHECK (srid(geom) = -1)
> ) 
> WITHOUT OIDS;
> ALTER TABLE trees OWNER TO postgres;
> 
> -- Index: trees_breed_ididx
> CREATE INDEX trees_breed_ididx
>   ON trees
>   USING btree
>   (breed_id);
> 
> -- Index: trees_heightidx
> CREATE INDEX trees_heightidx
>   ON trees
>   USING btree
>   (height);
> 
> -- Index: trees_index
> CREATE INDEX trees_index
>   ON trees
>   USING gist
>   (geom);
> 
> -- Index: trees_plant_dateidx
> CREATE INDEX trees_plant_dateidx
>   ON trees
>   USING btree
>   (plant_date);
> 
> --
> -- Table: land_parcel
> --
> CREATE TABLE land_parcel
> (
>   lp_id integer NOT NULL,
>   owner_id character varying(20) NOT NULL,
>   numerator integer NOT NULL,
>   denominator integer NOT NULL,
>   geom geometry,
>   CONSTRAINT land_parcel_pkey PRIMARY KEY (lp_id),
>   CONSTRAINT enforce_dims_geom CHECK (ndims(geom) = 2),
>   CONSTRAINT enforce_geotype_geom CHECK (geometrytype(geom) =
> 'POLYGON'::text OR geom IS NULL),
>   CONSTRAINT enforce_srid_geom CHECK (srid(geom) = -1)
> ) 
> WITHOUT OIDS;
> ALTER TABLE land_parcel OWNER TO postgres;
> 
> 
> -- Index: land_parcel_index
> CREATE INDEX land_parcel_index
>   ON land_parcel
>   USING gist
>   (geom);
> 
> -- Index: land_parcel_owner_ididx
> CREATE INDEX land_parcel_owner_ididx
>   ON land_parcel
>   USING btree
>   (owner_id);
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
> 




More information about the postgis-users mailing list