[postgis-users] Improving Performance

Wed Feb 18 20:06:10 PST 2004

I had a little idea like this in December - a 'lean' version of postgis, 
that only supports 2D geometries (and points) and where all  geometries 
are assumed to have the same projection.

I though it might would be possible to 'shrink' each geometry by 56 
bytes (+ 8 bytes for each 2d point) So in the example below your data 
could have been something like 900mb smaller as well. It was something 
like a 40% reduction in data size.

I don't think the loss of resoulition in the index is a big deal because 
it's only a bounding box - you really need to look at the distance or 
the intersection after the bounding box has done it's work to know if 
you have something valid anything.

Between these two, the disk activity and memory load for (what I assume 
is) the bulk of postgis work could be significantly reduced.

Maybe there could be a geom_lite type and specific functions (distance, 
intersects etc) could be ported to geom_lite and others could happen via 
a conversion.  

Ralph

David Blasby wrote:

> I was recently doing some spatial joins between a table with 10,000 
> points to a table with 17,000,000 lines.
>
> Here's some of my thoughts on the matter.  In the middle of this 
> message, I give 4 ideas for improving speed.  The last part of the 
> message is a more in-depth analysis of one of these techniques.
>
>
> The query I was running looked something like:
>
>  EXPLAIN ANALYSE
>  SELECT  ...
>  FROM all_lake_points, csn_edges
>  WHERE
>    csn_edges.the_Geom && expand(all_lake_points.the_Geom,0.5)
>    ;
>
> I repeatedly ran the process on one of our machines, and got
> the following times (it returns about 40,000 rows):
>
> Total runtime: 113813.259 ms  (very low CPU utilization - <5%)
> Total runtime:  51843.412 ms
> Total runtime:  37013.140 ms
> Total runtime:  27280.564 ms
> Total runtime:  17084.181 ms
> Total runtime:  12869.891 ms
> Total runtime:  10867.587 ms
> Total runtime:  10803.591 ms (very high CPU utilization - >70%)
>
> On another machine:
>
>  Total runtime: 163636.58 msec
>  Total runtime:  65834.27 msec
>  Total runtime:   9506.53 msec
>  Total runtime:   7880.42 msec
>
>
> Starting and stopping the postmaster between queries doesnt affect
> these times, so the diminishing time is a result of the OS caching
> the Spatial Index and actual csn_edges tuples.
>
> The size of the spatial index is approximately O(n log n),
> where n= number of geometries being index.
>
> Since the index is composed of a BOX (xmin,ymin, xmax, ymax -- all 
> doubles) of size 32, the actual index size is:
>
> 17,000,000 rows * 32bytes/row  + 17,000,000 rows * GiST overhead +
> (index hiearchy overhead)
>
> = 520 mb + 340mb + hierarchy overhead
> = 860 mb + hierarchy overhead
>
> The GiST overhead is quite large - see ggeometry_compress() - its 
> about 21 bytes.  The index hierarchy account for the "internal" nodes 
> in the GiST tree.
>
> I looked in the postgresql data directory, and it appears that the 
> index is actually 1.2 Gb.
> The table csn_edges is about 14 Gb.
>
> The query actually works like this:
>
> get a row from all_lake_points
> make a bounding box
> Search csn_edges GiST index using the bounding box:
>    GiST init
>    Traverse GiST index
>    For each matching GiST entry
>        get tuple from csn_edges
>
>
>
> Experimentation (with explain analyse) leads me to believe that an 
> uncached index search takes about 100 to 400 ms.  A search with the 
> index mostly cached is about 100ms, and with the index and actual 
> tuples cached, about 1.5ms.
> This is also verified with the timings given above.
>
> The very low CPU utilization numbers indicate that the uncached 
> version is IO bound, while the cached version is CPU bound.  Running 
> the same query serveral times hints to the OS to cache those disk pages.
>
> If a GiST entry is 32 (box) + 21 (GISTENTRY) =53 bytes, there are 
> about 154 on an 8k disk page.
>
>
> Immediately, I see four ways to improve performance:
>
> 1. Have the OS allocate more space to the disk cache
> 2. Have the index be smaller
> 3. Have the actual tuples be smaller on disk
> 4. Order the table so that nearby geometries are very close together 
> on the disk.
>
> #1 is up to the system administrator
> #3 we can change the size of a geometry by a small amount or go to a
>    WKB-type datatype like I've talked about in a few discussions.  This
>    will not be a major difference unless your individual geometries only
>    have a few points in them.
> #4 There's no easy way to do this.  One way would be to make a Quad-tree
>    type index for your geometries and insert them into the database on a
>    bucket-by-bucket basis.
>    (The PostGIS stats package does something like this in the
>    histogram2d functions)
>
>    This makes queries faster because it should minimize the number of
>    *different* disk pages a query will access.
>    It still reads the same number of tuples off the disk, but these
>    tuples are more likely to be on the same page.  For example, if your
>    tuples are 1kb long and the disk page size is 8kb and you're
>    selecting 1000 tuples.  With random location,
>    you'll likely read 1000 disk pages, but with a high degree of
>    auto-correlation you could read as few as 1000/8 = 125 pages.
>
>    This is definately worth persuing - but its a fair bit of work and
>    would probably be an external program instead of an SQL command. Its
>    hard to say how much this would improve performance.
>
>
>
> I'd like to talk more about #2 (smaller index).
>
> I'm proposing we change the index type (currently BOX - 32 bytes) to a:
>
> typedef struct BOX2DFLOAT_
> {
>     xmin float;
>     xmax float;
>     ymin float;
>     ymax float;
> } BOX2DFLOAT;
>
>
> This saves 16 bytes, meaning we go from 154 GiST entries/page to 221 
> (43%).  My original thoughts one this was it would be a 50% saving (16 
> vs 32), but the 21 byte GiST overhead reduces this. In the end my GiST 
> index goes from about 1.2 Gb to 840 Mbs...
>
> We'll need to write a function that takes a BOX3D and converts it to a 
> BOX2DFLOAT.  The BOX2DFLOAT's xmin/ymin will have to be strictly less 
> than the BOX3D's xmin/ymin to ensure that all the queries work properly.
>
> This will make all the search boxes "bigger" by about 1m if the 
> coordinates are in the 10,000,000 ranges (a float4 cannot represent 
> small differences in numbers as well as a float8). For boxes near the 
> origin, we'll only see a few micrometers difference.
> This means the bounding box queries will always work correctly, but 
> they might grab a few more nearby "canidate" geometries.
>
> Next, we'll have to write a set of BOX2DFLOAT high-level GiST support 
> functions, and a set of low-level GiST support functions.  These are 
> all quite easy.
>
>
> At the end of this, the index query should be a bit faster, require 
> about 40% less disk space, and require about 40% less cache space.
>
> In the end, the biggest wait will be pulling individual csn_edges 
> tuples from the disk (see comments on #4, above).
>
>
> dave
>
> For DAVE's reference only:
> -------------------------
>
>  explain analyse  SELECT  all_lake_points.gid, 
> all_lake_points.the_Geom ,csn_edges.code, 
> csn_edges.edge_id,csn_edges.group_code 
> ,startpoint(csn_edges.the_geom), endpoint(csn_edges.the_geom)
>   WHERE
>    all_lake_points.gid >= 650001 and
>    all_lake_points.gid <= 660000 and
>    all_lake_points.code != 1400 and
>    csn_edges.the_Geom && expand(all_lake_points.the_Geom,0.5) and
>    csn_edges.code in (1000, 1050, 1100, 1150, 1200, 1250, 1300, 1350, 
> 1400, 1410, 1425, 1450, 1475, 2000, 2300, 6010)
>    ;
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users