[postgis-users] Large geometry issue
Gregory Williamson
Gregory.Williamson at digitalglobe.com
Thu Jul 5 16:00:13 PDT 2007
Paul --
Thanks for the reference and the info ... I was pretty much afraid of that. Given the nature of the test for whether a polygon is within another one entirely, iterating through a few thousand tests will simply take time.
I'll keep an eye on the list and we'll try the JTS stuff when it comes out -- we use JTS in our MiddleWare, but the postGIS side uses GEOS. So it's impossible for us to switch, or (I hope -- no commitment, you understand) pony up some geld.
I'll think about maybe breaking up the sparse multis -- I have some leeway to preprocess data and if I get even a 25% gain things will be a bit easier ... currently numbers show that postGIS is way faster than The Other RDBMS for small polygons, much slower for big ones. Since the immediate testing is directed at a business case in which the large polygons aren't as much of an issue, we're ok for now. But I want to take aim at some internal applications that do use large polygons, eventually.
Greg W.
-----Original Message-----
From: postgis-users-bounces at postgis.refractions.net on behalf of Paul Ramsey
Sent: Thu 7/5/2007 9:24 AM
To: PostGIS Users Discussion
Subject: Re: [postgis-users] Large geometry issue
The general shape of this problem is
"large complex geometry potentially (based on index filter) contains a
large number of smaller geometries, figure out which ones are actually
contained"
http://geotips.blogspot.com/2007/06/performance-and-contains.html
There's no fast way to do this in PostGIS right now. Contains() will be
maximally fast for the many-points-in-polygon, but still hardly optimal,
since our naive p-i-p is still O(N).
We are working on speeding this up for a client right now, and the
delivery will happen relatively soon, but the drawback is that it will
require using the JTS bindings instead of the GEOS ones, since the
solution required new JTS work that is not going to be in GEOS for some
time. The solution will build an internally-indexed version of the big
complex shape (the individual edges will have index entries) and then
keep that version around for re-use by each test of the
small-shape-vs-large-shape combinations.
I would suggest waiting for that release to hit CVS, see if it does what
you want, then if you are averse to using the JTS bindings buck up for
the GEOS port.
P.
(It's a shame, but large sparse objects like multi-* don't use the
over-all feauture-based index well, because the bbox covers the whole
complex of sub-shapes. If lots of your objects are sparse multis,
breaking them into simple polygons with a key relationship will help
things quite a bit, by reducing improper index hits.)
Burgholzer,Robert wrote:
> Gregory,
> I have been using within() to do the same, and have found many of the same performance issues. There are a couple of points that I think are useful:
>
> 1) the "&&" queries are generally fast, especially when using a GIST index, because it is a very effective index - nothing comparable exists (to my knowledge) for distance and within().
> 2) In my investigations it seems that MULTIPOLYGONS may be the culprit in slowing things down. That is, it is not the complexity of the geometry per se, but rather, the number of isolated geometries within a given shape column, as indicated by the command "select my_id, numGeometries(the_geom) from my_table group by my_id;". I believe that if you are operating on shapes where the "numGeometries" command returns a value of 1, they will perform within() queries very quickly, however, when that number becomes considerably greater than 1, the queries bog down. This occurs often when you have shoreline type geometries that are non-continuous, or geometries that have a number of "holes" in them.
>
> I think that if you can insure that your geometries are single continuous entities that you can improve performance. I have thought to also investigate the code that underlies the within() command, but have been unsuccessful in understanding the code. Some optimization here would be very worthwhile. If you were to be able to progress in this area, I would be interested in seeing what you can find, and perhaps contributing to the investigation.
>
> I would be interested to know if anyone who knows more about this topic thinks my conclusions are correct.
>
> Hope this is at least somewhat helpful,
> r.b.
>
> -----Original Message-----
> From: postgis-users-bounces at postgis.refractions.net [mailto:postgis-users-bounces at postgis.refractions.net] On Behalf Of Gregory Williamson
> Sent: Thursday, July 05, 2007 8:06 AM
> To: PostGIS Users Discussion; postgis-users at postgis.refractions.net
> Subject: RE: [postgis-users] Large geometry issue
>
> FWIW, this is PostgreSQL 8.2.4, POSTGIS="1.2.1" GEOS="3.0.0rc4-CAPI-1.3.3" USE_STATS
> on a linux box with 8 gigs o' RAM ...
> GSW
>
>
> -----Original Message-----
> From: postgis-users-bounces at postgis.refractions.net on behalf of Gregory Williamson
> Sent: Thu 7/5/2007 3:09 AM
> To: postgis-users at postgis.refractions.net
> Subject: [postgis-users] Large geometry issue
>
> Dear peoples,
>
> I have a problem with a query that uses an absurdly large geometry (2530 in a single polygon). This is srid -1 (part of a large test of postgres vs some other database product). Everything has been vacuumed and analyzed.
>
> The initial search to find candidates in a target table is quite fast:
> catest=# select count(*) from wtm_sub_cell w, order_line_item x WHERE x.bbox && w.geometry AND x.id_as_int = 114672;
> count
> -------
> 13168
> (1 row)
>
> Time: 9.472 ms
>
> Trying to get the list narrowed to geometries that are completely contained by the requested shape is slow:
> catest=# select count(*) from wtm_sub_cell w, order_line_item x WHERE x.bbox && w.geometry AND distance(x.geometry,w.geometry) = 0 and x.id_as_int = 114672;
> count
> -------
> 1112
> (1 row)
>
> Time: 69277.780 ms
>
> So I have two questions:
> a) anything better to use than "distance(x,y) = 0) ? I tried st_within -- it is about the same speed but returns no polys, which is strange to me, but I also haven't looked at these in detail yet. For example:
> catest=# select count(*) from wtm_sub_cell w, order_line_item x WHERE x.bbox && w.geometry AND st_within(x.geometry,w.geometry) and x.id_as_int = 114672;
> count
> -------
> 0
> (1 row)
>
> Time: 1173.185 ms
> (same results with st_within(w.geometry,x.geometry):
> catest=# select count(*) from wtm_sub_cell w, order_line_item x WHERE x.bbox && w.geometry AND st_within(w.geometry,x.geometry) and x.id_as_int = 114672;
> count
> -------
> 0
> (1 row)
>
>
> b) anything I can do to speed things up ? I have tried boosting work mem to 16 megs (from 1) and it made no apparent difference.
>
>
>
> I have a self contained test case that shows the same behavior -- the one large poly and all the candidates in another table. Apologies for the size; hopefully it's not been mangled in the transfers.
>
> Explain analyze of the sample (the sequential is sensible since there is only one row in the table):
> catest=# explain analyze select count(*) from wsc_candidates w, oli_req x WHERE w.geometry && x.bbox AND distance(w.geometry,x.oli_req_geom) > 0 AND x.oli_req_id = 114672;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=20.28..20.29 rows=1 width=0) (actual time=77232.858..77232.859 rows=1 loops=1)
> -> Nested Loop (cost=0.00..9.30 rows=4389 width=0) (actual time=6.389..77221.506 rows=12056 loops=1)
> Join Filter: (distance(w.geometry, x.oli_req_geom) > 0::double precision)
> -> Seq Scan on oli_req x (cost=0.00..1.01 rows=1 width=40602) (actual time=0.007..0.009 rows=1 loops=1)
> Filter: (oli_req_id = 114672)
> -> Index Scan using wsc_c_spatial_ndx on wsc_candidates w (cost=0.00..8.27 rows=1 width=109) (actual time=0.022..25.991 rows=13168 loops=1)
> Index Cond: (w.geometry && x.bbox)
> Filter: (w.geometry && x.bbox)
> Total runtime: 77232.901 ms
> (9 rows)
>
> Time: 77233.773 ms
>
>
> And for the real thing:
> catest=# explain analyze select count(*) from wtm_sub_cell w, order_line_item x WHERE w.geometry && x.bbox AND distance(w.geometry,x.geometry) = 0 AND x.id_as_int = 114672;
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=141.83..141.84 rows=1 width=0) (actual time=77457.587..77457.588 rows=1 loops=1)
> -> Nested Loop (cost=5.99..141.83 rows=1 width=0) (actual time=15.682..77456.541 rows=1112 loops=1)
> Join Filter: (distance(w.geometry, x.geometry) = 0::double precision)
> -> Index Scan using oli_id_ndx on order_line_item x (cost=0.00..8.30 rows=1 width=383) (actual time=0.012..0.018 rows=1 loops=1)
> Index Cond: (id_as_int = 114672)
> -> Bitmap Heap Scan on wtm_sub_cell w (cost=5.99..132.97 rows=32 width=109) (actual time=2.988..21.796 rows=13168 loops=1)
> Filter: (w.geometry && x.bbox)
> -> Bitmap Index Scan on wsc_geom_idx1 (cost=0.00..5.98 rows=32 width=0) (actual time=2.828..2.828 rows=13168 loops=1)
> Index Cond: (w.geometry && x.bbox)
> Total runtime: 77457.633 ms
> (10 rows)
>
> Time: 77458.458 ms
>
>
> The tables involved by size:
> catest=# select count(*) from wsc_candidates;
> count
> -------
> 13168
> (1 row)
>
> Time: 2.586 ms
> catest=# select count(*) from oli_req;
> count
> -------
> 1
> (1 row)
>
> Time: 0.193 ms
> catest=# select count(*) from wtm_sub_cell;
> count
> ---------
> 6399928
> (1 row)
>
> Time: 1776.508 ms
> catest=# select count(*) from order_line_item;
> count
> --------
> 395921
> (1 row)
>
> Time: 176.083 ms
>
>
> Many thanks for your time and bandwidth!
>
> Greg Williamson
> Senior DBA
> GlobeXplorer LLC, a DigitalGlobe company
>
> Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information and must be protected in accordance with those provisions. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
>
> (My corporate masters made me say this.)
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
--
Paul Ramsey
Refractions Research
http://www.refractions.net
pramsey at refractions.net
Phone: 250-383-3022
Cell: 250-885-0632
_______________________________________________
postgis-users mailing list
postgis-users at postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20070705/02711b4b/attachment.html>
More information about the postgis-users
mailing list