[postgis-users] Very long time for a bbox/within query
Paul Ramsey
pramsey at refractions.net
Sun Jul 25 15:52:21 PDT 2004
Gregory S. Williamson wrote:
> Whew !
>
> 1) I'll work with our sysads and give a try to the CVS versions; I am
> sure we are using a stable (older) release.
>
> 2) slow is ok, up to a point -- that's why we are doing this as a
> preprocessing step. I like the bb-box test idea and am fairly sure I
> can code enough of this to test.
>
> I found my self wondering if centroids of parcels -- already a thing
> we are calculating -- might be useful? I would guess that there are a
> few cases where the centroid of a polygon would actually be outside
> of a given county, but that chance of that being true for all of the
> parcels in a given batch is vanishing low. (Of course, we'd see that
> same centroid again in testing the next FIPS).
Centroids will never tell you the "real story" about interactions
between two layers, but your application is such that you might not
actually want the whole story. Like you said, you are interested in
whether there is "enough data to make a map" for a particular county.
Well, the answer to that is probably something like "are there more than
X features of type Y in this county". For rough calculations like that,
doing simple point-in-polygon tests against centroids could be more than
accurate enough:
select v_boundary.v_bndry_id, count(*)
from parcels,v_boundary
where
v_boundary.the_geom && parcels.the_geom
and
distance(v_boundary.the_geom, centroid(parcels.the_geom)) = 0
and
v_bndry_type = 'FIPS'
group by v_boundary.v_bndry_id;
Take all the counties with a feature count > X and that's your answer.
> In practice we only have to test this data as it comes into the
> system, probably in a load table. Currently we are swimming upstream
> against some accumulated data, so I do not want to get to far behind
> the curve.
>
> Short circuiting ... seems like we we have data where most cases will
> be in a county with no fuss -- any one of these would qualify the
> whole batch being loaded for being in that county. Then all we have
> to care about are the parcels along the boundary of the county to see
> if we are in spilling into any adjacent ones. Maybe I've been
> looking at this wrong ... (natch!)
>
> Thanks for the suggestions and info ... lots to chew on.
>
> Probably won't get a chance to play with any of this this afternoon;
> I've got to San Francisco and experience some fog.
>
> Thanks again !
>
> Greg
>
> ps should any of this get posted publically ? Might reassure other
> users that my yelp was heard, and may provide clues for others ...
>
> -----Original Message----- From: Paul Ramsey
> [mailto:pramsey at refractions.net] Sent: Sun 7/25/2004 3:27 PM To:
> Gregory S. Williamson Cc: Suzanne George; Chris G. Nicholas Subject:
> Re: [postgis-users] Very long time for a bbox/within query OK, two
> pronged approach:
>
> 1 - GEOS has gone through a major update in the last two months. The
> CVS cuts of PostGIS and GEOS are supposed to work together. You might
> find that things work better with the latest and greatest. (No
> guarantees, this is a particular problem (poorly optimized Within())
> that might not have been touched in the update.
>
> But
>
> 2 - You seem to have noticed that Within and GEOS functions in
> general are slow, and that is not completely unexpected. For each
> GEOS test we have to build up a C++ object from the PostGIS
> geometries, which is expensive. So one way of avoiding that would be
> to try and not use the GEOS functions except where needed. They are
> needed for the boundary cases, and not for anything else. If all four
> points of an object's bounding box fall within the bounds on another
> objects polygon, then it is inside the other polygon, guaranteed. You
> can hand-code that test in SQL (with the distance() tester), though
> it will be ugly (four tests for the price of one). Or we can write
> some short-circuit code in the Within() function that does that for
> you. When writing your SQL tests, consider that the 'AND' operator
> short-circuits, so you want to put the inexpensive stuff to the left
> and the expensive stuff to the right. The one thing missing for this
> is a proper constructor function so you can get POINT objects out of
> the extents. You could do so using the envelope() function and the
> ringn() and pointn() functions...
>
> select v_boundary.v_bndry_id, count(*) from parcels,v_boundary where
> v_boundary.the_geom && parcels.the_geom and
> distance(v_boundary.the_geom,
> pointn(exteriorring(envelope(parcels.the_geom))), 1)) = 0 and
> distance(v_boundary.the_geom,
> pointn(exteriorring(envelope(parcels.the_geom))), 2)) = 0 and
> distance(v_boundary.the_geom,
> pointn(exteriorring(envelope(parcels.the_geom))), 3)) = 0 and
> distance(v_boundary.the_geom,
> pointn(exteriorring(envelope(parcels.the_geom))), 4)) = 0 and
> within(v_boundary.the_geom,parcels.the_geom) and v_bndry_type =
> 'FIPS' group by v_boundary.v_bndry_id;
>
> The problem is that even with spatial indexing you still have to test
> every single parcel against at least one county shape. Slightly more
> tests than that, with overlap cases. So if Within() is inefficient,
> you pay big as the number of parcels goes up. It would be better to
> fix Within(), but maybe the construction above will be better(ish).
>
> The GEOS (and JTS) stuff has all been worked on for correctness, more
> so than performance, so there are definately places where it does
> brain-dead stuff, not doing shortcut tests to avoid unnecessary work.
> Add that on top of the impedance of converting PostGIS geometries to
> GEOS geometries for testing, and perhaps that is the genesis of your
> current problem.
>
> I cannot actually think of a faster way to do this, offhand. FME will
> be almost as bad, and will end up memory-bound in any event. The
> shear volume of things to test could be quite overwhelming over time.
> If you do go into production, a trigger based approach to incremental
> maintenance of the metadata would probably be a Good Thing (tm).
>
> Yours,
>
> Paul
>
> Gregory S. Williamson wrote:
>
>
>> Sunny here, too ! (Glad I don't live in San Francisco)
>>
>> We're constructing a cross-reference table so that at runtime a
>> rapid determination can be made of which vector data we have for a
>> particular area (right now we are doing counties, but this would
>> apply to MSAs, zip codes, etc.) I've run this process for various
>> point, line and polygon tables (various esri data, GDT streets,
>> some flood data, etc.). Actually, we care mostly about quickly
>> determining whether we have all of the ingredients (e.g. vector
>> layers) to make a map.
>>
>> I built a perl script that goes through the boundry table (FIPS
>> only, but has to be extensible) FIP by FIP and asks if there is an
>> overlap with data from a particular vector set. The stuff that
>> fails fails with "psql" as well so I don't think this is a perl
>> issue per se, but I am willing to be open to ideas.
>>
>> The parcels are unusual in that they are very dense, complicated
>> polyong/multipolygons, currently only representing a few counties
>> in the US (Florida and Texas, and less than a dozen total, I
>> think). The number of counties represented is expected to grow;
>> updates may be an issue as well, but for now we would expect
>> sporadic data for only a small percent of the 5000+ FIPS in the US
>> (a dozen this year maybe, 10 times that over the next year or two,
>> maybe; Suz ?).
>>
>> Suz or Chris might be better able to speak to the ultimate
>> consequences of having a FIPS county identified as having data when
>> there is only a tiny sliver (say, a little corner of some parcel
>> in county "B" extends into county "C" which we otherwise do not
>> have data for).
>>
>> My view is simple:
>>
>> If any part of a feature in the vector data of interest is within
>> or on the boundry of a county, it is considered to be in the county
>> and we put an entry in our cross-reference table for this
>> county/data type (collection). (I am willing to negotiate about the
>> "on the boundry of" clause ... my worry was excluding features that
>> follow county boundaries and might properly be considered to be in
>> both counties if they are physical objects like streams, roads,
>> etc.) If one feature is in multiple counties
>>
>> I notice that when I run this "within" test on the table we got the
>> shapes from (ESRI Detail County outlines) we get no entries in the
>> cross-reference table. My thinking is that within would not see an
>> identical polygon as being "within" itself ... perhaps I am using
>> the wrong function ? (This was done as a test -- we already know we
>> have county shapes in runtime so this would not be considered as an
>> optional overlay).
>>
>> Thanks for your time ... most appreciated!
>>
>> Greg
>>
>> -----Original Message----- From: Paul Ramsey
>> [mailto:pramsey at refractions.net] Sent: Sun 7/25/2004 2:33 PM To:
>> Gregory S. Williamson Cc: Suzanne George; Chris G. Nicholas
>> Subject: Re: [postgis-users] Very long time for a bbox/within query
>> Greg, Since we are both working on a Sunday (and a sunny one, at
>> that)... What exactly do you want to do? I have gathered that
>> basically this is an overlay, so you want to know all the items A
>> that are contained within shapes B. What do you want to do with
>> boundary conditions (A's that are within multiple B's)? Paul
>>
>>
>> Gregory S. Williamson wrote:
>>
>>
>>
>>> a) find another function than within that would tell me what I
>>> want to know (the postGIS documents are a little skimpy on
>>> specifying exactly what they test and the formal specs they refer
>>> to do not make much sense to me). I may well be using a not-quite
>>> right function for what I want.
>>
>>
>>
>>
>>
>
>
>
>
More information about the postgis-users
mailing list