[postgis-users] Finding Islands

Thu Nov 21 13:40:42 PST 2013

This is what I came up with:

SELECT * FROM table WHERE gid IN (
    SELECT DISTINCT unnest(array[a.gid, b.gid]) AS gid
    FROM table a JOIN table b
        ON (ST_Intersects(a.geom, b.geom) AND a.gid < b.gid)
    )

I've posted my original attempts, your suggestions, and the final version
to my blog (with copious commentary) at
http://geospatial.commons.gc.cuny.edu/2013/11/21/finding-islands-or-the-converse/

Best,
--Lee

On Thu, Nov 21, 2013 at 7:28 AM, Rémi Cura <remi.cura at gmail.com> wrote:

> Hey ,
> Oops,
> vocaulary error
> "because it's making (2 million)^2 comparisons." -> it's cartesian
> product.
>
> basically you want to compare every polygon to every other, this is a
> cartesian product, and this is what takes time.
> Choosing wel the function used (touches or intersects or ...)  won't
> change much computing time as long as it uses index.
> In the same way, the logic you do after is not what takes time.
>
> Maybe you should not use the NOT IN syntaxe, but instead a where, or
> better, an "except", this should work much better
> http://www.postgresql.org/docs/9.3/static/queries-union.html
>
>
> Cheers,
>
> Rémi-C
>
>
>
> 2013/11/20 Lee Hachadoorian <Lee.Hachadoorian+L at gmail.com>
>
>> Thank you for these suggestions. I haven't replied yet because I am
>> testing them, and the queries take an hour or more to run on the 2 million
>> plus table. I will report back with some results.
>>
>>
>> On Wed, Nov 20, 2013 at 4:46 AM, Rémi Cura <remi.cura at gmail.com> wrote:
>>
>>>
>>> Maybe you can try the stupidest way to go,
>>> anyway you have to do a inner product because you have to consider each
>>> pair of polygons.
>>>
>>
>>  You don't have to do inner product. `a LEFT JOIN b ... WHERE b.gid IS
>> NULL` is a typical unmatched query. By doing self join ON ST_Touches(), the
>> polygons with no neighbors are returned because the LEFT JOIN returns all
>> records from a, while ST_Touches() produces no match (because a polygon
>> doesn't touch itself) so b.gid IS NULL.
>>
>>
>>> CREATE TABLE result AS
>>> SELECT DISTINCT ON (a.gid) a.*
>>> FROM table AS a , table AS b
>>> WHERE ST_Intersects(a.geom, b.geom) = TRUE
>>> AND a.gid != b.gid
>>>
>>
>> Just to be clear, this produces the polygons *with* neighbors, which is
>> why I use this as a subselect with gid NOT IN (...)
>>
>> Still testing,
>> --Lee
>>
>>
>>
>>
>>> Now you can improve, because in the previous query you will compute
>>> (geom1,geom2) and (geom2,geom1), which is duplicate
>>> so you can try creating an index on gid , and
>>> CREATE TABLE result2 AS
>>> WITH direct_comp AS (SELECT a.gid AS agid, a.geom AS ageom , b.gid AS
>>> bgid, b.geom AS bgeom
>>> FROM table AS a , table AS b
>>> WHERE ST_Intersects(a.geom, b.geom) = TRUE
>>> AND a.gid < b.gid)
>>> SELECT agid AS gid, ageom AS geom
>>> FROM direct_comp
>>> UNION
>>> SELECT bgid AS gid, bgeom AS geom
>>> FROM direct_comp
>>>
>>>
>>>
>>>
>>> Now if you want to go really big, this will be difficult.
>>> Assuming your polygons bboxes have a homogenous size and/or are not too
>>> big reguarding the total area .
>>> The idea is to group polygons into same cells, so when you try to find
>>> which polygons intersects, instead of comparing a polygon to each other
>>> polygon, you compare a polygon to each other polygon in the same cell.
>>>
>>>
>>> create a grid of cells (big enough or you will have lot's of cells)
>>> index the table
>>>
>>> for each cell, get polygons intersecting it. This defines groups of
>>> polygons in the same cell, defined by a new id "group_id"
>>> index "group_id"
>>>
>>> Now you will have several options depending on your hardware/data
>>> 1) process a cell at a time if limited hardware, adding "WHERE
>>> group_id=XX"
>>>
>>> 2) do a massiv inner join on the group_id and compute all
>>>
>>> In fact you could avoid to create explicitly the cells, but it will be
>>> more complicated.
>>>
>>> Cheers,
>>> Rémi-C
>>>
>>>
>>> 2013/11/20 Brent Wood <pcreso at pcreso.com>
>>>
>>>> I figure you have spatially indexed the polygons already?
>>>>
>>>> Any way of pre-categorising your polygons - binning them in some way
>>>> that allows a non spatial test in the where clause to replace the spatial
>>>> test...
>>>>
>>>> eg: a boolean attribute set to indicate if a feature has any part <= a
>>>> particular x value or not.
>>>>
>>>> run this against the 2m features once to populate it, and assuming you
>>>> get a 50/50 split of T/F features, your 2m^2 query can instead include a
>>>> "where a.bool = b.bool", as we already know that if the boolean flag is
>>>> different, they cannot touch, so your query should involve a spatial test
>>>> only over 1m^2 instead... if you do this on both x & y, the boolean filter
>>>> will replace even more spatial calculations & drop them to 500,000^2
>>>> tests...
>>>>
>>>> If you can pre-classify features so that non-spatial tests can reduce
>>>> the spatial ones (esp for features with lots of vertices) in a where
>>>> clause, such queries do run much faster, but you have the trade-off of the
>>>> time taken to carry out the classification... which is only n*no_classes/2,
>>>> generally much faster than n^n
>>>>
>>>> Hope this makes sense...
>>>>
>>>> Brent Wood
>>>>   ------------------------------
>>>>  *From:* Lee Hachadoorian <Lee.Hachadoorian+L at gmail.com>
>>>> *To:* PostGIS Users <postgis-users at postgis.refractions.net>
>>>> *Sent:* Wednesday, November 20, 2013 8:52 PM
>>>> *Subject:* [postgis-users] Finding Islands
>>>>
>>>> I am trying to find "islands", polygons in a (multi)polygon layer which
>>>> are not connected to any other polygons in the same layer. What I came
>>>> up with runs in a couple of seconds on a layer with ~1000 geometries,
>>>> and a couple of minutes on a layer with ~23,000 geometries, but I want
>>>> to run it on a layer of 2 million+ and it's taking a L-O-O-O-NG time,
>>>> presumably because it's making (2 million)^2 comparisons.
>>>>
>>>> What I came up with is:
>>>>
>>>> SELECT a.*
>>>> FROM table a LEFT JOIN table b
>>>>     ON (ST_Touches(a.geom, b.geom))
>>>> WHERE b.gid is null
>>>>
>>>> or
>>>>
>>>> SELECT *
>>>> FROM table
>>>> WHERE gid NOT IN (
>>>>     SELECT DISTINCT a.gid
>>>>     FROM table a JOIN table b
>>>>         ON (ST_Intersects(a.geom, b.geom) AND a.gid != b.gid)
>>>>     )
>>>>
>>>> The first variant raises "NOTICE:  geometry_gist_joinsel called with
>>>> incorrect join type". So I thought I could improve performance with an
>>>> INNER JOIN instead of an OUTER JOIN, and came up with the second
>>>> variant, and it does seem to perform somewhat better. Any suggestions
>>>> for how to speed up either approach?
>>>>
>>>> Best,
>>>> --Lee
>>>>
>>>> --
>>>> Lee Hachadoorian
>>>> Assistant Professor in Geography, Dartmouth College
>>>> http://freecity.commons.gc.cuny.edu
>>>>
>>>>
>>>> _______________________________________________
>>>> postgis-users mailing list
>>>> postgis-users at lists.osgeo.org
>>>> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> postgis-users mailing list
>>>> postgis-users at lists.osgeo.org
>>>> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>>>>
>>>
>>>
>>> _______________________________________________
>>> postgis-users mailing list
>>> postgis-users at lists.osgeo.org
>>> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>>>
>>
>>
>>
>> --
>> Lee Hachadoorian
>> Asst Professor of Geography, Dartmouth College
>> http://freecity.commons.gc.cuny.edu/
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at lists.osgeo.org
>> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>>
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>

-- 
Lee Hachadoorian
Asst Professor of Geography, Dartmouth College
http://freecity.commons.gc.cuny.edu/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20131121/85029d47/attachment.html>