[postgis-users] geomunion revisited....

Martin Davis mbdavis at refractions.net
Wed Nov 14 09:48:15 PST 2007


Lee,

Having done some more investigation, I can say that your dataset is a 
poster child for the use of Cascaded Union.  This is because it contains 
relatively simple geometries with a very high degree of overlap.  
Cascaded Union has the effect of quickly merging and discarding linework 
which doesn't appear in the final result.  This makes each individual 
union operation fairly performant.

In JTS-land my results were:

Cascaded Union: 20 sec
Iterated Union: 3 h 40 min !  (This is equivalent to the PostGIS 
geomunion aggregate)

I realize this doesn't help directly with your work, but it is a good 
example of why it would be nice to have CascadedUnion functionality in 
PostGIS.

I can also see why buffer was struggling with this data.  It has to deal 
with all the linework in the dataset at once, and with that many 
overlapping lines it is simply overwhelmed.   Using buffer() is a hack 
to get around the lack of true CascadedUnion.  It works in many cases, 
but does have its limits.

Martin

Martin Davis wrote:
> As I mentioned in my previous post, buffer has to pull all geometries 
> into memory at once and process them all together.  It creates a lot 
> of internal data structures in the course of processing.  I'm not that 
> suprised that it doesn't work on 30K geometries. Or, there may well be 
> a memory leak - our test cases don't actually include one with 30K 
> geoms  8^)  (Lee, I'd be interested to see if this works in JTS - can 
> you send me your dataset as a shapefile?)
>
> I can think of a couple of things to do:
> - Get the new version of JTS (from me) and use the CascadedUnion class
> - Use a simple grid index on your data to partition it, union each 
> partition separately, and then union the results.  This process would 
> look something like:    - pick an appropriate grid size (say 10x10 ?)
>    - compute the interior point of each geometry, determine which grid 
> cell it lies in, and save this in a new column
>    - union all geoms in each grid cell together
>    - union the result together
>
> Note: I haven't actually implemented this approach, but it's the only 
> way I can see to reduce the amount of geometry you are working with at 
> any given step.
>
> Lee Keel wrote:
>>> -----Original Message-----
>>> From: postgis-users-bounces at postgis.refractions.net 
>>> [mailto:postgis-users-
>>> bounces at postgis.refractions.net] On Behalf Of Kevin Neufeld
>>> Sent: Tuesday, November 13, 2007 12:40 AM
>>> To: PostGIS Users Discussion
>>> Subject: Re: [postgis-users] geomunion revisited....
>>>
>>> Also, I think you may be after the SQL syntax UNION ALL, not UNION, if
>>> you are simply after concatenating the resultset from table 2 to 
>>> table 1.
>>>
>>> select simplify(...
>>> UNION ALL
>>> select simplify(...
>>>
>>> Also, I too would be curious how buffer(collect(...), 0) fares for you
>>> instead of geomunion() for your final 5 hour query.
>>> Since you're doing this all at once anyway, it may work.  
>>> Geomunion() is
>>> an aggregate for the two geometry method.  Which means that it will
>>> slowly add one geometry at a time to an every growing resulting
>>> geometry.  Buffer(collect(...), 0) will decontruct the collection once,
>>> and rebuild it once to a union-ed geometry.
>>> -- Kevin
>>>
>>> Josh Livni wrote:
>>>    
>>>> I think running buffer(bunch_of_geoms,0) rather than
>>>> geomunion(bunch_of_geoms) might be faster.
>>>>
>>>> That said, still seems it takes a bit long.  If it were me I'd try
>>>> playing around by first creating a table with your results, eg:
>>>>
>>>> '''create table simple_buffers as (
>>>> select simplify(buffer(simplify(the_geom, 5), 400), 150) as the_geom
>>>> from table1
>>>> union
>>>> select simplify(buffer(simplify(the_geom, 5), 400), 150) as the_geom
>>>> from table2); '''
>>>>
>>>> then you could create a GiST index on the_geom of your new table, run
>>>> vacuum analyze on it, and then try the geomunion and buffer to test.
>>>> With 32k+ rows, depending on how much they overlap, that might help a
>>>> little, or at least help potentially isolate the issue if it does take
>>>> as long.
>>>> Also I assume you've modified your postgres config away from the
>>>> defaults to take advantage of more memory on your machine, etc.
>>>>
>>>>
>>>>
>>>>  -Josh
>>>>       
>> [Lee Keel]
>> Well, I have played with this some more and here is what I have found.
>>
>> I tried using the buffer(collect(the_geom), 0) on the largest table 
>> (about
>> 30K rows) and I get the following error:
>>
>> NOTICE:  St9bad_alloc
>>
>> ERROR: GEOS buffer() threw an error!
>> SQL state: XX000
>>
>>
>> This entire table is only 5872KB but when I monitor this buffer 
>> process the
>> memory gets over 1.7GB before it finally crashes out.  Even if you 
>> were to
>> triple the memory I can't see where it would ever get to 1.7GB.  
>> NOTE: When
>> monitoring the process of doing just the collect(the_geom) the memory
>> footprint gets up to about 26-28MB and takes about 2.5 minutes.  This 
>> leads
>> me to think there might be a memory leak in the buffer().
>>
>>
>> I tried using a test table.  This new table is 3704KB.  But I 
>> realized that
>> a gist index is not going to buy me anything because it is still 
>> going to
>> pull the entire table.
>>
>>
>> I have listed my config settings below.  I am not sure how much more 
>> I can
>> tweak them since I am running on 64bit Vista.  I have had a few posts 
>> in the
>> last week that have basically said that the windows version of this 
>> can't be
>> tweaked much more due to constraints by windows.
>>
>> shared_buffers = 256MB
>> temp_buffers = 32MB
>> max_prepared_transactions = 100
>> work_mem = 16MB
>> maintenance_work_mem = 256MB
>> max_stack_depth = 3MB
>>
>> Thanks for everyone's help and attention to this.
>> Lee
>>
>> This email and any files transmitted with it are confidential and 
>> intended solely for the use of the individual or entity to whom they 
>> are addressed. If you have received this email in error please notify 
>> the sender. This message contains confidential information and is 
>> intended only for the individual named. If you are not the named 
>> addressee you should not disseminate, distribute or copy this e-mail.
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>>
>>   
>

-- 
Martin Davis
Senior Technical Architect
Refractions Research, Inc.
(250) 383-3022




More information about the postgis-users mailing list