[GRASS-user] v.generalize: does it take forever?

Markus Metz markus.metz.giswork at gmail.com
Thu Jan 1 14:13:31 PST 2015


On Wed, Dec 31, 2014 at 5:20 PM, Fábio Dias <fabio.dias at gmail.com> wrote:
> On Wed, Dec 31, 2014 at 12:23 PM, Markus Neteler <neteler at osgeo.org> wrote:
>> On Sun, Dec 28, 2014 at 8:04 PM, Fábio Dias <fabio.dias at gmail.com> wrote:
>>> Hello all,
>>>
>>> Context: I've loaded some shp files into postgis, containing
>>> information over the amazon forest. For reference, the sql script has
>>> around 6Gb.
>>
>> How many polygons do you have there approximately?
>
> That depends. The information is separated by states. In this case, AP
> corresponds to the state of Amapá, which is the smallest one,
> datawise, with only 70ish mb. The state of Pará has 2.3gb.
> Ideally, I would generalize the information as a whole, not each state
> independently, so I don't get gaps/etc.

Makes sense.

> The whole thing, for one of
> the 4 years available, has around 5M polygons (counting from postgis,
> I do not have the data imported on grass at the moment. I'm importing,
> but it will take a while). The other years have more polygons, and it
> wouldn't be unreasonable to expect around 10M.

I would avoid the postgis step and import the shapefiles directly to GRASS.
>
>
>>> Problem: I managed do import, clean and dissolve properly, but when I
>>> run the generalization, by my estimates, it would take almost an year
>>> to complete.
>>
>> This will also depend on the generalization method you selected.
>
> Yes, but in a minor way, as I'll detail in the next part.
>
>>
>>> I also noticed that neither grass nor postgis are capable of parallel
>>> processing...
>> Yeah, hot topic for 2015 :-) Indeed, worth a thesis in my view!
>
> I fussed about the v.generalize code, thinking about pthread
> parallelization. The geometry part of the code is *really* fast and
> can be easily parallelized so it can run even faster. But, according
> to the profile google-perf gave me, the real bottleneck is inside the
> check_topo function (which uses static vars and inserts a new line
> into the vector, not only checks if it breaks topo - got stuck a while
> in there due to the misnomer). More specifically in the Rtree function
> used to check if one line intersects other lines.

The check_topo function can not be executed in parallel because 1)
topology must not be modified for several boundaries in parallel, 2)
data are written to disk, and disk IO is by nature not parallel.

>
> I commented out the check_topo call and it ran a whole lot faster. The
> result, obviously, was really bad and messed up, topologically, but it
> confirmed that it is indeed the bottleneck.
>
>>> Question: Am I using the correct tool for that? Is there a way to
>>> speed up the processing?
>>>
>>> For reference, the commands i've used (grass70, beta4, 22 dez 2014):
>>
>> (Glad you use beta4, so we have a recent code base to check.)

A pity you use beta4, please use current trunk, because there are a
few improvements in trunk not available in beta4. v.generalize should
be quite a bit faster in trunk than in beta4.

>>
>> I suppose that method=douglas is faster that method=reduction?

Yes.

>
> With the full dataset, both were painfully slow. And by slow, I mean
> more than 24h without printing the 1% message slow.
>
>
>> What is the projection you are working with? Given the threshold and
>> assuming LatLong, I get a short distance:
>
> 4674. It is indeed latlong.

That does not matter.

> The idea is to have multiple generalizations as different tables on
> postgis and fetch data from the correct table using the current zoom
> level in the web interface (googlemaps based). I considered serving
> the map using wms/geoserver and also rendering on the client using
> node.js (io.js now, apparently) and topojson.

As you mentioned above, the whole dataset should be generalized at
once to avoid gaps and overlapping parts.

>>
>> Probably you want to try a larger threshold first?

No, rather try a very small threshold first.

Markus M


More information about the grass-user mailing list