[GRASS-user] v.generalize: does it take forever?

Fábio Dias fabio.dias at gmail.com
Wed Dec 31 08:20:09 PST 2014


On Wed, Dec 31, 2014 at 12:23 PM, Markus Neteler <neteler at osgeo.org> wrote:
> On Sun, Dec 28, 2014 at 8:04 PM, Fábio Dias <fabio.dias at gmail.com> wrote:
>> Hello all,
>>
>> Context: I've loaded some shp files into postgis, containing
>> information over the amazon forest. For reference, the sql script has
>> around 6Gb.
>
> How many polygons do you have there approximately?

That depends. The information is separated by states. In this case, AP
corresponds to the state of Amapá, which is the smallest one,
datawise, with only 70ish mb. The state of Pará has 2.3gb.
Ideally, I would generalize the information as a whole, not each state
independently, so I don't get gaps/etc. The whole thing, for one of
the 4 years available, has around 5M polygons (counting from postgis,
I do not have the data imported on grass at the moment. I'm importing,
but it will take a while). The other years have more polygons, and it
wouldn't be unreasonable to expect around 10M.


>> Problem: I managed do import, clean and dissolve properly, but when I
>> run the generalization, by my estimates, it would take almost an year
>> to complete.
>
> This will also depend on the generalization method you selected.

Yes, but in a minor way, as I'll detail in the next part.

>
>> I also noticed that neither grass nor postgis are capable of parallel
>> processing...
> Yeah, hot topic for 2015 :-) Indeed, worth a thesis in my view!

I fussed about the v.generalize code, thinking about pthread
parallelization. The geometry part of the code is *really* fast and
can be easily parallelized so it can run even faster. But, according
to the profile google-perf gave me, the real bottleneck is inside the
check_topo function (which uses static vars and inserts a new line
into the vector, not only checks if it breaks topo - got stuck a while
in there due to the misnomer). More specifically in the Rtree function
used to check if one line intersects other lines.

I commented out the check_topo call and it ran a whole lot faster. The
result, obviously, was really bad and messed up, topologically, but it
confirmed that it is indeed the bottleneck.

>> Question: Am I using the correct tool for that? Is there a way to
>> speed up the processing?
>>
>> For reference, the commands i've used (grass70, beta4, 22 dez 2014):
>
> (Glad you use beta4, so we have a recent code base to check.)
>
>> v.in.ogr -e --verbose input="pg:host=localhost (...)" layer=ap10
>> output=ap10 snap=1e-6
> --> Please tell us how many polygons or lines the layer "ap10" contains.

ap10 was just a 'toy' dataset to try out the script. It is
considerably smaller than the real dataset. The postgis table of this
data has 50k records/polygons.
v.info for it on: http://pastebin.com/8RZELd8p

>
>> v.clean -c --verbose input=ap10 output=ap10c tool=bpol,break,rmsa type=line
> --> Not sure but should type be "boundary" or "line"?

I tried combinations and variations, I'm not that sure either. My
postgis data is composed of polygons. It is a landuse classification
data (or something like this, I'm not that familiar with
geo-nomenclature).

>> v.dissolve --verbose input=ap10c column=tc_2010 output=ap10d --overwrite
> --> How many polygons does "ap10d" contain?

120k boundaries (http://pastebin.com/8RZELd8p)

>
>> Try #1 )   v.generalize --verbose --overwrite input=ap10d output=ap10r
>> method=reduction threshold=0.00025 --overwrite
>> Try #2 )   v.generalize --verbose --overwrite input=ap10d output=ap10g
>> method=douglas threshold=0.00025 --overwrite
>
> I suppose that method=douglas is faster that method=reduction?

With the full dataset, both were painfully slow. And by slow, I mean
more than 24h without printing the 1% message slow.


> What is the projection you are working with? Given the threshold and
> assuming LatLong, I get a short distance:

4674. It is indeed latlong.
The idea is to have multiple generalizations as different tables on
postgis and fetch data from the correct table using the current zoom
level in the web interface (googlemaps based). I considered serving
the map using wms/geoserver and also rendering on the client using
node.js (io.js now, apparently) and topojson.

>
> GRASS 7.1.svn (latlong):~ > g.region -g res=0.00025 -a
> n=4.15
> s=-16.369
> w=-76.23975
> e=-45.1125
> nsres=0.00025
> ewres=0.00025
> ...
>
> GRASS 7.1.svn (latlong):~ > g.region -m...
> nsres=27.64959657
> ewres=27.21883498
> ...
>
> Probably you want to try a larger threshold first?

Empirically, that valued removed only the jagged edges, so it was a
good first generalization. My idea was that, afterwards, I'd increase
the threshold and generate more generalizations.

thanks again,
F


More information about the grass-user mailing list