[GRASS-user] v.generalize: does it take forever?

Fábio Dias fabio.dias at gmail.com
Sat Jan 10 10:23:22 PST 2015


> I have optimized the GRASS vector library in trunk r64032 and added
> another topology check to v.generalize in trunk r64033. The profile of
> v.generalize now shows that it is limited by disk I/O speed (on my
> laptop with a standard laptop-like spinning HDD), which means that the
> algorithms are, under the test conditions, close to their optimum.
> This picture might change as soon as you use a high-performance server
> or a SSD.


Then I should do a profile on my current setup. My grassdata dir is
not a disk, but a mounted ramdisk, which is, basically, ram, aka
really, really fast. It should be interesting.
By the way, it is really easy to do, at least on linux, and it should
really improve the performance for big datasets. Obviously, you'd need
a big machine too, but well, a big nail needs a big hammer.

cd ~
mkdir -p grassdata
sudo mount -t tmpfs -o size=512M tmpfs grassdata

In my case, the machine has 128Gb, so I made a 32Gb ramdisk. Each
vector directory has 6Gb, so it is plenty.
Of course, the data will be lost if you shutdown or reboot the
machine, so extra care is needed.
I did not compare the result with and without the ramdisk btw.


> The speed improvement is non-linear: for small datasets as in the
> official GRASS datasets, you won't notice a difference. For one tile
> of Terraclass, the processing speed should be about 2-4 times faster
> than before. For the full Terraclass dataset, the processing speed
> could be >10 times faster than before. You will need to wait until say
> 10% of the processing has been reached in order to estimate the total
> processing time. Simplifying each line takes its own time, therefore
> the processing time of 100% is not equal to 100 x the processing time
> of 1%.

Indeed, but it was a (very) rough approximation.

> Another user has applied v.generalize to NLCD2011 and it took nearly 2
> months. Your dataset is probably a bit smaller, but the Terraclass
> shapefiles are full of errors. If you want to fix these errors, this
> will take some time.

You know this dataset? The errors are really bugging me. It is, mostly
due to the process/tools they usually use. We have passed over the
request for a more topologically correct approach. Maybe on the next
iteration. But I'll create another thread asking advice regarding
these errors shortly :)

> I recommend to test the new v.generalize first on a subregion of
> Terraclass. Only if the processing speed and the results are
> acceptable, proceed with the full dataset. Otherwise, please report.

Testing before deploying? Where's the fun in that ? :)
Joking aside, I did that before trying the full dataset. I did,
however interrupt the processing to start over with the new trunk
version, because you said it would be faster. And indeed it is, thank
you very much.
By not previously dissolving and further doing v.clean tool=break the
original data, I've reduced the processing time from more than 30h for
1% to 24h to 11%. With the latest release, 9% in 18h.

However, this whole thing got me thinking about you said on an early message:

> The check_topo function can not be executed in parallel because 1)
> topology must not be modified for several boundaries in parallel, 2)
> data are written to disk, and disk IO is by nature not parallel.

Well, disk IO, there's not much we can do about it. On high end
servers, again, I'm thinking big hammers, this shouldn't really be a
bottleneck nor lock the threads for long, between the disk speed and
cache, this should barely lock each thread. Assuming the "vector
access" functions to be thread safe (which I think they will
eventually be, IMHO it would be the first step to make the whole
software "parallel-capable"), we could allow parallel changes in the
topology by carefully choosing which lines are going to be considered
at a time. One simple example might be lines whose bounding boxes do
not intercept. Not sure how much overhead this would cause, or if it
would be worth it.


Thanks again,

F


More information about the grass-user mailing list