[GRASS-user] v.generalize: does it take forever?

Sun Feb 15 09:45:14 PST 2015

The numbers I mention in the messages aren't really benchmark
material. I didn't do a proper comparison in the end, when things
started to really work. And most of it was due to changes in code,
that will affect everyone automagically.

Most of what was discussed found its way into the docs, with the
possible exception of the poor performance of sqlite when
parallelizing jobs using & on bash. I say possibly because it might
have, I'm otherwise occupied so I didn't check the docs lately.

F
-=--=-=-
Fábio Augusto Salve Dias
ICMC - USP
http://sites.google.com/site/fabiodias/

On Sun, Feb 15, 2015 at 2:14 AM, Vaclav Petras <wenzeslaus at gmail.com> wrote:
>
>
> On Mon, Feb 9, 2015 at 4:52 PM, Fábio Dias <fabio.dias at gmail.com> wrote:
>>
>> I switched to postgis for data storage and the v.generalize time went
>> down to 130ish minutes, all processes working in parallel.
>>
>> I'm happy now :) thanks you guys very much.
>
>
> Thanks for reporting this back. What about a blog post, or something like
> that, on this topic? I believe there is a lot of people interested in some
> benchmarks.
>
> Vaclav
>
>> -=--=-=-
>> Fábio Augusto Salve Dias
>> ICMC - USP
>> http://sites.google.com/site/fabiodias/
>>
>>
>> On Tue, Jan 27, 2015 at 8:50 PM, Fábio Dias <fabio.dias at gmail.com> wrote:
>> > Hi,
>> >
>> > I've kept an iotop, cumulative, running while the generalization run.
>> > No disk IO involved, just a couple of postgre stats. I believe the OS
>> > is keeping everything in RAM cache. I don't believe the disk is a
>> > bottleneck either, it is a 10 disk raid of 15k rpm disks, it's really
>> > fast.
>> >
>> > I interrupted the processing, moved everything into postgres and
>> > started over. I'm still loading the shapefiles (that I'm doing one at
>> > a time), I'll start the 15 processes as soon as it is loaded. As soon
>> > as that stabilizes, I'll report back.
>> >
>> >
>> > On a related note, wouldn't it be interesting to always try to
>> > simplify a line? As I understood the code, if the simplification fails
>> > for any reason, the original line is used. Might be interesting to
>> > have an option that makes the algorithm retry that line, with half the
>> > threshold, for instance. It's kind of weird for me to see one side of
>> > something really simplified while the other side really complicated :)
>> >
>> > F
>> > -=--=-=-
>> > Fábio Augusto Salve Dias
>> > ICMC - USP
>> > http://sites.google.com/site/fabiodias/
>> >
>> >
>> > On Tue, Jan 27, 2015 at 7:56 PM, Markus Metz
>> > <markus.metz.giswork at gmail.com> wrote:
>> >> On Mon, Jan 26, 2015 at 3:54 PM, Fábio Dias <fabio.dias at gmail.com>
>> >> wrote:
>> >>> Hi,
>> >>>
>> >>> The machine has 128Gb of ram. Doesn't matter what I do, I can't make a
>> >>> dent on it. Even with everything cached in ram (shp files, database,
>> >>> the whole lot), there is still free memory.
>> >>
>> >> OK, it's not RAM.
>> >>
>> >>>
>> >>> I'm asking about the database because the behavior I'm seeing on 'top'
>> >>> looks like the one you get when mutexes are involved. The processes
>> >>> don't go all to 100% processing at same time (and the machine has 64
>> >>> processors, so no dent there either), except for the v.in.ogr.
>> >>
>> >> The v.generailze processes should be at 100% while generalizing,
>> >> unless the disk can not keep up with multiple simultaneous IO
>> >> requests. The tables are copied only after the generalization finished
>> >> (100% reached).
>> >>
>> >>> What it
>> >>> looks like is that something is locking each process and they are
>> >>> taking turns. Considering how 'lite' the sqlite appears to be, and the
>> >>> weird locking behavior that was mentioned in other threads (I'm not
>> >>> getting the locked message here... I did, when I was running 2
>> >>> parallel v.in.ogr), isn't it likely to be the culprit? Should I change
>> >>> it to a more 'non-lite' system, like postgres for instance?
>> >>
>> >> That could make sense when running multiple processes in parallel. An
>> >> alternative would be to create a separate mapset for each process and
>> >> at the end copy the results back to the main mapset.
>> >>
>> >> Technically, it is not possible that the new v.generalize version in
>> >> trunk (G71) is slower than the old version as in relbr70 because the
>> >> new version updates only those parts of the topology that actually get
>> >> changed. The old version also updates components that do not get
>> >> changed, this is quite time-consuming.
>> >>
>> >> I understand you like to go for the big nail immediately, but maybe it
>> >> is worth testing first on a smaller sample?
>> >>
>> >> Markus M
>> >>
>> >>>
>> >>> F
>> >>> -=--=-=-
>> >>> Fábio Augusto Salve Dias
>> >>> ICMC - USP
>> >>> http://sites.google.com/site/fabiodias/
>> >>>
>> >>>
>> >>> On Mon, Jan 26, 2015 at 7:22 AM, Markus Metz
>> >>> <markus.metz.giswork at gmail.com> wrote:
>> >>>> On Mon, Jan 26, 2015 at 9:30 AM, Markus Metz
>> >>>> <markus.metz.giswork at gmail.com> wrote:
>> >>>>> On Sun, Jan 25, 2015 at 6:11 PM, Fábio Dias <fabio.dias at gmail.com>
>> >>>>> wrote:
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> Running r64249, with a couple of stuff in parallel using &. It
>> >>>>>> seems
>> >>>>>> to be considerably slower.
>> >>>>>
>> >>>>> Very strange. Are you using trunk or GRASS 7.0?
>> >>>>
>> >>>> Here, v.generalize on a TerraClass tile is down from 25 minutes to 13
>> >>>> seconds.
>> >>>>
>> >>>>>
>> >>>>>> More than 100h, no 1% printed. To be fair,
>> >>>>>> I'm not entirely sure I'll see it when it prints, 10 v.generalize
>> >>>>>> running (5 for each year) + 1 v.in.ogr for 2012. That v.in.ogr is
>> >>>>>> running for almost 100h too. I'm loading the shps directly, as
>> >>>>>> advised
>> >>>>>> way, way back in this thread.
>> >>>>>
>> >>>>> What exactly do you mean with "loading shps directly"? For
>> >>>>> v.generalize, you should import them with v.in.ogr.
>> >>>>>
>> >>>>> What about memory consumption on your system? With 10 v.generalize +
>> >>>>> 1
>> >>>>> v.in.ogr on such a big dataset, quite a lot of memory would be used.
>> >>>>>
>> >>>>> Markus M
>> >>>>>
>> >>>>>>
>> >>>>>> AFAIK, no disk is been used, the whole thing is cached (after more
>> >>>>>> than 24h processing, cumulative iotop shows only a few mb
>> >>>>>> written/read). I'm no longer using a ramdisk for the grassdata dir.
>> >>>>>>
>> >>>>>> However, it appears to be considerably slower, probably because of
>> >>>>>> the
>> >>>>>> parallel running jobs.
>> >>>>>>
>> >>>>>> My question then would be, considering the thread I saw about
>> >>>>>> sqlite,
>> >>>>>> should I be using something else as backend? When it starts to make
>> >>>>>> sense to change it?
>> >>>>>>
>> >>>>>> F
>> >>>>>>
>> >>>>>> -=--=-=-
>> >>>>>> Fábio Augusto Salve Dias
>> >>>>>> ICMC - USP
>> >>>>>> http://sites.google.com/site/fabiodias/
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Jan 14, 2015 at 1:06 PM, Markus Neteler <neteler at osgeo.org>
>> >>>>>> wrote:
>> >>>>>>> On Wed, Jan 14, 2015 at 3:54 PM, Fábio Dias <fabio.dias at gmail.com>
>> >>>>>>> wrote:
>> >>>>>>> ...
>> >>>>>>>> What would be the best way to do that in parallel? One mapset for
>> >>>>>>>> each
>> >>>>>>>> year? Can I run multiple v.generalizes on the same input with
>> >>>>>>>> different outputs?
>> >>>>>>>
>> >>>>>>> Yes sure.
>> >>>>>>>
>> >>>>>>>> My first thought was to run completely separated grass processes
>> >>>>>>>> for
>> >>>>>>>> each simplification, but I didn't find a way to make it search
>> >>>>>>>> something different than .grass / .grass70 for the configuration
>> >>>>>>>> stuff....
>> >>>>>>>
>> >>>>>>> Maybe take a look at this approach
>> >>>>>>> http://grasswiki.osgeo.org/wiki/Parallel_GRASS_jobs#Grid_Engine
>> >>>>>>>
>> >>>>>>> but even sending different v.generalize jobs to background (&)
>> >>>>>>> should
>> >>>>>>> work if you have enough RAM.
>> >>>>>>>
>> >>>>>>> markusN
>> _______________________________________________
>> grass-user mailing list
>> grass-user at lists.osgeo.org
>> http://lists.osgeo.org/mailman/listinfo/grass-user
>
>