[GRASS-dev] HPC support and implementation options

Wed Sep 28 23:25:07 PDT 2016

(I take liberty to fork this out from
Re: [GRASS-dev] Adding an expert mode to the parser
For the archive reference:
https://lists.osgeo.org/pipermail/grass-dev/2016-September/082520.html
)

On Sun, Sep 25, 2016 at 9:49 PM, Markus Neteler <neteler at osgeo.org> wrote:
> On Wed, Sep 28, 2016 at 10:51 PM, Markus Metz <markus.metz.giswork at gmail.com> wrote:
>> On Thu, Sep 29, 2016 at 12:03 AM, Sören Gebbert <soerengebbert at googlemail.com> wrote:
>> [snip]
>>>
>>> As an example, when aiming at processing all Sentinel-2 tiles
>>> globally, we speak about currently 73000 scenes * up-to-16 tiles along
>>> with global data, analysis on top of other global data is more complex
>>> when doing each job in its own mapset and reintegrate it in a single
>>> target mapset as if able to process then in parallel in one mapset by
>>> simply specifying the respective region to the command of interest.
>>> Yes, different from the current paradigm and not for G7.
>>
>> from our common experience, I would say that creating separate mapsets
>> is a safety feature. If anything goes wrong with that particular
>> processing chain, cleaning up is easy, simply delete this particular
>> mapset and run the job again, if possible on a different host/node
>> (assuming that failed jobs are logged). Anyway, I would be surprised
>> if the overhead of opening a separate mapset is measurable when
>> processing all Sentinel-2 tiles globally.

Generally I agree and with our MODIS experience it worked fine on a
"standalone" cluster system with local disks in each blade.

>> Reintegration into a single
>> target mapset could cause problems with regard to IO saturation, but
>> in a HPC environment temporary data always need to be copied to a
>> final target location at some stage.

Yes, with at least 10Gb/s internal connection it worked decently.

>> The HPC system you are using now
>> is most probably quite different from the one we used previously, so
>> this is a lot of guessing, particularly about the storage location of
>> temporary data (no matter if it is the same mapset or a separate
>> mapset).

Indeed, one of the current systems we are using is completely
virtualized, i.e. all disks are attached via network which is AFAIK
even virtualized. Hence no dedicated resources but competition with
other unknown users in this system.
I still try to understand how to optimize things there...

> Imagine you have a tool that is able to distribute the processing of a large
> time series of satellite images across a cluster. Each node in the cluster
> should process a stack of r.mapcalc, r.series or r.neighbors commands in a
> local temporary mapset, that gets later merged into a single one. A single
> stack of commands may have hundreds of jobs that will run in a single
> temporary mapset. In this scenario you need separate region settings for
> each command in the stack, because of the different spatial extents of the
> satellite images. The size of the stack depends on the size of the time
> series (number of maps) and the number of available nodes.
>
> Having region setting options in the parser will make the implementation of
> such an approach much easier. Commands like t.rast.algebra and
> t.rast.neighbors will directly benefit from a region parser option, allowing
> the parallel processing of satellite image time series on a multi-core
> machine.

Yes - the key issue is that such virtualized cluster systems behave
quite differently compared to the bare metal system we used to have in
Italy.

> Best regards
> Soeren

Best,
markusN