[GRASS-dev] i.segment: possible to cache results of open_files() for several runs of i.segment ?

Thu Aug 3 05:09:32 PDT 2017

On Thu, Aug 3, 2017 at 10:40 AM, Moritz Lennert <
mlennert at club.worldonline.be> wrote:
>
> On 03/08/17 10:11, Markus Metz wrote:
>>
>>
>>
>> On Thu, Aug 3, 2017 at 7:02 AM, Moritz Lennert <
mlennert at club.worldonline.be <mailto:mlennert at club.worldonline.be>> wrote:
>>  >
>>  > On 02/08/17 21:43, Markus Metz wrote:
>>  >>
>>  >> Hi Moritz,
>>  >>
>>  >> On Wed, Aug 2, 2017 at 2:52 PM, Moritz Lennert <
mlennert at club.worldonline.be <mailto:mlennert at club.worldonline.be> <mailto:
mlennert at club.worldonline.be <mailto:mlennert at club.worldonline.be>>> wrote:
>>  >>  >
>>  >>  > Hi MarkusM,
>>  >>  >
>>  >>  > Working on segmentation parameter optimization with fairly large
images we have stumbled upon some questions (ISTR that we've discussed this
before, but I cannot find traces of that discussion). As a reminder,
i.segment.uspo works by looping through a series of threshold parameter
values, segmenting a series of test regions at each parameter value and
then comparing the results in order to identify the "optimal" threshold.
>>  >>  >
>>  >>  > Two issues have popped up:
>>  >>  >
>>  >>  > - One approach we tried was to optimize thresholds separately for
different types of morphological zones. For each type we have several
polygons distributed across the image. These polygons are used as input for
a mask. However, it does seem that even if most of the image is masked,
open_files() takes a long time, as if it does read the entire image. Is
this expected / normal ? Would it be possible to reduce the read time when
most of the area is masked ?
>>  >>
>>  >> You could reduce the read time by zooming to the current mask with
g.region zoom=current_mask
>>  >
>>  >
>>  > Yes, but this doesn't help for situations where the mask areas are
distributed across the entire image, so that the region will be almost as
large as the original image.
>>  >
>>  >>
>>  >>  >
>>  >>  > - More generally: for every i.segment call, open_files() goes
through the reading of the input files and, AFAIU, checks for min/max
values and creates the seglib temp files (+ possibly other operations).
When segmenting the same image several times just using different
thresholds, it would seem that most of what open_files() does is repeated
in exactly the same manner at each call. Would it be possible to cache that
information somehow and to instruct i.segment to reuse that info each time
it is called on the same image and region ?
>>  >>
>>  >> The most time (and disk space) consuming part of open_files() is
creating temporary files for the input files and the current region and the
current mask. These temporary files are temporary because too many things
can change between two consecutive runs of any module using the segment
library. First of all, the input files could change (same name, but
different cell values), then region and mask settings could change.
>>  >
>>  >
>>  > Agreed, but here I'm talking about the situation where I run
i.segment multiple times in a loop with exactly the same input, and only
threshold value (and possibly minsize) changing. So we hoped that it would
be possible to reuse the segment library files.
>>
>> One problem is that the contents of the temporary files are modified at
runtime and can thus not be re-used for a new run. This is in order to save
disk space and memory, otherwise resource requirements would double if
input and output are kept separate.
>
>
> Ok, understood.
>
>>  >
>>  >>
>>  >>  >
>>  >>  > Just trying to crunch larger and larger images... :-)
>>  >>
>>  >> As in, it's working but a bit slow?
>>  >
>>  >
>>  > i.segment is definitely not slow compared to other similar software,
but in this specific case of looping the accumulated time used in the phase
of reading the input files grows to a significant duration.
>>
>> Reading input can take some time, but I thought that most of the time is
spent on the actual segmentation which takes substantially longer than
reading the input.
>
>
> Yes, sure, we were just wondering whether this might be a low-hanging
fruit, but I now see it is quite the contrary.

There is one relatively easy possibility to speed up reading input, if the
input maps are compressed with ZLIB or BZIP2: you could compress the input
maps with LZ4, this would speed up reading since quite a bit of time is
spent on decompressing ZLIB or BZIP2 compressed data.

export GRASS_COMPRESSOR=LZ4
r.compress <input_map>

Markus M
>
> Related to this is the question of parallel processing: we've hit upon
the issue of several parallel processes all doing the same open_files() run
at the beginning on the same file. Obviously this quickly leads to a severe
bottleneck when you do not use a parallel file system, but we also thought
that maybe the information could be shared. I now understand that this is
not feasible since the info will be modified by each process.
>
>> Of course reading input maps does require some time, but I can't see a
reasonable solution for creating a permanent cache of the input data
without lots of sanity checks and increasing resource requirements. I see
more potential in the actual segmentation part, maybe this could be further
optimized.
>
>
> We're always ready to test any such optimizations. But as I said, I
believe GRASS GIS is already quite fast in its region growing
segmentation...
>
> Moritz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/grass-dev/attachments/20170803/a1a9b150/attachment.html>