[GRASS-dev] i.segment: possible to cache results of open_files() for several runs of i.segment ?

Thu Aug 3 01:40:35 PDT 2017

On 03/08/17 10:11, Markus Metz wrote:
> 
> 
> On Thu, Aug 3, 2017 at 7:02 AM, Moritz Lennert 
> <mlennert at club.worldonline.be <mailto:mlennert at club.worldonline.be>> wrote:
>  >
>  > On 02/08/17 21:43, Markus Metz wrote:
>  >>
>  >> Hi Moritz,
>  >>
>  >> On Wed, Aug 2, 2017 at 2:52 PM, Moritz Lennert 
> <mlennert at club.worldonline.be <mailto:mlennert at club.worldonline.be> 
> <mailto:mlennert at club.worldonline.be 
> <mailto:mlennert at club.worldonline.be>>> wrote:
>  >>  >
>  >>  > Hi MarkusM,
>  >>  >
>  >>  > Working on segmentation parameter optimization with fairly large 
> images we have stumbled upon some questions (ISTR that we've discussed 
> this before, but I cannot find traces of that discussion). As a 
> reminder, i.segment.uspo works by looping through a series of threshold 
> parameter values, segmenting a series of test regions at each parameter 
> value and then comparing the results in order to identify the "optimal" 
> threshold.
>  >>  >
>  >>  > Two issues have popped up:
>  >>  >
>  >>  > - One approach we tried was to optimize thresholds separately for 
> different types of morphological zones. For each type we have several 
> polygons distributed across the image. These polygons are used as input 
> for a mask. However, it does seem that even if most of the image is 
> masked, open_files() takes a long time, as if it does read the entire 
> image. Is this expected / normal ? Would it be possible to reduce the 
> read time when most of the area is masked ?
>  >>
>  >> You could reduce the read time by zooming to the current mask with 
> g.region zoom=current_mask
>  >
>  >
>  > Yes, but this doesn't help for situations where the mask areas are 
> distributed across the entire image, so that the region will be almost 
> as large as the original image.
>  >
>  >>
>  >>  >
>  >>  > - More generally: for every i.segment call, open_files() goes 
> through the reading of the input files and, AFAIU, checks for min/max 
> values and creates the seglib temp files (+ possibly other operations). 
> When segmenting the same image several times just using different 
> thresholds, it would seem that most of what open_files() does is 
> repeated in exactly the same manner at each call. Would it be possible 
> to cache that information somehow and to instruct i.segment to reuse 
> that info each time it is called on the same image and region ?
>  >>
>  >> The most time (and disk space) consuming part of open_files() is 
> creating temporary files for the input files and the current region and 
> the current mask. These temporary files are temporary because too many 
> things can change between two consecutive runs of any module using the 
> segment library. First of all, the input files could change (same name, 
> but different cell values), then region and mask settings could change.
>  >
>  >
>  > Agreed, but here I'm talking about the situation where I run 
> i.segment multiple times in a loop with exactly the same input, and only 
> threshold value (and possibly minsize) changing. So we hoped that it 
> would be possible to reuse the segment library files.
> 
> One problem is that the contents of the temporary files are modified at 
> runtime and can thus not be re-used for a new run. This is in order to 
> save disk space and memory, otherwise resource requirements would double 
> if input and output are kept separate.

Ok, understood.

>  >
>  >>
>  >>  >
>  >>  > Just trying to crunch larger and larger images... :-)
>  >>
>  >> As in, it's working but a bit slow?
>  >
>  >
>  > i.segment is definitely not slow compared to other similar software, 
> but in this specific case of looping the accumulated time used in the 
> phase of reading the input files grows to a significant duration.
> 
> Reading input can take some time, but I thought that most of the time is 
> spent on the actual segmentation which takes substantially longer than 
> reading the input.

Yes, sure, we were just wondering whether this might be a low-hanging 
fruit, but I now see it is quite the contrary.

Related to this is the question of parallel processing: we've hit upon 
the issue of several parallel processes all doing the same open_files() 
run at the beginning on the same file. Obviously this quickly leads to a 
severe bottleneck when you do not use a parallel file system, but we 
also thought that maybe the information could be shared. I now 
understand that this is not feasible since the info will be modified by 
each process.

> Of course reading input maps does require some time, 
> but I can't see a reasonable solution for creating a permanent cache of 
> the input data without lots of sanity checks and increasing resource 
> requirements. I see more potential in the actual segmentation part, 
> maybe this could be further optimized.

We're always ready to test any such optimizations. But as I said, I 
believe GRASS GIS is already quite fast in its region growing 
segmentation...

Moritz