[GRASS-dev] Avoiding maximum open files limit in r.series

Sören Gebbert soerengebbert at googlemail.com
Fri Oct 7 14:03:56 EDT 2011


Hello Glynn,

2011/10/6 Glynn Clements <glynn at gclements.plus.com>:
>
> Sören Gebbert wrote:
>
>> Dear devs,
>> just for your information, i have added the support of input files
>> with newline separated map names to r.series.
>> r.series now supports two input methods, file and input. Using option
>> <file> is slower but avoids the open file descriptor limit.
>
> I've made some changes (mostly just clean-up) to this; can you test
> r48654?
>
> Main changes:
>
> Make opening/reading/closing maps for each row a separate feature
> (-x flag). This has a significant performance impact, may be
> unnecessary ("ulimit -n" is 1024 by default, but this can be changed
> if you have sufficient privilege; 100k open files is quite possible),
> and may be necessary even if map names are specified on the command
> line (via input=).

All of my colleagues and our system administrators do not have the
knowledge to increase the open file limit on Unix machines. And i
don't know if it is possible to set this limit on Windows or Mac OS,
so i thought it would be a meaningful addition. I have hit the Python
subprocess limit of command line arguments when running r.series using
grass.script.run_command() and did not found a solution ... accept to
patch r.series ... .

>
> Only read the file once, reallocating the array dynamically.
>
> Can't use G_check_input_output_name, as parm.output->multiple=YES.

Ohh, yes indeed. I will add some more tests to cover this r.series behavior.

>
> Don't use C99-specific features (specifically, variable declarations
> intermingled with statements).
>
> Move variables from function scope to block scope where possible.
>
>> I have tested r.series with ~6000 maps (ECA&D daily temperature data
>> from 1995-2010) each ~100000 cells. Computation needs for method
>> average
>> about 3 minutes on my (fast) machine.
>> Memory footprint is about 330MB of RAM. But this looks like a memory
>> leak  to me, because the memory consumption raise linear with the
>> processed rows of the output map. All the memory allocation in
>> r.series is done before the row processing ... ???
>
> I suspect that this will be due to re-opening the maps for each row.
> Normally, an overhead on each call to Rast_open_old() would be
> considered a per-map overhead, and we wouldn't worry about a few kB
> per map.
>
> Opening a map is quite an expensive operation, as it has to find which
> mapset contains the map, determine its type (CELL/FCELL/DCELL), read
> its cellhd (and possibly other files, e.g. reclass table), set up the
> column mapping, etc.
>
> For this particular case (and anything else like it), the process
> could be accelerated significantly by keeping the fileinfo structure
> around and just closing and re-opening (and re-positioning) the
> descriptors (one for the raster data, one for the null bitmap).
>
> One significant problem with doing this, however, is that raster maps
> are identified by the file descriptor for their data: the "fd"
> parameter to Rast_get_row() etc, and the index into the R__.fileinfo
> array, is the actual file descriptor.
>
> It wouldn't be a great deal of work to change this, so that the "fd"
> parameter was just the index into the R__.fileinfo array, and the
> fileinfo structure contained the actual fd. However, we would need to
> make sure that we catch every case where "fd" needs to be changed to
> e.g. R__.fileinfo[fd].data_fd.

So we have two options to solve the memory leak?
1.) Correct memory management while closing maps
2.) Modification of the raster map identification

Is it worth the effort to correct the memory management while closing
maps or should we try to change the the raster map identification?
Maybe we can provide additional functions which only initialize the
fileinfo structure but does not keep file descriptors open? And the
call of Rast_open_old will open only file descriptors in case the
fileinfo is already set up? An additional Raster_close_fd_only() can
be added to close only the file descriptors. In this case only
r.series must be patched, the API keeps consistent for other modules.

However, it may be meaningful to change the map identification and to
correct the memory management.

Best regards
Soeren
>
> --
> Glynn Clements <glynn at gclements.plus.com>
>


More information about the grass-dev mailing list