[GRASS-dev] r.external performance

Glynn Clements glynn at gclements.plus.com
Wed Sep 22 06:42:57 EDT 2010


Alessandro Frigeri wrote:

> I've set up a small experiment to verify how r.external registered
> maps perform over native grass raster maps.  I've prepared a shell
> script that performs r.slope.aspect one time on the imported dem map
> (through r.in.gdal) and one time on the same map, but registered with
> r.external.  The execution time is registered via gnu/time utility and
> differences are calculated. I'm working with geotiff maps (gtopo30),
> GRASS 6.4.0+42329, GDAL 1.7.2. .
> 
> Today I made the first experiments and in general it seems that for
> small maps (~1200x1500) we have a better performance with GRASS native
> maps (about 10%), while for bigger maps (18000x43000) I have found a
> better performance on external registered maps.  Markus suggested me
> that the speedup of the process for big maps may be explained by GDAL
> caching mechanism.  These first numbers support the idea that we have
> a very good flexibility in GRASS, so we can choose from time to time
> the solution that best fits our need (to me 10% is not critical).  But
> I'm sure there are issues I did not take into account, as for example
> working with compressed data, or other you might highlight.

First, bear in mind that the native GRASS format offers a choice of
compression methods: no compression, RLE or zlib. Higher compression
will require more CPU cycles (particularly when writing maps) but will
reduce the I/O bandwidth if the maps are read from disk (i.e. not
cached). Obviously, compressed maps also require less disk space.

Most modules generate compressed output by default; you can uncompress
an existing map using "r.compress -u ...". FP maps always use zlib
compression; integer maps use zlib if the environment variable
GRASS_INT_ZLIB exists (the value doesn't matter) and RLE otherwise.

Also, the native GRASS format may have an advantage if maps are being
downsampled on read, as the use of an index means that rows can easily
be skipped, whereas a format which compresses the data as a single
stream must always decompress the entire stream.

> I'd like to share the script and have some hints from you, and ideas
> to improve it, in order to have more meaningful results.  For example,
> the gnu/time utility offers several 'times', and I've chosen 'real
> time', but I could be wrong.  Moreover, I noted that results within
> single runs differ a lot, but the mean value of several n-runs output
> coherent values.  Why that?

For repeatable results, use a system with sufficient memory (enough to
cache all input and output files, as well as any memory required by
the modules) which is otherwise idle. Ignore the first run (which will
require reading the data in from disk), and "sync" between runs (so
that disk writes from the one run aren't competing with writes from a
previous run).

This doesn't take into account the time required for disk I/O, but any
approach which does will inevitably to have a high degree of
variability, as it's almost impossible to reconstruct any specific
caching state other than "everything is cached".

-- 
Glynn Clements <glynn at gclements.plus.com>


More information about the grass-dev mailing list