[GRASS-dev] what is the meaning of: "Error reading raster data for row 239 of <MASK>"

Glynn Clements glynn at gclements.plus.com
Tue Jul 14 00:46:23 PDT 2015


Moritz Lennert wrote:

> >> I don't know how to debug this...
> >
> > Can you identify a repeatable test case?
> >
> > If I could make it happen, I could debug it.
> 
> You can get a location names TEST here:
> 
> http://tomahawk.ulb.ac.be/moritz/mask_bug_testlocation.tgz
> 
> This contains only a PERMANENT mapset.
> 
> In that mapset, launch the following command:
> 
> r.mask vect=hull; for map in $(g.list rast pat="firm_rate*"); do echo 
> $map ; r.mapcalc "temp_prob = float($map) / sum_rates" --o --q; done; 
> r.mask -r
> 
> I get the error arbitrarily for different firm_rate_* maps, sometimes 
> only for one, sometimes for many, but at each run its for different 
> maps.

So it's non-deterministic (I'm getting one error for every 10-20
passes over the data, i.e. every 1200-2500 commands), and only applies
to r.mapcalc.

My first guess was a race condition related to pthreads. I tried

	export WORKERS=0

before running the test, and it hasn't happened since.

And actually I'm now fairly certain as to the specific cause.

When compiled with pthread support, r.mapcalc has a mutex for each map
to prevent concurrent access to a single map from multiple threads. 

Concurrent access to different maps (and to core lib/gis and and
lib/raster functionality) from different threads is supposed to be
safe (see r34485 and the interval surrounding it), but the MASK was
overlooked.

If a MASK is in use, reading a row from any raster map will read the
corresponding row from the MASK, and there's nothing to prevent
different threads from concurrently accessing two different maps and
thus accessing the MASK.

So, in read_data_{compressed,uncompressed,read_data_fp_compressed} in
lib/raster/get_row.c we have code like:

    if (lseek(fcb->data_fd, (off_t) row * bufsize, SEEK_SET) == -1)
	G_fatal_error(_("Error reading raster data for row %d of <%s>"),
		      row, fcb->name);

    if (read(fcb->data_fd, data_buf, bufsize) != bufsize)
	G_fatal_error(_("Error reading raster data for row %d of <%s>"),
		      row, fcb->name);

If multiple threads execute this code concurrently, you can end up
with the calls being interleaved like so:

	Thread 1	Thread 2

	lseek
			lseek
			read
	read

meaning that the file offset has changed betwee the lseek() and the
read() (this is why X/Open and POSIX added pread(), but that's still
relatively new).

This only results in an error at the end of the file (the first read()
will leave the file offset at EOF, so the second read() fails), but in
other situations it's likely causing the wrong row of the MASK to be
read.

A possible quick fix:

	if (R__.auto_mask > 0)
	    putenv("WORKERS=0");

A slightly better fix would be to check for masking and if it's
enabled, have a single mutex which guards *all* raster reads so that
even concurrent access to different maps is blocked. Unlike the above
hack, this still allows computations to be executed in parallel.

Better still would be to guard access to the MASK so that the other
aspects of raster input can be parallelised (raster I/O is still a
major bottleneck, and mostly because of processing rather than actual
disc access).

But that would involve either adding pthread code directly into the
base raster input code in lib/raster/get_row.c (undesirable) or at
least adding a mechanism to allow r.mapcalc to hook into it to provide
the mutex.

-- 
Glynn Clements <glynn at gclements.plus.com>


More information about the grass-dev mailing list