[GRASS-dev] [GRASS GIS] #2764: corrupt data written to FCELL and DCELL rasters, hard to re-produce

Tue Jan 9 12:41:58 PST 2018

#2764: corrupt data written to FCELL and DCELL rasters, hard to re-produce
---------------------+-------------------------
  Reporter:  dylan   |      Owner:  grass-dev@…
      Type:  defect  |     Status:  new
  Priority:  normal  |  Milestone:  7.2.3
 Component:  Raster  |    Version:  unspecified
Resolution:          |   Keywords:
       CPU:  x86-64  |   Platform:  Linux
---------------------+-------------------------

Comment (by mmetz):

 Replying to [comment:31 dylan]:
 > Replying to [comment:30 mmetz]:
 > > Replying to [comment:29 dylan]:
 > > >
 > > > [...] Note that I don't have any issues with any other GRASS
 commands, or (as far as I can tell) general usage on this machine. I only
 see these errors when working with GRASS commands that:
 > > >
 > > >   * take a long time to run: `r.sun` or `t.rast.mapcalc` ([http
 ://osgeo-org.1560.x6.nabble.com/Error-reading-raster-data-for-row-xxx-
 only-when-using-r-series-and-t-rast-series-td5229569.html e.g. a couple of
 years ago])
 > > >   * operate on moderately large, floating-point maps
 > > >   * are done in parallel, either via GNU `parallel` or as
 implemented in the temporal suite of modules
 > > >
 > > > ...hence the extreme difficulty in recreating the errors or further
 debugging.
 > >
 > > Unfortunately, I was not able to recreate these errors with the
 provided test data and scripts.
 > >
 > > I still think this is some obscure disk IO error. You could try to use
 `nice`, e.g. `nice r.sun ...` and `nice r.mapcalc ...` in `daily-rad.sh`.
 At least this helps when running many GRASS modules in parallel on HPC
 systems where results are written out to one single storage device.
 >
 > Well thank you very much for all of your patience, patches, and testing.
 I'll try the `nice` option. For now, I think that I can tolerate the much
 lower frequency of errors after switching to LZ4 compression. Perhaps the
 faster speed of LZ4 lowers the probability of concurrent write operations.

 I don't think so because because with LZ4, more data need to be written,
 which takes longer.

 >
 > It is still quite puzzling that this kind of error has come up on
 several different machines while tracking GRASS trunk over a 10 year
 period. Maybe this is a subtle hint that it is time to build a new
 workstation...

 Markus Neteler in particular spent a lot of time to fix various systems
 for parallel execution of GRASS commands. GRASS itself was never the
 problem, instead the main problem was that the multiple outputs to be
 written to a single storage device were too much for that storage device.
 >
 > I know this is a lot to ask, but did you try testing using ZLIB
 compression and running it multiple times? It took a couple of tiles
 before I noticed the error.

 I did use ZLIB compression when running the test with the data and scripts
 provided. Do you mean I should run the test several times with the same
 data?
 >
 > There will be an opportunity to test these same scripts in an HPC
 environment over the next two months. I'll be sure and report back any
 findings from those tests.

 An HPC environment basically consists of a job scheduler (e.g slurm) and
 any number of execution hosts. You can set up a minimal HPC environment on
 a single standard workstation. More execution hosts mean more (hardware)
 trouble.

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2764#comment:32>
GRASS GIS <https://grass.osgeo.org>