[GRASS-dev] [GRASS GIS] #2764: corrupt data written to FCELL and DCELL rasters, hard to re-produce
GRASS GIS
trac at osgeo.org
Tue Jan 9 12:41:58 PST 2018
#2764: corrupt data written to FCELL and DCELL rasters, hard to re-produce
---------------------+-------------------------
Reporter: dylan | Owner: grass-dev@…
Type: defect | Status: new
Priority: normal | Milestone: 7.2.3
Component: Raster | Version: unspecified
Resolution: | Keywords:
CPU: x86-64 | Platform: Linux
---------------------+-------------------------
Comment (by mmetz):
Replying to [comment:31 dylan]:
> Replying to [comment:30 mmetz]:
> > Replying to [comment:29 dylan]:
> > >
> > > [...] Note that I don't have any issues with any other GRASS
commands, or (as far as I can tell) general usage on this machine. I only
see these errors when working with GRASS commands that:
> > >
> > > * take a long time to run: `r.sun` or `t.rast.mapcalc` ([http
://osgeo-org.1560.x6.nabble.com/Error-reading-raster-data-for-row-xxx-
only-when-using-r-series-and-t-rast-series-td5229569.html e.g. a couple of
years ago])
> > > * operate on moderately large, floating-point maps
> > > * are done in parallel, either via GNU `parallel` or as
implemented in the temporal suite of modules
> > >
> > > ...hence the extreme difficulty in recreating the errors or further
debugging.
> >
> > Unfortunately, I was not able to recreate these errors with the
provided test data and scripts.
> >
> > I still think this is some obscure disk IO error. You could try to use
`nice`, e.g. `nice r.sun ...` and `nice r.mapcalc ...` in `daily-rad.sh`.
At least this helps when running many GRASS modules in parallel on HPC
systems where results are written out to one single storage device.
>
> Well thank you very much for all of your patience, patches, and testing.
I'll try the `nice` option. For now, I think that I can tolerate the much
lower frequency of errors after switching to LZ4 compression. Perhaps the
faster speed of LZ4 lowers the probability of concurrent write operations.
I don't think so because because with LZ4, more data need to be written,
which takes longer.
>
> It is still quite puzzling that this kind of error has come up on
several different machines while tracking GRASS trunk over a 10 year
period. Maybe this is a subtle hint that it is time to build a new
workstation...
Markus Neteler in particular spent a lot of time to fix various systems
for parallel execution of GRASS commands. GRASS itself was never the
problem, instead the main problem was that the multiple outputs to be
written to a single storage device were too much for that storage device.
>
> I know this is a lot to ask, but did you try testing using ZLIB
compression and running it multiple times? It took a couple of tiles
before I noticed the error.
I did use ZLIB compression when running the test with the data and scripts
provided. Do you mean I should run the test several times with the same
data?
>
> There will be an opportunity to test these same scripts in an HPC
environment over the next two months. I'll be sure and report back any
findings from those tests.
An HPC environment basically consists of a job scheduler (e.g slurm) and
any number of execution hosts. You can set up a minimal HPC environment on
a single standard workstation. More execution hosts mean more (hardware)
trouble.
--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2764#comment:32>
GRASS GIS <https://grass.osgeo.org>
More information about the grass-dev
mailing list