[gdal-dev] GDAL internal error handling and implications for other language bindings

Sun Jul 17 07:08:36 PDT 2022

Hi all,

I've been thinking about how to surface GDAL errors in a better way for
Python programmers. I'm pretty sure that the approaches I'm looking at
generalize to GDAL's Python bindings and other language bindings. As well,
I'm wondering if we can't improve GDAL's internal error handling in some
core code. I'd love some feedback on my reasoning and design ideas from any
angle. For example, I know that there is some prior art in Thomas Bonfort's
Go modules, and expect there is some in rgdal. Please let me know what you
think.

I'm an author of the Rasterio project for Python. This project has a number
of problems handling GDAL errors. Generally, rasterio only checks the GDAL
error context after a GDAL function returns and so can only see the last
error that was set. Deeper errors may leak out to stderr, but a Python
programmer using rasterio can't do anything about them using Python
language features like try/except. This is a flaw in rasterio and stems
from some naive analysis on my part about how errors are handled internally
in GDAL. I assumed that functions in GDAL core and driver code consistently
handle errors set by the functions they call and then set an error that
describes exactly what a caller can do in the case of failure.

Consider OGRXLSDataSource::Open at
https://github.com/OSGeo/gdal/blob/35c07b18316b4b6d238f6d60b82c31e25662ad27/ogr/ogrsf_frmts/xls/ogrxlsdatasource.cpp#L116-L118.
The code resets the error context, pushes GDAL's silencing handler so that
no other handlers (like GDAL's default which prints to stderr) receive
error events, calls CPLRecode, and then executes more statements if
CPLRecode set an error. This looks to me like GDAL's equivalent of what
might be written in Python as

try:
    CPLRecode(...)
except:
    CPLGenerateTemporaryFilename(...)
    ...

In many ways, GDAL's error system is not unlike Python's at the C level.
Python extension code that fails is supposed to set an error and return a
particular value. When callers get that return value, they are to check for
a set error and should either return with an error-indicating value
(leaving the set error in place), or they can handle the error by clearing
it and continuing, maybe setting a new error if recovery isn't
possible. OGRXLSDataSource::Open
does this. A rasterio user doesn't need to see farther into
OGRXLSDataSource::Open
than the last error set. GDAL and Python error reporting and handling are
well aligned.

I see different behavior when rasterio calls GDALDatasetRasterIOEx to read
data from a GeoTIFF. The silencing handler is not used, so error events are
printed to stderr, but callers set new errors on top of the previous ones.
A rasterio users sees the deeper causes of I/O failure in their logs, but
can't react to them in their programs without extra work to parse errors
out of log messages.

Specifically, here's a snippet of errors printed to stderr that was
provided by a rasterio user recently. These result from a call to
GDALDatasetRasterIOEx.

ERROR 1: TIFFFillTile:No space for data buffer at scanline 4294967295
ERROR 1: TIFFReadEncodedTile() failed.
ERROR 1: /home/ubuntu/Documents/CDL_tiffs/2015_30m_cdls.tif, band 1:
IReadBlock failed at X offset 189, Y offset 60: TIFFReadEncodedTile()
failed.

"IReadBlock failed" is the last error set before GDALDatasetRasterIOEx
returns and is the only one that rasterio can currently surface as a Python
exception. It's specific about the block address at which a problem
occurred, but vague about the nature of the root problem. Was it a codec
error? Was it a memory allocation error? In this case it's a memory
allocation error. The user found that they could retry data reads and get
results the next time, presumably after their program's memory footprint
shrinks sufficiently. What if we could surface enough error detail to a
user that they could determine whether they could retry a read or not?

In
https://github.com/rasterio/rasterio/pull/2526/files#diff-a263c7288922a4c1ffd8318c15dfd3332babeb13edc7023662cb8cd7d69643b5R219
I am testing a hypothesis that the three consecutive, related errors above
might be usefully surfaced to a Python programmer in a chain of exceptions.
I've written an thing that records GDAL error events (intercepting them
before they go to stderr), links them together, and then raises the last
one. A Python programmer can catch RasterioIOError (what is raised in the
"IReadBlock failed" case) and in handling that exception can follow the
chain. At the very least, my experiment will show
"CPLE_AppDefinedError: TIFFFillTile:No
space for data buffer at scanline 4294967295" in Python tracebacks, which
could be a big help for rasterio users who are debugging. Information that
would otherwise be only in their logs would now be in the traceback.

For example, here is the traceback we can get when trying to read a
deliberately corrupted COG:

(venv) seangillies at PF3675VY:~/projects/rasterio$ rio insp
tests/data/corrupt.tif Rasterio 1.4dev Interactive Inspector (Python
3.8.10)Type "src.meta", "src.read(1)", or "help(src)" for more
information.>>> src.read()rasterio._err.CPLE_AppDefinedError:
TIFFFillTile:Read error at row 512, col 0, tile 3; got 38232 bytes,
expected 47086
The above exception was the direct cause of the following exception:
rasterio._err.CPLE_AppDefinedError: TIFFReadEncodedTile() failed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):  File "rasterio/_io.pyx", line 934,
in rasterio._io.DatasetReaderBase._read    io_multi_band(self._hds, 0,
xoff, yoff, width, height, out, indexes_arr, resampling=resampling)
File "rasterio/_io.pyx", line 166, in rasterio._io.io_multi_band
with stack_errors():  File "/usr/lib/python3.8/contextlib.py", line
120, in __exit__    next(self.gen)  File "rasterio/_err.pyx", line
245, in stack_errors    raise lastrasterio._err.CPLE_AppDefinedError:
/home/seangillies/projects/rasterio/tests/data/corrupt.tif, band 1:
IReadBlock failed at X offset 1, Y offset 1: TIFFReadEncodedTile()
failed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):  File "<console>", line 1, in
<module>  File "rasterio/_io.pyx", line 610, in
rasterio._io.DatasetReaderBase.read    out = self._read(indexes, out,
window, dtype, resampling=resampling)  File "rasterio/_io.pyx", line
937, in rasterio._io.DatasetReaderBase._read    raise
RasterioIOError("Read or write failed. {}".format(cplerr)) from
cplerrrasterio.errors.RasterioIOError: Read or write failed.
/home/seangillies/projects/rasterio/tests/data/corrupt.tif, band 1:
IReadBlock failed at X offset 1, Y offset 1: TIFFReadEncodedTile()
failed.

I think this could make communication in the Rasterio issue tracker much
more productive. More information about the causes of an error is right
there in the traceback instead of being split between traceback and stderr
(or other log stream). It could at least eliminate one round of asking for
more error detail in a bug report. And there's the ability to catch an
exception and go up the chain in code, potentially very powerful when you
need it.

The effectiveness of this error recorder and chainer could depend on how
many different styles of error handling exist in GDAL. I've pointed out two
kinds above. In OGRXLSDataSource::Open, we have error handling that
actively prevents error events from being emitted until the function gives
up on trying to handle errors. In IReadBlock, there doesn't seem to be any
such error handling involving the GDAL error context. I believe we've seen
cases of GDAL functions that set errors while returning a success error
code. It's possible that some functions return a failed error code while
not setting any error. Lots of different cases could make the error
recording and chaining approach fruitless.

Are there other styles or paradigms in use? Are there GDAL modules that
will challenge the assumptions that I'm making as I write my error
recorder? If you know of any, I'd love to hear about them.

Here are a few links for reference:

* Exception handling in Python's C API:
https://docs.python.org/3/c-api/exceptions.html#exception-handling (I feel
like GDAL could use some documentation like this).
* Python exception chaining:
https://docs.python.org/3/tutorial/errors.html#exception-chaining
* On the difference between an exception raised while handling and "raise
from":
https://blog.ram.rachum.com/post/621791438475296768/improving-python-exception-chaining-with

-- 
Sean Gillies
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20220717/36684f76/attachment-0001.htm>