<div dir="ltr"><div>Hi all,</div><div><br></div><div>I've been thinking about how to surface GDAL errors in a better way for Python programmers. I'm pretty sure that the approaches I'm looking at generalize to GDAL's Python bindings and other language bindings. As well, I'm wondering if we can't improve GDAL's internal error handling in some core code. I'd love some feedback on my reasoning and design ideas from any angle. For example, I know that there is some prior art in Thomas Bonfort's Go modules, and expect there is some in rgdal. Please let me know what you think.</div><div><br></div><div>I'm an author of the Rasterio project for Python. This project has a number of problems handling GDAL errors. Generally, rasterio only checks the GDAL error context after a GDAL function returns and so can only see the last error that was set. Deeper errors may leak out to stderr, but a Python programmer using rasterio can't do anything about them using Python language features like try/except. This is a flaw in rasterio and stems from some naive analysis on my part about how errors are handled internally in GDAL. I assumed that functions in GDAL core and driver code consistently handle errors set by the functions they call and then set an error that describes exactly what a caller can do in the case of failure.<br></div><div><br></div><div>Consider <span class="gmail-pl-en">OGRXLSDataSource::Open at </span><a href="https://github.com/OSGeo/gdal/blob/35c07b18316b4b6d238f6d60b82c31e25662ad27/ogr/ogrsf_frmts/xls/ogrxlsdatasource.cpp#L116-L118">https://github.com/OSGeo/gdal/blob/35c07b18316b4b6d238f6d60b82c31e25662ad27/ogr/ogrsf_frmts/xls/ogrxlsdatasource.cpp#L116-L118</a>. The code resets the error context, pushes GDAL's silencing
handler so that no other handlers (like GDAL's default which prints to stderr) receive error events, calls CPLRecode, and then executes more statements if CPLRecode set an error. This looks to me
like GDAL's equivalent of what might be written in Python as<br><div class="gmail-snippet-clipboard-content gmail-notranslate gmail-position-relative gmail-overflow-auto"><pre class="gmail-notranslate"><code>try:
CPLRecode(...)
except:
CPLGenerateTemporaryFilename(...)
...
</code></pre></div></div><div>In many ways, GDAL's error system is not unlike Python's at the C level. Python extension code that fails is supposed to set an error and return a particular value. When callers get that return value, they are to check for a set error and should either return with an error-indicating value (leaving the set error in place), or they can handle the error by clearing it and continuing, maybe setting a new error if recovery isn't possible. <span class="gmail-pl-en">OGRXLSDataSource::Open does this. A rasterio user doesn't need to see farther into <span class="gmail-pl-en">OGRXLSDataSource::Open than the last error set. GDAL and Python error reporting and handling are well aligned.<br></span></span></div><div><span class="gmail-pl-en"><br></span></div><div><span class="gmail-pl-en">I see different behavior when rasterio calls GDALDatasetRasterIOEx to read data from a GeoTIFF. The silencing handler is not used, so error events are printed to stderr, but callers set new errors on top of the previous ones. A rasterio users sees the deeper causes of I/O failure in their logs, but can't react to them in their programs without extra work to parse errors out of log messages.</span></div><div><span class="gmail-pl-en"><br></span></div><div>Specifically, here's a snippet of errors printed to stderr that was provided by a rasterio user recently. These result from a call to <span class="gmail-pl-en">GDALDatasetRasterIOEx.</span></div><div><pre class="gmail-notranslate"><code>ERROR 1: TIFFFillTile:No space for data buffer at scanline 4294967295
ERROR 1: TIFFReadEncodedTile() failed.
ERROR 1: /home/ubuntu/Documents/CDL_tiffs/2015_30m_cdls.tif, band 1: IReadBlock failed at X offset 189, Y offset 60: TIFFReadEncodedTile() failed.<br><br></code></pre>"IReadBlock failed" is the last error set before <span class="gmail-pl-en">GDALDatasetRasterIOEx returns and is the only one that rasterio can currently surface as a Python exception. It's specific about the block address at which a problem occurred, but vague about the nature of the root problem. Was it a codec error? Was it a memory allocation error? In this case it's a memory allocation error. The user found that they could retry data reads and get results the next time, presumably after their program's memory footprint shrinks sufficiently. What if we could surface enough error detail to a user that they could determine whether they could retry a read or not?</span></div><div><span class="gmail-pl-en"><br></span></div><div><span class="gmail-pl-en">In <a href="https://github.com/rasterio/rasterio/pull/2526/files#diff-a263c7288922a4c1ffd8318c15dfd3332babeb13edc7023662cb8cd7d69643b5R219">https://github.com/rasterio/rasterio/pull/2526/files#diff-a263c7288922a4c1ffd8318c15dfd3332babeb13edc7023662cb8cd7d69643b5R219</a> I am testing a hypothesis that the three consecutive, related errors above might be usefully surfaced to a Python programmer in a chain of exceptions. I've written an thing that records GDAL error events (intercepting them before they go to stderr), links them together, and then raises the last one. A Python programmer can catch RasterioIOError (what is raised in the "IReadBlock failed" case) and in handling that exception can follow the chain. At the very least, my experiment will show "CPLE_AppDefinedError: </span>TIFFFillTile:No space for data buffer at scanline 4294967295" in Python tracebacks, which could be a big help for rasterio users who are debugging. Information that would otherwise be only in their logs would now be in the traceback.</div><div><br></div><div>For example, here is the traceback we can get when trying to read a deliberately corrupted COG:</div><div><pre><span class="gmail-pl-e">(venv) seangillies@PF3675VY:~/projects/rasterio</span>$ <span class="gmail-pl-s1">rio insp tests/data/corrupt.tif </span>
<span class="gmail-pl-c1">Rasterio 1.4dev Interactive Inspector (Python 3.8.10)</span>
<span class="gmail-pl-c1">Type "src.meta", "src.read(1)", or "help(src)" for more information.</span>
<span class="gmail-pl-c1">>>> src.read()</span>
<span class="gmail-pl-c1">rasterio._err.CPLE_AppDefinedError: TIFFFillTile:Read error at row 512, col 0, tile 3; got 38232 bytes, expected 47086</span>
<span class="gmail-pl-c1">The above exception was the direct cause of the following exception:</span>
<span class="gmail-pl-c1">rasterio._err.CPLE_AppDefinedError: TIFFReadEncodedTile() failed.</span>
<span class="gmail-pl-c1">The above exception was the direct cause of the following exception:</span>
<span class="gmail-pl-c1">Traceback (most recent call last):</span>
<span class="gmail-pl-c1"> File "rasterio/_io.pyx", line 934, in rasterio._io.DatasetReaderBase._read</span>
<span class="gmail-pl-c1"> io_multi_band(self._hds, 0, xoff, yoff, width, height, out, indexes_arr, resampling=resampling)</span>
<span class="gmail-pl-c1"> File "rasterio/_io.pyx", line 166, in rasterio._io.io_multi_band</span>
<span class="gmail-pl-c1"> with stack_errors():</span>
<span class="gmail-pl-c1"> File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__</span>
<span class="gmail-pl-c1"> next(self.gen)</span>
<span class="gmail-pl-c1"> File "rasterio/_err.pyx", line 245, in stack_errors</span>
<span class="gmail-pl-c1"> raise last</span>
<span class="gmail-pl-c1">rasterio._err.CPLE_AppDefinedError: /home/seangillies/projects/rasterio/tests/data/corrupt.tif, band 1: IReadBlock failed at X offset 1, Y offset 1: TIFFReadEncodedTile() failed.</span>
<span class="gmail-pl-c1">The above exception was the direct cause of the following exception:</span>
<span class="gmail-pl-c1">Traceback (most recent call last):</span>
<span class="gmail-pl-c1"> File "<console>", line 1, in <module></span>
<span class="gmail-pl-c1"> File "rasterio/_io.pyx", line 610, in rasterio._io.DatasetReaderBase.read</span>
<span class="gmail-pl-c1"> out = self._read(indexes, out, window, dtype, resampling=resampling)</span>
<span class="gmail-pl-c1"> File "rasterio/_io.pyx", line 937, in rasterio._io.DatasetReaderBase._read</span>
<span class="gmail-pl-c1"> raise RasterioIOError("Read or write failed. {}".format(cplerr)) from cplerr</span>
<span class="gmail-pl-c1">rasterio.errors.RasterioIOError: Read or write failed. /home/seangillies/projects/rasterio/tests/data/corrupt.tif, band 1: IReadBlock failed at X offset 1, Y offset 1: TIFFReadEncodedTile() failed.</span><br></pre></div>I think this could make communication in the Rasterio issue tracker much more productive. More information about the causes of an error is right there in the traceback instead of being split between traceback and stderr (or other log stream). It could at least eliminate one round of asking for more error detail in a bug report. And there's the ability to catch an exception and go up the chain in code, potentially very powerful when you need it.<br><div><span class="gmail-pl-en"><br></span></div><div><span class="gmail-pl-en">The effectiveness of this error recorder and chainer could depend on how many different styles of error handling exist in GDAL. I've pointed out two kinds above. In <span class="gmail-pl-en">OGRXLSDataSource::Open, we have error handling that actively prevents error events from being emitted until the function gives up on trying to handle errors. In IReadBlock, there doesn't seem to be any such error handling involving the GDAL error context. I believe we've seen cases of GDAL functions that set errors while returning a success error code. It's possible that some functions return a failed error code while not setting any error. Lots of different cases could make the error recording and chaining approach fruitless.<br></span></span></div><div><span class="gmail-pl-en"><span class="gmail-pl-en"><br></span></span></div><div><span class="gmail-pl-en"><span class="gmail-pl-en">Are there other styles or paradigms in use? Are there GDAL modules that will challenge the assumptions that I'm making as I write my error recorder? If you know of any, I'd love to hear about them.<br></span></span></div><div><span class="gmail-pl-en"><span class="gmail-pl-en"><br></span></span></div><div><span class="gmail-pl-en"><span class="gmail-pl-en">Here are a few links for reference:</span></span></div><div><span class="gmail-pl-en"><span class="gmail-pl-en"><br></span></span></div><div><span class="gmail-pl-en"><span class="gmail-pl-en">* Exception handling in Python's C API: <a href="https://docs.python.org/3/c-api/exceptions.html#exception-handling">https://docs.python.org/3/c-api/exceptions.html#exception-handling</a> (I feel like GDAL could use some documentation like this).<br></span></span></div><div><span class="gmail-pl-en"><span class="gmail-pl-en">* Python exception chaining: <a href="https://docs.python.org/3/tutorial/errors.html#exception-chaining">https://docs.python.org/3/tutorial/errors.html#exception-chaining</a></span></span></div><div><span class="gmail-pl-en"><span class="gmail-pl-en">* On the difference between an exception raised while handling and "raise from": <a href="https://blog.ram.rachum.com/post/621791438475296768/improving-python-exception-chaining-with">https://blog.ram.rachum.com/post/621791438475296768/improving-python-exception-chaining-with</a></span></span></div><div><br></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">Sean Gillies</div></div>