[gdal-dev] GDAL internal error handling and implications for other language bindings

Fri Jul 22 14:02:58 PDT 2022

Even,

On Sun, Jul 17, 2022 at 12:41 PM Even Rouault <even.rouault at spatialys.com>
wrote:

> Sean,
>
> that's a good and extensive analysis of error handling/reporting in GDAL.
>
Thank you! It was fun to compare GDAL and Python.

> Consider OGRXLSDataSource::Open at
> https://github.com/OSGeo/gdal/blob/35c07b18316b4b6d238f6d60b82c31e25662ad27/ogr/ogrsf_frmts/xls/ogrxlsdatasource.cpp#L116-L118.
> The code resets the error context, pushes GDAL's silencing handler so that
> no other handlers (like GDAL's default which prints to stderr) receive
> error events, calls CPLRecode, and then executes more statements if
> CPLRecode set an error. This looks to me like GDAL's equivalent of what
> might be written in Python as
>
> try:
>     CPLRecode(...)
> except:
>     CPLGenerateTemporaryFilename(...)
>     ...
>
> Yes, such error is fully handled internally by GDAL and hidden from the
> user. Purely implementation detail
>
>
> I see different behavior when rasterio calls GDALDatasetRasterIOEx to read
> data from a GeoTIFF. The silencing handler is not used, so error events are
> printed to stderr, but callers set new errors on top of the previous ones.
> A rasterio users sees the deeper causes of I/O failure in their logs, but
> can't react to them in their programs without extra work to parse errors
> out of log messages.
>
> Specifically, here's a snippet of errors printed to stderr that was
> provided by a rasterio user recently. These result from a call to
> GDALDatasetRasterIOEx.
>
> ERROR 1: TIFFFillTile:No space for data buffer at scanline 4294967295
> ERROR 1: TIFFReadEncodedTile() failed.
> ERROR 1: /home/ubuntu/Documents/CDL_tiffs/2015_30m_cdls.tif, band 1: IReadBlock failed at X offset 189, Y offset 60: TIFFReadEncodedTile() failed.
>
>
> "IReadBlock failed" is the last error set before GDALDatasetRasterIOEx
> returns and is the only one that rasterio can currently surface as a Python
> exception. It's specific about the block address at which a problem
> occurred, but vague about the nature of the root problem. Was it a codec
> error? Was it a memory allocation error? In this case it's a memory
> allocation error. The user found that they could retry data reads and get
> results the next time, presumably after their program's memory footprint
> shrinks sufficiently. What if we could surface enough error detail to a
> user that they could determine whether they could retry a read or not?
>
> As mentioned in https://github.com/OSGeo/gdal/issues/6071, even for a
> memory allocation error, it is not obvious to know if it is makes sense to
> retry because some memory allocation errors are related to corrupted data.
>

Thank you for making this point. It's complicated.

>
> I think this could make communication in the Rasterio issue tracker much
> more productive. More information about the causes of an error is right
> there in the traceback instead of being split between traceback and stderr
> (or other log stream). It could at least eliminate one round of asking for
> more error detail in a bug report. And there's the ability to catch an
> exception and go up the chain in code, potentially very powerful when you
> need it.
>
> It looks to me that if GDAL's error handling could potentially be enhanced
> with an error chaining concept. New API would have to be devised (for error
> emitters and consumers), and relevant places modified to use it. Would that
> be useful?
>

I'm not sure whether it would be useful or not. It wouldn't help
gdal_translate users (for example), since all the errors go to stderr
already and there's no way to catch exceptions. The benefit would be for
scripting language bindings, and in those cases it's not too hard to hack
something together that makes up for the lack of native error chaining. I
think my recommendation would be to leave the GDAL error system as it is
while incrementally fixing small handling bugs and inconsistencies and
being mindful that language bindings might like to push an error handler
that records and links errors.

>> I believe we've seen cases of GDAL functions that set errors while
returning a success error code.

> There are 2 sub cases:
>
> a) the function emits a CPLError(CE_Failure, ...) , clears the error state
> and returns something successful
>
> b) the function emits a CPLError(CE_Failure, ...), doesn't clear the error
> state and returns something successful
>
> I'd say b) is probably mostly unwanted, and should be corrected to, at the
> very least, clear the error state (*).  But people could also say that as
> soon as a CE_Failure is emitted and reaches the error handling method
> (either the default one or a user-set), that it is it is not caught by GDAL
> internals, the user could assume that the global function would fail.
> Making such an assumption currently given the whole code base would be
> overly cautious.  Worth mentioning: the CPLTurnFailureIntoWarning()
> function (introduced in
> https://github.com/OSGeo/gdal/commit/f151e9f652b9d036cacdebf67ca88f59d5680cb3
> and from what I can see , it is pretty much only used at the 2 initial only
> places of gdal_misc.cpp of this commit), that as its name suggests can
> cause any CE_Failure to be turned into a CE_Warning. It could be handy to
> implement a "any CE_Failure that goes to the default or user-set error
> handling method means that the function failed" policy, but I'm not sure
> how we could implement this consistently in the code base. That would
> require analyzing all potential code paths where this could happen.
> Impractical (this also applies to the (*) at the beginning of this
> paragraph)
>
> (I guess we all agree that a successful function may emit one or several
> CPLError(CE_Warning, ...) during its execution)
>
I would prefer that no errors were emitted during a successful call, but
the next best thing is to have a better understanding of how the code
behaves now, and I feel like I have that. As long as no functions return a
successful error code while failing we're in pretty good shape!

> It's possible that some functions return a failed error code while not
> setting any error.
>
> Yes, sometimes error messages aren't really helpful when it is about
> corrupted data. e.g if you parse WKB geometries there are tons of potential
> error situations, and emitting a specific error for each one  would require
> a lot of coding effort, would bloat the code base & the size of the GDAL
> binary.
>
That makes sense.

Given all this, I think I'll modify my error recording approach to also use
the return values of functions. The combination of errors and return values
will be enough information to determine whether rasterio should raise a
Python exception.

Thanks!

-- 
Sean Gillies
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20220722/a82933d1/attachment.htm>