<div dir="ltr"><div>Even,<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jul 17, 2022 at 12:41 PM Even Rouault <<a href="mailto:even.rouault@spatialys.com">even.rouault@spatialys.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div>

    <p>Sean,</p>

    <p>that's a good and extensive analysis of error handling/reporting

      in GDAL.<br></p></div></blockquote><div>Thank you! It was fun to compare GDAL and Python.<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><p>

    </p>

    <blockquote type="cite">

      <div dir="ltr">

        <div>Consider <span>OGRXLSDataSource::Open

            at </span><a href="https://github.com/OSGeo/gdal/blob/35c07b18316b4b6d238f6d60b82c31e25662ad27/ogr/ogrsf_frmts/xls/ogrxlsdatasource.cpp#L116-L118" target="_blank">https://github.com/OSGeo/gdal/blob/35c07b18316b4b6d238f6d60b82c31e25662ad27/ogr/ogrsf_frmts/xls/ogrxlsdatasource.cpp#L116-L118</a>.

          The code resets the error context, pushes GDAL's silencing

          handler so that no other handlers (like GDAL's default which

          prints to stderr) receive error events, calls CPLRecode, and

          then executes more statements if CPLRecode set an error. This

          looks to me like GDAL's equivalent of what might be written in

          Python as<br>

          <div>

            <pre><code>try:

    CPLRecode(...)

except:

    CPLGenerateTemporaryFilename(...)

    ...</code></pre>

          </div>

        </div>

      </div>

    </blockquote>

    <code>Yes, such error is fully handled internally by GDAL and hidden

      from the user. Purely implementation detail</code><br>

    <blockquote type="cite">

      <div dir="ltr"><br>

        <div><span>I see different behavior when

            rasterio calls GDALDatasetRasterIOEx to read data from a

            GeoTIFF. The silencing handler is not used, so error events

            are printed to stderr, but callers set new errors on top of

            the previous ones. A rasterio users sees the deeper causes

            of I/O failure in their logs, but can't react to them in

            their programs without extra work to parse errors out of log

            messages.</span></div>

        <div><span><br>

          </span></div>

        <div>Specifically, here's a snippet of errors printed to stderr

          that was provided by a rasterio user recently. These result

          from a call to <span>GDALDatasetRasterIOEx.</span></div>

        <div>

          <pre><code>ERROR 1: TIFFFillTile:No space for data buffer at scanline 4294967295

ERROR 1: TIFFReadEncodedTile() failed.

ERROR 1: /home/ubuntu/Documents/CDL_tiffs/2015_30m_cdls.tif, band 1: IReadBlock failed at X offset 189, Y offset 60: TIFFReadEncodedTile() failed.

</code></pre>

          "IReadBlock failed" is the last error set before <span>GDALDatasetRasterIOEx returns and is the

            only one that rasterio can currently surface as a Python

            exception. It's specific about the block address at which a

            problem occurred, but vague about the nature of the root

            problem. Was it a codec error? Was it a memory allocation

            error? In this case it's a memory allocation error. The user

            found that they could retry data reads and get results the

            next time, presumably after their program's memory footprint

            shrinks sufficiently. What if we could surface enough error

            detail to a user that they could determine whether they

            could retry a read or not?</span></div>

      </div>

    </blockquote>

    As mentioned in <a href="https://github.com/OSGeo/gdal/issues/6071" target="_blank">https://github.com/OSGeo/gdal/issues/6071</a>, even for

    a memory allocation error, it is not obvious to know if it is makes

    sense to retry because some memory allocation errors are related to

    corrupted data.<br></div></blockquote><div><br></div><div>Thank you for making this point. It's complicated. <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>

    <blockquote type="cite">

      <div dir="ltr">

        <div><span><br>

          </span></div>

        I think this could make communication in the Rasterio issue

        tracker much more productive. More information about the causes

        of an error is right there in the traceback instead of being

        split between traceback and stderr (or other log stream). It

        could at least eliminate one round of asking for more error

        detail in a bug report. And there's the ability to catch an

        exception and go up the chain in code, potentially very powerful

        when you need it.<br>

      </div>

    </blockquote>

    It looks to me that if GDAL's error handling could potentially be

    enhanced with an error chaining concept. New API would have to be

    devised (for error emitters and consumers), and relevant places

    modified to use it. Would that be useful?<br></div></blockquote><div dir="ltr"><br></div><div dir="ltr">I'm not sure whether it would be useful or not. It wouldn't help gdal_translate users (for example), since all the errors go to stderr already and there's no way to catch exceptions. The benefit would be for scripting language bindings, and in those cases it's not too hard to hack something together that makes up for the lack of native error chaining. I think my recommendation would be to leave the GDAL error system as it is while incrementally fixing small handling bugs and inconsistencies and being mindful that language bindings might like to push an error handler that records and links errors.<br></div><div dir="ltr"><br><div><span><span>>> I

              believe we've seen cases of GDAL functions that set errors

              while returning a success error code. </span></span></div>

      </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><blockquote type="cite">

    </blockquote>

    <p>There are 2 sub cases:</p>

    <p>a) the function emits a CPLError(CE_Failure, ...) , clears the

      error state and returns something successful</p>

    <p>b) the function emits a CPLError(CE_Failure, ...), doesn't clear

      the error state and returns something successful</p>

    <p>I'd say b) is probably mostly unwanted, and should be corrected

      to, at the very least, clear the error state (*).  But people

      could also say that as soon as a CE_Failure is emitted and reaches

      the error handling method (either the default one or a user-set),

      that it is it is not caught by GDAL internals, the user could

      assume that the global function would fail. Making such an

      assumption currently given the whole code base would be overly

      cautious.  Worth mentioning: the CPLTurnFailureIntoWarning()

      function (introduced in

<a href="https://github.com/OSGeo/gdal/commit/f151e9f652b9d036cacdebf67ca88f59d5680cb3" target="_blank">https://github.com/OSGeo/gdal/commit/f151e9f652b9d036cacdebf67ca88f59d5680cb3</a>

      and from what I can see , it is pretty much only used at the 2

      initial only places of gdal_misc.cpp of this commit), that as its

      name suggests can cause any CE_Failure to be turned into a

      CE_Warning. It could be handy to implement a "any CE_Failure that

      goes to the default or user-set error handling method means that

      the function failed" policy, but I'm not sure how we could

      implement this consistently in the code base. That would require

      analyzing all potential code paths where this could happen.

      Impractical (this also applies to the (*) at the beginning of this

      paragraph)<br>

    </p>

    <p>(I guess we all agree that a successful function may emit one or

      several CPLError(CE_Warning, ...) during its execution)<br></p></div></blockquote><div>I would prefer that no errors were emitted during a successful call, but the next best thing is to have a better understanding of how the code behaves now, and I feel like I have that. As long as no functions return a successful error code while failing we're in pretty good shape!<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><p>

    </p>

    <blockquote type="cite">

      <div dir="ltr">

        <div><span><span>It's

              possible that some functions return a failed error code

              while not setting any error.</span></span></div>

      </div>

    </blockquote>

    <p>Yes, sometimes error messages aren't really helpful when it is

      about corrupted data. e.g if you parse WKB geometries there are

      tons of potential error situations, and emitting a specific error

      for each one  would require a lot of coding effort, would bloat

      the code base & the size of the GDAL binary.</p></div></blockquote><div>That makes sense.</div><div><br></div><div>Given all this, I think I'll modify my error recording approach to also use the return values of functions. The combination of errors and return values will be enough information to determine whether rasterio should raise a Python exception.</div><div><br></div><div>Thanks!</div><br clear="all"></div>-- <br><div dir="ltr" class="gmail_signature">Sean Gillies</div></div>