<!DOCTYPE html>

<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Adam,</p>

    <p>Automated performance regression testing is probably one of the

      aspect of testing that could be enhanced. While the GDAL autotest

      suite is quite comprehensive functionally wise, performance

      testing has traditionally been a bit lagging. That said, this is

      an aspect we have improved lately with the addition of a benchmark

      component to the autotest suite

      <a class="moz-txt-link-freetext" href="https://github.com/OSGeo/gdal/tree/master/autotest/benchmark">https://github.com/OSGeo/gdal/tree/master/autotest/benchmark</a> .

      This is admitedly quite minimalistic for now, but testing some

      scenarios involving the GTiff driver and gdalwarp.<br>

    </p>

    <p>To test non-regression for a pull request, we have a CI benchmark

      configuration

(<a class="moz-txt-link-freetext" href="https://github.com/OSGeo/gdal/blob/master/.github/workflows/linux_build.yml#L111">https://github.com/OSGeo/gdal/blob/master/.github/workflows/linux_build.yml#L111</a>

      +

      <a class="moz-txt-link-freetext" href="https://github.com/OSGeo/gdal/tree/master/.github/workflows/benchmarks">https://github.com/OSGeo/gdal/tree/master/.github/workflows/benchmarks</a>)

      that runs the benchmarks first against master, and then with the

      pull request (during the same run of the same worker). But we need

      to allow a quite large tolerance threshold (up to +20%) to take

      into account that accurate timing measurements are extremely hard

      to get on CI infrastructure (even locally, on microbenchmarks this

      is very challenging). So this will mostly catch up big

      regressions, not subtle ones.</p>

    <p>One of the difficulty with benchmark testing is that we don't

      want the tests to run for hours, especially for pull requests, so

      they need to be written in a careful way to still trigger the

      relevant code paths & mechanisms of the code base that are

      exercised by real-world large datasets while running in a few

      seconds each at most. Typically those tests autogenerate their

      test data too, to avoid the test suite depending on too large

      datasets and keep the repository size as small as possible.</p>

    <p>As you mention GPUs, we have had private contacts from a couple

      GPU makers in recent years about potential GPU'ification of GDAL,

      but this has lead to nowhere for now. Some mentioned that moving

      data acquisition to the GPU could be interesting performance wise,

      but that seems to be a huge undertaking, basically moving the

      GTiff driver, libtiff and its codecs as GPU code. And even if

      done, how to manage the resulting code duplication... We aren't

      even able to properly keep up the OpenCL warper contributing many

      years ago in sync with the CPU warping code. We also lack GPU

      expertise in the current team to do that.<br>

    </p>

    <p>Even<br>

    </p>

    <div class="moz-cite-prefix">Le 25/02/2024 à 12:58, Adam Stewart via

      gdal-dev a écrit :<br>

    </div>

    <blockquote type="cite"

      cite="mid:9239DC13-93D4-4E8B-B362-96D87662D98C@tum.de">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <div

style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">

        Hi,

        <div><br>

        </div>

        <div><b>Background</b>: I'm the developer of the <a

            href="https://github.com/microsoft/torchgeo"

            moz-do-not-send="true">TorchGeo</a> software library.

          TorchGeo is a machine learning library that heavily relies on

          GDAL (via rasterio/fiona) for satellite imagery I/O.</div>

        <div><br>

        </div>

        <div>One of our primary concerns is ensuring that we can load

          data from disk fast enough to keep the GPU busy during model

          training. Of course, satellite imagery is often distributed in

          large files that make this challenging. We use various tricks

          to optimize performance (COGs, windowed reading, caching,

          compression, parallel workers, etc.). In our initial <a

            href="https://arxiv.org/abs/2111.08872"

            moz-do-not-send="true">paper</a>, we chose to create our own

          arbitrary I/O benchmarking dataset composed of 100 Landsat

          scenes and 1 CDL map. See Figure 3 for the results, and

          Appendix A for the experiment details.</div>

        <div><br>

        </div>

        <div><b>Question</b>: is there an official dataset that the GDAL

          developers use to benchmark GDAL itself? For example, if

          someone makes a change to how GDAL handles certain I/O

          operations, I assume the GDAL developers will benchmark it to

          see if I/O is now faster or slower. I'm envisioning

          experiments similar

to <a class="moz-txt-link-freetext" href="https://kokoalberti.com/articles/geotiff-compression-optimization-guide/">https://kokoalberti.com/articles/geotiff-compression-optimization-guide/</a>

          for various file formats, compression levels, block sizes,

          etc.</div>

        <div><br>

        </div>

        <div>If such a dataset doesn't yet exist, I would be interested

          in creating one and publishing a paper on how this can be used

          to develop libraries like GDAL and TorchGeo.</div>

        <div><br>

          <div>

            <div><b>Dr. Adam J. Stewart</b></div>

            <div>Technical University of Munich</div>

            <div>School of Engineering and Design</div>

            <div>Data Science in Earth Observation</div>

          </div>

          <br>

        </div>

      </div>

      <br>

      <fieldset class="moz-mime-attachment-header"></fieldset>

      <pre class="moz-quote-pre" wrap="">_______________________________________________

gdal-dev mailing list

<a class="moz-txt-link-abbreviated" href="mailto:gdal-dev@lists.osgeo.org">gdal-dev@lists.osgeo.org</a>

<a class="moz-txt-link-freetext" href="https://lists.osgeo.org/mailman/listinfo/gdal-dev">https://lists.osgeo.org/mailman/listinfo/gdal-dev</a>

</pre>

    </blockquote>

    <pre class="moz-signature" cols="72">-- 

<a class="moz-txt-link-freetext" href="http://www.spatialys.com">http://www.spatialys.com</a>

My software is free, but my time generally not.</pre>

  </body>

</html>