[gdal-dev] Official dataset for benchmarking GDAL I/O?

Even Rouault even.rouault at spatialys.com
Sun Feb 25 04:25:44 PST 2024


Adam,

Automated performance regression testing is probably one of the aspect 
of testing that could be enhanced. While the GDAL autotest suite is 
quite comprehensive functionally wise, performance testing has 
traditionally been a bit lagging. That said, this is an aspect we have 
improved lately with the addition of a benchmark component to the 
autotest suite 
https://github.com/OSGeo/gdal/tree/master/autotest/benchmark . This is 
admitedly quite minimalistic for now, but testing some scenarios 
involving the GTiff driver and gdalwarp.

To test non-regression for a pull request, we have a CI benchmark 
configuration 
(https://github.com/OSGeo/gdal/blob/master/.github/workflows/linux_build.yml#L111 
+ 
https://github.com/OSGeo/gdal/tree/master/.github/workflows/benchmarks) 
that runs the benchmarks first against master, and then with the pull 
request (during the same run of the same worker). But we need to allow a 
quite large tolerance threshold (up to +20%) to take into account that 
accurate timing measurements are extremely hard to get on CI 
infrastructure (even locally, on microbenchmarks this is very 
challenging). So this will mostly catch up big regressions, not subtle ones.

One of the difficulty with benchmark testing is that we don't want the 
tests to run for hours, especially for pull requests, so they need to be 
written in a careful way to still trigger the relevant code paths & 
mechanisms of the code base that are exercised by real-world large 
datasets while running in a few seconds each at most. Typically those 
tests autogenerate their test data too, to avoid the test suite 
depending on too large datasets and keep the repository size as small as 
possible.

As you mention GPUs, we have had private contacts from a couple GPU 
makers in recent years about potential GPU'ification of GDAL, but this 
has lead to nowhere for now. Some mentioned that moving data acquisition 
to the GPU could be interesting performance wise, but that seems to be a 
huge undertaking, basically moving the GTiff driver, libtiff and its 
codecs as GPU code. And even if done, how to manage the resulting code 
duplication... We aren't even able to properly keep up the OpenCL warper 
contributing many years ago in sync with the CPU warping code. We also 
lack GPU expertise in the current team to do that.

Even

Le 25/02/2024 à 12:58, Adam Stewart via gdal-dev a écrit :
> Hi,
>
> *Background*: I'm the developer of the TorchGeo 
> <https://github.com/microsoft/torchgeo> software library. TorchGeo is 
> a machine learning library that heavily relies on GDAL (via 
> rasterio/fiona) for satellite imagery I/O.
>
> One of our primary concerns is ensuring that we can load data from 
> disk fast enough to keep the GPU busy during model training. Of 
> course, satellite imagery is often distributed in large files that 
> make this challenging. We use various tricks to optimize performance 
> (COGs, windowed reading, caching, compression, parallel workers, 
> etc.). In our initial paper <https://arxiv.org/abs/2111.08872>, we 
> chose to create our own arbitrary I/O benchmarking dataset composed of 
> 100 Landsat scenes and 1 CDL map. See Figure 3 for the results, and 
> Appendix A for the experiment details.
>
> *Question*: is there an official dataset that the GDAL developers use 
> to benchmark GDAL itself? For example, if someone makes a change to 
> how GDAL handles certain I/O operations, I assume the GDAL developers 
> will benchmark it to see if I/O is now faster or slower. I'm 
> envisioning experiments similar to 
> https://kokoalberti.com/articles/geotiff-compression-optimization-guide/ 
> for various file formats, compression levels, block sizes, etc.
>
> If such a dataset doesn't yet exist, I would be interested in creating 
> one and publishing a paper on how this can be used to develop 
> libraries like GDAL and TorchGeo.
>
> *Dr. Adam J. Stewart*
> Technical University of Munich
> School of Engineering and Design
> Data Science in Earth Observation
>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev

-- 
http://www.spatialys.com
My software is free, but my time generally not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20240225/3fde13b3/attachment.htm>


More information about the gdal-dev mailing list