[gdal-dev] Querying many raster pixels efficiently from Python
Daniel.Evans at jbarisk.com
Thu Jan 23 05:44:19 PST 2020
I have a large (global, 30m resolution, 50GB+) GeoTIFF dataset, from which I need to read many (millions) of pixel values at given input coordinates. I've got reasonable performance out of the code, about a million queries over five minutes, but:
1. There are actually twelve separate datasets of this size to query, not just one, so it takes approximately an hour.
2. This is by far the slowest portion of the program, and the users demand speed!
3. The users would also like to move towards higher resolution datasets, which we see run about 5x slower.
4. When querying the data on a particular piece of network storage mounted as part of the local filesystem, we see a slowdown approaching two orders of magnitude - bulk file copies off the network storage are reasonable, but each IO request shows a significant overhead (up to a second), and GDAL is sending one for each coordinate queried.
The implementation is in Python, directly calling down to GDAL. The short, long-running snippet of code which performs the actual queries the dataset, having converted real-world coordinates to pixels, is:
value_arrays = (
xoff=coord - buffer_size,
yoff=coord - buffer_size,
) for coord in offsets
There are a few things that are probably worth noting:
1. It is not necessarily a single pixel that is being read - for each coordinate, the program may be asked to get all pixel values within a given radius (typically a couple of pixels), and use some function to summarise these into a single value (mean, median, ...). GDAL currently returns a numpy array for each query, which is passed to the user-specified function after the snippet above.
2. The dataset is made up of 2048x2048 LZW-Compressed tiles containing floating point data (essentially conforming to COG, but with no overviews), grouped together in a VRT (performance is identical with plain GeoTIFFs, though).
Multiprocessing has not been found to help - we actually lose throughput as the disk read head is moving back and forth constantly. Better hardware (especially SSDs) is known to help, but no one wants to pay for that.
We see no particular performance difference from setting GDAL_DISABLE_READDIR_ON_OPEN=TRUE, and GDAL_CACHEMAX is left at the default 5% (64GB+ RAM available).
Does the Python interface to GDAL provide a way to supply a large number of offsets and get blocks of pixels back, avoiding the need to come back up to Python after each query? (I suspect not)
Is there some way to optimise GDAL so that queries of files on the mounted network storage are more efficient?
Dr. Daniel Evans
Skype<sip:daniel.evans at jbarisk.com>
T +44 (0) 1756 799919
[Visit our website]<http://www.jbarisk.com> [http://www.jbagroup.co.uk/imgstore/JBA-Email-Sig-Icons-LINKEDIN.png] [Follow us on Twitter] <https://twitter.com/jbarisk>
Our postal address and registered office is JBA Risk Management Limited, 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire BD23 3FD.
Find out more about us here: www.jbarisk.com<http://www.jbarisk.com/> and follow us on Twitter @JBARisk<http://twitter.com/JBARisk> and LinkedIn<https://www.linkedin.com/company/2370847?trk=tyah&trkInfo=clickedVertical%3Acompany%2CclickedEntityId%3A2370847%2Cidx%3A2-1-2%2CtarId%3A1447414259786%2Ctas%3AJBA%20RISK%20MANAGEMENT>
The JBA Group supports the JBA Trust.
All JBA Risk Management's email messages contain confidential information and are intended only for the individual(s) named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail.
Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system.
JBA Risk Management Limited is registered in England, company number 07732946, 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire, BD23 3FD, Telephone: +441756799919
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the gdal-dev