[gdal-dev] Querying many raster pixels efficiently from Python
Patrick Young
patrick.mckendree.young at gmail.com
Thu Jan 23 18:58:24 PST 2020
Putting something like mapserver/mapcache in front of your requests might
work, if I understand correctly. We serve COGs out of s3 via WMS like this
and the performance is pretty nice.
See
https://github.com/pedros007/mapserver-docker
for some discussion on such a setup.
Best,
Patrick
On Thu, Jan 23, 2020, 6:44 AM Daniel Evans <Daniel.Evans at jbarisk.com> wrote:
> Hi,
>
>
>
> I have a large (global, 30m resolution, 50GB+) GeoTIFF dataset, from which
> I need to read many (millions) of pixel values at given input coordinates.
> I’ve got reasonable performance out of the code, about a million queries
> over five minutes, but:
>
>
>
> 1. There are actually twelve separate datasets of this size to query,
> not just one, so it takes approximately an hour.
> 2. This is by far the slowest portion of the program, and the users
> demand speed!
> 3. The users would also like to move towards higher resolution
> datasets, which we see run about 5x slower.
> 4. When querying the data on a particular piece of network storage
> mounted as part of the local filesystem, we see a slowdown approaching two
> orders of magnitude – bulk file copies off the network storage are
> reasonable, but each IO request shows a significant overhead (up to a
> second), and GDAL is sending one for each coordinate queried.
>
>
>
> The implementation is in Python, directly calling down to GDAL. The short,
> long-running snippet of code which performs the actual queries the dataset,
> having converted real-world coordinates to pixels, is:
>
>
>
> value_arrays = (
>
> raster_ds.ReadAsArray(
>
> xoff=coord[0] - buffer_size,
>
> yoff=coord[1] - buffer_size,
>
> xsize=npix,
>
> ysize=npix
>
> ) for coord in offsets
>
> )
>
>
>
> There are a few things that are probably worth noting:
>
> 1. It is not necessarily a single pixel that is being read – for each
> coordinate, the program may be asked to get all pixel values within a given
> radius (typically a couple of pixels), and use some function to summarise
> these into a single value (mean, median, …). GDAL currently returns a numpy
> array for each query, which is passed to the user-specified function after
> the snippet above.
> 2. The dataset is made up of 2048x2048 LZW-Compressed tiles containing
> floating point data (essentially conforming to COG, but with no overviews),
> grouped together in a VRT (performance is identical with plain GeoTIFFs,
> though).
>
>
>
> Multiprocessing has not been found to help - we actually lose throughput
> as the disk read head is moving back and forth constantly. Better hardware
> (especially SSDs) is known to help, but no one wants to pay for that.
>
>
>
> We see no particular performance difference from setting
> GDAL_DISABLE_READDIR_ON_OPEN=TRUE, and GDAL_CACHEMAX is left at the default
> 5% (64GB+ RAM available).
>
>
>
>
>
>
>
> Does the Python interface to GDAL provide a way to supply a large number
> of offsets and get blocks of pixels back, avoiding the need to come back up
> to Python after each query? (I suspect not)
>
> Is there some way to optimise GDAL so that queries of files on the mounted
> network storage are more efficient?
>
>
>
>
>
>
>
> *Dr. Daniel Evans*
>
> Software Developer
>
>
>
> *Skype*
>
>
>
> *T* +44 (0) 1756 799919
> www.jbarisk.com
>
> [image: Visit our website] <http://www.jbarisk.com> [image: Follow us on
> Twitter] <https://twitter.com/jbarisk>
>
> Our postal address and registered office is JBA Risk Management Limited,
> 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire BD23
> 3FD.
>
> *Find out more about us here: www.jbarisk.com <http://www.jbarisk.com/>
> and **follow us on Twitter @JBARisk <http://twitter.com/JBARisk> and
> LinkedIn
> <https://www.linkedin.com/company/2370847?trk=tyah&trkInfo=clickedVertical%3Acompany%2CclickedEntityId%3A2370847%2Cidx%3A2-1-2%2CtarId%3A1447414259786%2Ctas%3AJBA%20RISK%20MANAGEMENT>
> *
>
> The JBA Group supports the JBA Trust.
>
> All JBA Risk Management's email messages contain confidential information
> and are intended only for the individual(s) named. If you are not the named
> addressee you should not disseminate, distribute or copy this e-mail.
> Please notify the sender immediately by email if you have received this
> email by mistake and delete this email from your system.
>
>
> JBA Risk Management Limited is registered in England, company number
> 07732946, 1 Broughton Park, Old Lane North, Broughton, Skipton, North
> Yorkshire, BD23 3FD, Telephone: +441756799919
>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20200123/e41b547c/attachment-0001.html>
More information about the gdal-dev
mailing list