[gdal-dev] Querying many raster pixels efficiently from Python

Patrick Young patrick.mckendree.young at gmail.com
Thu Jan 23 18:58:24 PST 2020


Putting something like mapserver/mapcache in front of your requests might
work, if I understand correctly.  We serve COGs out of s3 via WMS like this
and the performance is pretty nice.

See

https://github.com/pedros007/mapserver-docker

for some discussion on such a setup.

Best,
Patrick


On Thu, Jan 23, 2020, 6:44 AM Daniel Evans <Daniel.Evans at jbarisk.com> wrote:

> Hi,
>
>
>
> I have a large (global, 30m resolution, 50GB+) GeoTIFF dataset, from which
> I need to read many (millions) of pixel values at given input coordinates.
> I’ve got reasonable performance out of the code, about a million queries
> over five minutes, but:
>
>
>
>    1. There are actually twelve separate datasets of this size to query,
>    not just one, so it takes approximately an hour.
>    2. This is by far the slowest portion of the program, and the users
>    demand speed!
>    3. The users would also like to move towards higher resolution
>    datasets, which we see run about 5x slower.
>    4. When querying the data on a particular piece of network storage
>    mounted as part of the local filesystem, we see a slowdown approaching two
>    orders of magnitude – bulk file copies off the network storage are
>    reasonable, but each IO request shows a significant overhead (up to a
>    second), and GDAL is sending one for each coordinate queried.
>
>
>
> The implementation is in Python, directly calling down to GDAL. The short,
> long-running snippet of code which performs the actual queries the dataset,
> having converted real-world coordinates to pixels, is:
>
>
>
> value_arrays = (
>
>             raster_ds.ReadAsArray(
>
>                         xoff=coord[0] - buffer_size,
>
>                         yoff=coord[1] - buffer_size,
>
>                         xsize=npix,
>
>                         ysize=npix
>
>             ) for coord in offsets
>
> )
>
>
>
> There are a few things that are probably worth noting:
>
>    1. It is not necessarily a single pixel that is being read – for each
>    coordinate, the program may be asked to get all pixel values within a given
>    radius (typically a couple of pixels), and use some function to summarise
>    these into a single value (mean, median, …). GDAL currently returns a numpy
>    array for each query, which is passed to the user-specified function after
>    the snippet above.
>    2. The dataset is made up of 2048x2048 LZW-Compressed tiles containing
>    floating point data (essentially conforming to COG, but with no overviews),
>    grouped together in a VRT (performance is identical with plain GeoTIFFs,
>    though).
>
>
>
> Multiprocessing has not been found to help - we actually lose throughput
> as the disk read head is moving back and forth constantly. Better hardware
> (especially SSDs) is known to help, but no one wants to pay for that.
>
>
>
> We see no particular performance difference from setting
> GDAL_DISABLE_READDIR_ON_OPEN=TRUE, and GDAL_CACHEMAX is left at the default
> 5% (64GB+ RAM available).
>
>
>
>
>
>
>
> Does the Python interface to GDAL provide a way to supply a large number
> of offsets and get blocks of pixels back, avoiding the need to come back up
> to Python after each query? (I suspect not)
>
> Is there some way to optimise GDAL so that queries of files on the mounted
> network storage are more efficient?
>
>
>
>
>
>
>
> *Dr. Daniel Evans*
>
> Software Developer
>
>
>
> *Skype*
>
>
>
> *T* +44 (0) 1756 799919
> www.jbarisk.com
>
> [image: Visit our website] <http://www.jbarisk.com>  [image: Follow us on
> Twitter] <https://twitter.com/jbarisk>
>
> Our postal address and registered office is JBA Risk Management Limited,
> 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire BD23
> 3FD.
>
> *Find out more about us here: www.jbarisk.com <http://www.jbarisk.com/>
> and **follow us on Twitter @JBARisk <http://twitter.com/JBARisk> and
> LinkedIn
> <https://www.linkedin.com/company/2370847?trk=tyah&trkInfo=clickedVertical%3Acompany%2CclickedEntityId%3A2370847%2Cidx%3A2-1-2%2CtarId%3A1447414259786%2Ctas%3AJBA%20RISK%20MANAGEMENT>
> *
>
> The JBA Group supports the JBA Trust.
>
> All JBA Risk Management's email messages contain confidential information
> and are intended only for the individual(s) named. If you are not the named
> addressee you should not disseminate, distribute or copy this e-mail.
> Please notify the sender immediately by email if you have received this
> email by mistake and delete this email from your system.
>
>
> JBA Risk Management Limited is registered in England, company number
> 07732946, 1 Broughton Park, Old Lane North, Broughton, Skipton, North
> Yorkshire, BD23 3FD, Telephone: +441756799919
>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20200123/e41b547c/attachment-0001.html>


More information about the gdal-dev mailing list