<div dir="auto"><div>Putting something like mapserver/mapcache in front of your requests might work, if I understand correctly. We serve COGs out of s3 via WMS like this and the performance is pretty nice.</div><div dir="auto"> </div><div dir="auto">See </div><div dir="auto"><br></div><div dir="auto"><a href="https://github.com/pedros007/mapserver-docker">https://github.com/pedros007/mapserver-docker</a></div><div dir="auto"><br></div><div dir="auto">for some discussion on such a setup.</div><div dir="auto"><br></div><div dir="auto"><div dir="auto">Best,</div><div dir="auto">Patrick</div><br><br><div class="gmail_quote" dir="auto"><div dir="ltr" class="gmail_attr">On Thu, Jan 23, 2020, 6:44 AM Daniel Evans <<a href="mailto:Daniel.Evans@jbarisk.com">Daniel.Evans@jbarisk.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div lang="EN-GB" link="#0563C1" vlink="#954F72">
<div class="m_2993562117743878583WordSection1">
<p class="MsoNormal"><a name="m_2993562117743878583_x__MailAutoSig" rel="noreferrer"><span style="font-family:"Arial",sans-serif">Hi,<u></u><u></u></span></a></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"><u></u> <u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif">I have a large (global, 30m resolution, 50GB+) GeoTIFF dataset, from which I need to read many (millions) of pixel values at given input coordinates.
I’ve got reasonable performance out of the code, about a million queries over five minutes, but:<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"><u></u> <u></u></span></span></p>
<ol style="margin-top:0cm" start="1" type="1">
<li class="m_2993562117743878583MsoListParagraph" style="margin-left:0cm"><span><span style="font-family:"Arial",sans-serif">There are actually twelve separate datasets of this size to query, not just one, so it takes
approximately an hour.<u></u><u></u></span></span></li><li class="m_2993562117743878583MsoListParagraph" style="margin-left:0cm"><span><span style="font-family:"Arial",sans-serif">This is by far the slowest portion of the program, and the users demand speed!<u></u><u></u></span></span></li><li class="m_2993562117743878583MsoListParagraph" style="margin-left:0cm"><span><span style="font-family:"Arial",sans-serif">The users would also like to move towards higher resolution datasets, which we see run about
5x slower.<u></u><u></u></span></span></li><li class="m_2993562117743878583MsoListParagraph" style="margin-left:0cm"><span><span style="font-family:"Arial",sans-serif">When querying the data on a particular piece of network storage mounted as part of the local
filesystem, we see a slowdown approaching two orders of magnitude – bulk file copies off the network storage are reasonable, but each IO request shows a significant overhead (up to a second), and GDAL is sending one for each coordinate queried.<u></u><u></u></span></span></li></ol>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"><u></u> <u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif">The implementation is in Python, directly calling down to GDAL. The short, long-running snippet of code which performs the actual queries the dataset,
having converted real-world coordinates to pixels, is:<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"><u></u> <u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif">value_arrays = (<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"> raster_ds.ReadAsArray(<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"> xoff=coord[0] - buffer_size,<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"> yoff=coord[1] - buffer_size,<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"> xsize=npix,<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"> ysize=npix<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"> ) for coord in offsets<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif">)<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"><u></u> <u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif">There are a few things that are probably worth noting:<u></u><u></u></span></span></p>
<ol style="margin-top:0cm" start="1" type="1">
<li class="m_2993562117743878583MsoListParagraph" style="margin-left:0cm"><span><span style="font-family:"Arial",sans-serif">It is not necessarily a single pixel that is being read – for each coordinate, the program may
be asked to get all pixel values within a given radius (typically a couple of pixels), and use some function to summarise these into a single value (mean, median, …). GDAL currently returns a numpy array for each query, which is passed to the user-specified
function after the snippet above.<u></u><u></u></span></span></li><li class="m_2993562117743878583MsoListParagraph" style="margin-left:0cm"><span><span style="font-family:"Arial",sans-serif">The dataset is made up of 2048x2048 LZW-Compressed tiles containing floating point data (essentially
conforming to COG, but with no overviews), grouped together in a VRT (performance is identical with plain GeoTIFFs, though).<u></u><u></u></span></span></li></ol>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"><u></u> <u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif">Multiprocessing has not been found to help - we actually lose throughput as the disk read head is moving back and forth constantly. Better hardware (especially
SSDs) is known to help, but no one wants to pay for that.<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"><u></u> <u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif">We see no particular performance difference from setting GDAL_DISABLE_READDIR_ON_OPEN=TRUE, and GDAL_CACHEMAX is left at the default 5% (64GB+ RAM available).<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"><u></u> <u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"><u></u> <u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"><u></u> <u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif">Does the Python interface to GDAL provide a way to supply a large number of offsets and get blocks of pixels back, avoiding the need to come back up
to Python after each query? (I suspect not)<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif">Is there some way to optimise GDAL so that queries of files on the mounted network storage are more efficient?<u></u><u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"><u></u> <u></u></span></span></p>
<p class="MsoNormal"><span><span style="font-family:"Arial",sans-serif"><u></u> <u></u></span></span></p>
<p class="MsoNormal"><span><b><span style="font-family:"Arial",sans-serif;color:black"><u></u> <u></u></span></b></span></p>
<p class="MsoNormal"><span><b><span style="font-family:"Arial",sans-serif;color:black">Dr. Daniel Evans</span></b></span><span></span><span style="font-family:"Arial",sans-serif;color:black"><u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:"Arial",sans-serif;color:#f6a124">Software Developer</span><span style="font-family:"Arial",sans-serif;color:black"><u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:"Arial",sans-serif;color:#f6a124"> </span><span style="font-family:"Arial",sans-serif;color:black"><u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:"Arial",sans-serif;color:black"><a rel="noreferrer"><b><span style="color:#f6a124">Skype</span></b></a><u></u><u></u></span></p>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<p><strong><span style="color:#f6a125;font-family:arial">T</span></strong><span style="text-decoration:none;color:#000000;font-family:arial"> +44 (0) 1756 799919</span><br>
<a style="text-decoration:none;color:#f6a125;font-family:arial" href="http://www.jbarisk.com" target="_blank" rel="noreferrer">www.jbarisk.com</a></p>
<p><a href="http://www.jbarisk.com" target="_blank" rel="noreferrer"><img src="http://www.jbagroup.co.uk/imgstore/JBA-Email-Sig-Icons-JBA.png" alt="Visit our website" width="33" height="26"></a> <a rel="noreferrer"><img src="http://www.jbagroup.co.uk/imgstore/JBA-Email-Sig-Icons-LINKEDIN.png" alt="" height="26"></a>
<a href="https://twitter.com/jbarisk" target="_blank" rel="noreferrer"><img src="http://www.jbagroup.co.uk/imgstore/JBA-Email-Sig-Icons-TWITTER.png" alt="Follow us on Twitter" width="33" height="26"></a></p>
<u></u><u></u>
<p></p>
<p class="MsoNormal"><span style="font-family:"Arial",sans-serif;color:#ff9c00">Our postal address and registered office is JBA Risk Management</span><span style="font-family:"Arial",sans-serif;color:black">
</span><span style="font-family:"Arial",sans-serif;color:#ff9c00">Limited, 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire BD23 3FD.</span><u></u><u></u></p>
<p><b><span style="background:white;color:black;font-family:"Arial",sans-serif;font-size:10pt">Find out more about us here:
<a href="http://www.jbarisk.com/" target="_blank" rel="noreferrer"><font color="#0563c1">www.jbarisk.com</font></a> and
</span></b><b><span style="background:white;color:rgb(68,68,68);font-family:"Arial",sans-serif;font-size:10pt"><a href="http://twitter.com/JBARisk" target="_blank" rel="noreferrer"><font color="#0563c1">follow us on Twitter @JBARisk</font></a> and
<a href="https://www.linkedin.com/company/2370847?trk=tyah&trkInfo=clickedVertical%3Acompany%2CclickedEntityId%3A2370847%2Cidx%3A2-1-2%2CtarId%3A1447414259786%2Ctas%3AJBA%20RISK%20MANAGEMENT" target="_blank" rel="noreferrer">
<font color="#0563c1">LinkedIn</font></a> </span></b></p>
<p><span style="background:white;color:black;font-family:"Arial",sans-serif;font-size:8pt">The JBA Group supports the JBA Trust.</span></p>
<p style="margin:0cm 0cm 0pt"><span style="background:white;color:rgb(68,68,68);font-family:"Arial",sans-serif;font-size:8pt">All JBA Risk Management's email messages contain confidential information and are intended only for the individual(s) named.
If you are not the named addressee you should not disseminate, distribute or copy this e-mail.</span><br>
<span style="color:rgb(68,68,68);font-family:"Arial",sans-serif;font-size:8pt"><span style="background:white">Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system.</span></span><br>
</p>
<p style="margin:0cm 0cm 0pt"><span style="color:rgb(68,68,68);font-family:"Arial",sans-serif;font-size:8pt"><span style="background:white">JBA Risk Management Limited is registered in England, company number 07732946, 1 Broughton Park, Old Lane
North, Broughton, Skipton, North Yorkshire, BD23 3FD, </span></span><span style="background:white;color:black;font-family:"Arial",sans-serif;font-size:8pt">Telephone: +441756799919</span></p>
<p> </p>
</div>
_______________________________________________<br>
gdal-dev mailing list<br>
<a href="mailto:gdal-dev@lists.osgeo.org" target="_blank" rel="noreferrer">gdal-dev@lists.osgeo.org</a><br>
<a href="https://lists.osgeo.org/mailman/listinfo/gdal-dev" rel="noreferrer noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/gdal-dev</a></blockquote></div></div></div>