[pdal] Replacing Arbiter with GDAL's VSI?
Norman Barker
norman.barker at gmail.com
Wed Jan 29 15:15:01 PST 2025
Andrew,
Thanks for the details.
Writing to temp is a blocker for distributed processing of a region of a
LAS file (which yes, you probably want to use COPC for if the source is
LAS) as each processing node has to download the file and may not have the
space. If the data is not well ordered but the data and processing is in
the same region then it is feasible to do the extra file seeks at
additional cost (time if network, $s as well if cloud).
Would a good start be to add GDAL VSI in FileUtils for the open and
close methods as an experiment? This would be harmless, if the prefix
starts with /vsi then use GDAL and if not fall back to the existing access
methods. I can give this a try and put up a PR.
Norman
On Wed, Jan 29, 2025 at 2:11 PM Andrew Bell <andrew.bell.ia at gmail.com>
wrote:
>
>
> On Wed, Jan 29, 2025 at 2:02 PM Norman Barker via pdal <
> pdal at lists.osgeo.org> wrote:
>
>> Howard,
>>
>> I have been looking into this further and I would like to better
>> understand how you envision using FileSpec and VSI to move over from
>> Arbiter.
>>
>> Taking the LasReader for example and in particular this line -
>> https://github.com/PDAL/PDAL/blob/master/io/LasReader.cpp#L257 where
>> arbiter copies the LAS file locally, a direct replacement with VSI will
>> also download the whole file.
>>
>
> FileSpec doesn't really have anything to do with fetching remote files. It
> just provides a uniform way to communicate options that may be related to
> remote access or perhaps some other file operations.
>
> I don't know that we really need/want a virtual file system in PDAL. This
> could end up being a hugely complicated thing that takes over. But it is
> useful to fetch blocks of data from remote storage. Right now we do this
> with a little helper class and Arbiter. This makes sense if you don't know
> the order in which you're going to fetch data or if you're going to need
> the entire file.
>
> In most cases we do read all the data when dealing with a file, but it's
> not always sequential. LAS places metadata at the beginning and end of the
> file. You have to read those first before processing the actual data. But
> after that you (mostly) read the data sequentially. So in a perfect world
> you would pre-fetch data as you go. But this gets complex and the access
> pattern isn't absolutely sequential -- the code is threaded in order to
> have plenty of CPUs doing decompression, which is typically the gating
> factor. Certainly you could request blocks of data as necessary, but then
> you're waiting on a round-trip to the server. To some extent this issue
> could be alleviated with more threads or async I/O, but it's all
> complicated. COPC does random access on remote data, so
> pre-fetching probably doesn't make sense for it. EPT is another thing
> altogether in that you read entire files as you need them. I haven't done a
> survey of other readers.
>
> Writing to temporary disk isn't necessarily a big deal. Data may never
> actually go to disk if you have sufficient memory to store the pages. The
> obvious disadvantage of the fetch-then-process model is that you need to
> wait until all the network traffic is done before you can start real
> processing. The advantage is simplicity.
>
> Anyway, I don't think doing this transparently in the code is going to
> yield the best outcome. Perhaps a VFS-thing would make sense, but I don't
> think it's an obvious win. I know that GDAL's VSI has various limitations.
> Probably all do. I think we have a certain strength in the simplicity of
> what's there now. I'd like to see some experimentation before we commit to
> anything.
>
> Note that this issue is a different one than providing more information to
> use when accessing remote sources (credentials, regions, chunk size, etc.)
>
> --
> Andrew Bell
> andrew.bell.ia at gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/pdal/attachments/20250129/775de56a/attachment.htm>
More information about the pdal
mailing list