[pdal] Replacing Arbiter with GDAL's VSI?

Andrew Bell andrew.bell.ia at gmail.com
Wed Jan 29 12:10:50 PST 2025


On Wed, Jan 29, 2025 at 2:02 PM Norman Barker via pdal <pdal at lists.osgeo.org>
wrote:

> Howard,
>
> I have been looking into this further and I would like to better
> understand how you envision using FileSpec and VSI to move over from
> Arbiter.
>
> Taking the LasReader for example and in particular this line -
> https://github.com/PDAL/PDAL/blob/master/io/LasReader.cpp#L257 where
> arbiter copies the LAS file locally, a direct replacement with VSI will
> also download the whole file.
>

FileSpec doesn't really have anything to do with fetching remote files. It
just provides a uniform way to communicate options that may be related to
remote access or perhaps some other file operations.

I don't know that we really need/want a virtual file system in PDAL. This
could end up being a hugely complicated thing that takes over. But it is
useful to fetch blocks of data from remote storage. Right now we do this
with a little helper class and Arbiter. This makes sense if you don't know
the order in which you're going to fetch data or if you're going to need
the entire file.

In most cases we do read all the data when dealing with a file, but it's
not always sequential. LAS places metadata at the beginning and end of the
file. You have to read those first before processing the actual data. But
after that you (mostly) read the data sequentially. So in a perfect world
you would pre-fetch data as you go. But this gets complex and the access
pattern isn't absolutely sequential -- the code is threaded in order to
have plenty of CPUs doing decompression, which is typically the gating
factor.  Certainly you could request blocks of data as necessary, but then
you're waiting on a round-trip to the server. To some extent this issue
could be alleviated with more threads or async I/O, but it's all
complicated. COPC does random access on remote data, so
pre-fetching probably doesn't make sense for it. EPT is another thing
altogether in that you read entire files as you need them. I haven't done a
survey of other readers.

Writing to temporary disk isn't necessarily a big deal.  Data may never
actually go to disk if you have sufficient memory to store the pages. The
obvious disadvantage of the fetch-then-process model is that you need to
wait until all the network traffic is done before you can start real
processing. The advantage is simplicity.

Anyway, I don't think doing this transparently in the code is going to
yield the best outcome. Perhaps a VFS-thing would make sense, but I don't
think it's an obvious win. I know that GDAL's VSI has various limitations.
Probably all do. I think we have a certain strength in the simplicity of
what's there now. I'd like to see some experimentation before we commit to
anything.

Note that this issue is a different one than providing more information to
use when accessing remote sources (credentials, regions, chunk size, etc.)

-- 
Andrew Bell
andrew.bell.ia at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/pdal/attachments/20250129/69f3ea0d/attachment.htm>


More information about the pdal mailing list