[pdal] pgpointcloud

Tue Apr 16 05:41:44 PDT 2024

> On Apr 15, 2024, at 7:07 PM, Darrin Holst via pdal <pdal at lists.osgeo.org> wrote:
> 
> Hello all,
> 
> Hope this isn’t too off topic, but was curious if anyone has any experience with pgpointcloud.
> 
> We’ve been consuming the usgs lidar data for a while now, but now we’re in need of creating and maintaining a modest point cloud…~18 billion points.
> 
> I’m looking for any wins, losses, pitfalls, or tips from managing a point cloud in postgis.

Darrin,

Perfectly on topic...

My experience isn't definitive because I don't have a lot of practical experience with pgpointcloud. Instead we spent nearly five years building out a significant data management system using Oracle Point Cloud before abandoning it for the Entwine (and subsequently COPC) approaches you've seen spin out of this project. 

Here's my experience and recommendation about storing point cloud data in postgis/postgresql through pgpointcloud:

Don't.

Here's why...

* Except in front loaded data production scenarios, users typically read point cloud data many more times than they write it. 
* PostgreSQL brings transactions and ACID, but point clouds aren't financial data, and transactional lifecycles are extra overhead for point cloud data.
* Point cloud data are i/o-bound when you are trying to figure out what data you want and cpu-bound when you are trying to extract what you need from them. Index types that support the former work against the latter and vice versa. 
* Putting the data in pg means all of your data must come through the database's often singular i/o stack. That's an annoying problem at terrabyte scale and an inconveniently expensive one at petabyte scale – especially to satisfy the simple query of "give me the points inside this box at this resolution". 

But having access to my point clouds in SQL is super convenient. Yes indeed, but there are many ways nowadays to accomplish that. You could write a FDW to Entwine/COPC and manage the data much like you would a big raster pile. You could also write foreign functions that push down queries to a data source type of your choice. The key bit is because you're always reading, these functions can be idempotent and simple.

I would be very interested to hear the counter argument to my case from the pgpointcloud enthusiasts. What makes it worth the hassle?

Howard

A measure of the Hobu team's bonafides on the topic: The Hobu team created the USGS 3DEP Entwine bucket at https://registry.opendata.aws/usgs-lidar/  With USGS' continued push of content to it, it is currently ~61 trillion points and nearly 320tb in size (see its footprints at https://usgs.entwine.io <https://usgs.entwine.io/>). Not only is the data directly renderable in applications like Potree, QGIS, and our Cesium-based renderer called Eptium, it is possible to use it to do a two-stage raster-like query to gather content and do whatever you need with it. See my notebook at https://colab.research.google.com/drive/1JQpcVFFJYMrJCfodqP4Nc_B0_w6p5WOV#scrollTo=qiKI1JD9VqIr for an example. 

For the USGS lidar data, it is totally possible to do something like this in PG, but why would you when it already exists and costs nothing to use?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/pdal/attachments/20240416/80df31a9/attachment.htm>