[pgpointcloud] Questions about indexing and querying

Sat Dec 14 17:05:55 PST 2013

On December 14, 2013 at 4:33:27 PM, Howard Butler (howard at hobu.co) wrote:
>  
>  
> On Dec 14, 2013, at 5:43 PM, Paul Ramsey  
> wrote:
>  
> >> So I have endeavored to process and load the entire State of  
> Iowa's
> >> LiDAR data into pgpointcloud using PDAL. I currently have  
> 1.24
> >> million patches at ~120k points per patch. I have some questions  
> >> on how to use this holding more effectively.
> >>
> >> 1) Is there an equivalent to this when the pointcloud_postgis  
> >> extension is enabled? Or did I miss something? The docs don't  
> >> say too much about indexing other than using the chipper to  
> create
> >> patches.
> >
> > Arg, yeah, there’s a big hole in indexing currently. Basically  
> you have to enable pointcloud_postgis and then
> >
> > CREATE INDEX patches_idx on cloud using GIST (Geometry(patchcol));  
> >
> > And since it’s a functional index, you then need to make sure  
> than one of your terms is “Geometry(patches)” when trying to  
> invoke it.
>  
> Does this mean on-th-fly pc-to-geom conversion of every candidate?  
> Is that expensive? Is it worth it to cache an explicit geometry  
> for each patch/row in my table and simply interact with that spatially? 

No, the index is independent of the rows, natch, so the index scan involves the table not at all. Once you’re evaluating the result set then naturally you have to look at the real data.

> >> 2) Does a simple count, ala "select sum(PC_NumPoints(pa))  
> from
> >> cloud", mean unpacking every patch all the way?
> >
> > No, the npoints is stored in the header, so it should only mean  
> a peek into the front of the page.
>  
> I haven't gotten an answer to my PC_NumPoints query a few hours  
> later. Does fetching the header mean lighting up the entire object?  
> I ask because I wonder if I should instead use more rows/patches  
> to store the data instead of bigger, fuller patches. To use 400  
> point patches would mean ~390million patches/rows for this  
> data. Do tables with that many rows in them work out very well?  
> For massive point cloud data, the choice is either lots of rows  
> or fat rows. Which way does PostgreSQL seem to favor?

Well, it’s possible something else terrible is going on in the abstraction layers. We are using the ‘external’ storage type, so the objects should be sitting there uncompressed, ready to be peeked into.

> In Oracle land, we would want fat rows that because the cost of  
> lighting up the blob is rather high. Once you've got it woken up,  
> you might as well pull down a lot of points. Each row in Oracle has  
> a roughly fixed cost, so the bigger the data in the rows, the better  
> off you generally are.

The downside to large objects in PgSQL is that they get stored in side tables, so there’s a big overhead to accessing them. That said, given the number of rows you’re looking at, I’d hope the overhead would still end up being low/small in a relative sense. (Not hours of processing, by any means.)

So there’s a mystery to unravel here. I certainly noticed a big improvement in my speed when I moved to just header-peaking in the numpoints code, so it’s not doing *nothing*, but I don’t think I’ve tested it in the TOASTed configuration, only with page-sized objects. 

> At CRREL, we also partition the table so that big chunks of the  
> table (mapped to logical collections of data like a date of collection  
> or something), can be dropped/updated altogether. I'll explore  
> what it would take to do that as well.

If the queries themselves send to restrict across the partitions, then PgSQL partitioning and constraint exclusion should be very very effective on making things both faster and more easy to manage.

P.

>  
> - signature.asc, 506 bytes