[pgpointcloud] RLE and SIGBITS heuristics

Wed Apr 15 08:04:47 PDT 2015

Sandro,

I think your example is probably too artificial to be very instructive for real world data. As Rémi noted, LASzip does very well in the geospatial point cloud scenario, but it is not generic enough by itself for use in the variable schema world of databases. Building on LASzip, PDAL is using a refactoring of it called lazperf that we wrote to do a few things:

* Compile to efficient Javascript using Emscripten
* Allow the encoder/decoder to be used dynamically
* Remove LASzip virtuals overhead.

https://github.com/verma/laz-perf

PDAL is using lazperf for compressing data before storing it in both SQLite and Oracle drivers. Of course, Oracle itself can’t consume the data internally, but we’re just using Oracle as a bucket for chips anyway. PDAL is using the dynamic compression capability of lazperf to construct a periodic encoder/decoder based on the schema of the input data. Rather than model the data “as LAZ”, we are instead modeling each dimension individually. This gives us point-major storage arrangement (and subsequent partial decompression) with dimension-major compression efficiency. It is not 100% efficient as LASzip of the same data would be, due to the fact that the fields are compressed individually without any model to enhance their bit delta going through the arithmetic encoder. One could do smarter things like model XYZ together as a unit and so on to improve the compression, but this ends up adding complexity. We have thus far just taken a very simple, per-field approach.

The result is we get 2:1-5:1 compression without working hard. Maybe it would be possible to incorporate lazperf into pgpointcloud as a compression type? See [1] for how we are using it. 

Howard 

[1]: https://github.com/PDAL/PDAL/blob/master/include/pdal/Compression.hpp

> On Apr 15, 2015, at 6:42 AM, Rémi Cura <remi.cura at gmail.com> wrote:
> 
> Maybe you would be interested in paper from lastool about how they compress the data
> [the paper](http://lastools.org/download/laszip.pdf)
> 
> 2015-04-15 13:34 GMT+02:00 Sandro Santilli <strk at keybit.net>:
> On Thu, Apr 09, 2015 at 12:41:32PM +0200, Sandro Santilli wrote:
> > Reading the code I found that SIGBITS is used when it gains
> > a compression ratio of 1.6:1 while RLE is required to get 4:1
> > ratio (but the comment talk about a 4:1 for both).
> >
> > How were the ratios decided ?
> >
> > https://github.com/pgpointcloud/pointcloud/blob/v0.1.0/lib/pc_dimstats.c#L121-L137
> 
> As an experiment, I created a patch containing 260,000 points organized
> so that the data in each dimension is layed out in a way to make the heuristic
> pick one of the 3 different encodings:
> 
>  All dimensions are of type int16_t
>  - 1st dimension alternates values 0 and 1
>  - 2nd dimension has value -32768 for the first 130k points, then 32767
>  - 3rd dimension alternates values -32768 and 32767
> 
> Then I checked the size of the patch after applying different compression
> schemes, and here's what I got:
> 
>    size   |          compression
>  ---------+---------------------------
>      1680 | {zlib,zlib,zlib}
>      4209 | {zlib,auto,auto} <-- zlib much better than sigbits !
>     33656 | {auto,zlib,auto} <-- zlib better than rle
>     36185 | {auto,auto,auto} <----- DEFAULT, effectively {sigbits,rle,zlib}
>     36185 | {auto,auto,zlib}
>   1072606 | {sigbits,sigbits,sigbits}
>   1560073 | {uncompressed}   <------ UNCOMPRESSED size
>   1563148 | {rle,rle,rle}
> 
> Interesting enough, "zlib" results in a better compression than
> both "sigbits" and "rle", and we're supposedly talking about their
> best performances (only 2 runs for rle, 15 bits over 16 in common for
> sigbits).
> 
> It might be a particularly lucky case for zlib too, given the very regular
> pattern of values distributions, but now I'm wondering... how would zlib
> perform on real world datasets out there ?
> 
> If you're running the code from current master branch you could test this
> yourself with a query like this:
> 
>   \set c mycol -- set to name of your column
>   \set t mytab -- set to name or your table
>   SELECT sum(pc_memsize(
>               pc_compress(:c, 'dimensional', array_to_string(
>                array_fill('zlib'::text,ARRAY[100]), ','
>               ))))::float8 /
>          sum(pc_memsize(:c))
>   FROM :t;
> 
> It will tell how much smaller could your dataset get by compressing
> it all with dimensional/zlib schema.
> 
> I get 0.046 with my test dataset above (260k points in patch),
> while it is 1.01 (bigger) on a dataset where patches have 1000 points each
> and 3 dimensions over 12 are already compressed with zlib, other 3 with
> sigbits and the remaining 6 with rle.
> 
> How about you ?
> 
> --strk;
> _______________________________________________
> pgpointcloud mailing list
> pgpointcloud at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/pgpointcloud
> 
> _______________________________________________
> pgpointcloud mailing list
> pgpointcloud at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/pgpointcloud