[pgpointcloud] RLE and SIGBITS heuristics

Wed Apr 15 04:34:24 PDT 2015

On Thu, Apr 09, 2015 at 12:41:32PM +0200, Sandro Santilli wrote:
> Reading the code I found that SIGBITS is used when it gains
> a compression ratio of 1.6:1 while RLE is required to get 4:1
> ratio (but the comment talk about a 4:1 for both).
> 
> How were the ratios decided ?
> 
> https://github.com/pgpointcloud/pointcloud/blob/v0.1.0/lib/pc_dimstats.c#L121-L137

As an experiment, I created a patch containing 260,000 points organized
so that the data in each dimension is layed out in a way to make the heuristic
pick one of the 3 different encodings:

 All dimensions are of type int16_t
 - 1st dimension alternates values 0 and 1
 - 2nd dimension has value -32768 for the first 130k points, then 32767
 - 3rd dimension alternates values -32768 and 32767

Then I checked the size of the patch after applying different compression
schemes, and here's what I got:

   size   |          compression
 ---------+---------------------------
     1680 | {zlib,zlib,zlib}
     4209 | {zlib,auto,auto} <-- zlib much better than sigbits !
    33656 | {auto,zlib,auto} <-- zlib better than rle
    36185 | {auto,auto,auto} <----- DEFAULT, effectively {sigbits,rle,zlib}
    36185 | {auto,auto,zlib}
  1072606 | {sigbits,sigbits,sigbits}
  1560073 | {uncompressed}   <------ UNCOMPRESSED size
  1563148 | {rle,rle,rle}

Interesting enough, "zlib" results in a better compression than
both "sigbits" and "rle", and we're supposedly talking about their
best performances (only 2 runs for rle, 15 bits over 16 in common for
sigbits).

It might be a particularly lucky case for zlib too, given the very regular
pattern of values distributions, but now I'm wondering... how would zlib
perform on real world datasets out there ?

If you're running the code from current master branch you could test this
yourself with a query like this:

  \set c mycol -- set to name of your column
  \set t mytab -- set to name or your table 
  SELECT sum(pc_memsize(
              pc_compress(:c, 'dimensional', array_to_string(
               array_fill('zlib'::text,ARRAY[100]), ','
              ))))::float8 / 
         sum(pc_memsize(:c))
  FROM :t;

It will tell how much smaller could your dataset get by compressing
it all with dimensional/zlib schema.

I get 0.046 with my test dataset above (260k points in patch),
while it is 1.01 (bigger) on a dataset where patches have 1000 points each
and 3 dimensions over 12 are already compressed with zlib, other 3 with
sigbits and the remaining 6 with rle.

How about you ?

--strk;