[pgpointcloud] RLE and SIGBITS heuristics
Sandro Santilli
strk at keybit.net
Wed Apr 15 04:34:24 PDT 2015
On Thu, Apr 09, 2015 at 12:41:32PM +0200, Sandro Santilli wrote:
> Reading the code I found that SIGBITS is used when it gains
> a compression ratio of 1.6:1 while RLE is required to get 4:1
> ratio (but the comment talk about a 4:1 for both).
>
> How were the ratios decided ?
>
> https://github.com/pgpointcloud/pointcloud/blob/v0.1.0/lib/pc_dimstats.c#L121-L137
As an experiment, I created a patch containing 260,000 points organized
so that the data in each dimension is layed out in a way to make the heuristic
pick one of the 3 different encodings:
All dimensions are of type int16_t
- 1st dimension alternates values 0 and 1
- 2nd dimension has value -32768 for the first 130k points, then 32767
- 3rd dimension alternates values -32768 and 32767
Then I checked the size of the patch after applying different compression
schemes, and here's what I got:
size | compression
---------+---------------------------
1680 | {zlib,zlib,zlib}
4209 | {zlib,auto,auto} <-- zlib much better than sigbits !
33656 | {auto,zlib,auto} <-- zlib better than rle
36185 | {auto,auto,auto} <----- DEFAULT, effectively {sigbits,rle,zlib}
36185 | {auto,auto,zlib}
1072606 | {sigbits,sigbits,sigbits}
1560073 | {uncompressed} <------ UNCOMPRESSED size
1563148 | {rle,rle,rle}
Interesting enough, "zlib" results in a better compression than
both "sigbits" and "rle", and we're supposedly talking about their
best performances (only 2 runs for rle, 15 bits over 16 in common for
sigbits).
It might be a particularly lucky case for zlib too, given the very regular
pattern of values distributions, but now I'm wondering... how would zlib
perform on real world datasets out there ?
If you're running the code from current master branch you could test this
yourself with a query like this:
\set c mycol -- set to name of your column
\set t mytab -- set to name or your table
SELECT sum(pc_memsize(
pc_compress(:c, 'dimensional', array_to_string(
array_fill('zlib'::text,ARRAY[100]), ','
))))::float8 /
sum(pc_memsize(:c))
FROM :t;
It will tell how much smaller could your dataset get by compressing
it all with dimensional/zlib schema.
I get 0.046 with my test dataset above (260k points in patch),
while it is 1.01 (bigger) on a dataset where patches have 1000 points each
and 3 dimensions over 12 are already compressed with zlib, other 3 with
sigbits and the remaining 6 with rle.
How about you ?
--strk;
More information about the pgpointcloud
mailing list