[pdal] PDAL Oracle small benchmark

Mon Apr 20 01:47:57 PDT 2015

Hi,

On 17-04-15 19:16, Howard Butler wrote:
>> On Apr 17, 2015, at 11:33 AM, Oscar Martinez Rubi <o.martinezrubi at tudelft.nl> wrote:
>>
>> Hi,
>>
>> After the latest fixs (thanks Howard, Connor, Andrew and the rest of PDAL guys!) in OCI writer and reader and the fact that I found out about laz-perf I have done this new small test with PDAL and Oracle to see how the two systems behave with different configurations. I tried with/without laz-perf, with point/dimension major, with all columns or only xyz, with/without BLOB compression and with/without offsets and scales (i.e. 64 or 32 bit per coord).
>>
>> There are 32 combinations, but since lazperf requires dimension major, there 24 valid "only" combinations.
> IIRC, lazperf is point major only. Maybe we should tweak the PDAL writer to error if the user tries to set dimension major for the storage orientation and use lazperf as the compression.
I think PDAL already shows that error message, so it is fine!
>
>> For each one I have loaded a single LAS file with 20M points (size 380 MB, 38MB in LAZ) with the PDAL OCI writer and I have queried a small rectangle (80000 points) with PDAL OCI reader. I have done the query twice.
>>
>>  From the attached table is clear that using lazperf is very good in size terms (almost factor 2 compared to LAZ), loading time and query time!. The best approach is to use lazperf, scales and offsets and without BLOB compression.
> This confirms our performance characteristics with lazperf and Oracle as well. BLOB/securefile compression has a cost, and the win to be had is to push less i/o across the database, no matter how you do it. I’m curious as to why lazperf was 2x smaller than LAZ. Is this because you are only writing XYZ dimensions? LAZ will always be more efficient than the current lazperf implementation in PDAL. This is because LAZ compresses fields together, and our naive lazperf implementation treats them individually. It is the same enconder/decoder, but as a free-form schema, it seems hazardous to try to model the behavior of the fields. LAZ being mostly lidar gets that luxury.
As Peter pointed out I meant that lazperf with PDAL/Oracle requires 82MB 
while the LAZ file was 37 MB so very decent in my opinion! I tested 
writing both all the columns and only xyz and in both cases it gives 
82MB but as Peter also suggested that maybe because this data only has 
valuable values in xyz, the rest of fields have the default value (so I 
guess the lazperf compression can compress them perfectly). From your 
message I do not get if you are saying that we should be able to squeeze 
it even more?
>> Regarding the loading times:
>> - Adding BLOB compression generally makes the loading slower.
> More CPU usage
>
>> - Using all columns instead of only xyz also makes it slower.
> Less io
>
>> - Adding offsets makes it faster.
> Data are quantized from 64 bit doubles to 32 bit floats.
>
>> - Using lazperf does not seem to have a visible effect in loading time.
>> - The point/dimension major doe snot have any visible effect
> As mentioned above, there is some combinations that probably shouldn’t be allowed.
>
>> Regarding the queries the only difference seems to be if using all columns or only xyz. In general by having only xyz instead of all columns the queries are faster.
>>
>> Regarding the size/storage the are some strange issues in the numbers of the table:
>>
>> - When lazperf is used and offsets are used it does not matter in storage terms whether I specify BLOB compression or not (I guess that BLOB compression just can not squeeze the data anymore, right? or maybe it is somehow ignored?)
> I assume this is true. lazperf assumes the periodicity of the data (because we model it using the schema), and the fields are compressed individually without regard to each other. This removes a lot of easy-to-compress bit duplication from the data stream.
>
>> - When lazperf is used and offsets are not used the BLOB compression actually increases the size (Seems like the BLOB compression messes up what lazperf did, strange though…)
> lazperf with doubles is not as efficient as it is with int32_t’s.
>
>> - When lazperf is used it does not matter in storage terms whether I specify only x,y,z or all columns. Why is this?
> The other dimensions compress very well with a differential encoder because they don’t change very fast.
>
>> - When lazperf is not used, the difference in size between point and dimension orientation is only visible when using all the columns, BLOB compression and offsets
> Dimension major orientation puts a bunch of similar values next together in the stream and BLOB compression can then easily squeeze them. In point major orientation, there are few runs to get.
>
>> - When lazperf is not used and only xyz are used the BLOB compression offsers same compression factor that using offsets but in twice the time, both 278MB. Combining both gives 212MB.
> Can you explain this one a bit more? I’m confused a bit by the wording.
I meant that if you compare these lines:

lazperf dimOri  Cols    BlobCom Offsets LTime   Size    QTime1 Qtime2
False    True    xyz    True    True    46.09    212 0.39    0.36
False    True    xyz    False   True    28.53    278 0.38    0.42
False    True    xyz    True    False   57.39    278 0.39    0.35

It is curious to see that using the offsets without BLOB compression 
(line 2) gives 278MB which is the same size than using the BLOB 
compression without offsets (line 3) but the latest takes 57 secs while 
the first one takes only 28 sec. If you combine both, i.e. BLOB and 
offsets then we squeeze a bit more to 212MB...anyway, this is not 
important, it was just a comment...

>> - When lazperf is not used and all columns are used the BLOB compression gives better compression factor that using offsets (343MB vs 538MB).
>>
>> - If BLOB compression is used and offset are not used the lazperf compression adds nothing. In fact in this case and having only xyz using lazperf is actually worse than not using it
>> (which makes me thing that in lazperf all the columns are stored even if you do not want it)
> This might be true. We should look at the code to confirm.
After realizing that in the used data all the columns that are not xyz 
have always the default values...maybe this is not the case, but I guess 
it is good to check anyway
>> I am aware that the estimation of the size in oracle can be tricky. I sum the size of the user segments related to my "blocks" table, i.e.:
>>
>> SELECT sum(bytes/1024/1024) size_in_MB
>> FROM user_segments
>> WHERE (segment_name LIKE : 'blocks%'
>>    OR segment_name in (
>>       SELECT segment_name
>>       FROM user_lobs
>>       WHERE table_name LIKE : 'blocks%'
>>       UNION
>>       SELECT index_name
>>       FROM user_lobs
>>       WHERE table_name LIKE : 'blocks%'
>>       )
>> );
>>
>> Kind Regards,
> Thanks for the great report!
Thanks!

O.
>
> Howard