[pdal] PDAL Oracle small benchmark

Fri Apr 17 10:16:38 PDT 2015

> On Apr 17, 2015, at 11:33 AM, Oscar Martinez Rubi <o.martinezrubi at tudelft.nl> wrote:
> 
> Hi,
> 
> After the latest fixs (thanks Howard, Connor, Andrew and the rest of PDAL guys!) in OCI writer and reader and the fact that I found out about laz-perf I have done this new small test with PDAL and Oracle to see how the two systems behave with different configurations. I tried with/without laz-perf, with point/dimension major, with all columns or only xyz, with/without BLOB compression and with/without offsets and scales (i.e. 64 or 32 bit per coord).
> 
> There are 32 combinations, but since lazperf requires dimension major, there 24 valid "only" combinations.

IIRC, lazperf is point major only. Maybe we should tweak the PDAL writer to error if the user tries to set dimension major for the storage orientation and use lazperf as the compression.

> For each one I have loaded a single LAS file with 20M points (size 380 MB, 38MB in LAZ) with the PDAL OCI writer and I have queried a small rectangle (80000 points) with PDAL OCI reader. I have done the query twice.
> 
> From the attached table is clear that using lazperf is very good in size terms (almost factor 2 compared to LAZ), loading time and query time!. The best approach is to use lazperf, scales and offsets and without BLOB compression.

This confirms our performance characteristics with lazperf and Oracle as well. BLOB/securefile compression has a cost, and the win to be had is to push less i/o across the database, no matter how you do it. I’m curious as to why lazperf was 2x smaller than LAZ. Is this because you are only writing XYZ dimensions? LAZ will always be more efficient than the current lazperf implementation in PDAL. This is because LAZ compresses fields together, and our naive lazperf implementation treats them individually. It is the same enconder/decoder, but as a free-form schema, it seems hazardous to try to model the behavior of the fields. LAZ being mostly lidar gets that luxury.

> 
> Regarding the loading times:
> - Adding BLOB compression generally makes the loading slower.

More CPU usage

> - Using all columns instead of only xyz also makes it slower.

Less io

> - Adding offsets makes it faster.

Data are quantized from 64 bit doubles to 32 bit floats. 

> - Using lazperf does not seem to have a visible effect in loading time.
> - The point/dimension major doe snot have any visible effect

As mentioned above, there is some combinations that probably shouldn’t be allowed.

> Regarding the queries the only difference seems to be if using all columns or only xyz. In general by having only xyz instead of all columns the queries are faster.
> 
> Regarding the size/storage the are some strange issues in the numbers of the table:
> 
> - When lazperf is used and offsets are used it does not matter in storage terms whether I specify BLOB compression or not (I guess that BLOB compression just can not squeeze the data anymore, right? or maybe it is somehow ignored?)

I assume this is true. lazperf assumes the periodicity of the data (because we model it using the schema), and the fields are compressed individually without regard to each other. This removes a lot of easy-to-compress bit duplication from the data stream.

> 
> - When lazperf is used and offsets are not used the BLOB compression actually increases the size (Seems like the BLOB compression messes up what lazperf did, strange though…)

lazperf with doubles is not as efficient as it is with int32_t’s.

> - When lazperf is used it does not matter in storage terms whether I specify only x,y,z or all columns. Why is this?

The other dimensions compress very well with a differential encoder because they don’t change very fast.

> - When lazperf is not used, the difference in size between point and dimension orientation is only visible when using all the columns, BLOB compression and offsets

Dimension major orientation puts a bunch of similar values next together in the stream and BLOB compression can then easily squeeze them. In point major orientation, there are few runs to get.

> 
> - When lazperf is not used and only xyz are used the BLOB compression offsers same compression factor that using offsets but in twice the time, both 278MB. Combining both gives 212MB.

Can you explain this one a bit more? I’m confused a bit by the wording.

> 
> - When lazperf is not used and all columns are used the BLOB compression gives better compression factor that using offsets (343MB vs 538MB).
> 
> - If BLOB compression is used and offset are not used the lazperf compression adds nothing. In fact in this case and having only xyz using lazperf is actually worse than not using it

> (which makes me thing that in lazperf all the columns are stored even if you do not want it)

This might be true. We should look at the code to confirm.

> 
> I am aware that the estimation of the size in oracle can be tricky. I sum the size of the user segments related to my "blocks" table, i.e.:
> 
> SELECT sum(bytes/1024/1024) size_in_MB
> FROM user_segments
> WHERE (segment_name LIKE : 'blocks%'
>   OR segment_name in (
>      SELECT segment_name
>      FROM user_lobs
>      WHERE table_name LIKE : 'blocks%'
>      UNION
>      SELECT index_name
>      FROM user_lobs
>      WHERE table_name LIKE : 'blocks%'
>      )
> );
> 
> Kind Regards,

Thanks for the great report!

Howard