[pdal] writers.gdal: median
Jim Klassen
klassen.js at gmail.com
Mon Jan 31 08:06:32 PST 2022
Thanks, nth_element does speed it up. The 18MB laz file that uses 1GB of memory for this completed in 9.9 seconds on my machine so CPU use wasn't my main concern. But, switching to nth_element drops that to 9.4 seconds. This is for 967311 cells (raster pixels with data) and 63269486 total points (across all cells), so on average sort is looking at 65 pts/cell. When I tried mean instead of median, it takes 7.5 seconds, so the median code with sort adds 2.4 seconds and the median code with nth_element adds 1.9 seconds. So nth_element is significantly faster than sort for that block, but it doesn't make a huge difference overall in performance. I suspect most of the time is spent in laz decompression and/or output image compression.
However, memory use is by far my main concern.
A hack to save half the memory would be to use std::vector<float> instead of std::vector<double>. The data I am working with is fine with 6 decimal digits of vertical precision (1cm at 1000m). I'd might feel differently about that if I was working in mountainous areas.
On the memory use front all the bounded memory median algorithms I have found either produce estimates of the median which can be wildly off for worst case datasets or follow a guess and check strategy where they only store a certain number of the values around the guessed median and only keep track of the counts of values that fall above and below the stored array. At the end the hope is the median is one of the numbers that was stored, but if it isn't then a new guess is made and it is restarted with another pass though the data. That is likely slower but more memory efficient than the naive implementation which potentially stores the same point multiple times (one point can be part of multiple cells depending on the radius chosen), but to support allowing additional passes through the points the stage couldn't be streamable anymore. And it seems to me that if the whole point cloud is in memory anyway (because the stage becomes non-streamable) that something more
efficient could be constructed by sorting the whole point cloud first than using one of the bounded memory median algorithms. Of course, writers.gdal should clearly remain streamable for the current use cases, so this would involve making this into a separate stage.
On 1/28/22 23:03, Charles Karney wrote:
> To find the median use nth_element, O(n), instead of sorting, O(n*log(n)).
>
> On 1/28/22 19:23, Jim Klassen wrote:
>> Is there any interest in adding a median (and possibly Q1 and Q3) statistic to writers.gdal?
>>
>> I was curious how median would look so I made a naive implementation based on storing each point that applies to each cell using Raster<std::vector<double> > instead of Raster<double> and then sorting and then picking the middle (or mean of the two middle) elements. It appears to work correctly as far as I can tell, but there is a big caveat in that memory usage easily becomes enormous (a 18MB LAZ required about 1GB of RAM to process with median that only took about 70MB with mean). Since with the default radius is resolution * sqrt(2), points by default are counted in multiple pixel cells, so it can need to store significantly more "Z" values (across all cells) than were in the original point cloud. The existing statistics where memory usage scales by # of bands written and number of pixels, this also scales with the number of times a point is considered for a cell (and so number of points, radius, and resolution).
>>
>> I'm not sure this memory limitation would be easy to document clearly and I presume this is why median isn't already implemented. I certainly would not include it by default in the "all" mode.
>>
>> There may be ways to make this more memory friendly if multiple passes through the point cloud would be allowed, but this is counter to how the existing writers.gdal stage is structured.
>>
>> Is there a better way to go about implementing this?
>>
>> https://github.com/klassenjs/PDAL/commit/a4786ff8aa0063ca531eea7dc58a3288516c768e
>> _______________________________________________
>> pdal mailing list
>> pdal at lists.osgeo.org
>> https://lists.osgeo.org/mailman/listinfo/pdal
>
More information about the pdal
mailing list