[Liblas-devel] Cached Reader bottleneck

Wed Oct 20 20:45:35 EDT 2010

On 20/10/10 23:20, Gary Huber wrote:
> Howard,
> 
> I've run into a problem trying to read a billion-point LAS file to test
> the index. It looks like I run out of memory in the
> CachedReaderImpl::m_mask while it is adding vector members trying to
> reach the number of points in the file. It's in the loop on line 88 of 
> cachedreader.cpp.

Gary,

I think I found where is the problem and I hope I've fixed it.

http://trac.liblas.org/changeset/5a272a57945c3e2383a63b87cd9356f1e402e2f6

Your use allocates mask large for 700 millions bytes.
Now, if my finding is correct and mask was sometimes incorrectly
doubled, it makes 1400 millions.

Allocation failure does not surprise me here, as it requests
continuous memory block of 1.5 GB for the mask array only.
I'm pretty sure the failure happens due to heap fragmentation.

Anyway, as I've said, I believe there was a bug which has been fixed.
I'd appreciate if you could rebuild and check (I don't have such huge
LAS file).

> Is there a way to override point cacheing or will I just hit a similar
> wall somewhere else if I do?

Practically, reader implementation is pluggable.
Cached reader is used by default, but it is possible to
add LASReader constructor to allow to select strategy in run-time.

In the meantime, you can try to replace this line:

http://trac.liblas.org/browser/src/lasreader.cpp#L66

with:

m_pimpl(new detail::ReaderImpl(ifs)),

what should plug in the non-caching reader.

You will need to rebuild, of course.

> I wonder if it wouldn't be faster and take
> less memory to resize that array once and then set the values in a loop.
> Having to realloc that array and copy repeatedly as it gets larger might
> not be a good way to go. I'm stalling after adding 689 million array
> elements. It seems to take a long time to hit that error and is probably
> taking twice the memory it needs to.

Eventually, this array is going to grow as large as the total.
I doubt growing in steps would solve the problem/

Let's see if my fix solves it, at least temporarily.

It is obvious to me, that for such large datasets, caching is
pointless due to memory constraints - it's not really feasible to find
continuous memory block of 2-3 GB of :-)

Alternative solution could be to partition cache and instead of 1 array,
manage it with N arrays (all of equal size) and calculate index of mask:

(index of array * size of array) + index of mask in array

This would allow some degree of random access.

> I made the change here and like I thought, much faster and I don't hit
> my memory limit. this would seem to be a way to speed up the reading of
> any large LAS file if you want me to check it in so you can look at it.

Yes, it's obvious caching makes the engine running slower, but it has
some benefits...you don't have to read records from disk more than once.

Anyway, I'd appreciate if you could check my last commit.
For now, I'm interested in confirming if there was also a bug involved.

Best regards,
-- 
Mateusz Loskot, http://mateusz.loskot.net
Charter Member of OSGeo, http://osgeo.org
Member of ACCU, http://accu.org