[pdal] questions about iterator and buffer semantics

Mon Jul 1 19:58:55 PDT 2013

On 1 July 2013 23:22, Howard Butler <hobu.inc at gmail.com> wrote:
>
> On Jun 30, 2013, at 7:54 PM, Chris Foster <chris.foster at roames.com.au> wrote:
>
>> On 29 June 2013 16:16, Chris Foster <chris.foster at roames.com.au> wrote:
>>> I've submitted a pull request which fixes the problem
>>> (https://github.com/PDAL/PDAL/pull/146)
>
> I want to note that the mosaic filter has not had much exercise, and there
> are very likely situations where it doesn't quite behave how we'd want. PDAL
> is still very much a work in progress, and there are plenty of pointy bits
> all about. I welcome any help you wish to contribute to file down those
> pointy bits into something less dangerous. While we/I try to make PDAL
> generic as possible, my group uses it for some very specific translation
> operations, and these are going to be more optimized and understood than
> many of the other possible combinations that are possible.

That's fine, some pieces will naturally be more battle hardened than others,
especially in a relatively new library like PDAL.

>> On further testing, I've found that disabling the dimension caching between
>> calls to read() does have a measurable performance impact in some cases.  When
>> reading an uncompressed las file in small chunks (buffer size of 1000) the
>> patch above causes a 50% performance degradation or so.
>
> One thing I want implement is a mechanism for a Stage to advertise whether
> changes the PointBuffer's schema. This would allow pipelines that are all
> simple reads to cache their dimension positions once.

Ok.  They would have to cache the dimensions by value though, unlike
the las reader which stores pointers: There's no way to know when the
PointBuffer from a previous call to read() may be deleted, which leads to
stale pointers to Dimensions even if the next PointBuffer has identical
layout.  This is what caused the crash in the Mosaic case.

Having a look at the way getDimension() is implemented I can see several
things which are convenient but probably make it go much slower than
necessary, so I think a lot of performance can be clawed back by simply
optimizing getDimension() and getDimensionOptional().  Things which stick out
to me:

* Exceptions are thrown as a matter of course (even in getDimensionOptional,
  where they are caught and used to return an empty optional).  I'm very
  suspicious of any code which uses exceptions as part of normal flow control.

* _Lots_ of memory management traffic happens behind the scenes:
  std::ostringstream is used to format an error message as a matter of course
  (rather than only in error situations), several std::string objects are
  created, etc.

* Complicated data structures are allocated for the dimension inheritance
  stuff.  I don't know how this works yet so I don't know whether it could be
  faster.

Ideally getDimension() can be made zero allocation, and getDimensionOptional
should never cause an exception to be thrown and gobbled behind the scenes.
I'll have a look at how to improve the performance there if I get a chance.

Side note: I expect it's too late to change now, but what benefit do we get by
making getDimensionOptional() return a boost::optional rather than a pointer?
It strikes me as very slightly safer but less convenient when you end up
caching raw pointers in places like the las reader anyway.

>> This is a bit surprising as it indicates the cost of dimension lookup in the
>> schema is rather large, unless there's something else I'm missing.  I'll have
>> to do some additional performance testing to try to get to the bottom of this.
>
> The cost of dimension lookup per-point would be prohibitive. It is
> essentially a std::map (actually boost::multi_array) traverse to find a
> dimension.

Definitely.  What I'm noting is that the cost is *still* noticeable
even when amortized over buffer sizes of 1000 points.  Something must be
suboptimal!

~Chris