[pdal] Feedback on driver development

Tue Apr 3 11:32:23 EDT 2012

Thanks for the feedback, Edgar. Comments inline...

On Apr 2, 2012, at 5:51 PM, Edgar Ellis wrote:

> Hi All,
> 
> I recently wrote a (read-only) PDAL driver using the plugin API for our proprietary format. MPG was interested in some feedback, and it seemed appropriate for the mailing list.
> 
> * Documentation
> I didn't find the limited documentation to be a problem, for the most part the source was easy to follow. A couple areas that required some digging:
> - How should a driver create UUIDs? (How unique are they)

This is open for debate and not settled yet. The UUIDs exist on Dimensions to support the following scenario.  When you pull data through a pipeline, there may be intermediate stages like a ScalingFilter that takes an existing Dimension like 'X', and derives a new 'X' dimension with the new scaling properties. An end-stage Writer down the line doesn't want to have to know about which 'X' stage it should ask for (which could be gotten by explicitly asking the PointBuffer for drivers.caris.reader.X or filters.scaling.X).  It just wants to ask for "the last X", which would be filters.scaling.X.  The UUIDs exist to define parent/child relationships between the 'X' dimensions in this scenario. The name/namespace of the dimension might also be used to define this relationship, but UUID made the implementation easier. The filters.scaling.X sets drivers.caris.reader.X as its parent when it adds the filters.scaling.X to the PointBuffer as part of the Iterator creation.

My intent was for a reader to declare the UUIDs of the dimensions it creates at compile-time, but filters and other things that derive dimensions would create their UUID at runtime as determined by the circumstances. This only happens in filters.scaling right now AFAIK, but there may be other filters that want to derive dimensions. 

> - Consistency of the Schema in Iterators. Is the PointBuffer passed to the Iterator guaranteed to have the same schema as the Reader?

No, it is my opinion that PointBuffer's schema need not have any relationship to the Stage's schema. For this reason, I'd like to move away from Stage:: having a Schema:: at all, and simply have Stage be a provider of "a list of dimensions it provides with an order that might make access faster".   Instead, the PointBuffer's Schema says "here's the dimensions I want to fill, fill them if you've got them".  For all writers, this means at least some sort of 'X', 'Y', and 'Z' being available, and in the case of the LAS writer, it simply zeros out any other dimensions that the PointBuffer doesn't have.  Having Stage:: not provide Schema:: would make this intent clear in my opinion.

> - (Un)SignedByte seems to duplicate (Un)SignedInteger with a size of 1. Easy enough to handle, but is there a preferred form?

My original intent here was in the context of description of a Dimension. Having both forms allows the person defining the Dimension to be explicit about the interpretation of the dimension. Do I treat this field as an integer? Do I treat this field as byte data?  Maybe this distinction is not needed.  I can be easily persuaded here...

> * Converting between formats
> To get pc2pc to convert from our format to las I had to make sure I exposed the position as "X", "Y", and "Z" dimensions as ints, instead of the natively stored doubles. If we add write support this would mean we wouldn't get an exact copy.

Put a scaling filter in front of the writer to create XYZ dimensions that the LAS writer can use. The scaling filter will define the precision level of the dimensions for the LAS file.  There is no way, of course, to have indefinite precision coordinates in the LAS file, so you have to specify them.  I think this property is mostly related to LAS itself and file conversion, and not PDAL's inability to provide double/float data.  I think someone using PDAL as a reader would be able to get the float/double data out without issue.

Here's a pipeline document for another proprietary format that stores its data in floats:

> <?xml version="1.0" encoding="utf-8"?>
> <Pipeline version="1.0">
>     <Writer type="drivers.las.writer">
>         <Option name="filename">
>             written-from-bpf.las
>         </Option>
>     <Filter type="filters.scaling">
>         <Option name="dimension">
>             X
>             <Options>
>                 <Option name="scale">0.00001</Option>
>                 <Option name="offset">524974</Option>
>             </Options>
>         </Option>
>         <Option name="dimension">
>             Y
>             <Options>
>                 <Option name="scale">0.00001</Option>
>                 <Option name="offset">3866496</Option>
>             </Options>
>         </Option>           
>           <Reader type="drivers.bpf.reader">
>               <Option name="filename">
>                   ./data/somefile.bpf
>               </Option>
>           </Reader>
>     </Filter>
>     </Writer>
> </Pipeline>

> Similarly I suspect there's going to be a need for users to map dimensions (eg write source dimension foo into destination dimension bar).

Yes, this is a filter that I've been meaning to write. Having this would remove the confounding of types that getField() has, and the user would have to be explicit about how they want to map dimensions and what their size/interpretation should be when doing so.

> Our format isn't interleaved (it's banded/column major), there would be some performance benefit if we only had to read dimensions that the Writer was using. Not a major concern.

The format that the pipeline describes above isn't interleaved either. PDAL definitely bakes in the interleavedness :)  Maybe this could be solved in a way where the reader implementation creates a mmap to the file, and then the read iterator creates pointers into this mmap.  It was a strategy I was thinking of for the above format.

> It would be nice if file based plugins could be used directly with pcinfo/pc2pc without an xml pipeline file.

I think this could be done now that we have a global environment startup/shutdown facility.  I didn't do it before because the pipeline stuff is the only thing that wakes up all of the stages and does any magic to determine which reader to open things with and so on.

> getField() looks dangerous, sometimes it will convert the types, but for others will give you corrupt data. Also, its behavior differs from setField. I have an idea for a patch if there's interest in making this more robust.

Agreed. As I said before getField confounds dimension mapping, and this should be made explicit with a Filter:: that does the work for you.  I would be very happy for patches to clean this ball of guck up.

> But all in all development went smoothly; Many thanks for adding the plugin support.

I'm excited things went well. I think you're one of the first developers outside of Michael and I to implement at driver, and the first to provide significant and helpful feedback on how things can improve.

Howard