[Gdal-dev] RFC 11: Fast Format Identification

Tamas Szekeres szekerest at gmail.com
Mon Apr 30 17:49:21 EDT 2007


2007/4/30, Frank Warmerdam <warmerdam at pobox.com>:
>
> > Is it enough to cache only the filenames for the subsequent Identify
> > calls? For example if we open a "secondary file" of a driver first
> > would not we want to retain the header bytes to that time when the
> > primary file is identifyed by a driver?
>
> This is possible, but I am doubtful that it would gain much for a
> substantial increase in complexity.  I do plan to do some performance
> testing, including strace's watching system calls, to see what extra
> work is being done in directory scans after the new code is put in place.
> This might point out that such header-caching would be of substantial
> value, but I'm hesitant to take on that complexity yet.
>

I think in that case the proposed interface should be reconstructed
substantially, since the current appoach does not provide the option
of retaining the internal state of the search among the subsequent
Identify invocations. Currently only the (readonly) stringlist is
retained.

> > Theoretically from the user's perspective I feel a bit hacky to pass
> > the siblings in a filesystem when identifying a particular file.
> > Wouldn't it be more convenient to follow the FindFirst... FindNext...
> > approach on the supported filenames and drivers. An internal
> > searchhandle could be passed between these functions holding the
> > internal state of the search and eliminating the need for the user of
> > dealing with the potentially unsupported items.
>
> I'm afraid I don't see where the benefit in this approach would be.
> How would it it be "elimiinating the need for the user of dealing
> with the potentially unsupported items."?
>

Only the successfully identified items would be retrieved by the
FindFirst... FindNext... methods. The user should not deal with the
files not relevant from the aspect of the gdal/ogr project. The
proposed implementation would require the user to call
GDALIdentifyDriver with every file have been found in the directory.

The other raising I've mentioned is mostly theoretical that is: Why
should we force the user  on retaining internal data for 'our' search?
Normally he would not want to deal with the list of the filenames in a
directory. Instead he would want to pick up the supported
files/formats one by one so as to populate a listview or a treeview on
the user interface.

>  > Moreover, later on,
> > you could easily reorganize the internal structure holded by the
> > handle without affecting the interface itself if you find a more
> > performant approach of which information should be retained during the
> > search.
>
> I had contemplated requiring that the filename list be lexically ordered
> so that a binary search could be done for particular files instead of a
> linear scan.  Beyond this I find it hard to imagine that performance of
> the list of filenames is likely to be much of an issue.
>

Hmmm... That would be another requirement the user would not want to
bother with. (Passing a lexically ordered list to GDALOpenInfo) The
ordering should take place internally when starting the search and the
ordered list should be retained during the search.

> > Exposing a stringlist in GDALIdentifyDriver to SWIG is less effective
> > as exposing an internal handle. Many of the languages would require to
> > reallocate the stringlist in the marshaling code every time when the
> > GDALIdentifyDriver is called.
>
> I have contemplated also offering a GDALIdentifyDriverForDirectory()
> function that where you would provide a directory name, and it would
> return a list of all apparently supported files, and the driver that
> applies.  If this was done, it would be the most efficient entry point
> for swig bindings that want to avoid reallocating the filename list
> for each call.
>

Well, it's worth considering to include this one in the current
proposal. I wonder if we could choose a sufficient type of the return
value that is easily SWIGgable for every languages. At this point I
would discourage to utilize the practice of using arrays of classes as
return value types.

> > CPLReadDir() should also be exposed to the SWIG interface to easily
> > construct the string list of the files in a directory. However I would
> > like more if this method was internally handled and only the root
> > directory had to be specified by the user (in the FindFirst... method
> > for example).
>
> In the past I've avoided exposing too many of the VSI and related functions
> through SWIG under the assumptions that the languages have all their own
> POSIX io and filesystem operations.  However, as we move to use VSI as an
> abstraction layer for "in memory files" and such, it may be valuable to
> expose the VSI*L API and a few helper functions expected to work properly
> against any VSI type files.
>

The various languages might not be completely equal in this regard.
However it might be possible to retrieve the list of the files in a
directory, it not as easy as calling a function have already been
prepared for this purpose. The more functionality we expose the less
extra code should be added to the target language to implement the
desired features.

Best regards,

Tamas



More information about the Gdal-dev mailing list