[Gdal-dev] RFC 11: Fast Format Identification

Frank Warmerdam warmerdam at pobox.com
Mon Apr 30 22:25:56 EDT 2007


Tamas Szekeres wrote:
> 2007/4/30, Frank Warmerdam <warmerdam at pobox.com>:
>>
>> > Is it enough to cache only the filenames for the subsequent Identify
>> > calls? For example if we open a "secondary file" of a driver first
>> > would not we want to retain the header bytes to that time when the
>> > primary file is identifyed by a driver?
>>
>> This is possible, but I am doubtful that it would gain much for a
>> substantial increase in complexity.  I do plan to do some performance
>> testing, including strace's watching system calls, to see what extra
>> work is being done in directory scans after the new code is put in place.
>> This might point out that such header-caching would be of substantial
>> value, but I'm hesitant to take on that complexity yet.
>>
> 
> I think in that case the proposed interface should be reconstructed
> substantially, since the current appoach does not provide the option
> of retaining the internal state of the search among the subsequent
> Identify invocations. Currently only the (readonly) stringlist is
> retained.

Tamas,

I agree that the current interface does not give the option of
caching file headers (short of pushing responsibility for this
down into some VSI*L magic).  However, I don't agree that this
possible optimization is worth complicating the interface.

>> > Theoretically from the user's perspective I feel a bit hacky to pass
>> > the siblings in a filesystem when identifying a particular file.
>> > Wouldn't it be more convenient to follow the FindFirst... FindNext...
>> > approach on the supported filenames and drivers. An internal
>> > searchhandle could be passed between these functions holding the
>> > internal state of the search and eliminating the need for the user of
>> > dealing with the potentially unsupported items.
>>
>> I'm afraid I don't see where the benefit in this approach would be.
>> How would it it be "elimiinating the need for the user of dealing
>> with the potentially unsupported items."?
>>
> 
> Only the successfully identified items would be retrieved by the
> FindFirst... FindNext... methods. The user should not deal with the
> files not relevant from the aspect of the gdal/ogr project. The
> proposed implementation would require the user to call
> GDALIdentifyDriver with every file have been found in the directory.

It's only a for loop for heaven's sake!

> The other raising I've mentioned is mostly theoretical that is: Why
> should we force the user  on retaining internal data for 'our' search?
> Normally he would not want to deal with the list of the filenames in a
> directory. Instead he would want to pick up the supported
> files/formats one by one so as to populate a listview or a treeview on
> the user interface.

I only anticipate the identify features being used by perhaps a half
dozen applications in the future, so I don't think putting a little
work on the application developer is so terrible.  It isn't like this
is something that would be expected to be very widely used such that
it would be worth putting a great deal of extra work into convenience
functions.

>>  > Moreover, later on,
>> > you could easily reorganize the internal structure holded by the
>> > handle without affecting the interface itself if you find a more
>> > performant approach of which information should be retained during the
>> > search.
>>
>> I had contemplated requiring that the filename list be lexically ordered
>> so that a binary search could be done for particular files instead of a
>> linear scan.  Beyond this I find it hard to imagine that performance of
>> the list of filenames is likely to be much of an issue.
>>
> 
> Hmmm... That would be another requirement the user would not want to
> bother with. (Passing a lexically ordered list to GDALOpenInfo) The
> ordering should take place internally when starting the search and the
> ordered list should be retained during the search.

I would agree that the ordering would likely be taken care of within
GDALIdentifyDriver() to avoid making things too fragile.

>> > Exposing a stringlist in GDALIdentifyDriver to SWIG is less effective
>> > as exposing an internal handle. Many of the languages would require to
>> > reallocate the stringlist in the marshaling code every time when the
>> > GDALIdentifyDriver is called.
>>
>> I have contemplated also offering a GDALIdentifyDriverForDirectory()
>> function that where you would provide a directory name, and it would
>> return a list of all apparently supported files, and the driver that
>> applies.  If this was done, it would be the most efficient entry point
>> for swig bindings that want to avoid reallocating the filename list
>> for each call.
>>
> 
> Well, it's worth considering to include this one in the current
> proposal. I wonder if we could choose a sufficient type of the return
> value that is easily SWIGgable for every languages. At this point I
> would discourage to utilize the practice of using arrays of classes as
> return value types.

The issue of how to return the results was one of the reasons I
was hesitant to implement such a function.  I am also note keen on
building a whole set of functions so that we can have a "results"
object, and iterators or query functions to pull each result from that.
I strongly dislike that sort of extra machinery except where it is
pretty valuable.

>> > CPLReadDir() should also be exposed to the SWIG interface to easily
>> > construct the string list of the files in a directory. However I would
>> > like more if this method was internally handled and only the root
>> > directory had to be specified by the user (in the FindFirst... method
>> > for example).
>>
>> In the past I've avoided exposing too many of the VSI and related 
>> functions
>> through SWIG under the assumptions that the languages have all their own
>> POSIX io and filesystem operations.  However, as we move to use VSI as an
>> abstraction layer for "in memory files" and such, it may be valuable to
>> expose the VSI*L API and a few helper functions expected to work properly
>> against any VSI type files.
>>
> 
> The various languages might not be completely equal in this regard.
> However it might be possible to retrieve the list of the files in a
> directory, it not as easy as calling a function have already been
> prepared for this purpose. The more functionality we expose the less
> extra code should be added to the target language to implement the
> desired features.

Can you identify a language where it is hard to get a list of files
in a directory?  I know that in some languages it is a bit tiresome
because they force use of FindFirst()/FindNext() iterators but then
that just supports my other point, right?

I stand by my claim that the only reason to expose low level filesystem
functions from CPL is because applications might need access to the
VSI*L virtual io redirection capability of GDAL.  But whether that is a
good idea or not, I don't see it as closely related to the Identify()
operation, so I'm not keen on adding it to this RFC.

Best regards,
-- 
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up   | Frank Warmerdam, warmerdam at pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush    | President OSGeo, http://osgeo.org




More information about the Gdal-dev mailing list