[Gdal-dev] RFC 11: Fast Format Identification

Frank Warmerdam warmerdam at pobox.com
Mon Apr 30 17:01:16 EDT 2007


Tamas Szekeres wrote:
> Frank,
> 
> Generally I like the idea of providing a fast identification of the
> files in the filesystem supported by gdal. I wish we had a similar
> capability for the ogr supported formats. Are you planning to extend
> this functionality to the ogr project as well?

Tamas,

I have no current plans to extend this to OGR.  In fact, OGR doesn't
have an equivelent to the GDALOpenInfo, and so checking formats in many
driver Open methods in OGR can be fairly expensive since the header gets
re-read in each one.

At some point (GDAL/OGR 2.0?) it is my hope to re-unify GDAL and OGR
with them sharing a common driver manager, open function, metadata
model and so forth.  At that point OGR would presumably be upgraded to
use the GDAL identify and open mechanisms.  But it is unclear if or
when that will happen.

> The proposed implementation eliminates the need of rescanning the file
> names within a directory by accepting the stringlist of the filenames
> when calling the GDALOpenInfo constructor. However all of the files
> will eventually be opened, fstat-ed and the header bytes will be read
> by the constructor. Wouldn't it be reasonable to establish a primary
> test based only on the exisistence of the filenames and the extensions
> in the stringlist? I guess it would significantly increase the overall
> performance of the scan.

Many file formats cannot be accurately distinguished by filename, so
I don't see how we could do a good job with only the filenames, and not
having the header chunk.

> Is it enough to cache only the filenames for the subsequent Identify
> calls? For example if we open a "secondary file" of a driver first
> would not we want to retain the header bytes to that time when the
> primary file is identifyed by a driver?

This is possible, but I am doubtful that it would gain much for a
substantial increase in complexity.  I do plan to do some performance
testing, including strace's watching system calls, to see what extra
work is being done in directory scans after the new code is put in place.
This might point out that such header-caching would be of substantial
value, but I'm hesitant to take on that complexity yet.

> Theoretically from the user's perspective I feel a bit hacky to pass
> the siblings in a filesystem when identifying a particular file.
> Wouldn't it be more convenient to follow the FindFirst... FindNext...
> approach on the supported filenames and drivers. An internal
> searchhandle could be passed between these functions holding the
> internal state of the search and eliminating the need for the user of
> dealing with the potentially unsupported items. 

I'm afraid I don't see where the benefit in this approach would be.
How would it it be "elimiinating the need for the user of dealing
with the potentially unsupported items."?

 > Moreover, later on,
> you could easily reorganize the internal structure holded by the
> handle without affecting the interface itself if you find a more
> performant approach of which information should be retained during the
> search.

I had contemplated requiring that the filename list be lexically ordered
so that a binary search could be done for particular files instead of a
linear scan.  Beyond this I find it hard to imagine that performance of
the list of filenames is likely to be much of an issue.

> Exposing a stringlist in GDALIdentifyDriver to SWIG is less effective
> as exposing an internal handle. Many of the languages would require to
> reallocate the stringlist in the marshaling code every time when the
> GDALIdentifyDriver is called.

I have contemplated also offering a GDALIdentifyDriverForDirectory()
function that where you would provide a directory name, and it would
return a list of all apparently supported files, and the driver that
applies.  If this was done, it would be the most efficient entry point
for swig bindings that want to avoid reallocating the filename list
for each call.

However, I'm just not at all certain that this is a significant
performance issue, and certainly the convenience function can be added
later if it is found to be valuable, without any changes to the
underlying GDALIdentifyDriver() call.

> CPLReadDir() should also be exposed to the SWIG interface to easily
> construct the string list of the files in a directory. However I would
> like more if this method was internally handled and only the root
> directory had to be specified by the user (in the FindFirst... method
> for example).

In the past I've avoided exposing too many of the VSI and related functions
through SWIG under the assumptions that the languages have all their own
POSIX io and filesystem operations.  However, as we move to use VSI as an
abstraction layer for "in memory files" and such, it may be valuable to
expose the VSI*L API and a few helper functions expected to work properly
against any VSI type files.

> Are you planning to order the drivers when making the Identify calls?
> For example the more effective drivers (the driver really supports the
> Identify, by not calling the corresponding Open) should be called
> first.

pfnIdentify calls will be made in the same order as pfnOpen calls would
be made which is in the order that the drivers are registered with
the GDALDriverManager.  I already make some effort to register expensive
or risky drivers after fast/safe drivers though the application/user can
also influence ordering if they want to work at it.

I think it is important that identify be called in the same order as
open since we don't want identify to say one driver will get used, and
then have another driver kick in at open time because things were done
in a different order.

Best regards,
-- 
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up   | Frank Warmerdam, warmerdam at pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush    | President OSGeo, http://osgeo.org




More information about the Gdal-dev mailing list