[gdal-dev] Proposed patch for HUGE GML files

Even Rouault even.rouault at mines-paris.org
Tue Dec 6 10:10:49 EST 2011


Hi Sandro,

> I'm here to submit to your attention an alternative gml:xlink
> resolving method (GML driver) I've implemented for Tuscany Region.
>

As usual, very in-depth analysis and high quality patch.

Here are my comments :

  * #include "sqlite3.h" in ogr/ogrsf_frmts/gml/gmlreaderp.h is not appropriate
in that file, and should be only found in hugefileresolver.cpp. SQLite is a very
common dependency in most GDAL/OGR builds, but not a compulsory one. So some
adjustements will be needed in the GNUmakefile/makefile.vc to detect if sqlite
dependency is really available before trying to build hugefileresolver.cpp and
calling it from ogrgmldatasource.cpp

  * Did you consider just using the OGR API to abstract the access to the
temporary database, instead of directly using sqlite API ? This would allow to
use other DB backends, such as postgresql for example. Not a strong objection
however to justify reworking your patch. (A nice side-effect of it would be
that there would make build integration easy)

  * GML_GFS_TEMPLATE config option is general purpose, and could be made also
available even when xlink:href resolving is not asked.

  * I feel you could completely remove the GML_HUGE_SQLITE_UNSAFE config option,
and decide it to be TRUE. I don't see any reason why one would want to have
recovery guarantees for a temporary file. If the process is interrupted, I
suppose that the full resolving must be starting again and the temp file will be
recreated : correct ?

  * Same for GML_HUGE_TEMPFILE. Why would one want to keep the temporary sqlite
DB ? Or perhaps keep it undocumented for just debugging purposes, and make it
default to YES.

  * A bit correlated with the above : the use of int* pbOutIsTempFile in
hugeFileResolver() is confusing, because (contrary to the resolvexlinks.cpp
case) it is not used as an output parameter, but as an input one. And I am not
certain it only does what is documented (i.e. deleting the temporary .sqlite). I
somehow feel it will also delete the .resolved.gml file. See
GRGMLDataSource::~OGRGMLDataSource(). Perhaps I'm just confused...
My understanding of how resolvexlinks.cpp works is the following one : if for
some reason, we cannot create a .resolved.gml file in the same directory as the
.gml, we'll try to generate a temporary file somewhere else, in which case
ResolveXlinks() will set *pbOutIsTempFile  to TRUE, so that the destructor of
OGRGMLDataSource() can clean it when no longer used.

  * I'm wondering if the HUGE resolver could be automatically selected
if the GML file is bigger than a threshold ( 100 MB ? ) (and if sqlite is
available of course). The naming GML_SKIP_RESOLVE_ELEMS=HUGE sounds a bit
awckward to select a resolving algorithm and not very consistant with the
current allowed values (ALL / NONE / commma separated values), but I also find
the current logic a bit twisted too (GML_SKIP_RESOLVE_ELEMS=NONE means resolving
all elements...).

  * In hugefileresolver.cpp, I think that the hand-parsing of XML is unnecessary
in gmlHugeFileCheckXrefs(). The only caller of gmlHugeFileCheckXrefs() does a
gmlText = CPLSerializeXMLTree((CPLXMLNode *)pszNode) just before. Why not just
exploring the nodes through the CPLXMLNode objects ? This would be much cleaner
and robust (in case the XML serialization changes a bit), and likely a bit
faster too.

  * There might be security issues due to the use of char XXX[1024]. I see quite
a few strcpy() with no size checking. I'd suggest using CPLStrlcpy().

  * A few naming inconsistencies. The z in papszGeomList =
poFeature->GetGeometryList()  should be removed because no strings are used.
Same for pszNode = papszGeomList[i].

  * At line 1793 of hugefileresolver.cpp, VSIFPrintfL ( fp, "     
<%s>\%s</%s>\n",
--> the \ before %s looks wrong. And I see no use of XML escaping for the value
of the attribute
(use CPLEscapeString())

  * I see special cases for GML topology geometries in your code. Do you have
small GML samples that illustrate that and could be used for regression testing
?

Please open a ticket on GDAL Trac.

You mentionned the limitations of the XSD parser. That could be an interesting
area for improvements... Downloading XSD via URLs or supporting <xs:include>
doesn't look the more difficult part. The real challenge is dealing with complex
schema (complex being everything that doesn't fit into the Simple Feature Model
of "flat" attributes).

Best regards,

Even


More information about the gdal-dev mailing list