[gdal-dev] Proposed patch for HUGE GML files

Tue Dec 6 10:10:49 EST 2011

Hi Sandro,

> I'm here to submit to your attention an alternative gml:xlink
> resolving method (GML driver) I've implemented for Tuscany Region.
>

As usual, very in-depth analysis and high quality patch.

Here are my comments :

  * #include "sqlite3.h" in ogr/ogrsf_frmts/gml/gmlreaderp.h is not appropriate
in that file, and should be only found in hugefileresolver.cpp. SQLite is a very
common dependency in most GDAL/OGR builds, but not a compulsory one. So some
adjustements will be needed in the GNUmakefile/makefile.vc to detect if sqlite
dependency is really available before trying to build hugefileresolver.cpp and
calling it from ogrgmldatasource.cpp

  * Did you consider just using the OGR API to abstract the access to the
temporary database, instead of directly using sqlite API ? This would allow to
use other DB backends, such as postgresql for example. Not a strong objection
however to justify reworking your patch. (A nice side-effect of it would be
that there would make build integration easy)

  * GML_GFS_TEMPLATE config option is general purpose, and could be made also
available even when xlink:href resolving is not asked.

  * I feel you could completely remove the GML_HUGE_SQLITE_UNSAFE config option,
and decide it to be TRUE. I don't see any reason why one would want to have
recovery guarantees for a temporary file. If the process is interrupted, I
suppose that the full resolving must be starting again and the temp file will be
recreated : correct ?

  * Same for GML_HUGE_TEMPFILE. Why would one want to keep the temporary sqlite
DB ? Or perhaps keep it undocumented for just debugging purposes, and make it
default to YES.

  * A bit correlated with the above : the use of int* pbOutIsTempFile in
hugeFileResolver() is confusing, because (contrary to the resolvexlinks.cpp
case) it is not used as an output parameter, but as an input one. And I am not
certain it only does what is documented (i.e. deleting the temporary .sqlite). I
somehow feel it will also delete the .resolved.gml file. See
GRGMLDataSource::~OGRGMLDataSource(). Perhaps I'm just confused...
My understanding of how resolvexlinks.cpp works is the following one : if for
some reason, we cannot create a .resolved.gml file in the same directory as the
.gml, we'll try to generate a temporary file somewhere else, in which case
ResolveXlinks() will set *pbOutIsTempFile  to TRUE, so that the destructor of
OGRGMLDataSource() can clean it when no longer used.

  * I'm wondering if the HUGE resolver could be automatically selected
if the GML file is bigger than a threshold ( 100 MB ? ) (and if sqlite is
available of course). The naming GML_SKIP_RESOLVE_ELEMS=HUGE sounds a bit
awckward to select a resolving algorithm and not very consistant with the
current allowed values (ALL / NONE / commma separated values), but I also find
the current logic a bit twisted too (GML_SKIP_RESOLVE_ELEMS=NONE means resolving
all elements...).

  * In hugefileresolver.cpp, I think that the hand-parsing of XML is unnecessary
in gmlHugeFileCheckXrefs(). The only caller of gmlHugeFileCheckXrefs() does a
gmlText = CPLSerializeXMLTree((CPLXMLNode *)pszNode) just before. Why not just
exploring the nodes through the CPLXMLNode objects ? This would be much cleaner
and robust (in case the XML serialization changes a bit), and likely a bit
faster too.

  * There might be security issues due to the use of char XXX[1024]. I see quite
a few strcpy() with no size checking. I'd suggest using CPLStrlcpy().

  * A few naming inconsistencies. The z in papszGeomList =
poFeature->GetGeometryList()  should be removed because no strings are used.
Same for pszNode = papszGeomList[i].

  * At line 1793 of hugefileresolver.cpp, VSIFPrintfL ( fp, "     
<%s>\%s</%s>\n",
--> the \ before %s looks wrong. And I see no use of XML escaping for the value
of the attribute
(use CPLEscapeString())

  * I see special cases for GML topology geometries in your code. Do you have
small GML samples that illustrate that and could be used for regression testing
?

Please open a ticket on GDAL Trac.

You mentionned the limitations of the XSD parser. That could be an interesting
area for improvements... Downloading XSD via URLs or supporting <xs:include>
doesn't look the more difficult part. The real challenge is dealing with complex
schema (complex being everything that doesn't fit into the Simple Feature Model
of "flat" attributes).

Best regards,

Even