[gdal-dev] Proposed patch for HUGE GML files

Tue Dec 6 07:37:37 EST 2011

Hi All,

I'm here to submit to your attention an alternative gml:xlink 
resolving method (GML driver) I've implemented for Tuscany Region.

Tuscany Region context and requirements:
========================================
Their brand new official vector cartography is fully 
topology-based, and GML is the main format adopted 
as primary source / long term storage.
Any possible alternative source (SHP, Spatial DBMS such 
as PostGIS and SpatiaLite) will be later derived from such 
primary GML files.
And ogr2ogr is the main tool currently used to transfer data 
between different sources.

the whole regional cartography consist in several
dozens of huge GML files (> 2GB each one), and each
single GML file contains about an hundredth layers
and many tenths of ancillary data tables (some million
features per single GML file).
the complete GML schema is formally defined by an accurate
but really complex XSD; actually a very ramified one, 
because it's widely based on <xs:include> elements.

Critical issues found in current ogr2org:
=========================================
enabling GML_SKIP_RESOLVE_ELEMS NONE effectively
support xlink:href resolution. 

a) anyway, as the official doc honestly recognizes 
  "the resolution algorithm is not optimised for 
  large files. For files with more than a couple of 
  thousand xlink:href tags, the process can go beyond 
  a few minutes".
  For real-world GML files (some million xlink:href 
  occurrences) Tuscany Region actually measured discouraging 
  timings (well in excess of 24 hours per single GML file).

b) and a physical size limitation exists: the current 
  implementation for GML_SKIP_RESOLVE_ELEMS NONE builds 
  a memory-based GML node-tree, used for xlink:href 
  resolution.
  This practically means that no GML file bigger than
  1GB (approx.) could never be successfully parsed at all,
  simply because the address space available for 32-bit
  sw forbids to allocate so much RAM as required by this 
  intensively memory-based method.

The proposed GML_SKIP_RESOLVE_ELEMS HUGE alternative:
=====================================================
I've carefully evaluated an alternative strategy 
allowing to skip all the above critical issues, but
still preserving absolutely untouched the main core 
of the current GML implementation.

The unique bottleneck I identified simply was using
a memory-based GML node-tree.
So I started developing an alternative method uniquely
focused on xlink:href resolution, not at all requiring to 
use huge memory allocations, and possibly offering enhanced
efficiency and better performances when parsing huge GML files.

Please note well:
- GML_SKIP_RESOLVE_ELEMS NONE still supports the previous
  xlink:href resolution method exactely as before
- GML_SKIP_RESOLVE_ELEMS HUGE activates the new alternative
  resolution method
- any other facet of the GML driver is left completely
  untouched; the difference between NONE and HUGE methods
  simply is in xlink:href resolution and in xxx.resolved.gml 
  output generation.

Main design:
- using a temporary DB (SQLite) for xlink:href resolution
- then outputting an xxxx.resolved.gml file, exactly
  as the old NONE method does

Advantages:
- a DBMS is the tool of excellence for massive relational
  pairs resolution (JOIN): any DBMS surely is carefully 
  optimized for this specific task, there is no need to reinvent 
  the wheel yet another time.
- this method is completely file-system-based; a very low 
  memory footprint is required, even when processing impressively
  huge GML files 
- SQLite is absolutely simple and light-weighted (and doesn't 
  requires any installation/configuration at all).
- an auxiliary SQLite DB-file can be used in the simplest way exactly 
  as it was a trivial temporary file of a more conventional type.
- SQLite already is included between GDAL/OGR dependencies, no
  further complexity and/or dependencies are introduced.

Testing and measured timings:
=============================
- the current GML_SKIP_RESOLVE_ELEMS NONE method was
  unable to open any GML file bigger than approx. 1GB 
  (insufficient memory)
- the alternative GML_SKIP_RESOLVE_ELEMS HUGE method is
  able to successfully parse GML files as big as approx. 
  3GB (there is no theoretically imposed size limit; simply 
  we had no sample bigger than this to test)
- the old method required 24+ hours to resolved a medium
  sized GML file.
- the new SQLite-based method requires about 1 hour to resolve 
  a huge 3GB GML file (containing million+ xlink:href pairs).

Implementation details:
=======================
some further configuration options are available when
GML_SKIP_RESOLVE_ELEMS HUGE is set:

- if GML_HUGE_TEMPFILE YES is set, the SQLite DB-file used 
  to resolve xlink:href / gml:id relations will be immediately
  removed as no longer required (i.e. once xxxx.resolved.gml
  output has been generated).
  usually this auxiliary DB-file is a huge file too, and is 
  completely useless once the resolved file has already been 
  generated. removing it from the file-system strongly helps 
  to save a lot of precious disk space when processing many 
  really huge GML files one after the other.
- if GML_HUGE_SQLITE_UNSAFE YES is set, the auxiliary SQLite DB-file 
  will be configured so to enable any possible performance boost. 
  This includes disabling at all the support for transactions: 
  an intrinsically unsafe option (unable to recover the DB after
  a fatal system crash), but absolutely harmless in this specific 
  case (this simply is a temporary file, we'll recreate it again 
  from scratch next time). 

And finally a GML_GFS_TEMPLATE path_to_template.gfs option was
introduced supporting GML_SKIP_RESOLVE_ELEMS HUGE

Rationale for GML_GFS_TEMPLATE:
-------------------------------
- the current OGR GML implementation attempts to 'sniff' the
  data layout; this practically means that subsequent runs
  of ogr2ogr will easily guess different layouts for the
  same class/layer (some attributes are optional, many
  alphanumeric keys will be wrongly recognized as integers ...).
  and this one is a painful issue when you have to import 
  many dozens distinct GML files into the same DBMS [-append]
- the current XSD support is insufficient, because it doesn't
  allows to download XSD via URL, and because it doesn't supports
  <xs:include> directives.
- the current GFS support is fairly adequate, once you've
  carefully hand-written a GFS file exactly corresponding to 
  your complex XSD schema. 
  allowing to unconditionally select the same GFS file for many 
  subsequent import runs surely makes your life easiest, and 
  ensures a strongly consistent data layout to be used everywhere.

Proposed patch:
===============
not too much complex, after all ... please see the attached 
svn diff and source:

a) -/ogr/ogrsf_frmts/gml/hugefileresolver.cpp
   full implementation of GML_SKIP_RESOLVE_ELEMS HUGE
   and related options (mainly SQLite handling stuff)

b) -/ogr/ogrsf_frmts/gml/gmlreader.h
   -/ogr/ogrsf_frmts/gml/gmlreaderp.h
   -/ogr/ogrsf_frmts/gml/ogrgmldatasource.cpp
   changing few C++ class definitions and implementing
   GML_SKIP_RESOLVE_ELEMS HUGE activation

c) -/ogr/ogrsf_frmts/gml/GNUmakefile
   building hugefileresolver.cpp

d) -/ogr/ogrsf_frmts/gml/drv_gml.html
   updated documentation

e) -/ogr/ogrsf_frmts/nas/nasreaderp.h
   -/ogr/ogrsf_frmts/nas/nasreader.cpp
   simply few formal changes for C++ class definitions so 
   to maintain full compatibility with GML base module.

thanks for your attention and consideration,
Sandro Furieri

-------------- next part --------------
A non-text attachment was scrubbed...
Name: gml_hugefile_patch.zip
Type: application/x-zip-compressed
Size: 16531 bytes
Desc: not available
Url : http://lists.osgeo.org/pipermail/gdal-dev/attachments/20111206/a14d1034/gml_hugefile_patch-0001.bin