[gdal-dev] Proposed patch for HUGE GML files
a.furieri at lqt.it
a.furieri at lqt.it
Tue Dec 6 07:37:37 EST 2011
Hi All,
I'm here to submit to your attention an alternative gml:xlink
resolving method (GML driver) I've implemented for Tuscany Region.
Tuscany Region context and requirements:
========================================
Their brand new official vector cartography is fully
topology-based, and GML is the main format adopted
as primary source / long term storage.
Any possible alternative source (SHP, Spatial DBMS such
as PostGIS and SpatiaLite) will be later derived from such
primary GML files.
And ogr2ogr is the main tool currently used to transfer data
between different sources.
the whole regional cartography consist in several
dozens of huge GML files (> 2GB each one), and each
single GML file contains about an hundredth layers
and many tenths of ancillary data tables (some million
features per single GML file).
the complete GML schema is formally defined by an accurate
but really complex XSD; actually a very ramified one,
because it's widely based on <xs:include> elements.
Critical issues found in current ogr2org:
=========================================
enabling GML_SKIP_RESOLVE_ELEMS NONE effectively
support xlink:href resolution.
a) anyway, as the official doc honestly recognizes
"the resolution algorithm is not optimised for
large files. For files with more than a couple of
thousand xlink:href tags, the process can go beyond
a few minutes".
For real-world GML files (some million xlink:href
occurrences) Tuscany Region actually measured discouraging
timings (well in excess of 24 hours per single GML file).
b) and a physical size limitation exists: the current
implementation for GML_SKIP_RESOLVE_ELEMS NONE builds
a memory-based GML node-tree, used for xlink:href
resolution.
This practically means that no GML file bigger than
1GB (approx.) could never be successfully parsed at all,
simply because the address space available for 32-bit
sw forbids to allocate so much RAM as required by this
intensively memory-based method.
The proposed GML_SKIP_RESOLVE_ELEMS HUGE alternative:
=====================================================
I've carefully evaluated an alternative strategy
allowing to skip all the above critical issues, but
still preserving absolutely untouched the main core
of the current GML implementation.
The unique bottleneck I identified simply was using
a memory-based GML node-tree.
So I started developing an alternative method uniquely
focused on xlink:href resolution, not at all requiring to
use huge memory allocations, and possibly offering enhanced
efficiency and better performances when parsing huge GML files.
Please note well:
- GML_SKIP_RESOLVE_ELEMS NONE still supports the previous
xlink:href resolution method exactely as before
- GML_SKIP_RESOLVE_ELEMS HUGE activates the new alternative
resolution method
- any other facet of the GML driver is left completely
untouched; the difference between NONE and HUGE methods
simply is in xlink:href resolution and in xxx.resolved.gml
output generation.
Main design:
- using a temporary DB (SQLite) for xlink:href resolution
- then outputting an xxxx.resolved.gml file, exactly
as the old NONE method does
Advantages:
- a DBMS is the tool of excellence for massive relational
pairs resolution (JOIN): any DBMS surely is carefully
optimized for this specific task, there is no need to reinvent
the wheel yet another time.
- this method is completely file-system-based; a very low
memory footprint is required, even when processing impressively
huge GML files
- SQLite is absolutely simple and light-weighted (and doesn't
requires any installation/configuration at all).
- an auxiliary SQLite DB-file can be used in the simplest way exactly
as it was a trivial temporary file of a more conventional type.
- SQLite already is included between GDAL/OGR dependencies, no
further complexity and/or dependencies are introduced.
Testing and measured timings:
=============================
- the current GML_SKIP_RESOLVE_ELEMS NONE method was
unable to open any GML file bigger than approx. 1GB
(insufficient memory)
- the alternative GML_SKIP_RESOLVE_ELEMS HUGE method is
able to successfully parse GML files as big as approx.
3GB (there is no theoretically imposed size limit; simply
we had no sample bigger than this to test)
- the old method required 24+ hours to resolved a medium
sized GML file.
- the new SQLite-based method requires about 1 hour to resolve
a huge 3GB GML file (containing million+ xlink:href pairs).
Implementation details:
=======================
some further configuration options are available when
GML_SKIP_RESOLVE_ELEMS HUGE is set:
- if GML_HUGE_TEMPFILE YES is set, the SQLite DB-file used
to resolve xlink:href / gml:id relations will be immediately
removed as no longer required (i.e. once xxxx.resolved.gml
output has been generated).
usually this auxiliary DB-file is a huge file too, and is
completely useless once the resolved file has already been
generated. removing it from the file-system strongly helps
to save a lot of precious disk space when processing many
really huge GML files one after the other.
- if GML_HUGE_SQLITE_UNSAFE YES is set, the auxiliary SQLite DB-file
will be configured so to enable any possible performance boost.
This includes disabling at all the support for transactions:
an intrinsically unsafe option (unable to recover the DB after
a fatal system crash), but absolutely harmless in this specific
case (this simply is a temporary file, we'll recreate it again
from scratch next time).
And finally a GML_GFS_TEMPLATE path_to_template.gfs option was
introduced supporting GML_SKIP_RESOLVE_ELEMS HUGE
Rationale for GML_GFS_TEMPLATE:
-------------------------------
- the current OGR GML implementation attempts to 'sniff' the
data layout; this practically means that subsequent runs
of ogr2ogr will easily guess different layouts for the
same class/layer (some attributes are optional, many
alphanumeric keys will be wrongly recognized as integers ...).
and this one is a painful issue when you have to import
many dozens distinct GML files into the same DBMS [-append]
- the current XSD support is insufficient, because it doesn't
allows to download XSD via URL, and because it doesn't supports
<xs:include> directives.
- the current GFS support is fairly adequate, once you've
carefully hand-written a GFS file exactly corresponding to
your complex XSD schema.
allowing to unconditionally select the same GFS file for many
subsequent import runs surely makes your life easiest, and
ensures a strongly consistent data layout to be used everywhere.
Proposed patch:
===============
not too much complex, after all ... please see the attached
svn diff and source:
a) -/ogr/ogrsf_frmts/gml/hugefileresolver.cpp
full implementation of GML_SKIP_RESOLVE_ELEMS HUGE
and related options (mainly SQLite handling stuff)
b) -/ogr/ogrsf_frmts/gml/gmlreader.h
-/ogr/ogrsf_frmts/gml/gmlreaderp.h
-/ogr/ogrsf_frmts/gml/ogrgmldatasource.cpp
changing few C++ class definitions and implementing
GML_SKIP_RESOLVE_ELEMS HUGE activation
c) -/ogr/ogrsf_frmts/gml/GNUmakefile
building hugefileresolver.cpp
d) -/ogr/ogrsf_frmts/gml/drv_gml.html
updated documentation
e) -/ogr/ogrsf_frmts/nas/nasreaderp.h
-/ogr/ogrsf_frmts/nas/nasreader.cpp
simply few formal changes for C++ class definitions so
to maintain full compatibility with GML base module.
thanks for your attention and consideration,
Sandro Furieri
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gml_hugefile_patch.zip
Type: application/x-zip-compressed
Size: 16531 bytes
Desc: not available
Url : http://lists.osgeo.org/pipermail/gdal-dev/attachments/20111206/a14d1034/gml_hugefile_patch-0001.bin
More information about the gdal-dev
mailing list