[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

Andreas Oxenstierna ao at t-kartor.se
Mon Mar 11 00:12:26 PDT 2019


We see significant performance boost when clustering a large PostGIS 
table on the spatial index. This will rewrite the table in spatial index 
order.
CLUSTER <table> USING <spatial_index>
My understanding is that you need to periodically recluster the table 
when updated, which is an atomic exclusive rewrite of the complete table 
i.e. will take time.
You may ask the Postgre team about this, e.g. Magnus Hagander.
You may ask Sandro Santilli (PostGIS topology) about topological aspects.


> It would probably be possible to get substantial gains on existing 
> formats that have spatial indexing by making them more 
> optimal/balanced however I see the following major issues:
>
> * Not sure how by convention deal with non-updatable indexes 
> (packing/sorting on hilbert or sort-tile-recursive seem inherently 
> static) on formats that are assumed to be mutable
> * GeoPackage cannot benefit from clustering on spatial index (so it 
> will not be possible to "cloud" optimize)
> * Shapefile has several other drawbacks as a format perhaps making it 
> less interesting to invest into
>
> I have wondered several times about these things in relation to 
> PostgreSQL and PostGIS... first thought is why there is no way to mark 
> a table static/immutable in PostgreSQL, it seems to be there would be 
> several general optimization possibilities if it was known that a 
> table cannot be changed, not the least that it would make it possible 
> to construct similar packed R-trees as discussed above. Second thought 
> is how the R-tree is constructed/implemented via gist index (I have 
> not had the brain power/interest to dive into the source to understand 
> it yet) - unless it can be static it should probably optimally be 
> a R*-tree with topological split but I've got indications that it is not.
>
> /Björn
>
> Den mån 25 feb. 2019 kl 11:06 skrev Darafei "Komяpa" Praliaskouski 
> <me at komzpa.net <mailto:me at komzpa.net>>:
>
>     Hi Björn,
>
>     Would it make sense to implement it not as a separate format, but
>     as a special convention on top of any of already existing ones?
>     What happens if you sort by Hilbert curve a shapefile or
>     geopackage, and tweak the internals to build indexes in flattish
>     manner?
>
>     For example, here is a branch of Postgres where GiST index is
>     being built in flat manner. \
>     It does not require any change on user's side, but gets you all
>     the benefits by plainly building 10x faster.
>     https://github.com/glukhovn/postgres/tree/gist_btree_build
>
>
>     On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell
>     <bjorn.harrtell at gmail.com <mailto:bjorn.harrtell at gmail.com>> wrote:
>
>         I've now implemented so that the traversal of the R-tree reads
>         nodes on demand and as suspected it gives a significant boost:
>
>         > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>         gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
>         real0m0,378s (~ 20% improvement)
>
>         > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>         gis_osm_buildings_a_free_1_single.fgb
>         gis_osm_buildings_a_free_1_single; done
>         real0m0,790s (~ 30% improvement)
>
>         I'm now also faster on the reduced window (56 10.1 56.1)
>         beating gpkg:
>
>         > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
>         gis_osm_buildings_a_free_1_single.gpkg
>         gis_osm_buildings_a_free_1_single; done
>         real0m0,305s
>
>         > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
>         gis_osm_buildings_a_free_1_single.fgb
>         gis_osm_buildings_a_free_1_single; done
>         real0m0,242s
>
>         It still remains a TODO to further optimize I/O to read larger
>         blocks if possible (applies both to R-tree nodes and the
>         feature data itself) and I think that could prove a very
>         significant performance boost potentially leaving any
>         competition in the dust. :D
>
>         /Björn
>
>         Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell
>         <bjorn.harrtell at gmail.com <mailto:bjorn.harrtell at gmail.com>>:
>
>             I've now implemented spatial filtering (at GDAL fork
>             https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and
>             results are promising, for two test cases about 3,5-4
>             times faster than shapefile with the same content. I've
>             used these command lines to measure total time over 10
>             iterations for each format and dataset:
>
>             > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
>             56.2 gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
>             real0m1,940s
>
>             > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
>             56.2 gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
>             real0m0,750s
>
>             > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
>             56.2 gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
>             real0m0,461s
>
>             > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
>             56.2 gis_osm_buildings_a_free_1_single.shp
>             gis_osm_buildings_a_free_1_single; done
>             real0m4,381s
>
>             > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
>             56.2 gis_osm_buildings_a_free_1_single.gpkg
>             gis_osm_buildings_a_free_1_single; done
>             real0m1,624s
>
>             > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
>             56.2 gis_osm_buildings_a_free_1_single.fgb
>             gis_osm_buildings_a_free_1_single; done
>             real0m1,113s
>
>             The bounds 10 56 10.2 56.2 contains 21525 out of the
>             812547 total features for gis_osm_roads_free_1 and 57699
>             out of 2027890 features for gis_osm_buildings_a_free_1_single.
>
>             And there is definitely room for optimization. The
>             implementation is naive and reads the whole R-tree in
>             memory and does not yet read consecutive features (common
>             case as the R-tree is balanced) in larger I/O blocks which
>             I think can be significant. If I reduce the extent to 10
>             56 10.1 56.1 GPKG is actually fair bit faster which I
>             think is due to it being able to read the index partially.
>
>             /Björn
>
>             Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell
>             <bjorn.harrtell at gmail.com <mailto:bjorn.harrtell at gmail.com>>:
>
>                 After posting about my experimental format I realized
>                 that I lack numbers on the potential performance, so I
>                 tried to make some more or less scientific measuring.
>                 The results was disappointing, reaching similar
>                 performance as shapefile for full sequential reads and
>                 so I lost interest for a while. But recently I found
>                 out that my method of measuring was flawed - I
>                 measured the time to read into a new shapefile using
>                 ogr2ogr, but it turns out that can be quite unfair due
>                 to the string field length dynamic resize that ogr2ogr
>                 will do if the strings from input feature attributes
>                 are of variable length. I've now made some new
>                 measurements using ogrinfo and the undocumented flag
>                 -qq to get a full sequential read.
>
>                 I've used the shapefiles "gis_osm_buildings_a_free_1"
>                 (exploded to single poly) and "gis_osm_roads_free_1"
>                 from Denmark at
>                 http://download.geofabrik.de/europe.html as the
>                 source, converted to respective format, then measured
>                 the time it takes to run ogrinfo -qq on them. Results
>                 (average over three runs):
>
>                 ## gis_osm_buildings_a_free_1 (2027890 Polygons)
>                 * SHP: 3800 ms
>                 * GPKG: 2700 ms
>                 * FlatGeobuf: 2200 ms
>
>                 ## gis_osm_roads_free_1 (812547 LineStrings)
>                 * SHP: 1600 ms
>                 * GPKG: 1350 ms
>                 * FlatGeobuf: 800 ms
>
>                 With that my hope and interest is rekindled. I believe
>                 I can fare even better against the competition with
>                 with spatially filtered searches, will hopefully soon
>                 have some results on that too.
>
>                 /Björn
>
>                 Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell
>                 <bjorn.harrtell at gmail.com
>                 <mailto:bjorn.harrtell at gmail.com>>:
>
>                     Hi GDAL/OGR folks,
>
>                     In my spare time I've been working on a vector
>                     file format called FlatGeobuf (tentatively). The
>                     main reason, besides curiosity and learning, I'm
>                     putting time into it is that I think shapefile
>                     still rules the read/query static data performance
>                     game, which is kind of sad, and probably one of
>                     the reasons it is still so widely popular. Also,
>                     the main competitor (GeoPackage) isn't suitable
>                     for streaming access (AFAIK) which shapefiles also
>                     handles surprisingly (?) well.
>
>                     By using a performance focused write once binary
>                     encoding (flatbuffers), schema constraint and
>                     focusing on read/query performance by clustering
>                     on an optimal spatial index (Packed Hilbert
>                     R-Tree) I hope to be able to beat shapefile
>                     performance and at the same time be optimal for
>                     streaming/cloud access.
>
>                     I think I'm starting to get somewhere, more
>                     details and source is at
>                     https://github.com/bjornharrtell/flatgeobuf and I
>                     have an early proof of concept driver
>                     implementation at
>                     https://github.com/bjornharrtell/gdal/tree/flatgeobuf and
>                     results are already quite promising - it can do
>                     roundtrip read/write and is already quite a bit
>                     faster than shapefile. I also have implemented
>                     naive read only QGIS provider for experimental
>                     purposes.
>
>                     Basically I'm fishing for interest in this effort,
>                     hoping that others will see potential in it even
>                     if it's "yet another format" and far from
>                     final/stable yet. Any feedback is welcome. As I
>                     see it, GDAL is a good place for a format
>                     specification and reference implementation to
>                     incubate.
>
>                     Some additional food for thought that I've had
>                     during the experimentation:
>
>                     1. The main in memory representations of
>                     geometry/coordinates seem to be either separate
>                     arrays per dimension (GDAL (partially?) and QGIS)
>                     or a single array with dimension as stride. I've
>                     chosen the latter as of yet but I'm still a bit
>                     undecided. There is of course a substantial
>                     involved in transforming between the two
>                     representations so the situation with two
>                     competing models is a bit unfortunate. I'm also
>                     not sure about which of these models are
>                     objectively "better" than the other?
>
>                     2. One could get better space efficiency with
>                     protobuf instead of flatbuffers, but it has a
>                     performance cost. I have not gone far into
>                     investigating how much though and one could even
>                     reason about supporting both these encodings in a
>                     single format specification but it's taking it a
>                     bit far I think.
>
>                     3. Is there a reason to support different packing
>                     strategies for the R-Tree or is Packed Hilbert a
>                     good compromise (besides it being beautifully
>                     simple to implement)?
>
>                     4. FlatGeobuf is perhaps a too technical name, not
>                     catchy enough and has a bad/boring abbreviation.
>                     Other candidates I've considered are OpenGeoFile,
>                     OpenVectorFile or OpenSpatialFile but I'm
>                     undecided. Any ideas? :)
>
>                     Regards,
>
>                     Björn Harrtell
>
>         _______________________________________________
>         gdal-dev mailing list
>         gdal-dev at lists.osgeo.org <mailto:gdal-dev at lists.osgeo.org>
>         https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
>
>     -- 
>     Darafei Praliaskouski
>     Support me: http://patreon.com/komzpa
>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev


-- 
Hälsningar

Andreas Oxenstierna
T-Kartor Geospatial AB
Olof Mohlins väg 12 Kristianstad
mobile: +46 733 206831
mailto: ao at t-kartor.se
http://www.t-kartor.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20190311/8aec40e8/attachment-0001.html>


More information about the gdal-dev mailing list