[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

Sun Feb 24 05:59:34 PST 2019

I've now implemented so that the traversal of the R-tree reads nodes on
demand and as suspected it gives a significant boost:

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
real 0m0,378s (~ 20% improvement)

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
done
real 0m0,790s (~ 30% improvement)

I'm now also faster on the reduced window (56 10.1 56.1) beating gpkg:

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
done
real 0m0,305s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
done
real 0m0,242s

It still remains a TODO to further optimize I/O to read larger blocks if
possible (applies both to R-tree nodes and the feature data itself) and I
think that could prove a very significant performance boost potentially
leaving any competition in the dust. :D

/Björn

Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell <bjorn.harrtell at gmail.com
>:

> I've now implemented spatial filtering (at GDAL fork
> https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are
> promising, for two test cases about 3,5-4 times faster than shapefile with
> the same content. I've used these command lines to measure total time over
> 10 iterations for each format and dataset:
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
> real 0m1,940s
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
> real 0m0,750s
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
> real 0m0,461s
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_buildings_a_free_1_single.shp gis_osm_buildings_a_free_1_single;
> done
> real 0m4,381s
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
> done
> real 0m1,624s
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
> done
> real 0m1,113s
>
> The bounds 10 56 10.2 56.2 contains 21525 out of the 812547 total features
> for gis_osm_roads_free_1 and 57699 out of 2027890 features for
> gis_osm_buildings_a_free_1_single.
>
> And there is definitely room for optimization. The implementation is naive
> and reads the whole R-tree in memory and does not yet read consecutive
> features (common case as the R-tree is balanced) in larger I/O blocks which
> I think can be significant. If I reduce the extent to 10 56 10.1 56.1 GPKG
> is actually fair bit faster which I think is due to it being able to read
> the index partially.
>
> /Björn
>
> Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell <
> bjorn.harrtell at gmail.com>:
>
>> After posting about my experimental format I realized that I lack numbers
>> on the potential performance, so I tried to make some more or less
>> scientific measuring. The results was disappointing, reaching similar
>> performance as shapefile for full sequential reads and so I lost interest
>> for a while. But recently I found out that my method of measuring was
>> flawed - I measured the time to read into a new shapefile using ogr2ogr,
>> but it turns out that can be quite unfair due to the string field length
>> dynamic resize that ogr2ogr will do if the strings from input feature
>> attributes are of variable length. I've now made some new measurements
>> using ogrinfo and the undocumented flag -qq to get a full sequential read.
>>
>> I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to single
>> poly) and "gis_osm_roads_free_1" from Denmark at
>> http://download.geofabrik.de/europe.html as the source, converted to
>> respective format, then measured the time it takes to run ogrinfo -qq on
>> them. Results (average over three runs):
>>
>> ## gis_osm_buildings_a_free_1 (2027890 Polygons)
>> * SHP: 3800 ms
>> * GPKG: 2700 ms
>> * FlatGeobuf: 2200 ms
>>
>> ## gis_osm_roads_free_1 (812547 LineStrings)
>> * SHP: 1600 ms
>> * GPKG: 1350 ms
>> * FlatGeobuf: 800 ms
>>
>> With that my hope and interest is rekindled. I believe I can fare even
>> better against the competition with with spatially filtered searches, will
>> hopefully soon have some results on that too.
>>
>> /Björn
>>
>> Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell <
>> bjorn.harrtell at gmail.com>:
>>
>>> Hi GDAL/OGR folks,
>>>
>>> In my spare time I've been working on a vector file format called
>>> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
>>> I'm putting time into it is that I think shapefile still rules the
>>> read/query static data performance game, which is kind of sad, and probably
>>> one of the reasons it is still so widely popular. Also, the main competitor
>>> (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles
>>> also handles surprisingly (?) well.
>>>
>>> By using a performance focused write once binary encoding (flatbuffers),
>>> schema constraint and focusing on read/query performance by clustering on
>>> an optimal spatial index (Packed Hilbert R-Tree) I hope to be able to beat
>>> shapefile performance and at the same time be optimal for streaming/cloud
>>> access.
>>>
>>> I think I'm starting to get somewhere, more details and source is at
>>> https://github.com/bjornharrtell/flatgeobuf and I have an early proof
>>> of concept driver implementation at
>>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
>>> already quite promising - it can do roundtrip read/write and is already
>>> quite a bit faster than shapefile. I also have implemented naive read only
>>> QGIS provider for experimental purposes.
>>>
>>> Basically I'm fishing for interest in this effort, hoping that others
>>> will see potential in it even if it's "yet another format" and far from
>>> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
>>> place for a format specification and reference implementation to incubate.
>>>
>>> Some additional food for thought that I've had during the
>>> experimentation:
>>>
>>> 1. The main in memory representations of geometry/coordinates seem to be
>>> either separate arrays per dimension (GDAL (partially?) and QGIS) or a
>>> single array with dimension as stride. I've chosen the latter as of yet but
>>> I'm still a bit undecided. There is of course a substantial involved in
>>> transforming between the two representations so the situation with two
>>> competing models is a bit unfortunate. I'm also not sure about which of
>>> these models are objectively "better" than the other?
>>>
>>> 2. One could get better space efficiency with protobuf instead of
>>> flatbuffers, but it has a performance cost. I have not gone far into
>>> investigating how much though and one could even reason about supporting
>>> both these encodings in a single format specification but it's taking it a
>>> bit far I think.
>>>
>>> 3. Is there a reason to support different packing strategies for the
>>> R-Tree or is Packed Hilbert a good compromise (besides it being beautifully
>>> simple to implement)?
>>>
>>> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and has
>>> a bad/boring abbreviation. Other candidates I've considered are
>>> OpenGeoFile, OpenVectorFile or OpenSpatialFile but I'm undecided. Any
>>> ideas? :)
>>>
>>> Regards,
>>>
>>> Björn Harrtell
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20190224/2fb14a69/attachment-0001.html>