[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

Björn Harrtell bjorn.harrtell at gmail.com
Fri Feb 8 15:08:16 PST 2019


I've now implemented spatial filtering (at GDAL fork
https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are
promising, for two test cases about 3,5-4 times faster than shapefile with
the same content. I've used these command lines to measure total time over
10 iterations for each format and dataset:

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
real 0m1,940s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
real 0m0,750s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
real 0m0,461s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_buildings_a_free_1_single.shp gis_osm_buildings_a_free_1_single;
done
real 0m4,381s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
done
real 0m1,624s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
done
real 0m1,113s

The bounds 10 56 10.2 56.2 contains 21525 out of the 812547 total features
for gis_osm_roads_free_1 and 57699 out of 2027890 features for
gis_osm_buildings_a_free_1_single.

And there is definitely room for optimization. The implementation is naive
and reads the whole R-tree in memory and does not yet read consecutive
features (common case as the R-tree is balanced) in larger I/O blocks which
I think can be significant. If I reduce the extent to 10 56 10.1 56.1 GPKG
is actually fair bit faster which I think is due to it being able to read
the index partially.

/Björn

Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell <bjorn.harrtell at gmail.com
>:

> After posting about my experimental format I realized that I lack numbers
> on the potential performance, so I tried to make some more or less
> scientific measuring. The results was disappointing, reaching similar
> performance as shapefile for full sequential reads and so I lost interest
> for a while. But recently I found out that my method of measuring was
> flawed - I measured the time to read into a new shapefile using ogr2ogr,
> but it turns out that can be quite unfair due to the string field length
> dynamic resize that ogr2ogr will do if the strings from input feature
> attributes are of variable length. I've now made some new measurements
> using ogrinfo and the undocumented flag -qq to get a full sequential read.
>
> I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to single
> poly) and "gis_osm_roads_free_1" from Denmark at
> http://download.geofabrik.de/europe.html as the source, converted to
> respective format, then measured the time it takes to run ogrinfo -qq on
> them. Results (average over three runs):
>
> ## gis_osm_buildings_a_free_1 (2027890 Polygons)
> * SHP: 3800 ms
> * GPKG: 2700 ms
> * FlatGeobuf: 2200 ms
>
> ## gis_osm_roads_free_1 (812547 LineStrings)
> * SHP: 1600 ms
> * GPKG: 1350 ms
> * FlatGeobuf: 800 ms
>
> With that my hope and interest is rekindled. I believe I can fare even
> better against the competition with with spatially filtered searches, will
> hopefully soon have some results on that too.
>
> /Björn
>
> Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell <
> bjorn.harrtell at gmail.com>:
>
>> Hi GDAL/OGR folks,
>>
>> In my spare time I've been working on a vector file format called
>> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
>> I'm putting time into it is that I think shapefile still rules the
>> read/query static data performance game, which is kind of sad, and probably
>> one of the reasons it is still so widely popular. Also, the main competitor
>> (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles
>> also handles surprisingly (?) well.
>>
>> By using a performance focused write once binary encoding (flatbuffers),
>> schema constraint and focusing on read/query performance by clustering on
>> an optimal spatial index (Packed Hilbert R-Tree) I hope to be able to beat
>> shapefile performance and at the same time be optimal for streaming/cloud
>> access.
>>
>> I think I'm starting to get somewhere, more details and source is at
>> https://github.com/bjornharrtell/flatgeobuf and I have an early proof of
>> concept driver implementation at
>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
>> already quite promising - it can do roundtrip read/write and is already
>> quite a bit faster than shapefile. I also have implemented naive read only
>> QGIS provider for experimental purposes.
>>
>> Basically I'm fishing for interest in this effort, hoping that others
>> will see potential in it even if it's "yet another format" and far from
>> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
>> place for a format specification and reference implementation to incubate.
>>
>> Some additional food for thought that I've had during the experimentation:
>>
>> 1. The main in memory representations of geometry/coordinates seem to be
>> either separate arrays per dimension (GDAL (partially?) and QGIS) or a
>> single array with dimension as stride. I've chosen the latter as of yet but
>> I'm still a bit undecided. There is of course a substantial involved in
>> transforming between the two representations so the situation with two
>> competing models is a bit unfortunate. I'm also not sure about which of
>> these models are objectively "better" than the other?
>>
>> 2. One could get better space efficiency with protobuf instead of
>> flatbuffers, but it has a performance cost. I have not gone far into
>> investigating how much though and one could even reason about supporting
>> both these encodings in a single format specification but it's taking it a
>> bit far I think.
>>
>> 3. Is there a reason to support different packing strategies for the
>> R-Tree or is Packed Hilbert a good compromise (besides it being beautifully
>> simple to implement)?
>>
>> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and has
>> a bad/boring abbreviation. Other candidates I've considered are
>> OpenGeoFile, OpenVectorFile or OpenSpatialFile but I'm undecided. Any
>> ideas? :)
>>
>> Regards,
>>
>> Björn Harrtell
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20190209/e7d40709/attachment.html>


More information about the gdal-dev mailing list