[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format
Björn Harrtell
bjorn.harrtell at gmail.com
Tue Jan 22 09:43:36 PST 2019
After posting about my experimental format I realized that I lack numbers
on the potential performance, so I tried to make some more or less
scientific measuring. The results was disappointing, reaching similar
performance as shapefile for full sequential reads and so I lost interest
for a while. But recently I found out that my method of measuring was
flawed - I measured the time to read into a new shapefile using ogr2ogr,
but it turns out that can be quite unfair due to the string field length
dynamic resize that ogr2ogr will do if the strings from input feature
attributes are of variable length. I've now made some new measurements
using ogrinfo and the undocumented flag -qq to get a full sequential read.
I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to single
poly) and "gis_osm_roads_free_1" from Denmark at
http://download.geofabrik.de/europe.html as the source, converted to
respective format, then measured the time it takes to run ogrinfo -qq on
them. Results (average over three runs):
## gis_osm_buildings_a_free_1 (2027890 Polygons)
* SHP: 3800 ms
* GPKG: 2700 ms
* FlatGeobuf: 2200 ms
## gis_osm_roads_free_1 (812547 LineStrings)
* SHP: 1600 ms
* GPKG: 1350 ms
* FlatGeobuf: 800 ms
With that my hope and interest is rekindled. I believe I can fare even
better against the competition with with spatially filtered searches, will
hopefully soon have some results on that too.
/Björn
Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell <bjorn.harrtell at gmail.com
>:
> Hi GDAL/OGR folks,
>
> In my spare time I've been working on a vector file format called
> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
> I'm putting time into it is that I think shapefile still rules the
> read/query static data performance game, which is kind of sad, and probably
> one of the reasons it is still so widely popular. Also, the main competitor
> (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles
> also handles surprisingly (?) well.
>
> By using a performance focused write once binary encoding (flatbuffers),
> schema constraint and focusing on read/query performance by clustering on
> an optimal spatial index (Packed Hilbert R-Tree) I hope to be able to beat
> shapefile performance and at the same time be optimal for streaming/cloud
> access.
>
> I think I'm starting to get somewhere, more details and source is at
> https://github.com/bjornharrtell/flatgeobuf and I have an early proof of
> concept driver implementation at
> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
> already quite promising - it can do roundtrip read/write and is already
> quite a bit faster than shapefile. I also have implemented naive read only
> QGIS provider for experimental purposes.
>
> Basically I'm fishing for interest in this effort, hoping that others will
> see potential in it even if it's "yet another format" and far from
> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
> place for a format specification and reference implementation to incubate.
>
> Some additional food for thought that I've had during the experimentation:
>
> 1. The main in memory representations of geometry/coordinates seem to be
> either separate arrays per dimension (GDAL (partially?) and QGIS) or a
> single array with dimension as stride. I've chosen the latter as of yet but
> I'm still a bit undecided. There is of course a substantial involved in
> transforming between the two representations so the situation with two
> competing models is a bit unfortunate. I'm also not sure about which of
> these models are objectively "better" than the other?
>
> 2. One could get better space efficiency with protobuf instead of
> flatbuffers, but it has a performance cost. I have not gone far into
> investigating how much though and one could even reason about supporting
> both these encodings in a single format specification but it's taking it a
> bit far I think.
>
> 3. Is there a reason to support different packing strategies for the
> R-Tree or is Packed Hilbert a good compromise (besides it being beautifully
> simple to implement)?
>
> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and has a
> bad/boring abbreviation. Other candidates I've considered are OpenGeoFile,
> OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)
>
> Regards,
>
> Björn Harrtell
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20190122/b446cf89/attachment.html>
More information about the gdal-dev
mailing list