[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

Sun Dec 9 11:36:19 PST 2018

Hi GDAL/OGR folks,

In my spare time I've been working on a vector file format called
FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
I'm putting time into it is that I think shapefile still rules the
read/query static data performance game, which is kind of sad, and probably
one of the reasons it is still so widely popular. Also, the main competitor
(GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles
also handles surprisingly (?) well.

By using a performance focused write once binary encoding (flatbuffers),
schema constraint and focusing on read/query performance by clustering on
an optimal spatial index (Packed Hilbert R-Tree) I hope to be able to beat
shapefile performance and at the same time be optimal for streaming/cloud
access.

I think I'm starting to get somewhere, more details and source is at
https://github.com/bjornharrtell/flatgeobuf and I have an early proof of
concept driver implementation at
https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
already quite promising - it can do roundtrip read/write and is already
quite a bit faster than shapefile. I also have implemented naive read only
QGIS provider for experimental purposes.

Basically I'm fishing for interest in this effort, hoping that others will
see potential in it even if it's "yet another format" and far from
final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
place for a format specification and reference implementation to incubate.

Some additional food for thought that I've had during the experimentation:

1. The main in memory representations of geometry/coordinates seem to be
either separate arrays per dimension (GDAL (partially?) and QGIS) or a
single array with dimension as stride. I've chosen the latter as of yet but
I'm still a bit undecided. There is of course a substantial involved in
transforming between the two representations so the situation with two
competing models is a bit unfortunate. I'm also not sure about which of
these models are objectively "better" than the other?

2. One could get better space efficiency with protobuf instead of
flatbuffers, but it has a performance cost. I have not gone far into
investigating how much though and one could even reason about supporting
both these encodings in a single format specification but it's taking it a
bit far I think.

3. Is there a reason to support different packing strategies for the R-Tree
or is Packed Hilbert a good compromise (besides it being beautifully simple
to implement)?

4. FlatGeobuf is perhaps a too technical name, not catchy enough and has a
bad/boring abbreviation. Other candidates I've considered are OpenGeoFile,
OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)

Regards,

Björn Harrtell
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20181209/b36dd200/attachment.html>