[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

Even Rouault even.rouault at spatialys.com
Mon Dec 10 04:18:58 PST 2018


Björn,

> In my spare time I've been working on a vector file format called
> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
> I'm putting time into it is that I think shapefile still rules the
> read/query static data performance game, which is kind of sad, and probably
> one of the reasons it is still so widely popular. Also, the main competitor
> (GeoPackage) isn't suitable for streaming access (AFAIK)

I suspect that you could organize the layout of a sqlite3 file to be streaming 
friendly, but the sqlite3 lib itself is probably not ready for that (or you 
have to cheat by buffering a sufficiently large number of bytes in memory and 
use a custom VFS to read in that buffer. at that point, implementing your own 
dedicated sqlite3 reader might be better. likely doable, but not trivial). 
Random access is possible (you can use /vsicurl/ etc on a geopackage), but it 
might involve a number of seeks and small reads in the B-Tree / R-Tree.

That raises a major question. Which use case(s) do you want to address 
specifically. From what I've understood, for network access:
- streaming access for progressive display as bytes come in
- efficient bbox queries with minimized number of bytes and ranges in the 
files to read (minimizing the number of ranges of bytes is probably the most 
important criterion since reading 1 byte or 1000 will take the same time)

> I think I'm starting to get somewhere, more details and source is at
> https://github.com/bjornharrtell/flatgeobuf and I have an early proof of
> concept driver implementation at
> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
> already quite promising - it can do roundtrip read/write and is already
> quite a bit faster than shapefile. I also have implemented naive read only
> QGIS provider for experimental purposes.

What are the I and O in the I+O related to the R-Tree index ?

Wondering if a 3D index could be an option in case you want to address the 
full 3D case at some point. But might be something for later.

I'm not familiar with flatbuf, but is random access in the Feature table by 
feature index is possible (without a preliminary scan pass), similarly to a 
.shx file in a shapefile ?

Just a detail: it could be nice to have some global flag in the header that 
would mean "index of any feature = its FID, its FID - 1, or no particular 
correlation between feature index and feature FID"

For completness of the attribute types, you could have date, time and binary. 
Can the concept of null value or empty value or both for a field value be 
encoded ?

The Column structure could also receive more optional info: a long 
description, maximum length for string, optional precision/scale formatting 
for those nostalgic of decimal formatting 

If you want full genericity to express a SRS, allowing a WKT CRS string as an 
alternative for authority+code.

> 
> Basically I'm fishing for interest in this effort, hoping that others will
> see potential in it even if it's "yet another format" and far from
> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
> place for a format specification and reference implementation to incubate.
> 
> Some additional food for thought that I've had during the experimentation:
> 
> 1. The main in memory representations of geometry/coordinates seem to be
> either separate arrays per dimension (GDAL (partially?) and QGIS) or a
> single array with dimension as stride. I've chosen the latter as of yet but
> I'm still a bit undecided. There is of course a substantial involved in
> transforming between the two representations so the situation with two
> competing models is a bit unfortunate. I'm also not sure about which of
> these models are objectively "better" than the other?

Why not just using WKB encoding since it has likely similar size and 
performance characteristics to the flatbuf encoding, with the main advantage 
of being widely implemented and supporting other geometry types like 
CircularString, PolyhedralSurface, etc..., which you need if you want to fully 
compete with GeoPackage ?

> 
> 2. One could get better space efficiency with protobuf instead of
> flatbuffers, but it has a performance cost. I have not gone far into
> investigating how much though and one could even reason about supporting
> both these encodings in a single format specification but it's taking it a
> bit far I think.

Contradicts a bit my above point, but if you decide for a custom geometry 
encoding, why not allowing int32 for vertex values, with a offset+scale 
setting ? (ala OpenStreetmap PBF)

If the main use case you want to address if cloud use, I was wondering if it 
would make sense to add a compression layer (ZSTD compress each Feature ?, or 
a group of features) to reduce the time spent in waiting for data from the 
network. Or perhaps not, and just let the transport layer do the compression 
(HTTP Encoding: gzip)

> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and has a
> bad/boring abbreviation. Other candidates I've considered are OpenGeoFile,
> OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)

COF = Cloud Optimized Features ?

Even

-- 
Spatialys - Geospatial professional services
http://www.spatialys.com


More information about the gdal-dev mailing list