[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

Björn Harrtell bjorn.harrtell at gmail.com
Mon Dec 10 13:42:30 PST 2018


Thanks Even, answers inlined.

Den mån 10 dec. 2018 kl 13:19 skrev Even Rouault <even.rouault at spatialys.com
>:

> Björn,
>
> > In my spare time I've been working on a vector file format called
> > FlatGeobuf (tentatively). The main reason, besides curiosity and
> learning,
> > I'm putting time into it is that I think shapefile still rules the
> > read/query static data performance game, which is kind of sad, and
> probably
> > one of the reasons it is still so widely popular. Also, the main
> competitor
> > (GeoPackage) isn't suitable for streaming access (AFAIK)
>
> I suspect that you could organize the layout of a sqlite3 file to be
> streaming
> friendly, but the sqlite3 lib itself is probably not ready for that (or
> you
> have to cheat by buffering a sufficiently large number of bytes in memory
> and
> use a custom VFS to read in that buffer. at that point, implementing your
> own
> dedicated sqlite3 reader might be better. likely doable, but not trivial).
> Random access is possible (you can use /vsicurl/ etc on a geopackage), but
> it
> might involve a number of seeks and small reads in the B-Tree / R-Tree.


> That raises a major question. Which use case(s) do you want to address
> specifically. From what I've understood, for network access:
> - streaming access for progressive display as bytes come in
> - efficient bbox queries with minimized number of bytes and ranges in the
> files to read (minimizing the number of ranges of bytes is probably the
> most
> important criterion since reading 1 byte or 1000 will take the same time)
>

The use case I had in mind is not the lossy and render optimized one, for
that use case I think vector tiles are a good design. What I'm aiming for
is essentially something to replace the shapefile i.e a lossless container
of simple features with as good as possible performance for bbox query
reads or full dataset reads via network or other means of I/O without
having to deal with preprocessing into tiles, generalization etc.


> > I think I'm starting to get somewhere, more details and source is at
> > https://github.com/bjornharrtell/flatgeobuf and I have an early proof of
> > concept driver implementation at
> > https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
> > already quite promising - it can do roundtrip read/write and is already
> > quite a bit faster than shapefile. I also have implemented naive read
> only
> > QGIS provider for experimental purposes.
>
> What are the I and O in the I+O related to the R-Tree index ?
>

I symbolizes the the R-Tree nodes. O is a separate array with feature
offsets (the byte offset of each feature), so that together you can quickly
get the byte ranges that needs to be fetched for a spatial query.


> Wondering if a 3D index could be an option in case you want to address the
> full 3D case at some point. But might be something for later.
>
> I'm not familiar with flatbuf, but is random access in the Feature table
> by
> feature index is possible (without a preliminary scan pass), similarly to
> a
> .shx file in a shapefile ?
>

That is the purpose of the O as explained above. Each feature is a separate
flatbuffer message and can be accessed directly.


> Just a detail: it could be nice to have some global flag in the header
> that
> would mean "index of any feature = its FID, its FID - 1, or no particular
> correlation between feature index and feature FID"
>

Not sure exactly what you mean, but I've considered having an optional FID
index to support fast random access by FID in the cases where FID is not
the same as the index of the feature and I guess what you are saying it
this should be explicit. This is not yet added to the spec.


> For completness of the attribute types, you could have date, time and
> binary.
> Can the concept of null value or empty value or both for a field value be
> encoded ?
>

Yes, perhaps it would be useful to have dedicated types for date, time and
binary. I recently added datetime.

The columns/field definition is static for the layer. Values are required
to specify a column index. Null/missing values are represented by simply
omitting values for column indexes. An empty values array for a feature =
all values are null.

The Column structure could also receive more optional info: a long
> description, maximum length for string, optional precision/scale
> formatting
> for those nostalgic of decimal formatting
>

Agreed.


> If you want full genericity to express a SRS, allowing a WKT CRS string as
> an
> alternative for authority+code.
>

Agreed, I should consider it.


> >
> > Basically I'm fishing for interest in this effort, hoping that others
> will
> > see potential in it even if it's "yet another format" and far from
> > final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
> > place for a format specification and reference implementation to
> incubate.
> >
> > Some additional food for thought that I've had during the
> experimentation:
> >
> > 1. The main in memory representations of geometry/coordinates seem to be
> > either separate arrays per dimension (GDAL (partially?) and QGIS) or a
> > single array with dimension as stride. I've chosen the latter as of yet
> but
> > I'm still a bit undecided. There is of course a substantial involved in
> > transforming between the two representations so the situation with two
> > competing models is a bit unfortunate. I'm also not sure about which of
> > these models are objectively "better" than the other?
>
> Why not just using WKB encoding since it has likely similar size and
> performance characteristics to the flatbuf encoding, with the main
> advantage
> of being widely implemented and supporting other geometry types like
> CircularString, PolyhedralSurface, etc..., which you need if you want to
> fully
> compete with GeoPackage ?
>

I think I've (perhaps prematurely) ruled out WKB because I find it not very
well/accessibly specified in it's details and existing implementations
rather complex, so you might be right here. I'm however not sure about
supporting any other geometry types than the ones I already do (similar as
shapefile) to constrain the complexity.


> >
> > 2. One could get better space efficiency with protobuf instead of
> > flatbuffers, but it has a performance cost. I have not gone far into
> > investigating how much though and one could even reason about supporting
> > both these encodings in a single format specification but it's taking it
> a
> > bit far I think.
>
> Contradicts a bit my above point, but if you decide for a custom geometry
> encoding, why not allowing int32 for vertex values, with a offset+scale
> setting ? (ala OpenStreetmap PBF)
>

That's what geobuf does for space efficiency but it seems it can cause
drift in some corner cases (see https://github.com/mapbox/geobuf/issues/96)
so I decided not to dive into it as I also wanted to try and aim for a
zero-copy encoding.


> If the main use case you want to address if cloud use, I was wondering if
> it
> would make sense to add a compression layer (ZSTD compress each Feature ?,
> or
> a group of features) to reduce the time spent in waiting for data from the
> network. Or perhaps not, and just let the transport layer do the
> compression
> (HTTP Encoding: gzip)
>

I have thought about it and container / transport layer compression seems
preferable to me.


> > 4. FlatGeobuf is perhaps a too technical name, not catchy enough and has
> a
> > bad/boring abbreviation. Other candidates I've considered are
> OpenGeoFile,
> > OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)
>
> COF = Cloud Optimized Features ?
>

Hmm, not bad :) I haven't considered cloud use the main/only use case for
the format, but also offline applications so I'm not entirely convinced
(yet).


>
> Even
>
> --
> Spatialys - Geospatial professional services
> http://www.spatialys.com
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20181210/e0465b9f/attachment.html>


More information about the gdal-dev mailing list