[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

Björn Harrtell bjorn.harrtell at gmail.com
Mon Dec 10 14:13:45 PST 2018


Hi Benjamin,

Very interesting to read that you have experimented in similar things and
of your positive experiences with flatbuffers.

Den mån 10 dec. 2018 kl 18:18 skrev Benjamin Stadin <
benjamin.stadin at bridging-it.de>:

> Björn,
>
> Interesting to see there is some progress in this area. We've written a
> Flatbuffers based vector tile schema about 2 1/2 years ago, which is used
> in our rendering product (3D map rendering engine). I can share some
> thoughts about the general pros and cons, and what I'd do different (from
> today's perspective).
>
> Just last week I also started experimenting with using our tile format
> within QGIS, which would work similar to the MVT QGIS plugin [1] which uses
> web meractor tiling. It will parse our "HSG" flatbuffers cells (aka tiles),
> convert them on the fly into GeoJSON and pass them to QGIS.
>
> Some thoughts about your current implementation:
>
> 1) I second Even's question about the actual use-case. I think a
> traditional tile-based approach will be more efficient for bbox querying,
> and client-side visualization. I'll share my approach later below.
>

I agree and perhaps the spatial index part needs more consideration but
it's an important point that the format I'm trying to make is not something
to compete with vector tiles and render optimized representations - my aim
is for the case where you want/need to work with simple features source
data in lossless representation i.e more or less the same case as when
directly using shapefiles.


> 2) 2 1/2 years and quite some projects later, I still prefer Flatbuffers
> over protobuf for a number of reasons. Starting with direct memory access,
> easier client side parser implementation (structs instead of strange
> parallel arrays and such), performance. But it has one limitation in
> regards to memory usage and file size. Flatbuffers will require more memory
> (up to the size of the flatbuffers raw data) when you create or read a
> large flatbuffers file. This is due to the direct memory access
> capabilities. You can create a size-prefixed flatbuffers file (I've asked a
> question here [2]). But this will have negative side effects for your
> r-tree based approach. You'll need to read larger chunks (thus more
> unrelated data) to stream individual features. For streaming bbox queries a
> tile based approach is still more efficient, and simpler. Data in one area
> is packed into one memory region, and can be queried using common tile
> pyramid techniques (or variations like ours).
>

I haven't dived deep enough into this yet but my hope is that since I
constrain my format to be non-updatable I can construct and assume I have a
balanced R-tree with a known size and structure and as such I should not
need to read more of it than needed for traversal into the relevant part of
the tree and then I can use the offset index to get byte offsets to each
feature out of a tree query result. Because I also require the features to
be written in the order the R-tree has been built they will be clustered on
R-tree nodes so that most queries should result in small sets of sequential
reads to get the features.


> 3) The r-rtee implementation might be sufficient for simple use cases. But
> when later requiring to deal with 3D data and large data sets (like OSM)
> the implementation will become more complex, and would require to handle
> efficient disk based indexing instead of having to read the whole index
> into memory (I think the memory footprint for the index of the OSM data set
> will be in the region of 1GB++). In the end the advantage might not be so
> big compared to e.g. SQLite's R-Tree virtual table. It's actually a quite
> fast implementation, considering its low memory footprint (by cost of disk
> accesses).
>

Agreed that I might have underestimated the complexity of a partial
buffered tree traversal, but consider the above answer.


> 4) Seconding Even's suggestion about using WKB. Our implementation is also
> related to WKB types, but we defined the geometry types as part of the
> schema similar to yours. It works fine, but today I'd choose a byte array
> for WKB data for generic use and easier integration into existing software
> (no need to write a custom geometry parser for any language, WKB readers
> already exist which do that for you).
>

While I mosty agree I'm not entirely convinced, see answer to Even.


> 5) In case the vector format is supposed for client-side rendering, how to
> handle zoom levels properly?
>

I view this as out of scope for the format.


> From an adoptability and maintenance point of view, I'd try everything
> possible to avoid recreating too much. I think an "integrative" approach is
> a better choice. E.g. creating a Geopackage extension for the vector tiles
> format analog to WebP, using WKB as byte array, and referring to the common
> tile pyramid scheme for easier implementation and integration possibilities
> for client sider rendering.
>

While that sounds like an interesting idea for vector tile distribution,
which I agree is kind of poor today, it's not the aim of this format to
compete with vector tiles as I've hopefully been able to clarify now. :)


> Some words about our "Heidelberg Sinusoidal Grid" (HSG) implementation. We
> basically faced three problems with existing approaches for our use-cases:
> - We need different projections for accurate indoor (and outdoor) models.
> Existing vector tile approaches are either optimized for rendering (e.g.
> projection depending and clipping geometry data at tile bounds like MVT),
> or too proprietary / not widely adopted and thus more complex to integrate
> with common rendering approaches (like Googles Earth's format, forgotten
> it's name) and thereby then limiting their use in existing processing
> chains and editing tools.
> - We make heavy use of feature querying and interaction, where the full
> feature is required instead of drawing commands (like with MVT).
> - (New) Mixing common xyz tiling scheme with "real feature" (not rendering
> optimized, clipped) vector data will eventually allow for efficient
> streaming (bbox) and editing capabilities. E.g. I'm thinking of using our
> vector tiles and simplify / strip non-useful data at lower zoom levels, and
> using non-simplified data at upper zoom levels where it can be edited
> in-place for example in QGIS.
>
> In brief this is how it works:
> To be independent of a projection, and to allow for storing the whole
> globe, we use a variation of the equal area MODIS grid for the tile
> pyramid. So instead of using the actual projection the client(s) use a
> helper routine to calculate the "directly" required HSG tiles. Similar to
> how a web map would calculate the web Mercator tiles required for the
> current view.
>
> There are however two key differences: The returned tiles (we call them
> actually cells for a reason) only loosely relate to the actually used
> projection: You will be provided the HSG cells (+ small buffer) covering
> your current client view with complete and unaltered features and geometry
> data regardless of what projection the geometries are actually in. So the
> grid is actually used for bbox (tile-pyramid alike) indexing and querying,
> and does not care about the projection or data stored inside individual
> cells (the projection is considered once when the storage is created and
> features get assigned to a cell).
>
> The other difference to common tiling is that you will also need to load
> implicitly required cells for features overlapping a cell in current direct
> sight: Cells contain meta info about related cells (cells containing at
> least one feature overlapping this cell). Constructing the storage is
> therefore a bit more complex, but we can thereby avoid the need for a
> client-side index like a r-tree.
>
> I'm willing to open source that, but there hasn't been much interest in
> this. I'm interested in participating to define a next gen vector file
> format as well.
>

My spontaneous reaction to "HSG" is that it seems rather complex and
perhaps that it is aiming at other use case(s) than I had in mind.


>
> Cheers
> Ben
>
> [1] https://github.com/geometalab/Vector-Tiles-Reader-QGIS-Plugin
> [2]
> https://stackoverflow.com/questions/35322700/flatbuffers-how-to-write-giant-files
>
>
> Am 10.12.18, 13:19 schrieb "gdal-dev im Auftrag von Even Rouault" <
> gdal-dev-bounces at lists.osgeo.org im Auftrag von even.rouault at spatialys.com
> >:
>
>     Björn,
>
>     > In my spare time I've been working on a vector file format called
>     > FlatGeobuf (tentatively). The main reason, besides curiosity and
> learning,
>     > I'm putting time into it is that I think shapefile still rules the
>     > read/query static data performance game, which is kind of sad, and
> probably
>     > one of the reasons it is still so widely popular. Also, the main
> competitor
>     > (GeoPackage) isn't suitable for streaming access (AFAIK)
>
>     I suspect that you could organize the layout of a sqlite3 file to be
> streaming
>     friendly, but the sqlite3 lib itself is probably not ready for that
> (or you
>     have to cheat by buffering a sufficiently large number of bytes in
> memory and
>     use a custom VFS to read in that buffer. at that point, implementing
> your own
>     dedicated sqlite3 reader might be better. likely doable, but not
> trivial).
>     Random access is possible (you can use /vsicurl/ etc on a geopackage),
> but it
>     might involve a number of seeks and small reads in the B-Tree / R-Tree.
>
>     That raises a major question. Which use case(s) do you want to address
>     specifically. From what I've understood, for network access:
>     - streaming access for progressive display as bytes come in
>     - efficient bbox queries with minimized number of bytes and ranges in
> the
>     files to read (minimizing the number of ranges of bytes is probably
> the most
>     important criterion since reading 1 byte or 1000 will take the same
> time)
>
>     > I think I'm starting to get somewhere, more details and source is at
>     > https://github.com/bjornharrtell/flatgeobuf and I have an early
> proof of
>     > concept driver implementation at
>     > https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results
> are
>     > already quite promising - it can do roundtrip read/write and is
> already
>     > quite a bit faster than shapefile. I also have implemented naive
> read only
>     > QGIS provider for experimental purposes.
>
>     What are the I and O in the I+O related to the R-Tree index ?
>
>     Wondering if a 3D index could be an option in case you want to address
> the
>     full 3D case at some point. But might be something for later.
>
>     I'm not familiar with flatbuf, but is random access in the Feature
> table by
>     feature index is possible (without a preliminary scan pass), similarly
> to a
>     .shx file in a shapefile ?
>
>     Just a detail: it could be nice to have some global flag in the header
> that
>     would mean "index of any feature = its FID, its FID - 1, or no
> particular
>     correlation between feature index and feature FID"
>
>     For completness of the attribute types, you could have date, time and
> binary.
>     Can the concept of null value or empty value or both for a field value
> be
>     encoded ?
>
>     The Column structure could also receive more optional info: a long
>     description, maximum length for string, optional precision/scale
> formatting
>     for those nostalgic of decimal formatting
>
>     If you want full genericity to express a SRS, allowing a WKT CRS
> string as an
>     alternative for authority+code.
>
>     >
>     > Basically I'm fishing for interest in this effort, hoping that
> others will
>     > see potential in it even if it's "yet another format" and far from
>     > final/stable yet. Any feedback is welcome. As I see it, GDAL is a
> good
>     > place for a format specification and reference implementation to
> incubate.
>     >
>     > Some additional food for thought that I've had during the
> experimentation:
>     >
>     > 1. The main in memory representations of geometry/coordinates seem
> to be
>     > either separate arrays per dimension (GDAL (partially?) and QGIS) or
> a
>     > single array with dimension as stride. I've chosen the latter as of
> yet but
>     > I'm still a bit undecided. There is of course a substantial involved
> in
>     > transforming between the two representations so the situation with
> two
>     > competing models is a bit unfortunate. I'm also not sure about which
> of
>     > these models are objectively "better" than the other?
>
>     Why not just using WKB encoding since it has likely similar size and
>     performance characteristics to the flatbuf encoding, with the main
> advantage
>     of being widely implemented and supporting other geometry types like
>     CircularString, PolyhedralSurface, etc..., which you need if you want
> to fully
>     compete with GeoPackage ?
>
>     >
>     > 2. One could get better space efficiency with protobuf instead of
>     > flatbuffers, but it has a performance cost. I have not gone far into
>     > investigating how much though and one could even reason about
> supporting
>     > both these encodings in a single format specification but it's
> taking it a
>     > bit far I think.
>
>     Contradicts a bit my above point, but if you decide for a custom
> geometry
>     encoding, why not allowing int32 for vertex values, with a
> offset+scale
>     setting ? (ala OpenStreetmap PBF)
>
>     If the main use case you want to address if cloud use, I was wondering
> if it
>     would make sense to add a compression layer (ZSTD compress each
> Feature ?, or
>     a group of features) to reduce the time spent in waiting for data from
> the
>     network. Or perhaps not, and just let the transport layer do the
> compression
>     (HTTP Encoding: gzip)
>
>     > 4. FlatGeobuf is perhaps a too technical name, not catchy enough and
> has a
>     > bad/boring abbreviation. Other candidates I've considered are
> OpenGeoFile,
>     > OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)
>
>     COF = Cloud Optimized Features ?
>
>     Even
>
>     --
>     Spatialys - Geospatial professional services
>     http://www.spatialys.com
>     _______________________________________________
>     gdal-dev mailing list
>     gdal-dev at lists.osgeo.org
>     https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20181210/d78698a2/attachment-0001.html>


More information about the gdal-dev mailing list