<div dir="ltr">Hi Benjamin,<div><br></div><div>Very interesting to read that you have experimented in similar things and of your positive experiences with flatbuffers.</div><div><br></div><div><div class="gmail_quote"><div dir="ltr">Den mån 10 dec. 2018 kl 18:18 skrev Benjamin Stadin <<a href="mailto:benjamin.stadin@bridging-it.de">benjamin.stadin@bridging-it.de</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Björn,<br>
<br>
Interesting to see there is some progress in this area. We've written a Flatbuffers based vector tile schema about 2 1/2 years ago, which is used in our rendering product (3D map rendering engine). I can share some thoughts about the general pros and cons, and what I'd do different (from today's perspective). <br>
<br>
Just last week I also started experimenting with using our tile format within QGIS, which would work similar to the MVT QGIS plugin [1] which uses web meractor tiling. It will parse our "HSG" flatbuffers cells (aka tiles), convert them on the fly into GeoJSON and pass them to QGIS. <br>
<br>
Some thoughts about your current implementation:<br>
<br>
1) I second Even's question about the actual use-case. I think a traditional tile-based approach will be more efficient for bbox querying, and client-side visualization. I'll share my approach later below. <br></blockquote><div><br></div><div>I agree and perhaps the spatial index part needs more consideration but it's an important point that the format I'm trying to make is not something to compete with vector tiles and render optimized representations - my aim is for the case where you want/need to work with simple features source data in lossless representation i.e more or less the same case as when directly using shapefiles.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
2) 2 1/2 years and quite some projects later, I still prefer Flatbuffers over protobuf for a number of reasons. Starting with direct memory access, easier client side parser implementation (structs instead of strange parallel arrays and such), performance. But it has one limitation in regards to memory usage and file size. Flatbuffers will require more memory (up to the size of the flatbuffers raw data) when you create or read a large flatbuffers file. This is due to the direct memory access capabilities. You can create a size-prefixed flatbuffers file (I've asked a question here [2]). But this will have negative side effects for your r-tree based approach. You'll need to read larger chunks (thus more unrelated data) to stream individual features. For streaming bbox queries a tile based approach is still more efficient, and simpler. Data in one area is packed into one memory region, and can be queried using common tile pyramid techniques (or variations like ours). <br></blockquote><div><br></div><div>I haven't dived deep enough into this yet but my hope is that since I constrain my format to be non-updatable I can construct and assume I have a balanced R-tree with a known size and structure and as such I should not need to read more of it than needed for traversal into the relevant part of the tree and then I can use the offset index to get byte offsets to each feature out of a tree query result. Because I also require the features to be written in the order the R-tree has been built they will be clustered on R-tree nodes so that most queries should result in small sets of sequential reads to get the features.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
3) The r-rtee implementation might be sufficient for simple use cases. But when later requiring to deal with 3D data and large data sets (like OSM) the implementation will become more complex, and would require to handle efficient disk based indexing instead of having to read the whole index into memory (I think the memory footprint for the index of the OSM data set will be in the region of 1GB++). In the end the advantage might not be so big compared to e.g. SQLite's R-Tree virtual table. It's actually a quite fast implementation, considering its low memory footprint (by cost of disk accesses).<br></blockquote><div><br></div><div>Agreed that I might have underestimated the complexity of a partial buffered tree traversal, but consider the above answer.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
4) Seconding Even's suggestion about using WKB. Our implementation is also related to WKB types, but we defined the geometry types as part of the schema similar to yours. It works fine, but today I'd choose a byte array for WKB data for generic use and easier integration into existing software (no need to write a custom geometry parser for any language, WKB readers already exist which do that for you).<br></blockquote><div><br></div><div>While I mosty agree I'm not entirely convinced, see answer to Even.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
5) In case the vector format is supposed for client-side rendering, how to handle zoom levels properly?<br></blockquote><div><br></div><div>I view this as out of scope for the format.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
>From an adoptability and maintenance point of view, I'd try everything possible to avoid recreating too much. I think an "integrative" approach is a better choice. E.g. creating a Geopackage extension for the vector tiles format analog to WebP, using WKB as byte array, and referring to the common tile pyramid scheme for easier implementation and integration possibilities for client sider rendering.<br></blockquote><div><br></div><div>While that sounds like an interesting idea for vector tile distribution, which I agree is kind of poor today, it's not the aim of this format to compete with vector tiles as I've hopefully been able to clarify now. :)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Some words about our "Heidelberg Sinusoidal Grid" (HSG) implementation. We basically faced three problems with existing approaches for our use-cases:<br>
- We need different projections for accurate indoor (and outdoor) models. Existing vector tile approaches are either optimized for rendering (e.g. projection depending and clipping geometry data at tile bounds like MVT), or too proprietary / not widely adopted and thus more complex to integrate with common rendering approaches (like Googles Earth's format, forgotten it's name) and thereby then limiting their use in existing processing chains and editing tools. <br>
- We make heavy use of feature querying and interaction, where the full feature is required instead of drawing commands (like with MVT). <br>
- (New) Mixing common xyz tiling scheme with "real feature" (not rendering optimized, clipped) vector data will eventually allow for efficient streaming (bbox) and editing capabilities. E.g. I'm thinking of using our vector tiles and simplify / strip non-useful data at lower zoom levels, and using non-simplified data at upper zoom levels where it can be edited in-place for example in QGIS. <br>
<br>
In brief this is how it works:<br>
To be independent of a projection, and to allow for storing the whole globe, we use a variation of the equal area MODIS grid for the tile pyramid. So instead of using the actual projection the client(s) use a helper routine to calculate the "directly" required HSG tiles. Similar to how a web map would calculate the web Mercator tiles required for the current view. <br>
<br>
There are however two key differences: The returned tiles (we call them actually cells for a reason) only loosely relate to the actually used projection: You will be provided the HSG cells (+ small buffer) covering your current client view with complete and unaltered features and geometry data regardless of what projection the geometries are actually in. So the grid is actually used for bbox (tile-pyramid alike) indexing and querying, and does not care about the projection or data stored inside individual cells (the projection is considered once when the storage is created and features get assigned to a cell). <br>
<br>
The other difference to common tiling is that you will also need to load implicitly required cells for features overlapping a cell in current direct sight: Cells contain meta info about related cells (cells containing at least one feature overlapping this cell). Constructing the storage is therefore a bit more complex, but we can thereby avoid the need for a client-side index like a r-tree. <br>
<br>
I'm willing to open source that, but there hasn't been much interest in this. I'm interested in participating to define a next gen vector file format as well. <br></blockquote><div><br></div><div>My spontaneous reaction to "HSG" is that it seems rather complex and perhaps that it is aiming at other use case(s) than I had in mind.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Cheers<br>
Ben<br>
<br>
[1] <a href="https://github.com/geometalab/Vector-Tiles-Reader-QGIS-Plugin" rel="noreferrer" target="_blank">https://github.com/geometalab/Vector-Tiles-Reader-QGIS-Plugin</a><br>
[2] <a href="https://stackoverflow.com/questions/35322700/flatbuffers-how-to-write-giant-files" rel="noreferrer" target="_blank">https://stackoverflow.com/questions/35322700/flatbuffers-how-to-write-giant-files</a><br>
<br>
<br>
Am 10.12.18, 13:19 schrieb "gdal-dev im Auftrag von Even Rouault" <<a href="mailto:gdal-dev-bounces@lists.osgeo.org" target="_blank">gdal-dev-bounces@lists.osgeo.org</a> im Auftrag von <a href="mailto:even.rouault@spatialys.com" target="_blank">even.rouault@spatialys.com</a>>:<br>
<br>
Björn,<br>
<br>
> In my spare time I've been working on a vector file format called<br>
> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,<br>
> I'm putting time into it is that I think shapefile still rules the<br>
> read/query static data performance game, which is kind of sad, and probably<br>
> one of the reasons it is still so widely popular. Also, the main competitor<br>
> (GeoPackage) isn't suitable for streaming access (AFAIK)<br>
<br>
I suspect that you could organize the layout of a sqlite3 file to be streaming <br>
friendly, but the sqlite3 lib itself is probably not ready for that (or you <br>
have to cheat by buffering a sufficiently large number of bytes in memory and <br>
use a custom VFS to read in that buffer. at that point, implementing your own <br>
dedicated sqlite3 reader might be better. likely doable, but not trivial). <br>
Random access is possible (you can use /vsicurl/ etc on a geopackage), but it <br>
might involve a number of seeks and small reads in the B-Tree / R-Tree.<br>
<br>
That raises a major question. Which use case(s) do you want to address <br>
specifically. From what I've understood, for network access:<br>
- streaming access for progressive display as bytes come in<br>
- efficient bbox queries with minimized number of bytes and ranges in the <br>
files to read (minimizing the number of ranges of bytes is probably the most <br>
important criterion since reading 1 byte or 1000 will take the same time)<br>
<br>
> I think I'm starting to get somewhere, more details and source is at<br>
> <a href="https://github.com/bjornharrtell/flatgeobuf" rel="noreferrer" target="_blank">https://github.com/bjornharrtell/flatgeobuf</a> and I have an early proof of<br>
> concept driver implementation at<br>
> <a href="https://github.com/bjornharrtell/gdal/tree/flatgeobuf" rel="noreferrer" target="_blank">https://github.com/bjornharrtell/gdal/tree/flatgeobuf</a> and results are<br>
> already quite promising - it can do roundtrip read/write and is already<br>
> quite a bit faster than shapefile. I also have implemented naive read only<br>
> QGIS provider for experimental purposes.<br>
<br>
What are the I and O in the I+O related to the R-Tree index ?<br>
<br>
Wondering if a 3D index could be an option in case you want to address the <br>
full 3D case at some point. But might be something for later.<br>
<br>
I'm not familiar with flatbuf, but is random access in the Feature table by <br>
feature index is possible (without a preliminary scan pass), similarly to a <br>
.shx file in a shapefile ?<br>
<br>
Just a detail: it could be nice to have some global flag in the header that <br>
would mean "index of any feature = its FID, its FID - 1, or no particular <br>
correlation between feature index and feature FID"<br>
<br>
For completness of the attribute types, you could have date, time and binary. <br>
Can the concept of null value or empty value or both for a field value be <br>
encoded ?<br>
<br>
The Column structure could also receive more optional info: a long <br>
description, maximum length for string, optional precision/scale formatting <br>
for those nostalgic of decimal formatting <br>
<br>
If you want full genericity to express a SRS, allowing a WKT CRS string as an <br>
alternative for authority+code.<br>
<br>
> <br>
> Basically I'm fishing for interest in this effort, hoping that others will<br>
> see potential in it even if it's "yet another format" and far from<br>
> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good<br>
> place for a format specification and reference implementation to incubate.<br>
> <br>
> Some additional food for thought that I've had during the experimentation:<br>
> <br>
> 1. The main in memory representations of geometry/coordinates seem to be<br>
> either separate arrays per dimension (GDAL (partially?) and QGIS) or a<br>
> single array with dimension as stride. I've chosen the latter as of yet but<br>
> I'm still a bit undecided. There is of course a substantial involved in<br>
> transforming between the two representations so the situation with two<br>
> competing models is a bit unfortunate. I'm also not sure about which of<br>
> these models are objectively "better" than the other?<br>
<br>
Why not just using WKB encoding since it has likely similar size and <br>
performance characteristics to the flatbuf encoding, with the main advantage <br>
of being widely implemented and supporting other geometry types like <br>
CircularString, PolyhedralSurface, etc..., which you need if you want to fully <br>
compete with GeoPackage ?<br>
<br>
> <br>
> 2. One could get better space efficiency with protobuf instead of<br>
> flatbuffers, but it has a performance cost. I have not gone far into<br>
> investigating how much though and one could even reason about supporting<br>
> both these encodings in a single format specification but it's taking it a<br>
> bit far I think.<br>
<br>
Contradicts a bit my above point, but if you decide for a custom geometry <br>
encoding, why not allowing int32 for vertex values, with a offset+scale <br>
setting ? (ala OpenStreetmap PBF)<br>
<br>
If the main use case you want to address if cloud use, I was wondering if it <br>
would make sense to add a compression layer (ZSTD compress each Feature ?, or <br>
a group of features) to reduce the time spent in waiting for data from the <br>
network. Or perhaps not, and just let the transport layer do the compression <br>
(HTTP Encoding: gzip)<br>
<br>
> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and has a<br>
> bad/boring abbreviation. Other candidates I've considered are OpenGeoFile,<br>
> OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)<br>
<br>
COF = Cloud Optimized Features ?<br>
<br>
Even<br>
<br>
-- <br>
Spatialys - Geospatial professional services<br>
<a href="http://www.spatialys.com" rel="noreferrer" target="_blank">http://www.spatialys.com</a><br>
_______________________________________________<br>
gdal-dev mailing list<br>
<a href="mailto:gdal-dev@lists.osgeo.org" target="_blank">gdal-dev@lists.osgeo.org</a><br>
<a href="https://lists.osgeo.org/mailman/listinfo/gdal-dev" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/gdal-dev</a><br>
<br>
</blockquote></div></div></div>