[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

Mon Mar 11 15:07:09 PDT 2019

Den mån 11 mars 2019 kl 15:07 skrev Darafei "Komяpa" Praliaskouski <
me at komzpa.net>:

>
>
> On Mon, Mar 11, 2019 at 1:08 AM Björn Harrtell <bjorn.harrtell at gmail.com>
> wrote:
>
>> It would probably be possible to get substantial gains on existing
>> formats that have spatial indexing by making them more optimal/balanced
>> however I see the following major issues:
>>
>> * Not sure how by convention deal with non-updatable indexes
>> (packing/sorting on hilbert or sort-tile-recursive seem inherently static)
>> on formats that are assumed to be mutable
>>
>
> Why does Hilbert indexing sets a requirement for it to be non-updatable?
>

> If it's packed into same tree format, it's going to be updatable just the
> same as the previous version. Yeah, it's going to degrade a bit on changes,
> but if the tool to put sorting back in place is available then it's not a
> big issue?
>

I'm not entirely sure but it seems to me that R-tree algorithms are divided
into static or dynamic ones for good reasons. The static ones have a
predictable structure with near 100% space utilization and I believe
assumptions can be used for performance in the implementation as long as
the structure is kept intact. Dynamic ones are typically more complex
because they need a strategy for tree insertion/balancing.

See https://github.com/mourner/rbush and https://github.com/mourner/flatbush
for contemporary and minimal implementations of the dynamic / static cases
also with reference to the algorithms used.

>
>
>> * GeoPackage cannot benefit from clustering on spatial index (so it will
>> not be possible to "cloud" optimize)
>>
>
> Why? What should it get for it to be possible?
>

I'm taking Evens word for it (see earlier in this thread) but I shouldn't
have written cannot, he wrote it's likely possible but as I interpreted it
quite far from trivial. When Even writes something is not trivial, I
instinctively think it's near impossible. :)

> * Shapefile has several other drawbacks as a format perhaps making it less
>> interesting to invest into
>>
>
> Its killer feature is universal support.
>

Agreed, but I'm saying I don't want to see that being the case forever? :)

> I have wondered several times about these things in relation to PostgreSQL
>> and PostGIS... first thought is why there is no way to mark a table
>> static/immutable in PostgreSQL, it seems to be there would be several
>> general optimization possibilities if it was known that a table cannot be
>> changed, not the least that it would make it possible to construct similar
>> packed R-trees as discussed above.
>>
>
> You probably missed a link I posted,
> https://github.com/glukhovn/postgres/tree/gist_btree_buil
> <https://github.com/glukhovn/postgres/tree/gist_btree_build>d - here a
> GiST is built in similar way, without requiring read onliness of the table.
>
>
>
>> Second thought is how the R-tree is constructed/implemented via gist
>> index (I have not had the brain power/interest to dive into the source to
>> understand it yet) - unless it can be static it should probably optimally
>> be a R*-tree with topological split but I've got indications that it is not.
>>
>
> GiST is just *a* tree on top of support functions. More sophisticated
> versions of tree are limited by what callbacks Postgres provides.
>
>
> /Björn
>
> Den mån 25 feb. 2019 kl 11:06 skrev Darafei "Komяpa" Praliaskouski <
> me at komzpa.net>:
>
>> Hi Björn,
>>
>> Would it make sense to implement it not as a separate format, but as a
>> special convention on top of any of already existing ones?
>> What happens if you sort by Hilbert curve a shapefile or geopackage, and
>> tweak the internals to build indexes in flattish manner?
>>
>> For example, here is a branch of Postgres where GiST index is being built
>> in flat manner. \
>> It does not require any change on user's side, but gets you all the
>> benefits by plainly building 10x faster.
>> https://github.com/glukhovn/postgres/tree/gist_btree_build
>>
>>
>> On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell <bjorn.harrtell at gmail.com>
>> wrote:
>>
>>> I've now implemented so that the traversal of the R-tree reads nodes on
>>> demand and as suspected it gives a significant boost:
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
>>> real 0m0,378s (~ 20% improvement)
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>>> done
>>> real 0m0,790s (~ 30% improvement)
>>>
>>> I'm now also faster on the reduced window (56 10.1 56.1) beating gpkg:
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
>>> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
>>> done
>>> real 0m0,305s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
>>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>>> done
>>> real 0m0,242s
>>>
>>> It still remains a TODO to further optimize I/O to read larger blocks if
>>> possible (applies both to R-tree nodes and the feature data itself) and I
>>> think that could prove a very significant performance boost potentially
>>> leaving any competition in the dust. :D
>>>
>>> /Björn
>>>
>>> Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell <
>>> bjorn.harrtell at gmail.com>:
>>>
>>>> I've now implemented spatial filtering (at GDAL fork
>>>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are
>>>> promising, for two test cases about 3,5-4 times faster than shapefile with
>>>> the same content. I've used these command lines to measure total time over
>>>> 10 iterations for each format and dataset:
>>>>
>>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>>> gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
>>>> real 0m1,940s
>>>>
>>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>>> gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
>>>> real 0m0,750s
>>>>
>>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
>>>> real 0m0,461s
>>>>
>>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>>> gis_osm_buildings_a_free_1_single.shp gis_osm_buildings_a_free_1_single;
>>>> done
>>>> real 0m4,381s
>>>>
>>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>>> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
>>>> done
>>>> real 0m1,624s
>>>>
>>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>>>> done
>>>> real 0m1,113s
>>>>
>>>> The bounds 10 56 10.2 56.2 contains 21525 out of the 812547 total
>>>> features for gis_osm_roads_free_1 and 57699 out of 2027890 features for
>>>> gis_osm_buildings_a_free_1_single.
>>>>
>>>> And there is definitely room for optimization. The implementation is
>>>> naive and reads the whole R-tree in memory and does not yet read
>>>> consecutive features (common case as the R-tree is balanced) in larger I/O
>>>> blocks which I think can be significant. If I reduce the extent to 10 56
>>>> 10.1 56.1 GPKG is actually fair bit faster which I think is due to it being
>>>> able to read the index partially.
>>>>
>>>> /Björn
>>>>
>>>> Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell <
>>>> bjorn.harrtell at gmail.com>:
>>>>
>>>>> After posting about my experimental format I realized that I lack
>>>>> numbers on the potential performance, so I tried to make some more or less
>>>>> scientific measuring. The results was disappointing, reaching similar
>>>>> performance as shapefile for full sequential reads and so I lost interest
>>>>> for a while. But recently I found out that my method of measuring was
>>>>> flawed - I measured the time to read into a new shapefile using ogr2ogr,
>>>>> but it turns out that can be quite unfair due to the string field length
>>>>> dynamic resize that ogr2ogr will do if the strings from input feature
>>>>> attributes are of variable length. I've now made some new measurements
>>>>> using ogrinfo and the undocumented flag -qq to get a full sequential read.
>>>>>
>>>>> I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to
>>>>> single poly) and "gis_osm_roads_free_1" from Denmark at
>>>>> http://download.geofabrik.de/europe.html as the source, converted to
>>>>> respective format, then measured the time it takes to run ogrinfo -qq on
>>>>> them. Results (average over three runs):
>>>>>
>>>>> ## gis_osm_buildings_a_free_1 (2027890 Polygons)
>>>>> * SHP: 3800 ms
>>>>> * GPKG: 2700 ms
>>>>> * FlatGeobuf: 2200 ms
>>>>>
>>>>> ## gis_osm_roads_free_1 (812547 LineStrings)
>>>>> * SHP: 1600 ms
>>>>> * GPKG: 1350 ms
>>>>> * FlatGeobuf: 800 ms
>>>>>
>>>>> With that my hope and interest is rekindled. I believe I can fare even
>>>>> better against the competition with with spatially filtered searches, will
>>>>> hopefully soon have some results on that too.
>>>>>
>>>>> /Björn
>>>>>
>>>>> Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell <
>>>>> bjorn.harrtell at gmail.com>:
>>>>>
>>>>>> Hi GDAL/OGR folks,
>>>>>>
>>>>>> In my spare time I've been working on a vector file format called
>>>>>> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
>>>>>> I'm putting time into it is that I think shapefile still rules the
>>>>>> read/query static data performance game, which is kind of sad, and probably
>>>>>> one of the reasons it is still so widely popular. Also, the main competitor
>>>>>> (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles
>>>>>> also handles surprisingly (?) well.
>>>>>>
>>>>>> By using a performance focused write once binary encoding
>>>>>> (flatbuffers), schema constraint and focusing on read/query performance by
>>>>>> clustering on an optimal spatial index (Packed Hilbert R-Tree) I hope to be
>>>>>> able to beat shapefile performance and at the same time be optimal for
>>>>>> streaming/cloud access.
>>>>>>
>>>>>> I think I'm starting to get somewhere, more details and source is at
>>>>>> https://github.com/bjornharrtell/flatgeobuf and I have an early
>>>>>> proof of concept driver implementation at
>>>>>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results
>>>>>> are already quite promising - it can do roundtrip read/write and is already
>>>>>> quite a bit faster than shapefile. I also have implemented naive read only
>>>>>> QGIS provider for experimental purposes.
>>>>>>
>>>>>> Basically I'm fishing for interest in this effort, hoping that others
>>>>>> will see potential in it even if it's "yet another format" and far from
>>>>>> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
>>>>>> place for a format specification and reference implementation to incubate.
>>>>>>
>>>>>> Some additional food for thought that I've had during the
>>>>>> experimentation:
>>>>>>
>>>>>> 1. The main in memory representations of geometry/coordinates seem to
>>>>>> be either separate arrays per dimension (GDAL (partially?) and QGIS) or a
>>>>>> single array with dimension as stride. I've chosen the latter as of yet but
>>>>>> I'm still a bit undecided. There is of course a substantial involved in
>>>>>> transforming between the two representations so the situation with two
>>>>>> competing models is a bit unfortunate. I'm also not sure about which of
>>>>>> these models are objectively "better" than the other?
>>>>>>
>>>>>> 2. One could get better space efficiency with protobuf instead of
>>>>>> flatbuffers, but it has a performance cost. I have not gone far into
>>>>>> investigating how much though and one could even reason about supporting
>>>>>> both these encodings in a single format specification but it's taking it a
>>>>>> bit far I think.
>>>>>>
>>>>>> 3. Is there a reason to support different packing strategies for the
>>>>>> R-Tree or is Packed Hilbert a good compromise (besides it being beautifully
>>>>>> simple to implement)?
>>>>>>
>>>>>> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and
>>>>>> has a bad/boring abbreviation. Other candidates I've considered are
>>>>>> OpenGeoFile, OpenVectorFile or OpenSpatialFile but I'm undecided. Any
>>>>>> ideas? :)
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Björn Harrtell
>>>>>>
>>>>> _______________________________________________
>>> gdal-dev mailing list
>>> gdal-dev at lists.osgeo.org
>>> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>>
>>
>>
>> --
>> Darafei Praliaskouski
>> Support me: http://patreon.com/komzpa
>>
>
>
> --
> Darafei Praliaskouski
> Support me: http://patreon.com/komzpa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20190311/2fd0ddc1/attachment-0001.html>