[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format
Darafei "Komяpa" Praliaskouski
me at komzpa.net
Mon Mar 11 07:07:24 PDT 2019
On Mon, Mar 11, 2019 at 1:08 AM Björn Harrtell <bjorn.harrtell at gmail.com>
wrote:
> It would probably be possible to get substantial gains on existing formats
> that have spatial indexing by making them more optimal/balanced however I
> see the following major issues:
>
> * Not sure how by convention deal with non-updatable indexes
> (packing/sorting on hilbert or sort-tile-recursive seem inherently static)
> on formats that are assumed to be mutable
>
Why does Hilbert indexing sets a requirement for it to be non-updatable?
If it's packed into same tree format, it's going to be updatable just the
same as the previous version. Yeah, it's going to degrade a bit on changes,
but if the tool to put sorting back in place is available then it's not a
big issue?
> * GeoPackage cannot benefit from clustering on spatial index (so it will
> not be possible to "cloud" optimize)
>
Why? What should it get for it to be possible?
> * Shapefile has several other drawbacks as a format perhaps making it less
> interesting to invest into
>
Its killer feature is universal support.
> I have wondered several times about these things in relation to PostgreSQL
> and PostGIS... first thought is why there is no way to mark a table
> static/immutable in PostgreSQL, it seems to be there would be several
> general optimization possibilities if it was known that a table cannot be
> changed, not the least that it would make it possible to construct similar
> packed R-trees as discussed above.
>
You probably missed a link I posted,
https://github.com/glukhovn/postgres/tree/gist_btree_buil
<https://github.com/glukhovn/postgres/tree/gist_btree_build>d - here a GiST
is built in similar way, without requiring read onliness of the table.
> Second thought is how the R-tree is constructed/implemented via gist index
> (I have not had the brain power/interest to dive into the source to
> understand it yet) - unless it can be static it should probably optimally
> be a R*-tree with topological split but I've got indications that it is not.
>
GiST is just *a* tree on top of support functions. More sophisticated
versions of tree are limited by what callbacks Postgres provides.
/Björn
Den mån 25 feb. 2019 kl 11:06 skrev Darafei "Komяpa" Praliaskouski <
me at komzpa.net>:
> Hi Björn,
>
> Would it make sense to implement it not as a separate format, but as a
> special convention on top of any of already existing ones?
> What happens if you sort by Hilbert curve a shapefile or geopackage, and
> tweak the internals to build indexes in flattish manner?
>
> For example, here is a branch of Postgres where GiST index is being built
> in flat manner. \
> It does not require any change on user's side, but gets you all the
> benefits by plainly building 10x faster.
> https://github.com/glukhovn/postgres/tree/gist_btree_build
>
>
> On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell <bjorn.harrtell at gmail.com>
> wrote:
>
>> I've now implemented so that the traversal of the R-tree reads nodes on
>> demand and as suspected it gives a significant boost:
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
>> real 0m0,378s (~ 20% improvement)
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>> done
>> real 0m0,790s (~ 30% improvement)
>>
>> I'm now also faster on the reduced window (56 10.1 56.1) beating gpkg:
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
>> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
>> done
>> real 0m0,305s
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>> done
>> real 0m0,242s
>>
>> It still remains a TODO to further optimize I/O to read larger blocks if
>> possible (applies both to R-tree nodes and the feature data itself) and I
>> think that could prove a very significant performance boost potentially
>> leaving any competition in the dust. :D
>>
>> /Björn
>>
>> Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell <
>> bjorn.harrtell at gmail.com>:
>>
>>> I've now implemented spatial filtering (at GDAL fork
>>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are
>>> promising, for two test cases about 3,5-4 times faster than shapefile with
>>> the same content. I've used these command lines to measure total time over
>>> 10 iterations for each format and dataset:
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
>>> real 0m1,940s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
>>> real 0m0,750s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
>>> real 0m0,461s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_buildings_a_free_1_single.shp gis_osm_buildings_a_free_1_single;
>>> done
>>> real 0m4,381s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
>>> done
>>> real 0m1,624s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>>> done
>>> real 0m1,113s
>>>
>>> The bounds 10 56 10.2 56.2 contains 21525 out of the 812547 total
>>> features for gis_osm_roads_free_1 and 57699 out of 2027890 features for
>>> gis_osm_buildings_a_free_1_single.
>>>
>>> And there is definitely room for optimization. The implementation is
>>> naive and reads the whole R-tree in memory and does not yet read
>>> consecutive features (common case as the R-tree is balanced) in larger I/O
>>> blocks which I think can be significant. If I reduce the extent to 10 56
>>> 10.1 56.1 GPKG is actually fair bit faster which I think is due to it being
>>> able to read the index partially.
>>>
>>> /Björn
>>>
>>> Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell <
>>> bjorn.harrtell at gmail.com>:
>>>
>>>> After posting about my experimental format I realized that I lack
>>>> numbers on the potential performance, so I tried to make some more or less
>>>> scientific measuring. The results was disappointing, reaching similar
>>>> performance as shapefile for full sequential reads and so I lost interest
>>>> for a while. But recently I found out that my method of measuring was
>>>> flawed - I measured the time to read into a new shapefile using ogr2ogr,
>>>> but it turns out that can be quite unfair due to the string field length
>>>> dynamic resize that ogr2ogr will do if the strings from input feature
>>>> attributes are of variable length. I've now made some new measurements
>>>> using ogrinfo and the undocumented flag -qq to get a full sequential read.
>>>>
>>>> I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to
>>>> single poly) and "gis_osm_roads_free_1" from Denmark at
>>>> http://download.geofabrik.de/europe.html as the source, converted to
>>>> respective format, then measured the time it takes to run ogrinfo -qq on
>>>> them. Results (average over three runs):
>>>>
>>>> ## gis_osm_buildings_a_free_1 (2027890 Polygons)
>>>> * SHP: 3800 ms
>>>> * GPKG: 2700 ms
>>>> * FlatGeobuf: 2200 ms
>>>>
>>>> ## gis_osm_roads_free_1 (812547 LineStrings)
>>>> * SHP: 1600 ms
>>>> * GPKG: 1350 ms
>>>> * FlatGeobuf: 800 ms
>>>>
>>>> With that my hope and interest is rekindled. I believe I can fare even
>>>> better against the competition with with spatially filtered searches, will
>>>> hopefully soon have some results on that too.
>>>>
>>>> /Björn
>>>>
>>>> Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell <
>>>> bjorn.harrtell at gmail.com>:
>>>>
>>>>> Hi GDAL/OGR folks,
>>>>>
>>>>> In my spare time I've been working on a vector file format called
>>>>> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
>>>>> I'm putting time into it is that I think shapefile still rules the
>>>>> read/query static data performance game, which is kind of sad, and probably
>>>>> one of the reasons it is still so widely popular. Also, the main competitor
>>>>> (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles
>>>>> also handles surprisingly (?) well.
>>>>>
>>>>> By using a performance focused write once binary encoding
>>>>> (flatbuffers), schema constraint and focusing on read/query performance by
>>>>> clustering on an optimal spatial index (Packed Hilbert R-Tree) I hope to be
>>>>> able to beat shapefile performance and at the same time be optimal for
>>>>> streaming/cloud access.
>>>>>
>>>>> I think I'm starting to get somewhere, more details and source is at
>>>>> https://github.com/bjornharrtell/flatgeobuf and I have an early proof
>>>>> of concept driver implementation at
>>>>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
>>>>> already quite promising - it can do roundtrip read/write and is already
>>>>> quite a bit faster than shapefile. I also have implemented naive read only
>>>>> QGIS provider for experimental purposes.
>>>>>
>>>>> Basically I'm fishing for interest in this effort, hoping that others
>>>>> will see potential in it even if it's "yet another format" and far from
>>>>> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
>>>>> place for a format specification and reference implementation to incubate.
>>>>>
>>>>> Some additional food for thought that I've had during the
>>>>> experimentation:
>>>>>
>>>>> 1. The main in memory representations of geometry/coordinates seem to
>>>>> be either separate arrays per dimension (GDAL (partially?) and QGIS) or a
>>>>> single array with dimension as stride. I've chosen the latter as of yet but
>>>>> I'm still a bit undecided. There is of course a substantial involved in
>>>>> transforming between the two representations so the situation with two
>>>>> competing models is a bit unfortunate. I'm also not sure about which of
>>>>> these models are objectively "better" than the other?
>>>>>
>>>>> 2. One could get better space efficiency with protobuf instead of
>>>>> flatbuffers, but it has a performance cost. I have not gone far into
>>>>> investigating how much though and one could even reason about supporting
>>>>> both these encodings in a single format specification but it's taking it a
>>>>> bit far I think.
>>>>>
>>>>> 3. Is there a reason to support different packing strategies for the
>>>>> R-Tree or is Packed Hilbert a good compromise (besides it being beautifully
>>>>> simple to implement)?
>>>>>
>>>>> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and
>>>>> has a bad/boring abbreviation. Other candidates I've considered are
>>>>> OpenGeoFile, OpenVectorFile or OpenSpatialFile but I'm undecided. Any
>>>>> ideas? :)
>>>>>
>>>>> Regards,
>>>>>
>>>>> Björn Harrtell
>>>>>
>>>> _______________________________________________
>> gdal-dev mailing list
>> gdal-dev at lists.osgeo.org
>> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
>
> --
> Darafei Praliaskouski
> Support me: http://patreon.com/komzpa
>
--
Darafei Praliaskouski
Support me: http://patreon.com/komzpa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20190311/7f8798f2/attachment-0001.html>
More information about the gdal-dev
mailing list