[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

Mon Feb 25 02:06:29 PST 2019

Hi Björn,

Would it make sense to implement it not as a separate format, but as a
special convention on top of any of already existing ones?
What happens if you sort by Hilbert curve a shapefile or geopackage, and
tweak the internals to build indexes in flattish manner?

For example, here is a branch of Postgres where GiST index is being built
in flat manner. \
It does not require any change on user's side, but gets you all the
benefits by plainly building 10x faster.
https://github.com/glukhovn/postgres/tree/gist_btree_build

On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell <bjorn.harrtell at gmail.com>
wrote:

> I've now implemented so that the traversal of the R-tree reads nodes on
> demand and as suspected it gives a significant boost:
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
> real 0m0,378s (~ 20% improvement)
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
> done
> real 0m0,790s (~ 30% improvement)
>
> I'm now also faster on the reduced window (56 10.1 56.1) beating gpkg:
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
> done
> real 0m0,305s
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
> done
> real 0m0,242s
>
> It still remains a TODO to further optimize I/O to read larger blocks if
> possible (applies both to R-tree nodes and the feature data itself) and I
> think that could prove a very significant performance boost potentially
> leaving any competition in the dust. :D
>
> /Björn
>
> Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell <
> bjorn.harrtell at gmail.com>:
>
>> I've now implemented spatial filtering (at GDAL fork
>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are
>> promising, for two test cases about 3,5-4 times faster than shapefile with
>> the same content. I've used these command lines to measure total time over
>> 10 iterations for each format and dataset:
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
>> real 0m1,940s
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
>> real 0m0,750s
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
>> real 0m0,461s
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_buildings_a_free_1_single.shp gis_osm_buildings_a_free_1_single;
>> done
>> real 0m4,381s
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
>> done
>> real 0m1,624s
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>> done
>> real 0m1,113s
>>
>> The bounds 10 56 10.2 56.2 contains 21525 out of the 812547 total
>> features for gis_osm_roads_free_1 and 57699 out of 2027890 features for
>> gis_osm_buildings_a_free_1_single.
>>
>> And there is definitely room for optimization. The implementation is
>> naive and reads the whole R-tree in memory and does not yet read
>> consecutive features (common case as the R-tree is balanced) in larger I/O
>> blocks which I think can be significant. If I reduce the extent to 10 56
>> 10.1 56.1 GPKG is actually fair bit faster which I think is due to it being
>> able to read the index partially.
>>
>> /Björn
>>
>> Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell <
>> bjorn.harrtell at gmail.com>:
>>
>>> After posting about my experimental format I realized that I lack
>>> numbers on the potential performance, so I tried to make some more or less
>>> scientific measuring. The results was disappointing, reaching similar
>>> performance as shapefile for full sequential reads and so I lost interest
>>> for a while. But recently I found out that my method of measuring was
>>> flawed - I measured the time to read into a new shapefile using ogr2ogr,
>>> but it turns out that can be quite unfair due to the string field length
>>> dynamic resize that ogr2ogr will do if the strings from input feature
>>> attributes are of variable length. I've now made some new measurements
>>> using ogrinfo and the undocumented flag -qq to get a full sequential read.
>>>
>>> I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to
>>> single poly) and "gis_osm_roads_free_1" from Denmark at
>>> http://download.geofabrik.de/europe.html as the source, converted to
>>> respective format, then measured the time it takes to run ogrinfo -qq on
>>> them. Results (average over three runs):
>>>
>>> ## gis_osm_buildings_a_free_1 (2027890 Polygons)
>>> * SHP: 3800 ms
>>> * GPKG: 2700 ms
>>> * FlatGeobuf: 2200 ms
>>>
>>> ## gis_osm_roads_free_1 (812547 LineStrings)
>>> * SHP: 1600 ms
>>> * GPKG: 1350 ms
>>> * FlatGeobuf: 800 ms
>>>
>>> With that my hope and interest is rekindled. I believe I can fare even
>>> better against the competition with with spatially filtered searches, will
>>> hopefully soon have some results on that too.
>>>
>>> /Björn
>>>
>>> Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell <
>>> bjorn.harrtell at gmail.com>:
>>>
>>>> Hi GDAL/OGR folks,
>>>>
>>>> In my spare time I've been working on a vector file format called
>>>> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
>>>> I'm putting time into it is that I think shapefile still rules the
>>>> read/query static data performance game, which is kind of sad, and probably
>>>> one of the reasons it is still so widely popular. Also, the main competitor
>>>> (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles
>>>> also handles surprisingly (?) well.
>>>>
>>>> By using a performance focused write once binary encoding
>>>> (flatbuffers), schema constraint and focusing on read/query performance by
>>>> clustering on an optimal spatial index (Packed Hilbert R-Tree) I hope to be
>>>> able to beat shapefile performance and at the same time be optimal for
>>>> streaming/cloud access.
>>>>
>>>> I think I'm starting to get somewhere, more details and source is at
>>>> https://github.com/bjornharrtell/flatgeobuf and I have an early proof
>>>> of concept driver implementation at
>>>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
>>>> already quite promising - it can do roundtrip read/write and is already
>>>> quite a bit faster than shapefile. I also have implemented naive read only
>>>> QGIS provider for experimental purposes.
>>>>
>>>> Basically I'm fishing for interest in this effort, hoping that others
>>>> will see potential in it even if it's "yet another format" and far from
>>>> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
>>>> place for a format specification and reference implementation to incubate.
>>>>
>>>> Some additional food for thought that I've had during the
>>>> experimentation:
>>>>
>>>> 1. The main in memory representations of geometry/coordinates seem to
>>>> be either separate arrays per dimension (GDAL (partially?) and QGIS) or a
>>>> single array with dimension as stride. I've chosen the latter as of yet but
>>>> I'm still a bit undecided. There is of course a substantial involved in
>>>> transforming between the two representations so the situation with two
>>>> competing models is a bit unfortunate. I'm also not sure about which of
>>>> these models are objectively "better" than the other?
>>>>
>>>> 2. One could get better space efficiency with protobuf instead of
>>>> flatbuffers, but it has a performance cost. I have not gone far into
>>>> investigating how much though and one could even reason about supporting
>>>> both these encodings in a single format specification but it's taking it a
>>>> bit far I think.
>>>>
>>>> 3. Is there a reason to support different packing strategies for the
>>>> R-Tree or is Packed Hilbert a good compromise (besides it being beautifully
>>>> simple to implement)?
>>>>
>>>> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and
>>>> has a bad/boring abbreviation. Other candidates I've considered are
>>>> OpenGeoFile, OpenVectorFile or OpenSpatialFile but I'm undecided. Any
>>>> ideas? :)
>>>>
>>>> Regards,
>>>>
>>>> Björn Harrtell
>>>>
>>> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20190225/6fca1f14/attachment-0001.html>