[gdal-dev] Understanding parquet vs adbc performance and SORT_BY_BBOX=YES

Even Rouault even.rouault at spatialys.com
Mon Jul 21 07:13:36 PDT 2025


Le 20/07/2025 à 13:27, Michael Smith via gdal-dev a écrit :
>
> Using GDAL 3.11.3:
>
> I have a dataset Geometry: Point Feature Count: 15546949 in parquet 
> format (written using gdal from oracle source). When doing a spatial 
> query using the geoparquet driver, I see it accessing almost all the 
> row groups of the dataset (PARQUET: 155/156 row groups selected) with 
> a spatial filter fetching 12000 of the 15M points and it takes 
> 0m18.794s. When accessing via ADBC and libduckdb, it takes 0m7.102s 
> (but it also uses 7x CPU and about 10x memory (from looking at top).
>
> I then rewrote the dataset using -lco SORT_BY_BBOX=YES. Then parquet 
> driver accesses PARQUET: 9/238 row groups selected, and the time drops 
> to 0m1.412s. Using ADBC and libduckdb, the performance doesn’t change.
>
> For proper performance with gdal, is SORT_BY_BBOX=YES always needed?
>
yes, unless your features are already spatially sorted. It is a bit 
strange that you don't see improvements with the ADBC driver as it does 
push the spatial filter bbox in the SQL request, so that's perhaps a 
limitation on how duckdb itself deals with such filters
>
> -- 
>
> Michael Smith
>
> RSGIS Center – ERDC CRREL NH
>
> US Army Corps
>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev

-- 
http://www.spatialys.com
My software is free, but my time generally not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20250721/0e8b626b/attachment.htm>


More information about the gdal-dev mailing list