[gdal-dev] Getting summary data from parquet with sql
Even Rouault
even.rouault at spatialys.com
Tue Feb 17 11:41:04 PST 2026
Fixed per https://github.com/OSGeo/gdal/pull/13941 . I would expect the
performance to better than your GDAL Python snippet due to OGR SQL
setting ignored fields, so only the geometry and file_size columns,
which will reduce I/O if there are many attribute fields.
Le 17/02/2026 à 11:48, Michael Smith via gdal-dev a écrit :
> I wanted to get a sum of the value of a column using a spatial filter on a parquet file. I can easily do this with duckdb but I was trying via gdal.
> I was able to do it via fetching features but was unable to do it just with executeSQL as the spatialfilter part wouldn’t find the geometry column unless it was part of the query
>
> This worked:
> gf = gdal.OpenEx(f'PARQUET:{parquet_file')
> lay = gf.GetLayer()
> lay.SetSpatialFilter(ogr.CreateGeometryFromWkb(aoi.wkb))
> totsize_bytes += sum([feat.GetFieldAsInteger64('file_size') for feat in lay])
>
> This didn’t:
>
> res = gf.ExecuteSQL('select sum(file_size) from "parquet-file"', ogr.CreateGeometryFromWkb(aoi.wkb))
> RuntimeError: Cannot set spatial filter: no geometry field present in layer.
>
> Is this just a limitation of OGR SQL?
>
> Via duckdb:
> wkb_bytes = aoi.wkb.tobytes()
> sql = f"select sum(file_size) from read_parquet('{str(parquet-file)}') where ST_Intersects_Extent(geometry, ST_GeomFromWKB(?))"
> params = [wkb_bytes]
>
> Performance difference:
> gdal: size: 1139758617, time: 0:01:37.471977
> duck: size: 1139758617, time: 0:00:15.171584
>
>
--
http://www.spatialys.com
My software is free, but my time generally not.
More information about the gdal-dev
mailing list