[gdal-dev] Attribute filter on remote Parquet file is slow

Daniel Baston dbaston at gmail.com
Wed Aug 28 16:36:12 PDT 2024


Hi Even,

Thanks for the suggestions. I've played around with it a bit more
without much success.

> No you didn't do anything obviously wrong. I'm not sure that in the
> ArrowDataset mode libarrow actually uses group statistics to filter out
> row groups, which might cause it to actually ingest the whole files

After instrumenting it a bit, it at least appears to filter out
selected row groups, but it's not terribly effective. 59 record
batches are downloaded, one of which contains the 14 desired features.
The ranges requested are not sequential and do not cover the entire
file. But the fetching process is much slower than downloading the 1.2
gb file directly and querying it locally.

>
> You may also try to tune the config options at
> https://github.com/OSGeo/gdal/blob/master/ogr/ogrsf_frmts/parquet/ogrparquetdatasetlayer.cpp#L522-L558
> do you observe a similar difference if you work with just a simple file
> like
> /vsis3/overturemaps-us-west-2/release/2024-08-20.0/theme=divisions/type=division_area/part-00000-5466202d-8cdf-48e5-9aee-886c73dafc5f-c000.zstd.parquet

I'm not seeing a performance difference by doing this, beyond the
(fairly high) run-to-run variability.

Dan


More information about the gdal-dev mailing list