[gdal-dev] Attribute filter on remote Parquet file is slow

Even Rouault even.rouault at spatialys.com
Wed Aug 28 10:01:19 PDT 2024


Dan,

No you didn't do anything obviously wrong. I'm not sure that in the 
ArrowDataset mode libarrow actually uses group statistics to filter out 
row groups, which might cause it to actually ingest the whole files

You may also try to tune the config options at 
https://github.com/OSGeo/gdal/blob/master/ogr/ogrsf_frmts/parquet/ogrparquetdatasetlayer.cpp#L522-L558

do you observe a similar difference if you work with just a simple file 
like 
/vsis3/overturemaps-us-west-2/release/2024-08-20.0/theme=divisions/type=division_area/part-00000-5466202d-8cdf-48e5-9aee-886c73dafc5f-c000.zstd.parquet 
?

Even

Le 28/08/2024 à 18:45, Daniel Baston via gdal-dev a écrit :
> Hello,
>
> I'm trying to use ogr2ogr with an attribute filter to pull 14 polygons
> from Overture maps. Running the following command with CPL_DEBUG=ON
> tells me that "PARQUET: Attribute filter fully translated to Arrow"
> yet it takes about 7 minutes to complete, and appears to download
> quite a bit of data:
>
> ogr2ogr /tmp/vt.geojson
> "PARQUET:/vsis3/overturemaps-us-west-2/release/2024-08-20.0/theme=divisions/type=division_area"
> -select "id,division_id,names.primary" -where "subtype='county' AND
> country='US' AND region='US-VT'"
>
> Have I made a mistake in my ogr2ogr invocation? For comparison,
> running what I believe to be an equivalent query in DuckDB takes about
> 10 seconds:
>
> SELECT
>        id,
>        division_id,
>        names.primary,
>        ST_GeomFromWKB(geometry) as geometry
>        FROM
>            read_parquet('s3://overturemaps-us-west-2/release/2024-08-20.0/theme=divisions/type=division_area/*',
> hive_partitioning=1)
>        WHERE
>            subtype = 'county'
>            AND country = 'US'
>            AND region = 'US-VT';
>
> I am using GDAL master (e09d07a7) and libarrow 16.1.
>
> Thanks,
> Dan
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev

-- 
http://www.spatialys.com
My software is free, but my time generally not.



More information about the gdal-dev mailing list