[gdal-dev] Attribute filter on remote Parquet file is slow
Even Rouault
even.rouault at spatialys.com
Wed Aug 28 10:01:19 PDT 2024
Dan,
No you didn't do anything obviously wrong. I'm not sure that in the
ArrowDataset mode libarrow actually uses group statistics to filter out
row groups, which might cause it to actually ingest the whole files
You may also try to tune the config options at
https://github.com/OSGeo/gdal/blob/master/ogr/ogrsf_frmts/parquet/ogrparquetdatasetlayer.cpp#L522-L558
do you observe a similar difference if you work with just a simple file
like
/vsis3/overturemaps-us-west-2/release/2024-08-20.0/theme=divisions/type=division_area/part-00000-5466202d-8cdf-48e5-9aee-886c73dafc5f-c000.zstd.parquet
?
Even
Le 28/08/2024 à 18:45, Daniel Baston via gdal-dev a écrit :
> Hello,
>
> I'm trying to use ogr2ogr with an attribute filter to pull 14 polygons
> from Overture maps. Running the following command with CPL_DEBUG=ON
> tells me that "PARQUET: Attribute filter fully translated to Arrow"
> yet it takes about 7 minutes to complete, and appears to download
> quite a bit of data:
>
> ogr2ogr /tmp/vt.geojson
> "PARQUET:/vsis3/overturemaps-us-west-2/release/2024-08-20.0/theme=divisions/type=division_area"
> -select "id,division_id,names.primary" -where "subtype='county' AND
> country='US' AND region='US-VT'"
>
> Have I made a mistake in my ogr2ogr invocation? For comparison,
> running what I believe to be an equivalent query in DuckDB takes about
> 10 seconds:
>
> SELECT
> id,
> division_id,
> names.primary,
> ST_GeomFromWKB(geometry) as geometry
> FROM
> read_parquet('s3://overturemaps-us-west-2/release/2024-08-20.0/theme=divisions/type=division_area/*',
> hive_partitioning=1)
> WHERE
> subtype = 'county'
> AND country = 'US'
> AND region = 'US-VT';
>
> I am using GDAL master (e09d07a7) and libarrow 16.1.
>
> Thanks,
> Dan
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
--
http://www.spatialys.com
My software is free, but my time generally not.
More information about the gdal-dev
mailing list