[gdal-dev] Attribute filter on remote Parquet file is slow

Scott public at postholer.com
Wed Aug 28 10:27:59 PDT 2024


I could be completely wrong here.

My understanding is duckdb uses httpfs or possibly some variant of fsspec.

I believe /vsis3 uses only libcurl, which doesn't *appear* to have 
support for httpfs.

Again, I could be wildly wrong.

On 8/28/24 09:45, Daniel Baston via gdal-dev wrote:
> Hello,
> 
> I'm trying to use ogr2ogr with an attribute filter to pull 14 polygons
> from Overture maps. Running the following command with CPL_DEBUG=ON
> tells me that "PARQUET: Attribute filter fully translated to Arrow"
> yet it takes about 7 minutes to complete, and appears to download
> quite a bit of data:
> 
> ogr2ogr /tmp/vt.geojson
> "PARQUET:/vsis3/overturemaps-us-west-2/release/2024-08-20.0/theme=divisions/type=division_area"
> -select "id,division_id,names.primary" -where "subtype='county' AND
> country='US' AND region='US-VT'"
> 
> Have I made a mistake in my ogr2ogr invocation? For comparison,
> running what I believe to be an equivalent query in DuckDB takes about
> 10 seconds:
> 
> SELECT
>        id,
>        division_id,
>        names.primary,
>        ST_GeomFromWKB(geometry) as geometry
>        FROM
>            read_parquet('s3://overturemaps-us-west-2/release/2024-08-20.0/theme=divisions/type=division_area/*',
> hive_partitioning=1)
>        WHERE
>            subtype = 'county'
>            AND country = 'US'
>            AND region = 'US-VT';
> 
> I am using GDAL master (e09d07a7) and libarrow 16.1.
> 
> Thanks,
> Dan
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev


More information about the gdal-dev mailing list