[gdal-dev] Reading from (geo)parquet using mixed spatia and non-spatiall filters
Ari Jolma
ari.jolma at gmail.com
Sun Jan 18 08:01:44 PST 2026
Even Rouault kirjoitti 18.1.2026 klo 16.50:
> Ari,
>
>> I need to read from a large Parquet file (10-20 GB, in S3) features
>> using a set of user defined constraints that I can parse into
>> non-spatial SQL and polygon masks. My tests so far show good
>> performance with a single non-spatial constraint and (separately)
>> with a bbox.
>
> Do you mean you get bad performance when setting both
> SetAttributeFilter() and SetSpatialFilter[Rect]() ? I cannot explain
> that. Combining them should not be less performant.
No, I'm, just looking for how to best mix spatial and non-spatial
filters/constraints when retrieving features from a Paquet file using GDAL.
>
> You don't mention if your geoparquet files have a covering bounding
> box column. For the default WKB encoding, this is essential to avoid
> full scan of the file.
I don't know about that - will check - but the basic
SetSpatialFilterRect on a GDAL Python layer works fine.
>
>
>> However, I not sure how to go forward with mixing non-spatial
>> constraints and perhaps multiple arbitrary polygons (which may be
>> non-adjacent).
> If you have something like attr_filter && (Intersects(geom, poly1) ||
> Intersects(geom, poly2)) , then you should do separately attr_filter
> && Intersects(geom, poly1) and then attr_filter && Intersects(geom,
> poly2)
Ok, so the attr_filter is not expensive even though it is applied twice.
>
>>
>> GDAL SQL docs tell me that with Spatialite built-in I could use
>> ST_Intersects but does that help with Parquet files?
> No, because that wouldn't translate as a SetSpatialFilter[Rect]()
> request, and thus you would get full scan of the file
Ok, I assumed that too.
>
>> How about constructing the non-spatial SQL query first, use that on
>> dataset, and then use SetSpatialFilterRect on the resulting layer
>> object possibly multiple times plus ogr.Geometry.Intersects on each
>> feature coming from the obtained layer? My intuition would tell me to
>> first do the spatial filtering as that (may) narrow down the search
>> considerably. But then I cannot use the non-spatial SQL as that
>> requires a dataset to be executed on.
>
> You could store the result of the spatial request in a temporary
> dataset (possibly in memory) and then apply the attribute filter. But
> as said above, I'm a bit surprised that combining the attribute filter
> and a (single geometry) spatial filter isn't efficient.
Maybe I was not clear on that I'm at this point wondering how to best
combine the attribute filter and the spatial filter.
>
> Instead of the Parquet driver, you may also try with duckdb and the
> ADBC driver. The duckdb SQL engine generally outperforms
> libarrow/libparquet.
Hm, Parquet files are given at this point - I'm doing
consultancy/development for a client and Parquet is their choice so I
guess I have developer role now. :)
>
> Even
>
Thanks,
Ari
More information about the gdal-dev
mailing list