[gdal-dev] Reading from (geo)parquet using mixed spatia and non-spatiall filters
Even Rouault
even.rouault at spatialys.com
Sun Jan 18 06:50:04 PST 2026
Ari,
> I need to read from a large Parquet file (10-20 GB, in S3) features
> using a set of user defined constraints that I can parse into
> non-spatial SQL and polygon masks. My tests so far show good
> performance with a single non-spatial constraint and (separately) with
> a bbox.
Do you mean you get bad performance when setting both
SetAttributeFilter() and SetSpatialFilter[Rect]() ? I cannot explain
that. Combining them should not be less performant.
You don't mention if your geoparquet files have a covering bounding box
column. For the default WKB encoding, this is essential to avoid full
scan of the file.
> However, I not sure how to go forward with mixing non-spatial
> constraints and perhaps multiple arbitrary polygons (which may be
> non-adjacent).
If you have something like attr_filter && (Intersects(geom, poly1) ||
Intersects(geom, poly2)) , then you should do separately attr_filter
&& Intersects(geom, poly1) and then attr_filter && Intersects(geom,
poly2)
>
> GDAL SQL docs tell me that with Spatialite built-in I could use
> ST_Intersects but does that help with Parquet files?
No, because that wouldn't translate as a SetSpatialFilter[Rect]()
request, and thus you would get full scan of the file
> How about constructing the non-spatial SQL query first, use that on
> dataset, and then use SetSpatialFilterRect on the resulting layer
> object possibly multiple times plus ogr.Geometry.Intersects on each
> feature coming from the obtained layer? My intuition would tell me to
> first do the spatial filtering as that (may) narrow down the search
> considerably. But then I cannot use the non-spatial SQL as that
> requires a dataset to be executed on.
You could store the result of the spatial request in a temporary dataset
(possibly in memory) and then apply the attribute filter. But as said
above, I'm a bit surprised that combining the attribute filter and a
(single geometry) spatial filter isn't efficient.
Instead of the Parquet driver, you may also try with duckdb and the ADBC
driver. The duckdb SQL engine generally outperforms libarrow/libparquet.
Even
--
http://www.spatialys.com
My software is free, but my time generally not.
More information about the gdal-dev
mailing list