[postgis-devel] Btree operators

Mon Apr 11 10:22:19 PDT 2016

On 11/04/16 15:47, Komяpa wrote:

> пн, 11 апр. 2016 г. в 17:36, Mark Cave-Ayland
> <mark.cave-ayland at ilande.co.uk <mailto:mark.cave-ayland at ilande.co.uk>>:
> 
>     On 11/04/16 12:06, Komяpa wrote:
> 
>     > Having counter-intuitive behaviour leads to realy sublte bugs that are
>     > nearly impossible to debug.
>     > SQL was meant as human-readable, things like this one really break the
>     > perceprion of it as of English text and break expectations.
>     >
>     > Distinct is (meant? felt?) to deduplicate identical records, that
>     > appeared possibly because of some join elsewhere in the flow (or
>     for any
>     > other reason). For floating point numbers there, I don't really expect
>     > "nearly equals" behaviour, so ::bytea/memcmp-like comparsion for
>     > geometries seems sane in the case.
>     >
>     > When I do SELECT DISTINCT geom, I want _distinct geometry_, not
>     > _geometry with distinct boxes_. For distinct boxes I'd write SELECT
>     > DISTINCT on (ST_Envelope(geom)) geom, and that's rather rare case.
>     These
>     > two being swapped really require mind-twisting, and the more
>     > mind-twisting it requires the less people can use it.
>     >
>     > How about sorting by zig-zag-encoded coordinates?
>     >
>     > Are there any showstoppers to implement this change, except of
>     everyone
>     > has to REINDEX?
> 
>     Bear in mind that DISTINCT isn't quite as simple as you may imagine -
>     for example would you consider two geometries with the same coordinates
>     but in reverse order the same?
> 
>  
> They're different.
>  
> 
>     Or how about two geometries that are
>     exactly the same but one with a 3rd dimension coordinate added to some
>     vertices?
> 
>  
> They're different. 
> 
>     Also note that memcmp() can't be directly used on floating point numbers
>     since it is possible to have multiple binary representations of the same
>     number (a quick search for memcmp floating point numbers will point you
>     towards some of the issues).
> 
>  
> That's fine.
> 
> It's no different form currently recommended ::bytea or ::text, but
> keeps more readability in the SQL code.

If you're thinking about binary representation, sure, but as a
counter-example some people may already rely on this behaviour, perhaps
unwittingly, as part of an existing merge process. Not that this would
give a consistent result, but it's a change in user-visible behaviour.

> If someone needs more sophisticated filtering, they can always do the
> filtering that fits them better, like ST_SnapToGrid or
> ST_CollectionHomogenize or ST_ForceRHR or ST_MakeValid or
>  ST_Area(ST_Intersection(geom1, geom2))/(ST_Area(geom1)+ST_Area(geom2))
> - whatever they like the best.
> 
> Current problem is that in current approach DISTINCT kills too much
> values that differ from one another.

Also if you're changing the geometry equals function then this also
affects the use of index scans in GROUP BY and DISTINCT (B-Tree
operators are used by PostgreSQL for sorts in these cases). You've also
got to think about whether any change in behaviour is intuitive and
consistent across all variants of R-Tree/B-Tree/heap scans and ST_Equals
- some functions will use = to cut the bounding box first as an index
optimisation before applying the more correct ST_Equals. There is also
the issue whereby if you're now pulling complete geometries vs. bounding
boxes then some index-only optimisations will now be unavailable to the
planner so there can be performance regressions on large datasets and
those with fewer but "wide" geometries.

Now there are probably ways to work around some of these issues but it
will require some thought. I suspect the main problem is that people
will have applications that depend upon the existing behaviour and so
such a change would need careful management across a major release.

ATB,

Mark.