[postgis-devel] Bugs in gserialized_gist_compress_2d

Sat Sep 14 14:26:00 PDT 2013

We talked about invalid keys, and we're going to need some. I'm going to have to backtrack and remember where/when we did this last time. Basically any instance of GEOMETRY EMPTY, or a geometry with NaN coordinates, or Inf coordinates, should generate an invalid key (is there any way to just leave it out of the index, since that's effectively what we want anyways...)

P 

-- 
Paul Ramsey
http://cleverelephant.ca
http://postgis.net

On Saturday, September 14, 2013 at 1:15 PM, Alexander Korotkov wrote:

> On Thu, Sep 12, 2013 at 12:57 AM, Paul Ramsey <pramsey at cleverelephant.ca (mailto:pramsey at cleverelephant.ca)> wrote:
> > I don't really see how the logic in gserialized_gist_compress_2d
> > differs from that in the core PgSQL code, like say gist_poly_compress.
> > In the event of things going terribly wrong, it's returning the
> > unaltered input.
> 
> 
> I don't see any case gist_poly_compress returning unaltered input when entry->leafkey is passed. It returns NULL on NULL input. But, it's just old way of handling NULLs. Now, it's only legacy code which never executed, because GiST never passes NULL datum to compress. 
> 
> It's easy to demonstrate how current checks in gserialized_gist_compress_2d cause inconsistent behaviour:
> 
> 1) Create some table with infinity coordinate point.
> 
> test=# create table gtest (g geometry);
> CREATE TABLE
> test=# insert into gtest values (geometry('(0,0)'::point)), (geometry('(0,inf)'::point));
> INSERT 0 2
> test=# create index gtest_idx on gtest using gist (g);
> CREATE INDEX
> 
> 2) Sequential scan gives us a predictable result.
> 
> test=# select * from gtest where g && geometry(polygon(box('(0,0)'::point, '(0,inf)'::point))); 
> g
> --------------------------------------------
> 010100000000000000000000000000000000000000
> 01010000000000000000000000000000000000F07F
> (2 rows)
> 
> 3) Index scan result differs. It's inconsistent behaviour, because index scan must give same result as sequential scan.
> 
> test=# set enable_seqscan = off;
> SET
> test=# select * from gtest where g && geometry(polygon(box('(0,0)'::point, '(0,inf)'::point)));
> g
> --------------------------------------------
> 010100000000000000000000000000000000000000
> (1 row)
> 
> 4) I wrote output function for box2df and installed gevel extension. We can see that second GiST entry contains garbage. 
> 
> test=# select * from gist_print('gtest_idx') as t(level int, valid bool, a box2df);
> level | valid | a
> -------+-------+------------------------------------------------------------------
> 1 | t | BOX2DF(0 0, 0 0)
> 1 | t | BOX2DF(8.26766093952e-44 3.58732406867e-43, 3.58732406867e-43 0)
> (2 rows)
> 
> 
> Current checks are worse than nothing. I would rather remove them than leave as is. 
> 
> Generally I think there are 3 ways:
> 1) Throw an error. But it seems to be a solution when we have good protection from invalid data in other function. In that case error should indicate some internal error or undiscovered way to input invalid data. However, protection is weak and index building is not best place to throw an error on invalid data.
> 2) Create invalid key which will be always skipped in the index scan. Then we have a risk that index scan result will differs from sequential scan. Actually, it's how index behaves now. 
> 3) Create invalid key which will be always rechecked. That could lead to some performance degradation, but will guarantee consistent index behaviour.
> 
> ------
> With best regards,
> Alexander Korotkov.
> 
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org (mailto:postgis-devel at lists.osgeo.org)
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-devel