[postgis-devel] GiST Sorting

Tue Nov 30 11:20:09 PST 2021

Stats for standard index 

                gist_stat                 
------------------------------------------
 Number of levels:          3            +
 Number of pages:           664          +
 Number of leaf pages:      659          +
 Number of tuples:          93339        +
 Number of invalid tuples:  0            +
 Number of leaf tuples:     92676        +
 Total size of tuples:      2621460 bytes+
 Total size of leaf tuples: 2602836 bytes+
 Total size of index:       5439488 bytes+

Stats for sort-order index

                gist_stat                 
------------------------------------------
 Number of levels:          3            +
 Number of pages:           359          +
 Number of leaf pages:      356          +
 Number of tuples:          93034        +
 Number of invalid tuples:  0            +
 Number of leaf tuples:     92676        +
 Total size of tuples:      2609260 bytes+
 Total size of leaf tuples: 2599200 bytes+
 Total size of index:       2940928 bytes+

So these are both 3-level indexes, but I'd say, at least from a visual examination the standard index is just a Better Tree. Yes, it's also got sparser nodes, so it can spend less time iterating through those level 2 parents, but it also just has way less overlap in those nodes, so it's doing less spurious searching.

So this is not just about sort-support being a wonderful tree but overly packed, it's about it being a kind of Suboptimal  tree that is also overly packed. If those packed up parent nodes had less overlap and better structure, both trees have the same depth, so it seems likely it would be much harder to find performance gaps between them.

I think the best route forward (and please feel free to argue) is a whole 'nother opclass, gist_geometry_ops_2d_fastbuild, so that people can build a "fast to create but suboptimal to query" tree and all they need is to know the one magic keyword to get it. Also that way "fast'n'suboptimal" trees can coexist with "slow'n'good" trees and people don't have to constantly flip the sort-support option of the default index opclass back and forth.

P.

> On Nov 30, 2021, at 10:28 AM, Paul Ramsey <pramsey at cleverelephant.ca> wrote:
> 
> Here's a more zoomed in view. I think the hilbert curve is causing a lot of less-than-optimal overlap in the parent nodes
> <screenshot_770.png><screenshot_769.png>
> 
>> On Nov 30, 2021, at 10:20 AM, Paul Ramsey <pramsey at cleverelephant.ca> wrote:
>> 
>> I got gevel working on Pg14, and here's some pictures of the top two levels of the index. Blue is the 3.2 sort-support index, red is the 3.1 standard index. 
>> 
>> <screenshot_768.jpg><screenshot_766.jpg>
>> 
>>> On Nov 30, 2021, at 9:23 AM, Martin Davis <mtnclimb at gmail.com> wrote:
>>> 
>>> Just to get some clarity about how this works - the sorting/packing only happens when an index is created, right?  So insert/update records are added to the index in the normal way (i.e the same way the old code uses to build the index in the first place).
>>> 
>>> If so, will a packed index "fluff up" as more records are added?  Which should (slowly) bring query performance back in line with the standard index approach.
>>> 
>>> Of course, any reindex of the table will pack the index again, resulting in the observed query performance hit.
>>> 
>>> 
>>> On Tue, Nov 30, 2021 at 8:43 AM Bruce Rindahl <bruce.rindahl at gmail.com> wrote:
>>> I have to agree with Paul on this.  Indexes are usually created once but queries are done all the time.  The only benefit would be to a spatial dataset that is constantly updated.  Even then the update performance increase (if it happens) might be small compared to the query performance.  Most applications would do a massive bulk load (OSM data) then index, then constant queries.
>>> 
>>> On Tue, Nov 30, 2021 at 8:08 AM Paul Ramsey <pramsey at cleverelephant.ca> wrote:
>>> 
>>> 
>>> > On Nov 29, 2021, at 11:03 PM, Marco Boeringa <marco at boeringa.demon.nl> wrote:
>>> > 
>>> > Hi Paul,
>>> > 
>>> > Thanks for these insights. In fact, seeing some of the similar discussions in the past about build time versus query performance when Han was still working on this project, was one of the reasons I initially asked about this here on the list, and asked about real world experiences with the new versions of PostgreSQL and PostGIS.
>>> > 
>>> > However, your remark about "flat trees" raises another question for me:
>>> > 
>>> > - Is the issue you describe data size depended or not? E.g. does the 50% penalty on query performance possibly only occur on small datasets, but not - or much less so - on something like a Planet size extract of OpenStreetMap? Or is this a universal issue?
>>> 
>>> Note that, with optimal page filling (which the flatter trees get close to) you can fit a 250000 record table into an index with only 2 levels. These aren't just a little flat, they are really flat.
>>> I also tested with a 2.7M record table and found the sort-support index ran 30% slower. My mental model, which may be wrong, says that the lower levels of the tree tend to be the most full, so "smaller" tables (under a quarter million records!) bear the brunt of the downgrade, but even larger tables see it.
>>> 
>>> > Having a 50% decrease in query performance of primarily small datasets might be acceptable,
>>> 
>>> Uh, 100% abosutely not. Querying is something the database does all day, every day, so every inefficiency is multiplied over all the queries that are run on the table For All Time. Bulk indexing is something that happens Once.
>>> 
>>> There is no way that we should push this kind of regression out as the default setting. Either we should make a whole other "fast to build but slow to query" opclass or just leave the sort-support out of the opclass and let people with the "index build time is my constraint" people manually add the sort support function to the default opclass.
>>> 
>>> P
>>> 
>>> > if it means much larger datasets can be indexed much faster and have minimal impact of the same issue. That said, I understand your caution if the 50% decrease in query performance is a universal issue with the new implementation.
>>> > 
>>> > Marco
>>> > 
>>> > Op 30-11-2021 om 00:08 schreef Paul Ramsey:
>>> >> Sorry to be the rain on the parade, but I think that the results of testing on performance for the new index sorting mean we should NOT be enabling it by default for this release.
>>> >> 
>>> >> Basically we have multiple tests now that show that the index is slower for query performance, and we even have a convincing working theory for why (the improved packing into pages/nodes means that we have very flat trees which are expensive to query).
>>> >> 
>>> >> I think releasing with a "new improved" version that degrades most people's systems by 50% is not a good look. We can leave all the code in there and just take the sorting function out of the opclass. That way people with "build an index and use it once" use cases can still turn the feature on, but people with normal use patterns aren't going to silently get a kick in the teeth when they upgrade to 3.2.
>>> >> 
>>> >> If you would like to replicate my results, I provide them below.
>>> >> 
>>> >> P.
>>> >> 
>>> >> 
>>> >> Data: https://www.dropbox.com/s/e75g1y9ua1da6qu/roads_rdr.sql.gz?dl=0
>>> >> 
>>> >> create index roads_rdr_idx on roads_rdr using gist (geom);
>>> >> 
>>> >> 13/3.2/CREATE 1546
>>> >> 13/3.1/CREATE 1683
>>> >> 14/3.2/CREATE 200
>>> >> 14/3.1/CREATE 1705
>>> >> 
>>> >> select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom;
>>> >> 
>>> >> 13/3.2/QUERY  4600
>>> >> 13/3.1/QUERY 4540
>>> >> 14/3.2/QUERY 11200
>>> >> 14/3.1/QUERY 4700
>>> >> 
>>> >> select pg_relation_size('roads_rdr_idx');
>>> >> 
>>> >> 13/3.2/IDXSIZE  5414912
>>> >> 13/3.1/IDXSIZE 5496832
>>> >> 14/3.2/IDXSIZE 2940928
>>> >> 14/3.1/IDXSIZE 5439488
>>> >> _______________________________________________
>>> >> postgis-devel mailing list
>>> >> postgis-devel at lists.osgeo.org
>>> >> https://lists.osgeo.org/mailman/listinfo/postgis-devel
>>> > _______________________________________________
>>> > postgis-devel mailing list
>>> > postgis-devel at lists.osgeo.org
>>> > https://lists.osgeo.org/mailman/listinfo/postgis-devel
>>> 
>>> _______________________________________________
>>> postgis-devel mailing list
>>> postgis-devel at lists.osgeo.org
>>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
>>> _______________________________________________
>>> postgis-devel mailing list
>>> postgis-devel at lists.osgeo.org
>>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
>>> _______________________________________________
>>> postgis-devel mailing list
>>> postgis-devel at lists.osgeo.org
>>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
>> 
>