<div dir="ltr">+1 to gist_geometry_ops_2d_fastbuild<br><div><br></div><div>A good thing to allow use-case-specific choice of index strategy.  Then real-world experience can show the future path to take.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Nov 30, 2021 at 2:33 PM Regina Obe <<a href="mailto:lr@pcorp.us">lr@pcorp.us</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">+1 to gist_geometry_ops_2d_fastbuild<br>

<br>

The only concern I have is if we figure out how to make this fast query for<br>

all query cases, can we easily make it alias for the default.  Kind of how<br>

<-> changed from bounding box to actual dist operator.<br>

<br>

But yah as I was yelling in IRC, I'm not convinced the poor performance is<br>

the norm.<br>

For the what I consider "Common use-cases" <br>

<br>

Point in (4 or 5 polys)<br>

Line in (4 or 5 polys)<br>

Make envelope && stuff<br>

<br>

The fastbuild one had the nice effect of a smaller index and often<br>

outperforming the regular (or no noticeable change).<br>

<br>

But I like the idea of testing out both in real envs to see if that is true<br>

and also waiting for PG to  / PostGIS to get better.<br>

<br>

<br>

<br>

> -----Original Message-----<br>

> From: postgis-devel [mailto:<a href="mailto:postgis-devel-bounces@lists.osgeo.org" target="_blank">postgis-devel-bounces@lists.osgeo.org</a>] On<br>

Behalf<br>

> Of Paul Ramsey<br>

> Sent: Tuesday, November 30, 2021 2:20 PM<br>

> To: PostGIS Development Discussion <<a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a>><br>

> Subject: Re: [postgis-devel] GiST Sorting<br>

> <br>

> Stats for standard index<br>

> <br>

>                 gist_stat<br>

> ------------------------------------------<br>

>  Number of levels:          3            +<br>

>  Number of pages:           664          +<br>

>  Number of leaf pages:      659          +<br>

>  Number of tuples:          93339        +<br>

>  Number of invalid tuples:  0            +<br>

>  Number of leaf tuples:     92676        +<br>

>  Total size of tuples:      2621460 bytes+<br>

>  Total size of leaf tuples: 2602836 bytes+<br>

>  Total size of index:       5439488 bytes+<br>

> <br>

> Stats for sort-order index<br>

> <br>

>                 gist_stat<br>

> ------------------------------------------<br>

>  Number of levels:          3            +<br>

>  Number of pages:           359          +<br>

>  Number of leaf pages:      356          +<br>

>  Number of tuples:          93034        +<br>

>  Number of invalid tuples:  0            +<br>

>  Number of leaf tuples:     92676        +<br>

>  Total size of tuples:      2609260 bytes+<br>

>  Total size of leaf tuples: 2599200 bytes+<br>

>  Total size of index:       2940928 bytes+<br>

> <br>

> <br>

> So these are both 3-level indexes, but I'd say, at least from a visual<br>

examination<br>

> the standard index is just a Better Tree. Yes, it's also got sparser<br>

nodes, so it<br>

> can spend less time iterating through those level 2 parents, but it also<br>

just has<br>

> way less overlap in those nodes, so it's doing less spurious searching.<br>

> <br>

> So this is not just about sort-support being a wonderful tree but overly<br>

packed,<br>

> it's about it being a kind of Suboptimal  tree that is also overly packed.<br>

If those<br>

> packed up parent nodes had less overlap and better structure, both trees<br>

have<br>

> the same depth, so it seems likely it would be much harder to find<br>

> performance gaps between them.<br>

> <br>

> I think the best route forward (and please feel free to argue) is a whole<br>

'nother<br>

> opclass, gist_geometry_ops_2d_fastbuild, so that people can build a "fast<br>

to<br>

> create but suboptimal to query" tree and all they need is to know the one<br>

> magic keyword to get it. Also that way "fast'n'suboptimal" trees can<br>

coexist<br>

> with "slow'n'good" trees and people don't have to constantly flip the<br>

sort-<br>

> support option of the default index opclass back and forth.<br>

> <br>

> P.<br>

> <br>

> > On Nov 30, 2021, at 10:28 AM, Paul Ramsey <<a href="mailto:pramsey@cleverelephant.ca" target="_blank">pramsey@cleverelephant.ca</a>><br>

> wrote:<br>

> ><br>

> > Here's a more zoomed in view. I think the hilbert curve is causing a<br>

> > lot of less-than-optimal overlap in the parent nodes<br>

> > <screenshot_770.png><screenshot_769.png><br>

> ><br>

> >> On Nov 30, 2021, at 10:20 AM, Paul Ramsey <<a href="mailto:pramsey@cleverelephant.ca" target="_blank">pramsey@cleverelephant.ca</a>><br>

> wrote:<br>

> >><br>

> >> I got gevel working on Pg14, and here's some pictures of the top two<br>

levels<br>

> of the index. Blue is the 3.2 sort-support index, red is the 3.1 standard<br>

index.<br>

> >><br>

> >> <screenshot_768.jpg><screenshot_766.jpg><br>

> >><br>

> >>> On Nov 30, 2021, at 9:23 AM, Martin Davis <<a href="mailto:mtnclimb@gmail.com" target="_blank">mtnclimb@gmail.com</a>><br>

> wrote:<br>

> >>><br>

> >>> Just to get some clarity about how this works - the sorting/packing<br>

only<br>

> happens when an index is created, right?  So insert/update records are<br>

added<br>

> to the index in the normal way (i.e the same way the old code uses to<br>

build the<br>

> index in the first place).<br>

> >>><br>

> >>> If so, will a packed index "fluff up" as more records are added?<br>

Which<br>

> should (slowly) bring query performance back in line with the standard<br>

index<br>

> approach.<br>

> >>><br>

> >>> Of course, any reindex of the table will pack the index again,<br>

resulting in<br>

> the observed query performance hit.<br>

> >>><br>

> >>><br>

> >>> On Tue, Nov 30, 2021 at 8:43 AM Bruce Rindahl<br>

> <<a href="mailto:bruce.rindahl@gmail.com" target="_blank">bruce.rindahl@gmail.com</a>> wrote:<br>

> >>> I have to agree with Paul on this.  Indexes are usually created once<br>

but<br>

> queries are done all the time.  The only benefit would be to a spatial<br>

dataset<br>

> that is constantly updated.  Even then the update performance increase (if<br>

it<br>

> happens) might be small compared to the query performance.  Most<br>

> applications would do a massive bulk load (OSM data) then index, then<br>

> constant queries.<br>

> >>><br>

> >>> On Tue, Nov 30, 2021 at 8:08 AM Paul Ramsey<br>

> <<a href="mailto:pramsey@cleverelephant.ca" target="_blank">pramsey@cleverelephant.ca</a>> wrote:<br>

> >>><br>

> >>><br>

> >>> > On Nov 29, 2021, at 11:03 PM, Marco Boeringa<br>

> <<a href="mailto:marco@boeringa.demon.nl" target="_blank">marco@boeringa.demon.nl</a>> wrote:<br>

> >>> ><br>

> >>> > Hi Paul,<br>

> >>> ><br>

> >>> > Thanks for these insights. In fact, seeing some of the similar<br>

discussions<br>

> in the past about build time versus query performance when Han was still<br>

> working on this project, was one of the reasons I initially asked about<br>

this here<br>

> on the list, and asked about real world experiences with the new versions<br>

of<br>

> PostgreSQL and PostGIS.<br>

> >>> ><br>

> >>> > However, your remark about "flat trees" raises another question for<br>

me:<br>

> >>> ><br>

> >>> > - Is the issue you describe data size depended or not? E.g. does the<br>

50%<br>

> penalty on query performance possibly only occur on small datasets, but<br>

not -<br>

> or much less so - on something like a Planet size extract of<br>

OpenStreetMap? Or<br>

> is this a universal issue?<br>

> >>><br>

> >>> Note that, with optimal page filling (which the flatter trees get<br>

close to)<br>

> you can fit a 250000 record table into an index with only 2 levels. These<br>

aren't<br>

> just a little flat, they are really flat.<br>

> >>> I also tested with a 2.7M record table and found the sort-support<br>

index ran<br>

> 30% slower. My mental model, which may be wrong, says that the lower<br>

levels<br>

> of the tree tend to be the most full, so "smaller" tables (under a quarter<br>

million<br>

> records!) bear the brunt of the downgrade, but even larger tables see it.<br>

> >>><br>

> >>> > Having a 50% decrease in query performance of primarily small<br>

> >>> > datasets might be acceptable,<br>

> >>><br>

> >>> Uh, 100% abosutely not. Querying is something the database does all<br>

day,<br>

> every day, so every inefficiency is multiplied over all the queries that<br>

are run on<br>

> the table For All Time. Bulk indexing is something that happens Once.<br>

> >>><br>

> >>> There is no way that we should push this kind of regression out as the<br>

> default setting. Either we should make a whole other "fast to build but<br>

slow to<br>

> query" opclass or just leave the sort-support out of the opclass and let<br>

people<br>

> with the "index build time is my constraint" people manually add the sort<br>

> support function to the default opclass.<br>

> >>><br>

> >>> P<br>

> >>><br>

> >>> > if it means much larger datasets can be indexed much faster and have<br>

> minimal impact of the same issue. That said, I understand your caution if<br>

the<br>

> 50% decrease in query performance is a universal issue with the new<br>

> implementation.<br>

> >>> ><br>

> >>> > Marco<br>

> >>> ><br>

> >>> > Op 30-11-2021 om 00:08 schreef Paul Ramsey:<br>

> >>> >> Sorry to be the rain on the parade, but I think that the results of<br>

testing<br>

> on performance for the new index sorting mean we should NOT be enabling it<br>

> by default for this release.<br>

> >>> >><br>

> >>> >> Basically we have multiple tests now that show that the index is<br>

slower<br>

> for query performance, and we even have a convincing working theory for<br>

why<br>

> (the improved packing into pages/nodes means that we have very flat trees<br>

> which are expensive to query).<br>

> >>> >><br>

> >>> >> I think releasing with a "new improved" version that degrades most<br>

> people's systems by 50% is not a good look. We can leave all the code in<br>

there<br>

> and just take the sorting function out of the opclass. That way people<br>

with<br>

> "build an index and use it once" use cases can still turn the feature on,<br>

but<br>

> people with normal use patterns aren't going to silently get a kick in the<br>

teeth<br>

> when they upgrade to 3.2.<br>

> >>> >><br>

> >>> >> If you would like to replicate my results, I provide them below.<br>

> >>> >><br>

> >>> >> P.<br>

> >>> >><br>

> >>> >><br>

> >>> >> Data:<br>

> >>> >> <a href="https://www.dropbox.com/s/e75g1y9ua1da6qu/roads_rdr.sql.gz?dl=0" rel="noreferrer" target="_blank">https://www.dropbox.com/s/e75g1y9ua1da6qu/roads_rdr.sql.gz?dl=0</a><br>

> >>> >><br>

> >>> >> create index roads_rdr_idx on roads_rdr using gist (geom);<br>

> >>> >><br>

> >>> >> 13/3.2/CREATE 1546<br>

> >>> >> 13/3.1/CREATE 1683<br>

> >>> >> 14/3.2/CREATE 200<br>

> >>> >> 14/3.1/CREATE 1705<br>

> >>> >><br>

> >>> >> select count(*) from roads_rdr a, roads_rdr b where a.geom &&<br>

> >>> >> b.geom;<br>

> >>> >><br>

> >>> >> 13/3.2/QUERY  4600<br>

> >>> >> 13/3.1/QUERY 4540<br>

> >>> >> 14/3.2/QUERY 11200<br>

> >>> >> 14/3.1/QUERY 4700<br>

> >>> >><br>

> >>> >> select pg_relation_size('roads_rdr_idx');<br>

> >>> >><br>

> >>> >> 13/3.2/IDXSIZE  5414912<br>

> >>> >> 13/3.1/IDXSIZE 5496832<br>

> >>> >> 14/3.2/IDXSIZE 2940928<br>

> >>> >> 14/3.1/IDXSIZE 5439488<br>

> >>> >> _______________________________________________<br>

> >>> >> postgis-devel mailing list<br>

> >>> >> <a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a><br>

> >>> >> <a href="https://lists.osgeo.org/mailman/listinfo/postgis-devel" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-devel</a><br>

> >>> > _______________________________________________<br>

> >>> > postgis-devel mailing list<br>

> >>> > <a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a><br>

> >>> > <a href="https://lists.osgeo.org/mailman/listinfo/postgis-devel" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-devel</a><br>

> >>><br>

> >>> _______________________________________________<br>

> >>> postgis-devel mailing list<br>

> >>> <a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a><br>

> >>> <a href="https://lists.osgeo.org/mailman/listinfo/postgis-devel" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-devel</a><br>

> >>> _______________________________________________<br>

> >>> postgis-devel mailing list<br>

> >>> <a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a><br>

> >>> <a href="https://lists.osgeo.org/mailman/listinfo/postgis-devel" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-devel</a><br>

> >>> _______________________________________________<br>

> >>> postgis-devel mailing list<br>

> >>> <a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a><br>

> >>> <a href="https://lists.osgeo.org/mailman/listinfo/postgis-devel" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-devel</a><br>

> >><br>

> ><br>

> <br>

> _______________________________________________<br>

> postgis-devel mailing list<br>

> <a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a><br>

> <a href="https://lists.osgeo.org/mailman/listinfo/postgis-devel" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-devel</a><br>

<br>

_______________________________________________<br>

postgis-devel mailing list<br>

<a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a><br>

<a href="https://lists.osgeo.org/mailman/listinfo/postgis-devel" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-devel</a><br>

</blockquote></div>