[postgis-devel] GiST Sorting

Regina Obe lr at pcorp.us
Tue Nov 30 14:33:06 PST 2021


+1 to gist_geometry_ops_2d_fastbuild

The only concern I have is if we figure out how to make this fast query for
all query cases, can we easily make it alias for the default.  Kind of how
<-> changed from bounding box to actual dist operator.

But yah as I was yelling in IRC, I'm not convinced the poor performance is
the norm.
For the what I consider "Common use-cases" 

Point in (4 or 5 polys)
Line in (4 or 5 polys)
Make envelope && stuff

The fastbuild one had the nice effect of a smaller index and often
outperforming the regular (or no noticeable change).

But I like the idea of testing out both in real envs to see if that is true
and also waiting for PG to  / PostGIS to get better.



> -----Original Message-----
> From: postgis-devel [mailto:postgis-devel-bounces at lists.osgeo.org] On
Behalf
> Of Paul Ramsey
> Sent: Tuesday, November 30, 2021 2:20 PM
> To: PostGIS Development Discussion <postgis-devel at lists.osgeo.org>
> Subject: Re: [postgis-devel] GiST Sorting
> 
> Stats for standard index
> 
>                 gist_stat
> ------------------------------------------
>  Number of levels:          3            +
>  Number of pages:           664          +
>  Number of leaf pages:      659          +
>  Number of tuples:          93339        +
>  Number of invalid tuples:  0            +
>  Number of leaf tuples:     92676        +
>  Total size of tuples:      2621460 bytes+
>  Total size of leaf tuples: 2602836 bytes+
>  Total size of index:       5439488 bytes+
> 
> Stats for sort-order index
> 
>                 gist_stat
> ------------------------------------------
>  Number of levels:          3            +
>  Number of pages:           359          +
>  Number of leaf pages:      356          +
>  Number of tuples:          93034        +
>  Number of invalid tuples:  0            +
>  Number of leaf tuples:     92676        +
>  Total size of tuples:      2609260 bytes+
>  Total size of leaf tuples: 2599200 bytes+
>  Total size of index:       2940928 bytes+
> 
> 
> So these are both 3-level indexes, but I'd say, at least from a visual
examination
> the standard index is just a Better Tree. Yes, it's also got sparser
nodes, so it
> can spend less time iterating through those level 2 parents, but it also
just has
> way less overlap in those nodes, so it's doing less spurious searching.
> 
> So this is not just about sort-support being a wonderful tree but overly
packed,
> it's about it being a kind of Suboptimal  tree that is also overly packed.
If those
> packed up parent nodes had less overlap and better structure, both trees
have
> the same depth, so it seems likely it would be much harder to find
> performance gaps between them.
> 
> I think the best route forward (and please feel free to argue) is a whole
'nother
> opclass, gist_geometry_ops_2d_fastbuild, so that people can build a "fast
to
> create but suboptimal to query" tree and all they need is to know the one
> magic keyword to get it. Also that way "fast'n'suboptimal" trees can
coexist
> with "slow'n'good" trees and people don't have to constantly flip the
sort-
> support option of the default index opclass back and forth.
> 
> P.
> 
> > On Nov 30, 2021, at 10:28 AM, Paul Ramsey <pramsey at cleverelephant.ca>
> wrote:
> >
> > Here's a more zoomed in view. I think the hilbert curve is causing a
> > lot of less-than-optimal overlap in the parent nodes
> > <screenshot_770.png><screenshot_769.png>
> >
> >> On Nov 30, 2021, at 10:20 AM, Paul Ramsey <pramsey at cleverelephant.ca>
> wrote:
> >>
> >> I got gevel working on Pg14, and here's some pictures of the top two
levels
> of the index. Blue is the 3.2 sort-support index, red is the 3.1 standard
index.
> >>
> >> <screenshot_768.jpg><screenshot_766.jpg>
> >>
> >>> On Nov 30, 2021, at 9:23 AM, Martin Davis <mtnclimb at gmail.com>
> wrote:
> >>>
> >>> Just to get some clarity about how this works - the sorting/packing
only
> happens when an index is created, right?  So insert/update records are
added
> to the index in the normal way (i.e the same way the old code uses to
build the
> index in the first place).
> >>>
> >>> If so, will a packed index "fluff up" as more records are added?
Which
> should (slowly) bring query performance back in line with the standard
index
> approach.
> >>>
> >>> Of course, any reindex of the table will pack the index again,
resulting in
> the observed query performance hit.
> >>>
> >>>
> >>> On Tue, Nov 30, 2021 at 8:43 AM Bruce Rindahl
> <bruce.rindahl at gmail.com> wrote:
> >>> I have to agree with Paul on this.  Indexes are usually created once
but
> queries are done all the time.  The only benefit would be to a spatial
dataset
> that is constantly updated.  Even then the update performance increase (if
it
> happens) might be small compared to the query performance.  Most
> applications would do a massive bulk load (OSM data) then index, then
> constant queries.
> >>>
> >>> On Tue, Nov 30, 2021 at 8:08 AM Paul Ramsey
> <pramsey at cleverelephant.ca> wrote:
> >>>
> >>>
> >>> > On Nov 29, 2021, at 11:03 PM, Marco Boeringa
> <marco at boeringa.demon.nl> wrote:
> >>> >
> >>> > Hi Paul,
> >>> >
> >>> > Thanks for these insights. In fact, seeing some of the similar
discussions
> in the past about build time versus query performance when Han was still
> working on this project, was one of the reasons I initially asked about
this here
> on the list, and asked about real world experiences with the new versions
of
> PostgreSQL and PostGIS.
> >>> >
> >>> > However, your remark about "flat trees" raises another question for
me:
> >>> >
> >>> > - Is the issue you describe data size depended or not? E.g. does the
50%
> penalty on query performance possibly only occur on small datasets, but
not -
> or much less so - on something like a Planet size extract of
OpenStreetMap? Or
> is this a universal issue?
> >>>
> >>> Note that, with optimal page filling (which the flatter trees get
close to)
> you can fit a 250000 record table into an index with only 2 levels. These
aren't
> just a little flat, they are really flat.
> >>> I also tested with a 2.7M record table and found the sort-support
index ran
> 30% slower. My mental model, which may be wrong, says that the lower
levels
> of the tree tend to be the most full, so "smaller" tables (under a quarter
million
> records!) bear the brunt of the downgrade, but even larger tables see it.
> >>>
> >>> > Having a 50% decrease in query performance of primarily small
> >>> > datasets might be acceptable,
> >>>
> >>> Uh, 100% abosutely not. Querying is something the database does all
day,
> every day, so every inefficiency is multiplied over all the queries that
are run on
> the table For All Time. Bulk indexing is something that happens Once.
> >>>
> >>> There is no way that we should push this kind of regression out as the
> default setting. Either we should make a whole other "fast to build but
slow to
> query" opclass or just leave the sort-support out of the opclass and let
people
> with the "index build time is my constraint" people manually add the sort
> support function to the default opclass.
> >>>
> >>> P
> >>>
> >>> > if it means much larger datasets can be indexed much faster and have
> minimal impact of the same issue. That said, I understand your caution if
the
> 50% decrease in query performance is a universal issue with the new
> implementation.
> >>> >
> >>> > Marco
> >>> >
> >>> > Op 30-11-2021 om 00:08 schreef Paul Ramsey:
> >>> >> Sorry to be the rain on the parade, but I think that the results of
testing
> on performance for the new index sorting mean we should NOT be enabling it
> by default for this release.
> >>> >>
> >>> >> Basically we have multiple tests now that show that the index is
slower
> for query performance, and we even have a convincing working theory for
why
> (the improved packing into pages/nodes means that we have very flat trees
> which are expensive to query).
> >>> >>
> >>> >> I think releasing with a "new improved" version that degrades most
> people's systems by 50% is not a good look. We can leave all the code in
there
> and just take the sorting function out of the opclass. That way people
with
> "build an index and use it once" use cases can still turn the feature on,
but
> people with normal use patterns aren't going to silently get a kick in the
teeth
> when they upgrade to 3.2.
> >>> >>
> >>> >> If you would like to replicate my results, I provide them below.
> >>> >>
> >>> >> P.
> >>> >>
> >>> >>
> >>> >> Data:
> >>> >> https://www.dropbox.com/s/e75g1y9ua1da6qu/roads_rdr.sql.gz?dl=0
> >>> >>
> >>> >> create index roads_rdr_idx on roads_rdr using gist (geom);
> >>> >>
> >>> >> 13/3.2/CREATE 1546
> >>> >> 13/3.1/CREATE 1683
> >>> >> 14/3.2/CREATE 200
> >>> >> 14/3.1/CREATE 1705
> >>> >>
> >>> >> select count(*) from roads_rdr a, roads_rdr b where a.geom &&
> >>> >> b.geom;
> >>> >>
> >>> >> 13/3.2/QUERY  4600
> >>> >> 13/3.1/QUERY 4540
> >>> >> 14/3.2/QUERY 11200
> >>> >> 14/3.1/QUERY 4700
> >>> >>
> >>> >> select pg_relation_size('roads_rdr_idx');
> >>> >>
> >>> >> 13/3.2/IDXSIZE  5414912
> >>> >> 13/3.1/IDXSIZE 5496832
> >>> >> 14/3.2/IDXSIZE 2940928
> >>> >> 14/3.1/IDXSIZE 5439488
> >>> >> _______________________________________________
> >>> >> postgis-devel mailing list
> >>> >> postgis-devel at lists.osgeo.org
> >>> >> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> >>> > _______________________________________________
> >>> > postgis-devel mailing list
> >>> > postgis-devel at lists.osgeo.org
> >>> > https://lists.osgeo.org/mailman/listinfo/postgis-devel
> >>>
> >>> _______________________________________________
> >>> postgis-devel mailing list
> >>> postgis-devel at lists.osgeo.org
> >>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> >>> _______________________________________________
> >>> postgis-devel mailing list
> >>> postgis-devel at lists.osgeo.org
> >>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> >>> _______________________________________________
> >>> postgis-devel mailing list
> >>> postgis-devel at lists.osgeo.org
> >>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> >>
> >
> 
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel



More information about the postgis-devel mailing list