[postgis-devel] GiST Sorting

Martin Davis mtnclimb at gmail.com
Tue Nov 30 14:40:45 PST 2021


+1 to gist_geometry_ops_2d_fastbuild

A good thing to allow use-case-specific choice of index strategy.  Then
real-world experience can show the future path to take.

On Tue, Nov 30, 2021 at 2:33 PM Regina Obe <lr at pcorp.us> wrote:

> +1 to gist_geometry_ops_2d_fastbuild
>
> The only concern I have is if we figure out how to make this fast query for
> all query cases, can we easily make it alias for the default.  Kind of how
> <-> changed from bounding box to actual dist operator.
>
> But yah as I was yelling in IRC, I'm not convinced the poor performance is
> the norm.
> For the what I consider "Common use-cases"
>
> Point in (4 or 5 polys)
> Line in (4 or 5 polys)
> Make envelope && stuff
>
> The fastbuild one had the nice effect of a smaller index and often
> outperforming the regular (or no noticeable change).
>
> But I like the idea of testing out both in real envs to see if that is true
> and also waiting for PG to  / PostGIS to get better.
>
>
>
> > -----Original Message-----
> > From: postgis-devel [mailto:postgis-devel-bounces at lists.osgeo.org] On
> Behalf
> > Of Paul Ramsey
> > Sent: Tuesday, November 30, 2021 2:20 PM
> > To: PostGIS Development Discussion <postgis-devel at lists.osgeo.org>
> > Subject: Re: [postgis-devel] GiST Sorting
> >
> > Stats for standard index
> >
> >                 gist_stat
> > ------------------------------------------
> >  Number of levels:          3            +
> >  Number of pages:           664          +
> >  Number of leaf pages:      659          +
> >  Number of tuples:          93339        +
> >  Number of invalid tuples:  0            +
> >  Number of leaf tuples:     92676        +
> >  Total size of tuples:      2621460 bytes+
> >  Total size of leaf tuples: 2602836 bytes+
> >  Total size of index:       5439488 bytes+
> >
> > Stats for sort-order index
> >
> >                 gist_stat
> > ------------------------------------------
> >  Number of levels:          3            +
> >  Number of pages:           359          +
> >  Number of leaf pages:      356          +
> >  Number of tuples:          93034        +
> >  Number of invalid tuples:  0            +
> >  Number of leaf tuples:     92676        +
> >  Total size of tuples:      2609260 bytes+
> >  Total size of leaf tuples: 2599200 bytes+
> >  Total size of index:       2940928 bytes+
> >
> >
> > So these are both 3-level indexes, but I'd say, at least from a visual
> examination
> > the standard index is just a Better Tree. Yes, it's also got sparser
> nodes, so it
> > can spend less time iterating through those level 2 parents, but it also
> just has
> > way less overlap in those nodes, so it's doing less spurious searching.
> >
> > So this is not just about sort-support being a wonderful tree but overly
> packed,
> > it's about it being a kind of Suboptimal  tree that is also overly
> packed.
> If those
> > packed up parent nodes had less overlap and better structure, both trees
> have
> > the same depth, so it seems likely it would be much harder to find
> > performance gaps between them.
> >
> > I think the best route forward (and please feel free to argue) is a whole
> 'nother
> > opclass, gist_geometry_ops_2d_fastbuild, so that people can build a "fast
> to
> > create but suboptimal to query" tree and all they need is to know the one
> > magic keyword to get it. Also that way "fast'n'suboptimal" trees can
> coexist
> > with "slow'n'good" trees and people don't have to constantly flip the
> sort-
> > support option of the default index opclass back and forth.
> >
> > P.
> >
> > > On Nov 30, 2021, at 10:28 AM, Paul Ramsey <pramsey at cleverelephant.ca>
> > wrote:
> > >
> > > Here's a more zoomed in view. I think the hilbert curve is causing a
> > > lot of less-than-optimal overlap in the parent nodes
> > > <screenshot_770.png><screenshot_769.png>
> > >
> > >> On Nov 30, 2021, at 10:20 AM, Paul Ramsey <pramsey at cleverelephant.ca>
> > wrote:
> > >>
> > >> I got gevel working on Pg14, and here's some pictures of the top two
> levels
> > of the index. Blue is the 3.2 sort-support index, red is the 3.1 standard
> index.
> > >>
> > >> <screenshot_768.jpg><screenshot_766.jpg>
> > >>
> > >>> On Nov 30, 2021, at 9:23 AM, Martin Davis <mtnclimb at gmail.com>
> > wrote:
> > >>>
> > >>> Just to get some clarity about how this works - the sorting/packing
> only
> > happens when an index is created, right?  So insert/update records are
> added
> > to the index in the normal way (i.e the same way the old code uses to
> build the
> > index in the first place).
> > >>>
> > >>> If so, will a packed index "fluff up" as more records are added?
> Which
> > should (slowly) bring query performance back in line with the standard
> index
> > approach.
> > >>>
> > >>> Of course, any reindex of the table will pack the index again,
> resulting in
> > the observed query performance hit.
> > >>>
> > >>>
> > >>> On Tue, Nov 30, 2021 at 8:43 AM Bruce Rindahl
> > <bruce.rindahl at gmail.com> wrote:
> > >>> I have to agree with Paul on this.  Indexes are usually created once
> but
> > queries are done all the time.  The only benefit would be to a spatial
> dataset
> > that is constantly updated.  Even then the update performance increase
> (if
> it
> > happens) might be small compared to the query performance.  Most
> > applications would do a massive bulk load (OSM data) then index, then
> > constant queries.
> > >>>
> > >>> On Tue, Nov 30, 2021 at 8:08 AM Paul Ramsey
> > <pramsey at cleverelephant.ca> wrote:
> > >>>
> > >>>
> > >>> > On Nov 29, 2021, at 11:03 PM, Marco Boeringa
> > <marco at boeringa.demon.nl> wrote:
> > >>> >
> > >>> > Hi Paul,
> > >>> >
> > >>> > Thanks for these insights. In fact, seeing some of the similar
> discussions
> > in the past about build time versus query performance when Han was still
> > working on this project, was one of the reasons I initially asked about
> this here
> > on the list, and asked about real world experiences with the new versions
> of
> > PostgreSQL and PostGIS.
> > >>> >
> > >>> > However, your remark about "flat trees" raises another question for
> me:
> > >>> >
> > >>> > - Is the issue you describe data size depended or not? E.g. does
> the
> 50%
> > penalty on query performance possibly only occur on small datasets, but
> not -
> > or much less so - on something like a Planet size extract of
> OpenStreetMap? Or
> > is this a universal issue?
> > >>>
> > >>> Note that, with optimal page filling (which the flatter trees get
> close to)
> > you can fit a 250000 record table into an index with only 2 levels. These
> aren't
> > just a little flat, they are really flat.
> > >>> I also tested with a 2.7M record table and found the sort-support
> index ran
> > 30% slower. My mental model, which may be wrong, says that the lower
> levels
> > of the tree tend to be the most full, so "smaller" tables (under a
> quarter
> million
> > records!) bear the brunt of the downgrade, but even larger tables see it.
> > >>>
> > >>> > Having a 50% decrease in query performance of primarily small
> > >>> > datasets might be acceptable,
> > >>>
> > >>> Uh, 100% abosutely not. Querying is something the database does all
> day,
> > every day, so every inefficiency is multiplied over all the queries that
> are run on
> > the table For All Time. Bulk indexing is something that happens Once.
> > >>>
> > >>> There is no way that we should push this kind of regression out as
> the
> > default setting. Either we should make a whole other "fast to build but
> slow to
> > query" opclass or just leave the sort-support out of the opclass and let
> people
> > with the "index build time is my constraint" people manually add the sort
> > support function to the default opclass.
> > >>>
> > >>> P
> > >>>
> > >>> > if it means much larger datasets can be indexed much faster and
> have
> > minimal impact of the same issue. That said, I understand your caution if
> the
> > 50% decrease in query performance is a universal issue with the new
> > implementation.
> > >>> >
> > >>> > Marco
> > >>> >
> > >>> > Op 30-11-2021 om 00:08 schreef Paul Ramsey:
> > >>> >> Sorry to be the rain on the parade, but I think that the results
> of
> testing
> > on performance for the new index sorting mean we should NOT be enabling
> it
> > by default for this release.
> > >>> >>
> > >>> >> Basically we have multiple tests now that show that the index is
> slower
> > for query performance, and we even have a convincing working theory for
> why
> > (the improved packing into pages/nodes means that we have very flat trees
> > which are expensive to query).
> > >>> >>
> > >>> >> I think releasing with a "new improved" version that degrades most
> > people's systems by 50% is not a good look. We can leave all the code in
> there
> > and just take the sorting function out of the opclass. That way people
> with
> > "build an index and use it once" use cases can still turn the feature on,
> but
> > people with normal use patterns aren't going to silently get a kick in
> the
> teeth
> > when they upgrade to 3.2.
> > >>> >>
> > >>> >> If you would like to replicate my results, I provide them below.
> > >>> >>
> > >>> >> P.
> > >>> >>
> > >>> >>
> > >>> >> Data:
> > >>> >> https://www.dropbox.com/s/e75g1y9ua1da6qu/roads_rdr.sql.gz?dl=0
> > >>> >>
> > >>> >> create index roads_rdr_idx on roads_rdr using gist (geom);
> > >>> >>
> > >>> >> 13/3.2/CREATE 1546
> > >>> >> 13/3.1/CREATE 1683
> > >>> >> 14/3.2/CREATE 200
> > >>> >> 14/3.1/CREATE 1705
> > >>> >>
> > >>> >> select count(*) from roads_rdr a, roads_rdr b where a.geom &&
> > >>> >> b.geom;
> > >>> >>
> > >>> >> 13/3.2/QUERY  4600
> > >>> >> 13/3.1/QUERY 4540
> > >>> >> 14/3.2/QUERY 11200
> > >>> >> 14/3.1/QUERY 4700
> > >>> >>
> > >>> >> select pg_relation_size('roads_rdr_idx');
> > >>> >>
> > >>> >> 13/3.2/IDXSIZE  5414912
> > >>> >> 13/3.1/IDXSIZE 5496832
> > >>> >> 14/3.2/IDXSIZE 2940928
> > >>> >> 14/3.1/IDXSIZE 5439488
> > >>> >> _______________________________________________
> > >>> >> postgis-devel mailing list
> > >>> >> postgis-devel at lists.osgeo.org
> > >>> >> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> > >>> > _______________________________________________
> > >>> > postgis-devel mailing list
> > >>> > postgis-devel at lists.osgeo.org
> > >>> > https://lists.osgeo.org/mailman/listinfo/postgis-devel
> > >>>
> > >>> _______________________________________________
> > >>> postgis-devel mailing list
> > >>> postgis-devel at lists.osgeo.org
> > >>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> > >>> _______________________________________________
> > >>> postgis-devel mailing list
> > >>> postgis-devel at lists.osgeo.org
> > >>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> > >>> _______________________________________________
> > >>> postgis-devel mailing list
> > >>> postgis-devel at lists.osgeo.org
> > >>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> > >>
> > >
> >
> > _______________________________________________
> > postgis-devel mailing list
> > postgis-devel at lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/postgis-devel
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20211130/07aa91d7/attachment-0001.html>


More information about the postgis-devel mailing list