[postgis-devel] GiST Sorting

Tue Nov 30 22:06:42 PST 2021

>
> I think the best route forward (and please feel free to argue) is a whole
> 'nother opclass, gist_geometry_ops_2d_fastbuild, so that people can build a
> "fast to create but suboptimal to query" tree and all they need is to know
> the one magic keyword to get it. Also that way "fast'n'suboptimal" trees
> can coexist with "slow'n'good" trees and people don't have to constantly
> flip the sort-support option of the default index opclass back and forth.
>

What I don't want is to get legacy APIs that we know how to remove. I say
if it is not the default method this year, then make the ALTER OPERATOR
CLASS the trick for now for people who need it, and aim at enabling it by
default next year with Postgres 15 improvements.

People who need it are middle class openstreetmap enthusiasts who want to
play with openstreetmap data with osm2pgsql but they also have a life and
sometimes turn off their computer at night. Planet scale osm2pgsql can take
a weekend easily, and index building is the most frustrating part of it.
These are the same people who will appreciate 2x smaller indexes.

This whole "space-filling curve indexing" endeavour started in 2.4 with
btree z-curve opclass and has been improving ever since. Current best
option to get such indexing is via ORDER BY geom and a BRIN. Right now
we've gone up the situation where Postgres does not provide enough/proper
infrastructure to implement the things we need.

The plan is:
 - get feature out as an ALTER OPERATOR CLASS so Postgres people can play
with it not on their toy datatypes.
 - build benchmark for the case so people on postgres side know what we
need [done, thanks Paul],
 - on postgres side, fix the leaf page sort to be not by Hilbert key but by
tid, so there's disk locality in search (issue found by Han),
 - on postgres side, implement tree balancing via picksplit. I suspect that
a good implementation will need at least one new function on our side, a
box2df overlap estimator - which is similar but not exactly inverse of
penalty.
Two above points discussed on pgsql-hackers:
https://www.postgresql.org/message-id/E5F64198-C592-4916-873E-74E33A7A935A@yandex-team.ru

 - after the above is done, I expect that the previous performance of gist
can be replicated via creation of index with fillfactor=50. I consider the
current implementation not being able to pack the tree to the required
fillfactor of 100 a bug really.

In the creation of gist_geometry_ops_2d_fastbuild we will deprecate that in
3.3 if all goes well in the above which does not seem great.

>
> P.
>
> > On Nov 30, 2021, at 10:28 AM, Paul Ramsey <pramsey at cleverelephant.ca>
> wrote:
> >
> > Here's a more zoomed in view. I think the hilbert curve is causing a lot
> of less-than-optimal overlap in the parent nodes
> > <screenshot_770.png><screenshot_769.png>
> >
> >> On Nov 30, 2021, at 10:20 AM, Paul Ramsey <pramsey at cleverelephant.ca>
> wrote:
> >>
> >> I got gevel working on Pg14, and here's some pictures of the top two
> levels of the index. Blue is the 3.2 sort-support index, red is the 3.1
> standard index.
> >>
> >> <screenshot_768.jpg><screenshot_766.jpg>
> >>
> >>> On Nov 30, 2021, at 9:23 AM, Martin Davis <mtnclimb at gmail.com> wrote:
> >>>
> >>> Just to get some clarity about how this works - the sorting/packing
> only happens when an index is created, right?  So insert/update records are
> added to the index in the normal way (i.e the same way the old code uses to
> build the index in the first place).
> >>>
> >>> If so, will a packed index "fluff up" as more records are added?
> Which should (slowly) bring query performance back in line with the
> standard index approach.
> >>>
> >>> Of course, any reindex of the table will pack the index again,
> resulting in the observed query performance hit.
> >>>
> >>>
> >>> On Tue, Nov 30, 2021 at 8:43 AM Bruce Rindahl <bruce.rindahl at gmail.com>
> wrote:
> >>> I have to agree with Paul on this.  Indexes are usually created once
> but queries are done all the time.  The only benefit would be to a spatial
> dataset that is constantly updated.  Even then the update performance
> increase (if it happens) might be small compared to the query performance.
> Most applications would do a massive bulk load (OSM data) then index, then
> constant queries.
> >>>
> >>> On Tue, Nov 30, 2021 at 8:08 AM Paul Ramsey <pramsey at cleverelephant.ca>
> wrote:
> >>>
> >>>
> >>> > On Nov 29, 2021, at 11:03 PM, Marco Boeringa <
> marco at boeringa.demon.nl> wrote:
> >>> >
> >>> > Hi Paul,
> >>> >
> >>> > Thanks for these insights. In fact, seeing some of the similar
> discussions in the past about build time versus query performance when Han
> was still working on this project, was one of the reasons I initially asked
> about this here on the list, and asked about real world experiences with
> the new versions of PostgreSQL and PostGIS.
> >>> >
> >>> > However, your remark about "flat trees" raises another question for
> me:
> >>> >
> >>> > - Is the issue you describe data size depended or not? E.g. does the
> 50% penalty on query performance possibly only occur on small datasets, but
> not - or much less so - on something like a Planet size extract of
> OpenStreetMap? Or is this a universal issue?
> >>>
> >>> Note that, with optimal page filling (which the flatter trees get
> close to) you can fit a 250000 record table into an index with only 2
> levels. These aren't just a little flat, they are really flat.
> >>> I also tested with a 2.7M record table and found the sort-support
> index ran 30% slower. My mental model, which may be wrong, says that the
> lower levels of the tree tend to be the most full, so "smaller" tables
> (under a quarter million records!) bear the brunt of the downgrade, but
> even larger tables see it.
> >>>
> >>> > Having a 50% decrease in query performance of primarily small
> datasets might be acceptable,
> >>>
> >>> Uh, 100% abosutely not. Querying is something the database does all
> day, every day, so every inefficiency is multiplied over all the queries
> that are run on the table For All Time. Bulk indexing is something that
> happens Once.
> >>>
> >>> There is no way that we should push this kind of regression out as the
> default setting. Either we should make a whole other "fast to build but
> slow to query" opclass or just leave the sort-support out of the opclass
> and let people with the "index build time is my constraint" people manually
> add the sort support function to the default opclass.
> >>>
> >>> P
> >>>
> >>> > if it means much larger datasets can be indexed much faster and have
> minimal impact of the same issue. That said, I understand your caution if
> the 50% decrease in query performance is a universal issue with the new
> implementation.
> >>> >
> >>> > Marco
> >>> >
> >>> > Op 30-11-2021 om 00:08 schreef Paul Ramsey:
> >>> >> Sorry to be the rain on the parade, but I think that the results of
> testing on performance for the new index sorting mean we should NOT be
> enabling it by default for this release.
> >>> >>
> >>> >> Basically we have multiple tests now that show that the index is
> slower for query performance, and we even have a convincing working theory
> for why (the improved packing into pages/nodes means that we have very flat
> trees which are expensive to query).
> >>> >>
> >>> >> I think releasing with a "new improved" version that degrades most
> people's systems by 50% is not a good look. We can leave all the code in
> there and just take the sorting function out of the opclass. That way
> people with "build an index and use it once" use cases can still turn the
> feature on, but people with normal use patterns aren't going to silently
> get a kick in the teeth when they upgrade to 3.2.
> >>> >>
> >>> >> If you would like to replicate my results, I provide them below.
> >>> >>
> >>> >> P.
> >>> >>
> >>> >>
> >>> >> Data:
> https://www.dropbox.com/s/e75g1y9ua1da6qu/roads_rdr.sql.gz?dl=0
> >>> >>
> >>> >> create index roads_rdr_idx on roads_rdr using gist (geom);
> >>> >>
> >>> >> 13/3.2/CREATE 1546
> >>> >> 13/3.1/CREATE 1683
> >>> >> 14/3.2/CREATE 200
> >>> >> 14/3.1/CREATE 1705
> >>> >>
> >>> >> select count(*) from roads_rdr a, roads_rdr b where a.geom &&
> b.geom;
> >>> >>
> >>> >> 13/3.2/QUERY  4600
> >>> >> 13/3.1/QUERY 4540
> >>> >> 14/3.2/QUERY 11200
> >>> >> 14/3.1/QUERY 4700
> >>> >>
> >>> >> select pg_relation_size('roads_rdr_idx');
> >>> >>
> >>> >> 13/3.2/IDXSIZE  5414912
> >>> >> 13/3.1/IDXSIZE 5496832
> >>> >> 14/3.2/IDXSIZE 2940928
> >>> >> 14/3.1/IDXSIZE 5439488
> >>> >> _______________________________________________
> >>> >> postgis-devel mailing list
> >>> >> postgis-devel at lists.osgeo.org
> >>> >> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> >>> > _______________________________________________
> >>> > postgis-devel mailing list
> >>> > postgis-devel at lists.osgeo.org
> >>> > https://lists.osgeo.org/mailman/listinfo/postgis-devel
> >>>
> >>> _______________________________________________
> >>> postgis-devel mailing list
> >>> postgis-devel at lists.osgeo.org
> >>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> >>> _______________________________________________
> >>> postgis-devel mailing list
> >>> postgis-devel at lists.osgeo.org
> >>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> >>> _______________________________________________
> >>> postgis-devel mailing list
> >>> postgis-devel at lists.osgeo.org
> >>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> >>
> >
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20211201/2ab94333/attachment.html>