[postgis-devel] Issues about the GSoC project

Han Wang hanwgeek at gmail.com
Tue Jul 27 09:09:09 PDT 2021


Hi Raúl,

Thanks for your suggestions!

* Make sure Postgresql and Postgis are build with optimizations on. You
> can build with both -O2 (or -O3) and -g, so you get optimized code with
> symbols. Disable any extra logging.
>
Actually, I build Postgres and PostGIS with `-O0` to generate a detailed
function calling information in the `FlameGraph` before.
And I test the index with the same build version. I will change the build
later.

I also made a test document draft here((
https://docs.google.com/document/d/1m4oxBAsKCyjAnYmkCmQ0X_ltiid5tliFwF3rtdzlKsc/edit?usp=sharing
).
And I am not sure why the `no index` way of querying would only hit 18
buffers. I am working on that.

Best regards,
Han

On Tue, Jul 27, 2021 at 4:55 AM Raúl Marín <raul at rmr.ninja> wrote:

> Hi,
>
> As Paul is mentioning, it appears that you are assuming that having lots
> of shared buffer hits is good but it isn't. A shared buffer hit meant
> that you needed to read a page and it happened to be cached in memory;
> but indexes are useful because they avoid the need of reading
> unnecessary pages (in memory or in disk), so an ideal sort would be one
> that put the data you are looking in the same pages (so all the rows in
> the page are useful) and the ideal index would be one that knew which
> pages as fast as possible (also reading as few index pages as possible).
>
> It's good that you are measuring query times and it's clear that
> something odd is going on.
>
> If I had to debug this I would start by making sure I'm measuring things
> that are comparable:
> * For the queries running indexes, check that they are using the same
> plan and that they are always using the index in the queries.
> * Run multiple queries (repeatedly) for each index, not just once and
> get minimum / avg / mode of the timings.
> * Make sure Postgresql and Postgis are build with optimizations on. You
> can build with both -O2 (or -O3) and -g, so you get optimized code with
> symbols. Disable any extra logging.
> * Disable Postgresql JIT (if you haven't done it already) as it won't
> help you here and will make things harder to measure.
> * Confirm you are getting the same output for all indexes and queries.
>
> If you have that already, I would recommend focusing on just one query
> (I'd say && is the most used operator) and confirm that you see similar
> results with different input (changing the geometry you are querying
> against the index). It's not the same to run a query that looks for 1%
> of the data, that one that looks for 10% or 90% (where the index isn't
> useful and probably won't be used by Postgres).
> If you have something that is more or less reproducible then it's easier
> to debug with an smaller dataset and you can use anything from a
> profiler (like perf / Xcode Instruments), extra logs, etc to try to
> understand how many pages you are trying to read with the query and why
> it is so different between gist and hilbert presort.
>
> Regards,
> Raúl.
>
> On 26/7/21 20:55, Paul Ramsey wrote:
> > The column names in the table are a little confusing :)
> >
> > I don't have any thing I can think that would explain the execution time
> difference, so you'll probably need to gather more evidence. Comparing the
> call counts (gproff? callgrind?) might give a clue where the extra work is
> going in the sorted indexes. Actually, if your sort function was in fact
> "desorting" the inputs, that could explain both the extra shared buffer
> hits and the performance... like, if your whole index floats into memory
> then you'd expect lots of buffer hits, and the more work on the index, the
> more hits. So if you had a terrible index you'd get both lots of buffer
> hits and a lot of time spent.
> >
> > Basically, it shouldn't take a lot of extra steps on a sorted index than
> on a normal gist index, so if there *are* a lot of extra index accesses...
> something is very awry in the sorting.
> >
> > P.
> >
> >> On Jul 25, 2021, at 10:21 PM, Han Wang <hanwgeek at gmail.com> wrote:
> >>
> >> Hi Giuseppe and Hi Regina,
> >>
> >> After checking the paper of GiST and implementation in Postgres. I
> think the query performance should be considered besides the building
> process. In the larger data test scenario, the building time of different
> indexes are similar because Postgres just hashes the tuples and sorts them
> and packs them into pages, building the tree index from bottom to up. With
> a bad hash order definition, the building process cannot detect the poor
> index query performance. So it is necessary to test the index query
> performance. I have tested the query performance with the `EXPLAIN`
> operator, using the sql scripts like other indexes in the `/regress`. But I
> am not familiar with PL/pgSQL, so I handle the log with some python scripts.
> >>
> >> In this test, I focus on the buffer hits and execution time of
> different tasks of different index types including `No Index`, `Simple GiST
> index`, `X hash function`, `morton hash function` and `hilbert hash
> function`. And there are some results:
> >> Shared buffer hits:
> >>   Index      Create Time(ms) <<      &<      &&      &>      >>
> ~=      ~       @       &<|     <<|     |>>     |&>
> >> 0    No Index        0       18      18      18      18      18
> 18      18      18      18      18      18      18
> >> 1    GiST Index      25.249  40237   46085   3009    40297   46025
>  3009    3009    3009    41994   45295   41934   45355
> >> 2    X PreSort Index 8.829   443620  441235  3009    444501  440354
> 3009    3009    3009    441568  442503  440687  443384
> >> 3    Morton PreSort Index    16.885  447779  447446  4079    448669
> 446556  4079    4079    4079    445362  449428  444472  450318
> >> 4    Hilbert PreSort Index   16.824  446714  444058  3558    447600
> 443172  3558    3558    3558    446072  445394  445186  446280
> >> Execution time:
> >>
> >> Index        Create Time(ms) <<      &<      &&      &>      >>
> ~=      ~       @       &<|     <<|     |>>     |&>
> >> 0    No Index        0       567.251 565.720 481.618 561.877 563.452
> 480.128 478.275 478.518 567.144 572.191 563.255 556.281
> >> 1    GiST Index      25.249  289.255 293.143 28.838  289.002 291.296
> 28.336  28.597  26.947  295.394 297.556 293.988 299.760
> >> 2    X PreSort Index 8.829   440.861 445.630 37.960  439.564 440.535
> 37.594  37.979  37.741  386.662 384.635 385.166 388.832
> >> 3    Morton PreSort Index    16.885  421.999 413.427 77.002  422.939
> 412.130 77.415  75.102  76.056  416.205 446.599 410.614 434.613
> >> 4    Hilbert PreSort Index   16.824  417.539 415.962 56.583  421.226
> 414.553 56.320  55.600  55.338  416.639 421.243 418.550 417.094
> >> The number of shared buffer hits are far bigger than the original one.
> But what confuses me is that the execution times are worse. I am trying to
> figure out why this happened.
> >> What's more, I am not very clear about the relationship between query
> performance and the number of shared buffer hits.
> >>
> >> If you have any questions or suggestions, please let me know.
> >>
> >> Best regards,
> >> Han
> >>
> >> On Thu, Jul 8, 2021 at 1:32 AM Giuseppe Broccolo <
> g.broccolo.7 at gmail.com> wrote:
> >> Hi Han,
> >>
> >> Il giorno mer 7 lug 2021 alle ore 18:05 Han Wang <hanwgeek at gmail.com>
> ha scritto:
> >> Hi Regina and hi Giuseppe,
> >>
> >> Thanks for your reply!
> >>
> >> I am now checking the original paper and postgres's implement of gist.
> And now I think the query performance test after building gist index is
> necessary. Because as Darafei mentioned, the fast sorting building method
> may just pack the sorted tuples into pages regardless of which hash
> function it use. The correctness of index may be checked in the runtime
> query. At present, I am working on confirm the io access of hash functions.
> >>
> >> Feel free to give suggestions or questions.
> >>
> >> The correctness of the index should be covered by some of the
> regression tests in action for the GiST support in PostGIS - for instance,
> kNN searches etc. - for the moment I am really curious about the I/O access
> of the hash functions. This is something maybe it has not been checked even
> in PostgreSQL, and maybe it would be good to share the results with them as
> well.
> >>
> >> Keep us updated, and thank you for your help!
> >>
> >> Giuseppe.
> >> _______________________________________________
> >> postgis-devel mailing list
> >> postgis-devel at lists.osgeo.org
> >> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> > _______________________________________________
> > postgis-devel mailing list
> > postgis-devel at lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/postgis-devel
>
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20210728/22c98a9e/attachment-0001.html>


More information about the postgis-devel mailing list