[postgis-devel] Issues about the GSoC project
Raúl Marín
raul at rmr.ninja
Mon Jul 26 13:44:42 PDT 2021
Hi,
As Paul is mentioning, it appears that you are assuming that having lots
of shared buffer hits is good but it isn't. A shared buffer hit meant
that you needed to read a page and it happened to be cached in memory;
but indexes are useful because they avoid the need of reading
unnecessary pages (in memory or in disk), so an ideal sort would be one
that put the data you are looking in the same pages (so all the rows in
the page are useful) and the ideal index would be one that knew which
pages as fast as possible (also reading as few index pages as possible).
It's good that you are measuring query times and it's clear that
something odd is going on.
If I had to debug this I would start by making sure I'm measuring things
that are comparable:
* For the queries running indexes, check that they are using the same
plan and that they are always using the index in the queries.
* Run multiple queries (repeatedly) for each index, not just once and
get minimum / avg / mode of the timings.
* Make sure Postgresql and Postgis are build with optimizations on. You
can build with both -O2 (or -O3) and -g, so you get optimized code with
symbols. Disable any extra logging.
* Disable Postgresql JIT (if you haven't done it already) as it won't
help you here and will make things harder to measure.
* Confirm you are getting the same output for all indexes and queries.
If you have that already, I would recommend focusing on just one query
(I'd say && is the most used operator) and confirm that you see similar
results with different input (changing the geometry you are querying
against the index). It's not the same to run a query that looks for 1%
of the data, that one that looks for 10% or 90% (where the index isn't
useful and probably won't be used by Postgres).
If you have something that is more or less reproducible then it's easier
to debug with an smaller dataset and you can use anything from a
profiler (like perf / Xcode Instruments), extra logs, etc to try to
understand how many pages you are trying to read with the query and why
it is so different between gist and hilbert presort.
Regards,
Raúl.
On 26/7/21 20:55, Paul Ramsey wrote:
> The column names in the table are a little confusing :)
>
> I don't have any thing I can think that would explain the execution time difference, so you'll probably need to gather more evidence. Comparing the call counts (gproff? callgrind?) might give a clue where the extra work is going in the sorted indexes. Actually, if your sort function was in fact "desorting" the inputs, that could explain both the extra shared buffer hits and the performance... like, if your whole index floats into memory then you'd expect lots of buffer hits, and the more work on the index, the more hits. So if you had a terrible index you'd get both lots of buffer hits and a lot of time spent.
>
> Basically, it shouldn't take a lot of extra steps on a sorted index than on a normal gist index, so if there *are* a lot of extra index accesses... something is very awry in the sorting.
>
> P.
>
>> On Jul 25, 2021, at 10:21 PM, Han Wang <hanwgeek at gmail.com> wrote:
>>
>> Hi Giuseppe and Hi Regina,
>>
>> After checking the paper of GiST and implementation in Postgres. I think the query performance should be considered besides the building process. In the larger data test scenario, the building time of different indexes are similar because Postgres just hashes the tuples and sorts them and packs them into pages, building the tree index from bottom to up. With a bad hash order definition, the building process cannot detect the poor index query performance. So it is necessary to test the index query performance. I have tested the query performance with the `EXPLAIN` operator, using the sql scripts like other indexes in the `/regress`. But I am not familiar with PL/pgSQL, so I handle the log with some python scripts.
>>
>> In this test, I focus on the buffer hits and execution time of different tasks of different index types including `No Index`, `Simple GiST index`, `X hash function`, `morton hash function` and `hilbert hash function`. And there are some results:
>> Shared buffer hits:
>> Index Create Time(ms) << &< && &> >> ~= ~ @ &<| <<| |>> |&>
>> 0 No Index 0 18 18 18 18 18 18 18 18 18 18 18 18
>> 1 GiST Index 25.249 40237 46085 3009 40297 46025 3009 3009 3009 41994 45295 41934 45355
>> 2 X PreSort Index 8.829 443620 441235 3009 444501 440354 3009 3009 3009 441568 442503 440687 443384
>> 3 Morton PreSort Index 16.885 447779 447446 4079 448669 446556 4079 4079 4079 445362 449428 444472 450318
>> 4 Hilbert PreSort Index 16.824 446714 444058 3558 447600 443172 3558 3558 3558 446072 445394 445186 446280
>> Execution time:
>>
>> Index Create Time(ms) << &< && &> >> ~= ~ @ &<| <<| |>> |&>
>> 0 No Index 0 567.251 565.720 481.618 561.877 563.452 480.128 478.275 478.518 567.144 572.191 563.255 556.281
>> 1 GiST Index 25.249 289.255 293.143 28.838 289.002 291.296 28.336 28.597 26.947 295.394 297.556 293.988 299.760
>> 2 X PreSort Index 8.829 440.861 445.630 37.960 439.564 440.535 37.594 37.979 37.741 386.662 384.635 385.166 388.832
>> 3 Morton PreSort Index 16.885 421.999 413.427 77.002 422.939 412.130 77.415 75.102 76.056 416.205 446.599 410.614 434.613
>> 4 Hilbert PreSort Index 16.824 417.539 415.962 56.583 421.226 414.553 56.320 55.600 55.338 416.639 421.243 418.550 417.094
>> The number of shared buffer hits are far bigger than the original one. But what confuses me is that the execution times are worse. I am trying to figure out why this happened.
>> What's more, I am not very clear about the relationship between query performance and the number of shared buffer hits.
>>
>> If you have any questions or suggestions, please let me know.
>>
>> Best regards,
>> Han
>>
>> On Thu, Jul 8, 2021 at 1:32 AM Giuseppe Broccolo <g.broccolo.7 at gmail.com> wrote:
>> Hi Han,
>>
>> Il giorno mer 7 lug 2021 alle ore 18:05 Han Wang <hanwgeek at gmail.com> ha scritto:
>> Hi Regina and hi Giuseppe,
>>
>> Thanks for your reply!
>>
>> I am now checking the original paper and postgres's implement of gist. And now I think the query performance test after building gist index is necessary. Because as Darafei mentioned, the fast sorting building method may just pack the sorted tuples into pages regardless of which hash function it use. The correctness of index may be checked in the runtime query. At present, I am working on confirm the io access of hash functions.
>>
>> Feel free to give suggestions or questions.
>>
>> The correctness of the index should be covered by some of the regression tests in action for the GiST support in PostGIS - for instance, kNN searches etc. - for the moment I am really curious about the I/O access of the hash functions. This is something maybe it has not been checked even in PostgreSQL, and maybe it would be good to share the results with them as well.
>>
>> Keep us updated, and thank you for your help!
>>
>> Giuseppe.
>> _______________________________________________
>> postgis-devel mailing list
>> postgis-devel at lists.osgeo.org
>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel
More information about the postgis-devel
mailing list