[SoC] GSoC 2021 - Final Report: Implement pre-sorting methods before GiST index building

Han Wang hanwgeek at gmail.com
Mon Aug 23 06:54:32 PDT 2021


Hi Rajat,

I made some updates to the report. And I publish here for backup.

*Abstract*

GiST(Generalized Search Tree) is a generalization data structure of a
variety of disk-based height-balanced search trees. Under the high-level
API of GiST, structures like b-tree, r-tree can be implemented for data
management. PostgreSQL defines a set of process function APIs for elements
of the GiST index. Only with these function implementations can a data type
be indexed and managed by a GiST structure. In large data scenarios,
pre-sorting a batch of data fetched in memory may be a local approximation
to the global sorting method. Recent PostgreSQL patch shows that it should
speed up the build of a GiST index after some pre-sorting of the data which
needs to be indexed. In one fork, the author replaces the GIST_OPTIONS_PROC
with GIST_ORDER_PROC to try to define an order for data fetched in memory
to sort in order to speed up the subsequent index building process. And I
implemented pre-sorting methods in z-order pattern and Hilbert order
pattern, Alos tested and compared pre-sorting methods on various data.

*The state of the art BEFORE your GSoC*

The index building process does not change the tuple order in the page and
runs at a slow speed

*The addition value*

With the pre-sorting index, the time of building index reduce to the to
one-third to one-fifth of the original

*Links*

   - Code Base: https://github.com/postgis/postgis/pull/619
   - Wiki:
   https://trac.osgeo.org/postgis/wiki/ImplementSortingMethodsBeforeGistIndexBuilding
   - Test Results and Performance Comparison:
   https://docs.google.com/document/d/1_mY_F2hPDk3vmXH5PPp2z9BuQWt-ZMORk6KxtdVQ3HY/edit?usp=sharing

*Conclusion*

   - The Morton/Hilbert hash function for pre-sorting can accelerate the
   process of index building process and reduce the time-consuming to
   one-third to one-fifth of the original
   - The query operation for different data in the same area demonstrates
   more stable performance
   - The query performance of pre-sorting is about 30% slower than the
   original

*Future Work*

   - Try to figure out the reason for the slowness of the pre-sorting index
   - Implement a fast Morton/Hilbert hash function for n-dimension geometry
   objects

[image: pre-sort-gist.png]

                        The flow chart of pre-sort gist index

Thanks for your excellent work and assistance!

Best regards,

Han
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/soc/attachments/20210823/1aee43c7/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pre-sort-gist.png
Type: image/png
Size: 130846 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/soc/attachments/20210823/1aee43c7/attachment-0001.png>


More information about the SoC mailing list