[postgis-users] ST_Subdivide vertex limit and TOAST overhead (Follow up)

Marco Boeringa marco at boeringa.demon.nl
Sat Feb 19 08:42:31 PST 2022


Depends on what your interests are...

And there is definitely reason to subdivide this dataset. The 
generalization processing of these OpenStreetMap woodland involves 
dissolving, so some of resulting geometries of that process step are 
absolutely huge (well over 1M vertices), and need subdividing to avoid 
subsequent processing issues.

I was just interested to see if in a real world scenario like this, the 
'subdividing-to-avoid-TOAST' could also mean the data might be displayed 
quicker in a GIS like QGIS, which is always a welcome enhancement, as 
not having to go through de-toasting, despite a larger number of 
records. For this specific dataset, the answer is no. There may well be 
situations, datasets, and scenario's, e.g. the ones you point out, where 
it is useful to set such small limit so as to avoid TOAST.

I thought it might just be interesting for others to read.

Marco

Op 19-2-2022 om 17:04 schreef Darafei "Komяpa" Praliaskouski:
> Hi Marko,
>
> The scenarios you benchmarked don't hit any of the reasons to use the 
> Subdivide. The interesting ones are KNN K=1 and exists(where 
> ST_Intersects()).
>
> On Thu, Feb 17, 2022 at 9:49 PM Marco Boeringa 
> <marco at boeringa.demon.nl> wrote:
>
>     Hi all,
>
>     As a follow up to the mail that I attached below and that
>     unfortunately
>     presented bogus results due to a processing error, I have now
>     revisited
>     the issue and can now present some realistic figures for the case of
>     TOAST overhead versus optimal ST_Subdivide default (which as Paul
>     pointed out would be 96 vertices to avoid spilling geometries over to
>     TOAST).
>
>     The timings are for zooming to either the full data extent or a
>     partial
>     extent in QGIS.
>
>     The data represents generalized woodland polygons designed for a
>     display
>     scale of 1:100k to 1:250k. This is a custom generalization
>     routine, that
>     included smoothing as well, and gives more pleasing and
>     cartographically
>     sound results than just running ST_Simplify alone.
>
>     As you can see from the stats, the dataset subdivided at 96
>     vertices has
>     understandably a lot more records, almost triple the one
>     subdivided with
>     a 5000 vertices limit, although the disk size is only about 20%
>     larger
>     (as displayed by DBeaver).
>
>     Clearly, from the result, even though fetching a single non-toasted
>     record is significantly faster (almost 3x as much records are
>     retrieved
>     in only about 1,6x longer time period), compared to records that have
>     been subdivided with a much larger limit, the significantly larger
>     number of records to process does in fact mean that in day to day
>     usage
>     you may not see this benefit, and the net result of
>     'subdividing-to-avoid-TOAST' may in fact be negative in terms of
>     total
>     processing / display time, depending on the nature of the dataset.
>
>     Note that for the 5000 vertices dataset, there are only 191,005
>     records
>     (14%) with > 96 vertices, and only  141,899 records (11%) with > 128
>     vertices (another figure Paul mentioned as a possible
>     'spill-over-to-TOAST' limit), so the vast majority of records (86%
>     at 96
>     limit) are still below the TOAST limit even for the dataset
>     subdivided
>     with 5000 vertices limit.
>
>     Also note that the data resided on a NVMe based array, so access time
>     overhead for TOAST are likely limited compared to HDD.
>
>     The result are as follows:
>
>     * Stats: *
>     96 vertices: 3,754,257 records: 3.3GB disk size
>     5000 vertices: 1,332,258 records: 2.7GB disk size
>
>     * Zooming to full extent of the data: *
>     96 vertices: 68s / 68s /65s / 64s: 66s average
>     5000 vertices: 41s / 40s / 38s / 41: 40s average
>
>     * Zooming to partial extent of the data: *
>     96 vertices: 4,51s / 4,12s /4,30s / 4,32s
>     5000 vertices: 2,21s / 2,14s / 2,51s / 2,53s
>
>     Marco
>
>     Op 26-1-2022 om 10:32 schreef Marco Boeringa:
>     > Hi all,
>     >
>     > After Paul's remarks here on the list about the cost of TOAST in
>     > relation to the optimal default for ST_Subdivide's vertex limit (96
>     > according to Paul's tests), I got a bit fascinated and wanted to do
>     > some testing myself.
>     >
>     > Until Paul's remark, I never gave much thought about TOAST
>     overhead in
>     > relation to my OpenStreetMap database. I just simply assumed it
>     as a
>     > fact, as it was likely big geometries needed TOASTing in many cases.
>     >
>     > However, since Paul gave a clear guideline to prevent TOASTing,
>     I gave
>     > it a try and collected some rough statistics.
>     >
>     > The data is from generalized OpenStreetMap woodland polygons,
>     some of
>     > which are absolutely huge before ST_Subdivide kicks in in the
>     > generalization processing (> 100k vertices), as I amalgamate
>     them to
>     > bigger structures in the generalization processing.
>     >
>     > I now tested with two subdivide limits: the default 5000 I had been
>     > using up to now, which seemed a reasonable compromise between
>     limiting
>     > the number of vertices in a polygon and the number of output
>     polygons
>     > at the same time: not to small to generate large amounts of splits,
>     > but also not to big to cause issues with display times.
>     >
>     > Next, I used Paul's recommended "prevent TOAST" limit of 96
>     vertices.
>     > I subsequently looked at display times for the entire dataset in
>     QGIS
>     > by zooming to the dataset's extent and timing the display time.
>     >
>     > The result are as follows:
>     >
>     > 96 vertices: 1,996,226 records: 1.8GB disk size: 33s / 32s /33s
>     / 32s
>     > 5000 vertices: 1,332,258 records: 2.7GB disk size: 45s / 39s /
>     38s / 39s
>     >
>     > A few take aways:
>     >
>     > - What I never realised before, is also the disk size cost of
>     TOAST:
>     > as can be seen, the '5000' limit size, which requires many
>     geometries
>     > to be TOASTed, results in an almost 40% larger disk size for the
>     > relation according to DBeaver (2.7 versus 1.8 GB for '5000' versus
>     > '96' vertex limit).
>     >
>     > - Non-TOASTed records have an about 20-35% faster retrieval time,
>     > although it seems that especially the initial time for TOASTed
>     has a
>     > bigger delay (45s), I guess this is because the de-TOASTed
>     records are
>     > subsequently cached. Even taking that in account, the overhead
>     seems
>     > to plateau at 20% minimum.
>     >
>     > - Counter-intuitively, displaying almost 600k (non-TOASTed) records
>     > more due to much smaller ST_Subdivide vertex limit, is still
>     > considerably faster than the displaying the smaller (in terms of
>     > records) dataset that did get TOASTed.
>     >
>     > Does this all seem about right? And does this fit other users
>     > experiences?
>     >
>     > Of course, despite the gains of non TOASTing, you still have to
>     > evaluate for each dataset whether subdividing even makes sense:
>     it is
>     > usually the last step in processing, and if you actually need the
>     > entire polygon for e.g. labelling purposes in QGIS, than
>     subdividing
>     > in such small pieces as to prevent TOASTing, doesn't make sense
>     at all.
>     >
>     > Marco
>     >
>     > _______________________________________________
>     > postgis-users mailing list
>     > postgis-users at lists.osgeo.org
>     > https://lists.osgeo.org/mailman/listinfo/postgis-users
>     _______________________________________________
>     postgis-users mailing list
>     postgis-users at lists.osgeo.org
>     https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20220219/cd00f6e2/attachment.html>


More information about the postgis-users mailing list