[postgis-users] ST_Subdivide vertex limit and TOAST overhead (Follow up)

Thu Feb 17 10:49:03 PST 2022

Hi all,

As a follow up to the mail that I attached below and that unfortunately 
presented bogus results due to a processing error, I have now revisited 
the issue and can now present some realistic figures for the case of 
TOAST overhead versus optimal ST_Subdivide default (which as Paul 
pointed out would be 96 vertices to avoid spilling geometries over to 
TOAST).

The timings are for zooming to either the full data extent or a partial 
extent in QGIS.

The data represents generalized woodland polygons designed for a display 
scale of 1:100k to 1:250k. This is a custom generalization routine, that 
included smoothing as well, and gives more pleasing and cartographically 
sound results than just running ST_Simplify alone.

As you can see from the stats, the dataset subdivided at 96 vertices has 
understandably a lot more records, almost triple the one subdivided with 
a 5000 vertices limit, although the disk size is only about 20% larger 
(as displayed by DBeaver).

Clearly, from the result, even though fetching a single non-toasted 
record is significantly faster (almost 3x as much records are retrieved 
in only about 1,6x longer time period), compared to records that have 
been subdivided with a much larger limit, the significantly larger 
number of records to process does in fact mean that in day to day usage 
you may not see this benefit, and the net result of 
'subdividing-to-avoid-TOAST' may in fact be negative in terms of total 
processing / display time, depending on the nature of the dataset.

Note that for the 5000 vertices dataset, there are only 191,005 records 
(14%) with > 96 vertices, and only  141,899 records (11%) with > 128 
vertices (another figure Paul mentioned as a possible 
'spill-over-to-TOAST' limit), so the vast majority of records (86% at 96 
limit) are still below the TOAST limit even for the dataset subdivided 
with 5000 vertices limit.

Also note that the data resided on a NVMe based array, so access time 
overhead for TOAST are likely limited compared to HDD.

The result are as follows:

* Stats: *
96 vertices: 3,754,257 records: 3.3GB disk size
5000 vertices: 1,332,258 records: 2.7GB disk size

* Zooming to full extent of the data: *
96 vertices: 68s / 68s /65s / 64s: 66s average
5000 vertices: 41s / 40s / 38s / 41: 40s average

* Zooming to partial extent of the data: *
96 vertices: 4,51s / 4,12s /4,30s / 4,32s
5000 vertices: 2,21s / 2,14s / 2,51s / 2,53s

Marco

Op 26-1-2022 om 10:32 schreef Marco Boeringa:
> Hi all,
>
> After Paul's remarks here on the list about the cost of TOAST in 
> relation to the optimal default for ST_Subdivide's vertex limit (96 
> according to Paul's tests), I got a bit fascinated and wanted to do 
> some testing myself.
>
> Until Paul's remark, I never gave much thought about TOAST overhead in 
> relation to my OpenStreetMap database. I just simply assumed it as a 
> fact, as it was likely big geometries needed TOASTing in many cases.
>
> However, since Paul gave a clear guideline to prevent TOASTing, I gave 
> it a try and collected some rough statistics.
>
> The data is from generalized OpenStreetMap woodland polygons, some of 
> which are absolutely huge before ST_Subdivide kicks in in the 
> generalization processing (> 100k vertices), as I amalgamate them to 
> bigger structures in the generalization processing.
>
> I now tested with two subdivide limits: the default 5000 I had been 
> using up to now, which seemed a reasonable compromise between limiting 
> the number of vertices in a polygon and the number of output polygons 
> at the same time: not to small to generate large amounts of splits, 
> but also not to big to cause issues with display times.
>
> Next, I used Paul's recommended "prevent TOAST" limit of 96 vertices. 
> I subsequently looked at display times for the entire dataset in QGIS 
> by zooming to the dataset's extent and timing the display time.
>
> The result are as follows:
>
> 96 vertices: 1,996,226 records: 1.8GB disk size: 33s / 32s /33s / 32s
> 5000 vertices: 1,332,258 records: 2.7GB disk size: 45s / 39s / 38s / 39s
>
> A few take aways:
>
> - What I never realised before, is also the disk size cost of TOAST: 
> as can be seen, the '5000' limit size, which requires many geometries 
> to be TOASTed, results in an almost 40% larger disk size for the 
> relation according to DBeaver (2.7 versus 1.8 GB for '5000' versus 
> '96' vertex limit).
>
> - Non-TOASTed records have an about 20-35% faster retrieval time, 
> although it seems that especially the initial time for TOASTed has a 
> bigger delay (45s), I guess this is because the de-TOASTed records are 
> subsequently cached. Even taking that in account, the overhead seems 
> to plateau at 20% minimum.
>
> - Counter-intuitively, displaying almost 600k (non-TOASTed) records  
> more due to much smaller ST_Subdivide vertex limit, is still 
> considerably faster than the displaying the smaller (in terms of 
> records) dataset that did get TOASTed.
>
> Does this all seem about right? And does this fit other users 
> experiences?
>
> Of course, despite the gains of non TOASTing, you still have to 
> evaluate for each dataset whether subdividing even makes sense: it is 
> usually the last step in processing, and if you actually need the 
> entire polygon for e.g. labelling purposes in QGIS, than subdividing 
> in such small pieces as to prevent TOASTing, doesn't make sense at all.
>
> Marco
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users