[postgis-users] ST_Subdivide vertex limit and TOAST overhead

Wed Jan 26 01:32:42 PST 2022

Hi all,

After Paul's remarks here on the list about the cost of TOAST in 
relation to the optimal default for ST_Subdivide's vertex limit (96 
according to Paul's tests), I got a bit fascinated and wanted to do some 
testing myself.

Until Paul's remark, I never gave much thought about TOAST overhead in 
relation to my OpenStreetMap database. I just simply assumed it as a 
fact, as it was likely big geometries needed TOASTing in many cases.

However, since Paul gave a clear guideline to prevent TOASTing, I gave 
it a try and collected some rough statistics.

The data is from generalized OpenStreetMap woodland polygons, some of 
which are absolutely huge before ST_Subdivide kicks in in the 
generalization processing (> 100k vertices), as I amalgamate them to 
bigger structures in the generalization processing.

I now tested with two subdivide limits: the default 5000 I had been 
using up to now, which seemed a reasonable compromise between limiting 
the number of vertices in a polygon and the number of output polygons at 
the same time: not to small to generate large amounts of splits, but 
also not to big to cause issues with display times.

Next, I used Paul's recommended "prevent TOAST" limit of 96 vertices. I 
subsequently looked at display times for the entire dataset in QGIS by 
zooming to the dataset's extent and timing the display time.

The result are as follows:

96 vertices: 1,996,226 records: 1.8GB disk size: 33s / 32s /33s / 32s
5000 vertices: 1,332,258 records: 2.7GB disk size: 45s / 39s / 38s / 39s

A few take aways:

- What I never realised before, is also the disk size cost of TOAST: as 
can be seen, the '5000' limit size, which requires many geometries to be 
TOASTed, results in an almost 40% larger disk size for the relation 
according to DBeaver (2.7 versus 1.8 GB for '5000' versus '96' vertex 
limit).

- Non-TOASTed records have an about 20-35% faster retrieval time, 
although it seems that especially the initial time for TOASTed has a 
bigger delay (45s), I guess this is because the de-TOASTed records are 
subsequently cached. Even taking that in account, the overhead seems to 
plateau at 20% minimum.

- Counter-intuitively, displaying almost 600k (non-TOASTed) records  
more due to much smaller ST_Subdivide vertex limit, is still 
considerably faster than the displaying the smaller (in terms of 
records) dataset that did get TOASTed.

Does this all seem about right? And does this fit other users experiences?

Of course, despite the gains of non TOASTing, you still have to evaluate 
for each dataset whether subdividing even makes sense: it is usually the 
last step in processing, and if you actually need the entire polygon for 
e.g. labelling purposes in QGIS, than subdividing in such small pieces 
as to prevent TOASTing, doesn't make sense at all.

Marco