[postgis-devel] Code review of `Xmeans`

Darafei "Komяpa" Praliaskouski me at komzpa.net
Sat Apr 3 02:42:29 PDT 2021


Thank you so much for the review! Here are some of my other comments on
what's going on there:

XMeans is technically just an outer loop outside of main KMeans that picks
K for you by trying to split clusters in the output of KMeans by running
2-Means and checks that after split things became better, then polishes the
split with another run of KMeans. It's by definition slower than KMeans and
is a price you pay for not knowing your K.

The original paper is using BIC, Bayesian Information Criteria, to
understand that "things are good". Most of my doubts are in implementation
of that thing: I got it explained to me twice and still don't fully
understand how it works in our 2.5D spherical surface and what will be a
good choice of parameters (are we almost-2D as we're nearly flat surface of
the planet or must it be a 3D since we have x.y.z?). The thing is supposed
to measure how roundish the clusters are and prefer more roundish ones. I
would appreciate more attention to that part of the code or links to better
explanations on dimensionality of our world.

The other finding is that BIC is rather costly to calculate. I was using
Kontur Population dataset (
https://data.humdata.org/dataset/kontur-population-dataset) casted as
points and the algorithm converged in hours for me, on millions of points.
Clustering looks rather nice, but hours frighten me - still are useful for
some scenarios, it's a machine working not you picking the best K.

The useful piece was to check that other criterias can be used. max_radius
was simple to implement, converges quickly and has an easy to grasp meaning
- "no clusters larger than 500m". That's what I really needed for
walkability analysis in the past and I'm tempted to rip away the parts with
BIC scores and just add min_radius to KMeans if I don't get a chance to fix
this thing to perfection for 3.2.

Other thing is that XMeans lets you run KMeans in 2 ways: you can either
init all the clusters from the start, or just init some and let the others
be initialized by XMeans by splitting the ones you had initially. Second
way seems nice, as it's easy to parallelize (you're running lots of 2-means
splitting your clusters in 2 on separate CPUs) and code's structured to
allow that instrumentation to be easy, maybe using OpenMP comments. I've
seen some cases where the second way is faster already, since our KMeans
init is O(KN) and with high K the recursive XMeans can be lower.

Original paper also suggests that this whole thing should not be done as is
but should be done via tree lookups, then on KMeans loops computation
complexity will go even lower. That's also not implemented. I've done a
full-SQL XMeans in the past where it was done using Voronoi partition and
GiST lookups and that seemingly worked good enough too, but hard to pack
into PostGIS function.

On Sat, Apr 3, 2021 at 11:53 AM Han Wang <hanwgeek at gmail.com> wrote:

> Hi all,
> I just reviewed the code of `xmeans`(
> https://github.com/postgis/postgis/pull/592/commits) which plans to
> replace conventional kmeans. I would like to know if there is anyone to
> test its performance? I think it may costs multiple times than kmeans.
> Best regards,
> Han
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel

Darafei "Komяpa" Praliaskouski
OSM BY Team - http://openstreetmap.by/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20210403/dc203b9e/attachment.html>

More information about the postgis-devel mailing list