[postgis-tickets] [SCM] PostGIS branch master updated. 3.1.0rc1-461-g45e4912

git at osgeo.org git at osgeo.org
Sun Aug 29 23:32:50 PDT 2021


This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "PostGIS".

The branch, master has been updated
       via  45e491242d9d63beb7eac0961b16439d00c93bd4 (commit)
       via  00d9d4884cb10fe55545305a7626925d2cc9a44a (commit)
       via  d5660f37c74a5f7d805541ccfe3d7109b874a8d4 (commit)
       via  ea70141749ffc51eaf1616f14d1166e493406033 (commit)
       via  b0c650563376a85a95e4309b94fb55f940b41d6a (commit)
       via  9f494a5ec66f4f55f84631c2da9073e5e87166e1 (commit)
       via  24f3366406080fb0ef9900f90ece77a234852e3d (commit)
       via  45da46285f0745df8f5fe038739091be984e8569 (commit)
       via  fe750ed9ba37fa003728d6df94a92076534128bc (commit)
       via  76f02de9e275ec960cdd83c48fdfb1f6b728c30b (commit)
       via  d7925ddffb590e2c39561c3d379f23041e572c41 (commit)
       via  2dbae13e8ee216d3fd96d7a44d32d5d083f9957a (commit)
      from  24d601b4bcf019a2ccb8e65c37a36b6be7fe5b85 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit 45e491242d9d63beb7eac0961b16439d00c93bd4
Merge: 24d601b 00d9d48
Author: Darafei Praliaskouski <me at komzpa.net>
Date:   Mon Aug 30 00:59:11 2021 +0300

    ST_ClusterKMeans(max_radius)
    
    Closes #4808
    Closes https://github.com/postgis/postgis/pull/592


commit 00d9d4884cb10fe55545305a7626925d2cc9a44a
Author: Darafei Praliaskouski <me at komzpa.net>
Date:   Mon Aug 30 00:50:58 2021 +0300

    NEWS for ST_ClusterKMeans(max_radius)
    
    Closes #4808
    Closes https://github.com/postgis/postgis/pull/592

diff --git a/NEWS b/NEWS
index 4b3c98b..adb304f 100644
--- a/NEWS
+++ b/NEWS
@@ -39,7 +39,7 @@ PostGIS 3.2.0
   - #4924, Faster ST_RemoveRepeatedPoints on large multipoints, O(NlogN) instead of O(N^2)
            (Aliaksandr Kalenik, Darafei Praliaskouski)
   - #4925, fix ST_DumpPoints to not overlook points (Aliaksandr Kalenik)
-  - ST_Srid(topogeometry) override, to speedup lookups (Sandro Santilli)
+  - ST_SRID(topogeometry) override, to speedup lookups (Sandro Santilli)
   - #2175, Avoid creating additional nodes when adding same closed
            line to topology (Sandro Santilli)
   - #4974, Upgrade path for address_standardizer_data_us
@@ -48,7 +48,6 @@ PostGIS 3.2.0
   - #4981, ST_StartPoint support any geometry (Aliaksandr Kalenik)
   - #4799, Include srs in GeoJSON where it exists in spatial_ref_sys.
 
-
  * New features*
   - #4923, topology.ValidateTopologyRelation (Sandro Santilli)
   - #4933, topology.GetFaceContainingPoint (Sandro Santilli)
@@ -69,9 +68,10 @@ PostGIS 3.2.0
   - New postgis.gdal_vsi_options GUC allows out-db rasters on VSI network
     services to be accessed with authentication keys, etc. (Paul Ramsey)
   - ST_DumpSegments returns a set of segments of input geometry (Aliaksandr Kalenik)
-  - #4859, ST_Point, ST_PointZ, ST_PointM, ST_PointZM, constructors 
+  - #4859, ST_Point, ST_PointZ, ST_PointM, ST_PointZM, constructors
     with SRID parameter (Paul Ramsey)
-
+  - #4808, ST_ClusterKMeans now supports max_radius argument. Use it when you're not sure what is
+    the number of clusters but you know what the size of clusters should be. (Darafei Praliaskouski)
 
 PostGIS 3.1.0
 2020/12/18

commit d5660f37c74a5f7d805541ccfe3d7109b874a8d4
Merge: ea70141 e35fbe1
Author: Darafei Praliaskouski <me at komzpa.net>
Date:   Mon Aug 30 00:43:01 2021 +0300

    Merge remote-tracking branch 'upstream/master' into ticket-4808


commit ea70141749ffc51eaf1616f14d1166e493406033
Author: Darafei Praliaskouski <me at komzpa.net>
Date:   Mon Aug 30 00:35:14 2021 +0300

    KMeans max_radius: doc, tests, upgrade, examples.

diff --git a/doc/html/images/st_clusterkmeans03.png b/doc/html/images/st_clusterkmeans03.png
new file mode 100644
index 0000000..d27569d
Binary files /dev/null and b/doc/html/images/st_clusterkmeans03.png differ
diff --git a/doc/reference_cluster.xml b/doc/reference_cluster.xml
index 6d0a693..ce2508a 100644
--- a/doc/reference_cluster.xml
+++ b/doc/reference_cluster.xml
@@ -223,6 +223,9 @@ GEOMETRYCOLLECTION(LINESTRING(6 6,7 7))
 
 			<paramdef><type>integer </type>
 			<parameter>number_of_clusters</parameter></paramdef>
+
+            <paramdef><type>float </type>
+			<parameter>max_radius</parameter></paramdef>
 		  </funcprototype>
 		</funcsynopsis>
 	  </refsynopsisdiv>
@@ -230,12 +233,15 @@ GEOMETRYCOLLECTION(LINESTRING(6 6,7 7))
 	  <refsection>
       <title>Description</title>
 
-      <para>Returns 2D distance based
-        <ulink url="https://en.wikipedia.org/wiki/K-means_clustering">K-means</ulink>
+      <para>Returns <ulink url="https://en.wikipedia.org/wiki/K-means_clustering">K-means</ulink>
         cluster number for each input geometry. The distance used for clustering is the
         distance between the centroids for 2D geometries, and distance between bounding box centers for 3D geometries.
         For POINT inputs, M coordinate will be treated as weight of input and has to be larger than 0.
       </para>
+      <para><varname>max_radius</varname>, if set, will cause ST_ClusterKMeans to generate more clusters than
+        <varname>k</varname> ensuring that no cluster in output has radius larger than <varname>max_radius</varname>.
+        This is useful in reachability analysis. </para>
+      <para>Enhanced: 3.2.0 Support for <varname>max_radius</varname></para>
       <para>Enhanced: 3.1.0 Support for 3D geometries and weights</para>
       <para>Availability: 2.3.0</para>
     </refsection>
@@ -275,9 +281,8 @@ FROM
 						</mediaobject>
 					  </informalfigure>
 						<programlisting>SELECT ST_ClusterKMeans(geom, 5) OVER() AS cid, parcel_id, geom
-FROM parcels;
--- result
- cid | parcel_id |   geom
+FROM parcels;</programlisting>
+<screen> cid | parcel_id |   geom
 -----+-----------+---------------
    0 | 001       | 0103000000...
    0 | 002       | 0103000000...
@@ -286,18 +291,16 @@ FROM parcels;
    1 | 005       | 0103000000...
    2 | 006       | 0103000000...
    2 | 007       | 0103000000...
-(7 rows)</programlisting>
+(7 rows)</screen>
 					</para></entry>
 				  </row>
 			</tbody>
 			</tgroup>
 		</informaltable>
-
-		    <programlisting> -- Partitioning parcel clusters by type
-SELECT ST_ClusterKMeans(geom,3) over (PARTITION BY type) AS cid, parcel_id, type
-FROM parcels;
--- result
- cid | parcel_id |    type
+<para>Partitioning parcel clusters by type:</para>
+		    <programlisting>SELECT ST_ClusterKMeans(geom, 3) over (PARTITION BY type) AS cid, parcel_id, type
+FROM parcels;</programlisting>
+<screen> cid | parcel_id |    type
 -----+-----------+-------------
    1 | 005       | commercial
    1 | 003       | commercial
@@ -306,28 +309,34 @@ FROM parcels;
    1 | 004       | residential
    0 | 002       | residential
    2 | 006       | residential
-(7 rows)</programlisting>
-
-
-		    <programlisting> -- Clustering points around antimeridian can be done in 3D XYZ CRS, EPSG:4978:
-
-SELECT ST_ClusterKMeans(ST_Transform(ST_Force3D(geom), 4978), 3) over () AS cid, parcel_id, type
-FROM parcels;
--- result
-┌─────┬───────────┬─────────────┐
-│ cid │ parcel_id │    type     │
-├─────┼───────────┼─────────────┤
-│   1 │ 001       │ commercial  │
-│   2 │ 002       │ residential │
-│   0 │ 003       │ commercial  │
-│   1 │ 004       │ residential │
-│   0 │ 005       │ commercial  │
-│   2 │ 006       │ residential │
-│   0 │ 007       │ commercial  │
-└─────┴───────────┴─────────────┘
-(7 rows)
-</programlisting>
-
+(7 rows)</screen>
+
+
+<para>Clustering preaggregated planetary scale data like population dataset may require using 3D clusering and weighting.
+Let's try to idenify 20-ish meta-regions based on <ulink url="https://data.humdata.org/dataset/kontur-population-dataset">Kontur Population</ulink> that will not span more than 3000 km from their center:</para>
+		    <programlisting>create table kontur_population_3000km_clusters as
+select
+    geom,
+    ST_ClusterKMeans(
+        ST_Force4D(
+            ST_Transform(ST_Force3D(geom), 4978), -- cluster in 3D XYZ CRS
+            mvalue := population -- set clustering to be weighed by population
+        ),
+        20,                      -- aim to generate at least 20 clusters
+        max_radius := 3000000    -- but generate more to make each under 3000 km radius
+    ) over () as cid
+from
+    kontur_population;
+    </programlisting>
+    <para><informalfigure>
+    <mediaobject>
+        <imageobject>
+        <imagedata fileref="images/st_clusterkmeans03.png" />
+        </imageobject>
+        <caption><para>World population clustered to above specs: 46 resulting clusters. Greenland is one cluster, there are island clusters that span across antimeridian, clusters are centered at well-populated regions (New York, Moscow), and edges follow Earth's curvature.</para></caption>
+    </mediaobject>
+    </informalfigure>
+    </para>
 
     </refsection>
 
@@ -336,7 +345,10 @@ FROM parcels;
           <para>
               <xref linkend="ST_ClusterDBSCAN"/>,
               <xref linkend="ST_ClusterIntersecting" />,
-              <xref linkend="ST_ClusterWithin" />, <xref linkend="ST_Subdivide" />
+              <xref linkend="ST_ClusterWithin" />,
+              <xref linkend="ST_Subdivide" />,
+              <xref linkend="ST_Force_3D" />,
+              <xref linkend="ST_Force_4D" />,
           </para>
 	  </refsection>
 	</refentry>
diff --git a/liblwgeom/lwkmeans.c b/liblwgeom/lwkmeans.c
index 7a163c6..87863d0 100644
--- a/liblwgeom/lwkmeans.c
+++ b/liblwgeom/lwkmeans.c
@@ -65,20 +65,13 @@ improve_structure(POINT4D *objs,
 	uint32_t *temp_clusters = lwalloc(sizeof(uint32_t) * n);
 	double *temp_radii = lwalloc(sizeof(double) * n);
 	POINT4D *temp_centers = lwalloc(sizeof(POINT4D) * n);
-	double overshot = 0;
 
 	uint32_t new_k = k;
 
 	for (uint32_t cluster = first_cluster_to_split; cluster < k; cluster++)
 	{
-		if (overshot < 0)
-			overshot = 0;
-		if (radii[cluster] <= (max_radius_sq + overshot))
-		{
-			if (radii[cluster] > max_radius_sq)
-				overshot -= radii[cluster] - max_radius_sq;
+		if (radii[cluster] <= max_radius_sq)
 			continue;
-		}
 
 		/* copy cluster alone */
 		uint32_t cluster_size = 0;
@@ -102,8 +95,6 @@ improve_structure(POINT4D *objs,
 		centers[new_k] = temp_centers[1];
 		radii[cluster] = temp_radii[0];
 		radii[new_k] = temp_radii[1];
-
-		overshot += max_radius_sq - temp_radii[0] - temp_radii[1];
 		new_k++;
 	}
 	lwfree(temp_centers);
@@ -189,8 +180,13 @@ kmeans_init(POINT4D *objs, uint32_t n, POINT4D *centers, uint32_t k)
 	double max_dst = -1, current_distance;
 	double dst_p1, dst_p2;
 
-	/* k=0, k=1: "clustering" is just input validation */
-	assert(k > 1);
+	/* k=0, k=1: any point will do */
+	assert(n > 0);
+	if (k < 2)
+	{
+		centers[0] = objs[0];
+		return;
+	}
 
 	/* k >= 2: find two distant points greedily */
 	for (i = 1; i < n; i++)
@@ -298,6 +294,9 @@ kmeans(POINT4D *objs,
 	uint32_t cur_k = min_k;
 
 	kmeans_init(objs, n, centers, cur_k);
+	/* One iteration of kmeans needs to happen without shortcuts to fully initialize structures */
+	update_r(objs, clusters, n, centers, radii, cur_k);
+	update_means(objs, clusters, n, centers, cur_k);
 	for (uint32_t t = 0; t < KMEANS_MAX_ITERATIONS; t++)
 	{
 		/* Standard KMeans loop */
diff --git a/postgis/postgis_before_upgrade.sql b/postgis/postgis_before_upgrade.sql
index 49c566a..9c1a068 100644
--- a/postgis/postgis_before_upgrade.sql
+++ b/postgis/postgis_before_upgrade.sql
@@ -222,16 +222,7 @@ DROP FUNCTION IF EXISTS st_subdivide(geometry, integer); -- replaced in 3.1.0 by
 DROP FUNCTION IF EXISTS st_difference(geometry, geometry); -- replaced in 3.1.0 by 3 args version
 DROP FUNCTION IF EXISTS st_symdifference(geometry, geometry); -- replaced in 3.1.0 by 3 args version
 DROP FUNCTION IF EXISTS st_unaryunion(geometry); -- replaced in 3.1.0 by 3 args version
-
--- replaced in 3.2.0 by 3 args version
-SELECT _postgis_drop_function_if_needed
-    (
-    'st_clusterkmeans',
-    'geom geometry, k integer'
-    );
-
-
-
+DROP FUNCTION IF EXISTS ST_ClusterKMeans(geometry, integer); -- replaced in 3.2.0 by 3 args version
 
 -- geometry_columns changed parameter types so we verify if it needs to be dropped
 -- We check the catalog to see if the view (geometry_columns) has a column
diff --git a/regress/core/cluster.sql b/regress/core/cluster.sql
index 7358222..7d2c618 100644
--- a/regress/core/cluster.sql
+++ b/regress/core/cluster.sql
@@ -72,3 +72,7 @@ select '#4071', count(distinct a), count(distinct b), count(distinct c)  from
 	ST_ClusterKMeans(geom, 2) over () b,
 	ST_ClusterKMeans(geom, 3) over () c
 from (values (null::geometry), ('POINT(1 1)'), ('POINT EMPTY'), ('POINT(0 0)'), ('POINT(4 4)')) as g (geom)) z;
+
+
+select 'weight-and-limit-support-1', count(distinct cid) from (select ST_ClusterKMeans(ST_Force2D(geom), 1, 1) over () as cid from (values ('POINT(0 0 0 1)'::geometry), ('POINT(1 0 0 1)'), ('POINT(2 0 0 10000)')) g(geom)) kmeans;
+select 'weight-and-limit-support-2', count(distinct cid) from (select ST_ClusterKMeans(geom, 1, 1) over () as cid from (values ('POINT(0 0 0 1)'::geometry), ('POINT(1 0 0 1)'), ('POINT(2 0 0 10000)')) g(geom)) kmeans;
diff --git a/regress/core/cluster_expected b/regress/core/cluster_expected
index 287125b..46390c9 100644
--- a/regress/core/cluster_expected
+++ b/regress/core/cluster_expected
@@ -43,3 +43,5 @@ NOTICE:  lwgeom_cluster_kmeans: number of non-empty geometries (0) is less than
 NOTICE:  kmeans_init: there are at least 3 duplicate inputs, number of output clusters may be less than you requested
 3d_support-3|1
 #4071|2|3|4
+weight-and-limit-support-1|1
+weight-and-limit-support-2|2

commit b0c650563376a85a95e4309b94fb55f940b41d6a
Author: Darafei Praliaskouski <me at komzpa.net>
Date:   Sat Aug 7 12:09:58 2021 +0300

    Remove unused variable, fix signedness and memory leak

diff --git a/liblwgeom/lwkmeans.c b/liblwgeom/lwkmeans.c
index dfadd35..7a163c6 100644
--- a/liblwgeom/lwkmeans.c
+++ b/liblwgeom/lwkmeans.c
@@ -19,8 +19,13 @@
  */
 #define KMEANS_MAX_ITERATIONS 1000
 
-static uint32_t
-kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *radii, uint32_t min_k, double max_radius);
+static uint32_t kmeans(POINT4D *objs,
+		       uint32_t *clusters,
+		       uint32_t n,
+		       POINT4D *centers,
+		       double *radii,
+		       uint32_t min_k,
+		       double max_radius);
 
 inline static double
 distance3d_sqr_pt4d_pt4d(const POINT4D *p1, const POINT4D *p2)
@@ -35,7 +40,7 @@ distance3d_sqr_pt4d_pt4d(const POINT4D *p1, const POINT4D *p2)
 /* Split the clusters that need to be split */
 static uint32_t
 improve_structure(POINT4D *objs,
-		  int *clusters,
+		  uint32_t *clusters,
 		  uint32_t n,
 		  POINT4D *centers,
 		  double *radii,
@@ -49,22 +54,22 @@ improve_structure(POINT4D *objs,
 	double max_radius_sq = max_radius * max_radius;
 
 	/* Do we have the big clusters to split at all? */
-	int first_cluster_to_split = 0;
-	for (; first_cluster_to_split < (int)k; first_cluster_to_split++)
+	uint32_t first_cluster_to_split = 0;
+	for (; first_cluster_to_split < k; first_cluster_to_split++)
 		if (radii[first_cluster_to_split] > max_radius_sq)
 			break;
 	if (first_cluster_to_split == k)
 		return k;
 
 	POINT4D *temp_objs = lwalloc(sizeof(POINT4D) * n);
-	int *temp_clusters = lwalloc(sizeof(int) * n);
+	uint32_t *temp_clusters = lwalloc(sizeof(uint32_t) * n);
 	double *temp_radii = lwalloc(sizeof(double) * n);
 	POINT4D *temp_centers = lwalloc(sizeof(POINT4D) * n);
 	double overshot = 0;
 
 	uint32_t new_k = k;
 
-	for (int cluster = first_cluster_to_split; cluster < (int)k; cluster++)
+	for (uint32_t cluster = first_cluster_to_split; cluster < k; cluster++)
 	{
 		if (overshot < 0)
 			overshot = 0;
@@ -76,7 +81,7 @@ improve_structure(POINT4D *objs,
 		}
 
 		/* copy cluster alone */
-		int cluster_size = 0;
+		uint32_t cluster_size = 0;
 		for (uint32_t i = 0; i < n; i++)
 			if (clusters[i] == cluster)
 				temp_objs[cluster_size++] = objs[i];
@@ -110,10 +115,9 @@ improve_structure(POINT4D *objs,
 
 /* Refresh mapping of point to closest cluster */
 static uint8_t
-update_r(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *radii, uint32_t k)
+update_r(POINT4D *objs, uint32_t *clusters, uint32_t n, POINT4D *centers, double *radii, uint32_t k)
 {
 	uint8_t converged = LW_TRUE;
-	double max_radius = 0;
 	memset(radii, 0, sizeof(double) * k);
 
 	for (uint32_t i = 0; i < n; i++)
@@ -122,10 +126,10 @@ update_r(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *rad
 
 		/* Initialize with distance to first cluster */
 		double curr_distance = distance3d_sqr_pt4d_pt4d(&obj, &centers[0]);
-		int curr_cluster = 0;
+		uint32_t curr_cluster = 0;
 
 		/* Check all other cluster centers and find the nearest */
-		for (int cluster = 1; cluster < (int)k; cluster++)
+		for (uint32_t cluster = 1; cluster < k; cluster++)
 		{
 			double distance = distance3d_sqr_pt4d_pt4d(&obj, &centers[cluster]);
 			if (distance < curr_distance)
@@ -150,13 +154,13 @@ update_r(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *rad
 
 /* Refresh cluster centroids based on all of their objects */
 static void
-update_means(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k)
+update_means(POINT4D *objs, uint32_t *clusters, uint32_t n, POINT4D *centers, uint32_t k)
 {
 	memset(centers, 0, sizeof(POINT4D) * k);
 	/* calculate weighted sum */
 	for (uint32_t i = 0; i < n; i++)
 	{
-		int cluster = clusters[i];
+		uint32_t cluster = clusters[i];
 		centers[cluster].x += objs[i].x * objs[i].m;
 		centers[cluster].y += objs[i].y * objs[i].m;
 		centers[cluster].z += objs[i].z * objs[i].m;
@@ -282,7 +286,13 @@ cluster_centroid_cmp(const void *a, const void *b)
 }
 
 static uint32_t
-kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *radii, uint32_t min_k, double max_radius)
+kmeans(POINT4D *objs,
+       uint32_t *clusters,
+       uint32_t n,
+       POINT4D *centers,
+       double *radii,
+       uint32_t min_k,
+       double max_radius)
 {
 	uint8_t converged = LW_FALSE;
 	uint32_t cur_k = min_k;
@@ -305,7 +315,7 @@ kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *radii
 		/* XMeans-inspired improve_structure pass to split clusters bigger than limit into 2 */
 		if (max_radius)
 		{
-			for (int i = 0; i < cur_k; i++)
+			for (uint32_t i = 0; i < cur_k; i++)
 				centers[i].m = radii[i];
 			qsort(centers, cur_k, sizeof(POINT4D), cluster_centroid_cmp);
 			update_r(objs, clusters, n, centers, radii, cur_k);
@@ -430,8 +440,8 @@ lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t n, uint32_t k, double max_r
 
 	if (k > 1 || max_radius > 0)
 	{
-		int *clusters_dense = lwalloc(sizeof(int) * num_non_empty);
-		memset(clusters_dense, 0, sizeof(int) * num_non_empty);
+		uint32_t *clusters_dense = lwalloc(sizeof(int) * num_non_empty);
+		memset(clusters_dense, 0, sizeof(uint32_t) * num_non_empty);
 		converged = kmeans(objs, clusters_dense, num_non_empty, centers, radii, k, max_radius);
 
 		if (converged)
@@ -439,7 +449,7 @@ lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t n, uint32_t k, double max_r
 			uint32_t d = 0;
 			for (uint32_t i = 0; i < n; i++)
 				if (geom_valid[i])
-					clusters[i] = clusters_dense[d++];
+					clusters[i] = (int)clusters_dense[d++];
 				else
 					clusters[i] = KMEANS_NULL_CLUSTER;
 		}
@@ -469,6 +479,7 @@ lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t n, uint32_t k, double max_r
 	lwfree(objs);
 	lwfree(centers);
 	lwfree(geom_valid);
+	lwfree(radii);
 
 	/* Good result */
 	if (converged)

commit 9f494a5ec66f4f55f84631c2da9073e5e87166e1
Author: Darafei Praliaskouski <me at komzpa.net>
Date:   Mon Jul 12 22:36:21 2021 +0300

    xmeans: remove debug output

diff --git a/liblwgeom/lwkmeans.c b/liblwgeom/lwkmeans.c
index 67ebb9a..dfadd35 100644
--- a/liblwgeom/lwkmeans.c
+++ b/liblwgeom/lwkmeans.c
@@ -71,28 +71,9 @@ improve_structure(POINT4D *objs,
 		if (radii[cluster] <= (max_radius_sq + overshot))
 		{
 			if (radii[cluster] > max_radius_sq)
-			{
-				lwnotice("%s: overshot %f, not splitting %d of radius2 %f into %d",
-					 __func__,
-					 overshot,
-					 cluster,
-					 radii[cluster],
-					 new_k);
 				overshot -= radii[cluster] - max_radius_sq;
-			}
 			continue;
 		}
-		else
-		{
-			//		overshot += max_radius_sq - radii[cluster]/2;
-		}
-
-		lwnotice("%s: overshot %f, splitting %d of radius2 %f into %d",
-			 __func__,
-			 overshot,
-			 cluster,
-			 radii[cluster],
-			 new_k);
 
 		/* copy cluster alone */
 		int cluster_size = 0;

commit 24f3366406080fb0ef9900f90ece77a234852e3d
Author: Darafei Praliaskouski <me at komzpa.net>
Date:   Mon Jul 12 22:35:45 2021 +0300

    XMeans: go more slowly towards the end to generate bigger clusters

diff --git a/liblwgeom/lwkmeans.c b/liblwgeom/lwkmeans.c
index 3614530..67ebb9a 100644
--- a/liblwgeom/lwkmeans.c
+++ b/liblwgeom/lwkmeans.c
@@ -40,57 +40,88 @@ improve_structure(POINT4D *objs,
 		  POINT4D *centers,
 		  double *radii,
 		  uint32_t k,
-		  uint32_t min_k,
 		  double max_radius)
 {
+	/* Input check: radius limit should be measurable */
+	if (max_radius <= 0)
+		return k;
+
+	double max_radius_sq = max_radius * max_radius;
+
+	/* Do we have the big clusters to split at all? */
+	int first_cluster_to_split = 0;
+	for (; first_cluster_to_split < (int)k; first_cluster_to_split++)
+		if (radii[first_cluster_to_split] > max_radius_sq)
+			break;
+	if (first_cluster_to_split == k)
+		return k;
+
 	POINT4D *temp_objs = lwalloc(sizeof(POINT4D) * n);
 	int *temp_clusters = lwalloc(sizeof(int) * n);
 	double *temp_radii = lwalloc(sizeof(double) * n);
 	POINT4D *temp_centers = lwalloc(sizeof(POINT4D) * n);
+	double overshot = 0;
 
 	uint32_t new_k = k;
 
-	for (int cluster = 0; cluster < (int)k; cluster++)
+	for (int cluster = first_cluster_to_split; cluster < (int)k; cluster++)
 	{
-		double radius = radii[cluster];
-		uint8_t force_split = (new_k < min_k) || (max_radius > 0 && radius > (max_radius * max_radius));
-		if (!force_split)
+		if (overshot < 0)
+			overshot = 0;
+		if (radii[cluster] <= (max_radius_sq + overshot))
+		{
+			if (radii[cluster] > max_radius_sq)
+			{
+				lwnotice("%s: overshot %f, not splitting %d of radius2 %f into %d",
+					 __func__,
+					 overshot,
+					 cluster,
+					 radii[cluster],
+					 new_k);
+				overshot -= radii[cluster] - max_radius_sq;
+			}
 			continue;
+		}
+		else
+		{
+			//		overshot += max_radius_sq - radii[cluster]/2;
+		}
+
+		lwnotice("%s: overshot %f, splitting %d of radius2 %f into %d",
+			 __func__,
+			 overshot,
+			 cluster,
+			 radii[cluster],
+			 new_k);
 
 		/* copy cluster alone */
 		int cluster_size = 0;
 		for (uint32_t i = 0; i < n; i++)
 			if (clusters[i] == cluster)
 				temp_objs[cluster_size++] = objs[i];
-
-		/* clustering intention is to have less clusters than points.
-		 * split only if there are at least 3 points or absolutely needed to get enough clusters. */
-		if (cluster_size <= 2 || (force_split && cluster_size <= 1))
+		if (cluster_size <= 1)
 			continue;
-		memset(temp_clusters, 0, sizeof(int) * cluster_size);
-		memset(temp_radii, 0, sizeof(double) * 2);
-		temp_centers[0] = centers[cluster];
 
-		uint32_t clusters_to_add =
-		    kmeans(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, temp_radii, 2, max_radius);
+		/* run 2-means on the cluster */
+		kmeans(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, temp_radii, 2, 0);
+
 		/* replace cluster with split */
 		uint32_t d = 0;
 		for (uint32_t i = 0; i < n; i++)
-		{
 			if (clusters[i] == cluster)
 				if (temp_clusters[d++])
-					clusters[i] = new_k + clusters_to_add - 1;
-		}
+					clusters[i] = new_k;
+
 		centers[cluster] = temp_centers[0];
+		centers[new_k] = temp_centers[1];
 		radii[cluster] = temp_radii[0];
-		for (int t = 1; t < clusters_to_add; t++)
-		{
-			centers[new_k + t - 1] = temp_centers[t];
-			radii[new_k + t - 1] = temp_radii[t];
-		}
-		new_k += clusters_to_add - 1;
+		radii[new_k] = temp_radii[1];
+
+		overshot += max_radius_sq - temp_radii[0] - temp_radii[1];
+		new_k++;
 	}
 	lwfree(temp_centers);
+	lwfree(temp_radii);
 	lwfree(temp_clusters);
 	lwfree(temp_objs);
 	return new_k;
@@ -101,6 +132,8 @@ static uint8_t
 update_r(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *radii, uint32_t k)
 {
 	uint8_t converged = LW_TRUE;
+	double max_radius = 0;
+	memset(radii, 0, sizeof(double) * k);
 
 	for (uint32_t i = 0; i < n; i++)
 	{
@@ -259,6 +292,14 @@ kmeans_init(POINT4D *objs, uint32_t n, POINT4D *centers, uint32_t k)
 	}
 }
 
+static int
+cluster_centroid_cmp(const void *a, const void *b)
+{
+	POINT4D *p1 = (POINT4D *)a;
+	POINT4D *p2 = (POINT4D *)b;
+	return p1->m - p2->m;
+}
+
 static uint32_t
 kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *radii, uint32_t min_k, double max_radius)
 {
@@ -272,19 +313,23 @@ kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *radii
 		for (uint32_t i = 0; i < KMEANS_MAX_ITERATIONS; i++)
 		{
 			LW_ON_INTERRUPT(break);
-			memset(radii, sizeof(double), 0);
 			converged = update_r(objs, clusters, n, centers, radii, cur_k);
 			if (converged)
 				break;
 			update_means(objs, clusters, n, centers, cur_k);
 		}
-		if (!converged)
+		if (!converged || !max_radius)
 			break;
 
 		/* XMeans-inspired improve_structure pass to split clusters bigger than limit into 2 */
 		if (max_radius)
 		{
-			uint32_t new_k = improve_structure(objs, clusters, n, centers, radii, cur_k, min_k, max_radius);
+			for (int i = 0; i < cur_k; i++)
+				centers[i].m = radii[i];
+			qsort(centers, cur_k, sizeof(POINT4D), cluster_centroid_cmp);
+			update_r(objs, clusters, n, centers, radii, cur_k);
+
+			uint32_t new_k = improve_structure(objs, clusters, n, centers, radii, cur_k, max_radius);
 			if (new_k == cur_k)
 				break;
 			cur_k = new_k;
diff --git a/postgis/postgis_before_upgrade.sql b/postgis/postgis_before_upgrade.sql
index dcd69aa..49c566a 100644
--- a/postgis/postgis_before_upgrade.sql
+++ b/postgis/postgis_before_upgrade.sql
@@ -223,6 +223,16 @@ DROP FUNCTION IF EXISTS st_difference(geometry, geometry); -- replaced in 3.1.0
 DROP FUNCTION IF EXISTS st_symdifference(geometry, geometry); -- replaced in 3.1.0 by 3 args version
 DROP FUNCTION IF EXISTS st_unaryunion(geometry); -- replaced in 3.1.0 by 3 args version
 
+-- replaced in 3.2.0 by 3 args version
+SELECT _postgis_drop_function_if_needed
+    (
+    'st_clusterkmeans',
+    'geom geometry, k integer'
+    );
+
+
+
+
 -- geometry_columns changed parameter types so we verify if it needs to be dropped
 -- We check the catalog to see if the view (geometry_columns) has a column
 -- with name `f_table_schema` and type `character varying(256)` as it was

commit 45da46285f0745df8f5fe038739091be984e8569
Author: Darafei Praliaskouski <me at komzpa.net>
Date:   Sun Jun 6 18:03:16 2021 +0300

    Add KMeans comments

diff --git a/liblwgeom/lwkmeans.c b/liblwgeom/lwkmeans.c
index ca1f725..3614530 100644
--- a/liblwgeom/lwkmeans.c
+++ b/liblwgeom/lwkmeans.c
@@ -32,6 +32,7 @@ distance3d_sqr_pt4d_pt4d(const POINT4D *p1, const POINT4D *p2)
 	return hside * hside + vside * vside + zside * zside;
 }
 
+/* Split the clusters that need to be split */
 static uint32_t
 improve_structure(POINT4D *objs,
 		  int *clusters,
@@ -95,6 +96,7 @@ improve_structure(POINT4D *objs,
 	return new_k;
 }
 
+/* Refresh mapping of point to closest cluster */
 static uint8_t
 update_r(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *radii, uint32_t k)
 {
@@ -132,10 +134,12 @@ update_r(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *rad
 	return converged;
 }
 
+/* Refresh cluster centroids based on all of their objects */
 static void
 update_means(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k)
 {
 	memset(centers, 0, sizeof(POINT4D) * k);
+	/* calculate weighted sum */
 	for (uint32_t i = 0; i < n; i++)
 	{
 		int cluster = clusters[i];
@@ -144,6 +148,7 @@ update_means(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_
 		centers[cluster].z += objs[i].z * objs[i].m;
 		centers[cluster].m += objs[i].m;
 	}
+	/* divide by weight to get average */
 	for (uint32_t i = 0; i < k; i++)
 	{
 		if (centers[i].m)
@@ -155,6 +160,7 @@ update_means(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_
 	}
 }
 
+/* Assign initial clusters centroids heuristically */
 static void
 kmeans_init(POINT4D *objs, uint32_t n, POINT4D *centers, uint32_t k)
 {
@@ -257,13 +263,12 @@ static uint32_t
 kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *radii, uint32_t min_k, double max_radius)
 {
 	uint8_t converged = LW_FALSE;
-	assert(max_k >= 2);
 	uint32_t cur_k = min_k;
 
 	kmeans_init(objs, n, centers, cur_k);
-
 	for (uint32_t t = 0; t < KMEANS_MAX_ITERATIONS; t++)
 	{
+		/* Standard KMeans loop */
 		for (uint32_t i = 0; i < KMEANS_MAX_ITERATIONS; i++)
 		{
 			LW_ON_INTERRUPT(break);
@@ -275,6 +280,8 @@ kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *radii
 		}
 		if (!converged)
 			break;
+
+		/* XMeans-inspired improve_structure pass to split clusters bigger than limit into 2 */
 		if (max_radius)
 		{
 			uint32_t new_k = improve_structure(objs, clusters, n, centers, radii, cur_k, min_k, max_radius);
@@ -358,7 +365,7 @@ lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t n, uint32_t k, double max_r
 		}
 		else if (!lwgeom_has_z(geom))
 		{
-			/* For 2D, we can take a centroid*/
+			/* For 2D, we can take a centroid */
 			LWGEOM *centroid = lwgeom_centroid(geom);
 			if (!centroid)
 				continue;

commit fe750ed9ba37fa003728d6df94a92076534128bc
Author: Darafei Praliaskouski <me at komzpa.net>
Date:   Sun Jun 6 16:04:10 2021 +0300

    XMeans: remove the machinery and reduce to just max_radius argument

diff --git a/liblwgeom/cunit/cu_algorithm.c b/liblwgeom/cunit/cu_algorithm.c
index e6c16b3..67f3822 100644
--- a/liblwgeom/cunit/cu_algorithm.c
+++ b/liblwgeom/cunit/cu_algorithm.c
@@ -1743,7 +1743,7 @@ static void test_kmeans(void)
 		}
 	}
 
-	r = lwgeom_cluster_kmeans((const LWGEOM **)geoms, N, num_clusters);
+	r = lwgeom_cluster_kmeans((const LWGEOM **)geoms, N, num_clusters, 0.0);
 
 	// for (i = 0; i < k; i++)
 	// {
diff --git a/liblwgeom/liblwgeom.h.in b/liblwgeom/liblwgeom.h.in
index bbacc19..79f97e0 100644
--- a/liblwgeom/liblwgeom.h.in
+++ b/liblwgeom/liblwgeom.h.in
@@ -2534,8 +2534,9 @@ LWGEOM* lwgeom_voronoi_diagram(const LWGEOM* g, const GBOX* env, double toleranc
 * @param geoms the input array of LWGEOM pointers
 * @param ngeoms the number of elements in the array
 * @param k the number of clusters to calculate
+* @param max_radius maxmimum radius of cluster before it's split
 */
-int * lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t ngeoms, uint32_t k);
+int * lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t ngeoms, uint32_t k, double max_radius);
 
 #include "lwinline.h"
 
diff --git a/liblwgeom/lwkmeans.c b/liblwgeom/lwkmeans.c
index 304e31f..ca1f725 100644
--- a/liblwgeom/lwkmeans.c
+++ b/liblwgeom/lwkmeans.c
@@ -1,6 +1,6 @@
 /*-------------------------------------------------------------------------
  *
- * Copyright (c) 2018-2020, Darafei Praliaskouski <me at komzpa.net>
+ * Copyright (c) 2018-2021, Darafei Praliaskouski <me at komzpa.net>
  * Copyright (c) 2016, Paul Ramsey <pramsey at cleverelephant.ca>
  *
  *------------------------------------------------------------------------*/
@@ -19,16 +19,8 @@
  */
 #define KMEANS_MAX_ITERATIONS 1000
 
-static uint32_t kmeans(POINT4D *objs,
-		       int *clusters,
-		       uint32_t n,
-		       POINT4D *centers,
-		       double *radii,
-		       uint32_t min_k,
-		       uint32_t max_k,
-		       double max_radius,
-		       double max_weight,
-		       uint8_t we_had_a_split);
+static uint32_t
+kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *radii, uint32_t min_k, double max_radius);
 
 inline static double
 distance3d_sqr_pt4d_pt4d(const POINT4D *p1, const POINT4D *p2)
@@ -40,52 +32,6 @@ distance3d_sqr_pt4d_pt4d(const POINT4D *p1, const POINT4D *p2)
 	return hside * hside + vside * vside + zside * zside;
 }
 
-static double
-bayesian_information_criteria(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k)
-{
-	if (n < k)
-		return INFINITY;
-
-	/*	double total_weight = 0.0;
-		for (uint32_t i = 0; i < n; i++)
-			total_weight += objs[i].m;
-
-		if (total_weight < k)
-			return INFINITY;*/
-
-	/* estimation of the noise variance in the data set */
-	double sigma_sqrt = 0.0;
-	for (uint32_t i = 0; i < n; i++)
-		sigma_sqrt += distance3d_sqr_pt4d_pt4d(&objs[i], &centers[clusters[i]]);
-	sigma_sqrt /= (n - k);
-
-	/* in geography everything is flat */
-	double dimension = 2;
-	double p = (k - 1) + dimension * k + 1;
-
-	/* in case of the same points, sigma_sqrt can be zero */
-	double sigma_multiplier = 0.0;
-	if (sigma_sqrt <= 0.0)
-		sigma_multiplier = -INFINITY;
-	else
-		sigma_multiplier = dimension * 0.5 * log(sigma_sqrt);
-
-	/* splitting criterion */
-	double score = 0;
-	for (int cluster = 0; cluster < (int)k; cluster++)
-	{
-		double weight = centers[cluster].m;
-		/*double L = weight * log(weight) - weight * log(total_weight) - weight * 0.5 * log(2.0 * M_PI) - weight
-		 * * sigma_multiplier - (weight - k) * 0.5; */
-		double L = -weight * 0.5 * log(2.0 * M_PI) - weight * sigma_multiplier - (weight - k) / 2 +
-			   weight * log(weight) - weight * log((double)n);
-
-		/* BIC calculation */
-		score += L - p * 0.5 * log((double)n);
-	}
-	return score;
-}
-
 static uint32_t
 improve_structure(POINT4D *objs,
 		  int *clusters,
@@ -94,10 +40,7 @@ improve_structure(POINT4D *objs,
 		  double *radii,
 		  uint32_t k,
 		  uint32_t min_k,
-		  uint32_t max_k,
-		  double max_radius,
-		  double max_weight,
-		  uint8_t we_had_a_split)
+		  double max_radius)
 {
 	POINT4D *temp_objs = lwalloc(sizeof(POINT4D) * n);
 	int *temp_clusters = lwalloc(sizeof(int) * n);
@@ -108,19 +51,8 @@ improve_structure(POINT4D *objs,
 
 	for (int cluster = 0; cluster < (int)k; cluster++)
 	{
-		double mass = centers[cluster].m;
 		double radius = radii[cluster];
-		double initial_bic = 0, split_bic = 0;
-		// uint8_t force_split = (new_k < min_k && !we_had_a_split) || (max_weight>0 && mass > max_weight && !
-		// we_had_a_split) ||(max_weight>0 && mass > sqrt(2)*max_weight &&  we_had_a_split) || (!we_had_a_split
-		// && max_radius>0 && radius > (max_radius*max_radius)) ||(we_had_a_split && max_radius>0 && radius >
-		// (2*max_radius*max_radius)) ;
-		uint8_t force_split = (new_k < min_k && !we_had_a_split) || (max_weight > 0 && mass > max_weight) ||
-				      (max_radius > 0 && radius > (max_radius * max_radius));
-		if ((new_k >= max_k) && !force_split)
-			break;
-		if ((k >= cluster) && !force_split)
-			continue;
+		uint8_t force_split = (new_k < min_k) || (max_radius > 0 && radius > (max_radius * max_radius));
 		if (!force_split)
 			continue;
 
@@ -138,64 +70,24 @@ improve_structure(POINT4D *objs,
 		memset(temp_radii, 0, sizeof(double) * 2);
 		temp_centers[0] = centers[cluster];
 
-		/* calculate BIC for one cluster */
-		if (!force_split)
-			initial_bic = bayesian_information_criteria(
-			    temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 1);
-
-		/* 2-means the cluster */
-		/* if you split a perfect circle of r=1 in two parts, distance from centroid of halfcircle to edge will
-		 * be 1.2513529646364838 */
-		/// uint32_t clusters_to_add = kmeans(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers,
-		/// temp_radii, 2, n-new_k, 1.2513529646364838*max_radius, max_weight, LW_TRUE);
-		uint32_t clusters_to_add = kmeans(temp_objs,
-						  temp_clusters,
-						  (uint32_t)cluster_size,
-						  temp_centers,
-						  temp_radii,
-						  2,
-						  n - new_k,
-						  max_radius,
-						  max_weight,
-						  LW_TRUE);
-
-		/* calculate BIC for 2-means */
-		if (!force_split)
-			split_bic = bayesian_information_criteria(
-			    temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 2);
-
-		/* replace cluster with split if needed */
-		if (force_split) // || split_bic > initial_bic)
+		uint32_t clusters_to_add =
+		    kmeans(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, temp_radii, 2, max_radius);
+		/* replace cluster with split */
+		uint32_t d = 0;
+		for (uint32_t i = 0; i < n; i++)
 		{
-			lwnotice(
-			    "%s: splitting cluster %u into %u, bic orig = %f, bic split = %f, size = %d, mass = %f, radius = %f, forced = %u",
-			    __func__,
-			    cluster,
-			    clusters_to_add,
-			    initial_bic,
-			    split_bic,
-			    cluster_size,
-			    mass,
-			    sqrt(radius),
-			    force_split);
-			uint32_t d = 0;
-			for (uint32_t i = 0; i < n; i++)
-			{
-				if (clusters[i] == cluster)
-					if (temp_clusters[d++])
-						clusters[i] = new_k + clusters_to_add - 1;
-			}
-			centers[cluster] = temp_centers[0];
-			radii[cluster] = temp_radii[0];
-			for (int t = 1; t < clusters_to_add; t++)
-			{
-				centers[new_k + t - 1] = temp_centers[t];
-				radii[new_k + t - 1] = temp_radii[t];
-			}
-			new_k += clusters_to_add - 1;
-			we_had_a_split = LW_TRUE;
-			//	cluster--;
+			if (clusters[i] == cluster)
+				if (temp_clusters[d++])
+					clusters[i] = new_k + clusters_to_add - 1;
+		}
+		centers[cluster] = temp_centers[0];
+		radii[cluster] = temp_radii[0];
+		for (int t = 1; t < clusters_to_add; t++)
+		{
+			centers[new_k + t - 1] = temp_centers[t];
+			radii[new_k + t - 1] = temp_radii[t];
 		}
+		new_k += clusters_to_add - 1;
 	}
 	lwfree(temp_centers);
 	lwfree(temp_clusters);
@@ -361,30 +253,12 @@ kmeans_init(POINT4D *objs, uint32_t n, POINT4D *centers, uint32_t k)
 	}
 }
 
-static int
-cluster_centroid_cmp(const void *a, const void *b)
-{
-	POINT4D *p1 = (POINT4D *)a;
-	POINT4D *p2 = (POINT4D *)b;
-	return p2->m - p1->m;
-}
-
 static uint32_t
-kmeans(POINT4D *objs,
-       int *clusters,
-       uint32_t n,
-       POINT4D *centers,
-       double *radii,
-       uint32_t min_k,
-       uint32_t max_k,
-       double max_radius,
-       double max_weight,
-       uint8_t we_had_a_split)
+kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *radii, uint32_t min_k, double max_radius)
 {
 	uint8_t converged = LW_FALSE;
-	assert(max_k >= min_k);
 	assert(max_k >= 2);
-	uint32_t cur_k = 2;
+	uint32_t cur_k = min_k;
 
 	kmeans_init(objs, n, centers, cur_k);
 
@@ -399,19 +273,15 @@ kmeans(POINT4D *objs,
 				break;
 			update_means(objs, clusters, n, centers, cur_k);
 		}
-		if (!converged || cur_k >= max_k)
+		if (!converged)
 			break;
-		if (cur_k < min_k)
+		if (max_radius)
 		{
-			/* Renumber clusters and make most massive one first. */
-			qsort(centers, cur_k, sizeof(POINT4D), cluster_centroid_cmp);
-			converged = update_r(objs, clusters, n, centers, radii, cur_k);
+			uint32_t new_k = improve_structure(objs, clusters, n, centers, radii, cur_k, min_k, max_radius);
+			if (new_k == cur_k)
+				break;
+			cur_k = new_k;
 		}
-		uint32_t new_k = improve_structure(
-		    objs, clusters, n, centers, radii, cur_k, min_k, max_k, max_radius, max_weight, we_had_a_split);
-		if (new_k == cur_k)
-			break;
-		cur_k = new_k;
 	}
 
 	if (!converged)
@@ -423,15 +293,14 @@ kmeans(POINT4D *objs,
 }
 
 int *
-lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t n, uint32_t k)
+lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t n, uint32_t k, double max_radius)
 {
 	uint32_t num_non_empty = 0;
 	uint8_t converged = LW_FALSE;
-	double max_radius = 0; // 400000;
-	double max_weight = 10000000;
 
 	assert(k > 0);
 	assert(n > 0);
+	assert(max_radius >= 0);
 	assert(geoms);
 
 	if (n < k)
@@ -526,20 +395,11 @@ lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t n, uint32_t k)
 		k = num_non_empty;
 	}
 
-	if (k > 1)
+	if (k > 1 || max_radius > 0)
 	{
 		int *clusters_dense = lwalloc(sizeof(int) * num_non_empty);
 		memset(clusters_dense, 0, sizeof(int) * num_non_empty);
-		converged = kmeans(objs,
-				   clusters_dense,
-				   num_non_empty,
-				   centers,
-				   radii,
-				   k,
-				   num_non_empty,
-				   max_radius,
-				   max_weight,
-				   LW_FALSE);
+		converged = kmeans(objs, clusters_dense, num_non_empty, centers, radii, k, max_radius);
 
 		if (converged)
 		{
diff --git a/postgis/lwgeom_window.c b/postgis/lwgeom_window.c
index 107590b..a605bc1 100644
--- a/postgis/lwgeom_window.c
+++ b/postgis/lwgeom_window.c
@@ -185,6 +185,7 @@ Datum ST_ClusterKMeans(PG_FUNCTION_ARGS)
 	{
 		int       i, k, N;
 		bool      isnull, isout;
+		double max_radius = 0.0;
 		LWGEOM    **geoms;
 		int       *r;
 
@@ -206,11 +207,14 @@ Datum ST_ClusterKMeans(PG_FUNCTION_ARGS)
 			PG_RETURN_NULL();
 		}
 
+		/* Maximum cluster radius. 0 if not set*/
+		max_radius = DatumGetFloat8(WinGetFuncArgCurrent(winobj, 2, &isnull));
+		if (isnull || max_radius <= 0)
+			max_radius = 0.0;
+
 		/* Error out if N < K */
 		if (N<k)
-		{
 			lwpgerror("K (%d) must be smaller than the number of rows in the group (%d)", k, N);
-		}
 
 		/* Read all the geometries from the partition window into a list */
 		geoms = palloc(sizeof(LWGEOM*) * N);
@@ -232,7 +236,7 @@ Datum ST_ClusterKMeans(PG_FUNCTION_ARGS)
 		}
 
 		/* Calculate k-means on the list! */
-		r = lwgeom_cluster_kmeans((const LWGEOM **)geoms, N, k);
+		r = lwgeom_cluster_kmeans((const LWGEOM **)geoms, N, k, max_radius);
 
 		/* Clean up */
 		for (i = 0; i < N; i++)
diff --git a/postgis/postgis.sql.in b/postgis/postgis.sql.in
index 1ef5d29..ecc8bc4 100644
--- a/postgis/postgis.sql.in
+++ b/postgis/postgis.sql.in
@@ -4108,7 +4108,8 @@ CREATE AGGREGATE ST_MakeLine (geometry) (
 --------------------------------------------------------------------------------
 
 -- Availability: 2.3.0
-CREATE OR REPLACE FUNCTION ST_ClusterKMeans(geom geometry, k integer)
+-- Changed: 3.2.0 added max_radius parameter
+CREATE OR REPLACE FUNCTION ST_ClusterKMeans(geom geometry, k integer, max_radius float default null)
 	RETURNS integer
 	AS 'MODULE_PATHNAME', 'ST_ClusterKMeans'
 	LANGUAGE 'c' VOLATILE STRICT WINDOW

commit 76f02de9e275ec960cdd83c48fdfb1f6b728c30b
Author: Darafei Praliaskouski <me at komzpa.net>
Date:   Mon Dec 7 00:07:12 2020 +0300

    XMeans: bounds for weight and radius.

diff --git a/liblwgeom/lwkmeans.c b/liblwgeom/lwkmeans.c
index e101506..304e31f 100644
--- a/liblwgeom/lwkmeans.c
+++ b/liblwgeom/lwkmeans.c
@@ -19,7 +19,16 @@
  */
 #define KMEANS_MAX_ITERATIONS 1000
 
-static uint8_t kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k, uint32_t max_k);
+static uint32_t kmeans(POINT4D *objs,
+		       int *clusters,
+		       uint32_t n,
+		       POINT4D *centers,
+		       double *radii,
+		       uint32_t min_k,
+		       uint32_t max_k,
+		       double max_radius,
+		       double max_weight,
+		       uint8_t we_had_a_split);
 
 inline static double
 distance3d_sqr_pt4d_pt4d(const POINT4D *p1, const POINT4D *p2)
@@ -78,20 +87,42 @@ bayesian_information_criteria(POINT4D *objs, int *clusters, uint32_t n, POINT4D
 }
 
 static uint32_t
-improve_structure(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k, uint32_t max_k)
+improve_structure(POINT4D *objs,
+		  int *clusters,
+		  uint32_t n,
+		  POINT4D *centers,
+		  double *radii,
+		  uint32_t k,
+		  uint32_t min_k,
+		  uint32_t max_k,
+		  double max_radius,
+		  double max_weight,
+		  uint8_t we_had_a_split)
 {
 	POINT4D *temp_objs = lwalloc(sizeof(POINT4D) * n);
 	int *temp_clusters = lwalloc(sizeof(int) * n);
-	POINT4D *temp_centers = lwalloc(sizeof(POINT4D) * 2);
+	double *temp_radii = lwalloc(sizeof(double) * n);
+	POINT4D *temp_centers = lwalloc(sizeof(POINT4D) * n);
+
 	uint32_t new_k = k;
-	/* worst case we will get twice as much clusters but no more than max */
-	/*uint32_t new_size = (k*2 > max_k) ? max_k : k*2;
-	centers = lwrealloc(centers, sizeof(POINT4D) * new_size);*/
 
 	for (int cluster = 0; cluster < (int)k; cluster++)
 	{
-		if (new_k >= max_k)
+		double mass = centers[cluster].m;
+		double radius = radii[cluster];
+		double initial_bic = 0, split_bic = 0;
+		// uint8_t force_split = (new_k < min_k && !we_had_a_split) || (max_weight>0 && mass > max_weight && !
+		// we_had_a_split) ||(max_weight>0 && mass > sqrt(2)*max_weight &&  we_had_a_split) || (!we_had_a_split
+		// && max_radius>0 && radius > (max_radius*max_radius)) ||(we_had_a_split && max_radius>0 && radius >
+		// (2*max_radius*max_radius)) ;
+		uint8_t force_split = (new_k < min_k && !we_had_a_split) || (max_weight > 0 && mass > max_weight) ||
+				      (max_radius > 0 && radius > (max_radius * max_radius));
+		if ((new_k >= max_k) && !force_split)
 			break;
+		if ((k >= cluster) && !force_split)
+			continue;
+		if (!force_split)
+			continue;
 
 		/* copy cluster alone */
 		int cluster_size = 0;
@@ -99,47 +130,81 @@ improve_structure(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, ui
 			if (clusters[i] == cluster)
 				temp_objs[cluster_size++] = objs[i];
 
-		/* clustering intention is to have less objects than points, so split if we have at least 3 points */
-		if (cluster_size < 3)
+		/* clustering intention is to have less clusters than points.
+		 * split only if there are at least 3 points or absolutely needed to get enough clusters. */
+		if (cluster_size <= 2 || (force_split && cluster_size <= 1))
 			continue;
 		memset(temp_clusters, 0, sizeof(int) * cluster_size);
+		memset(temp_radii, 0, sizeof(double) * 2);
 		temp_centers[0] = centers[cluster];
 
-		/* calculate BIC for cluster */
-		double initial_bic =
-		    bayesian_information_criteria(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 1);
+		/* calculate BIC for one cluster */
+		if (!force_split)
+			initial_bic = bayesian_information_criteria(
+			    temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 1);
+
 		/* 2-means the cluster */
-		kmeans(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 2, 2);
+		/* if you split a perfect circle of r=1 in two parts, distance from centroid of halfcircle to edge will
+		 * be 1.2513529646364838 */
+		/// uint32_t clusters_to_add = kmeans(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers,
+		/// temp_radii, 2, n-new_k, 1.2513529646364838*max_radius, max_weight, LW_TRUE);
+		uint32_t clusters_to_add = kmeans(temp_objs,
+						  temp_clusters,
+						  (uint32_t)cluster_size,
+						  temp_centers,
+						  temp_radii,
+						  2,
+						  n - new_k,
+						  max_radius,
+						  max_weight,
+						  LW_TRUE);
+
 		/* calculate BIC for 2-means */
-		double split_bic =
-		    bayesian_information_criteria(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 2);
+		if (!force_split)
+			split_bic = bayesian_information_criteria(
+			    temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 2);
+
 		/* replace cluster with split if needed */
-		if (split_bic > initial_bic)
+		if (force_split) // || split_bic > initial_bic)
 		{
-			lwnotice("%s: splitting cluster %u into %u, bic orig = %f, bic split = %f, size = %d",
-				 __func__,
-				 cluster,
-				 new_k,
-				 initial_bic,
-				 split_bic,
-				 cluster_size);
+			lwnotice(
+			    "%s: splitting cluster %u into %u, bic orig = %f, bic split = %f, size = %d, mass = %f, radius = %f, forced = %u",
+			    __func__,
+			    cluster,
+			    clusters_to_add,
+			    initial_bic,
+			    split_bic,
+			    cluster_size,
+			    mass,
+			    sqrt(radius),
+			    force_split);
 			uint32_t d = 0;
 			for (uint32_t i = 0; i < n; i++)
 			{
 				if (clusters[i] == cluster)
 					if (temp_clusters[d++])
-						clusters[i] = new_k;
+						clusters[i] = new_k + clusters_to_add - 1;
 			}
 			centers[cluster] = temp_centers[0];
-			centers[new_k] = temp_centers[1];
-			new_k++;
+			radii[cluster] = temp_radii[0];
+			for (int t = 1; t < clusters_to_add; t++)
+			{
+				centers[new_k + t - 1] = temp_centers[t];
+				radii[new_k + t - 1] = temp_radii[t];
+			}
+			new_k += clusters_to_add - 1;
+			we_had_a_split = LW_TRUE;
+			//	cluster--;
 		}
 	}
+	lwfree(temp_centers);
+	lwfree(temp_clusters);
+	lwfree(temp_objs);
 	return new_k;
 }
 
 static uint8_t
-update_r(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k)
+update_r(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, double *radii, uint32_t k)
 {
 	uint8_t converged = LW_TRUE;
 
@@ -168,6 +233,9 @@ update_r(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k)
 			converged = LW_FALSE;
 			clusters[i] = curr_cluster;
 		}
+		if (radii)
+			if (radii[curr_cluster] < curr_distance)
+				radii[curr_cluster] = curr_distance;
 	}
 	return converged;
 }
@@ -293,35 +361,65 @@ kmeans_init(POINT4D *objs, uint32_t n, POINT4D *centers, uint32_t k)
 	}
 }
 
-static uint8_t
-kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k, uint32_t max_k)
+static int
+cluster_centroid_cmp(const void *a, const void *b)
+{
+	POINT4D *p1 = (POINT4D *)a;
+	POINT4D *p2 = (POINT4D *)b;
+	return p2->m - p1->m;
+}
+
+static uint32_t
+kmeans(POINT4D *objs,
+       int *clusters,
+       uint32_t n,
+       POINT4D *centers,
+       double *radii,
+       uint32_t min_k,
+       uint32_t max_k,
+       double max_radius,
+       double max_weight,
+       uint8_t we_had_a_split)
 {
 	uint8_t converged = LW_FALSE;
-	uint32_t cur_k = k;
+	assert(max_k >= min_k);
+	assert(max_k >= 2);
+	uint32_t cur_k = 2;
 
-	kmeans_init(objs, n, centers, k);
+	kmeans_init(objs, n, centers, cur_k);
 
 	for (uint32_t t = 0; t < KMEANS_MAX_ITERATIONS; t++)
 	{
 		for (uint32_t i = 0; i < KMEANS_MAX_ITERATIONS; i++)
 		{
 			LW_ON_INTERRUPT(break);
-			converged = update_r(objs, clusters, n, centers, cur_k);
+			memset(radii, sizeof(double), 0);
+			converged = update_r(objs, clusters, n, centers, radii, cur_k);
 			if (converged)
 				break;
 			update_means(objs, clusters, n, centers, cur_k);
 		}
-		if (!converged || k >= max_k)
+		if (!converged || cur_k >= max_k)
 			break;
-		uint32_t new_k = improve_structure(objs, clusters, n, centers, cur_k, max_k);
+		if (cur_k < min_k)
+		{
+			/* Renumber clusters and make most massive one first. */
+			qsort(centers, cur_k, sizeof(POINT4D), cluster_centroid_cmp);
+			converged = update_r(objs, clusters, n, centers, radii, cur_k);
+		}
+		uint32_t new_k = improve_structure(
+		    objs, clusters, n, centers, radii, cur_k, min_k, max_k, max_radius, max_weight, we_had_a_split);
 		if (new_k == cur_k)
 			break;
 		cur_k = new_k;
 	}
 
 	if (!converged)
+	{
 		lwerror("%s did not converge after %d iterations", __func__, KMEANS_MAX_ITERATIONS);
-	return converged;
+		return 0;
+	}
+	return cur_k;
 }
 
 int *
@@ -329,6 +427,8 @@ lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t n, uint32_t k)
 {
 	uint32_t num_non_empty = 0;
 	uint8_t converged = LW_FALSE;
+	double max_radius = 0; // 400000;
+	double max_weight = 10000000;
 
 	assert(k > 0);
 	assert(n > 0);
@@ -357,6 +457,10 @@ lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t n, uint32_t k)
 	POINT4D *centers = lwalloc(sizeof(POINT4D) * n);
 	memset(centers, 0, sizeof(POINT4D) * n);
 
+	/* An array of clusters radii for the algorithm. */
+	double *radii = lwalloc(sizeof(double) * n);
+	memset(radii, 0, sizeof(double) * n);
+
 	/* Prepare the list of object pointers for K-means */
 	for (uint32_t i = 0; i < n; i++)
 	{
@@ -426,7 +530,16 @@ lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t n, uint32_t k)
 	{
 		int *clusters_dense = lwalloc(sizeof(int) * num_non_empty);
 		memset(clusters_dense, 0, sizeof(int) * num_non_empty);
-		converged = kmeans(objs, clusters_dense, num_non_empty, centers, k, k);
+		converged = kmeans(objs,
+				   clusters_dense,
+				   num_non_empty,
+				   centers,
+				   radii,
+				   k,
+				   num_non_empty,
+				   max_radius,
+				   max_weight,
+				   LW_FALSE);
 
 		if (converged)
 		{

commit d7925ddffb590e2c39561c3d379f23041e572c41
Author: Darafei Praliaskouski <me at komzpa.net>
Date:   Sat Dec 5 19:04:00 2020 +0300

    clang-format

diff --git a/liblwgeom/lwkmeans.c b/liblwgeom/lwkmeans.c
index 898d603..e101506 100644
--- a/liblwgeom/lwkmeans.c
+++ b/liblwgeom/lwkmeans.c
@@ -19,9 +19,7 @@
  */
 #define KMEANS_MAX_ITERATIONS 1000
 
-
-static uint8_t
-kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k, uint32_t max_k);
+static uint8_t kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k, uint32_t max_k);
 
 inline static double
 distance3d_sqr_pt4d_pt4d(const POINT4D *p1, const POINT4D *p2)
@@ -33,21 +31,22 @@ distance3d_sqr_pt4d_pt4d(const POINT4D *p1, const POINT4D *p2)
 	return hside * hside + vside * vside + zside * zside;
 }
 
-static double bayesian_information_criteria(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k)
+static double
+bayesian_information_criteria(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k)
 {
 	if (n < k)
 		return INFINITY;
 
-/*	double total_weight = 0.0;
-	for (uint32_t i = 0; i < n; i++)
-		total_weight += objs[i].m;
+	/*	double total_weight = 0.0;
+		for (uint32_t i = 0; i < n; i++)
+			total_weight += objs[i].m;
 
-	if (total_weight < k)
-		return INFINITY;*/
+		if (total_weight < k)
+			return INFINITY;*/
 
 	/* estimation of the noise variance in the data set */
-    double sigma_sqrt = 0.0;
-    for (uint32_t i = 0; i < n; i++)
+	double sigma_sqrt = 0.0;
+	for (uint32_t i = 0; i < n; i++)
 		sigma_sqrt += distance3d_sqr_pt4d_pt4d(&objs[i], &centers[clusters[i]]);
 	sigma_sqrt /= (n - k);
 
@@ -67,8 +66,10 @@ static double bayesian_information_criteria(POINT4D *objs, int *clusters, uint32
 	for (int cluster = 0; cluster < (int)k; cluster++)
 	{
 		double weight = centers[cluster].m;
-		/*double L = weight * log(weight) - weight * log(total_weight) - weight * 0.5 * log(2.0 * M_PI) - weight * sigma_multiplier - (weight - k) * 0.5; */
-		double L = - weight * 0.5 * log(2.0 * M_PI) - weight * sigma_multiplier - (weight - k)/2 + weight * log(weight) - weight * log((double)n);
+		/*double L = weight * log(weight) - weight * log(total_weight) - weight * 0.5 * log(2.0 * M_PI) - weight
+		 * * sigma_multiplier - (weight - k) * 0.5; */
+		double L = -weight * 0.5 * log(2.0 * M_PI) - weight * sigma_multiplier - (weight - k) / 2 +
+			   weight * log(weight) - weight * log((double)n);
 
 		/* BIC calculation */
 		score += L - p * 0.5 * log((double)n);
@@ -79,9 +80,9 @@ static double bayesian_information_criteria(POINT4D *objs, int *clusters, uint32
 static uint32_t
 improve_structure(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k, uint32_t max_k)
 {
-	POINT4D * temp_objs = lwalloc(sizeof(POINT4D)*n);
-	int * temp_clusters = lwalloc(sizeof(int)*n);
-	POINT4D * temp_centers = lwalloc(sizeof(POINT4D)*2);
+	POINT4D *temp_objs = lwalloc(sizeof(POINT4D) * n);
+	int *temp_clusters = lwalloc(sizeof(int) * n);
+	POINT4D *temp_centers = lwalloc(sizeof(POINT4D) * 2);
 	uint32_t new_k = k;
 	/* worst case we will get twice as much clusters but no more than max */
 	/*uint32_t new_size = (k*2 > max_k) ? max_k : k*2;
@@ -105,23 +106,23 @@ improve_structure(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, ui
 		temp_centers[0] = centers[cluster];
 
 		/* calculate BIC for cluster */
-		double initial_bic = bayesian_information_criteria(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 1);
+		double initial_bic =
+		    bayesian_information_criteria(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 1);
 		/* 2-means the cluster */
 		kmeans(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 2, 2);
 		/* calculate BIC for 2-means */
-		double split_bic = bayesian_information_criteria(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 2);
+		double split_bic =
+		    bayesian_information_criteria(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 2);
 		/* replace cluster with split if needed */
 		if (split_bic > initial_bic)
 		{
-			lwnotice(
-		    "%s: splitting cluster %u into %u, bic orig = %f, bic split = %f, size = %d",
-		    __func__,
-		    cluster,
-			new_k,
-			initial_bic,
-			split_bic,
-			cluster_size
-			);
+			lwnotice("%s: splitting cluster %u into %u, bic orig = %f, bic split = %f, size = %d",
+				 __func__,
+				 cluster,
+				 new_k,
+				 initial_bic,
+				 split_bic,
+				 cluster_size);
 			uint32_t d = 0;
 			for (uint32_t i = 0; i < n; i++)
 			{

commit 2dbae13e8ee216d3fd96d7a44d32d5d083f9957a
Author: Darafei Praliaskouski <me at komzpa.net>
Date:   Sat Dec 5 18:56:11 2020 +0300

    Implement XMeans as is.

diff --git a/liblwgeom/lwkmeans.c b/liblwgeom/lwkmeans.c
index aaac0d0..898d603 100644
--- a/liblwgeom/lwkmeans.c
+++ b/liblwgeom/lwkmeans.c
@@ -19,6 +19,10 @@
  */
 #define KMEANS_MAX_ITERATIONS 1000
 
+
+static uint8_t
+kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k, uint32_t max_k);
+
 inline static double
 distance3d_sqr_pt4d_pt4d(const POINT4D *p1, const POINT4D *p2)
 {
@@ -29,6 +33,110 @@ distance3d_sqr_pt4d_pt4d(const POINT4D *p1, const POINT4D *p2)
 	return hside * hside + vside * vside + zside * zside;
 }
 
+static double bayesian_information_criteria(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k)
+{
+	if (n < k)
+		return INFINITY;
+
+/*	double total_weight = 0.0;
+	for (uint32_t i = 0; i < n; i++)
+		total_weight += objs[i].m;
+
+	if (total_weight < k)
+		return INFINITY;*/
+
+	/* estimation of the noise variance in the data set */
+    double sigma_sqrt = 0.0;
+    for (uint32_t i = 0; i < n; i++)
+		sigma_sqrt += distance3d_sqr_pt4d_pt4d(&objs[i], &centers[clusters[i]]);
+	sigma_sqrt /= (n - k);
+
+	/* in geography everything is flat */
+	double dimension = 2;
+	double p = (k - 1) + dimension * k + 1;
+
+	/* in case of the same points, sigma_sqrt can be zero */
+	double sigma_multiplier = 0.0;
+	if (sigma_sqrt <= 0.0)
+		sigma_multiplier = -INFINITY;
+	else
+		sigma_multiplier = dimension * 0.5 * log(sigma_sqrt);
+
+	/* splitting criterion */
+	double score = 0;
+	for (int cluster = 0; cluster < (int)k; cluster++)
+	{
+		double weight = centers[cluster].m;
+		/*double L = weight * log(weight) - weight * log(total_weight) - weight * 0.5 * log(2.0 * M_PI) - weight * sigma_multiplier - (weight - k) * 0.5; */
+		double L = - weight * 0.5 * log(2.0 * M_PI) - weight * sigma_multiplier - (weight - k)/2 + weight * log(weight) - weight * log((double)n);
+
+		/* BIC calculation */
+		score += L - p * 0.5 * log((double)n);
+	}
+	return score;
+}
+
+static uint32_t
+improve_structure(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k, uint32_t max_k)
+{
+	POINT4D * temp_objs = lwalloc(sizeof(POINT4D)*n);
+	int * temp_clusters = lwalloc(sizeof(int)*n);
+	POINT4D * temp_centers = lwalloc(sizeof(POINT4D)*2);
+	uint32_t new_k = k;
+	/* worst case we will get twice as much clusters but no more than max */
+	/*uint32_t new_size = (k*2 > max_k) ? max_k : k*2;
+	centers = lwrealloc(centers, sizeof(POINT4D) * new_size);*/
+
+	for (int cluster = 0; cluster < (int)k; cluster++)
+	{
+		if (new_k >= max_k)
+			break;
+
+		/* copy cluster alone */
+		int cluster_size = 0;
+		for (uint32_t i = 0; i < n; i++)
+			if (clusters[i] == cluster)
+				temp_objs[cluster_size++] = objs[i];
+
+		/* clustering intention is to have less objects than points, so split if we have at least 3 points */
+		if (cluster_size < 3)
+			continue;
+		memset(temp_clusters, 0, sizeof(int) * cluster_size);
+		temp_centers[0] = centers[cluster];
+
+		/* calculate BIC for cluster */
+		double initial_bic = bayesian_information_criteria(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 1);
+		/* 2-means the cluster */
+		kmeans(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 2, 2);
+		/* calculate BIC for 2-means */
+		double split_bic = bayesian_information_criteria(temp_objs, temp_clusters, (uint32_t)cluster_size, temp_centers, 2);
+		/* replace cluster with split if needed */
+		if (split_bic > initial_bic)
+		{
+			lwnotice(
+		    "%s: splitting cluster %u into %u, bic orig = %f, bic split = %f, size = %d",
+		    __func__,
+		    cluster,
+			new_k,
+			initial_bic,
+			split_bic,
+			cluster_size
+			);
+			uint32_t d = 0;
+			for (uint32_t i = 0; i < n; i++)
+			{
+				if (clusters[i] == cluster)
+					if (temp_clusters[d++])
+						clusters[i] = new_k;
+			}
+			centers[cluster] = temp_centers[0];
+			centers[new_k] = temp_centers[1];
+			new_k++;
+		}
+	}
+	return new_k;
+}
+
 static uint8_t
 update_r(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k)
 {
@@ -86,24 +194,6 @@ update_means(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_
 	}
 }
 
-static uint8_t
-kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k)
-{
-	uint8_t converged = LW_FALSE;
-
-	for (uint32_t i = 0; i < KMEANS_MAX_ITERATIONS; i++)
-	{
-		LW_ON_INTERRUPT(break);
-		converged = update_r(objs, clusters, n, centers, k);
-		if (converged)
-			break;
-		update_means(objs, clusters, n, centers, k);
-	}
-	if (!converged)
-		lwerror("%s did not converge after %d iterations", __func__, KMEANS_MAX_ITERATIONS);
-	return converged;
-}
-
 static void
 kmeans_init(POINT4D *objs, uint32_t n, POINT4D *centers, uint32_t k)
 {
@@ -202,6 +292,37 @@ kmeans_init(POINT4D *objs, uint32_t n, POINT4D *centers, uint32_t k)
 	}
 }
 
+static uint8_t
+kmeans(POINT4D *objs, int *clusters, uint32_t n, POINT4D *centers, uint32_t k, uint32_t max_k)
+{
+	uint8_t converged = LW_FALSE;
+	uint32_t cur_k = k;
+
+	kmeans_init(objs, n, centers, k);
+
+	for (uint32_t t = 0; t < KMEANS_MAX_ITERATIONS; t++)
+	{
+		for (uint32_t i = 0; i < KMEANS_MAX_ITERATIONS; i++)
+		{
+			LW_ON_INTERRUPT(break);
+			converged = update_r(objs, clusters, n, centers, cur_k);
+			if (converged)
+				break;
+			update_means(objs, clusters, n, centers, cur_k);
+		}
+		if (!converged || k >= max_k)
+			break;
+		uint32_t new_k = improve_structure(objs, clusters, n, centers, cur_k, max_k);
+		if (new_k == cur_k)
+			break;
+		cur_k = new_k;
+	}
+
+	if (!converged)
+		lwerror("%s did not converge after %d iterations", __func__, KMEANS_MAX_ITERATIONS);
+	return converged;
+}
+
 int *
 lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t n, uint32_t k)
 {
@@ -232,8 +353,8 @@ lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t n, uint32_t k)
 	memset(clusters, 0, sizeof(int) * n);
 
 	/* An array of clusters centers for the algorithm. */
-	POINT4D *centers = lwalloc(sizeof(POINT4D) * k);
-	memset(centers, 0, sizeof(POINT4D) * k);
+	POINT4D *centers = lwalloc(sizeof(POINT4D) * n);
+	memset(centers, 0, sizeof(POINT4D) * n);
 
 	/* Prepare the list of object pointers for K-means */
 	for (uint32_t i = 0; i < n; i++)
@@ -304,9 +425,7 @@ lwgeom_cluster_kmeans(const LWGEOM **geoms, uint32_t n, uint32_t k)
 	{
 		int *clusters_dense = lwalloc(sizeof(int) * num_non_empty);
 		memset(clusters_dense, 0, sizeof(int) * num_non_empty);
-
-		kmeans_init(objs, num_non_empty, centers, k);
-		converged = kmeans(objs, clusters_dense, num_non_empty, centers, k);
+		converged = kmeans(objs, clusters_dense, num_non_empty, centers, k, k);
 
 		if (converged)
 		{

-----------------------------------------------------------------------

Summary of changes:
 NEWS                                   |   8 +-
 doc/html/images/st_clusterkmeans03.png | Bin 0 -> 72110 bytes
 doc/reference_cluster.xml              |  82 +++++++------
 liblwgeom/cunit/cu_algorithm.c         |   2 +-
 liblwgeom/liblwgeom.h.in               |   3 +-
 liblwgeom/lwkmeans.c                   | 210 +++++++++++++++++++++++++++------
 postgis/lwgeom_window.c                |  10 +-
 postgis/postgis.sql.in                 |   3 +-
 postgis/postgis_before_upgrade.sql     |   1 +
 regress/core/cluster.sql               |   4 +
 regress/core/cluster_expected          |   2 +
 11 files changed, 243 insertions(+), 82 deletions(-)
 create mode 100644 doc/html/images/st_clusterkmeans03.png


hooks/post-receive
-- 
PostGIS


More information about the postgis-tickets mailing list