Continuing on clustering queries

Regina Obe lr at pcorp.us
Thu Mar 28 09:43:43 PDT 2024


> Thank you again; results are starting to come in.
> 
> Apologies for starting a new thread; but the other one was getting messy.
> 
> Here are the queries that I am testing:
> SELECT city, state, ST_ClusterDBScan( ST_Transform(geom,2163), eps :=
> 0.3048*5000, minpoints :=1) OVER () AS cluster_id INTO TEMP aa_zz FROM
> shapefiles.atp_filling_stations aa WHERE aa.amenity = 'fuel'
> AND aa.ctry_code = 'US' AND aa.geom IS NOT NULL;
> 
> SELECT cluster_id, COUNT(*) FROM aa_zz GROUP BY 1 ORDER BY 2 DESC
> LIMIT 10;
> 
> SELECT state,  cluster_id, COUNT(*) FROM aa_zz GROUP BY 2, 1 ORDER BY 3
> DESC LIMIT 20;
> 
> SELECT city, state,  cluster_id, COUNT(*) FROM aa_zz GROUP BY 3, 2, 1
> ORDER BY 4 DESC LIMIT 20;
> 
> 
> I receive the following results:
> SELECT 76023
>   cluster_id | count
> ------------+-------
>         8467 |   422
>         3888 |   371
>         5701 |   249
>         3351 |   177
>         1111 |   141
>         8352 |   120
>         1610 |   107
>         1812 |    90
>         1824 |    86
>        14232 |    81
> (10 rows)
> 
>   state | cluster_id | count
> -------+------------+-------
>   MI    |       8467 |   422
>   NY    |       3888 |   371
>   IL    |       5701 |   245
>   NY    |       3351 |   177
>   CO    |       8352 |   110
>   TX    |       1610 |   101
>   NY    |       1824 |    86
>   DC    |       1111 |    82
>   IN    |      14748 |    78
>   TX    |      14297 |    75
>   MO    |       3661 |    74
>   CA    |      11272 |    74
>   MI    |        462 |    71
>   TX    |      14232 |    70
>   TX    |      19326 |    69
>   TX    |       1812 |    68
>   FL    |      13693 |    68
>   TX    |      16824 |    68
>   CA    |      14006 |    66
>   IL    |       2727 |    64
> (20 rows)
> 
>       city     | state | cluster_id | count
> --------------+-------+------------+-------
>   DETROIT      | MI    |       8467 |   147
>   CHICAGO      | IL    |       5701 |   104
>   Austin       | TX    |       1610 |   101
>   Detroit      | MI    |       8467 |    85
>   BROOKLYN     | NY    |       3888 |    77
>   Chicago      | IL    |       5701 |    75
>   Fresno       | CA    |      11272 |    73
>   Brooklyn     | NY    |       3888 |    71
>   WASHINGTON   | DC    |       1111 |    66
>   HOUSTON      | TX    |      14297 |    64
>   St. Louis    | MO    |       3661 |    63
>   HOUSTON      | TX    |      14232 |    63
>   DENVER       | CO    |       8352 |    54
>   BRONX        | NY    |       3351 |    53
>   LOS ANGELES  | CA    |      14006 |    50
>   Bakersfield  | CA    |       1777 |    49
>   INDIANAPOLIS | IN    |      14748 |    44
>   Bronx        | NY    |       3351 |    42
>   Amarillo     | TX    |        622 |    41
>   Austin       | TX    |       1601 |    38
> (20 rows)
> 
> 
> Obviously, the case-based aspect of the city name can be cancelled with a
> function such as UPPER(city).
> 
> I also am generous in the radius dimension setting it at 5k feet.
> 
> But my next question is how do position the cluster_id on a map w/o
> referencing the city, and then tying that out to a geometry that I have
> stored for the city?
> 
> Thank you again,
> 
> Max

Depends how you want to show the cluster. If you want to show like a bubble
to represent the cluster, you can probably using ST_MinimumBoundingCircle
https://postgis.net/docs/en/ST_MinimumBoundingCircle.html  
First by keeping the point geometry

CREATE TEMP TABLE tmp_aa_zz AS 
SELECT city, state, ST_ClusterDBScan( ST_Transform(geom,2163), eps :=
 0.3048*5000, minpoints :=1) OVER () AS cluster_id, aa.geom 
FROM
 shapefiles.atp_filling_stations aa WHERE aa.amenity = 'fuel'
AND aa.ctry_code = 'US' AND aa.geom IS NOT NULL;


SELECT cluster_id , ST_MinimumBoundingCircle(ST_Collect(geom)) AS geom,
COUNT(*)
FROM tmp_aa_zz
GROUP BY cluster_id;

Or you might want to show the individual points in which case you can
dispense with the ST_MinimumBoundingCircle call or have it in addition to
like: 

SELECT cluster_id , ST_Collect(ST_MinimumBoundingCircle(ST_Collect(geom)),
ST_Collect(geom)) AS geom, COUNT(*)
FROM tmp_aa_zz
GROUP BY cluster_id;

Other alternative options you can try besides the minimum bounding circle
might be ST_ConvexHull  https://postgis.net/docs/en/ST_ConvexHull.html
, ST_ConcaveHull https://postgis.net/docs/en/ST_ConcaveHull.html, or maybe
even ST_GeometricMedian https://postgis.net/docs/en/ST_GeometricMedian.html
(if you want to show only one point to represent all the points or
ST_Centroid ( I think ST_GeometricMedian would be a more accurate
representation though, think of centroid as the average so would be highly
affected by outlier points where as the median would be less so)





More information about the postgis-users mailing list