[postgis-users] Optimizing PostGIS/Geoserver schema for huge dataset

Fri Mar 31 09:38:34 PDT 2017

Hi,

I'm trying to efficiently serve forecast streamflow WMS images from a
PostGIS database using Geoserver 2.9x. My dataset (currently > 5TB)
consists of static polylines (2.3 million of them) that each have a number
of values that vary over the time dimension, as well as a forecast time
dimension and a model run dimension.

To manage this dataset, I've set up a static table containing the
geometries of each polyline, with an id assigned to each polyline.
Additionally I've set up a 'times' table which assigns a unique id to each
time/forecast time/model run combination. Thirdly, I've put the data in a
huge partitioned table that includes the index of each polyline (into the
static table) as well as an index into the times table. The data table is
partitioned by the time id, so each partition contains only values
applicable at a particular time/forecast time/model combination.

So I have:

static table: (2.3 million rows)

                                                       Table
"public.staticflow"
   Column    |              Type              |
  Modifiers                          | Storage | Stats target |
Description --------------+--------------------------------+-------------------------------------------------------------+---------+--------------+-------------
id           | integer                        | not null default
nextval('staticflow_id_seq'::regclass) | plain   |              |
shape_length | double precision               |
                                     | plain   |              |
strorder     | integer                        |
                                     | plain   |              |
wkb_geometry | geometry(MultiLineString,4269) |
                                     | main    |              |
Indexes:
   "staticflow_pkey" PRIMARY KEY, btree (id)
   "staticflow_id_idx" btree (id)
   "staticflow_wkb_geometry_geom_idx" gist (wkb_geometry) CLUSTER

times table: (10,000 rows)

                                                     Table "public.flow_times"
  Column   |            Type             |
Modifiers                           | Storage  | Stats target |
Description -----------+-----------------------------+---------------------------------------------------------------+----------+--------------+-------------
id        | integer                     | not null default
nextval('flow_times_id_seq'::regclass)      | plain    |
|
validtime | timestamp without time zone |
                                 | plain    |              |
inittime  | timestamp without time zone |
                                 | plain    |              |
model     | character varying(30)       |
                                 | extended |              | Indexes:
    "flow_times_pkey" PRIMARY KEY, btree (id)
    "flow_times_id_idx" btree (id)
    "flow_times_inittime_idx" btree (inittime)
    "flow_times_model_idx" btree (model)
    "flow_times_validtime_idx" btree (validtime)

data table: (10,000 partitions, each with 2.3 million rows)

Table "public.flow"
   Column   |  Type   | Modifiers | Storage | Stats target |
Description ------------+---------+-----------+---------+--------------+-------------
 station_id | integer |           | plain   |              |
 timeid     | integer |           | plain   |              |
 streamflow | real    |           | plain   |              |
 velocity   | real    |           | plain   |              | Child
tables: flow_15192,
              flow_15193,
              flow_15194,
             ...

each partition has the index/constraints, e.g. for flow_15192: Indexes:
"flow_15192_station_id_idx" btree (station_id) Check constraints:
"flow_15192_timeid_check" CHECK (timeid = 15192) Inherits: flow

To serve this with GeoServer, I created a postgres view that puts this all
together, and use a CQL filter with the time id (obtained through a
separate query) to query the WMS against this view.

create flow_view SELECT staticflow.id,
    flow_times.id AS timeid,
    flow_times.validtime,
    flow_times.inittime,
    flow_times.model,
    staticflow.strorder,
    flow.streamflow,
    flow.velocity,
    staticflow.shape_length,
    staticflow.wkb_geometry
   FROM staticflow,
    flow_times,
    flow
  WHERE flow.station_id = staticflow.id AND flow_times.id = flow.timeid;

I've looked at the query plan for the queries that GeoServer is sending to
Postgres, e.g:

explain analyze SELECT
"streamflow",encode(ST_AsBinary(ST_Simplify(ST_Force2D("wkb_geometry"),
1.0357088025898521E-4)),'base64') as "wkb_geometry" FROM
"public"."flow_view" WHERE  ("timeid" = 81388 AND "timeid" IS NOT NULL
 AND "wkb_geometry" && ST_GeomFromText('POLYGON ((-115.13723373413087
41.01267720039345, -115.13723373413087 41.04660520510985,
-115.09225845336915 41.04660520510985, -115.09225845336915
41.01267720039345, -115.13723373413087 41.01267720039345))', 4269) AND
(("streamflow" > -1.0E-4 AND "streamflow" IS NOT NULL  AND
"streamflow" < 5.0) OR ("streamflow" > 4.9999 AND "streamflow" IS NOT
NULL  AND "streamflow" < 10.0) OR ("streamflow" > 9.9999 AND
"streamflow" IS NOT NULL  AND "streamflow" < 15.0) OR ("streamflow" >
14.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 20.0) OR
("streamflow" > 19.9999 AND "streamflow" IS NOT NULL  AND "streamflow"
< 25.0) OR ("streamflow" > 24.9999 AND "streamflow" IS NOT NULL  AND
"streamflow" < 30.0) OR ("streamflow" > 29.9999 AND "streamflow" IS
NOT NULL  AND "streamflow" < 100000.0)));

     QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.70..163.24 rows=11 width=465) (actual
time=0.101..0.200 rows=9 loops=1)
   ->  Index Only Scan using conus_flow_times_id_idx on
conus_flow_times  (cost=0.29..8.31 rows=1 width=4) (actual
time=0.020..0.020 rows=1 loops=1)
         Index Cond: ((id IS NOT NULL) AND (id = 81388))
         Heap Fetches: 1
   ->  Nested Loop  (cost=0.41..154.71 rows=11 width=469) (actual
time=0.045..0.104 rows=9 loops=1)
         ->  Index Scan using staticflow_wkb_geometry_geom_idx on
staticflow  (cost=0.41..52.62 rows=12 width=465) (actual
time=0.026..0.058 rows=9 loops=1)
               Index Cond: (wkb_geometry &&
'0103000020AD100000010000000500000001000070C8C85CC01F8C10689F81444001000070C8C85CC095E2CB28F785444001000090E7C55CC095E2CB28F785444001000090E7C55CC01F8C10689F81444001000070C8C85CC01F8C10689F814440'::geometry)
         ->  Append  (cost=0.00..8.49 rows=2 width=12) (actual
time=0.004..0.005 rows=1 loops=9)
               ->  Seq Scan on conus_flow  (cost=0.00..0.00 rows=1
width=12) (actual time=0.000..0.000 rows=0 loops=9)
                     Filter: ((streamflow IS NOT NULL) AND (timeid =
81388) AND (staticflow.id = station_id) AND (((streamflow >
(-0.0001)::double precision) AND (streamflow < 5::double precision))
OR ((streamflow > 4.9999::double precision) AND (streamflow <
10::double precision)) OR ((streamflow > 9.9999::double precision) AND
(streamflow < 15::double precision)) OR ((streamflow > 14.9999::double
precision) AND (streamflow < 20::double precision)) OR ((streamflow >
19.9999::double precision) AND (streamflow < 25::double precision)) OR
((streamflow > 24.9999::double precision) AND (streamflow < 30::double
precision)) OR ((streamflow > 29.9999::double precision) AND
(streamflow < 100000::double precision))))
               ->  Index Scan using conus_flow_81388_station_id_idx on
conus_flow_81388  (cost=0.43..8.49 rows=1 width=12) (actual
time=0.004..0.004 rows=1 loops=9)
                     Index Cond: (station_id = staticflow.id)
                     Filter: ((streamflow IS NOT NULL) AND (timeid =
81388) AND (((streamflow > (-0.0001)::double precision) AND
(streamflow < 5::double precision)) OR ((streamflow > 4.9999::double
precision) AND (streamflow < 10::double precision)) OR ((streamflow >
9.9999::double precision) AND (streamflow < 15::double precision)) OR
((streamflow > 14.9999::double precision) AND (streamflow < 20::double
precision)) OR ((streamflow > 19.9999::double precision) AND
(streamflow < 25::double precision)) OR ((streamflow > 24.9999::double
precision) AND (streamflow < 30::double precision)) OR ((streamflow >
29.9999::double precision) AND (streamflow < 100000::double
precision))))
 Planning time: 3041.998 ms
 Execution time: 1.192 ms

(the streamflow filter is for styling the polylines based on streamflow
value (using an SLD). Removing this does not seem to speed up the queries
significantly.)

My questions are

   1. It seems that for every session, there is a one-time penalty for the
   first query (several minutes) after which queries tend to run much quicker
   (about 10 seconds for all the tiles to be served). What is going on here?
   2. Is there a way to optimize GeoServer's queries against this schema,
   or a more efficient query to try?
   3. other postgres optimizations that might help?

I'm pretty new to both GeoServer and PostGIS and have a sinking feeling
that I could be structuring this dataset and queries more efficiently, but
I've run out of ideas and don't have any postgres experts at work to ask,
so I'm posting here.

Thanks for any insight!
-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20170331/15237d97/attachment.html>