[postgis-users] Optimizing PostGIS/Geoserver schema for huge dataset

Fri Mar 31 20:00:21 PDT 2017

Yup, timeid, wkb_geometry and streamflow.

'streamflow' is a data value residing in one of the partitions (conus_flow
is the master table which is partitioned)
'wkb_geometry' is a static location polygon(line) in a 'staticchannels'
table.
'timeid' is a unique identifier that is also a key into another table which
provides a valid timestamp, forecast timestamp, and model configuration.

Geoserver queries the 'conus_flow_view' table using a timeid, which returns
a set of polygons joined with the particular values for that particular
valid/init/model configuration.

I imagine that the EXPLAIN goes through all of the joins/etc of the view,
which is where the conus_flow (and it's partitions) come into play.

My requirement is to maintain a month's worth of data, so once per day I
delete records older than 30 days. Other than that, and the initial data
inserts (inserts happen in batches every two hours), the data can be
considered 'read only' (i.e. no additional deletes/updates/replaces).

-Andy

On Fri, Mar 31, 2017 at 8:33 PM, Andy Colson <andy at squeakycode.net> wrote:

> This says 100:
> http://stackoverflow.com/questions/6104774/how-many-table-
> partitions-is-too-many-in-postgres
>
> This says 1000 is too many:
> http://dba.stackexchange.com/questions/95977/maximum-partiti
> oning-postgresql
>
> Honestly you'd have to benchmark it, because I have no idea if there is a
> difference between 100 and 1000.
>
> That being said, I'm surprised a good index isn't fast enough.  Partitions
> do cut the index size down, which is good, but it still has to scan all the
> child tables to see which match.
>
> Do you ever update or delete data from the flow table?
> Correct me if I'm wrong, but it looks like your where clause only uses
> fields: timeid, wkb_geometry and streamflow.  Yes?
>
>
> Your explain analyze include table conus_flow, which isnt int your query
> or view.  Are you sure that's the right explain analyze?
>
> -Andy
>
>
> On 03/31/2017 08:18 PM, Andrew Gaydos wrote:
>
>> Thanks for the help!
>>
>> I originally tried putting everything into a single non-partitioned table
>> but the performance was horrible! Since each set of 2.3M rows shares the
>> same timestamp, I thought this would be a good way to divide up the data
>> when partitioning - I set a constraint on each, e.g.
>>
>> table 1: constraint: timeid=101
>> table 2: constraint: timeid=102
>> etc.
>>
>> I could try grouping times into a single table, e.g.
>>
>> table 1: constraint: 100 <= timeid < 110
>> table 2: constraint: 110 <= timeid < 120
>> etc.
>>
>> so that would give me 1000 partitions of 24 million rows each.
>>
>> Is this what you were suggesting? What do you think the optimal balance
>> of partitions and rows would be? 100 partitions of 240 million rows each?
>> 10 partitions of 2.4 billion rows each? At some point I think I would run
>> into the insufferable performance I was getting with a single table, though.
>>
>> Actually, now that I check the number of partitions is closer to 17,000,
>> and number of rows per is 2.7M, so 46 billion rows altogether...
>>
>> Thanks again!
>> -Andy
>>
>> On Fri, Mar 31, 2017 at 6:15 PM, Andy Colson <andy at squeakycode.net
>> <mailto:andy at squeakycode.net>> wrote:
>>
>>     On 03/31/2017 11:38 AM, Andrew Gaydos wrote:
>>
>>         Hi,
>>
>>
>>
>>         My questions are
>>
>>          1. It seems that for every session, there is a one-time penalty
>> for the first query (several minutes) after which queries tend to run much
>> quicker (about 10 seconds for all the tiles to be served). What is going on
>> here?
>>          2. Is there a way to optimize GeoServer's queries against this
>> schema, or a more efficient query to try?
>>          3. other postgres optimizations that might help?
>>
>>         I'm pretty new to both GeoServer and PostGIS and have a sinking
>> feeling that I could be structuring this dataset and queries more
>> efficiently, but I've run out of ideas and don't have any postgres experts
>> at work to ask, so I'm posting here.
>>
>>         Thanks for any insight!
>>
>>         -Andy
>>
>>
>>     Andy's Unite!
>>
>>     err.. anyway, Here is the problem:
>>
>>         data table: (10,000 partitions, each with 2.3 million rows)
>>
>>
>>
>>     Lots of partitions will kill planning time. Look at the very bottom
>> of:
>>     https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html <
>> https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html>
>>
>>     Do you have your heart set on lots of partitions?  How'd you feel
>> about 100? or maybe 1000?
>>
>>     -Andy
>>
>>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20170331/b06463bc/attachment.html>