[postgis-users] OT: Generic postgresql Q about LARGE number of tables
Paragon Corporation
lr at pcorp.us
Sun Nov 28 19:32:07 PST 2010
> Leo, I have a question below ...
> This seems to imply either:
> 1. there are 26 tables one for each letter of the alphabet or 2. that
there is some way to alias a table so m_symbols, n_symbols and o_symbols all
point to the mno_symbols table.
> If the tables are split into ABC, DEF, GHI, JKL, MNO, PQR, STU, VWX, YZ,
then o_symbols and m_symbols would both need to point to the MNO table.
> So I'm assuming that in that example you were thinking of one table per
letter which probably works fine also and might be easier to make
assumptions on the table name based on the symbol.
> It is likely that I will encapsulate most standard queries in stored
procedures where you pass it the symbols and it generates the SQL for the
queries so I can hide the complexity of query generation.
Steve,
Sorry was typing too fast and had a lot of typos. There are 3 competing
factors:
Simplicity, Load Balancing, and Minimizing the number of tables that need to
be considered by the planner when making a plan.
The biggest pitfall with using inherited tables and leaving it up to the
planner to figure out which needs to be inspected is the last factor.
By Load Balancing I mean you don't want your tables too lop-sided e.g. you
have few records in Z table and too many in A table.
With that said there are 2 ways I was thinking of partitioning data. Regina
and I were discussing this, and she things that when you have a possibly
large number of tables, you should go with a nested hierarchy rather than a
flat one. Though we haven't done any tests to prove it is better.
So for example here are a couple of structures that come to mind
1) stocks
A B C D E .....
2) stocks
A B C-E ..... V-Z
3) stocks
A B ..
M
APPL Other MSFT
.. Others
4) stocks
Y1973_1980 Y1981_1985
A B C ... A B C ...
Etc,
Which is better really depends on the queries you do. If your queries
rarely involve date ranges, then 4 would not be an option. Why have
multiple hierarchies, Regina thinks it will minimize the number of tables
the planner will need to check with the assumption that
Since check constraints are inherited and you are say comparing APPL with
MSFT
The planner will need to make 1-26 checks and say 4 more checks after that
since its already discarded the other nodes by the first check. Because
there isn't any reason to check the B to .. Tables since their children
would have inherited the letter range constraint.
However if you have a lot more data for APPL and MSFT than you do for other
tables, then you save on having a smaller dataset that the planner finally
works with.
If your queries involve a lot of date ranges and you rarely touch older
years or your queries are such that you are always analyzing the same
periods, then partitioning by date range may make more sense.
Given the amount of data you have 80 million I recall. I think just
breaking up by letters A, B, C... Z with a decent speed box will probably
work fine for you and provide room for growth with minimal complexity.
The other benefit of separate tables is you can put them on different disks
and also back them up separately if needed. This holds true for both
inherited as well as non-inherited tables.
Leo and Regina
http://www.postgis.us
More information about the postgis-users
mailing list