[postgis-users] splitting a PostGIS db across multiple disks

Brian Modra brian at zwartberg.com
Tue Apr 6 13:12:15 PDT 2010


On 06/04/2010, P Kishor <punk.kish at gmail.com> wrote:
> On Tue, Apr 6, 2010 at 10:22 AM, Chris Hermansen
> <chris.hermansen at timberline.ca> wrote:
>> What about using RAID?  No data reorganization necessary.
>>
>
> I am already using a RAID, and conceptually, a RAID is the same as a
> single disk. My issue arises when I run out of space on RAID as well.
> While I could extend the RAID, and theoretically never have any space
> limitation, I would still have to use TABLESPACES, just so I can store
> part of the db in another location on the expanded RAID space, and
> also have to utilize table partitioning, because, at the db level, I
> am dealing with a single table that is growing beyond what is
> available on the RAID.
>
> I am assuming that adding more disks to a RAID is not a simple task,
> because the RAID set has to be migrated to incorporate the newly added
> disk... I tried a RAID migration yesterday, and it takes a long time.
> So, tablespaces are essential. Table partitioning would allow me to
> deal with a single conceptual table, so my application would not be
> riddled with junky logic (if querying this geog. area, query that
> table, if querying that geog. area, query that other table and such
> nonsense).
>
> Wrt the dataset itself... it is all point data. My current estimate is
> that the raw dataset is going to be about 125 billion rows in a table.
> In terms of disk space, it will be about 4 TB raw, that is, before it
> goes into the db. Indexes and other db overhead might 1.5x or 2x it,
> so, say, max 8 TB.

This may cause a lot of discussion... but in my opinion (though I
posted a few minutes ago answering a comment about RAID) ... my
opinion is that if you can design your own software, then design out
RAID.

RAID is required because software needs lots of space, and disks are
not big enough. Its a compromise. Then add more disks, you increase
the probability of failure, so you have to build in redundancy... you
need expensive hardware RAID controllers, you need 3U or bigger
servers... cost goes up...

If you can split your dataset up logically, across multiple disks, and
(even better) multiple hosts... then it will be faster, you isolate
the chance of failure to just one section of your service (rather than
the whole service), and you can use less expensive servers, or get a
blade cabinet and really squeeze them in to use less server room space
and therefore less rental costs.

But... this assumes that you are at the point where you can design
your own software.
I've been in this position (being able to design my own PostGis -
based service) and this is how I did it, and how I'd do it again
(except if anything, more so).

It makes sense to me to use small servers, (relatively) small disks,
and smaller service applications.

>
>> http://wiki.postgresql.org/wiki/Community_Disk_Tuning_Guide
>>
>> As the above notes, disk performance (eg spindle speed) is important.
>>
>> Also, to get good performance, you need to understand your application's
>> performance needs.  Do you have a lot of small queries, etc etc.  If you
>> don't understand your application's performance needs, then you may be
>> better off running your application for awhile and measuring its
>> bottlenecks before you decide on a storage strategy.
>>
>> P Kishor wrote:
>>> I asked this question yesterday, and received a very helpful pointer
>>> from Ben Madin re. TABLESPACES. As noted in my reply in that thread, I
>>> am also investigating the possibility of splitting a single table
>>> across multiple disks.
>>>
>>> However, I am going to post this question in a different way in this new
>>> thread.
>>>
>>> Suppose I have a table FOO0 that stores info about every state in the
>>> union. I know that some of these states will have mongo number of
>>> rows, but I don't have to build all the states immediately. So, I
>>> start with a few states' worth data, putting it in the default
>>> /usr/local/pgsql/data location.
>>>
>>> Then I start outgrowing that disk, and need to add another state, so I
>>> add another disk, create a new tablespace, and create a new table
>>> called FOO1 in this new tablespace. Then I can store the new states in
>>> FOO1. As long as I break up my table into FOO0, FOO1, FOO2, and so on,
>>> I can store each FOOn in a new tablespace. And, as long as I ensure
>>> that each FOOn table contains a geographically consistent spatial
>>> extent, I can build logic in my application to query the correct
>>> table.
>>>
>>> So, lets say 0 lon to -10 lon data are stored in FOO0, and -10 lon to
>>> -20 lon in FOO1, then if the user requests data for -5 lon to -15 lon,
>>> I will have to query both FOO0 and FOO1.
>>>
>>> More work for me, but it is doable, no? Any insights on how to handle
>>> something like this?
>>>
>>> A corollary question -- are their any speed advantages to actually
>>> creating multiple PostGIS instances, perhaps even splitting them
>>> across multiple machines? Of course, it is going to be a pain in the
>>> ass for me to maintain more than one instance of PostGres/PostGIS, so
>>> I am not thrilled at that possibility. I'd rather have a single
>>> instance just be managing data across multiple locations as required.
>>>
>>>
>>>
>>
>> --
>> Regards,
>>
>> Chris Hermansen    ·    mailto:chris.hermansen at timberline.ca
>> tel+1.604.714.2878 · fax+1.604.733.0631 · mob+1.778.840.4625
>> Timberline Natural Resource Group · http://www.timberline.ca
>> 401 · 958 West 8th Avenue  · Vancouver BC · Canada · V5Z 1E5
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>>
>
>
>
> --
> Puneet Kishor http://www.punkish.org
> Carbon Model http://carbonmodel.org
> Charter Member, Open Source Geospatial Foundation http://www.osgeo.org
> Science Commons Fellow, http://sciencecommons.org/about/whoweare/kishor
> Nelson Institute, UW-Madison http://www.nelson.wisc.edu
> -----------------------------------------------------------------------
> Assertions are politics; backing up assertions with evidence is science
> =======================================================================
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
>


-- 
Brian Modra   Land line: +27 23 5411 462
Mobile: +27 79 69 77 082
5 Jan Louw Str, Prince Albert, 6930
Postal: P.O. Box 2, Prince Albert 6930
South Africa
http://www.zwartberg.com/



More information about the postgis-users mailing list