[SAC] [support.osuosl.org #23649] [OSGeo] Failed disk in osgeo4.osuosl.bak

Fri Apr 11 13:08:59 PDT 2014

Hey, an important followup!

It appears that osgeo4 has lost another drive (this time in slot 0) around 5:30AM:

05:39 PROBLEM: osgeo4.osuosl.bak/Dell RAID Array is CRITICAL, CRITICAL: 0:BBU Charged (100%):0:RAID-6:6 drives:557.75GB:Partially Drives:6 1 Bad Drives (88 Errors), Apr 11, 12:39 UTC

This will affect I/O performance in the obvious ways: any read involving the affected disk will require reading all other volumes to calculate what the block should be. 

On Tue Apr 08 09:20:21 2014, jldugger wrote:
> And another followup to document the results of the repair:
> 
> osgeo4 came back online complaining that one of it's Power Supply
> units has failed. It also took quite a while for the VM qgis to fsck,
> and that ended up requiring a manual fsck to repair.
> 
> We've agreed to delay osgeo3's battery replacement until next week.
> 
> On Mon Apr 07 11:30:29 2014, jldugger wrote:
> > Just to confirm/document what was discussed on IRC:
> >
> > The RAID array rebuild last week, but we discovered the cause of the
> > low throughput was the RAID card on osgeo4 detected a weak battery
> > state and transitioned to a slower, safer WriteBack policy.
> >
> > We've received a pair of batteries and will be taking a planned
> > downtime to install them.
> >
> > On Thu Apr 03 09:25:58 2014, jldugger wrote:
> > > On Thu Apr 03 08:28:55 2014, ramereth wrote:
> > > > On Thu, Apr 3, 2014 at 12:04 AM, tech at wildintellect.com via RT <
> > > > support at osuosl.org> wrote:
> > > >
> > > > > Something seems amiss. The ProjectsVM stopped responding, high
> > >    disk
> > > > > latency and iowait ( 10-11pm PST
> > > >
> > > > Rebuild Progress on Device at Enclosure 32, Slot 3 Completed 82%
> > in
> > > > 200
> > > > Minutes.
> > > >
> > > > I've never seen a rebuild take this long before but this
> hardware
> > is
> > > > starting to show its age a little.
> > >
> > > The only time I've seen things go this slowly was the time I
> forgot
> > to
> > >    take our (very busy) FTP mirror out of rotation for the
> duration
> > of
> > >    a build. Under RAID 5, recalculating a block on the replacement
> > >    drive requires a reading in a block on all the other drives. So
> > >    rebuilds can 'steal' a lot of I/O from a system that was
> already
> > >    down 1 disk worth of I/O requests per second. While you can
> > >    sometimes tune the RAID firmware to rebuild at a lower
> priority,
> > >    there's a balancing act between service latency and repairing
> the
> > >    RAID array before a second drive fails.
> > >
> > > TL;DR: sorry this is taking so long; I didn't realize the services
> > >    depending on it were quite so IO bound.
> >
> >
> 
>