[SAC] [support.osuosl.org #23649] [OSGeo] Failed disk in osgeo4.osuosl.bak
Justin Dugger via RT
support at osuosl.org
Fri Apr 11 13:08:59 PDT 2014
Hey, an important followup!
It appears that osgeo4 has lost another drive (this time in slot 0) around 5:30AM:
05:39 PROBLEM: osgeo4.osuosl.bak/Dell RAID Array is CRITICAL, CRITICAL: 0:BBU Charged (100%):0:RAID-6:6 drives:557.75GB:Partially Drives:6 1 Bad Drives (88 Errors), Apr 11, 12:39 UTC
This will affect I/O performance in the obvious ways: any read involving the affected disk will require reading all other volumes to calculate what the block should be.
On Tue Apr 08 09:20:21 2014, jldugger wrote:
> And another followup to document the results of the repair:
>
> osgeo4 came back online complaining that one of it's Power Supply
> units has failed. It also took quite a while for the VM qgis to fsck,
> and that ended up requiring a manual fsck to repair.
>
> We've agreed to delay osgeo3's battery replacement until next week.
>
> On Mon Apr 07 11:30:29 2014, jldugger wrote:
> > Just to confirm/document what was discussed on IRC:
> >
> > The RAID array rebuild last week, but we discovered the cause of the
> > low throughput was the RAID card on osgeo4 detected a weak battery
> > state and transitioned to a slower, safer WriteBack policy.
> >
> > We've received a pair of batteries and will be taking a planned
> > downtime to install them.
> >
> > On Thu Apr 03 09:25:58 2014, jldugger wrote:
> > > On Thu Apr 03 08:28:55 2014, ramereth wrote:
> > > > On Thu, Apr 3, 2014 at 12:04 AM, tech at wildintellect.com via RT <
> > > > support at osuosl.org> wrote:
> > > >
> > > > > Something seems amiss. The ProjectsVM stopped responding, high
> > > disk
> > > > > latency and iowait ( 10-11pm PST
> > > >
> > > > Rebuild Progress on Device at Enclosure 32, Slot 3 Completed 82%
> > in
> > > > 200
> > > > Minutes.
> > > >
> > > > I've never seen a rebuild take this long before but this
> hardware
> > is
> > > > starting to show its age a little.
> > >
> > > The only time I've seen things go this slowly was the time I
> forgot
> > to
> > > take our (very busy) FTP mirror out of rotation for the
> duration
> > of
> > > a build. Under RAID 5, recalculating a block on the replacement
> > > drive requires a reading in a block on all the other drives. So
> > > rebuilds can 'steal' a lot of I/O from a system that was
> already
> > > down 1 disk worth of I/O requests per second. While you can
> > > sometimes tune the RAID firmware to rebuild at a lower
> priority,
> > > there's a balancing act between service latency and repairing
> the
> > > RAID array before a second drive fails.
> > >
> > > TL;DR: sorry this is taking so long; I didn't realize the services
> > > depending on it were quite so IO bound.
> >
> >
>
>
More information about the Sac
mailing list