[SAC] [support.osuosl.org #23649] [OSGeo] Failed disk in osgeo4.osuosl.bak

Justin Dugger via RT support at osuosl.org
Tue Apr 8 09:20:22 PDT 2014


And another followup to document the results of the repair:

osgeo4 came back online complaining that one of it's Power Supply units has failed. It also took quite a while for the VM qgis to fsck, and that ended up requiring a manual fsck to repair.

We've agreed to delay osgeo3's battery replacement until next week.

On Mon Apr 07 11:30:29 2014, jldugger wrote:
> Just to confirm/document what was discussed on IRC:
> 
> The RAID array rebuild last week, but we discovered the cause of the
> low throughput was the RAID card on osgeo4 detected a weak battery
> state and transitioned to a slower, safer WriteBack policy.
> 
> We've received a pair of batteries and will be taking a planned
> downtime to install them.
> 
> On Thu Apr 03 09:25:58 2014, jldugger wrote:
> > On Thu Apr 03 08:28:55 2014, ramereth wrote:
> > > On Thu, Apr 3, 2014 at 12:04 AM, tech at wildintellect.com via RT <
> > > support at osuosl.org> wrote:
> > >
> > > > Something seems amiss. The ProjectsVM stopped responding, high
> >    disk
> > > > latency and iowait ( 10-11pm PST
> > >
> > > Rebuild Progress on Device at Enclosure 32, Slot 3 Completed 82%
> in
> > > 200
> > > Minutes.
> > >
> > > I've never seen a rebuild take this long before but this hardware
> is
> > > starting to show its age a little.
> >
> > The only time I've seen things go this slowly was the time I forgot
> to
> >    take our (very busy) FTP mirror out of rotation for the duration
> of
> >    a build. Under RAID 5, recalculating a block on the replacement
> >    drive requires a reading in a block on all the other drives. So
> >    rebuilds can 'steal' a lot of I/O from a system that was already
> >    down 1 disk worth of I/O requests per second. While you can
> >    sometimes tune the RAID firmware to rebuild at a lower priority,
> >    there's a balancing act between service latency and repairing the
> >    RAID array before a second drive fails.
> >
> > TL;DR: sorry this is taking so long; I didn't realize the services
> >    depending on it were quite so IO bound.
> 
> 





More information about the Sac mailing list