[SAC] [support.osuosl.org #23649] [OSGeo] Failed disk in osgeo4.osuosl.bak

Thu Apr 17 14:56:59 PDT 2014

I've received a pair of drives today ATTN: OSGEO. Let me know when you'd like to take a downtime and we'll get that in.

Justin

On Fri Apr 11 14:29:56 2014, tech at wildintellect.com wrote:
> Justin,
> 
> Thanks, we suspected this when we did the battery replacement. 2
> drives
> have been ordered and should arrive early next week. Yes we got a
> spare
> this time.
> 
> When it comes in we should plan an outage window to turn on off VMs to
> make the rebuild go faster.
> 
> Thanks,
> Alex
> 
> On 04/11/2014 01:08 PM, Justin Dugger via RT wrote:
> > Hey, an important followup!
> >
> > It appears that osgeo4 has lost another drive (this time in slot 0)
> around 5:30AM:
> >
> > 05:39 PROBLEM: osgeo4.osuosl.bak/Dell RAID Array is CRITICAL,
> CRITICAL: 0:BBU Charged (100%):0:RAID-6:6 drives:557.75GB:Partially
> Drives:6 1 Bad Drives (88 Errors), Apr 11, 12:39 UTC
> >
> > This will affect I/O performance in the obvious ways: any read
> involving the affected disk will require reading all other volumes to
> calculate what the block should be.
> >
> > On Tue Apr 08 09:20:21 2014, jldugger wrote:
> >> And another followup to document the results of the repair:
> >>
> >> osgeo4 came back online complaining that one of it's Power Supply
> >> units has failed. It also took quite a while for the VM qgis to
> fsck,
> >> and that ended up requiring a manual fsck to repair.
> >>
> >> We've agreed to delay osgeo3's battery replacement until next week.
> >>
> >> On Mon Apr 07 11:30:29 2014, jldugger wrote:
> >>> Just to confirm/document what was discussed on IRC:
> >>>
> >>> The RAID array rebuild last week, but we discovered the cause of
> the
> >>> low throughput was the RAID card on osgeo4 detected a weak battery
> >>> state and transitioned to a slower, safer WriteBack policy.
> >>>
> >>> We've received a pair of batteries and will be taking a planned
> >>> downtime to install them.
> >>>
> >>> On Thu Apr 03 09:25:58 2014, jldugger wrote:
> >>>> On Thu Apr 03 08:28:55 2014, ramereth wrote:
> >>>>> On Thu, Apr 3, 2014 at 12:04 AM, tech at wildintellect.com via RT <
> >>>>> support at osuosl.org> wrote:
> >>>>>
> >>>>>> Something seems amiss. The ProjectsVM stopped responding, high
> >>>>    disk
> >>>>>> latency and iowait ( 10-11pm PST
> >>>>>
> >>>>> Rebuild Progress on Device at Enclosure 32, Slot 3 Completed 82%
> >>> in
> >>>>> 200
> >>>>> Minutes.
> >>>>>
> >>>>> I've never seen a rebuild take this long before but this
> >> hardware
> >>> is
> >>>>> starting to show its age a little.
> >>>>
> >>>> The only time I've seen things go this slowly was the time I
> >> forgot
> >>> to
> >>>>    take our (very busy) FTP mirror out of rotation for the
> >> duration
> >>> of
> >>>>    a build. Under RAID 5, recalculating a block on the
> replacement
> >>>>    drive requires a reading in a block on all the other drives.
> So
> >>>>    rebuilds can 'steal' a lot of I/O from a system that was
> >> already
> >>>>    down 1 disk worth of I/O requests per second. While you can
> >>>>    sometimes tune the RAID firmware to rebuild at a lower
> >> priority,
> >>>>    there's a balancing act between service latency and repairing
> >> the
> >>>>    RAID array before a second drive fails.
> >>>>
> >>>> TL;DR: sorry this is taking so long; I didn't realize the
> services
> >>>>    depending on it were quite so IO bound.
> >>>
> >>>
> >>
> >>
> >
> >
> >
> > _______________________________________________
> > Sac mailing list
> > Sac at lists.osgeo.org
> > http://lists.osgeo.org/mailman/listinfo/sac
> >
>