[SAC] [support.osuosl.org #23649] [OSGeo] Failed disk in osgeo4.osuosl.bak

Justin Dugger via RT support at osuosl.org
Fri Apr 18 10:41:17 PDT 2014


I'll mark it on my calendar then ;)

Justin

On Thu Apr 17 21:15:27 2014, tech at wildintellect.com wrote:
> Friday 1pm PST?
> 
> Unless I hear screams from the community about some event happening
> lets
> plan for that. We'll also plan to shutdown most if not all of the VMs
> to
> make it go faster.
> 
> Thanks,
> Alex
> 
> On 04/17/2014 02:56 PM, Justin Dugger via RT wrote:
> > I've received a pair of drives today ATTN: OSGEO. Let me know when
> you'd like to take a downtime and we'll get that in.
> >
> > Justin
> >
> > On Fri Apr 11 14:29:56 2014, tech at wildintellect.com wrote:
> >> Justin,
> >>
> >> Thanks, we suspected this when we did the battery replacement. 2
> >> drives
> >> have been ordered and should arrive early next week. Yes we got a
> >> spare
> >> this time.
> >>
> >> When it comes in we should plan an outage window to turn on off VMs
> to
> >> make the rebuild go faster.
> >>
> >> Thanks,
> >> Alex
> >>
> >> On 04/11/2014 01:08 PM, Justin Dugger via RT wrote:
> >>> Hey, an important followup!
> >>>
> >>> It appears that osgeo4 has lost another drive (this time in slot
> 0)
> >> around 5:30AM:
> >>>
> >>> 05:39 PROBLEM: osgeo4.osuosl.bak/Dell RAID Array is CRITICAL,
> >> CRITICAL: 0:BBU Charged (100%):0:RAID-6:6 drives:557.75GB:Partially
> >> Drives:6 1 Bad Drives (88 Errors), Apr 11, 12:39 UTC
> >>>
> >>> This will affect I/O performance in the obvious ways: any read
> >> involving the affected disk will require reading all other volumes
> to
> >> calculate what the block should be.
> >>>
> >>> On Tue Apr 08 09:20:21 2014, jldugger wrote:
> >>>> And another followup to document the results of the repair:
> >>>>
> >>>> osgeo4 came back online complaining that one of it's Power Supply
> >>>> units has failed. It also took quite a while for the VM qgis to
> >> fsck,
> >>>> and that ended up requiring a manual fsck to repair.
> >>>>
> >>>> We've agreed to delay osgeo3's battery replacement until next
> week.
> >>>>
> >>>> On Mon Apr 07 11:30:29 2014, jldugger wrote:
> >>>>> Just to confirm/document what was discussed on IRC:
> >>>>>
> >>>>> The RAID array rebuild last week, but we discovered the cause of
> >> the
> >>>>> low throughput was the RAID card on osgeo4 detected a weak
> battery
> >>>>> state and transitioned to a slower, safer WriteBack policy.
> >>>>>
> >>>>> We've received a pair of batteries and will be taking a planned
> >>>>> downtime to install them.
> >>>>>
> >>>>> On Thu Apr 03 09:25:58 2014, jldugger wrote:
> >>>>>> On Thu Apr 03 08:28:55 2014, ramereth wrote:
> >>>>>>> On Thu, Apr 3, 2014 at 12:04 AM, tech at wildintellect.com via RT
> <
> >>>>>>> support at osuosl.org> wrote:
> >>>>>>>
> >>>>>>>> Something seems amiss. The ProjectsVM stopped responding,
> high
> >>>>>>    disk
> >>>>>>>> latency and iowait ( 10-11pm PST
> >>>>>>>
> >>>>>>> Rebuild Progress on Device at Enclosure 32, Slot 3 Completed
> 82%
> >>>>> in
> >>>>>>> 200
> >>>>>>> Minutes.
> >>>>>>>
> >>>>>>> I've never seen a rebuild take this long before but this
> >>>> hardware
> >>>>> is
> >>>>>>> starting to show its age a little.
> >>>>>>
> >>>>>> The only time I've seen things go this slowly was the time I
> >>>> forgot
> >>>>> to
> >>>>>>    take our (very busy) FTP mirror out of rotation for the
> >>>> duration
> >>>>> of
> >>>>>>    a build. Under RAID 5, recalculating a block on the
> >> replacement
> >>>>>>    drive requires a reading in a block on all the other drives.
> >> So
> >>>>>>    rebuilds can 'steal' a lot of I/O from a system that was
> >>>> already
> >>>>>>    down 1 disk worth of I/O requests per second. While you can
> >>>>>>    sometimes tune the RAID firmware to rebuild at a lower
> >>>> priority,
> >>>>>>    there's a balancing act between service latency and
> repairing
> >>>> the
> >>>>>>    RAID array before a second drive fails.
> >>>>>>
> >>>>>> TL;DR: sorry this is taking so long; I didn't realize the
> >> services
> >>>>>>    depending on it were quite so IO bound.
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Sac mailing list
> >>> Sac at lists.osgeo.org
> >>> http://lists.osgeo.org/mailman/listinfo/sac
> >>>
> >>
> >
> >
> >
> > _______________________________________________
> > Sac mailing list
> > Sac at lists.osgeo.org
> > http://lists.osgeo.org/mailman/listinfo/sac
> >
> 





More information about the Sac mailing list