[SAC] [support.osuosl.org #23649] [OSGeo] Failed disk in osgeo4.osuosl.bak

Justin Dugger via RT support at osuosl.org
Wed Apr 23 17:09:53 PDT 2014


Resolving ticket.

On Mon Apr 21 10:35:17 2014, jldugger wrote:
> Looks like this disk rebuild finished moments ago, so your RAID array
> should be healthy and full of IOPS now. This should wrap up the disk
> replacements in osgeo3 for a while.
> 
> As a reminder, osgeo4 also has a failed Power Supply.
> 
> Justin
> 
> On Fri Apr 18 10:41:17 2014, jldugger wrote:
> > I'll mark it on my calendar then ;)
> >
> > Justin
> >
> > On Thu Apr 17 21:15:27 2014, tech at wildintellect.com wrote:
> > > Friday 1pm PST?
> > >
> > > Unless I hear screams from the community about some event
> happening
> > > lets
> > > plan for that. We'll also plan to shutdown most if not all of the
> VMs
> > > to
> > > make it go faster.
> > >
> > > Thanks,
> > > Alex
> > >
> > > On 04/17/2014 02:56 PM, Justin Dugger via RT wrote:
> > > > I've received a pair of drives today ATTN: OSGEO. Let me know
> when
> > > you'd like to take a downtime and we'll get that in.
> > > >
> > > > Justin
> > > >
> > > > On Fri Apr 11 14:29:56 2014, tech at wildintellect.com wrote:
> > > >> Justin,
> > > >>
> > > >> Thanks, we suspected this when we did the battery replacement.
> 2
> > > >> drives
> > > >> have been ordered and should arrive early next week. Yes we got
> a
> > > >> spare
> > > >> this time.
> > > >>
> > > >> When it comes in we should plan an outage window to turn on off
> VMs
> > > to
> > > >> make the rebuild go faster.
> > > >>
> > > >> Thanks,
> > > >> Alex
> > > >>
> > > >> On 04/11/2014 01:08 PM, Justin Dugger via RT wrote:
> > > >>> Hey, an important followup!
> > > >>>
> > > >>> It appears that osgeo4 has lost another drive (this time in
> slot
> > > 0)
> > > >> around 5:30AM:
> > > >>>
> > > >>> 05:39 PROBLEM: osgeo4.osuosl.bak/Dell RAID Array is CRITICAL,
> > > >> CRITICAL: 0:BBU Charged (100%):0:RAID-6:6
> drives:557.75GB:Partially
> > > >> Drives:6 1 Bad Drives (88 Errors), Apr 11, 12:39 UTC
> > > >>>
> > > >>> This will affect I/O performance in the obvious ways: any read
> > > >> involving the affected disk will require reading all other
> volumes
> > > to
> > > >> calculate what the block should be.
> > > >>>
> > > >>> On Tue Apr 08 09:20:21 2014, jldugger wrote:
> > > >>>> And another followup to document the results of the repair:
> > > >>>>
> > > >>>> osgeo4 came back online complaining that one of it's Power
> Supply
> > > >>>> units has failed. It also took quite a while for the VM qgis
> to
> > > >> fsck,
> > > >>>> and that ended up requiring a manual fsck to repair.
> > > >>>>
> > > >>>> We've agreed to delay osgeo3's battery replacement until next
> > > week.
> > > >>>>
> > > >>>> On Mon Apr 07 11:30:29 2014, jldugger wrote:
> > > >>>>> Just to confirm/document what was discussed on IRC:
> > > >>>>>
> > > >>>>> The RAID array rebuild last week, but we discovered the
> cause of
> > > >> the
> > > >>>>> low throughput was the RAID card on osgeo4 detected a weak
> > > battery
> > > >>>>> state and transitioned to a slower, safer WriteBack policy.
> > > >>>>>
> > > >>>>> We've received a pair of batteries and will be taking a
> planned
> > > >>>>> downtime to install them.
> > > >>>>>
> > > >>>>> On Thu Apr 03 09:25:58 2014, jldugger wrote:
> > > >>>>>> On Thu Apr 03 08:28:55 2014, ramereth wrote:
> > > >>>>>>> On Thu, Apr 3, 2014 at 12:04 AM, tech at wildintellect.com
> via RT
> > > <
> > > >>>>>>> support at osuosl.org> wrote:
> > > >>>>>>>
> > > >>>>>>>> Something seems amiss. The ProjectsVM stopped responding,
> > > high
> > > >>>>>>    disk
> > > >>>>>>>> latency and iowait ( 10-11pm PST
> > > >>>>>>>
> > > >>>>>>> Rebuild Progress on Device at Enclosure 32, Slot 3
> Completed
> > > 82%
> > > >>>>> in
> > > >>>>>>> 200
> > > >>>>>>> Minutes.
> > > >>>>>>>
> > > >>>>>>> I've never seen a rebuild take this long before but this
> > > >>>> hardware
> > > >>>>> is
> > > >>>>>>> starting to show its age a little.
> > > >>>>>>
> > > >>>>>> The only time I've seen things go this slowly was the time
> I
> > > >>>> forgot
> > > >>>>> to
> > > >>>>>>    take our (very busy) FTP mirror out of rotation for the
> > > >>>> duration
> > > >>>>> of
> > > >>>>>>    a build. Under RAID 5, recalculating a block on the
> > > >> replacement
> > > >>>>>>    drive requires a reading in a block on all the other
> drives.
> > > >> So
> > > >>>>>>    rebuilds can 'steal' a lot of I/O from a system that was
> > > >>>> already
> > > >>>>>>    down 1 disk worth of I/O requests per second. While you
> can
> > > >>>>>>    sometimes tune the RAID firmware to rebuild at a lower
> > > >>>> priority,
> > > >>>>>>    there's a balancing act between service latency and
> > > repairing
> > > >>>> the
> > > >>>>>>    RAID array before a second drive fails.
> > > >>>>>>
> > > >>>>>> TL;DR: sorry this is taking so long; I didn't realize the
> > > >> services
> > > >>>>>>    depending on it were quite so IO bound.
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> _______________________________________________
> > > >>> Sac mailing list
> > > >>> Sac at lists.osgeo.org
> > > >>> http://lists.osgeo.org/mailman/listinfo/sac
> > > >>>
> > > >>
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Sac mailing list
> > > > Sac at lists.osgeo.org
> > > > http://lists.osgeo.org/mailman/listinfo/sac
> > > >
> > >
> >
> >
> 
> 





More information about the Sac mailing list