[SAC] [support.osuosl.org #23649] [OSGeo] Failed disk in osgeo4.osuosl.bak

Justin Dugger via RT support at osuosl.org
Mon Apr 21 10:35:17 PDT 2014


Looks like this disk rebuild finished moments ago, so your RAID array should be healthy and full of IOPS now. This should wrap up the disk replacements in osgeo3 for a while.

As a reminder, osgeo4 also has a failed Power Supply. 

Justin

On Fri Apr 18 10:41:17 2014, jldugger wrote:
> I'll mark it on my calendar then ;)
> 
> Justin
> 
> On Thu Apr 17 21:15:27 2014, tech at wildintellect.com wrote:
> > Friday 1pm PST?
> > 
> > Unless I hear screams from the community about some event happening
> > lets
> > plan for that. We'll also plan to shutdown most if not all of the VMs
> > to
> > make it go faster.
> > 
> > Thanks,
> > Alex
> > 
> > On 04/17/2014 02:56 PM, Justin Dugger via RT wrote:
> > > I've received a pair of drives today ATTN: OSGEO. Let me know when
> > you'd like to take a downtime and we'll get that in.
> > >
> > > Justin
> > >
> > > On Fri Apr 11 14:29:56 2014, tech at wildintellect.com wrote:
> > >> Justin,
> > >>
> > >> Thanks, we suspected this when we did the battery replacement. 2
> > >> drives
> > >> have been ordered and should arrive early next week. Yes we got a
> > >> spare
> > >> this time.
> > >>
> > >> When it comes in we should plan an outage window to turn on off VMs
> > to
> > >> make the rebuild go faster.
> > >>
> > >> Thanks,
> > >> Alex
> > >>
> > >> On 04/11/2014 01:08 PM, Justin Dugger via RT wrote:
> > >>> Hey, an important followup!
> > >>>
> > >>> It appears that osgeo4 has lost another drive (this time in slot
> > 0)
> > >> around 5:30AM:
> > >>>
> > >>> 05:39 PROBLEM: osgeo4.osuosl.bak/Dell RAID Array is CRITICAL,
> > >> CRITICAL: 0:BBU Charged (100%):0:RAID-6:6 drives:557.75GB:Partially
> > >> Drives:6 1 Bad Drives (88 Errors), Apr 11, 12:39 UTC
> > >>>
> > >>> This will affect I/O performance in the obvious ways: any read
> > >> involving the affected disk will require reading all other volumes
> > to
> > >> calculate what the block should be.
> > >>>
> > >>> On Tue Apr 08 09:20:21 2014, jldugger wrote:
> > >>>> And another followup to document the results of the repair:
> > >>>>
> > >>>> osgeo4 came back online complaining that one of it's Power Supply
> > >>>> units has failed. It also took quite a while for the VM qgis to
> > >> fsck,
> > >>>> and that ended up requiring a manual fsck to repair.
> > >>>>
> > >>>> We've agreed to delay osgeo3's battery replacement until next
> > week.
> > >>>>
> > >>>> On Mon Apr 07 11:30:29 2014, jldugger wrote:
> > >>>>> Just to confirm/document what was discussed on IRC:
> > >>>>>
> > >>>>> The RAID array rebuild last week, but we discovered the cause of
> > >> the
> > >>>>> low throughput was the RAID card on osgeo4 detected a weak
> > battery
> > >>>>> state and transitioned to a slower, safer WriteBack policy.
> > >>>>>
> > >>>>> We've received a pair of batteries and will be taking a planned
> > >>>>> downtime to install them.
> > >>>>>
> > >>>>> On Thu Apr 03 09:25:58 2014, jldugger wrote:
> > >>>>>> On Thu Apr 03 08:28:55 2014, ramereth wrote:
> > >>>>>>> On Thu, Apr 3, 2014 at 12:04 AM, tech at wildintellect.com via RT
> > <
> > >>>>>>> support at osuosl.org> wrote:
> > >>>>>>>
> > >>>>>>>> Something seems amiss. The ProjectsVM stopped responding,
> > high
> > >>>>>>    disk
> > >>>>>>>> latency and iowait ( 10-11pm PST
> > >>>>>>>
> > >>>>>>> Rebuild Progress on Device at Enclosure 32, Slot 3 Completed
> > 82%
> > >>>>> in
> > >>>>>>> 200
> > >>>>>>> Minutes.
> > >>>>>>>
> > >>>>>>> I've never seen a rebuild take this long before but this
> > >>>> hardware
> > >>>>> is
> > >>>>>>> starting to show its age a little.
> > >>>>>>
> > >>>>>> The only time I've seen things go this slowly was the time I
> > >>>> forgot
> > >>>>> to
> > >>>>>>    take our (very busy) FTP mirror out of rotation for the
> > >>>> duration
> > >>>>> of
> > >>>>>>    a build. Under RAID 5, recalculating a block on the
> > >> replacement
> > >>>>>>    drive requires a reading in a block on all the other drives.
> > >> So
> > >>>>>>    rebuilds can 'steal' a lot of I/O from a system that was
> > >>>> already
> > >>>>>>    down 1 disk worth of I/O requests per second. While you can
> > >>>>>>    sometimes tune the RAID firmware to rebuild at a lower
> > >>>> priority,
> > >>>>>>    there's a balancing act between service latency and
> > repairing
> > >>>> the
> > >>>>>>    RAID array before a second drive fails.
> > >>>>>>
> > >>>>>> TL;DR: sorry this is taking so long; I didn't realize the
> > >> services
> > >>>>>>    depending on it were quite so IO bound.
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> Sac mailing list
> > >>> Sac at lists.osgeo.org
> > >>> http://lists.osgeo.org/mailman/listinfo/sac
> > >>>
> > >>
> > >
> > >
> > >
> > > _______________________________________________
> > > Sac mailing list
> > > Sac at lists.osgeo.org
> > > http://lists.osgeo.org/mailman/listinfo/sac
> > >
> > 
> 
> 





More information about the Sac mailing list