[SAC] [support.osuosl.org #23649] [OSGeo] Failed disk in osgeo4.osuosl.bak

tech@wildintellect.com via RT support at osuosl.org
Fri Apr 11 14:29:56 PDT 2014


Justin,

Thanks, we suspected this when we did the battery replacement. 2 drives
have been ordered and should arrive early next week. Yes we got a spare
this time.

When it comes in we should plan an outage window to turn on off VMs to
make the rebuild go faster.

Thanks,
Alex

On 04/11/2014 01:08 PM, Justin Dugger via RT wrote:
> Hey, an important followup!
> 
> It appears that osgeo4 has lost another drive (this time in slot 0) around 5:30AM:
> 
> 05:39 PROBLEM: osgeo4.osuosl.bak/Dell RAID Array is CRITICAL, CRITICAL: 0:BBU Charged (100%):0:RAID-6:6 drives:557.75GB:Partially Drives:6 1 Bad Drives (88 Errors), Apr 11, 12:39 UTC
> 
> This will affect I/O performance in the obvious ways: any read involving the affected disk will require reading all other volumes to calculate what the block should be. 
> 
> On Tue Apr 08 09:20:21 2014, jldugger wrote:
>> And another followup to document the results of the repair:
>>
>> osgeo4 came back online complaining that one of it's Power Supply
>> units has failed. It also took quite a while for the VM qgis to fsck,
>> and that ended up requiring a manual fsck to repair.
>>
>> We've agreed to delay osgeo3's battery replacement until next week.
>>
>> On Mon Apr 07 11:30:29 2014, jldugger wrote:
>>> Just to confirm/document what was discussed on IRC:
>>>
>>> The RAID array rebuild last week, but we discovered the cause of the
>>> low throughput was the RAID card on osgeo4 detected a weak battery
>>> state and transitioned to a slower, safer WriteBack policy.
>>>
>>> We've received a pair of batteries and will be taking a planned
>>> downtime to install them.
>>>
>>> On Thu Apr 03 09:25:58 2014, jldugger wrote:
>>>> On Thu Apr 03 08:28:55 2014, ramereth wrote:
>>>>> On Thu, Apr 3, 2014 at 12:04 AM, tech at wildintellect.com via RT <
>>>>> support at osuosl.org> wrote:
>>>>>
>>>>>> Something seems amiss. The ProjectsVM stopped responding, high
>>>>    disk
>>>>>> latency and iowait ( 10-11pm PST
>>>>>
>>>>> Rebuild Progress on Device at Enclosure 32, Slot 3 Completed 82%
>>> in
>>>>> 200
>>>>> Minutes.
>>>>>
>>>>> I've never seen a rebuild take this long before but this
>> hardware
>>> is
>>>>> starting to show its age a little.
>>>>
>>>> The only time I've seen things go this slowly was the time I
>> forgot
>>> to
>>>>    take our (very busy) FTP mirror out of rotation for the
>> duration
>>> of
>>>>    a build. Under RAID 5, recalculating a block on the replacement
>>>>    drive requires a reading in a block on all the other drives. So
>>>>    rebuilds can 'steal' a lot of I/O from a system that was
>> already
>>>>    down 1 disk worth of I/O requests per second. While you can
>>>>    sometimes tune the RAID firmware to rebuild at a lower
>> priority,
>>>>    there's a balancing act between service latency and repairing
>> the
>>>>    RAID array before a second drive fails.
>>>>
>>>> TL;DR: sorry this is taking so long; I didn't realize the services
>>>>    depending on it were quite so IO bound.
>>>
>>>
>>
>>
> 
> 
> 
> _______________________________________________
> Sac mailing list
> Sac at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/sac
> 




More information about the Sac mailing list