[SAC] [Hosting] Core network switch reboot

Lance Albertson lance at osuosl.org
Tue Sep 13 15:10:45 PDT 2022


This has been completed and everything seems to be working fine.

Now keep in mind, the troublesome switch could reboot again until we figure
out why it's happening. If it does, it's impact should be smaller than
before at least.

Thanks!

On Tue, Sep 13, 2022 at 2:42 PM Lance Albertson <lance at osuosl.org> wrote:

> I have the "new" switch setup and ready to go. I'm currently planning on
> doing this switch in about 20 minutes (3pm PDT). You will see a set of
> outages as I plan to do the following:
>
> 1. Move LinkOregon uplink to "new" switch
> 2. Move oslsw3 uplink to "new" switch
> 3. Move oslsw1 uplink to "new" switch
> 4. Move remaining backend 10g switches
>
> If anything goes wrong, I should be able to quickly revert the change.
>
> On Tue, Sep 13, 2022 at 11:08 AM Lance Albertson <lance at osuosl.org> wrote:
>
>> All,
>>
>> I wanted to pass along more information on where we're at and our current
>> plans to try and work around this issue.
>>
>> Without going deep into the history of our core network infrastructure,
>> we have two core "routers" that are both aging and we're in the process of
>> replacing them with something newer.
>>
>> Previously, our uplink was connected through our Cisco 6509. This switch
>> has several 1G line cards that half of our servers are directly connected
>> to.
>>
>> The other core switch is a Cisco Nexus 6001 which has three fabric
>> extenders which provide 1G connectivity to the other half of our servers.
>> When we migrated over to the LinkOregon network, we moved the uplink over
>> to this Nexus 6k as it was much easier to get LR optics for it.
>>
>> Unfortunately this Nexus 6k has started kernel panicking and rebooting in
>> the past several months multiple times causing these outages. Much of our
>> downlink 10G switches are connected to this Nexus 6k which means there's a
>> larger impact when it goes down.
>>
>> A few years ago a high speed trading company donated us a pallet full of
>> Arista switches and I've been slowly adding to our infrastructure. Even
>> though they are EOL, they still work very well and we haven't had any
>> problems with them. And since I have a lot of them, I can easily replace
>> one if one goes bad.
>>
>> My current plan is to set up one of these Arista switches and move all of
>> the current 10G connections to it. This way, at least we can reduce the
>> impact if/when this Nexus 6k switch reboots again. In theory, it should
>> only affect the servers directly connected to the FEX switches if it
>> reboots again.
>>
>> I reached out to the OSU IT community and they graciously donated two
>> 10G-LR optical modules so that I can put this plan in place without having
>> to wait to ship modules.
>>
>> Current plan for today:
>> - Setup new Arista switch
>> - Move upstream connectivity to LinkOregon to it
>> - Move all downstream 10G links to this router
>>
>> I will send another email when I plan to do the actual outages for the
>> cut over.
>>
>> Longer term plans
>> - Work with vendors to replace our aging core network infrastructure with
>> something that's still supported and we can afford
>> - Look into getting redundancy put into place so that we don't have this
>> issue anymore
>> - Migrate off of the older equipment
>>
>> If anyone on this list has connections to Arista or any other major edge
>> networking vendor, please let me know. That will certainly help our
>> situation in the long term!
>>
>> I had already started working on a plan to replace these systems but it
>> seems my time may have run out (at least for the Nexus 6k switch).
>>
>> Thanks all for your patience and support!
>>
>> On Mon, Sep 12, 2022 at 11:54 PM Lance Albertson <lance at osuosl.org>
>> wrote:
>>
>>> Sadly this just happened again about 50 minutes ago. We may need to do
>>> some emergency firmware patching tomorrow. As a backup plan, I'm also
>>> formulating a plan to add another switch to try and minimize the impact of
>>> this troublesome switch.
>>>
>>> Once I gather some additional information tomorrow morning, I'll send an
>>> update on what we're planning to do.
>>>
>>> Thanks again for your patience.
>>>
>>> On Mon, Sep 12, 2022 at 3:14 PM Lance Albertson <lance at osuosl.org>
>>> wrote:
>>>
>>>> This happened again at approximately 10AM PDT. Since we moved our
>>>> uplink to this switch, everything went down while the switch rebooted.
>>>>
>>>> We're still planning on doing an upgrade but don't have a date yet for
>>>> that. We'll hopefully get that going soon.
>>>>
>>>> Thanks for your patience.
>>>>
>>>> On Wed, Aug 24, 2022 at 7:40 AM Lance Albertson <lance at osuosl.org>
>>>> wrote:
>>>>
>>>>> Unfortunately this just happened again overnight. We may need to
>>>>> schedule another outage to perform some software upgrade on this switch so
>>>>> that this stops happening. We'll send an announcement out once we have
>>>>> everything in place to do that upgrade.
>>>>>
>>>>> Thanks-
>>>>>
>>>>> On Wed, May 25, 2022 at 11:22 PM Lance Albertson <lance at osuosl.org>
>>>>> wrote:
>>>>>
>>>>>> All,
>>>>>>
>>>>>> It appears that one of our core network switches had a kernel panic
>>>>>> and rebooted which caused widespread outages throughout our infrastructure.
>>>>>> As of right now, everything appears to be back to normal but please let me
>>>>>> know if that isn't the case by sending an email to support at osuosl.org
>>>>>> .
>>>>>>
>>>>>> Apologies for the outage and we'll be looking into why this switch
>>>>>> had a kernel panic in the first place.
>>>>>>
>>>>>> Thanks-
>>>>>>
>>>>>
>> --
>> Lance Albertson
>> Director
>> Oregon State University | Open Source Lab
>>
>
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>


-- 
Lance Albertson
Director
Oregon State University | Open Source Lab
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/sac/attachments/20220913/40f5321f/attachment-0001.htm>
-------------- next part --------------
_______________________________________________
Hosting mailing list
Hosting at osuosl.org
https://lists.osuosl.org/mailman/listinfo/hosting


More information about the Sac mailing list