[SAC] [Hosting] Core network switch reboot

Lance Albertson lance at osuosl.org
Tue Sep 13 14:42:41 PDT 2022


I have the "new" switch setup and ready to go. I'm currently planning on
doing this switch in about 20 minutes (3pm PDT). You will see a set of
outages as I plan to do the following:

1. Move LinkOregon uplink to "new" switch
2. Move oslsw3 uplink to "new" switch
3. Move oslsw1 uplink to "new" switch
4. Move remaining backend 10g switches

If anything goes wrong, I should be able to quickly revert the change.

On Tue, Sep 13, 2022 at 11:08 AM Lance Albertson <lance at osuosl.org> wrote:

> All,
>
> I wanted to pass along more information on where we're at and our current
> plans to try and work around this issue.
>
> Without going deep into the history of our core network infrastructure, we
> have two core "routers" that are both aging and we're in the process of
> replacing them with something newer.
>
> Previously, our uplink was connected through our Cisco 6509. This switch
> has several 1G line cards that half of our servers are directly connected
> to.
>
> The other core switch is a Cisco Nexus 6001 which has three fabric
> extenders which provide 1G connectivity to the other half of our servers.
> When we migrated over to the LinkOregon network, we moved the uplink over
> to this Nexus 6k as it was much easier to get LR optics for it.
>
> Unfortunately this Nexus 6k has started kernel panicking and rebooting in
> the past several months multiple times causing these outages. Much of our
> downlink 10G switches are connected to this Nexus 6k which means there's a
> larger impact when it goes down.
>
> A few years ago a high speed trading company donated us a pallet full of
> Arista switches and I've been slowly adding to our infrastructure. Even
> though they are EOL, they still work very well and we haven't had any
> problems with them. And since I have a lot of them, I can easily replace
> one if one goes bad.
>
> My current plan is to set up one of these Arista switches and move all of
> the current 10G connections to it. This way, at least we can reduce the
> impact if/when this Nexus 6k switch reboots again. In theory, it should
> only affect the servers directly connected to the FEX switches if it
> reboots again.
>
> I reached out to the OSU IT community and they graciously donated two
> 10G-LR optical modules so that I can put this plan in place without having
> to wait to ship modules.
>
> Current plan for today:
> - Setup new Arista switch
> - Move upstream connectivity to LinkOregon to it
> - Move all downstream 10G links to this router
>
> I will send another email when I plan to do the actual outages for the cut
> over.
>
> Longer term plans
> - Work with vendors to replace our aging core network infrastructure with
> something that's still supported and we can afford
> - Look into getting redundancy put into place so that we don't have this
> issue anymore
> - Migrate off of the older equipment
>
> If anyone on this list has connections to Arista or any other major edge
> networking vendor, please let me know. That will certainly help our
> situation in the long term!
>
> I had already started working on a plan to replace these systems but it
> seems my time may have run out (at least for the Nexus 6k switch).
>
> Thanks all for your patience and support!
>
> On Mon, Sep 12, 2022 at 11:54 PM Lance Albertson <lance at osuosl.org> wrote:
>
>> Sadly this just happened again about 50 minutes ago. We may need to do
>> some emergency firmware patching tomorrow. As a backup plan, I'm also
>> formulating a plan to add another switch to try and minimize the impact of
>> this troublesome switch.
>>
>> Once I gather some additional information tomorrow morning, I'll send an
>> update on what we're planning to do.
>>
>> Thanks again for your patience.
>>
>> On Mon, Sep 12, 2022 at 3:14 PM Lance Albertson <lance at osuosl.org> wrote:
>>
>>> This happened again at approximately 10AM PDT. Since we moved our uplink
>>> to this switch, everything went down while the switch rebooted.
>>>
>>> We're still planning on doing an upgrade but don't have a date yet for
>>> that. We'll hopefully get that going soon.
>>>
>>> Thanks for your patience.
>>>
>>> On Wed, Aug 24, 2022 at 7:40 AM Lance Albertson <lance at osuosl.org>
>>> wrote:
>>>
>>>> Unfortunately this just happened again overnight. We may need to
>>>> schedule another outage to perform some software upgrade on this switch so
>>>> that this stops happening. We'll send an announcement out once we have
>>>> everything in place to do that upgrade.
>>>>
>>>> Thanks-
>>>>
>>>> On Wed, May 25, 2022 at 11:22 PM Lance Albertson <lance at osuosl.org>
>>>> wrote:
>>>>
>>>>> All,
>>>>>
>>>>> It appears that one of our core network switches had a kernel panic
>>>>> and rebooted which caused widespread outages throughout our infrastructure.
>>>>> As of right now, everything appears to be back to normal but please let me
>>>>> know if that isn't the case by sending an email to support at osuosl.org.
>>>>>
>>>>> Apologies for the outage and we'll be looking into why this switch had
>>>>> a kernel panic in the first place.
>>>>>
>>>>> Thanks-
>>>>>
>>>>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>


-- 
Lance Albertson
Director
Oregon State University | Open Source Lab
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/sac/attachments/20220913/c532f3a0/attachment.htm>
-------------- next part --------------
_______________________________________________
Hosting mailing list
Hosting at osuosl.org
https://lists.osuosl.org/mailman/listinfo/hosting


More information about the Sac mailing list