[SAC] [Hosting] Core network switch reboot
Lance Albertson
lance at osuosl.org
Tue Sep 13 11:08:41 PDT 2022
All,
I wanted to pass along more information on where we're at and our current
plans to try and work around this issue.
Without going deep into the history of our core network infrastructure, we
have two core "routers" that are both aging and we're in the process of
replacing them with something newer.
Previously, our uplink was connected through our Cisco 6509. This switch
has several 1G line cards that half of our servers are directly connected
to.
The other core switch is a Cisco Nexus 6001 which has three fabric
extenders which provide 1G connectivity to the other half of our servers.
When we migrated over to the LinkOregon network, we moved the uplink over
to this Nexus 6k as it was much easier to get LR optics for it.
Unfortunately this Nexus 6k has started kernel panicking and rebooting in
the past several months multiple times causing these outages. Much of our
downlink 10G switches are connected to this Nexus 6k which means there's a
larger impact when it goes down.
A few years ago a high speed trading company donated us a pallet full of
Arista switches and I've been slowly adding to our infrastructure. Even
though they are EOL, they still work very well and we haven't had any
problems with them. And since I have a lot of them, I can easily replace
one if one goes bad.
My current plan is to set up one of these Arista switches and move all of
the current 10G connections to it. This way, at least we can reduce the
impact if/when this Nexus 6k switch reboots again. In theory, it should
only affect the servers directly connected to the FEX switches if it
reboots again.
I reached out to the OSU IT community and they graciously donated two
10G-LR optical modules so that I can put this plan in place without having
to wait to ship modules.
Current plan for today:
- Setup new Arista switch
- Move upstream connectivity to LinkOregon to it
- Move all downstream 10G links to this router
I will send another email when I plan to do the actual outages for the cut
over.
Longer term plans
- Work with vendors to replace our aging core network infrastructure with
something that's still supported and we can afford
- Look into getting redundancy put into place so that we don't have this
issue anymore
- Migrate off of the older equipment
If anyone on this list has connections to Arista or any other major edge
networking vendor, please let me know. That will certainly help our
situation in the long term!
I had already started working on a plan to replace these systems but it
seems my time may have run out (at least for the Nexus 6k switch).
Thanks all for your patience and support!
On Mon, Sep 12, 2022 at 11:54 PM Lance Albertson <lance at osuosl.org> wrote:
> Sadly this just happened again about 50 minutes ago. We may need to do
> some emergency firmware patching tomorrow. As a backup plan, I'm also
> formulating a plan to add another switch to try and minimize the impact of
> this troublesome switch.
>
> Once I gather some additional information tomorrow morning, I'll send an
> update on what we're planning to do.
>
> Thanks again for your patience.
>
> On Mon, Sep 12, 2022 at 3:14 PM Lance Albertson <lance at osuosl.org> wrote:
>
>> This happened again at approximately 10AM PDT. Since we moved our uplink
>> to this switch, everything went down while the switch rebooted.
>>
>> We're still planning on doing an upgrade but don't have a date yet for
>> that. We'll hopefully get that going soon.
>>
>> Thanks for your patience.
>>
>> On Wed, Aug 24, 2022 at 7:40 AM Lance Albertson <lance at osuosl.org> wrote:
>>
>>> Unfortunately this just happened again overnight. We may need to
>>> schedule another outage to perform some software upgrade on this switch so
>>> that this stops happening. We'll send an announcement out once we have
>>> everything in place to do that upgrade.
>>>
>>> Thanks-
>>>
>>> On Wed, May 25, 2022 at 11:22 PM Lance Albertson <lance at osuosl.org>
>>> wrote:
>>>
>>>> All,
>>>>
>>>> It appears that one of our core network switches had a kernel panic and
>>>> rebooted which caused widespread outages throughout our infrastructure. As
>>>> of right now, everything appears to be back to normal but please let me
>>>> know if that isn't the case by sending an email to support at osuosl.org.
>>>>
>>>> Apologies for the outage and we'll be looking into why this switch had
>>>> a kernel panic in the first place.
>>>>
>>>> Thanks-
>>>>
>>>
--
Lance Albertson
Director
Oregon State University | Open Source Lab
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/sac/attachments/20220913/feac405a/attachment.htm>
-------------- next part --------------
_______________________________________________
Hosting mailing list
Hosting at osuosl.org
https://lists.osuosl.org/mailman/listinfo/hosting
More information about the Sac
mailing list