[Hosting] Post-mortem: Network connectivity issues during edge router upgrade

Lance Albertson lance at osuosl.org
Thu May 14 17:05:21 PDT 2026


Hi everyone,

*Date*: 2026-05-14
*Impact*: Intermittent IPv4 and IPv6 connectivity for some hosted services
for approximately 3 hours and 20 minutes beyond the planned maintenance
window.

Today, OSL performed scheduled maintenance to bring our second edge router
(sw-edge1) into active service alongside our existing edge router
(sw-edge2). The goal was active-active routing redundancy at our network
edge, eliminating long-standing traffic asymmetry, and enabling future edge
router maintenance without service interruption.

The maintenance hit two issues:

*1. An upstream LACP issue with our ISP (LinkOregon).*

Stale configuration on the interface facing our new switch — left over from
a pseudo-wire used during our data center migration earlier this year —
prevented the new uplink from forming an active LACP bundle. Because we had
already activated sw-edge1 as a Layer 3 router, traffic that hashed to
sw-edge1 had no clean path out and was disrupted until the bundle came up.
LinkOregon's team identified and removed the legacy configuration once we they
noticed it.

*2. An ARP and IPv6 neighbor synchronization issue between our two edge
switches.*

After we resolved the LACP issues, some hosted services experienced
intermittent connectivity — some hosts were reachable, others were not,
with the pattern shifting over time. The root cause was a subtle
platform-specific behavior on our Arista switches: by default, MLAG (the
technology bonding our two edge routers into an active-active pair) does
not synchronize ARP and IPv6 neighbor state between peer switches unless an
additional software agent is active. We had been operating under the
assumption that this synchronization happened automatically — a widespread
assumption that turned out to be incorrect for our hardware platform.

We had reviewed the migration plan with both LinkOregon and Arista
beforehand, and neither of these failure modes was anticipated by anyone
involved. We're grateful that Arista's engineer was able to join us on
short notice — the engineer who helped us had a meeting in ten minutes when
we reached out and provided a working fix within about twenty. Their fix
involved enabling a VxLAN configuration between our edge switches (used
purely to activate the synchronization agent, not to carry traffic) and
changing our IPv6 gateway addressing model to give each switch a unique
IPv6 address alongside the shared gateway. From the host perspective,
gateway addresses are unchanged.

The IPv4 fix was in place by 2:20 PM PDT; IPv6 SLAAC was fully restored by
approximately 3:20 PM PDT.

Thanks to Arista's engineer for the quick response, to LinkOregon's network
team for the fast turnaround, and to our hosted projects for their
patience. If you observed connectivity issues you'd like us to verify
against our timeline, please reach out via support at osuosl.org.
Thanks-

-- 
Lance Albertson
Director
Oregon State University | Open Source Lab
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/sac/attachments/20260514/a0cc2000/attachment.htm>
-------------- next part --------------
_______________________________________________
Hosting mailing list
Hosting at lists.osuosl.org
https://lists.osuosl.org/mailman/listinfo/hosting


More information about the Sac mailing list