[SAC] [Hosting] Ganeti Production Rebuild - Dec 11-15 & 18-19, 2017

Wed Dec 6 11:36:49 PST 2017

Service(s) affected:

All VMs running on our production Ganeti cluster will need to be non-live
migrated to their secondary nodes (i.e. shutdown and start is required). We
expect the outages for each VM to be short (under 5 minutes each). To see a
list of VMs that are affected and when please see this page [1]. We will
ensure the VMs are pingable after the reboot, but you may want to check
that services started properly for any services we don't already monitor.

No OpenStack services will be affected by this outage.

Outage Window:

This is a multi-day outage which will impact one hypervisor per day with an
outage window of approximately two hours. If we run into an issue that
can't be resolved during the day of the planned outage, we will be pushing
back this schedule a day and notify you of the change.

Currently proposed schedule for the hypervisors:

- gprod6: 12/11/2017 9:00AM - 11:00 AM PST (1700 - 1900 UTC)
- gprod4: 12/12/2017 9:00AM - 11:00 AM PST (1700 - 1900 UTC)
- gprod3: 12/13/2017 9:00AM - 11:00 AM PST (1700 - 1900 UTC)
- gprod7: 12/14/2017 9:00AM - 11:00 AM PST (1700 - 1900 UTC)
- gprod2: 12/15/2017 9:00AM - 12:00 PM PST (1700 - 2000 UTC)
- gprod1: 12/11/2017 1:00PM - 3:00 PM PST (2100 - 2300 UTC)
- gprod8: 12/11/2017 9:00AM - 11:00 AM PST (1700 - 1900 UTC)

Reason for outage:

We're in the midst of rebuilding our Ganeti clusters to CentOS 7 and
managed via Chef. We finished our rebuild of the internal cluster this week
and are ready to proceed with rebuilding our production cluster. We have a
total of 8 hypervisors in this cluster, one of which has already been
migrated to Chef. All secondary instances attached to the affected node
will remain and be re-synced once the node has been rebuilt and added back
as a node. All VM data stored on nodes will remain intact during the
rebuild as only the OS partition will be rebuilt.

To minimize the impact of outages, we're going to ensure all VMs will be
migrated to a new Chef managed node so that we do not need to do another
downtime. Once all hosts have been migrated, we'll be re-balance the
cluster and use live-migration to move VMs so that no downtime will be
noticed. We cannot use live-migration during the migration due to KVM
version differences between the new and old hosts unfortunately. We're also
going to take advantage of this downtime to replace the RAID batteries on
these nodes.

If you have any questions or concerns please let us know.

Projects affected:

All hosted VMs on our production Ganeti cluster.

[1] https://goo.gl/QEQsyu

-- 
Lance Albertson
Director
Oregon State University | Open Source Lab
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/sac/attachments/20171206/c4d27bf0/attachment.html>
-------------- next part --------------
_______________________________________________
Hosting mailing list
Hosting at osuosl.org
https://lists.osuosl.org/mailman/listinfo/hosting