[SAC] [Hosting] Hypervisor reboots on production Ganeti cluster
Lance Albertson
lance at osuosl.org
Mon Aug 27 13:01:42 PDT 2018
All,
We've been having issues with our Ganeti cluster hypervisors randomly
rebooting. So far we've had this happen four times within the last month
(including this morning) at the following times (in UTC) and duration:
Event Start Time Event End Time Event Duration
2018-07-27 10:25:13 2018-07-27 10:29:29 0d 0h 4m 16s
2018-07-28 17:03:19 2018-07-28 17:06:57 0d 0h 3m 38s
2018-08-23 08:14:07 2018-08-23 08:18:00 0d 0h 3m 53s
2018-08-27 11:46:06 2018-08-27 11:48:44 0d 0h 2m 38s
The kernel log points to this [1] upstream issue for RHEL and also reported
on the CentOS forum [2] (We run CentOS 7 on our servers).
[1] https://access.redhat.com/solutions/3432391
[2] https://www.centos.org/forums/viewtopic.php?t=67170
It seems the solution is "In progress" so I expect it to be included in the
next released kernel in CentOS. Unfortunately the workaround is to either
use an older kernel or just wait. As of right now we are going to opt to
wait until the next release happens however if we see increased instability
we'll look into using an older kernel.
So far this only affects our Ganeti cluster and hasn't been triggered in
either of our OpenStack clusters (x86 & ppc64le) as of yet.
For some of our older VMs which boot using an external kernel we build,
we've been hitting a different problem with the dracut dropping to an
emergency shell if there are any fsck warnings on boot. We've had to
manually fsck those filesystems and reboot the VMs to bring them back
online. We're hoping to move away from these external kernels soon so let
us know if you'd like to do that. In the meantime, we're going to look and
see what changes we can make with the initrd so that it doesn't drop into
the emergency shell and let's the OS do the fsck instead. We recently
updated those external kernels to use an initrd which is why this has
started to happen.
The RH solution page has the following upstream commit message which
explains why this bug is happening in the kernel:
Because we drop cpu_base->lock around calling hrtimer::function, it is
possible for hrtimer_start() to come in between and enqueue the timer.
If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
because HRTIMER_STATE_ENQUEUED will be set.
Since the above is a perfectly valid scenario, remove the BUG_ON and
make the enqueue_hrtimer() call conditional on the timer not being
enqueued already.
NOTE: in that concurrent scenario its entirely common for both sites
to want to modify the hrtimer, since hrtimers don't provide
serialization themselves be sure to provide some such that the
hrtimer::function and the hrtimer_start() caller don't both try and
fudge the expiration state at the same time.
To that effect, add a WARN when someone tries to forward an already
enqueued timer, the most common way to change the expiry of self
restarting timers. Ideally we'd put the WARN in everything modifying
the expiry but most of that is inlines and we don't need the bloat.
If you have any questions or concerns please let us know.
Thanks!
--
Lance Albertson
Director
Oregon State University | Open Source Lab
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/sac/attachments/20180827/8c40cada/attachment.html>
-------------- next part --------------
_______________________________________________
Hosting mailing list
Hosting at osuosl.org
https://lists.osuosl.org/mailman/listinfo/hosting
More information about the Sac
mailing list