[SAC] Munin notes

Tue Aug 21 19:02:32 PDT 2012

On 08/21/2012 03:52 PM, Frank Warmerdam wrote:
> On Tue, Aug 21, 2012 at 2:56 PM, Alex Mandel <tech_dev at wildintellect.com> wrote:
>> Due to the ongoing troubleshooting on Tracsvn I was paying a little more
>> attention to Munin.
>>
>> 1. We still need to decide where emails should be sent when munin hits a
>> warning or critical value (eg. disk is over 90%). I don't want to use
>> the Sac list because the number emails per incident can get astronomical
>> quickly (1 email per chart even if only 1 chart is out of bounds, repeat
>> every 5 minutes until issue is resolved). Do we have an svn watch type
>> list or something similar that sends out but isn't really for discussion?
> 
> Alex,
> 
> Perhaps we could setup a sac-alert mailing list?  I'm just a bit
> concerned about alerting going nuts with lots of message and
> also bogging down mailman, filling up disks, etc.
> 
sac-alert would work, as an exception we could host such a list with an
outside service. The emails aren't big per email, there's just a lot of
them when things go wrong.

>> 2. I'd like to increase the ram allocation for the QGIS and the Projects
>> VMs. 2 GB+ to each, as both are using 70%+ of their current RAM on a
>> regular basis. Based on our notes we have 20 GB of unallocated RAM on
>> osgeo4 currently, so this should be no problem. Just wanted to get the
>> idea out before I ask osuosl to do it (might be able to do it with
>> ganeti web interface ourselves).
> 
> I can live with this change, but it doesn't leave us much more room
> if we want to spin up new machines or allocate more ram to existing
> ones.  What happens when the VMs go over the physical RAM?
> Does it degrade gracefully with some sort of swap equivelent?
> 
> Best regards,
> 

Yes, it uses the swap partition inside the VM disk allocation. Graceful
depends on definition as going into swap sometimes means the machine
locks up for while until it can clear the swap or finish whatever it's
doing. A common experience is that the machine becomes unresponsive and
no one can log in to check on it, leaving the options of wait until it's
done (might be never if it's apache) or force a reboot.

We still would have 16 GB of ram for the host and for fail-over of
secure/web (8 reserved for that I think). I can audit to see if we can
take it away from other VMs that are underusing ram. Would adding 1 GB
be more palatable, no reason we have to do increments of 2.

Long term there's still the idea of getting build machine going
somewhere which would take the load off the qgis VM.

Thanks,
Alex