[SAC] Run away process on Projects VM

Christopher Schmidt crschmidt at crschmidt.net
Wed Mar 13 05:34:20 PDT 2013


On Wed, Mar 13, 2013 at 12:00:59PM +0100, Markus Neteler wrote:
> On Tue, Mar 12, 2013 at 7:08 PM, Alex Mandel <tech_dev at wildintellect.com> wrote:
> > It looks like it's apache memory consumption, or a memory leak in
> > someone's app (perhaps python or php). Looking around at options I think
> > we might need to asjust MaxChildRequests to kill slightly more often and
> > reset the memory.
> 
> Looking at:
> http://httpd.apache.org/docs/2.2/mod/mpm_common.html#maxrequestsperchild
> 
> -> Default: MaxRequestsPerChild 10000
> 
> but we have 0 - "If MaxRequestsPerChild is 0, then the process will
> never expire."
> 
> I'll set to 10000 and restart since Apache is blocked continuously and
>   http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org-apache_processes.html
> looks bad for some days (I wonder what happened).

So, I hopped on to look at this a bit this morning.

Here are the observations I made, and what I did to try to negate the problem.

 1. Load up http://projects.osgeo.osuosl.org/server-status , note that
    many many connections appear to be stuck in "Reading" state, which
    means they have not yet sent enough information ot the server to make
    a request.
 2. Bump maxclients to the max of 256 -- no difference.
 3. Turn down timeout from 120 seconds to 5 seconds -- server behavior
    improves, but behavior is still consistent with many clients opening a
    connection and trying to hold it for > 5 seconds. Free clients exist,
    but any client taking longer than 5s is killed.

At this point, it seems like the problem is that many clients are connecting,
but are not sending packets after that for a long time. Unfortunately, I can't
see any traffic pattern that would explain this.

One possible explanation is just that we're network-bound on input, but that is
inconsistent with low latency interactive ssh and with the fact that lowering the
timeout seems to have an improvement. Another possible explanation is a DOS of
some sort, but I can't find any obvious evidence of that. (Of course, running
a webserver shared by many projects and accessed by a worldwide network of users
of many websites doesn't really look much different htan a DOS to begin with,
so I'm hard pressed to dny the possibility entirely.)

After a bit more playing it seems clear that whatever the timeout is, we will
have an extra 10 requests per second hanging in the 'reading' state. Bumping
the timeout from 5s -> 30s doesn't seem to increase the overall r/s throughput
(which is around 68-70 / s at the moment.)

As a tradeoff between a higher timeout and letting clients in, I have:

 - Bumped ServerLimit and MaxClients to 700
 - Put the timeout to 30s

This means that any server which doesn't start returning bytes in 30s will
be closed by the server (vs 120s before). Given that most traffic on the
projects VM is static file requests, I think this is a possibly-unfortunate,
but reasonable tradeoff.

The setting of 700 is such that we have a spare, under the current operating
situation, of 200 clients to respond. In observing for the past 20 minutes,
it seems like this is relatively stable. From our historic munin data, it
looks like this will be about 100 more than our max at peak times.

I do not feel comfortable saying that this is a good fix: something looks
like it has changed in the usage pattern. But I can't find out what it is,
and I don't know what tool to use to debug the connections that are sitting
in 'waiting' state.

CPU, I/O utilization, and memory usage are reasonable, and requests are flowing
again.

As a first line of defense, if the problem becomes drastic again -- which
should be obvious from:

 http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org-apache_processes.html

if the 'busy servers' line is flat, the first change is to lower the timeout
in /etc/apache2/apache2.conf to 5s, and restart the server. Beyond that,
I'm a bit lost as to what we can do here.

Regards,
-- 
Christopher Schmidt
Web Developer


More information about the Sac mailing list