[SAC] Run away process on Projects VM

Wed Mar 13 06:30:41 PDT 2013

On Wed, Mar 13, 2013 at 8:34 AM, Christopher Schmidt
<crschmidt at crschmidt.net> wrote:
> On Wed, Mar 13, 2013 at 12:00:59PM +0100, Markus Neteler wrote:
>> On Tue, Mar 12, 2013 at 7:08 PM, Alex Mandel <tech_dev at wildintellect.com> wrote:
>> > It looks like it's apache memory consumption, or a memory leak in
>> > someone's app (perhaps python or php). Looking around at options I think
>> > we might need to asjust MaxChildRequests to kill slightly more often and
>> > reset the memory.
>>
>> Looking at:
>> http://httpd.apache.org/docs/2.2/mod/mpm_common.html#maxrequestsperchild
>>
>> -> Default: MaxRequestsPerChild 10000
>>
>> but we have 0 - "If MaxRequestsPerChild is 0, then the process will
>> never expire."
>>
>> I'll set to 10000 and restart since Apache is blocked continuously and
>>   http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org-apache_processes.html
>> looks bad for some days (I wonder what happened).
>
> So, I hopped on to look at this a bit this morning.
>
> Here are the observations I made, and what I did to try to negate the problem.
>
>  1. Load up http://projects.osgeo.osuosl.org/server-status , note that
>     many many connections appear to be stuck in "Reading" state, which
>     means they have not yet sent enough information ot the server to make
>     a request.
>  2. Bump maxclients to the max of 256 -- no difference.
>  3. Turn down timeout from 120 seconds to 5 seconds -- server behavior
>     improves, but behavior is still consistent with many clients opening a
>     connection and trying to hold it for > 5 seconds. Free clients exist,
>     but any client taking longer than 5s is killed.
>
> At this point, it seems like the problem is that many clients are connecting,
> but are not sending packets after that for a long time. Unfortunately, I can't
> see any traffic pattern that would explain this.
>
> One possible explanation is just that we're network-bound on input, but that is
> inconsistent with low latency interactive ssh and with the fact that lowering the
> timeout seems to have an improvement. Another possible explanation is a DOS of
> some sort, but I can't find any obvious evidence of that. (Of course, running
> a webserver shared by many projects and accessed by a worldwide network of users
> of many websites doesn't really look much different htan a DOS to begin with,
> so I'm hard pressed to dny the possibility entirely.)

After some more research, I found that the timed out requests were all
coming from
a narrow range of IPs which were trying to use OSGeo as a proxy server. This
behavior is ongoing in general -- about 5 r/s hitting OSGeo are
trying, and failing,
to use our service as a proxy server -- but this block of IPs was new, and was
causing more harm than good.

I have, for the time being, implemented the following iptables rules, using help
from petschge on IRC:

DROP       all  --  192.74.224.0/19      anywhere
DROP       all  --  142.4.96.0/19        anywhere
DROP       all  --  142.0.128.0/20       anywhere

These are from:

 http://bgp.he.net/AS54600#_prefixes

This brought us back down to our previous load levels without other obvious
negative consequences.

I've left the apache config mostly the same for now -- but bumped up the timeout
to 60s instead of 30s -- and it would be wise to watch it for a few
days, but the
problem seems to have simply been that a network which is somewhat known
for spamming started trying to use our server for spam. That won't
actually work,
but since we had an open proxy a year ago, it seems that we still get regular
traffic trying to use us as a proxy server.

-- Chris