[Mapserver-dev] FCGI & Database Failure

Tue Oct 19 17:04:35 EDT 2004

The more I think about it, the more I think that "punt layers" is a good 
failure mode to have, once you get a system out of testing and into 
production. You can scream and yell in the logs, but don't just shut 
down if one of your data sources goes away for a while.

That still leaves "go down in flames" as a mode which is useful in 
development situations, so sadly we almost need a switch.

And of course that still leaves us with the problem of remediating bad 
database connections. Going down in flames gives us the option of rising 
like a phoenix. On the other hand, if there is nothing to rise to (if 
the database or remote WMS server is completely in the crapper), we have 
just taken the whole service offline.

As I talk about it, it sounds like a "nice" solution would be quite a 
complex mix of "punt layer" and "reset layer" with a dose of "die die 
die" stirred in for developer purposes. The maximum quantity of work :/

P.

Frank Warmerdam wrote:

> Paul Ramsey wrote:
> 
>> Now that we have FCGI, we have database connections that can persist 
>> for quite a while.
>>
>> This means that the issue of in-process database failure is no longer 
>> one we can ignore.
>>
>> So, for every connectiontype that is doing connection pooling, there 
>> needs to be a test right after the connection is obtained from the 
>> pool of "is this connection any good?". And then something useful has 
>> to be done if it is not. What should that useful thing be?
>> (a) try to reestablish the connection?
>> (b) shut down the whole process and let someone else clean up the mess?
>> (c) punt on this layer, but let other layers try and finish their work?
> 
> 
> Paul,
> 
> Well, dramatic death (ie (b)) would ensure that a clean fast cgi process
> gets started and it would go through the normal processes to acquire a new
> connection for the next request.  This approach provides for guaranteed
> self-cleanup as long as you are willing to have occational map requests 
> fail.
> 
> What sorts of situations are you trying to deal with?  A case where the
> database was restarted while the fastcgi was running?
> 
> If we want to pursue (a), we would need some manner of letting the low
> level pooling API know that an existing handle is no longer valid and needs
> to be flushed from the pool.  I think this is the most desirable approach
> if we want to do a nice clean job.
> 
>> This brings up the general case of database/layer failure, which is 
>> what is the correct behavior in the case of failed data sources? 
>> Should we be dying with prejudice, or trying to let the layers that 
>> *do* work finish their jobs?
> 
> My personal opinion would be to die dramatically when something goes wrong,
> with the possible exception of when we are depending on remote web services
> (like cascaded WMS layers) where we might well expect things to be lossy.
> But then I hate being fooled into thinking something wrong has succeeded 
> than
> I worry about having stuff blow up in front of me.  I may have a "developer
> bias" on this.  If we are going to die dramatically, of course it behooves
> us to explain as well as possible what has gone wrong so diagnosing the
> problems isn't as hard as it often is with MapServer.
> 
> Best regards,