[GRASS-dev] Some progress on Win32 attribute deadlock problem

Benjamin Ducke benjamin.ducke at ufg.uni-kiel.de
Tue Jul 10 11:02:55 EDT 2007


Glynn,

my last posting to the GRASS dev list overlapped with yours.
I'll try to find the time and do as you suggested.

Best,

Ben

Glynn Clements wrote:
> Benjamin Ducke wrote:
> 
>>> I'm referring to the situation where read/write return a short count,
>>> e.g. "write(fd, buf, count) < count". But, I've remembered that XDR
>>> uses stdio rather than POSIX I/O, and I don't think that fread/fwrite
>>> can return a short count (except for EOF).
>>>
>>> According to the MSVCRT documentation, the O_NOINHERIT flag should be
>>> used when using _dup2(), e.g.:
>>>
>>> 	_pipe(p1, 250000, _O_BINARY|_O_NOINHERIT)
>>>
>>> Another suggestion: try changing the size passed to the _pipe()
>>> function in dbmi_client/start.c. If that affects the tendency to
>>> deadlock, it strongly suggests that the issue is related to the way
>>> that a full pipe is handled.
>>>
>>> Beyond that, the only thing which I can suggest is to instrument the
>>> XDR code with debug code to log all I/O operations (including the data
>>> which is read/written).
>> After hundreds of test runs with different Windows versions, these
>> are my conclusions:
>>
>> The problem has to do with the pipe mechanism in Windows.
>> I tried changing the pipe size as suggest, using extremely small (25)
>> and extremely high (250000000) values. On Windows 2000, with
>> the very small value, no module run makes it past 33 percent. So there
>> is a clear correlation. As soon as I set it to some "sane" value
>> (at least 25000), I get the same situation: ca. 4-6 out of 50 runs
>> complete. Increasing the value from here won't make a difference,
>> the differences are always within measuring precision.
>>
>> This is no surprise, since the comment in dbmi_client/start.c states
>> that the pipe buffer value is not directly related to the pipe size.
>> Apparently, Windows choose some fixed value as soon as the size
>> is greated than some threshold. The same thing happens when I set
>> the size to "0".
>>
>> However, the fact that I can block the piping effectively with
>> very small values leaves me believing that this is, as Glynn
>> suggests, the source of the troubles:
>> A full pipe gets stuck and no process ever takes anything out of
>> it to make some room, so the next bit of data cannot be pushed into
>> it. Puller waits for pusher, pusher never pushes, because nothing
>> gets pulled = deadlock. (I think...)
> 
> The reader won't be waiting for the writer if it already has a full
> pipe of data available.
> 
> The usual reason for co-processes (two processes connected via a
> read/write pair of pipes) to deadlock is that both are trying to write
> to full pipes. Neither process can continue until their write pipe is
> drained, which won't happen as each process is blocked.
> 
> [The situation where both are blocked trying to read from empty pipes
> is theoretically possible but uncommon in practice, as it indicates a
> fundamental design error, whereas the both-blocked-on-write case is
> usually due to a relatively simple oversight.]
> 
> Briefly, the drivers use a synchronous RPC mechanism:
> 
> 	Phase				Client		Driver
> 	
> 	1. Client sends request		write(1)	read(1)
> 	2. Driver reads request		read(2)		read(1)
> 	3. Driver processes request	read(2)		busy(2)
> 	4. Driver sends response	read(2)		write(3)
> 	5. Client reads response	read(2)		read(1)
> 
> Client:
> 
> 	do_request(...)
> 	{
> 	1:	send_request();
> 	2:	read_response();
> 	}
> 
> Driver:
> 
> 	main_loop(...)
> 	{
> 		while (!eof(...))
> 		{
> 		1:	read_request();
> 		2:	process_request();
> 		3:	send_response();
> 		}
> 	}
> 
> For this mechanism to work, each process (client and driver) must read
> exactly what it is sent before proceeding, and must send exactly what
> the client expects to receive. The driver must not send the response
> before it has read the entire request, even if it intends to discard
> it. Similarly, the client must read any response before sending the
> next request; it can't send a bunch of requests then read all of the
> responses later.
> 
> Having said all of that, there can't be any fundamental design
> problems if it works fine on Unix. One possibility is incorrect
> handling of an error condition which only occurs on Windows.
> 
> It's also possible that the Windows stdio implementation doesn't like
> _pipe(). If it doesn't handle short reads/writes correctly, that won't
> affect files, and it probably won't affect pipes which never fill up,
> but it may fall down on a full pipe.
> 
>> Another thing makes me believe that Windows itself is the culprit
>> here: I tested the same stuff on a Windows XP SP2 system, clean
>> install from scratch. On this system, almost all the runs (97%)
>> finished cleanly!
>>
>> Obviously MS did some improvements to process communication in that
>> release ...
> 
> Not necessarily. If the issue is related to timing, it could just be
> that everything runs more smoothly due to the clean install (rather
> than SP2).
> 
>> Setting the _NO_INHERIT flag makes no difference.
>>
>> So, how are we going to go ahead?
> 
> Figure out how to debug the processes. If you can't get gdb to work, I
> can only suggest logging every significant event at the lowest level,
> i.e. log every read/write operation: the arguments, the return code,
> and the complete data (i.e. the buffer contents before read and after
> write). This is all done in the RPC/XDR library, in xdr_stdio.c. It
> will probably help to also log the beginning/end of each procedure
> call (i.e. lib/db/dbmi_base/xdrprocedure.c).
> 

-- 
Benjamin Ducke, M.A.
Archäoinformatik
(Archaeoinformation Science)
Institut für Ur- und Frühgeschichte
(Inst. of Prehistoric and Historic Archaeology)
Christian-Albrechts-Universität zu Kiel
Johanna-Mestorf-Straße 2-6
D 24098 Kiel
Germany

Tel.: ++49 (0)431 880-3378 / -3379
Fax : ++49 (0)431 880-7300
www.uni-kiel.de/ufg




More information about the grass-dev mailing list