[GRASS-dev] Some progress on Win32 attribute deadlock problem

Glynn Clements glynn at gclements.plus.com
Tue Jul 10 11:00:00 EDT 2007


Benjamin Ducke wrote:

> > I'm referring to the situation where read/write return a short count,
> > e.g. "write(fd, buf, count) < count". But, I've remembered that XDR
> > uses stdio rather than POSIX I/O, and I don't think that fread/fwrite
> > can return a short count (except for EOF).
> > 
> > According to the MSVCRT documentation, the O_NOINHERIT flag should be
> > used when using _dup2(), e.g.:
> > 
> > 	_pipe(p1, 250000, _O_BINARY|_O_NOINHERIT)
> > 
> > Another suggestion: try changing the size passed to the _pipe()
> > function in dbmi_client/start.c. If that affects the tendency to
> > deadlock, it strongly suggests that the issue is related to the way
> > that a full pipe is handled.
> > 
> > Beyond that, the only thing which I can suggest is to instrument the
> > XDR code with debug code to log all I/O operations (including the data
> > which is read/written).
> 
> After hundreds of test runs with different Windows versions, these
> are my conclusions:
> 
> The problem has to do with the pipe mechanism in Windows.
> I tried changing the pipe size as suggest, using extremely small (25)
> and extremely high (250000000) values. On Windows 2000, with
> the very small value, no module run makes it past 33 percent. So there
> is a clear correlation. As soon as I set it to some "sane" value
> (at least 25000), I get the same situation: ca. 4-6 out of 50 runs
> complete. Increasing the value from here won't make a difference,
> the differences are always within measuring precision.
> 
> This is no surprise, since the comment in dbmi_client/start.c states
> that the pipe buffer value is not directly related to the pipe size.
> Apparently, Windows choose some fixed value as soon as the size
> is greated than some threshold. The same thing happens when I set
> the size to "0".
> 
> However, the fact that I can block the piping effectively with
> very small values leaves me believing that this is, as Glynn
> suggests, the source of the troubles:
> A full pipe gets stuck and no process ever takes anything out of
> it to make some room, so the next bit of data cannot be pushed into
> it. Puller waits for pusher, pusher never pushes, because nothing
> gets pulled = deadlock. (I think...)

The reader won't be waiting for the writer if it already has a full
pipe of data available.

The usual reason for co-processes (two processes connected via a
read/write pair of pipes) to deadlock is that both are trying to write
to full pipes. Neither process can continue until their write pipe is
drained, which won't happen as each process is blocked.

[The situation where both are blocked trying to read from empty pipes
is theoretically possible but uncommon in practice, as it indicates a
fundamental design error, whereas the both-blocked-on-write case is
usually due to a relatively simple oversight.]

Briefly, the drivers use a synchronous RPC mechanism:

	Phase				Client		Driver
	
	1. Client sends request		write(1)	read(1)
	2. Driver reads request		read(2)		read(1)
	3. Driver processes request	read(2)		busy(2)
	4. Driver sends response	read(2)		write(3)
	5. Client reads response	read(2)		read(1)

Client:

	do_request(...)
	{
	1:	send_request();
	2:	read_response();
	}

Driver:

	main_loop(...)
	{
		while (!eof(...))
		{
		1:	read_request();
		2:	process_request();
		3:	send_response();
		}
	}

For this mechanism to work, each process (client and driver) must read
exactly what it is sent before proceeding, and must send exactly what
the client expects to receive. The driver must not send the response
before it has read the entire request, even if it intends to discard
it. Similarly, the client must read any response before sending the
next request; it can't send a bunch of requests then read all of the
responses later.

Having said all of that, there can't be any fundamental design
problems if it works fine on Unix. One possibility is incorrect
handling of an error condition which only occurs on Windows.

It's also possible that the Windows stdio implementation doesn't like
_pipe(). If it doesn't handle short reads/writes correctly, that won't
affect files, and it probably won't affect pipes which never fill up,
but it may fall down on a full pipe.

> Another thing makes me believe that Windows itself is the culprit
> here: I tested the same stuff on a Windows XP SP2 system, clean
> install from scratch. On this system, almost all the runs (97%)
> finished cleanly!
> 
> Obviously MS did some improvements to process communication in that
> release ...

Not necessarily. If the issue is related to timing, it could just be
that everything runs more smoothly due to the clean install (rather
than SP2).

> Setting the _NO_INHERIT flag makes no difference.
> 
> So, how are we going to go ahead?

Figure out how to debug the processes. If you can't get gdb to work, I
can only suggest logging every significant event at the lowest level,
i.e. log every read/write operation: the arguments, the return code,
and the complete data (i.e. the buffer contents before read and after
write). This is all done in the RPC/XDR library, in xdr_stdio.c. It
will probably help to also log the beginning/end of each procedure
call (i.e. lib/db/dbmi_base/xdrprocedure.c).

-- 
Glynn Clements <glynn at gclements.plus.com>




More information about the grass-dev mailing list