[gdal-dev] Multithread deadlock

Tue Sep 27 07:35:26 PDT 2016

Hi Even,

Your changes have fixed the issue with multiple threads, now they can write simultaneously without getting locked. I've done some test in GTiff format and everything went OK. However, using a custom-made driver for a new format (called Wizard), a new problem has arisen :( .

In that case, although threads finish its task and generate its corresponding output raster, 2 weird things happen randomly and (again) only when the block cache gets full during writing:
- Some blocks (lines) are missing, so they appear in the file with no data value. 
- Resulting file format is corrupted. 

It seems that 2 or more threads writing simultaneously (althought different threads) mess up the block cache to each other when trying to flush dirty blocks. I have tried to manually flush dirty blocks (either calling block->FlushDirtyBlocks() or block->Write() ) everytime a block is completely written in the cache, but it doesn't work.

My driver is simple, nothing too complex, actually was mostly taken from other drivers already implemented. Check below the IWriteMethod, just in case I made a mistake.

Do you have any clue what is happening here?

// *********************************************************************

CPLErr WIZARDRasterBand::IWriteBlock(CPL_UNUSED int nBlockXOff, int nBlockYOff, void * pImage)
{
	CPLAssert(nBlockXOff == 0);
	CPLAssert(eAccess == GA_Update);

	// By default, the count of pixels we want to write will be the natural block
	// size in pixels, except for the case of a padded block.
	int nPixelCount = this->nBlockSizeInPixels;
	if (this->bBlockPadding && (nBlockYOff == this->nPaddedBlockYIndex))
	{
		nPixelCount = this->nPaddedBlockSizeInPixels;
	}

	// Lets move it to the first position of the desired block.	
	if (VSIFSeekL(this->poWDS->fp,
		this->nHeaderSizeInBytes + this->nBlockSizeInBytes * nBlockYOff,
		SEEK_SET) != 0)
	{
		CPLError(CE_Failure, CPLE_FileIO,
			"Seek failed when writing block to Wizard driver: %s", VSIStrerror(errno));
		return CE_Failure;
	}   

	// Now Write.
	if (VSIFWriteL(pImage, this->nPixelSizeInBytes, nPixelCount, this->poWDS->fp) !=
		nPixelCount)
	{
		CPLError(CE_Failure, CPLE_FileIO,
			"Write failed when dumping block to Wizard driver: %s", VSIStrerror(errno));
		return CE_Failure;
	}

	return CE_None;
}
// *********************************************************************

Thanks,
BR. Javier.

-----Original Message-----
From: gdal-dev [mailto:gdal-dev-bounces at lists.osgeo.org] On Behalf Of Francisco Javier Calzado
Sent: 27 September, 2016 09:50
To: Even Rouault <even.rouault at spatialys.com>; gdal-dev at lists.osgeo.org
Subject: Re: [gdal-dev] Multithread deadlock

Hi Even,

Thanks for such a quick fix! I'm gonna apply the patch and recompile GDAL and will let you know :)

Keep in touch.
Best Regards,
Javier Calzado

-----Original Message-----
From: Even Rouault [mailto:even.rouault at spatialys.com]
Sent: 26 September, 2016 17:53
To: gdal-dev at lists.osgeo.org
Cc: Francisco Javier Calzado <francisco.javier.calzado at ericsson.com>; Andrew Bell <andrew.bell.ia at gmail.com>
Subject: Re: [gdal-dev] Multithread deadlock

Hi,

I admire Andrew's enthousiasm and would happily let him tackle the next bug reported in this area ;-)

I could reproduce the deadlock with the same stack trace  and have pushed a fix per https://trac.osgeo.org/gdal/ticket/6661. This was actually not a typical deadlock situation, but a undefined behaviour caused by trying to acquire a recursive mutex than was previously released more times than it had been acquired.

Without this patch, a "workaround" would be to define the GDAL_ENABLE_READ_WRITE_MUTEX config option to NO to disable the per-dataset mutex. had added this option since I wasn't really sure that the per-dataset mutex wouldn't introduce deadlock situations. But when defining it, you'll get undefined behaviour (=potentially crashing or causing corruptions) due to 2 threads potentially calling the IWriteBlock() method of the same dataset,which was the GDAL 1.X behaviour.

Clearly multi-threading scenarios involving writing is the point where the global block cache mechanism + the band-aid of the per-dataset R/W mutex are showing their limit in terms of design&maintenance complexity, and scalability. A per-dataset block cache would avoid such headaches (the drawback would be to define a per-dataset block cache size)

Even

> Sure Andrew,
> 
> Here it is the call stack from Visual Studio for both threads (I just 
> copied the top calls where GDAL is involved, just for easy reading. If 
> you need the whole stack just let me know):
> 
> THREAD 1:
> 
>                 ntdll.dll!_NtWaitForSingleObject at 12‑() Unknown
>                ntdll.dll!_RtlpWaitOnCriticalSection at 8‑()            
> Unknown ntdll.dll!_RtlEnterCriticalSection at 4‑()    Unknown
>                gdal201.dll!CPLAcquireMutex(_CPLMutex * hMutexIn, double
> dfWaitInSeconds) Line 806               C++
> gdal201.dll!GDALDataset::EnterReadWrite(GDALRWFlag eRWFlag) Line 6102     
>   C++ gdal201.dll!GDALRasterBand::EnterReadWrite(GDALRWFlag eRWFlag) Line
> 5290 C++ gdal201.dll!GDALRasterBlock::Write() Line 742    C++
> gdal201.dll!GDALRasterBlock::Internalize() Line 917          C++
> gdal201.dll!GDALRasterBand::GetLockedBlockRef(int nXBlockOff, int
> nYBlockOff, int bJustInitialize) Line 1126                C++
> 
> >             Test.exe!RasterBandPixelAccess::SetValueAtPixel<short>(const
> >             int & pX, const int & pY, const short & value) Line 180     
> >                       C++
> 
> THREAD 2:
> 
>                 ntdll.dll!_NtWaitForSingleObject at 12‑() Unknown
>                KernelBase.dll!_WaitForSingleObjectEx at 12‑()   Unknown
>                kernel32.dll!_WaitForSingleObjectExImplementation at 12‑()     
>   Unknown kernel32.dll!_WaitForSingleObject at 8‑() Unknown
>                gdal201.dll!CPLCondWait(_CPLCond * hCond, _CPLMutex *
> hClientMutex) Line 937           C++
> gdal201.dll!GDALAbstractBandBlockCache::WaitKeepAliveCounter() Line 134   
>    C++ gdal201.dll!GDALArrayBandBlockCache::FlushCache() Line 312    C++
> gdal201.dll!GDALRasterBand::FlushCache() Line 865         C++
> gdal201.dll!GDALDataset::FlushCache() Line 386 C++
>                gdal201.dll!GDALPamDataset::FlushCache() Line 159        C++
>                gdal201.dll!GTiffDataset::Finalize() Line 6180       C++
>                gdal201.dll!GTiffDataset::~GTiffDataset() Line 6135         
>  C++ gdal201.dll!GTiffDataset::`scalar deleting destructor'(unsigned int) 
>          C++ gdal201.dll!GDALClose(void * hDS) Line 2998       C++
> 
> >             Test.exe!main::__l2::<lambda>(std::basic_string<char,std::cha
> >             r_traits<char>,std::allocator<char> > sourcefilePath,
> >             std::basic_string<char,std::char_traits<char>,std::allocator
> >             <char> > targetFilePath, int threadID) Line 66           C++
> 
> From: Andrew Bell [mailto:andrew.bell.ia at gmail.com]
> Sent: 26 September, 2016 16:06
> To: Francisco Javier Calzado <francisco.javier.calzado at ericsson.com>
> Cc: gdal-dev at lists.osgeo.org
> Subject: Re: [gdal-dev] Multithread deadlock
> 
> Deadlocks are usually easy to debug if you can get a traceback when 
> deadlocked.  If you can attach with gdb (or run in the debugger) and 
> reproduce and post the stack at the time ('where' from gdb), it should 
> be no problem to fix.  Trying to reproduce on different hardware can 
> be difficult.
> 
> On Mon, Sep 26, 2016 at 9:33 AM, Francisco Javier Calzado 
> <francisco.javier.calzado at ericsson.com<mailto:francisco.javier.calzado
> @eri
> csson.com>> wrote: Hi guys,
> 
> I am experiencing a deadlock with just 2 threads in a single reader & 
> multiple writer scenario. This is, threads read from the same input 
> file (using different handlers) and then write different output files.
> Deadlock comes when the block cache gets filled. The situation is the following:
> 
> 
> -          T1 and T2 read datasets D1 and D2, both pointing to the same
> input raster (GTiff).
> 
> -          Block cache gets filled.
> 
> -          T1 tries to lock one block in the cache to write data. But cache
> is full, so it tries to free dirty blocks from T2 (as seen in
> Internalize() method). For that purpose, it requires apparently a 
> mutex from D2.
> 
> -          However T2 is in a state where must wait for thread T1 to finish
> working with T2’s blocks. In this state, T2 has a mutex acquired from D2.
> 
> At least, that is what it seems to be happening based on source code. 
> Maybe I’m wrong, I don’t have a full picture overview about how GDAL 
> is internally working. The thing is that I can reproduce this issue 
> with the following test code and dataset:
> https://drive.google.com/file/d/0B-OCl1FjBi0YSkU3RUozZjc5SnM/view?usp=
> shar
> ing
> 
> Oddly enough, ticket with number #6163 is supposed to fix this, but 
> its failing in my case. I am working with GDAL 2.1.0 version under
> VS2015 (x32, Debug) compilation.
> 
> Even, what do you think?
> 
> Thanks!
> Javier C.
> 
> 
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org<mailto:gdal-dev at lists.osgeo.org>
> http://lists.osgeo.org/mailman/listinfo/gdal-dev
> 
> 
> 
> --
> Andrew Bell
> andrew.bell.ia at gmail.com<mailto:andrew.bell.ia at gmail.com>

--
Spatialys - Geospatial professional services http://www.spatialys.com _______________________________________________
gdal-dev mailing list
gdal-dev at lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/gdal-dev