[gdal-dev] Multithread deadlock

Francisco Javier Calzado francisco.javier.calzado at ericsson.com
Tue Sep 27 00:50:19 PDT 2016


Hi Even,

Thanks for such a quick fix! I'm gonna apply the patch and recompile GDAL and will let you know :)

Keep in touch.
Best Regards,
Javier Calzado



-----Original Message-----
From: Even Rouault [mailto:even.rouault at spatialys.com] 
Sent: 26 September, 2016 17:53
To: gdal-dev at lists.osgeo.org
Cc: Francisco Javier Calzado <francisco.javier.calzado at ericsson.com>; Andrew Bell <andrew.bell.ia at gmail.com>
Subject: Re: [gdal-dev] Multithread deadlock

Hi,

I admire Andrew's enthousiasm and would happily let him tackle the next bug reported in this area ;-)

I could reproduce the deadlock with the same stack trace  and have pushed a fix per https://trac.osgeo.org/gdal/ticket/6661. This was actually not a typical deadlock situation, but a undefined behaviour caused by trying to acquire a recursive mutex than was previously released more times than it had been acquired.

Without this patch, a "workaround" would be to define the GDAL_ENABLE_READ_WRITE_MUTEX config option to NO to disable the per-dataset mutex. had added this option since I wasn't really sure that the per-dataset mutex wouldn't introduce deadlock situations. But when defining it, you'll get undefined behaviour (=potentially crashing or causing corruptions) due to 2 threads potentially calling the IWriteBlock() method of the same dataset,which was the GDAL 1.X behaviour.

Clearly multi-threading scenarios involving writing is the point where the global block cache mechanism + the band-aid of the per-dataset R/W mutex are showing their limit in terms of design&maintenance complexity, and scalability. A per-dataset block cache would avoid such headaches (the drawback would be to define a per-dataset block cache size)

Even

> Sure Andrew,
> 
> Here it is the call stack from Visual Studio for both threads (I just 
> copied the top calls where GDAL is involved, just for easy reading. If 
> you need the whole stack just let me know):
> 
> THREAD 1:
> 
>                 ntdll.dll!_NtWaitForSingleObject at 12‑() Unknown
>                ntdll.dll!_RtlpWaitOnCriticalSection at 8‑()            
> Unknown ntdll.dll!_RtlEnterCriticalSection at 4‑()    Unknown
>                gdal201.dll!CPLAcquireMutex(_CPLMutex * hMutexIn, double
> dfWaitInSeconds) Line 806               C++
> gdal201.dll!GDALDataset::EnterReadWrite(GDALRWFlag eRWFlag) Line 6102     
>   C++ gdal201.dll!GDALRasterBand::EnterReadWrite(GDALRWFlag eRWFlag) Line
> 5290 C++ gdal201.dll!GDALRasterBlock::Write() Line 742    C++
> gdal201.dll!GDALRasterBlock::Internalize() Line 917          C++
> gdal201.dll!GDALRasterBand::GetLockedBlockRef(int nXBlockOff, int
> nYBlockOff, int bJustInitialize) Line 1126                C++
> 
> >             Test.exe!RasterBandPixelAccess::SetValueAtPixel<short>(const
> >             int & pX, const int & pY, const short & value) Line 180     
> >                       C++
> 
> THREAD 2:
> 
>                 ntdll.dll!_NtWaitForSingleObject at 12‑() Unknown
>                KernelBase.dll!_WaitForSingleObjectEx at 12‑()   Unknown
>                kernel32.dll!_WaitForSingleObjectExImplementation at 12‑()     
>   Unknown kernel32.dll!_WaitForSingleObject at 8‑() Unknown
>                gdal201.dll!CPLCondWait(_CPLCond * hCond, _CPLMutex *
> hClientMutex) Line 937           C++
> gdal201.dll!GDALAbstractBandBlockCache::WaitKeepAliveCounter() Line 134   
>    C++ gdal201.dll!GDALArrayBandBlockCache::FlushCache() Line 312    C++
> gdal201.dll!GDALRasterBand::FlushCache() Line 865         C++
> gdal201.dll!GDALDataset::FlushCache() Line 386 C++
>                gdal201.dll!GDALPamDataset::FlushCache() Line 159        C++
>                gdal201.dll!GTiffDataset::Finalize() Line 6180       C++
>                gdal201.dll!GTiffDataset::~GTiffDataset() Line 6135         
>  C++ gdal201.dll!GTiffDataset::`scalar deleting destructor'(unsigned int) 
>          C++ gdal201.dll!GDALClose(void * hDS) Line 2998       C++
> 
> >             Test.exe!main::__l2::<lambda>(std::basic_string<char,std::cha
> >             r_traits<char>,std::allocator<char> > sourcefilePath,
> >             std::basic_string<char,std::char_traits<char>,std::allocator
> >             <char> > targetFilePath, int threadID) Line 66           C++
> 
> From: Andrew Bell [mailto:andrew.bell.ia at gmail.com]
> Sent: 26 September, 2016 16:06
> To: Francisco Javier Calzado <francisco.javier.calzado at ericsson.com>
> Cc: gdal-dev at lists.osgeo.org
> Subject: Re: [gdal-dev] Multithread deadlock
> 
> Deadlocks are usually easy to debug if you can get a traceback when 
> deadlocked.  If you can attach with gdb (or run in the debugger) and 
> reproduce and post the stack at the time ('where' from gdb), it should 
> be no problem to fix.  Trying to reproduce on different hardware can 
> be difficult.
> 
> On Mon, Sep 26, 2016 at 9:33 AM, Francisco Javier Calzado 
> <francisco.javier.calzado at ericsson.com<mailto:francisco.javier.calzado
> @eri
> csson.com>> wrote: Hi guys,
> 
> I am experiencing a deadlock with just 2 threads in a single reader & 
> multiple writer scenario. This is, threads read from the same input 
> file (using different handlers) and then write different output files. 
> Deadlock comes when the block cache gets filled. The situation is the following:
> 
> 
> -          T1 and T2 read datasets D1 and D2, both pointing to the same
> input raster (GTiff).
> 
> -          Block cache gets filled.
> 
> -          T1 tries to lock one block in the cache to write data. But cache
> is full, so it tries to free dirty blocks from T2 (as seen in
> Internalize() method). For that purpose, it requires apparently a 
> mutex from D2.
> 
> -          However T2 is in a state where must wait for thread T1 to finish
> working with T2’s blocks. In this state, T2 has a mutex acquired from D2.
> 
> At least, that is what it seems to be happening based on source code. 
> Maybe I’m wrong, I don’t have a full picture overview about how GDAL 
> is internally working. The thing is that I can reproduce this issue 
> with the following test code and dataset:
> https://drive.google.com/file/d/0B-OCl1FjBi0YSkU3RUozZjc5SnM/view?usp=
> shar
> ing
> 
> Oddly enough, ticket with number #6163 is supposed to fix this, but 
> its failing in my case. I am working with GDAL 2.1.0 version under 
> VS2015 (x32, Debug) compilation.
> 
> Even, what do you think?
> 
> Thanks!
> Javier C.
> 
> 
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org<mailto:gdal-dev at lists.osgeo.org>
> http://lists.osgeo.org/mailman/listinfo/gdal-dev
> 
> 
> 
> --
> Andrew Bell
> andrew.bell.ia at gmail.com<mailto:andrew.bell.ia at gmail.com>

--
Spatialys - Geospatial professional services http://www.spatialys.com


More information about the gdal-dev mailing list