[gdal-dev] The pthread_atfork() issue

Even Rouault even.rouault at spatialys.com
Mon Dec 4 06:45:19 PST 2023


Carsten,

thanks for the detailed analysis.

There is definitely a cost to getpid() in some circumstances. The worst 
case I can imagine is on code like the one below that looks over 
OGRCoordinateTransformation::Transform() many times. For each call to 
that function, there's a OSRGetProjTLSContext() call which involves 
getpid() when pthread_atfork is not used. The overhead is ~ 15% with 
that reproducer, so definitely measurable.

In https://github.com/OSGeo/gdal/pull/8909, I've taken into account your 
suggestion to call pthread_atfork() only once.  This is true that the 
detection might not work well with threads, but :

- generally mixing fork() and threads is a very bad idea as this leads 
to many issues like mutexes that would be taken by a thread that is not 
the one forked being let in a state where the child process can't 
acquire them anymore. So the safer practice is that you use either 
threading , or fork() your main process quite at its beginning to have 
child processes

- but even if you have let's say 2 threads A and B, and A calls fork(). 
Then the child process will only have a main thread at its creation, so 
it will inherit only from the one TLS instance of OSRPJContextHolder 
duplicated from A. Hence having a global g_bForkOccurred shouldn't be an 
issue. At least I believe... Anyway as mentioned in my previous point, 
mixing both paradigm is generally a bad idea.

Even

#include "ogr_spatialref.h"

int main(int argc, char **argv) {
     OGRSpatialReference srs1;
     srs1.importFromEPSG(4326);
     OGRSpatialReference srs2;
     srs2.importFromEPSG(32631);
     auto poCT = OGRCreateCoordinateTransformation(&srs1, &srs2);
     for(int i = 0; i < 5 * 1000 * 1000; ++i)
     {
         double dfX = 49;
         double dfY = 2;
         poCT->TransformWithErrorCodes(1, &dfX, &dfY, nullptr, nullptr, 
nullptr);
     }
     delete poCT;
     return 0;
}

Le 04/12/2023 à 12:21, Carsten Schultz via gdal-dev a écrit :
> Hello everyone!
>
> Please let me know if you would have preferred to discuss this on Github.
>
> I have run into a problem with the use of pthread_atfork() in ogr_proj_p.cpp. I have since seen that the code has been disabled for macOS in 3.8 in response to https://github.com/OSGeo/gdal/issues/8497, and indeed I encountered the problem in 3.6.4 on macOS. The problems are however more general in my opinion.
>
>
> I have two concerns.
>
> 1) The call `pthread_atfork(nullptr, nullptr, ForkOccurred)` install a handler which sets a flag in a global variable in a forked child. This handler should be installed exactly once. (Indeed there is no provision to remove such a handler.) However, the call is done in the constructor of OSRPJContextHolder and since there is a thread local instance of this, that constructor can be called an unlimited number of times.
>
> 2) The static flag g_bForkOccurred that is set by ForkOccured is used to signal that a fork has occurred and OSRPJContextHolder will clear that flag when it acts upon it. This means that at most one OSRPJContextHolder instance will observe the flag. That is probably not a problem as long as there is only one static or thread_local instance of OSRPJContextHolder, but it doesn’t look conceptually clean to me.
>
>
> To see why (1) is a problem (except for the handler running several thousand time should a fork actually occur), there may be a limit to how many handlers can be actually registered, and pthread_atfork will return E_NOMEM if this limit is exceeded. Now since the code doesn’t check the return code and shouldn’t have registered the handler multiple times anyway, this will usually not be noticed. However, an unrelated part of the program may try to use pthread_atfork after GDAL has taken up all slots and react less nonchalantly to a failure. This is what happened to me. This is most likely the reason if a problem is noticed on macOS and not Linux, because on macOS the limit can be as low as 170 calls to pthread_atfork (4096 / 24, vm page size / size of 3 pointers) while I haven’t encountered such a limit on Linux.
>
> Additionally pthread_atfork() is most likely not thread safe, and it is called without synchronisation.
>
>
> Now all of this is of course easy to fix, but at first I wanted to ask whether you consider it actually worth the effort. If the getpid() version works fine, is that one system call expensive enough to warrant the effort of maintaining the pthread_atfork code?
>
> Thanks for your time,
>
> Carsten
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev

-- 
http://www.spatialys.com
My software is free, but my time generally not.



More information about the gdal-dev mailing list