[gdal-dev] The pthread_atfork() issue
Carsten Schultz
carsten.schultz at pix4d.com
Mon Dec 4 09:07:43 PST 2023
Even,
Thanks for the quick reply. If the getpid() has actually been measured to be an overhead, then it’s even better that this will now be resolved so that the faster code can be re-enabled for the Macs.
Since you have already created a PR I will answer there.
Best
Carsten
> On 4. Dec 2023, at 15:45, Even Rouault <even.rouault at spatialys.com> wrote:
>
> Carsten,
>
> thanks for the detailed analysis.
>
> There is definitely a cost to getpid() in some circumstances. The worst case I can imagine is on code like the one below that looks over OGRCoordinateTransformation::Transform() many times. For each call to that function, there's a OSRGetProjTLSContext() call which involves getpid() when pthread_atfork is not used. The overhead is ~ 15% with that reproducer, so definitely measurable.
>
> In https://github.com/OSGeo/gdal/pull/8909, I've taken into account your suggestion to call pthread_atfork() only once. This is true that the detection might not work well with threads, but :
>
> - generally mixing fork() and threads is a very bad idea as this leads to many issues like mutexes that would be taken by a thread that is not the one forked being let in a state where the child process can't acquire them anymore. So the safer practice is that you use either threading , or fork() your main process quite at its beginning to have child processes
>
> - but even if you have let's say 2 threads A and B, and A calls fork(). Then the child process will only have a main thread at its creation, so it will inherit only from the one TLS instance of OSRPJContextHolder duplicated from A. Hence having a global g_bForkOccurred shouldn't be an issue. At least I believe... Anyway as mentioned in my previous point, mixing both paradigm is generally a bad idea.
>
> Even
>
> #include "ogr_spatialref.h"
>
> int main(int argc, char **argv) {
> OGRSpatialReference srs1;
> srs1.importFromEPSG(4326);
> OGRSpatialReference srs2;
> srs2.importFromEPSG(32631);
> auto poCT = OGRCreateCoordinateTransformation(&srs1, &srs2);
> for(int i = 0; i < 5 * 1000 * 1000; ++i)
> {
> double dfX = 49;
> double dfY = 2;
> poCT->TransformWithErrorCodes(1, &dfX, &dfY, nullptr, nullptr, nullptr);
> }
> delete poCT;
> return 0;
> }
>
> Le 04/12/2023 à 12:21, Carsten Schultz via gdal-dev a écrit :
>> Hello everyone!
>>
>> Please let me know if you would have preferred to discuss this on Github.
>>
>> I have run into a problem with the use of pthread_atfork() in ogr_proj_p.cpp. I have since seen that the code has been disabled for macOS in 3.8 in response to https://github.com/OSGeo/gdal/issues/8497, and indeed I encountered the problem in 3.6.4 on macOS. The problems are however more general in my opinion.
>>
>>
>> I have two concerns.
>>
>> 1) The call `pthread_atfork(nullptr, nullptr, ForkOccurred)` install a handler which sets a flag in a global variable in a forked child. This handler should be installed exactly once. (Indeed there is no provision to remove such a handler.) However, the call is done in the constructor of OSRPJContextHolder and since there is a thread local instance of this, that constructor can be called an unlimited number of times.
>>
>> 2) The static flag g_bForkOccurred that is set by ForkOccured is used to signal that a fork has occurred and OSRPJContextHolder will clear that flag when it acts upon it. This means that at most one OSRPJContextHolder instance will observe the flag. That is probably not a problem as long as there is only one static or thread_local instance of OSRPJContextHolder, but it doesn’t look conceptually clean to me.
>>
>>
>> To see why (1) is a problem (except for the handler running several thousand time should a fork actually occur), there may be a limit to how many handlers can be actually registered, and pthread_atfork will return E_NOMEM if this limit is exceeded. Now since the code doesn’t check the return code and shouldn’t have registered the handler multiple times anyway, this will usually not be noticed. However, an unrelated part of the program may try to use pthread_atfork after GDAL has taken up all slots and react less nonchalantly to a failure. This is what happened to me. This is most likely the reason if a problem is noticed on macOS and not Linux, because on macOS the limit can be as low as 170 calls to pthread_atfork (4096 / 24, vm page size / size of 3 pointers) while I haven’t encountered such a limit on Linux.
>>
>> Additionally pthread_atfork() is most likely not thread safe, and it is called without synchronisation.
>>
>>
>> Now all of this is of course easy to fix, but at first I wanted to ask whether you consider it actually worth the effort. If the getpid() version works fine, is that one system call expensive enough to warrant the effort of maintaining the pthread_atfork code?
>>
>> Thanks for your time,
>>
>> Carsten
>> _______________________________________________
>> gdal-dev mailing list
>> gdal-dev at lists.osgeo.org
>> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
> --
> http://www.spatialys.com
> My software is free, but my time generally not.
>
More information about the gdal-dev
mailing list