[gdal-dev] The pthread_atfork() issue

Carsten Schultz carsten.schultz at pix4d.com
Mon Dec 4 03:21:02 PST 2023


Hello everyone!

Please let me know if you would have preferred to discuss this on Github.

I have run into a problem with the use of pthread_atfork() in ogr_proj_p.cpp. I have since seen that the code has been disabled for macOS in 3.8 in response to https://github.com/OSGeo/gdal/issues/8497, and indeed I encountered the problem in 3.6.4 on macOS. The problems are however more general in my opinion.


I have two concerns.

1) The call `pthread_atfork(nullptr, nullptr, ForkOccurred)` install a handler which sets a flag in a global variable in a forked child. This handler should be installed exactly once. (Indeed there is no provision to remove such a handler.) However, the call is done in the constructor of OSRPJContextHolder and since there is a thread local instance of this, that constructor can be called an unlimited number of times.

2) The static flag g_bForkOccurred that is set by ForkOccured is used to signal that a fork has occurred and OSRPJContextHolder will clear that flag when it acts upon it. This means that at most one OSRPJContextHolder instance will observe the flag. That is probably not a problem as long as there is only one static or thread_local instance of OSRPJContextHolder, but it doesn’t look conceptually clean to me.


To see why (1) is a problem (except for the handler running several thousand time should a fork actually occur), there may be a limit to how many handlers can be actually registered, and pthread_atfork will return E_NOMEM if this limit is exceeded. Now since the code doesn’t check the return code and shouldn’t have registered the handler multiple times anyway, this will usually not be noticed. However, an unrelated part of the program may try to use pthread_atfork after GDAL has taken up all slots and react less nonchalantly to a failure. This is what happened to me. This is most likely the reason if a problem is noticed on macOS and not Linux, because on macOS the limit can be as low as 170 calls to pthread_atfork (4096 / 24, vm page size / size of 3 pointers) while I haven’t encountered such a limit on Linux.

Additionally pthread_atfork() is most likely not thread safe, and it is called without synchronisation.


Now all of this is of course easy to fix, but at first I wanted to ask whether you consider it actually worth the effort. If the getpid() version works fine, is that one system call expensive enough to warrant the effort of maintaining the pthread_atfork code?

Thanks for your time,

Carsten


More information about the gdal-dev mailing list