Threading shutdown race condition

This article is about a race condition in threading shutdown that I fixed in Python 3.9 in March 2019. I also forbid spawning daemon threads in subinterpreters to fix another related bug.

Drawing: #CoronaMaison by Julien Neel.

Race condition in threading shutdown

Random test failure noticed on FreeBSD buildbot

In March 2019, I noticed that test_threading.test_threads_join_2() was killed by SIGABRT on the FreeBSD CURRENT buildbot, bpo-36402:

Fatal Python error: Py_EndInterpreter: not the last thread

The test_threads_join_2() test failed randomly on buildbots when tests were run in parallel, but test_threading passed when it was re-run sequentially. Such failure was silently ignored, since the build was seen overall as a success.

The test test_threading.test_threads_join_2() was added by in 2013 commit 7b476993.

In 2016, I already reported the same test failure: bpo-27791 (same test, also on FreeBSD). And Christian Heimes reported a similar issue: bpo-28084. I simply closed these issues because I only saw the failure once in 4 months and I didn't have access to FreeBSD to attempt to reproduce the crash.

Reproduce the race condition

In 2019, I had a FreeBSD VM to attempt to reproduce the bug locally.

In June 2019, I found a reliable way to reproduce the bug by adding random sleeps to the test. With this patch, I was also able to reproduce the bug on Linux. I am way more comfortable to debug an issue on Linux with my favorite debugging tools!

I identified a race condition in the Python finalization. I also understood that the bug was not specific to subinterpreters:

The test shows the bug using subinterpreters (Py_EndInterpreter), but the bug also exists in Py_Finalize() which has the same race condition.

I wrote a patch for Py_Finalize() to help me to reproduce the bug without subinterpreters:

+    if (tstate != interp->tstate_head || tstate->next != NULL) {
+        Py_FatalError("Py_EndInterpreter: not the last thread");
+    }

threading._shutdown() race condition

threading._shutdown() uses threading.enumerate() which iterates on threading._active dictionary.

threading.Thread registers itself into threading._active when the thread starts. It unregisters itself from threading._active when it completes.

The bug occurs when the thread is unregistered whereas the underlying native thread is still running and the Python thread state is not deleted yet.

_thread._set_sentinel() creates a lock and registers a tstate->on_delete callback to release this lock. It's called by threading.Thread when the thread starts to set threading.Thread._tstate_lock. This lock is used by threading.Thread.join() method to wait until the thread completes.

_thread.start_new_thread() calls the C function t_bootstrap() which ends with:

tstate->interp->num_threads--;
PyThreadState_Clear(tstate);
PyThreadState_DeleteCurrent();
PyThread_exit_thread();

When the native thread completes, _PyThreadState_DeleteCurrent() is called: it calls tstate->on_delete() callback which releases threading.Thread._tstate_lock lock.

The root issue is that:

threading._shutdown() rely on threading._alive dictionary
Py_EndInterpreter() rely on the interpreter linked list of Python thread states of the interpreter (interp->tstate_head).

The lock on Python thread states (threading.Thread._tstate_lock) and PyThreadState.on_delete callback were added in 2013 by Antoine Pitrou to Python 3.4, commit 7b476993 of bpo-18808:

Issue #18808: Thread.join() now waits for the underlying thread state
to be destroyed before returning. This prevents unpredictable aborts
in Py_EndInterpreter() when some non-daemon threads are still running.

Fix threading._shutdown()

Finally in June 2019, I fixed the race condition in threading._shutdown() with commit 468e5fec:

bpo-36402: Fix threading._shutdown() race condition (GH-13948)

Fix a race condition at Python shutdown when waiting for threads.  Wait
until the Python thread state of all non-daemon threads get deleted
(join all non-daemon threads), rather than just wait until Python
threads complete.

The fix is to modify threading._shutdown() to wait until the Python thread state of all non-daemon threads get deleted, rather than calling the join() method of all non-daemon threads. The join() does not ensure that the Python thread state is deleted.

The Python finalization calls threading._shutdown() to wait until all threads complete. Only non-daemon threads are awaited: daemon threads can continue to run after threading._shutdown().

Py_EndInterpreter() requires that the Python thread states of all threads have been deleted. What about daemon threads? More about that in the next section ;-)

Note: This change introduced a regression (memory leak) which is not fixed yet: bpo-37788.

Forbid daemon threads in subinterpreters

In June 2019, while fixing the threading shutdown, I found a reliable way to trigger a bug with daemon threads when a subinterpreter is finalized:

Fatal Python error: Py_EndInterpreter: not the last thread

By design, daemon threads can run after a Python interpreter is finalized, whereas Py_EndInterpreter() requires that all threads completed.

I reported bpo-37266 to propose to forbid the creation of daemon threads in subinterpreters. I fixed the issue with commit 066e5b1a:

bpo-37266: Daemon threads are now denied in subinterpreters (GH-14049)

In a subinterpreter, spawning a daemon thread now raises an
exception. Daemon threads were never supported in subinterpreters.
Previously, the subinterpreter finalization crashed with a Pyton
fatal error if a daemon thread was still running.

The change adds this check to Thread.start():

if self.daemon and not _is_main_interpreter():
    raise RuntimeError("daemon thread are not supported "
                       "in subinterpreters")

I commented:

Daemon threads must die. That's a first step towards their death!

Antoine Pitrou created bpo-39812: Avoid daemon threads in concurrent.futures as a follow-up.

In February 2020, when rebuilding Fedora Rawhide with Python 3.9, Miro Hrončok of my team noticed that my change broke the python-jep project. I reported the bug upstream. It has been fixed by using regular threads, rather than daemon threads: commit.

Conclusion

A random failure on a FreeBSD buildbot was hiding a severe race condition in the threading shutdown. The bug existed since 2013, but was silently ignored since the test passed when re-run.

The race condition was that that the threading shutdown didn't ensure that the Python thread state of all non-daemon threads are deleted, whereas it is a Py_EndInterpreter() requirement.

I fixed the threading shutdown by waiting until the Python thread state of all non-daemon threads is deleted.

I also modified Thread.start() to forbid spawning daemon threads in Python subinterpreters to fix a related issue.

Victor Stinner blog 3

Victor Stinner