2017-06-29 18:33:40 UTC
After a lot of intensive debugging, I think I've identified the reason
the test suite for our python-llfuse-based driver sometimes deadlocks
1. It gets stuck in PyEval_RestoreThread() attempting to acquire the
GIL. However, if the GIL is already acquired, this results in a
2. The method PyGILState_Ensure() is intended to determine if the
calling thread is the interpreter's "current" thread. This is
accomplished by testing the variable _PyThreadState_Current against
the ThreadState associated with this thread.
3. The ThreadState is stored in a linked list keyed on the thread id,
which has semantics that PyThread_set_key_value() returns the existing
value (but doesn't update it) if the key is already present.
4. Calls to PyGILState_Ensure() need to be paired with
PyGILState_Release() which maintains a gilstate_counter. When
gilstate_counter goes to zero, the thread context is cleaned up and
removed from the ThreadState list.
5. PyThread_get_thread_ident() uses pthread_self(). Pthreads specifies
that thread ids may be reused.
6. llfuse creates worker threads directly using pthread_create. These
threads are Cython code which call into Python.
7. llfuse also uses pthread_cancel() to terminate its worker threads.
8. If pthread_cancel() stops a thread that has a Python thread state,
so PyGILState_Release() isn't called, it will leak that thread state
9. If a new thread is created, it may get the same id as the previous
thread. Somehow, it is possible for this to result in creation of a
new ThreadState object and not taking over the old one.
10. Later, the new thread calls PyGILState_Ensure(). The
_PyThreadState_Current is correct and indicates the GIL is locked, but
looking up the ThreadState for this thread using
PyThread_get_key_value returns a different struct. This causes Python
to believe that it needs to acquire the GIL and swap in the new
11. Because the GIL is actually already locked, it deadlocks.
1. Don't call pthread_cancel(). This seems to eliminate the problem.
The main drawback is long-running request handlers could cause
problems of their own by delaying unmount/shutdown.
2. Add pthread_cleanup_push() to ensure that PyGILState_Release() gets
called when the thread is canceled. May require a bit of tinkering to
ensure Cython emits the right code.