Discussion:
[fuse-devel] python-llfuse deadlock due to pthread_cancel
Peter Amstutz
2017-06-29 18:33:40 UTC
Permalink
Hi everyone,

After a lot of intensive debugging, I think I've identified the reason
the test suite for our python-llfuse-based driver sometimes deadlocks
in fuse_destroy():

1. It gets stuck in PyEval_RestoreThread() attempting to acquire the
GIL. However, if the GIL is already acquired, this results in a
deadlock.

2. The method PyGILState_Ensure() is intended to determine if the
calling thread is the interpreter's "current" thread. This is
accomplished by testing the variable _PyThreadState_Current against
the ThreadState associated with this thread.

3. The ThreadState is stored in a linked list keyed on the thread id,
which has semantics that PyThread_set_key_value() returns the existing
value (but doesn't update it) if the key is already present.

4. Calls to PyGILState_Ensure() need to be paired with
PyGILState_Release() which maintains a gilstate_counter. When
gilstate_counter goes to zero, the thread context is cleaned up and
removed from the ThreadState list.

5. PyThread_get_thread_ident() uses pthread_self(). Pthreads specifies
that thread ids may be reused.

6. llfuse creates worker threads directly using pthread_create. These
threads are Cython code which call into Python.

7. llfuse also uses pthread_cancel() to terminate its worker threads.

8. If pthread_cancel() stops a thread that has a Python thread state,
so PyGILState_Release() isn't called, it will leak that thread state

9. If a new thread is created, it may get the same id as the previous
thread. Somehow, it is possible for this to result in creation of a
new ThreadState object and not taking over the old one.

10. Later, the new thread calls PyGILState_Ensure(). The
_PyThreadState_Current is correct and indicates the GIL is locked, but
looking up the ThreadState for this thread using
PyThread_get_key_value returns a different struct. This causes Python
to believe that it needs to acquire the GIL and swap in the new
ThreadState.

11. Because the GIL is actually already locked, it deadlocks.

Possible fixes:

1. Don't call pthread_cancel(). This seems to eliminate the problem.
The main drawback is long-running request handlers could cause
problems of their own by delaying unmount/shutdown.

2. Add pthread_cleanup_push() to ensure that PyGILState_Release() gets
called when the thread is canceled. May require a bit of tinkering to
ensure Cython emits the right code.


Thanks,
Peter
Nikolaus Rath
2017-07-02 07:44:43 UTC
Permalink
Post by Peter Amstutz
Hi everyone,
After a lot of intensive debugging, I think I've identified the reason
the test suite for our python-llfuse-based driver sometimes deadlocks
[...]

For reference, this is now tracked at
https://bitbucket.org/nikratio/python-llfuse/issues/108/

Thanks for Peter for figuring this out!


Best,
-Nikolaus
--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

»Time flies like an arrow, fruit flies like a Banana.«
Loading...