Detecting and Breaking Deadlock

by Ben Adida.

A system where multiple threads coexist in the same address space is extremely vulnerable to data corruption and deadlock. While there is a trade-off between these two problems, deadlock remains the ultimate bug, almost impossible to eradicate (numerous books have been written on this topic). One way to attack the problem is to have a runtime deadlock detector and breaker system.

To detect deadlock between threads, we need to track blocked threads and the mutexes they are blocked on, as well as be able to easily find the owner of a given mutex. This information needs to be updated every time a mutex lock is attempted (whether it succeeds or not).

Deadlock happens when two threads are waiting for a mutex owned by the other (circular deadlock between multiple threads is also possible). Therefore, we need to check for deadlock only when a thread fails to lock a mutex. At that point, the Thread Manager needs to suspend all threads and take over to perform a cycle check on mutex dependency. Finding such a cycle is easily done by performing a tree traversal of the dependencies, and marking threads and mutexes along the way. Using this method, we can detect deadlock and identify all threads and mutexes involved in the deadlock.

Now that deadlock is detected, we need to fix the problem. One way of fixing this problem is to have each thread perform a checkpoint of the address space to disk every time a mutex is grabbed. The checkpoint will also store the execution point of each thread that is present in the system at the time the mutex is grabbed. Then, as soon as our system detects deadlock, it can back out to the previous checkpointed state concerned (before one of the mutexes involved in the deadlock is grabbed), and schedule threads appropriately to avoid deadlock in the same place the next time around (that is not too complicated: schedule the thread that blocked on the second mutex until it has unlocked all mutexes involved, all before the other thread grabs its first mutex).

The problem with this idea is that any procedure call in the list of rolled-back commands that accessed an external system not controlled by the address space rollback (i.e. sending a packet across the network, launching a missile) will be executed twice, which could have catastrophic results under certain circumstances. Furthermore, checkpointing the address-space to disk at each mutex lock is very expensive, and could result in unacceptable speed trade-offs.

Another solution is to very simply kill the threads involved in the deadlock, after ensuring that the data invariants are enforced. This problem is not as easy as it sounds. We know that data invariants must be respected for a piece of data whose mutex is unlocked. What we need to do, then, when threads deadlock, is to make each thread involved skip the critical section that is in question, continue its execution until the last troublesome mutex is released, then kill the thread (possibly restarting it from the beginning). This forces the programmer to ensure that critical sections in any given thread do not overlap: any additional mutex locked within a critical section must be unlocked before that very critical section ends. This allows us to skip critical sections without creating additional mutex problems.

Of course, this solution has enormous problems, too. How can we safely skip critical sections? Although the data structures will comply with their invariants, the data contained might be erroneous.

It seems clear that there is no perfect solution to deadlock between same address-space threads. Each specific case must be handled differently. For mission-critical systems where the threads involved do not send external orders, and where losing a thread cannot happen, thread rollback is the best solution. On the other hand, for something like a webserver where it might be acceptable to drop a few requests on rare occasions, it is possible to simply kill the guilty thread, and expect the client to resubmit the request.

In conclusion, the very nature of deadlock makes any solution problematic in some way. There is no perfect solution, only solutions better adapted to specific systems.