123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345 |
- Wait/Wound Deadlock-Proof Mutex Design
- ======================================
- Please read mutex-design.txt first, as it applies to wait/wound mutexes too.
- Motivation for WW-Mutexes
- -------------------------
- GPU's do operations that commonly involve many buffers. Those buffers
- can be shared across contexts/processes, exist in different memory
- domains (for example VRAM vs system memory), and so on. And with
- PRIME / dmabuf, they can even be shared across devices. So there are
- a handful of situations where the driver needs to wait for buffers to
- become ready. If you think about this in terms of waiting on a buffer
- mutex for it to become available, this presents a problem because
- there is no way to guarantee that buffers appear in a execbuf/batch in
- the same order in all contexts. That is directly under control of
- userspace, and a result of the sequence of GL calls that an application
- makes. Which results in the potential for deadlock. The problem gets
- more complex when you consider that the kernel may need to migrate the
- buffer(s) into VRAM before the GPU operates on the buffer(s), which
- may in turn require evicting some other buffers (and you don't want to
- evict other buffers which are already queued up to the GPU), but for a
- simplified understanding of the problem you can ignore this.
- The algorithm that the TTM graphics subsystem came up with for dealing with
- this problem is quite simple. For each group of buffers (execbuf) that need
- to be locked, the caller would be assigned a unique reservation id/ticket,
- from a global counter. In case of deadlock while locking all the buffers
- associated with a execbuf, the one with the lowest reservation ticket (i.e.
- the oldest task) wins, and the one with the higher reservation id (i.e. the
- younger task) unlocks all of the buffers that it has already locked, and then
- tries again.
- In the RDBMS literature this deadlock handling approach is called wait/wound:
- The older tasks waits until it can acquire the contended lock. The younger tasks
- needs to back off and drop all the locks it is currently holding, i.e. the
- younger task is wounded.
- Concepts
- --------
- Compared to normal mutexes two additional concepts/objects show up in the lock
- interface for w/w mutexes:
- Acquire context: To ensure eventual forward progress it is important the a task
- trying to acquire locks doesn't grab a new reservation id, but keeps the one it
- acquired when starting the lock acquisition. This ticket is stored in the
- acquire context. Furthermore the acquire context keeps track of debugging state
- to catch w/w mutex interface abuse.
- W/w class: In contrast to normal mutexes the lock class needs to be explicit for
- w/w mutexes, since it is required to initialize the acquire context.
- Furthermore there are three different class of w/w lock acquire functions:
- * Normal lock acquisition with a context, using ww_mutex_lock.
- * Slowpath lock acquisition on the contending lock, used by the wounded task
- after having dropped all already acquired locks. These functions have the
- _slow postfix.
- From a simple semantics point-of-view the _slow functions are not strictly
- required, since simply calling the normal ww_mutex_lock functions on the
- contending lock (after having dropped all other already acquired locks) will
- work correctly. After all if no other ww mutex has been acquired yet there's
- no deadlock potential and hence the ww_mutex_lock call will block and not
- prematurely return -EDEADLK. The advantage of the _slow functions is in
- interface safety:
- - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow
- has a void return type. Note that since ww mutex code needs loops/retries
- anyway the __must_check doesn't result in spurious warnings, even though the
- very first lock operation can never fail.
- - When full debugging is enabled ww_mutex_lock_slow checks that all acquired
- ww mutex have been released (preventing deadlocks) and makes sure that we
- block on the contending lock (preventing spinning through the -EDEADLK
- slowpath until the contended lock can be acquired).
- * Functions to only acquire a single w/w mutex, which results in the exact same
- semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL
- context.
- Again this is not strictly required. But often you only want to acquire a
- single lock in which case it's pointless to set up an acquire context (and so
- better to avoid grabbing a deadlock avoidance ticket).
- Of course, all the usual variants for handling wake-ups due to signals are also
- provided.
- Usage
- -----
- Three different ways to acquire locks within the same w/w class. Common
- definitions for methods #1 and #2:
- static DEFINE_WW_CLASS(ww_class);
- struct obj {
- struct ww_mutex lock;
- /* obj data */
- };
- struct obj_entry {
- struct list_head head;
- struct obj *obj;
- };
- Method 1, using a list in execbuf->buffers that's not allowed to be reordered.
- This is useful if a list of required objects is already tracked somewhere.
- Furthermore the lock helper can use propagate the -EALREADY return code back to
- the caller as a signal that an object is twice on the list. This is useful if
- the list is constructed from userspace input and the ABI requires userspace to
- not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl).
- int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
- {
- struct obj *res_obj = NULL;
- struct obj_entry *contended_entry = NULL;
- struct obj_entry *entry;
- ww_acquire_init(ctx, &ww_class);
- retry:
- list_for_each_entry (entry, list, head) {
- if (entry->obj == res_obj) {
- res_obj = NULL;
- continue;
- }
- ret = ww_mutex_lock(&entry->obj->lock, ctx);
- if (ret < 0) {
- contended_entry = entry;
- goto err;
- }
- }
- ww_acquire_done(ctx);
- return 0;
- err:
- list_for_each_entry_continue_reverse (entry, list, head)
- ww_mutex_unlock(&entry->obj->lock);
- if (res_obj)
- ww_mutex_unlock(&res_obj->lock);
- if (ret == -EDEADLK) {
- /* we lost out in a seqno race, lock and retry.. */
- ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);
- res_obj = contended_entry->obj;
- goto retry;
- }
- ww_acquire_fini(ctx);
- return ret;
- }
- Method 2, using a list in execbuf->buffers that can be reordered. Same semantics
- of duplicate entry detection using -EALREADY as method 1 above. But the
- list-reordering allows for a bit more idiomatic code.
- int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
- {
- struct obj_entry *entry, *entry2;
- ww_acquire_init(ctx, &ww_class);
- list_for_each_entry (entry, list, head) {
- ret = ww_mutex_lock(&entry->obj->lock, ctx);
- if (ret < 0) {
- entry2 = entry;
- list_for_each_entry_continue_reverse (entry2, list, head)
- ww_mutex_unlock(&entry2->obj->lock);
- if (ret != -EDEADLK) {
- ww_acquire_fini(ctx);
- return ret;
- }
- /* we lost out in a seqno race, lock and retry.. */
- ww_mutex_lock_slow(&entry->obj->lock, ctx);
- /*
- * Move buf to head of the list, this will point
- * buf->next to the first unlocked entry,
- * restarting the for loop.
- */
- list_del(&entry->head);
- list_add(&entry->head, list);
- }
- }
- ww_acquire_done(ctx);
- return 0;
- }
- Unlocking works the same way for both methods #1 and #2:
- void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
- {
- struct obj_entry *entry;
- list_for_each_entry (entry, list, head)
- ww_mutex_unlock(&entry->obj->lock);
- ww_acquire_fini(ctx);
- }
- Method 3 is useful if the list of objects is constructed ad-hoc and not upfront,
- e.g. when adjusting edges in a graph where each node has its own ww_mutex lock,
- and edges can only be changed when holding the locks of all involved nodes. w/w
- mutexes are a natural fit for such a case for two reasons:
- - They can handle lock-acquisition in any order which allows us to start walking
- a graph from a starting point and then iteratively discovering new edges and
- locking down the nodes those edges connect to.
- - Due to the -EALREADY return code signalling that a given objects is already
- held there's no need for additional book-keeping to break cycles in the graph
- or keep track off which looks are already held (when using more than one node
- as a starting point).
- Note that this approach differs in two important ways from the above methods:
- - Since the list of objects is dynamically constructed (and might very well be
- different when retrying due to hitting the -EDEADLK wound condition) there's
- no need to keep any object on a persistent list when it's not locked. We can
- therefore move the list_head into the object itself.
- - On the other hand the dynamic object list construction also means that the -EALREADY return
- code can't be propagated.
- Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a
- list of starting nodes (passed in from userspace) using one of the above
- methods. And then lock any additional objects affected by the operations using
- method #3 below. The backoff/retry procedure will be a bit more involved, since
- when the dynamic locking step hits -EDEADLK we also need to unlock all the
- objects acquired with the fixed list. But the w/w mutex debug checks will catch
- any interface misuse for these cases.
- Also, method 3 can't fail the lock acquisition step since it doesn't return
- -EALREADY. Of course this would be different when using the _interruptible
- variants, but that's outside of the scope of these examples here.
- struct obj {
- struct ww_mutex ww_mutex;
- struct list_head locked_list;
- };
- static DEFINE_WW_CLASS(ww_class);
- void __unlock_objs(struct list_head *list)
- {
- struct obj *entry, *temp;
- list_for_each_entry_safe (entry, temp, list, locked_list) {
- /* need to do that before unlocking, since only the current lock holder is
- allowed to use object */
- list_del(&entry->locked_list);
- ww_mutex_unlock(entry->ww_mutex)
- }
- }
- void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
- {
- struct obj *obj;
- ww_acquire_init(ctx, &ww_class);
- retry:
- /* re-init loop start state */
- loop {
- /* magic code which walks over a graph and decides which objects
- * to lock */
- ret = ww_mutex_lock(obj->ww_mutex, ctx);
- if (ret == -EALREADY) {
- /* we have that one already, get to the next object */
- continue;
- }
- if (ret == -EDEADLK) {
- __unlock_objs(list);
- ww_mutex_lock_slow(obj, ctx);
- list_add(&entry->locked_list, list);
- goto retry;
- }
- /* locked a new object, add it to the list */
- list_add_tail(&entry->locked_list, list);
- }
- ww_acquire_done(ctx);
- return 0;
- }
- void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
- {
- __unlock_objs(list);
- ww_acquire_fini(ctx);
- }
- Method 4: Only lock one single objects. In that case deadlock detection and
- prevention is obviously overkill, since with grabbing just one lock you can't
- produce a deadlock within just one class. To simplify this case the w/w mutex
- api can be used with a NULL context.
- Implementation Details
- ----------------------
- Design:
- ww_mutex currently encapsulates a struct mutex, this means no extra overhead for
- normal mutex locks, which are far more common. As such there is only a small
- increase in code size if wait/wound mutexes are not used.
- In general, not much contention is expected. The locks are typically used to
- serialize access to resources for devices. The only way to make wakeups
- smarter would be at the cost of adding a field to struct mutex_waiter. This
- would add overhead to all cases where normal mutexes are used, and
- ww_mutexes are generally less performance sensitive.
- Lockdep:
- Special care has been taken to warn for as many cases of api abuse
- as possible. Some common api abuses will be caught with
- CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.
- Some of the errors which will be warned about:
- - Forgetting to call ww_acquire_fini or ww_acquire_init.
- - Attempting to lock more mutexes after ww_acquire_done.
- - Attempting to lock the wrong mutex after -EDEADLK and
- unlocking all mutexes.
- - Attempting to lock the right mutex after -EDEADLK,
- before unlocking all mutexes.
- - Calling ww_mutex_lock_slow before -EDEADLK was returned.
- - Unlocking mutexes with the wrong unlock function.
- - Calling one of the ww_acquire_* twice on the same context.
- - Using a different ww_class for the mutex than for the ww_acquire_ctx.
- - Normal lockdep errors that can result in deadlocks.
- Some of the lockdep errors that can result in deadlocks:
- - Calling ww_acquire_init to initialize a second ww_acquire_ctx before
- having called ww_acquire_fini on the first.
- - 'normal' deadlocks that can occur.
- FIXME: Update this section once we have the TASK_DEADLOCK task state flag magic
- implemented.
|