123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571 |
- This document provides "recipes", that is, litmus tests for commonly
- occurring situations, as well as a few that illustrate subtly broken but
- attractive nuisances. Many of these recipes include example code from
- v4.13 of the Linux kernel.
- The first section covers simple special cases, the second section
- takes off the training wheels to cover more involved examples,
- and the third section provides a few rules of thumb.
- Simple special cases
- ====================
- This section presents two simple special cases, the first being where
- there is only one CPU or only one memory location is accessed, and the
- second being use of that old concurrency workhorse, locking.
- Single CPU or single memory location
- ------------------------------------
- If there is only one CPU on the one hand or only one variable
- on the other, the code will execute in order. There are (as
- usual) some things to be careful of:
- 1. Some aspects of the C language are unordered. For example,
- in the expression "f(x) + g(y)", the order in which f and g are
- called is not defined; the object code is allowed to use either
- order or even to interleave the computations.
- 2. Compilers are permitted to use the "as-if" rule. That is, a
- compiler can emit whatever code it likes for normal accesses,
- as long as the results of a single-threaded execution appear
- just as if the compiler had followed all the relevant rules.
- To see this, compile with a high level of optimization and run
- the debugger on the resulting binary.
- 3. If there is only one variable but multiple CPUs, that variable
- must be properly aligned and all accesses to that variable must
- be full sized. Variables that straddle cachelines or pages void
- your full-ordering warranty, as do undersized accesses that load
- from or store to only part of the variable.
- 4. If there are multiple CPUs, accesses to shared variables should
- use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
- tearing, load/store fusing, and invented loads and stores.
- There are exceptions to this rule, including:
- i. When there is no possibility of a given shared variable
- being updated by some other CPU, for example, while
- holding the update-side lock, reads from that variable
- need not use READ_ONCE().
- ii. When there is no possibility of a given shared variable
- being either read or updated by other CPUs, for example,
- when running during early boot, reads from that variable
- need not use READ_ONCE() and writes to that variable
- need not use WRITE_ONCE().
- Locking
- -------
- Locking is well-known and straightforward, at least if you don't think
- about it too hard. And the basic rule is indeed quite simple: Any CPU that
- has acquired a given lock sees any changes previously seen or made by any
- CPU before it released that same lock. Note that this statement is a bit
- stronger than "Any CPU holding a given lock sees all changes made by any
- CPU during the time that CPU was holding this same lock". For example,
- consider the following pair of code fragments:
- /* See MP+polocks.litmus. */
- void CPU0(void)
- {
- WRITE_ONCE(x, 1);
- spin_lock(&mylock);
- WRITE_ONCE(y, 1);
- spin_unlock(&mylock);
- }
- void CPU1(void)
- {
- spin_lock(&mylock);
- r0 = READ_ONCE(y);
- spin_unlock(&mylock);
- r1 = READ_ONCE(x);
- }
- The basic rule guarantees that if CPU0() acquires mylock before CPU1(),
- then both r0 and r1 must be set to the value 1. This also has the
- consequence that if the final value of r0 is equal to 1, then the final
- value of r1 must also be equal to 1. In contrast, the weaker rule would
- say nothing about the final value of r1.
- The converse to the basic rule also holds, as illustrated by the
- following litmus test:
- /* See MP+porevlocks.litmus. */
- void CPU0(void)
- {
- r0 = READ_ONCE(y);
- spin_lock(&mylock);
- r1 = READ_ONCE(x);
- spin_unlock(&mylock);
- }
- void CPU1(void)
- {
- spin_lock(&mylock);
- WRITE_ONCE(x, 1);
- spin_unlock(&mylock);
- WRITE_ONCE(y, 1);
- }
- This converse to the basic rule guarantees that if CPU0() acquires
- mylock before CPU1(), then both r0 and r1 must be set to the value 0.
- This also has the consequence that if the final value of r1 is equal
- to 0, then the final value of r0 must also be equal to 0. In contrast,
- the weaker rule would say nothing about the final value of r0.
- These examples show only a single pair of CPUs, but the effects of the
- locking basic rule extend across multiple acquisitions of a given lock
- across multiple CPUs.
- However, it is not necessarily the case that accesses ordered by
- locking will be seen as ordered by CPUs not holding that lock.
- Consider this example:
- /* See Z6.0+pooncerelease+poacquirerelease+fencembonceonce.litmus. */
- void CPU0(void)
- {
- spin_lock(&mylock);
- WRITE_ONCE(x, 1);
- WRITE_ONCE(y, 1);
- spin_unlock(&mylock);
- }
- void CPU1(void)
- {
- spin_lock(&mylock);
- r0 = READ_ONCE(y);
- WRITE_ONCE(z, 1);
- spin_unlock(&mylock);
- }
- void CPU2(void)
- {
- WRITE_ONCE(z, 2);
- smp_mb();
- r1 = READ_ONCE(x);
- }
- Counter-intuitive though it might be, it is quite possible to have
- the final value of r0 be 1, the final value of z be 2, and the final
- value of r1 be 0. The reason for this surprising outcome is that
- CPU2() never acquired the lock, and thus did not benefit from the
- lock's ordering properties.
- Ordering can be extended to CPUs not holding the lock by careful use
- of smp_mb__after_spinlock():
- /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
- void CPU0(void)
- {
- spin_lock(&mylock);
- WRITE_ONCE(x, 1);
- WRITE_ONCE(y, 1);
- spin_unlock(&mylock);
- }
- void CPU1(void)
- {
- spin_lock(&mylock);
- smp_mb__after_spinlock();
- r0 = READ_ONCE(y);
- WRITE_ONCE(z, 1);
- spin_unlock(&mylock);
- }
- void CPU2(void)
- {
- WRITE_ONCE(z, 2);
- smp_mb();
- r1 = READ_ONCE(x);
- }
- This addition of smp_mb__after_spinlock() strengthens the lock acquisition
- sufficiently to rule out the counter-intuitive outcome.
- Taking off the training wheels
- ==============================
- This section looks at more complex examples, including message passing,
- load buffering, release-acquire chains, store buffering.
- Many classes of litmus tests have abbreviated names, which may be found
- here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
- Message passing (MP)
- --------------------
- The MP pattern has one CPU execute a pair of stores to a pair of variables
- and another CPU execute a pair of loads from this same pair of variables,
- but in the opposite order. The goal is to avoid the counter-intuitive
- outcome in which the first load sees the value written by the second store
- but the second load does not see the value written by the first store.
- In the absence of any ordering, this goal may not be met, as can be seen
- in the MP+poonceonces.litmus litmus test. This section therefore looks at
- a number of ways of meeting this goal.
- Release and acquire
- ~~~~~~~~~~~~~~~~~~~
- Use of smp_store_release() and smp_load_acquire() is one way to force
- the desired MP ordering. The general approach is shown below:
- /* See MP+pooncerelease+poacquireonce.litmus. */
- void CPU0(void)
- {
- WRITE_ONCE(x, 1);
- smp_store_release(&y, 1);
- }
- void CPU1(void)
- {
- r0 = smp_load_acquire(&y);
- r1 = READ_ONCE(x);
- }
- The smp_store_release() macro orders any prior accesses against the
- store, while the smp_load_acquire macro orders the load against any
- subsequent accesses. Therefore, if the final value of r0 is the value 1,
- the final value of r1 must also be the value 1.
- The init_stack_slab() function in lib/stackdepot.c uses release-acquire
- in this way to safely initialize of a slab of the stack. Working out
- the mutual-exclusion design is left as an exercise for the reader.
- Assign and dereference
- ~~~~~~~~~~~~~~~~~~~~~~
- Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the
- use of smp_store_release() and smp_load_acquire(), except that both
- rcu_assign_pointer() and rcu_dereference() operate on RCU-protected
- pointers. The general approach is shown below:
- /* See MP+onceassign+derefonce.litmus. */
- int z;
- int *y = &z;
- int x;
- void CPU0(void)
- {
- WRITE_ONCE(x, 1);
- rcu_assign_pointer(y, &x);
- }
- void CPU1(void)
- {
- rcu_read_lock();
- r0 = rcu_dereference(y);
- r1 = READ_ONCE(*r0);
- rcu_read_unlock();
- }
- In this example, if the final value of r0 is &x then the final value of
- r1 must be 1.
- The rcu_assign_pointer() macro has the same ordering properties as does
- smp_store_release(), but the rcu_dereference() macro orders the load only
- against later accesses that depend on the value loaded. A dependency
- is present if the value loaded determines the address of a later access
- (address dependency, as shown above), the value written by a later store
- (data dependency), or whether or not a later store is executed in the
- first place (control dependency). Note that the term "data dependency"
- is sometimes casually used to cover both address and data dependencies.
- In lib/prime_numbers.c, the expand_to_next_prime() function invokes
- rcu_assign_pointer(), and the next_prime_number() function invokes
- rcu_dereference(). This combination mediates access to a bit vector
- that is expanded as additional primes are needed.
- Write and read memory barriers
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- It is usually better to use smp_store_release() instead of smp_wmb()
- and to use smp_load_acquire() instead of smp_rmb(). However, the older
- smp_wmb() and smp_rmb() APIs are still heavily used, so it is important
- to understand their use cases. The general approach is shown below:
- /* See MP+fencewmbonceonce+fencermbonceonce.litmus. */
- void CPU0(void)
- {
- WRITE_ONCE(x, 1);
- smp_wmb();
- WRITE_ONCE(y, 1);
- }
- void CPU1(void)
- {
- r0 = READ_ONCE(y);
- smp_rmb();
- r1 = READ_ONCE(x);
- }
- The smp_wmb() macro orders prior stores against later stores, and the
- smp_rmb() macro orders prior loads against later loads. Therefore, if
- the final value of r0 is 1, the final value of r1 must also be 1.
- The the xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
- the following write-side code fragment:
- log->l_curr_block -= log->l_logBBsize;
- ASSERT(log->l_curr_block >= 0);
- smp_wmb();
- log->l_curr_cycle++;
- And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains
- the corresponding read-side code fragment:
- cur_cycle = READ_ONCE(log->l_curr_cycle);
- smp_rmb();
- cur_block = READ_ONCE(log->l_curr_block);
- Alternatively, consider the following comment in function
- perf_output_put_handle() in kernel/events/ring_buffer.c:
- * kernel user
- *
- * if (LOAD ->data_tail) { LOAD ->data_head
- * (A) smp_rmb() (C)
- * STORE $data LOAD $data
- * smp_wmb() (B) smp_mb() (D)
- * STORE ->data_head STORE ->data_tail
- * }
- The B/C pairing is an example of the MP pattern using smp_wmb() on the
- write side and smp_rmb() on the read side.
- Of course, given that smp_mb() is strictly stronger than either smp_wmb()
- or smp_rmb(), any code fragment that would work with smp_rmb() and
- smp_wmb() would also work with smp_mb() replacing either or both of the
- weaker barriers.
- Load buffering (LB)
- -------------------
- The LB pattern has one CPU load from one variable and then store to a
- second, while another CPU loads from the second variable and then stores
- to the first. The goal is to avoid the counter-intuitive situation where
- each load reads the value written by the other CPU's store. In the
- absence of any ordering it is quite possible that this may happen, as
- can be seen in the LB+poonceonces.litmus litmus test.
- One way of avoiding the counter-intuitive outcome is through the use of a
- control dependency paired with a full memory barrier:
- /* See LB+fencembonceonce+ctrlonceonce.litmus. */
- void CPU0(void)
- {
- r0 = READ_ONCE(x);
- if (r0)
- WRITE_ONCE(y, 1);
- }
- void CPU1(void)
- {
- r1 = READ_ONCE(y);
- smp_mb();
- WRITE_ONCE(x, 1);
- }
- This pairing of a control dependency in CPU0() with a full memory
- barrier in CPU1() prevents r0 and r1 from both ending up equal to 1.
- The A/D pairing from the ring-buffer use case shown earlier also
- illustrates LB. Here is a repeat of the comment in
- perf_output_put_handle() in kernel/events/ring_buffer.c, showing a
- control dependency on the kernel side and a full memory barrier on
- the user side:
- * kernel user
- *
- * if (LOAD ->data_tail) { LOAD ->data_head
- * (A) smp_rmb() (C)
- * STORE $data LOAD $data
- * smp_wmb() (B) smp_mb() (D)
- * STORE ->data_head STORE ->data_tail
- * }
- *
- * Where A pairs with D, and B pairs with C.
- The kernel's control dependency between the load from ->data_tail
- and the store to data combined with the user's full memory barrier
- between the load from data and the store to ->data_tail prevents
- the counter-intuitive outcome where the kernel overwrites the data
- before the user gets done loading it.
- Release-acquire chains
- ----------------------
- Release-acquire chains are a low-overhead, flexible, and easy-to-use
- method of maintaining order. However, they do have some limitations that
- need to be fully understood. Here is an example that maintains order:
- /* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */
- void CPU0(void)
- {
- WRITE_ONCE(x, 1);
- smp_store_release(&y, 1);
- }
- void CPU1(void)
- {
- r0 = smp_load_acquire(y);
- smp_store_release(&z, 1);
- }
- void CPU2(void)
- {
- r1 = smp_load_acquire(z);
- r2 = READ_ONCE(x);
- }
- In this case, if r0 and r1 both have final values of 1, then r2 must
- also have a final value of 1.
- The ordering in this example is stronger than it needs to be. For
- example, ordering would still be preserved if CPU1()'s smp_load_acquire()
- invocation was replaced with READ_ONCE().
- It is tempting to assume that CPU0()'s store to x is globally ordered
- before CPU1()'s store to z, but this is not the case:
- /* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */
- void CPU0(void)
- {
- WRITE_ONCE(x, 1);
- smp_store_release(&y, 1);
- }
- void CPU1(void)
- {
- r0 = smp_load_acquire(y);
- smp_store_release(&z, 1);
- }
- void CPU2(void)
- {
- WRITE_ONCE(z, 2);
- smp_mb();
- r1 = READ_ONCE(x);
- }
- One might hope that if the final value of r0 is 1 and the final value
- of z is 2, then the final value of r1 must also be 1, but it really is
- possible for r1 to have the final value of 0. The reason, of course,
- is that in this version, CPU2() is not part of the release-acquire chain.
- This situation is accounted for in the rules of thumb below.
- Despite this limitation, release-acquire chains are low-overhead as
- well as simple and powerful, at least as memory-ordering mechanisms go.
- Store buffering
- ---------------
- Store buffering can be thought of as upside-down load buffering, so
- that one CPU first stores to one variable and then loads from a second,
- while another CPU stores to the second variable and then loads from the
- first. Preserving order requires nothing less than full barriers:
- /* See SB+fencembonceonces.litmus. */
- void CPU0(void)
- {
- WRITE_ONCE(x, 1);
- smp_mb();
- r0 = READ_ONCE(y);
- }
- void CPU1(void)
- {
- WRITE_ONCE(y, 1);
- smp_mb();
- r1 = READ_ONCE(x);
- }
- Omitting either smp_mb() will allow both r0 and r1 to have final
- values of 0, but providing both full barriers as shown above prevents
- this counter-intuitive outcome.
- This pattern most famously appears as part of Dekker's locking
- algorithm, but it has a much more practical use within the Linux kernel
- of ordering wakeups. The following comment taken from waitqueue_active()
- in include/linux/wait.h shows the canonical pattern:
- * CPU0 - waker CPU1 - waiter
- *
- * for (;;) {
- * @cond = true; prepare_to_wait(&wq_head, &wait, state);
- * smp_mb(); // smp_mb() from set_current_state()
- * if (waitqueue_active(wq_head)) if (@cond)
- * wake_up(wq_head); break;
- * schedule();
- * }
- * finish_wait(&wq_head, &wait);
- On CPU0, the store is to @cond and the load is in waitqueue_active().
- On CPU1, prepare_to_wait() contains both a store to wq_head and a call
- to set_current_state(), which contains an smp_mb() barrier; the load is
- "if (@cond)". The full barriers prevent the undesirable outcome where
- CPU1 puts the waiting task to sleep and CPU0 fails to wake it up.
- Note that use of locking can greatly simplify this pattern.
- Rules of thumb
- ==============
- There might seem to be no pattern governing what ordering primitives are
- needed in which situations, but this is not the case. There is a pattern
- based on the relation between the accesses linking successive CPUs in a
- given litmus test. There are three types of linkage:
- 1. Write-to-read, where the next CPU reads the value that the
- previous CPU wrote. The LB litmus-test patterns contain only
- this type of relation. In formal memory-modeling texts, this
- relation is called "reads-from" and is usually abbreviated "rf".
- 2. Read-to-write, where the next CPU overwrites the value that the
- previous CPU read. The SB litmus test contains only this type
- of relation. In formal memory-modeling texts, this relation is
- often called "from-reads" and is sometimes abbreviated "fr".
- 3. Write-to-write, where the next CPU overwrites the value written
- by the previous CPU. The Z6.0 litmus test pattern contains a
- write-to-write relation between the last access of CPU1() and
- the first access of CPU2(). In formal memory-modeling texts,
- this relation is often called "coherence order" and is sometimes
- abbreviated "co". In the C++ standard, it is instead called
- "modification order" and often abbreviated "mo".
- The strength of memory ordering required for a given litmus test to
- avoid a counter-intuitive outcome depends on the types of relations
- linking the memory accesses for the outcome in question:
- o If all links are write-to-read links, then the weakest
- possible ordering within each CPU suffices. For example, in
- the LB litmus test, a control dependency was enough to do the
- job.
- o If all but one of the links are write-to-read links, then a
- release-acquire chain suffices. Both the MP and the ISA2
- litmus tests illustrate this case.
- o If more than one of the links are something other than
- write-to-read links, then a full memory barrier is required
- between each successive pair of non-write-to-read links. This
- case is illustrated by the Z6.0 litmus tests, both in the
- locking and in the release-acquire sections.
- However, if you find yourself having to stretch these rules of thumb
- to fit your situation, you should consider creating a litmus test and
- running it on the model.
|