12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334 |
- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
- "http://www.w3.org/TR/html4/loose.dtd">
- <html>
- <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title>
- <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
- <p>January 27, 2016</p>
- <p>This article was contributed by Paul E. McKenney</p>
- <h3>Introduction</h3>
- This document describes RCU's major data structures and their relationship
- to each other.
- <ol>
- <li> <a href="#Data-Structure Relationships">
- Data-Structure Relationships</a>
- <li> <a href="#The rcu_state Structure">
- The <tt>rcu_state</tt> Structure</a>
- <li> <a href="#The rcu_node Structure">
- The <tt>rcu_node</tt> Structure</a>
- <li> <a href="#The rcu_data Structure">
- The <tt>rcu_data</tt> Structure</a>
- <li> <a href="#The rcu_dynticks Structure">
- The <tt>rcu_dynticks</tt> Structure</a>
- <li> <a href="#The rcu_head Structure">
- The <tt>rcu_head</tt> Structure</a>
- <li> <a href="#RCU-Specific Fields in the task_struct Structure">
- RCU-Specific Fields in the <tt>task_struct</tt> Structure</a>
- <li> <a href="#Accessor Functions">
- Accessor Functions</a>
- </ol>
- At the end we have the
- <a href="#Answers to Quick Quizzes">answers to the quick quizzes</a>.
- <h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3>
- <p>RCU is for all intents and purposes a large state machine, and its
- data structures maintain the state in such a way as to allow RCU readers
- to execute extremely quickly, while also processing the RCU grace periods
- requested by updaters in an efficient and extremely scalable fashion.
- The efficiency and scalability of RCU updaters is provided primarily
- by a combining tree, as shown below:
- </p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%">
- </p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure
- containing a tree of <tt>rcu_node</tt> structures.
- Each leaf node of the <tt>rcu_node</tt> tree has up to 16
- <tt>rcu_data</tt> structures associated with it, so that there
- are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures,
- one for each possible CPU.
- This structure is adjusted at boot time, if needed, to handle the
- common case where <tt>nr_cpu_ids</tt> is much less than
- <tt>NR_CPUs</tt>.
- For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>,
- which results in a three-level <tt>rcu_node</tt> tree.
- If the actual hardware has only 16 CPUs, RCU will adjust itself
- at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node.
- </p><p>The purpose of this combining tree is to allow per-CPU events
- such as quiescent states, dyntick-idle transitions,
- and CPU hotplug operations to be processed efficiently
- and scalably.
- Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures,
- and other events are recorded by the leaf-level <tt>rcu_node</tt>
- structures.
- All of these events are combined at each level of the tree until finally
- grace periods are completed at the tree's root <tt>rcu_node</tt>
- structure.
- A grace period can be completed at the root once every CPU
- (or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task)
- has passed through a quiescent state.
- Once a grace period has completed, record of that fact is propagated
- back down the tree.
- </p><p>As can be seen from the diagram, on a 64-bit system
- a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
- of 64 at the root and a fanout of 16 at the leaves.
- <table>
- <tr><th> </th></tr>
- <tr><th align="left">Quick Quiz:</th></tr>
- <tr><td>
- Why isn't the fanout at the leaves also 64?
- </td></tr>
- <tr><th align="left">Answer:</th></tr>
- <tr><td bgcolor="#ffffff"><font color="ffffff">
- Because there are more types of events that affect the leaf-level
- <tt>rcu_node</tt> structures than further up the tree.
- Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of
- 64, the contention on these structures' <tt>->structures</tt>
- becomes excessive.
- Experimentation on a wide variety of systems has shown that a fanout
- of 16 works well for the leaves of the <tt>rcu_node</tt> tree.
- </font>
- <p><font color="ffffff">Of course, further experience with
- systems having hundreds or thousands of CPUs may demonstrate
- that the fanout for the non-leaf <tt>rcu_node</tt> structures
- must also be reduced.
- Such reduction can be easily carried out when and if it proves
- necessary.
- In the meantime, if you are using such a system and running into
- contention problems on the non-leaf <tt>rcu_node</tt> structures,
- you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration
- parameter to reduce the non-leaf fanout as needed.
- </font>
- <p><font color="ffffff">Kernels built for systems with
- strong NUMA characteristics might also need to adjust
- <tt>CONFIG_RCU_FANOUT</tt> so that the domains of the
- <tt>rcu_node</tt> structures align with hardware boundaries.
- However, there has thus far been no need for this.
- </font></td></tr>
- <tr><td> </td></tr>
- </table>
- <p>If your system has more than 1,024 CPUs (or more than 512 CPUs on
- a 32-bit system), then RCU will automatically add more levels to the
- tree.
- For example, if you are crazy enough to build a 64-bit system with 65,536
- CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows:
- </p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%">
- </p><p>RCU currently permits up to a four-level tree, which on a 64-bit system
- accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for
- 32-bit systems.
- On the other hand, you can set <tt>CONFIG_RCU_FANOUT</tt> to be
- as small as 2 if you wish, which would permit only 16 CPUs, which
- is useful for testing.
- </p><p>This multi-level combining tree allows us to get most of the
- performance and scalability
- benefits of partitioning, even though RCU grace-period detection is
- inherently a global operation.
- The trick here is that only the last CPU to report a quiescent state
- into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt>
- structure at the next level up the tree.
- This means that at the leaf-level <tt>rcu_node</tt> structure, only
- one access out of sixteen will progress up the tree.
- For the internal <tt>rcu_node</tt> structures, the situation is even
- more extreme: Only one access out of sixty-four will progress up
- the tree.
- Because the vast majority of the CPUs do not progress up the tree,
- the lock contention remains roughly constant up the tree.
- No matter how many CPUs there are in the system, at most 64 quiescent-state
- reports per grace period will progress all the way to the root
- <tt>rcu_node</tt> structure, thus ensuring that the lock contention
- on that root <tt>rcu_node</tt> structure remains acceptably low.
- </p><p>In effect, the combining tree acts like a big shock absorber,
- keeping lock contention under control at all tree levels regardless
- of the level of loading on the system.
- </p><p>The Linux kernel actually supports multiple flavors of RCU
- running concurrently, so RCU builds separate data structures for each
- flavor.
- For example, for <tt>CONFIG_TREE_RCU=y</tt> kernels, RCU provides
- rcu_sched and rcu_bh, as shown below:
- </p><p><img src="BigTreeClassicRCUBH.svg" alt="BigTreeClassicRCUBH.svg" width="33%">
- </p><p>Energy efficiency is increasingly important, and for that
- reason the Linux kernel provides <tt>CONFIG_NO_HZ_IDLE</tt>, which
- turns off the scheduling-clock interrupts on idle CPUs, which in
- turn allows those CPUs to attain deeper sleep states and to consume
- less energy.
- CPUs whose scheduling-clock interrupts have been turned off are
- said to be in <i>dyntick-idle mode</i>.
- RCU must handle dyntick-idle CPUs specially
- because RCU would otherwise wake up each CPU on every grace period,
- which would defeat the whole purpose of <tt>CONFIG_NO_HZ_IDLE</tt>.
- RCU uses the <tt>rcu_dynticks</tt> structure to track
- which CPUs are in dyntick idle mode, as shown below:
- </p><p><img src="BigTreeClassicRCUBHdyntick.svg" alt="BigTreeClassicRCUBHdyntick.svg" width="33%">
- </p><p>However, if a CPU is in dyntick-idle mode, it is in that mode
- for all flavors of RCU.
- Therefore, a single <tt>rcu_dynticks</tt> structure is allocated per
- CPU, and all of a given CPU's <tt>rcu_data</tt> structures share
- that <tt>rcu_dynticks</tt>, as shown in the figure.
- </p><p>Kernels built with <tt>CONFIG_PREEMPT_RCU</tt> support
- rcu_preempt in addition to rcu_sched and rcu_bh, as shown below:
- </p><p><img src="BigTreePreemptRCUBHdyntick.svg" alt="BigTreePreemptRCUBHdyntick.svg" width="35%">
- </p><p>RCU updaters wait for normal grace periods by registering
- RCU callbacks, either directly via <tt>call_rcu()</tt> and
- friends (namely <tt>call_rcu_bh()</tt> and <tt>call_rcu_sched()</tt>),
- there being a separate interface per flavor of RCU)
- or indirectly via <tt>synchronize_rcu()</tt> and friends.
- RCU callbacks are represented by <tt>rcu_head</tt> structures,
- which are queued on <tt>rcu_data</tt> structures while they are
- waiting for a grace period to elapse, as shown in the following figure:
- </p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%">
- </p><p>This figure shows how <tt>TREE_RCU</tt>'s and
- <tt>PREEMPT_RCU</tt>'s major data structures are related.
- Lesser data structures will be introduced with the algorithms that
- make use of them.
- </p><p>Note that each of the data structures in the above figure has
- its own synchronization:
- <p><ol>
- <li> Each <tt>rcu_state</tt> structures has a lock and a mutex,
- and some fields are protected by the corresponding root
- <tt>rcu_node</tt> structure's lock.
- <li> Each <tt>rcu_node</tt> structure has a spinlock.
- <li> The fields in <tt>rcu_data</tt> are private to the corresponding
- CPU, although a few can be read and written by other CPUs.
- <li> Similarly, the fields in <tt>rcu_dynticks</tt> are private
- to the corresponding CPU, although a few can be read by
- other CPUs.
- </ol>
- <p>It is important to note that different data structures can have
- very different ideas about the state of RCU at any given time.
- For but one example, awareness of the start or end of a given RCU
- grace period propagates slowly through the data structures.
- This slow propagation is absolutely necessary for RCU to have good
- read-side performance.
- If this balkanized implementation seems foreign to you, one useful
- trick is to consider each instance of these data structures to be
- a different person, each having the usual slightly different
- view of reality.
- </p><p>The general role of each of these data structures is as
- follows:
- </p><ol>
- <li> <tt>rcu_state</tt>:
- This structure forms the interconnection between the
- <tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
- tracks grace periods, serves as short-term repository
- for callbacks orphaned by CPU-hotplug events,
- maintains <tt>rcu_barrier()</tt> state,
- tracks expedited grace-period state,
- and maintains state used to force quiescent states when
- grace periods extend too long,
- <li> <tt>rcu_node</tt>: This structure forms the combining
- tree that propagates quiescent-state
- information from the leaves to the root, and also propagates
- grace-period information from the root to the leaves.
- It provides local copies of the grace-period state in order
- to allow this information to be accessed in a synchronized
- manner without suffering the scalability limitations that
- would otherwise be imposed by global locking.
- In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists
- of tasks that have blocked while in their current
- RCU read-side critical section.
- In <tt>CONFIG_PREEMPT_RCU</tt> with
- <tt>CONFIG_RCU_BOOST</tt>, it manages the
- per-<tt>rcu_node</tt> priority-boosting
- kernel threads (kthreads) and state.
- Finally, it records CPU-hotplug state in order to determine
- which CPUs should be ignored during a given grace period.
- <li> <tt>rcu_data</tt>: This per-CPU structure is the
- focus of quiescent-state detection and RCU callback queuing.
- It also tracks its relationship to the corresponding leaf
- <tt>rcu_node</tt> structure to allow more-efficient
- propagation of quiescent states up the <tt>rcu_node</tt>
- combining tree.
- Like the <tt>rcu_node</tt> structure, it provides a local
- copy of the grace-period information to allow for-free
- synchronized
- access to this information from the corresponding CPU.
- Finally, this structure records past dyntick-idle state
- for the corresponding CPU and also tracks statistics.
- <li> <tt>rcu_dynticks</tt>:
- This per-CPU structure tracks the current dyntick-idle
- state for the corresponding CPU.
- Unlike the other three structures, the <tt>rcu_dynticks</tt>
- structure is not replicated per RCU flavor.
- <li> <tt>rcu_head</tt>:
- This structure represents RCU callbacks, and is the
- only structure allocated and managed by RCU users.
- The <tt>rcu_head</tt> structure is normally embedded
- within the RCU-protected data structure.
- </ol>
- <p>If all you wanted from this article was a general notion of how
- RCU's data structures are related, you are done.
- Otherwise, each of the following sections give more details on
- the <tt>rcu_state</tt>, <tt>rcu_node</tt>, <tt>rcu_data</tt>,
- and <tt>rcu_dynticks</tt> data structures.
- <h3><a name="The rcu_state Structure">
- The <tt>rcu_state</tt> Structure</a></h3>
- <p>The <tt>rcu_state</tt> structure is the base structure that
- represents a flavor of RCU.
- This structure forms the interconnection between the
- <tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
- tracks grace periods, contains the lock used to
- synchronize with CPU-hotplug events,
- and maintains state used to force quiescent states when
- grace periods extend too long,
- </p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed,
- singly and in groups, in the following sections.
- The more specialized fields are covered in the discussion of their
- use.
- <h5>Relationship to rcu_node and rcu_data Structures</h5>
- This portion of the <tt>rcu_state</tt> structure is declared
- as follows:
- <pre>
- 1 struct rcu_node node[NUM_RCU_NODES];
- 2 struct rcu_node *level[NUM_RCU_LVLS + 1];
- 3 struct rcu_data __percpu *rda;
- </pre>
- <table>
- <tr><th> </th></tr>
- <tr><th align="left">Quick Quiz:</th></tr>
- <tr><td>
- Wait a minute!
- You said that the <tt>rcu_node</tt> structures formed a tree,
- but they are declared as a flat array!
- What gives?
- </td></tr>
- <tr><th align="left">Answer:</th></tr>
- <tr><td bgcolor="#ffffff"><font color="ffffff">
- The tree is laid out in the array.
- The first node In the array is the head, the next set of nodes in the
- array are children of the head node, and so on until the last set of
- nodes in the array are the leaves.
- </font>
- <p><font color="ffffff">See the following diagrams to see how
- this works.
- </font></td></tr>
- <tr><td> </td></tr>
- </table>
- <p>The <tt>rcu_node</tt> tree is embedded into the
- <tt>->node[]</tt> array as shown in the following figure:
- </p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%">
- </p><p>One interesting consequence of this mapping is that a
- breadth-first traversal of the tree is implemented as a simple
- linear scan of the array, which is in fact what the
- <tt>rcu_for_each_node_breadth_first()</tt> macro does.
- This macro is used at the beginning and ends of grace periods.
- </p><p>Each entry of the <tt>->level</tt> array references
- the first <tt>rcu_node</tt> structure on the corresponding level
- of the tree, for example, as shown below:
- </p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%">
- </p><p>The zero<sup>th</sup> element of the array references the root
- <tt>rcu_node</tt> structure, the first element references the
- first child of the root <tt>rcu_node</tt>, and finally the second
- element references the first leaf <tt>rcu_node</tt> structure.
- </p><p>For whatever it is worth, if you draw the tree to be tree-shaped
- rather than array-shaped, it is easy to draw a planar representation:
- </p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%">
- </p><p>Finally, the <tt>->rda</tt> field references a per-CPU
- pointer to the corresponding CPU's <tt>rcu_data</tt> structure.
- </p><p>All of these fields are constant once initialization is complete,
- and therefore need no protection.
- <h5>Grace-Period Tracking</h5>
- <p>This portion of the <tt>rcu_state</tt> structure is declared
- as follows:
- <pre>
- 1 unsigned long gpnum;
- 2 unsigned long completed;
- </pre>
- <p>RCU grace periods are numbered, and
- the <tt>->gpnum</tt> field contains the number of the grace
- period that started most recently.
- The <tt>->completed</tt> field contains the number of the
- grace period that completed most recently.
- If the two fields are equal, the RCU grace period that most recently
- started has already completed, and therefore the corresponding
- flavor of RCU is idle.
- If <tt>->gpnum</tt> is one greater than <tt>->completed</tt>,
- then <tt>->gpnum</tt> gives the number of the current RCU
- grace period, which has not yet completed.
- Any other combination of values indicates that something is broken.
- These two fields are protected by the root <tt>rcu_node</tt>'s
- <tt>->lock</tt> field.
- </p><p>There are <tt>->gpnum</tt> and <tt>->completed</tt> fields
- in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures
- as well.
- The fields in the <tt>rcu_state</tt> structure represent the
- most current values, and those of the other structures are compared
- in order to detect the start of a new grace period in a distributed
- fashion.
- The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt>
- (down the tree from the root to the leaves) to <tt>rcu_data</tt>.
- <h5>Miscellaneous</h5>
- <p>This portion of the <tt>rcu_state</tt> structure is declared
- as follows:
- <pre>
- 1 unsigned long gp_max;
- 2 char abbr;
- 3 char *name;
- </pre>
- <p>The <tt>->gp_max</tt> field tracks the duration of the longest
- grace period in jiffies.
- It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>.
- <p>The <tt>->name</tt> field points to the name of the RCU flavor
- (for example, “rcu_sched”), and is constant.
- The <tt>->abbr</tt> field contains a one-character abbreviation,
- for example, “s” for RCU-sched.
- <h3><a name="The rcu_node Structure">
- The <tt>rcu_node</tt> Structure</a></h3>
- <p>The <tt>rcu_node</tt> structures form the combining
- tree that propagates quiescent-state
- information from the leaves to the root and also that propagates
- grace-period information from the root down to the leaves.
- They provides local copies of the grace-period state in order
- to allow this information to be accessed in a synchronized
- manner without suffering the scalability limitations that
- would otherwise be imposed by global locking.
- In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists
- of tasks that have blocked while in their current
- RCU read-side critical section.
- In <tt>CONFIG_PREEMPT_RCU</tt> with
- <tt>CONFIG_RCU_BOOST</tt>, they manage the
- per-<tt>rcu_node</tt> priority-boosting
- kernel threads (kthreads) and state.
- Finally, they record CPU-hotplug state in order to determine
- which CPUs should be ignored during a given grace period.
- </p><p>The <tt>rcu_node</tt> structure's fields are discussed,
- singly and in groups, in the following sections.
- <h5>Connection to Combining Tree</h5>
- <p>This portion of the <tt>rcu_node</tt> structure is declared
- as follows:
- <pre>
- 1 struct rcu_node *parent;
- 2 u8 level;
- 3 u8 grpnum;
- 4 unsigned long grpmask;
- 5 int grplo;
- 6 int grphi;
- </pre>
- <p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt>
- one level up in the tree, and is <tt>NULL</tt> for the root
- <tt>rcu_node</tt>.
- The RCU implementation makes heavy use of this field to push quiescent
- states up the tree.
- The <tt>->level</tt> field gives the level in the tree, with
- the root being at level zero, its children at level one, and so on.
- The <tt>->grpnum</tt> field gives this node's position within
- the children of its parent, so this number can range between 0 and 31
- on 32-bit systems and between 0 and 63 on 64-bit systems.
- The <tt>->level</tt> and <tt>->grpnum</tt> fields are
- used only during initialization and for tracing.
- The <tt>->grpmask</tt> field is the bitmask counterpart of
- <tt>->grpnum</tt>, and therefore always has exactly one bit set.
- This mask is used to clear the bit corresponding to this <tt>rcu_node</tt>
- structure in its parent's bitmasks, which are described later.
- Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields
- contain the lowest and highest numbered CPU served by this
- <tt>rcu_node</tt> structure, respectively.
- </p><p>All of these fields are constant, and thus do not require any
- synchronization.
- <h5>Synchronization</h5>
- <p>This field of the <tt>rcu_node</tt> structure is declared
- as follows:
- <pre>
- 1 raw_spinlock_t lock;
- </pre>
- <p>This field is used to protect the remaining fields in this structure,
- unless otherwise stated.
- That said, all of the fields in this structure can be accessed without
- locking for tracing purposes.
- Yes, this can result in confusing traces, but better some tracing confusion
- than to be heisenbugged out of existence.
- <h5>Grace-Period Tracking</h5>
- <p>This portion of the <tt>rcu_node</tt> structure is declared
- as follows:
- <pre>
- 1 unsigned long gpnum;
- 2 unsigned long completed;
- </pre>
- <p>These fields are the counterparts of the fields of the same name in
- the <tt>rcu_state</tt> structure.
- They each may lag up to one behind their <tt>rcu_state</tt>
- counterparts.
- If a given <tt>rcu_node</tt> structure's <tt>->gpnum</tt> and
- <tt>->complete</tt> fields are equal, then this <tt>rcu_node</tt>
- structure believes that RCU is idle.
- Otherwise, as with the <tt>rcu_state</tt> structure,
- the <tt>->gpnum</tt> field will be one greater than the
- <tt>->complete</tt> fields, with <tt>->gpnum</tt>
- indicating which grace period this <tt>rcu_node</tt> believes
- is still being waited for.
- </p><p>The <tt>>gpnum</tt> field of each <tt>rcu_node</tt>
- structure is updated at the beginning
- of each grace period, and the <tt>->completed</tt> fields are
- updated at the end of each grace period.
- <h5>Quiescent-State Tracking</h5>
- <p>These fields manage the propagation of quiescent states up the
- combining tree.
- </p><p>This portion of the <tt>rcu_node</tt> structure has fields
- as follows:
- <pre>
- 1 unsigned long qsmask;
- 2 unsigned long expmask;
- 3 unsigned long qsmaskinit;
- 4 unsigned long expmaskinit;
- </pre>
- <p>The <tt>->qsmask</tt> field tracks which of this
- <tt>rcu_node</tt> structure's children still need to report
- quiescent states for the current normal grace period.
- Such children will have a value of 1 in their corresponding bit.
- Note that the leaf <tt>rcu_node</tt> structures should be
- thought of as having <tt>rcu_data</tt> structures as their
- children.
- Similarly, the <tt>->expmask</tt> field tracks which
- of this <tt>rcu_node</tt> structure's children still need to report
- quiescent states for the current expedited grace period.
- An expedited grace period has
- the same conceptual properties as a normal grace period, but the
- expedited implementation accepts extreme CPU overhead to obtain
- much lower grace-period latency, for example, consuming a few
- tens of microseconds worth of CPU time to reduce grace-period
- duration from milliseconds to tens of microseconds.
- The <tt>->qsmaskinit</tt> field tracks which of this
- <tt>rcu_node</tt> structure's children cover for at least
- one online CPU.
- This mask is used to initialize <tt>->qsmask</tt>,
- and <tt>->expmaskinit</tt> is used to initialize
- <tt>->expmask</tt> and the beginning of the
- normal and expedited grace periods, respectively.
- <table>
- <tr><th> </th></tr>
- <tr><th align="left">Quick Quiz:</th></tr>
- <tr><td>
- Why are these bitmasks protected by locking?
- Come on, haven't you heard of atomic instructions???
- </td></tr>
- <tr><th align="left">Answer:</th></tr>
- <tr><td bgcolor="#ffffff"><font color="ffffff">
- Lockless grace-period computation! Such a tantalizing possibility!
- </font>
- <p><font color="ffffff">But consider the following sequence of events:
- </font>
- <ol>
- <li> <font color="ffffff">CPU 0 has been in dyntick-idle
- mode for quite some time.
- When it wakes up, it notices that the current RCU
- grace period needs it to report in, so it sets a
- flag where the scheduling clock interrupt will find it.
- </font><p>
- <li> <font color="ffffff">Meanwhile, CPU 1 is running
- <tt>force_quiescent_state()</tt>,
- and notices that CPU 0 has been in dyntick idle mode,
- which qualifies as an extended quiescent state.
- </font><p>
- <li> <font color="ffffff">CPU 0's scheduling clock
- interrupt fires in the
- middle of an RCU read-side critical section, and notices
- that the RCU core needs something, so commences RCU softirq
- processing.
- </font>
- <p>
- <li> <font color="ffffff">CPU 0's softirq handler
- executes and is just about ready
- to report its quiescent state up the <tt>rcu_node</tt>
- tree.
- </font><p>
- <li> <font color="ffffff">But CPU 1 beats it to the punch,
- completing the current
- grace period and starting a new one.
- </font><p>
- <li> <font color="ffffff">CPU 0 now reports its quiescent
- state for the wrong
- grace period.
- That grace period might now end before the RCU read-side
- critical section.
- If that happens, disaster will ensue.
- </font>
- </ol>
- <p><font color="ffffff">So the locking is absolutely required in
- order to coordinate
- clearing of the bits with the grace-period numbers in
- <tt>->gpnum</tt> and <tt>->completed</tt>.
- </font></td></tr>
- <tr><td> </td></tr>
- </table>
- <h5>Blocked-Task Management</h5>
- <p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the
- midst of their RCU read-side critical sections, and these tasks
- must be tracked explicitly.
- The details of exactly why and how they are tracked will be covered
- in a separate article on RCU read-side processing.
- For now, it is enough to know that the <tt>rcu_node</tt>
- structure tracks them.
- <pre>
- 1 struct list_head blkd_tasks;
- 2 struct list_head *gp_tasks;
- 3 struct list_head *exp_tasks;
- 4 bool wait_blkd_tasks;
- </pre>
- <p>The <tt>->blkd_tasks</tt> field is a list header for
- the list of blocked and preempted tasks.
- As tasks undergo context switches within RCU read-side critical
- sections, their <tt>task_struct</tt> structures are enqueued
- (via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt>
- field) onto the head of the <tt>->blkd_tasks</tt> list for the
- leaf <tt>rcu_node</tt> structure corresponding to the CPU
- on which the outgoing context switch executed.
- As these tasks later exit their RCU read-side critical sections,
- they remove themselves from the list.
- This list is therefore in reverse time order, so that if one of the tasks
- is blocking the current grace period, all subsequent tasks must
- also be blocking that same grace period.
- Therefore, a single pointer into this list suffices to track
- all tasks blocking a given grace period.
- That pointer is stored in <tt>->gp_tasks</tt> for normal
- grace periods and in <tt>->exp_tasks</tt> for expedited
- grace periods.
- These last two fields are <tt>NULL</tt> if either there is
- no grace period in flight or if there are no blocked tasks
- preventing that grace period from completing.
- If either of these two pointers is referencing a task that
- removes itself from the <tt>->blkd_tasks</tt> list,
- then that task must advance the pointer to the next task on
- the list, or set the pointer to <tt>NULL</tt> if there
- are no subsequent tasks on the list.
- </p><p>For example, suppose that tasks T1, T2, and T3 are
- all hard-affinitied to the largest-numbered CPU in the system.
- Then if task T1 blocked in an RCU read-side
- critical section, then an expedited grace period started,
- then task T2 blocked in an RCU read-side critical section,
- then a normal grace period started, and finally task 3 blocked
- in an RCU read-side critical section, then the state of the
- last leaf <tt>rcu_node</tt> structure's blocked-task list
- would be as shown below:
- </p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%">
- </p><p>Task T1 is blocking both grace periods, task T2 is
- blocking only the normal grace period, and task T3 is blocking
- neither grace period.
- Note that these tasks will not remove themselves from this list
- immediately upon resuming execution.
- They will instead remain on the list until they execute the outermost
- <tt>rcu_read_unlock()</tt> that ends their RCU read-side critical
- section.
- <p>
- The <tt>->wait_blkd_tasks</tt> field indicates whether or not
- the current grace period is waiting on a blocked task.
- <h5>Sizing the <tt>rcu_node</tt> Array</h5>
- <p>The <tt>rcu_node</tt> array is sized via a series of
- C-preprocessor expressions as follows:
- <pre>
- 1 #ifdef CONFIG_RCU_FANOUT
- 2 #define RCU_FANOUT CONFIG_RCU_FANOUT
- 3 #else
- 4 # ifdef CONFIG_64BIT
- 5 # define RCU_FANOUT 64
- 6 # else
- 7 # define RCU_FANOUT 32
- 8 # endif
- 9 #endif
- 10
- 11 #ifdef CONFIG_RCU_FANOUT_LEAF
- 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
- 13 #else
- 14 # ifdef CONFIG_64BIT
- 15 # define RCU_FANOUT_LEAF 64
- 16 # else
- 17 # define RCU_FANOUT_LEAF 32
- 18 # endif
- 19 #endif
- 20
- 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
- 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
- 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
- 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
- 25
- 26 #if NR_CPUS <= RCU_FANOUT_1
- 27 # define RCU_NUM_LVLS 1
- 28 # define NUM_RCU_LVL_0 1
- 29 # define NUM_RCU_NODES NUM_RCU_LVL_0
- 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 }
- 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" }
- 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" }
- 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" }
- 34 #elif NR_CPUS <= RCU_FANOUT_2
- 35 # define RCU_NUM_LVLS 2
- 36 # define NUM_RCU_LVL_0 1
- 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
- 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
- 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
- 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" }
- 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" }
- 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" }
- 43 #elif NR_CPUS <= RCU_FANOUT_3
- 44 # define RCU_NUM_LVLS 3
- 45 # define NUM_RCU_LVL_0 1
- 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
- 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
- 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
- 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
- 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
- 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
- 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
- 53 #elif NR_CPUS <= RCU_FANOUT_4
- 54 # define RCU_NUM_LVLS 4
- 55 # define NUM_RCU_LVL_0 1
- 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
- 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
- 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
- 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
- 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
- 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
- 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
- 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
- 64 #else
- 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
- 66 #endif
- </pre>
- <p>The maximum number of levels in the <tt>rcu_node</tt> structure
- is currently limited to four, as specified by lines 21-24
- and the structure of the subsequent “if” statement.
- For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which
- should be sufficient for the next few years at least.
- For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which
- should see us through the next decade or so.
- This four-level tree also allows kernels built with
- <tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs,
- which might be useful in very large systems having eight CPUs per
- socket (but please note that no one has yet shown any measurable
- performance degradation due to misaligned socket and <tt>rcu_node</tt>
- boundaries).
- In addition, building kernels with a full four levels of <tt>rcu_node</tt>
- tree permits better testing of RCU's combining-tree code.
- </p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children
- are permitted at each non-leaf level of the <tt>rcu_node</tt> tree.
- If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified,
- it is set based on the word size of the system, which is also
- the Kconfig default.
- </p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are
- handled by each leaf <tt>rcu_node</tt> structure.
- Experience has shown that allowing a given leaf <tt>rcu_node</tt>
- structure to handle 64 CPUs, as permitted by the number of bits in
- the <tt>->qsmask</tt> field on a 64-bit system, results in
- excessive contention for the leaf <tt>rcu_node</tt> structures'
- <tt>->lock</tt> fields.
- The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore
- limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>.
- If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value
- selected is based on the word size of the system, just as for
- <tt>CONFIG_RCU_FANOUT</tt>.
- Lines 11-19 perform this computation.
- </p><p>Lines 21-24 compute the maximum number of CPUs supported by
- a single-level (which contains a single <tt>rcu_node</tt> structure),
- two-level, three-level, and four-level <tt>rcu_node</tt> tree,
- respectively, given the fanout specified by <tt>RCU_FANOUT</tt>
- and <tt>RCU_FANOUT_LEAF</tt>.
- These numbers of CPUs are retained in the
- <tt>RCU_FANOUT_1</tt>,
- <tt>RCU_FANOUT_2</tt>,
- <tt>RCU_FANOUT_3</tt>, and
- <tt>RCU_FANOUT_4</tt>
- C-preprocessor variables, respectively.
- </p><p>These variables are used to control the C-preprocessor <tt>#if</tt>
- statement spanning lines 26-66 that computes the number of
- <tt>rcu_node</tt> structures required for each level of the tree,
- as well as the number of levels required.
- The number of levels is placed in the <tt>NUM_RCU_LVLS</tt>
- C-preprocessor variable by lines 27, 35, 44, and 54.
- The number of <tt>rcu_node</tt> structures for the topmost level
- of the tree is always exactly one, and this value is unconditionally
- placed into <tt>NUM_RCU_LVL_0</tt> by lines 28, 36, 45, and 55.
- The rest of the levels (if any) of the <tt>rcu_node</tt> tree
- are computed by dividing the maximum number of CPUs by the
- fanout supported by the number of levels from the current level down,
- rounding up. This computation is performed by lines 37,
- 46-47, and 56-58.
- Lines 31-33, 40-42, 50-52, and 62-63 create initializers
- for lockdep lock-class names.
- Finally, lines 64-66 produce an error if the maximum number of
- CPUs is too large for the specified fanout.
- <h3><a name="The rcu_data Structure">
- The <tt>rcu_data</tt> Structure</a></h3>
- <p>The <tt>rcu_data</tt> maintains the per-CPU state for the
- corresponding flavor of RCU.
- The fields in this structure may be accessed only from the corresponding
- CPU (and from tracing) unless otherwise stated.
- This structure is the
- focus of quiescent-state detection and RCU callback queuing.
- It also tracks its relationship to the corresponding leaf
- <tt>rcu_node</tt> structure to allow more-efficient
- propagation of quiescent states up the <tt>rcu_node</tt>
- combining tree.
- Like the <tt>rcu_node</tt> structure, it provides a local
- copy of the grace-period information to allow for-free
- synchronized
- access to this information from the corresponding CPU.
- Finally, this structure records past dyntick-idle state
- for the corresponding CPU and also tracks statistics.
- </p><p>The <tt>rcu_data</tt> structure's fields are discussed,
- singly and in groups, in the following sections.
- <h5>Connection to Other Data Structures</h5>
- <p>This portion of the <tt>rcu_data</tt> structure is declared
- as follows:
- <pre>
- 1 int cpu;
- 2 struct rcu_state *rsp;
- 3 struct rcu_node *mynode;
- 4 struct rcu_dynticks *dynticks;
- 5 unsigned long grpmask;
- 6 bool beenonline;
- </pre>
- <p>The <tt>->cpu</tt> field contains the number of the
- corresponding CPU, the <tt>->rsp</tt> pointer references
- the corresponding <tt>rcu_state</tt> structure (and is most frequently
- used to locate the name of the corresponding flavor of RCU for tracing),
- and the <tt>->mynode</tt> field references the corresponding
- <tt>rcu_node</tt> structure.
- The <tt>->mynode</tt> is used to propagate quiescent states
- up the combining tree.
- <p>The <tt>->dynticks</tt> pointer references the
- <tt>rcu_dynticks</tt> structure corresponding to this
- CPU.
- Recall that a single per-CPU instance of the <tt>rcu_dynticks</tt>
- structure is shared among all flavors of RCU.
- These first four fields are constant and therefore require not
- synchronization.
- </p><p>The <tt>->grpmask</tt> field indicates the bit in
- the <tt>->mynode->qsmask</tt> corresponding to this
- <tt>rcu_data</tt> structure, and is also used when propagating
- quiescent states.
- The <tt>->beenonline</tt> flag is set whenever the corresponding
- CPU comes online, which means that the debugfs tracing need not dump
- out any <tt>rcu_data</tt> structure for which this flag is not set.
- <h5>Quiescent-State and Grace-Period Tracking</h5>
- <p>This portion of the <tt>rcu_data</tt> structure is declared
- as follows:
- <pre>
- 1 unsigned long completed;
- 2 unsigned long gpnum;
- 3 bool cpu_no_qs;
- 4 bool core_needs_qs;
- 5 bool gpwrap;
- 6 unsigned long rcu_qs_ctr_snap;
- </pre>
- <p>The <tt>completed</tt> and <tt>gpnum</tt>
- fields are the counterparts of the fields of the same name
- in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures.
- They may each lag up to one behind their <tt>rcu_node</tt>
- counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and
- <tt>CONFIG_NO_HZ_FULL</tt> kernels can lag
- arbitrarily far behind for CPUs in dyntick-idle mode (but these counters
- will catch up upon exit from dyntick-idle mode).
- If a given <tt>rcu_data</tt> structure's <tt>->gpnum</tt> and
- <tt>->complete</tt> fields are equal, then this <tt>rcu_data</tt>
- structure believes that RCU is idle.
- Otherwise, as with the <tt>rcu_state</tt> and <tt>rcu_node</tt>
- structure,
- the <tt>->gpnum</tt> field will be one greater than the
- <tt>->complete</tt> fields, with <tt>->gpnum</tt>
- indicating which grace period this <tt>rcu_data</tt> believes
- is still being waited for.
- <table>
- <tr><th> </th></tr>
- <tr><th align="left">Quick Quiz:</th></tr>
- <tr><td>
- All this replication of the grace period numbers can only cause
- massive confusion.
- Why not just keep a global pair of counters and be done with it???
- </td></tr>
- <tr><th align="left">Answer:</th></tr>
- <tr><td bgcolor="#ffffff"><font color="ffffff">
- Because if there was only a single global pair of grace-period
- numbers, there would need to be a single global lock to allow
- safely accessing and updating them.
- And if we are not going to have a single global lock, we need
- to carefully manage the numbers on a per-node basis.
- Recall from the answer to a previous Quick Quiz that the consequences
- of applying a previously sampled quiescent state to the wrong
- grace period are quite severe.
- </font></td></tr>
- <tr><td> </td></tr>
- </table>
- <p>The <tt>->cpu_no_qs</tt> flag indicates that the
- CPU has not yet passed through a quiescent state,
- while the <tt>->core_needs_qs</tt> flag indicates that the
- RCU core needs a quiescent state from the corresponding CPU.
- The <tt>->gpwrap</tt> field indicates that the corresponding
- CPU has remained idle for so long that the <tt>completed</tt>
- and <tt>gpnum</tt> counters are in danger of overflow, which
- will cause the CPU to disregard the values of its counters on
- its next exit from idle.
- Finally, the <tt>rcu_qs_ctr_snap</tt> field is used to detect
- cases where a given operation has resulted in a quiescent state
- for all flavors of RCU, for example, <tt>cond_resched_rcu_qs()</tt>.
- <h5>RCU Callback Handling</h5>
- <p>In the absence of CPU-hotplug events, RCU callbacks are invoked by
- the same CPU that registered them.
- This is strictly a cache-locality optimization: callbacks can and
- do get invoked on CPUs other than the one that registered them.
- After all, if the CPU that registered a given callback has gone
- offline before the callback can be invoked, there really is no other
- choice.
- </p><p>This portion of the <tt>rcu_data</tt> structure is declared
- as follows:
- <pre>
- 1 struct rcu_head *nxtlist;
- 2 struct rcu_head **nxttail[RCU_NEXT_SIZE];
- 3 unsigned long nxtcompleted[RCU_NEXT_SIZE];
- 4 long qlen_lazy;
- 5 long qlen;
- 6 long qlen_last_fqs_check;
- 7 unsigned long n_force_qs_snap;
- 8 unsigned long n_cbs_invoked;
- 9 unsigned long n_cbs_orphaned;
- 10 unsigned long n_cbs_adopted;
- 11 long blimit;
- </pre>
- <p>The <tt>->nxtlist</tt> pointer and the
- <tt>->nxttail[]</tt> array form a four-segment list with
- older callbacks near the head and newer ones near the tail.
- Each segment contains callbacks with the corresponding relationship
- to the current grace period.
- The pointer out of the end of each of the four segments is referenced
- by the element of the <tt>->nxttail[]</tt> array indexed by
- <tt>RCU_DONE_TAIL</tt> (for callbacks handled by a prior grace period),
- <tt>RCU_WAIT_TAIL</tt> (for callbacks waiting on the current grace period),
- <tt>RCU_NEXT_READY_TAIL</tt> (for callbacks that will wait on the next
- grace period), and
- <tt>RCU_NEXT_TAIL</tt> (for callbacks that are not yet associated
- with a specific grace period)
- respectively, as shown in the following figure.
- </p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%">
- </p><p>In this figure, the <tt>->nxtlist</tt> pointer references the
- first
- RCU callback in the list.
- The <tt>->nxttail[RCU_DONE_TAIL]</tt> array element references
- the <tt>->nxtlist</tt> pointer itself, indicating that none
- of the callbacks is ready to invoke.
- The <tt>->nxttail[RCU_WAIT_TAIL]</tt> array element references callback
- CB 2's <tt>->next</tt> pointer, which indicates that
- CB 1 and CB 2 are both waiting on the current grace period.
- The <tt>->nxttail[RCU_NEXT_READY_TAIL]</tt> array element
- references the same RCU callback that <tt>->nxttail[RCU_WAIT_TAIL]</tt>
- does, which indicates that there are no callbacks waiting on the next
- RCU grace period.
- The <tt>->nxttail[RCU_NEXT_TAIL]</tt> array element references
- CB 4's <tt>->next</tt> pointer, indicating that all the
- remaining RCU callbacks have not yet been assigned to an RCU grace
- period.
- Note that the <tt>->nxttail[RCU_NEXT_TAIL]</tt> array element
- always references the last RCU callback's <tt>->next</tt> pointer
- unless the callback list is empty, in which case it references
- the <tt>->nxtlist</tt> pointer.
- </p><p>CPUs advance their callbacks from the
- <tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the
- <tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments
- as grace periods advance.
- The CPU advances the callbacks in its <tt>rcu_data</tt> structure
- whenever it notices that another RCU grace period has completed.
- The CPU detects the completion of an RCU grace period by noticing
- that the value of its <tt>rcu_data</tt> structure's
- <tt>->completed</tt> field differs from that of its leaf
- <tt>rcu_node</tt> structure.
- Recall that each <tt>rcu_node</tt> structure's
- <tt>->completed</tt> field is updated at the end of each
- grace period.
- </p><p>The <tt>->nxtcompleted[]</tt> array records grace-period
- numbers corresponding to the list segments.
- This allows CPUs that go idle for extended periods to determine
- which of their callbacks are ready to be invoked after reawakening.
- </p><p>The <tt>->qlen</tt> counter contains the number of
- callbacks in <tt>->nxtlist</tt>, and the
- <tt>->qlen_lazy</tt> contains the number of those callbacks that
- are known to only free memory, and whose invocation can therefore
- be safely deferred.
- The <tt>->qlen_last_fqs_check</tt> and
- <tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent
- states from <tt>call_rcu()</tt> and friends when callback
- lists grow excessively long.
- </p><p>The <tt>->n_cbs_invoked</tt>,
- <tt>->n_cbs_orphaned</tt>, and <tt>->n_cbs_adopted</tt>
- fields count the number of callbacks invoked,
- sent to other CPUs when this CPU goes offline,
- and received from other CPUs when those other CPUs go offline.
- Finally, the <tt>->blimit</tt> counter is the maximum number of
- RCU callbacks that may be invoked at a given time.
- <h5>Dyntick-Idle Handling</h5>
- <p>This portion of the <tt>rcu_data</tt> structure is declared
- as follows:
- <pre>
- 1 int dynticks_snap;
- 2 unsigned long dynticks_fqs;
- </pre>
- The <tt>->dynticks_snap</tt> field is used to take a snapshot
- of the corresponding CPU's dyntick-idle state when forcing
- quiescent states, and is therefore accessed from other CPUs.
- Finally, the <tt>->dynticks_fqs</tt> field is used to
- count the number of times this CPU is determined to be in
- dyntick-idle state, and is used for tracing and debugging purposes.
- <h3><a name="The rcu_dynticks Structure">
- The <tt>rcu_dynticks</tt> Structure</a></h3>
- <p>The <tt>rcu_dynticks</tt> maintains the per-CPU dyntick-idle state
- for the corresponding CPU.
- Unlike the other structures, <tt>rcu_dynticks</tt> is not
- replicated over the different flavors of RCU.
- The fields in this structure may be accessed only from the corresponding
- CPU (and from tracing) unless otherwise stated.
- Its fields are as follows:
- <pre>
- 1 int dynticks_nesting;
- 2 int dynticks_nmi_nesting;
- 3 atomic_t dynticks;
- </pre>
- <p>The <tt>->dynticks_nesting</tt> field counts the
- nesting depth of normal interrupts.
- In addition, this counter is incremented when exiting dyntick-idle
- mode and decremented when entering it.
- This counter can therefore be thought of as counting the number
- of reasons why this CPU cannot be permitted to enter dyntick-idle
- mode, aside from non-maskable interrupts (NMIs).
- NMIs are counted by the <tt>->dynticks_nmi_nesting</tt>
- field, except that NMIs that interrupt non-dyntick-idle execution
- are not counted.
- </p><p>Finally, the <tt>->dynticks</tt> field counts the corresponding
- CPU's transitions to and from dyntick-idle mode, so that this counter
- has an even value when the CPU is in dyntick-idle mode and an odd
- value otherwise.
- <table>
- <tr><th> </th></tr>
- <tr><th align="left">Quick Quiz:</th></tr>
- <tr><td>
- Why not just count all NMIs?
- Wouldn't that be simpler and less error prone?
- </td></tr>
- <tr><th align="left">Answer:</th></tr>
- <tr><td bgcolor="#ffffff"><font color="ffffff">
- It seems simpler only until you think hard about how to go about
- updating the <tt>rcu_dynticks</tt> structure's
- <tt>->dynticks</tt> field.
- </font></td></tr>
- <tr><td> </td></tr>
- </table>
- <p>Additional fields are present for some special-purpose
- builds, and are discussed separately.
- <h3><a name="The rcu_head Structure">
- The <tt>rcu_head</tt> Structure</a></h3>
- <p>Each <tt>rcu_head</tt> structure represents an RCU callback.
- These structures are normally embedded within RCU-protected data
- structures whose algorithms use asynchronous grace periods.
- In contrast, when using algorithms that block waiting for RCU grace periods,
- RCU users need not provide <tt>rcu_head</tt> structures.
- </p><p>The <tt>rcu_head</tt> structure has fields as follows:
- <pre>
- 1 struct rcu_head *next;
- 2 void (*func)(struct rcu_head *head);
- </pre>
- <p>The <tt>->next</tt> field is used
- to link the <tt>rcu_head</tt> structures together in the
- lists within the <tt>rcu_data</tt> structures.
- The <tt>->func</tt> field is a pointer to the function
- to be called when the callback is ready to be invoked, and
- this function is passed a pointer to the <tt>rcu_head</tt>
- structure.
- However, <tt>kfree_rcu()</tt> uses the <tt>->func</tt>
- field to record the offset of the <tt>rcu_head</tt>
- structure within the enclosing RCU-protected data structure.
- </p><p>Both of these fields are used internally by RCU.
- From the viewpoint of RCU users, this structure is an
- opaque “cookie”.
- <table>
- <tr><th> </th></tr>
- <tr><th align="left">Quick Quiz:</th></tr>
- <tr><td>
- Given that the callback function <tt>->func</tt>
- is passed a pointer to the <tt>rcu_head</tt> structure,
- how is that function supposed to find the beginning of the
- enclosing RCU-protected data structure?
- </td></tr>
- <tr><th align="left">Answer:</th></tr>
- <tr><td bgcolor="#ffffff"><font color="ffffff">
- In actual practice, there is a separate callback function per
- type of RCU-protected data structure.
- The callback function can therefore use the <tt>container_of()</tt>
- macro in the Linux kernel (or other pointer-manipulation facilities
- in other software environments) to find the beginning of the
- enclosing structure.
- </font></td></tr>
- <tr><td> </td></tr>
- </table>
- <h3><a name="RCU-Specific Fields in the task_struct Structure">
- RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3>
- <p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some
- additional fields in the <tt>task_struct</tt> structure:
- <pre>
- 1 #ifdef CONFIG_PREEMPT_RCU
- 2 int rcu_read_lock_nesting;
- 3 union rcu_special rcu_read_unlock_special;
- 4 struct list_head rcu_node_entry;
- 5 struct rcu_node *rcu_blocked_node;
- 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
- 7 #ifdef CONFIG_TASKS_RCU
- 8 unsigned long rcu_tasks_nvcsw;
- 9 bool rcu_tasks_holdout;
- 10 struct list_head rcu_tasks_holdout_list;
- 11 int rcu_tasks_idle_cpu;
- 12 #endif /* #ifdef CONFIG_TASKS_RCU */
- </pre>
- <p>The <tt>->rcu_read_lock_nesting</tt> field records the
- nesting level for RCU read-side critical sections, and
- the <tt>->rcu_read_unlock_special</tt> field is a bitmask
- that records special conditions that require <tt>rcu_read_unlock()</tt>
- to do additional work.
- The <tt>->rcu_node_entry</tt> field is used to form lists of
- tasks that have blocked within preemptible-RCU read-side critical
- sections and the <tt>->rcu_blocked_node</tt> field references
- the <tt>rcu_node</tt> structure whose list this task is a member of,
- or <tt>NULL</tt> if it is not blocked within a preemptible-RCU
- read-side critical section.
- <p>The <tt>->rcu_tasks_nvcsw</tt> field tracks the number of
- voluntary context switches that this task had undergone at the
- beginning of the current tasks-RCU grace period,
- <tt>->rcu_tasks_holdout</tt> is set if the current tasks-RCU
- grace period is waiting on this task, <tt>->rcu_tasks_holdout_list</tt>
- is a list element enqueuing this task on the holdout list,
- and <tt>->rcu_tasks_idle_cpu</tt> tracks which CPU this
- idle task is running, but only if the task is currently running,
- that is, if the CPU is currently idle.
- <h3><a name="Accessor Functions">
- Accessor Functions</a></h3>
- <p>The following listing shows the
- <tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt>,
- <tt>rcu_for_each_nonleaf_node_breadth_first()</tt>, and
- <tt>rcu_for_each_leaf_node()</tt> function and macros:
- <pre>
- 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
- 2 {
- 3 return &rsp->node[0];
- 4 }
- 5
- 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
- 7 for ((rnp) = &(rsp)->node[0]; \
- 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
- 9
- 10 #define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \
- 11 for ((rnp) = &(rsp)->node[0]; \
- 12 (rnp) < (rsp)->level[NUM_RCU_LVLS - 1]; (rnp)++)
- 13
- 14 #define rcu_for_each_leaf_node(rsp, rnp) \
- 15 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
- 16 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
- </pre>
- <p>The <tt>rcu_get_root()</tt> simply returns a pointer to the
- first element of the specified <tt>rcu_state</tt> structure's
- <tt>->node[]</tt> array, which is the root <tt>rcu_node</tt>
- structure.
- </p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt>
- macro takes advantage of the layout of the <tt>rcu_node</tt>
- structures in the <tt>rcu_state</tt> structure's
- <tt>->node[]</tt> array, performing a breadth-first traversal by
- simply traversing the array in order.
- The <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> macro operates
- similarly, but traverses only the first part of the array, thus excluding
- the leaf <tt>rcu_node</tt> structures.
- Finally, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only
- the last part of the array, thus traversing only the leaf
- <tt>rcu_node</tt> structures.
- <table>
- <tr><th> </th></tr>
- <tr><th align="left">Quick Quiz:</th></tr>
- <tr><td>
- What do <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> and
- <tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree
- contains only a single node?
- </td></tr>
- <tr><th align="left">Answer:</th></tr>
- <tr><td bgcolor="#ffffff"><font color="ffffff">
- In the single-node case,
- <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> is a no-op
- and <tt>rcu_for_each_leaf_node()</tt> traverses the single node.
- </font></td></tr>
- <tr><td> </td></tr>
- </table>
- <h3><a name="Summary">
- Summary</a></h3>
- So each flavor of RCU is represented by an <tt>rcu_state</tt> structure,
- which contains a combining tree of <tt>rcu_node</tt> and
- <tt>rcu_data</tt> structures.
- Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle
- state is tracked by an <tt>rcu_dynticks</tt> structure.
- If you made it this far, you are well prepared to read the code
- walkthroughs in the other articles in this series.
- <h3><a name="Acknowledgments">
- Acknowledgments</a></h3>
- I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul
- Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn
- for helping me get this document into a more human-readable state.
- <h3><a name="Legal Statement">
- Legal Statement</a></h3>
- <p>This work represents the view of the author and does not necessarily
- represent the view of IBM.
- </p><p>Linux is a registered trademark of Linus Torvalds.
- </p><p>Other company, product, and service names may be trademarks or
- service marks of others.
- </body></html>
|