cluster-pm-race-avoidance.txt 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499
  1. Cluster-wide Power-up/power-down race avoidance algorithm
  2. =========================================================
  3. This file documents the algorithm which is used to coordinate CPU and
  4. cluster setup and teardown operations and to manage hardware coherency
  5. controls safely.
  6. The section "Rationale" explains what the algorithm is for and why it is
  7. needed. "Basic model" explains general concepts using a simplified view
  8. of the system. The other sections explain the actual details of the
  9. algorithm in use.
  10. Rationale
  11. ---------
  12. In a system containing multiple CPUs, it is desirable to have the
  13. ability to turn off individual CPUs when the system is idle, reducing
  14. power consumption and thermal dissipation.
  15. In a system containing multiple clusters of CPUs, it is also desirable
  16. to have the ability to turn off entire clusters.
  17. Turning entire clusters off and on is a risky business, because it
  18. involves performing potentially destructive operations affecting a group
  19. of independently running CPUs, while the OS continues to run. This
  20. means that we need some coordination in order to ensure that critical
  21. cluster-level operations are only performed when it is truly safe to do
  22. so.
  23. Simple locking may not be sufficient to solve this problem, because
  24. mechanisms like Linux spinlocks may rely on coherency mechanisms which
  25. are not immediately enabled when a cluster powers up. Since enabling or
  26. disabling those mechanisms may itself be a non-atomic operation (such as
  27. writing some hardware registers and invalidating large caches), other
  28. methods of coordination are required in order to guarantee safe
  29. power-down and power-up at the cluster level.
  30. The mechanism presented in this document describes a coherent memory
  31. based protocol for performing the needed coordination. It aims to be as
  32. lightweight as possible, while providing the required safety properties.
  33. Basic model
  34. -----------
  35. Each cluster and CPU is assigned a state, as follows:
  36. DOWN
  37. COMING_UP
  38. UP
  39. GOING_DOWN
  40. +---------> UP ----------+
  41. | v
  42. COMING_UP GOING_DOWN
  43. ^ |
  44. +--------- DOWN <--------+
  45. DOWN: The CPU or cluster is not coherent, and is either powered off or
  46. suspended, or is ready to be powered off or suspended.
  47. COMING_UP: The CPU or cluster has committed to moving to the UP state.
  48. It may be part way through the process of initialisation and
  49. enabling coherency.
  50. UP: The CPU or cluster is active and coherent at the hardware
  51. level. A CPU in this state is not necessarily being used
  52. actively by the kernel.
  53. GOING_DOWN: The CPU or cluster has committed to moving to the DOWN
  54. state. It may be part way through the process of teardown and
  55. coherency exit.
  56. Each CPU has one of these states assigned to it at any point in time.
  57. The CPU states are described in the "CPU state" section, below.
  58. Each cluster is also assigned a state, but it is necessary to split the
  59. state value into two parts (the "cluster" state and "inbound" state) and
  60. to introduce additional states in order to avoid races between different
  61. CPUs in the cluster simultaneously modifying the state. The cluster-
  62. level states are described in the "Cluster state" section.
  63. To help distinguish the CPU states from cluster states in this
  64. discussion, the state names are given a CPU_ prefix for the CPU states,
  65. and a CLUSTER_ or INBOUND_ prefix for the cluster states.
  66. CPU state
  67. ---------
  68. In this algorithm, each individual core in a multi-core processor is
  69. referred to as a "CPU". CPUs are assumed to be single-threaded:
  70. therefore, a CPU can only be doing one thing at a single point in time.
  71. This means that CPUs fit the basic model closely.
  72. The algorithm defines the following states for each CPU in the system:
  73. CPU_DOWN
  74. CPU_COMING_UP
  75. CPU_UP
  76. CPU_GOING_DOWN
  77. cluster setup and
  78. CPU setup complete policy decision
  79. +-----------> CPU_UP ------------+
  80. | v
  81. CPU_COMING_UP CPU_GOING_DOWN
  82. ^ |
  83. +----------- CPU_DOWN <----------+
  84. policy decision CPU teardown complete
  85. or hardware event
  86. The definitions of the four states correspond closely to the states of
  87. the basic model.
  88. Transitions between states occur as follows.
  89. A trigger event (spontaneous) means that the CPU can transition to the
  90. next state as a result of making local progress only, with no
  91. requirement for any external event to happen.
  92. CPU_DOWN:
  93. A CPU reaches the CPU_DOWN state when it is ready for
  94. power-down. On reaching this state, the CPU will typically
  95. power itself down or suspend itself, via a WFI instruction or a
  96. firmware call.
  97. Next state: CPU_COMING_UP
  98. Conditions: none
  99. Trigger events:
  100. a) an explicit hardware power-up operation, resulting
  101. from a policy decision on another CPU;
  102. b) a hardware event, such as an interrupt.
  103. CPU_COMING_UP:
  104. A CPU cannot start participating in hardware coherency until the
  105. cluster is set up and coherent. If the cluster is not ready,
  106. then the CPU will wait in the CPU_COMING_UP state until the
  107. cluster has been set up.
  108. Next state: CPU_UP
  109. Conditions: The CPU's parent cluster must be in CLUSTER_UP.
  110. Trigger events: Transition of the parent cluster to CLUSTER_UP.
  111. Refer to the "Cluster state" section for a description of the
  112. CLUSTER_UP state.
  113. CPU_UP:
  114. When a CPU reaches the CPU_UP state, it is safe for the CPU to
  115. start participating in local coherency.
  116. This is done by jumping to the kernel's CPU resume code.
  117. Note that the definition of this state is slightly different
  118. from the basic model definition: CPU_UP does not mean that the
  119. CPU is coherent yet, but it does mean that it is safe to resume
  120. the kernel. The kernel handles the rest of the resume
  121. procedure, so the remaining steps are not visible as part of the
  122. race avoidance algorithm.
  123. The CPU remains in this state until an explicit policy decision
  124. is made to shut down or suspend the CPU.
  125. Next state: CPU_GOING_DOWN
  126. Conditions: none
  127. Trigger events: explicit policy decision
  128. CPU_GOING_DOWN:
  129. While in this state, the CPU exits coherency, including any
  130. operations required to achieve this (such as cleaning data
  131. caches).
  132. Next state: CPU_DOWN
  133. Conditions: local CPU teardown complete
  134. Trigger events: (spontaneous)
  135. Cluster state
  136. -------------
  137. A cluster is a group of connected CPUs with some common resources.
  138. Because a cluster contains multiple CPUs, it can be doing multiple
  139. things at the same time. This has some implications. In particular, a
  140. CPU can start up while another CPU is tearing the cluster down.
  141. In this discussion, the "outbound side" is the view of the cluster state
  142. as seen by a CPU tearing the cluster down. The "inbound side" is the
  143. view of the cluster state as seen by a CPU setting the CPU up.
  144. In order to enable safe coordination in such situations, it is important
  145. that a CPU which is setting up the cluster can advertise its state
  146. independently of the CPU which is tearing down the cluster. For this
  147. reason, the cluster state is split into two parts:
  148. "cluster" state: The global state of the cluster; or the state
  149. on the outbound side:
  150. CLUSTER_DOWN
  151. CLUSTER_UP
  152. CLUSTER_GOING_DOWN
  153. "inbound" state: The state of the cluster on the inbound side.
  154. INBOUND_NOT_COMING_UP
  155. INBOUND_COMING_UP
  156. The different pairings of these states results in six possible
  157. states for the cluster as a whole:
  158. CLUSTER_UP
  159. +==========> INBOUND_NOT_COMING_UP -------------+
  160. # |
  161. |
  162. CLUSTER_UP <----+ |
  163. INBOUND_COMING_UP | v
  164. ^ CLUSTER_GOING_DOWN CLUSTER_GOING_DOWN
  165. # INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP
  166. CLUSTER_DOWN | |
  167. INBOUND_COMING_UP <----+ |
  168. |
  169. ^ |
  170. +=========== CLUSTER_DOWN <------------+
  171. INBOUND_NOT_COMING_UP
  172. Transitions -----> can only be made by the outbound CPU, and
  173. only involve changes to the "cluster" state.
  174. Transitions ===##> can only be made by the inbound CPU, and only
  175. involve changes to the "inbound" state, except where there is no
  176. further transition possible on the outbound side (i.e., the
  177. outbound CPU has put the cluster into the CLUSTER_DOWN state).
  178. The race avoidance algorithm does not provide a way to determine
  179. which exact CPUs within the cluster play these roles. This must
  180. be decided in advance by some other means. Refer to the section
  181. "Last man and first man selection" for more explanation.
  182. CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the
  183. cluster can actually be powered down.
  184. The parallelism of the inbound and outbound CPUs is observed by
  185. the existence of two different paths from CLUSTER_GOING_DOWN/
  186. INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic
  187. model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to
  188. COMING_UP in the basic model). The second path avoids cluster
  189. teardown completely.
  190. CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic
  191. model. The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP
  192. is trivial and merely resets the state machine ready for the
  193. next cycle.
  194. Details of the allowable transitions follow.
  195. The next state in each case is notated
  196. <cluster state>/<inbound state> (<transitioner>)
  197. where the <transitioner> is the side on which the transition
  198. can occur; either the inbound or the outbound side.
  199. CLUSTER_DOWN/INBOUND_NOT_COMING_UP:
  200. Next state: CLUSTER_DOWN/INBOUND_COMING_UP (inbound)
  201. Conditions: none
  202. Trigger events:
  203. a) an explicit hardware power-up operation, resulting
  204. from a policy decision on another CPU;
  205. b) a hardware event, such as an interrupt.
  206. CLUSTER_DOWN/INBOUND_COMING_UP:
  207. In this state, an inbound CPU sets up the cluster, including
  208. enabling of hardware coherency at the cluster level and any
  209. other operations (such as cache invalidation) which are required
  210. in order to achieve this.
  211. The purpose of this state is to do sufficient cluster-level
  212. setup to enable other CPUs in the cluster to enter coherency
  213. safely.
  214. Next state: CLUSTER_UP/INBOUND_COMING_UP (inbound)
  215. Conditions: cluster-level setup and hardware coherency complete
  216. Trigger events: (spontaneous)
  217. CLUSTER_UP/INBOUND_COMING_UP:
  218. Cluster-level setup is complete and hardware coherency is
  219. enabled for the cluster. Other CPUs in the cluster can safely
  220. enter coherency.
  221. This is a transient state, leading immediately to
  222. CLUSTER_UP/INBOUND_NOT_COMING_UP. All other CPUs on the cluster
  223. should consider treat these two states as equivalent.
  224. Next state: CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound)
  225. Conditions: none
  226. Trigger events: (spontaneous)
  227. CLUSTER_UP/INBOUND_NOT_COMING_UP:
  228. Cluster-level setup is complete and hardware coherency is
  229. enabled for the cluster. Other CPUs in the cluster can safely
  230. enter coherency.
  231. The cluster will remain in this state until a policy decision is
  232. made to power the cluster down.
  233. Next state: CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound)
  234. Conditions: none
  235. Trigger events: policy decision to power down the cluster
  236. CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP:
  237. An outbound CPU is tearing the cluster down. The selected CPU
  238. must wait in this state until all CPUs in the cluster are in the
  239. CPU_DOWN state.
  240. When all CPUs are in the CPU_DOWN state, the cluster can be torn
  241. down, for example by cleaning data caches and exiting
  242. cluster-level coherency.
  243. To avoid wasteful unnecessary teardown operations, the outbound
  244. should check the inbound cluster state for asynchronous
  245. transitions to INBOUND_COMING_UP. Alternatively, individual
  246. CPUs can be checked for entry into CPU_COMING_UP or CPU_UP.
  247. Next states:
  248. CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound)
  249. Conditions: cluster torn down and ready to power off
  250. Trigger events: (spontaneous)
  251. CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound)
  252. Conditions: none
  253. Trigger events:
  254. a) an explicit hardware power-up operation,
  255. resulting from a policy decision on another
  256. CPU;
  257. b) a hardware event, such as an interrupt.
  258. CLUSTER_GOING_DOWN/INBOUND_COMING_UP:
  259. The cluster is (or was) being torn down, but another CPU has
  260. come online in the meantime and is trying to set up the cluster
  261. again.
  262. If the outbound CPU observes this state, it has two choices:
  263. a) back out of teardown, restoring the cluster to the
  264. CLUSTER_UP state;
  265. b) finish tearing the cluster down and put the cluster
  266. in the CLUSTER_DOWN state; the inbound CPU will
  267. set up the cluster again from there.
  268. Choice (a) permits the removal of some latency by avoiding
  269. unnecessary teardown and setup operations in situations where
  270. the cluster is not really going to be powered down.
  271. Next states:
  272. CLUSTER_UP/INBOUND_COMING_UP (outbound)
  273. Conditions: cluster-level setup and hardware
  274. coherency complete
  275. Trigger events: (spontaneous)
  276. CLUSTER_DOWN/INBOUND_COMING_UP (outbound)
  277. Conditions: cluster torn down and ready to power off
  278. Trigger events: (spontaneous)
  279. Last man and First man selection
  280. --------------------------------
  281. The CPU which performs cluster tear-down operations on the outbound side
  282. is commonly referred to as the "last man".
  283. The CPU which performs cluster setup on the inbound side is commonly
  284. referred to as the "first man".
  285. The race avoidance algorithm documented above does not provide a
  286. mechanism to choose which CPUs should play these roles.
  287. Last man:
  288. When shutting down the cluster, all the CPUs involved are initially
  289. executing Linux and hence coherent. Therefore, ordinary spinlocks can
  290. be used to select a last man safely, before the CPUs become
  291. non-coherent.
  292. First man:
  293. Because CPUs may power up asynchronously in response to external wake-up
  294. events, a dynamic mechanism is needed to make sure that only one CPU
  295. attempts to play the first man role and do the cluster-level
  296. initialisation: any other CPUs must wait for this to complete before
  297. proceeding.
  298. Cluster-level initialisation may involve actions such as configuring
  299. coherency controls in the bus fabric.
  300. The current implementation in mcpm_head.S uses a separate mutual exclusion
  301. mechanism to do this arbitration. This mechanism is documented in
  302. detail in vlocks.txt.
  303. Features and Limitations
  304. ------------------------
  305. Implementation:
  306. The current ARM-based implementation is split between
  307. arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and
  308. arch/arm/common/mcpm_entry.c (everything else):
  309. __mcpm_cpu_going_down() signals the transition of a CPU to the
  310. CPU_GOING_DOWN state.
  311. __mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN
  312. state.
  313. A CPU transitions to CPU_COMING_UP and then to CPU_UP via the
  314. low-level power-up code in mcpm_head.S. This could
  315. involve CPU-specific setup code, but in the current
  316. implementation it does not.
  317. __mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical()
  318. handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN
  319. and from there to CLUSTER_DOWN or back to CLUSTER_UP (in
  320. the case of an aborted cluster power-down).
  321. These functions are more complex than the __mcpm_cpu_*()
  322. functions due to the extra inter-CPU coordination which
  323. is needed for safe transitions at the cluster level.
  324. A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via
  325. the low-level power-up code in mcpm_head.S. This
  326. typically involves platform-specific setup code,
  327. provided by the platform-specific power_up_setup
  328. function registered via mcpm_sync_init.
  329. Deep topologies:
  330. As currently described and implemented, the algorithm does not
  331. support CPU topologies involving more than two levels (i.e.,
  332. clusters of clusters are not supported). The algorithm could be
  333. extended by replicating the cluster-level states for the
  334. additional topological levels, and modifying the transition
  335. rules for the intermediate (non-outermost) cluster levels.
  336. Colophon
  337. --------
  338. Originally created and documented by Dave Martin for Linaro Limited, in
  339. collaboration with Nicolas Pitre and Achin Gupta.
  340. Copyright (C) 2012-2013 Linaro Limited
  341. Distributed under the terms of Version 2 of the GNU General Public
  342. License, as defined in linux/COPYING.