swap_numa.txt 2.9 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
  1. Automatically bind swap device to numa node
  2. -------------------------------------------
  3. If the system has more than one swap device and swap device has the node
  4. information, we can make use of this information to decide which swap
  5. device to use in get_swap_pages() to get better performance.
  6. How to use this feature
  7. -----------------------
  8. Swap device has priority and that decides the order of it to be used. To make
  9. use of automatically binding, there is no need to manipulate priority settings
  10. for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and
  11. swapB, with swapA attached to node 0 and swapB attached to node 1, are going
  12. to be swapped on. Simply swapping them on by doing:
  13. # swapon /dev/swapA
  14. # swapon /dev/swapB
  15. Then node 0 will use the two swap devices in the order of swapA then swapB and
  16. node 1 will use the two swap devices in the order of swapB then swapA. Note
  17. that the order of them being swapped on doesn't matter.
  18. A more complex example on a 4 node machine. Assume 6 swap devices are going to
  19. be swapped on: swapA and swapB are attached to node 0, swapC is attached to
  20. node 1, swapD and swapE are attached to node 2 and swapF is attached to node3.
  21. The way to swap them on is the same as above:
  22. # swapon /dev/swapA
  23. # swapon /dev/swapB
  24. # swapon /dev/swapC
  25. # swapon /dev/swapD
  26. # swapon /dev/swapE
  27. # swapon /dev/swapF
  28. Then node 0 will use them in the order of:
  29. swapA/swapB -> swapC -> swapD -> swapE -> swapF
  30. swapA and swapB will be used in a round robin mode before any other swap device.
  31. node 1 will use them in the order of:
  32. swapC -> swapA -> swapB -> swapD -> swapE -> swapF
  33. node 2 will use them in the order of:
  34. swapD/swapE -> swapA -> swapB -> swapC -> swapF
  35. Similaly, swapD and swapE will be used in a round robin mode before any
  36. other swap devices.
  37. node 3 will use them in the order of:
  38. swapF -> swapA -> swapB -> swapC -> swapD -> swapE
  39. Implementation details
  40. ----------------------
  41. The current code uses a priority based list, swap_avail_list, to decide
  42. which swap device to use and if multiple swap devices share the same
  43. priority, they are used round robin. This change here replaces the single
  44. global swap_avail_list with a per-numa-node list, i.e. for each numa node,
  45. it sees its own priority based list of available swap devices. Swap
  46. device's priority can be promoted on its matching node's swap_avail_list.
  47. The current swap device's priority is set as: user can set a >=0 value,
  48. or the system will pick one starting from -1 then downwards. The priority
  49. value in the swap_avail_list is the negated value of the swap device's
  50. due to plist being sorted from low to high. The new policy doesn't change
  51. the semantics for priority >=0 cases, the previous starting from -1 then
  52. downwards now becomes starting from -2 then downwards and -1 is reserved
  53. as the promoted value. So if multiple swap devices are attached to the same
  54. node, they will all be promoted to priority -1 on that node's plist and will
  55. be used round robin before any other swap devices.