vfio-ccw.txt 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304
  1. vfio-ccw: the basic infrastructure
  2. ==================================
  3. Introduction
  4. ------------
  5. Here we describe the vfio support for I/O subchannel devices for
  6. Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a
  7. virtual machine, while vfio is the means.
  8. Different than other hardware architectures, s390 has defined a unified
  9. I/O access method, which is so called Channel I/O. It has its own access
  10. patterns:
  11. - Channel programs run asynchronously on a separate (co)processor.
  12. - The channel subsystem will access any memory designated by the caller
  13. in the channel program directly, i.e. there is no iommu involved.
  14. Thus when we introduce vfio support for these devices, we realize it
  15. with a mediated device (mdev) implementation. The vfio mdev will be
  16. added to an iommu group, so as to make itself able to be managed by the
  17. vfio framework. And we add read/write callbacks for special vfio I/O
  18. regions to pass the channel programs from the mdev to its parent device
  19. (the real I/O subchannel device) to do further address translation and
  20. to perform I/O instructions.
  21. This document does not intend to explain the s390 I/O architecture in
  22. every detail. More information/reference could be found here:
  23. - A good start to know Channel I/O in general:
  24. https://en.wikipedia.org/wiki/Channel_I/O
  25. - s390 architecture:
  26. s390 Principles of Operation manual (IBM Form. No. SA22-7832)
  27. - The existing Qemu code which implements a simple emulated channel
  28. subsystem could also be a good reference. It makes it easier to follow
  29. the flow.
  30. qemu/hw/s390x/css.c
  31. For vfio mediated device framework:
  32. - Documentation/vfio-mediated-device.txt
  33. Motivation of vfio-ccw
  34. ----------------------
  35. Currently, a guest virtualized via qemu/kvm on s390 only sees
  36. paravirtualized virtio devices via the "Virtio Over Channel I/O
  37. (virtio-ccw)" transport. This makes virtio devices discoverable via
  38. standard operating system algorithms for handling channel devices.
  39. However this is not enough. On s390 for the majority of devices, which
  40. use the standard Channel I/O based mechanism, we also need to provide
  41. the functionality of passing through them to a Qemu virtual machine.
  42. This includes devices that don't have a virtio counterpart (e.g. tape
  43. drives) or that have specific characteristics which guests want to
  44. exploit.
  45. For passing a device to a guest, we want to use the same interface as
  46. everybody else, namely vfio. Thus, we would like to introduce vfio
  47. support for channel devices. And we would like to name this new vfio
  48. device "vfio-ccw".
  49. Access patterns of CCW devices
  50. ------------------------------
  51. s390 architecture has implemented a so called channel subsystem, that
  52. provides a unified view of the devices physically attached to the
  53. systems. Though the s390 hardware platform knows about a huge variety of
  54. different peripheral attachments like disk devices (aka. DASDs), tapes,
  55. communication controllers, etc. They can all be accessed by a well
  56. defined access method and they are presenting I/O completion a unified
  57. way: I/O interruptions.
  58. All I/O requires the use of channel command words (CCWs). A CCW is an
  59. instruction to a specialized I/O channel processor. A channel program is
  60. a sequence of CCWs which are executed by the I/O channel subsystem. To
  61. issue a channel program to the channel subsystem, it is required to
  62. build an operation request block (ORB), which can be used to point out
  63. the format of the CCW and other control information to the system. The
  64. operating system signals the I/O channel subsystem to begin executing
  65. the channel program with a SSCH (start sub-channel) instruction. The
  66. central processor is then free to proceed with non-I/O instructions
  67. until interrupted. The I/O completion result is received by the
  68. interrupt handler in the form of interrupt response block (IRB).
  69. Back to vfio-ccw, in short:
  70. - ORBs and channel programs are built in guest kernel (with guest
  71. physical addresses).
  72. - ORBs and channel programs are passed to the host kernel.
  73. - Host kernel translates the guest physical addresses to real addresses
  74. and starts the I/O with issuing a privileged Channel I/O instruction
  75. (e.g SSCH).
  76. - channel programs run asynchronously on a separate processor.
  77. - I/O completion will be signaled to the host with I/O interruptions.
  78. And it will be copied as IRB to user space to pass it back to the
  79. guest.
  80. Physical vfio ccw device and its child mdev
  81. -------------------------------------------
  82. As mentioned above, we realize vfio-ccw with a mdev implementation.
  83. Channel I/O does not have IOMMU hardware support, so the physical
  84. vfio-ccw device does not have an IOMMU level translation or isolation.
  85. Sub-channel I/O instructions are all privileged instructions, When
  86. handling the I/O instruction interception, vfio-ccw has the software
  87. policing and translation how the channel program is programmed before
  88. it gets sent to hardware.
  89. Within this implementation, we have two drivers for two types of
  90. devices:
  91. - The vfio_ccw driver for the physical subchannel device.
  92. This is an I/O subchannel driver for the real subchannel device. It
  93. realizes a group of callbacks and registers to the mdev framework as a
  94. parent (physical) device. As a consequence, mdev provides vfio_ccw a
  95. generic interface (sysfs) to create mdev devices. A vfio mdev could be
  96. created by vfio_ccw then and added to the mediated bus. It is the vfio
  97. device that added to an IOMMU group and a vfio group.
  98. vfio_ccw also provides an I/O region to accept channel program
  99. request from user space and store I/O interrupt result for user
  100. space to retrieve. To notify user space an I/O completion, it offers
  101. an interface to setup an eventfd fd for asynchronous signaling.
  102. - The vfio_mdev driver for the mediated vfio ccw device.
  103. This is provided by the mdev framework. It is a vfio device driver for
  104. the mdev that created by vfio_ccw.
  105. It realize a group of vfio device driver callbacks, adds itself to a
  106. vfio group, and registers itself to the mdev framework as a mdev
  107. driver.
  108. It uses a vfio iommu backend that uses the existing map and unmap
  109. ioctls, but rather than programming them into an IOMMU for a device,
  110. it simply stores the translations for use by later requests. This
  111. means that a device programmed in a VM with guest physical addresses
  112. can have the vfio kernel convert that address to process virtual
  113. address, pin the page and program the hardware with the host physical
  114. address in one step.
  115. For a mdev, the vfio iommu backend will not pin the pages during the
  116. VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database
  117. of the iova<->vaddr mappings in this operation. And they export a
  118. vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu
  119. backend for the physical devices to pin and unpin pages by demand.
  120. Below is a high Level block diagram.
  121. +-------------+
  122. | |
  123. | +---------+ | mdev_register_driver() +--------------+
  124. | | Mdev | +<-----------------------+ |
  125. | | bus | | | vfio_mdev.ko |
  126. | | driver | +----------------------->+ |<-> VFIO user
  127. | +---------+ | probe()/remove() +--------------+ APIs
  128. | |
  129. | MDEV CORE |
  130. | MODULE |
  131. | mdev.ko |
  132. | +---------+ | mdev_register_device() +--------------+
  133. | |Physical | +<-----------------------+ |
  134. | | device | | | vfio_ccw.ko |<-> subchannel
  135. | |interface| +----------------------->+ | device
  136. | +---------+ | callback +--------------+
  137. +-------------+
  138. The process of how these work together.
  139. 1. vfio_ccw.ko drives the physical I/O subchannel, and registers the
  140. physical device (with callbacks) to mdev framework.
  141. When vfio_ccw probing the subchannel device, it registers device
  142. pointer and callbacks to the mdev framework. Mdev related file nodes
  143. under the device node in sysfs would be created for the subchannel
  144. device, namely 'mdev_create', 'mdev_destroy' and
  145. 'mdev_supported_types'.
  146. 2. Create a mediated vfio ccw device.
  147. Use the 'mdev_create' sysfs file, we need to manually create one (and
  148. only one for our case) mediated device.
  149. 3. vfio_mdev.ko drives the mediated ccw device.
  150. vfio_mdev is also the vfio device drvier. It will probe the mdev and
  151. add it to an iommu_group and a vfio_group. Then we could pass through
  152. the mdev to a guest.
  153. vfio-ccw I/O region
  154. -------------------
  155. An I/O region is used to accept channel program request from user
  156. space and store I/O interrupt result for user space to retrieve. The
  157. defination of the region is:
  158. struct ccw_io_region {
  159. #define ORB_AREA_SIZE 12
  160. __u8 orb_area[ORB_AREA_SIZE];
  161. #define SCSW_AREA_SIZE 12
  162. __u8 scsw_area[SCSW_AREA_SIZE];
  163. #define IRB_AREA_SIZE 96
  164. __u8 irb_area[IRB_AREA_SIZE];
  165. __u32 ret_code;
  166. } __packed;
  167. While starting an I/O request, orb_area should be filled with the
  168. guest ORB, and scsw_area should be filled with the SCSW of the Virtual
  169. Subchannel.
  170. irb_area stores the I/O result.
  171. ret_code stores a return code for each access of the region.
  172. vfio-ccw patches overview
  173. -------------------------
  174. For now, our patches are rebased on the latest mdev implementation.
  175. vfio-ccw follows what vfio-pci did on the s390 paltform and uses
  176. vfio-iommu-type1 as the vfio iommu backend. It's a good start to launch
  177. the code review for vfio-ccw. Note that the implementation is far from
  178. complete yet; but we'd like to get feedback for the general
  179. architecture.
  180. * CCW translation APIs
  181. - Description:
  182. These introduce a group of APIs (start with 'cp_') to do CCW
  183. translation. The CCWs passed in by a user space program are
  184. organized with their guest physical memory addresses. These APIs
  185. will copy the CCWs into the kernel space, and assemble a runnable
  186. kernel channel program by updating the guest physical addresses with
  187. their corresponding host physical addresses.
  188. - Patches:
  189. vfio: ccw: introduce channel program interfaces
  190. * vfio_ccw device driver
  191. - Description:
  192. The following patches utilizes the CCW translation APIs and introduce
  193. vfio_ccw, which is the driver for the I/O subchannel devices you want
  194. to pass through.
  195. vfio_ccw implements the following vfio ioctls:
  196. VFIO_DEVICE_GET_INFO
  197. VFIO_DEVICE_GET_IRQ_INFO
  198. VFIO_DEVICE_GET_REGION_INFO
  199. VFIO_DEVICE_RESET
  200. VFIO_DEVICE_SET_IRQS
  201. This provides an I/O region, so that the user space program can pass a
  202. channel program to the kernel, to do further CCW translation before
  203. issuing them to a real device.
  204. This also provides the SET_IRQ ioctl to setup an event notifier to
  205. notify the user space program the I/O completion in an asynchronous
  206. way.
  207. - Patches:
  208. vfio: ccw: basic implementation for vfio_ccw driver
  209. vfio: ccw: introduce ccw_io_region
  210. vfio: ccw: realize VFIO_DEVICE_GET_REGION_INFO ioctl
  211. vfio: ccw: realize VFIO_DEVICE_RESET ioctl
  212. vfio: ccw: realize VFIO_DEVICE_G(S)ET_IRQ_INFO ioctls
  213. The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
  214. good example to get understand how these patches work. Here is a little
  215. bit more detail how an I/O request triggered by the Qemu guest will be
  216. handled (without error handling).
  217. Explanation:
  218. Q1-Q7: Qemu side process.
  219. K1-K5: Kernel side process.
  220. Q1. Get I/O region info during initialization.
  221. Q2. Setup event notifier and handler to handle I/O completion.
  222. ... ...
  223. Q3. Intercept a ssch instruction.
  224. Q4. Write the guest channel program and ORB to the I/O region.
  225. K1. Copy from guest to kernel.
  226. K2. Translate the guest channel program to a host kernel space
  227. channel program, which becomes runnable for a real device.
  228. K3. With the necessary information contained in the orb passed in
  229. by Qemu, issue the ccwchain to the device.
  230. K4. Return the ssch CC code.
  231. Q5. Return the CC code to the guest.
  232. ... ...
  233. K5. Interrupt handler gets the I/O result and write the result to
  234. the I/O region.
  235. K6. Signal Qemu to retrieve the result.
  236. Q6. Get the signal and event handler reads out the result from the I/O
  237. region.
  238. Q7. Update the irb for the guest.
  239. Limitations
  240. -----------
  241. The current vfio-ccw implementation focuses on supporting basic commands
  242. needed to implement block device functionality (read/write) of DASD/ECKD
  243. device only. Some commands may need special handling in the future, for
  244. example, anything related to path grouping.
  245. DASD is a kind of storage device. While ECKD is a data recording format.
  246. More information for DASD and ECKD could be found here:
  247. https://en.wikipedia.org/wiki/Direct-access_storage_device
  248. https://en.wikipedia.org/wiki/Count_key_data
  249. Together with the corresponding work in Qemu, we can bring the passed
  250. through DASD/ECKD device online in a guest now and use it as a block
  251. device.
  252. Reference
  253. ---------
  254. 1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832)
  255. 2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204)
  256. 3. https://en.wikipedia.org/wiki/Channel_I/O
  257. 4. Documentation/s390/cds.txt
  258. 5. Documentation/vfio.txt
  259. 6. Documentation/vfio-mediated-device.txt