|
- Wei Yang <weiyang@linux.vnet.ibm.com>
- Benjamin Herrenschmidt <benh@au1.ibm.com>
- Bjorn Helgaas <bhelgaas@google.com>
- 26 Aug 2014
- This document describes the requirement from hardware for PCI MMIO resource
- sizing and assignment on PowerKVM and how generic PCI code handles this
- requirement. The first two sections describe the concepts of Partitionable
- Endpoints and the implementation on P8 (IODA2). The next two sections talks
- about considerations on enabling SRIOV on IODA2.
- 1. Introduction to Partitionable Endpoints
- A Partitionable Endpoint (PE) is a way to group the various resources
- associated with a device or a set of devices to provide isolation between
- partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
- to freeze a device that is causing errors in order to limit the possibility
- of propagation of bad data.
- There is thus, in HW, a table of PE states that contains a pair of "frozen"
- state bits (one for MMIO and one for DMA, they get set together but can be
- cleared independently) for each PE.
- When a PE is frozen, all stores in any direction are dropped and all loads
- return all 1's value. MSIs are also blocked. There's a bit more state that
- captures things like the details of the error that caused the freeze etc., but
- that's not critical.
- The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
- are matched to their corresponding PEs.
- The following section provides a rough description of what we have on P8
- (IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB
- is a completely separate HW entity that replicates the entire logic, so has
- its own set of PEs, etc.
- 2. Implementation of Partitionable Endpoints on P8 (IODA2)
- P8 supports up to 256 Partitionable Endpoints per PHB.
- * Inbound
- For DMA, MSIs and inbound PCIe error messages, we have a table (in
- memory but accessed in HW by the chip) that provides a direct
- correspondence between a PCIe RID (bus/dev/fn) with a PE number.
- We call this the RTT.
- - For DMA we then provide an entire address space for each PE that can
- contain two "windows", depending on the value of PCI address bit 59.
- Each window can be configured to be remapped via a "TCE table" (IOMMU
- translation table), which has various configurable characteristics
- not described here.
- - For MSIs, we have two windows in the address space (one at the top of
- the 32-bit space and one much higher) which, via a combination of the
- address and MSI value, will result in one of the 2048 interrupts per
- bridge being triggered. There's a PE# in the interrupt controller
- descriptor table as well which is compared with the PE# obtained from
- the RTT to "authorize" the device to emit that specific interrupt.
- - Error messages just use the RTT.
- * Outbound. That's where the tricky part is.
- Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
- from the CPU address space to the PCI address space. There is one M32
- window and sixteen M64 windows. They have different characteristics.
- First what they have in common: they forward a configurable portion of
- the CPU address space to the PCIe bus and must be naturally aligned
- power of two in size. The rest is different:
- - The M32 window:
- * Is limited to 4GB in size.
- * Drops the top bits of the address (above the size) and replaces
- them with a configurable value. This is typically used to generate
- 32-bit PCIe accesses. We configure that window at boot from FW and
- don't touch it from Linux; it's usually set to forward a 2GB
- portion of address space from the CPU to PCIe
- 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually
- reserved for MSIs but this is not a problem at this point; we just
- need to ensure Linux doesn't assign anything there, the M32 logic
- ignores that however and will forward in that space if we try).
- * It is divided into 256 segments of equal size. A table in the chip
- maps each segment to a PE#. That allows portions of the MMIO space
- to be assigned to PEs on a segment granularity. For a 2GB window,
- the segment granularity is 2GB/256 = 8MB.
- Now, this is the "main" window we use in Linux today (excluding
- SR-IOV). We basically use the trick of forcing the bridge MMIO windows
- onto a segment alignment/granularity so that the space behind a bridge
- can be assigned to a PE.
- Ideally we would like to be able to have individual functions in PEs
- but that would mean using a completely different address allocation
- scheme where individual function BARs can be "grouped" to fit in one or
- more segments.
- - The M64 windows:
- * Must be at least 256MB in size.
- * Do not translate addresses (the address on PCIe is the same as the
- address on the PowerBus). There is a way to also set the top 14
- bits which are not conveyed by PowerBus but we don't use this.
- * Can be configured to be segmented. When not segmented, we can
- specify the PE# for the entire window. When segmented, a window
- has 256 segments; however, there is no table for mapping a segment
- to a PE#. The segment number *is* the PE#.
- * Support overlaps. If an address is covered by multiple windows,
- there's a defined ordering for which window applies.
- We have code (fairly new compared to the M32 stuff) that exploits that
- for large BARs in 64-bit space:
- We configure an M64 window to cover the entire region of address space
- that has been assigned by FW for the PHB (about 64GB, ignore the space
- for the M32, it comes out of a different "reserve"). We configure it
- as segmented.
- Then we do the same thing as with M32, using the bridge alignment
- trick, to match to those giant segments.
- Since we cannot remap, we have two additional constraints:
- - We do the PE# allocation *after* the 64-bit space has been assigned
- because the addresses we use directly determine the PE#. We then
- update the M32 PE# for the devices that use both 32-bit and 64-bit
- spaces or assign the remaining PE# to 32-bit only devices.
- - We cannot "group" segments in HW, so if a device ends up using more
- than one segment, we end up with more than one PE#. There is a HW
- mechanism to make the freeze state cascade to "companion" PEs but
- that only works for PCIe error messages (typically used so that if
- you freeze a switch, it freezes all its children). So we do it in
- SW. We lose a bit of effectiveness of EEH in that case, but that's
- the best we found. So when any of the PEs freezes, we freeze the
- other ones for that "domain". We thus introduce the concept of
- "master PE" which is the one used for DMA, MSIs, etc., and "secondary
- PEs" that are used for the remaining M64 segments.
- We would like to investigate using additional M64 windows in "single
- PE" mode to overlay over specific BARs to work around some of that, for
- example for devices with very large BARs, e.g., GPUs. It would make
- sense, but we haven't done it yet.
- 3. Considerations for SR-IOV on PowerKVM
- * SR-IOV Background
- The PCIe SR-IOV feature allows a single Physical Function (PF) to
- support several Virtual Functions (VFs). Registers in the PF's SR-IOV
- Capability control the number of VFs and whether they are enabled.
- When VFs are enabled, they appear in Configuration Space like normal
- PCI devices, but the BARs in VF config space headers are unusual. For
- a non-VF device, software uses BARs in the config space header to
- discover the BAR sizes and assign addresses for them. For VF devices,
- software uses VF BAR registers in the *PF* SR-IOV Capability to
- discover sizes and assign addresses. The BARs in the VF's config space
- header are read-only zeros.
- When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
- base address for all the corresponding VF(n) BARs. For example, if the
- PF SR-IOV Capability is programmed to enable eight VFs, and it has a
- 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
- This region is divided into eight contiguous 1MB regions, each of which
- is a BAR0 for one of the VFs. Note that even though the VF BAR
- describes an 8MB region, the alignment requirement is for a single VF,
- i.e., 1MB in this example.
- There are several strategies for isolating VFs in PEs:
- - M32 window: There's one M32 window, and it is split into 256
- equally-sized segments. The finest granularity possible is a 256MB
- window with 1MB segments. VF BARs that are 1MB or larger could be
- mapped to separate PEs in this window. Each segment can be
- individually mapped to a PE via the lookup table, so this is quite
- flexible, but it works best when all the VF BARs are the same size. If
- they are different sizes, the entire window has to be small enough that
- the segment size matches the smallest VF BAR, which means larger VF
- BARs span several segments.
- - Non-segmented M64 window: A non-segmented M64 window is mapped entirely
- to a single PE, so it could only isolate one VF.
- - Single segmented M64 windows: A segmented M64 window could be used just
- like the M32 window, but the segments can't be individually mapped to
- PEs (the segment number is the PE#), so there isn't as much
- flexibility. A VF with multiple BARs would have to be in a "domain" of
- multiple PEs, which is not as well isolated as a single PE.
- - Multiple segmented M64 windows: As usual, each window is split into 256
- equally-sized segments, and the segment number is the PE#. But if we
- use several M64 windows, they can be set to different base addresses
- and different segment sizes. If we have VFs that each have a 1MB BAR
- and a 32MB BAR, we could use one M64 window to assign 1MB segments and
- another M64 window to assign 32MB segments.
- Finally, the plan to use M64 windows for SR-IOV, which will be described
- more in the next two sections. For a given VF BAR, we need to
- effectively reserve the entire 256 segments (256 * VF BAR size) and
- position the VF BAR to start at the beginning of a free range of
- segments/PEs inside that M64 window.
- The goal is of course to be able to give a separate PE for each VF.
- The IODA2 platform has 16 M64 windows, which are used to map MMIO
- range to PE#. Each M64 window defines one MMIO range and this range is
- divided into 256 segments, with each segment corresponding to one PE.
- We decide to leverage this M64 window to map VFs to individual PEs, since
- SR-IOV VF BARs are all the same size.
- But doing so introduces another problem: total_VFs is usually smaller
- than the number of M64 window segments, so if we map one VF BAR directly
- to one M64 window, some part of the M64 window will map to another
- device's MMIO range.
- IODA supports 256 PEs, so segmented windows contain 256 segments, so if
- total_VFs is less than 256, we have the situation in Figure 1.0, where
- segments [total_VFs, 255] of the M64 window may map to some MMIO range on
- other devices:
- 0 1 total_VFs - 1
- +------+------+- -+------+------+
- | | | ... | | |
- +------+------+- -+------+------+
- VF(n) BAR space
- 0 1 total_VFs - 1 255
- +------+------+- -+------+------+- -+------+------+
- | | | ... | | | ... | | |
- +------+------+- -+------+------+- -+------+------+
- M64 window
- Figure 1.0 Direct map VF(n) BAR space
- Our current solution is to allocate 256 segments even if the VF(n) BAR
- space doesn't need that much, as shown in Figure 1.1:
- 0 1 total_VFs - 1 255
- +------+------+- -+------+------+- -+------+------+
- | | | ... | | | ... | | |
- +------+------+- -+------+------+- -+------+------+
- VF(n) BAR space + extra
- 0 1 total_VFs - 1 255
- +------+------+- -+------+------+- -+------+------+
- | | | ... | | | ... | | |
- +------+------+- -+------+------+- -+------+------+
- M64 window
- Figure 1.1 Map VF(n) BAR space + extra
- Allocating the extra space ensures that the entire M64 window will be
- assigned to this one SR-IOV device and none of the space will be
- available for other devices. Note that this only expands the space
- reserved in software; there are still only total_VFs VFs, and they only
- respond to segments [0, total_VFs - 1]. There's nothing in hardware that
- responds to segments [total_VFs, 255].
- 4. Implications for the Generic PCI Code
- The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
- aligned to the size of an individual VF BAR.
- In IODA2, the MMIO address determines the PE#. If the address is in an M32
- window, we can set the PE# by updating the table that translates segments
- to PE#s. Similarly, if the address is in an unsegmented M64 window, we can
- set the PE# for the window. But if it's in a segmented M64 window, the
- segment number is the PE#.
- Therefore, the only way to control the PE# for a VF is to change the base
- of the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact
- amount of space required for the VF(n) BAR space, the VF BAR value is fixed
- and cannot be changed.
- On the other hand, if the PCI core allocates additional space, the VF BAR
- value can be changed as long as the entire VF(n) BAR space remains inside
- the space allocated by the core.
- Ideally the segment size will be the same as an individual VF BAR size.
- Then each VF will be in its own PE. The VF BARs (and therefore the PE#s)
- are contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we
- allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
- If the segment size is smaller than the VF BAR size, it will take several
- segments to cover a VF BAR, and a VF will be in several PEs. This is
- possible, but the isolation isn't as good, and it reduces the number of PE#
- choices because instead of consuming only numVFs segments, the VF(n) BAR
- space will consume (numVFs * n) segments. That means there aren't as many
- available segments for adjusting base of the VF(n) BAR space.
|