123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306 |
- Linux kernel driver for Elastic Network Adapter (ENA) family:
- =============================================================
- Overview:
- =========
- ENA is a networking interface designed to make good use of modern CPU
- features and system architectures.
- The ENA device exposes a lightweight management interface with a
- minimal set of memory mapped registers and extendable command set
- through an Admin Queue.
- The driver supports a range of ENA devices, is link-speed independent
- (i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has
- a negotiated and extendable feature set.
- Some ENA devices support SR-IOV. This driver is used for both the
- SR-IOV Physical Function (PF) and Virtual Function (VF) devices.
- ENA devices enable high speed and low overhead network traffic
- processing by providing multiple Tx/Rx queue pairs (the maximum number
- is advertised by the device via the Admin Queue), a dedicated MSI-X
- interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation,
- and CPU cacheline optimized data placement.
- The ENA driver supports industry standard TCP/IP offload features such
- as checksum offload and TCP transmit segmentation offload (TSO).
- Receive-side scaling (RSS) is supported for multi-core scaling.
- The ENA driver and its corresponding devices implement health
- monitoring mechanisms such as watchdog, enabling the device and driver
- to recover in a manner transparent to the application, as well as
- debug logs.
- Some of the ENA devices support a working mode called Low-latency
- Queue (LLQ), which saves several more microseconds.
- Supported PCI vendor ID/device IDs:
- ===================================
- 1d0f:0ec2 - ENA PF
- 1d0f:1ec2 - ENA PF with LLQ support
- 1d0f:ec20 - ENA VF
- 1d0f:ec21 - ENA VF with LLQ support
- ENA Source Code Directory Structure:
- ====================================
- ena_com.[ch] - Management communication layer. This layer is
- responsible for the handling all the management
- (admin) communication between the device and the
- driver.
- ena_eth_com.[ch] - Tx/Rx data path.
- ena_admin_defs.h - Definition of ENA management interface.
- ena_eth_io_defs.h - Definition of ENA data path interface.
- ena_common_defs.h - Common definitions for ena_com layer.
- ena_regs_defs.h - Definition of ENA PCI memory-mapped (MMIO) registers.
- ena_netdev.[ch] - Main Linux kernel driver.
- ena_syfsfs.[ch] - Sysfs files.
- ena_ethtool.c - ethtool callbacks.
- ena_pci_id_tbl.h - Supported device IDs.
- Management Interface:
- =====================
- ENA management interface is exposed by means of:
- - PCIe Configuration Space
- - Device Registers
- - Admin Queue (AQ) and Admin Completion Queue (ACQ)
- - Asynchronous Event Notification Queue (AENQ)
- ENA device MMIO Registers are accessed only during driver
- initialization and are not involved in further normal device
- operation.
- AQ is used for submitting management commands, and the
- results/responses are reported asynchronously through ACQ.
- ENA introduces a very small set of management commands with room for
- vendor-specific extensions. Most of the management operations are
- framed in a generic Get/Set feature command.
- The following admin queue commands are supported:
- - Create I/O submission queue
- - Create I/O completion queue
- - Destroy I/O submission queue
- - Destroy I/O completion queue
- - Get feature
- - Set feature
- - Configure AENQ
- - Get statistics
- Refer to ena_admin_defs.h for the list of supported Get/Set Feature
- properties.
- The Asynchronous Event Notification Queue (AENQ) is a uni-directional
- queue used by the ENA device to send to the driver events that cannot
- be reported using ACQ. AENQ events are subdivided into groups. Each
- group may have multiple syndromes, as shown below
- The events are:
- Group Syndrome
- Link state change - X -
- Fatal error - X -
- Notification Suspend traffic
- Notification Resume traffic
- Keep-Alive - X -
- ACQ and AENQ share the same MSI-X vector.
- Keep-Alive is a special mechanism that allows monitoring of the
- device's health. The driver maintains a watchdog (WD) handler which,
- if fired, logs the current state and statistics then resets and
- restarts the ENA device and driver. A Keep-Alive event is delivered by
- the device every second. The driver re-arms the WD upon reception of a
- Keep-Alive event. A missed Keep-Alive event causes the WD handler to
- fire.
- Data Path Interface:
- ====================
- I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx
- SQ correspondingly). Each SQ has a completion queue (CQ) associated
- with it.
- The SQs and CQs are implemented as descriptor rings in contiguous
- physical memory.
- The ENA driver supports two Queue Operation modes for Tx SQs:
- - Regular mode
- * In this mode the Tx SQs reside in the host's memory. The ENA
- device fetches the ENA Tx descriptors and packet data from host
- memory.
- - Low Latency Queue (LLQ) mode or "push-mode".
- * In this mode the driver pushes the transmit descriptors and the
- first 128 bytes of the packet directly to the ENA device memory
- space. The rest of the packet payload is fetched by the
- device. For this operation mode, the driver uses a dedicated PCI
- device memory BAR, which is mapped with write-combine capability.
- The Rx SQs support only the regular mode.
- Note: Not all ENA devices support LLQ, and this feature is negotiated
- with the device upon initialization. If the ENA device does not
- support LLQ mode, the driver falls back to the regular mode.
- The driver supports multi-queue for both Tx and Rx. This has various
- benefits:
- - Reduced CPU/thread/process contention on a given Ethernet interface.
- - Cache miss rate on completion is reduced, particularly for data
- cache lines that hold the sk_buff structures.
- - Increased process-level parallelism when handling received packets.
- - Increased data cache hit rate, by steering kernel processing of
- packets to the CPU, where the application thread consuming the
- packet is running.
- - In hardware interrupt re-direction.
- Interrupt Modes:
- ================
- The driver assigns a single MSI-X vector per queue pair (for both Tx
- and Rx directions). The driver assigns an additional dedicated MSI-X vector
- for management (for ACQ and AENQ).
- Management interrupt registration is performed when the Linux kernel
- probes the adapter, and it is de-registered when the adapter is
- removed. I/O queue interrupt registration is performed when the Linux
- interface of the adapter is opened, and it is de-registered when the
- interface is closed.
- The management interrupt is named:
- ena-mgmnt@pci:<PCI domain:bus:slot.function>
- and for each queue pair, an interrupt is named:
- <interface name>-Tx-Rx-<queue index>
- The ENA device operates in auto-mask and auto-clear interrupt
- modes. That is, once MSI-X is delivered to the host, its Cause bit is
- automatically cleared and the interrupt is masked. The interrupt is
- unmasked by the driver after NAPI processing is complete.
- Interrupt Moderation:
- =====================
- ENA driver and device can operate in conventional or adaptive interrupt
- moderation mode.
- In conventional mode the driver instructs device to postpone interrupt
- posting according to static interrupt delay value. The interrupt delay
- value can be configured through ethtool(8). The following ethtool
- parameters are supported by the driver: tx-usecs, rx-usecs
- In adaptive interrupt moderation mode the interrupt delay value is
- updated by the driver dynamically and adjusted every NAPI cycle
- according to the traffic nature.
- By default ENA driver applies adaptive coalescing on Rx traffic and
- conventional coalescing on Tx traffic.
- Adaptive coalescing can be switched on/off through ethtool(8)
- adaptive_rx on|off parameter.
- The driver chooses interrupt delay value according to the number of
- bytes and packets received between interrupt unmasking and interrupt
- posting. The driver uses interrupt delay table that subdivides the
- range of received bytes/packets into 5 levels and assigns interrupt
- delay value to each level.
- The user can enable/disable adaptive moderation, modify the interrupt
- delay table and restore its default values through sysfs.
- The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK
- and can be configured by the ETHTOOL_STUNABLE command of the
- SIOCETHTOOL ioctl.
- SKB:
- The driver-allocated SKB for frames received from Rx handling using
- NAPI context. The allocation method depends on the size of the packet.
- If the frame length is larger than rx_copybreak, napi_get_frags()
- is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer
- content is copied (by CPU) to the SKB, and the buffer is recycled.
- Statistics:
- ===========
- The user can obtain ENA device and driver statistics using ethtool.
- The driver can collect regular or extended statistics (including
- per-queue stats) from the device.
- In addition the driver logs the stats to syslog upon device reset.
- MTU:
- ====
- The driver supports an arbitrarily large MTU with a maximum that is
- negotiated with the device. The driver configures MTU using the
- SetFeature command (ENA_ADMIN_MTU property). The user can change MTU
- via ip(8) and similar legacy tools.
- Stateless Offloads:
- ===================
- The ENA driver supports:
- - TSO over IPv4/IPv6
- - TSO with ECN
- - IPv4 header checksum offload
- - TCP/UDP over IPv4/IPv6 checksum offloads
- RSS:
- ====
- - The ENA device supports RSS that allows flexible Rx traffic
- steering.
- - Toeplitz and CRC32 hash functions are supported.
- - Different combinations of L2/L3/L4 fields can be configured as
- inputs for hash functions.
- - The driver configures RSS settings using the AQ SetFeature command
- (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and
- ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties).
- - If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash
- function delivered in the Rx CQ descriptor is set in the received
- SKB.
- - The user can provide a hash key, hash function, and configure the
- indirection table through ethtool(8).
- DATA PATH:
- ==========
- Tx:
- ---
- end_start_xmit() is called by the stack. This function does the following:
- - Maps data buffers (skb->data and frags).
- - Populates ena_buf for the push buffer (if the driver and device are
- in push mode.)
- - Prepares ENA bufs for the remaining frags.
- - Allocates a new request ID from the empty req_id ring. The request
- ID is the index of the packet in the Tx info. This is used for
- out-of-order TX completions.
- - Adds the packet to the proper place in the Tx ring.
- - Calls ena_com_prepare_tx(), an ENA communication layer that converts
- the ena_bufs to ENA descriptors (and adds meta ENA descriptors as
- needed.)
- * This function also copies the ENA descriptors and the push buffer
- to the Device memory space (if in push mode.)
- - Writes doorbell to the ENA device.
- - When the ENA device finishes sending the packet, a completion
- interrupt is raised.
- - The interrupt handler schedules NAPI.
- - The ena_clean_tx_irq() function is called. This function handles the
- completion descriptors generated by the ENA, with a single
- completion descriptor per completed packet.
- * req_id is retrieved from the completion descriptor. The tx_info of
- the packet is retrieved via the req_id. The data buffers are
- unmapped and req_id is returned to the empty req_id ring.
- * The function stops when the completion descriptors are completed or
- the budget is reached.
- Rx:
- ---
- - When a packet is received from the ENA device.
- - The interrupt handler schedules NAPI.
- - The ena_clean_rx_irq() function is called. This function calls
- ena_rx_pkt(), an ENA communication layer function, which returns the
- number of descriptors used for a new unhandled packet, and zero if
- no new packet is found.
- - Then it calls the ena_clean_rx_irq() function.
- - ena_eth_rx_skb() checks packet length:
- * If the packet is small (len < rx_copybreak), the driver allocates
- a SKB for the new packet, and copies the packet payload into the
- SKB data buffer.
- - In this way the original data buffer is not passed to the stack
- and is reused for future Rx packets.
- * Otherwise the function unmaps the Rx buffer, then allocates the
- new SKB structure and hooks the Rx buffer to the SKB frags.
- - The new SKB is updated with the necessary information (protocol,
- checksum hw verify result, etc.), and then passed to the network
- stack, using the NAPI interface function napi_gro_receive().
|