123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424 |
- Overview
- ========
- This readme tries to provide some background on the hows and whys of RDS,
- and will hopefully help you find your way around the code.
- In addition, please see this email about RDS origins:
- http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
- RDS Architecture
- ================
- RDS provides reliable, ordered datagram delivery by using a single
- reliable connection between any two nodes in the cluster. This allows
- applications to use a single socket to talk to any other process in the
- cluster - so in a cluster with N processes you need N sockets, in contrast
- to N*N if you use a connection-oriented socket transport like TCP.
- RDS is not Infiniband-specific; it was designed to support different
- transports. The current implementation used to support RDS over TCP as well
- as IB.
- The high-level semantics of RDS from the application's point of view are
- * Addressing
- RDS uses IPv4 addresses and 16bit port numbers to identify
- the end point of a connection. All socket operations that involve
- passing addresses between kernel and user space generally
- use a struct sockaddr_in.
- The fact that IPv4 addresses are used does not mean the underlying
- transport has to be IP-based. In fact, RDS over IB uses a
- reliable IB connection; the IP address is used exclusively to
- locate the remote node's GID (by ARPing for the given IP).
- The port space is entirely independent of UDP, TCP or any other
- protocol.
- * Socket interface
- RDS sockets work *mostly* as you would expect from a BSD
- socket. The next section will cover the details. At any rate,
- all I/O is performed through the standard BSD socket API.
- Some additions like zerocopy support are implemented through
- control messages, while other extensions use the getsockopt/
- setsockopt calls.
- Sockets must be bound before you can send or receive data.
- This is needed because binding also selects a transport and
- attaches it to the socket. Once bound, the transport assignment
- does not change. RDS will tolerate IPs moving around (eg in
- a active-active HA scenario), but only as long as the address
- doesn't move to a different transport.
- * sysctls
- RDS supports a number of sysctls in /proc/sys/net/rds
- Socket Interface
- ================
- AF_RDS, PF_RDS, SOL_RDS
- AF_RDS and PF_RDS are the domain type to be used with socket(2)
- to create RDS sockets. SOL_RDS is the socket-level to be used
- with setsockopt(2) and getsockopt(2) for RDS specific socket
- options.
- fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
- This creates a new, unbound RDS socket.
- setsockopt(SOL_SOCKET): send and receive buffer size
- RDS honors the send and receive buffer size socket options.
- You are not allowed to queue more than SO_SNDSIZE bytes to
- a socket. A message is queued when sendmsg is called, and
- it leaves the queue when the remote system acknowledges
- its arrival.
- The SO_RCVSIZE option controls the maximum receive queue length.
- This is a soft limit rather than a hard limit - RDS will
- continue to accept and queue incoming messages, even if that
- takes the queue length over the limit. However, it will also
- mark the port as "congested" and send a congestion update to
- the source node. The source node is supposed to throttle any
- processes sending to this congested port.
- bind(fd, &sockaddr_in, ...)
- This binds the socket to a local IP address and port, and a
- transport, if one has not already been selected via the
- SO_RDS_TRANSPORT socket option
- sendmsg(fd, ...)
- Sends a message to the indicated recipient. The kernel will
- transparently establish the underlying reliable connection
- if it isn't up yet.
- An attempt to send a message that exceeds SO_SNDSIZE will
- return with -EMSGSIZE
- An attempt to send a message that would take the total number
- of queued bytes over the SO_SNDSIZE threshold will return
- EAGAIN.
- An attempt to send a message to a destination that is marked
- as "congested" will return ENOBUFS.
- recvmsg(fd, ...)
- Receives a message that was queued to this socket. The sockets
- recv queue accounting is adjusted, and if the queue length
- drops below SO_SNDSIZE, the port is marked uncongested, and
- a congestion update is sent to all peers.
- Applications can ask the RDS kernel module to receive
- notifications via control messages (for instance, there is a
- notification when a congestion update arrived, or when a RDMA
- operation completes). These notifications are received through
- the msg.msg_control buffer of struct msghdr. The format of the
- messages is described in manpages.
- poll(fd)
- RDS supports the poll interface to allow the application
- to implement async I/O.
- POLLIN handling is pretty straightforward. When there's an
- incoming message queued to the socket, or a pending notification,
- we signal POLLIN.
- POLLOUT is a little harder. Since you can essentially send
- to any destination, RDS will always signal POLLOUT as long as
- there's room on the send queue (ie the number of bytes queued
- is less than the sendbuf size).
- However, the kernel will refuse to accept messages to
- a destination marked congested - in this case you will loop
- forever if you rely on poll to tell you what to do.
- This isn't a trivial problem, but applications can deal with
- this - by using congestion notifications, and by checking for
- ENOBUFS errors returned by sendmsg.
- setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
- This allows the application to discard all messages queued to a
- specific destination on this particular socket.
- This allows the application to cancel outstanding messages if
- it detects a timeout. For instance, if it tried to send a message,
- and the remote host is unreachable, RDS will keep trying forever.
- The application may decide it's not worth it, and cancel the
- operation. In this case, it would use RDS_CANCEL_SENT_TO to
- nuke any pending messages.
- setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
- getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
- Set or read an integer defining the underlying
- encapsulating transport to be used for RDS packets on the
- socket. When setting the option, integer argument may be
- one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
- value, RDS_TRANS_NONE will be returned on an unbound socket.
- This socket option may only be set exactly once on the socket,
- prior to binding it via the bind(2) system call. Attempts to
- set SO_RDS_TRANSPORT on a socket for which the transport has
- been previously attached explicitly (by SO_RDS_TRANSPORT) or
- implicitly (via bind(2)) will return an error of EOPNOTSUPP.
- An attempt to set SO_RDS_TRANSPPORT to RDS_TRANS_NONE will
- always return EINVAL.
- RDMA for RDS
- ============
- see rds-rdma(7) manpage (available in rds-tools)
- Congestion Notifications
- ========================
- see rds(7) manpage
- RDS Protocol
- ============
- Message header
- The message header is a 'struct rds_header' (see rds.h):
- Fields:
- h_sequence:
- per-packet sequence number
- h_ack:
- piggybacked acknowledgment of last packet received
- h_len:
- length of data, not including header
- h_sport:
- source port
- h_dport:
- destination port
- h_flags:
- CONG_BITMAP - this is a congestion update bitmap
- ACK_REQUIRED - receiver must ack this packet
- RETRANSMITTED - packet has previously been sent
- h_credit:
- indicate to other end of connection that
- it has more credits available (i.e. there is
- more send room)
- h_padding[4]:
- unused, for future use
- h_csum:
- header checksum
- h_exthdr:
- optional data can be passed here. This is currently used for
- passing RDMA-related information.
- ACK and retransmit handling
- One might think that with reliable IB connections you wouldn't need
- to ack messages that have been received. The problem is that IB
- hardware generates an ack message before it has DMAed the message
- into memory. This creates a potential message loss if the HCA is
- disabled for any reason between when it sends the ack and before
- the message is DMAed and processed. This is only a potential issue
- if another HCA is available for fail-over.
- Sending an ack immediately would allow the sender to free the sent
- message from their send queue quickly, but could cause excessive
- traffic to be used for acks. RDS piggybacks acks on sent data
- packets. Ack-only packets are reduced by only allowing one to be
- in flight at a time, and by the sender only asking for acks when
- its send buffers start to fill up. All retransmissions are also
- acked.
- Flow Control
- RDS's IB transport uses a credit-based mechanism to verify that
- there is space in the peer's receive buffers for more data. This
- eliminates the need for hardware retries on the connection.
- Congestion
- Messages waiting in the receive queue on the receiving socket
- are accounted against the sockets SO_RCVBUF option value. Only
- the payload bytes in the message are accounted for. If the
- number of bytes queued equals or exceeds rcvbuf then the socket
- is congested. All sends attempted to this socket's address
- should return block or return -EWOULDBLOCK.
- Applications are expected to be reasonably tuned such that this
- situation very rarely occurs. An application encountering this
- "back-pressure" is considered a bug.
- This is implemented by having each node maintain bitmaps which
- indicate which ports on bound addresses are congested. As the
- bitmap changes it is sent through all the connections which
- terminate in the local address of the bitmap which changed.
- The bitmaps are allocated as connections are brought up. This
- avoids allocation in the interrupt handling path which queues
- sages on sockets. The dense bitmaps let transports send the
- entire bitmap on any bitmap change reasonably efficiently. This
- is much easier to implement than some finer-grained
- communication of per-port congestion. The sender does a very
- inexpensive bit test to test if the port it's about to send to
- is congested or not.
- RDS Transport Layer
- ==================
- As mentioned above, RDS is not IB-specific. Its code is divided
- into a general RDS layer and a transport layer.
- The general layer handles the socket API, congestion handling,
- loopback, stats, usermem pinning, and the connection state machine.
- The transport layer handles the details of the transport. The IB
- transport, for example, handles all the queue pairs, work requests,
- CM event handlers, and other Infiniband details.
- RDS Kernel Structures
- =====================
- struct rds_message
- aka possibly "rds_outgoing", the generic RDS layer copies data to
- be sent and sets header fields as needed, based on the socket API.
- This is then queued for the individual connection and sent by the
- connection's transport.
- struct rds_incoming
- a generic struct referring to incoming data that can be handed from
- the transport to the general code and queued by the general code
- while the socket is awoken. It is then passed back to the transport
- code to handle the actual copy-to-user.
- struct rds_socket
- per-socket information
- struct rds_connection
- per-connection information
- struct rds_transport
- pointers to transport-specific functions
- struct rds_statistics
- non-transport-specific statistics
- struct rds_cong_map
- wraps the raw congestion bitmap, contains rbnode, waitq, etc.
- Connection management
- =====================
- Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
- ERROR states.
- The first time an attempt is made by an RDS socket to send data to
- a node, a connection is allocated and connected. That connection is
- then maintained forever -- if there are transport errors, the
- connection will be dropped and re-established.
- Dropping a connection while packets are queued will cause queued or
- partially-sent datagrams to be retransmitted when the connection is
- re-established.
- The send path
- =============
- rds_sendmsg()
- struct rds_message built from incoming data
- CMSGs parsed (e.g. RDMA ops)
- transport connection alloced and connected if not already
- rds_message placed on send queue
- send worker awoken
- rds_send_worker()
- calls rds_send_xmit() until queue is empty
- rds_send_xmit()
- transmits congestion map if one is pending
- may set ACK_REQUIRED
- calls transport to send either non-RDMA or RDMA message
- (RDMA ops never retransmitted)
- rds_ib_xmit()
- allocs work requests from send ring
- adds any new send credits available to peer (h_credits)
- maps the rds_message's sg list
- piggybacks ack
- populates work requests
- post send to connection's queue pair
- The recv path
- =============
- rds_ib_recv_cq_comp_handler()
- looks at write completions
- unmaps recv buffer from device
- no errors, call rds_ib_process_recv()
- refill recv ring
- rds_ib_process_recv()
- validate header checksum
- copy header to rds_ib_incoming struct if start of a new datagram
- add to ibinc's fraglist
- if competed datagram:
- update cong map if datagram was cong update
- call rds_recv_incoming() otherwise
- note if ack is required
- rds_recv_incoming()
- drop duplicate packets
- respond to pings
- find the sock associated with this datagram
- add to sock queue
- wake up sock
- do some congestion calculations
- rds_recvmsg
- copy data into user iovec
- handle CMSGs
- return to application
- Multipath RDS (mprds)
- =====================
- Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
- (though the concept can be extended to other transports). The classical
- implementation of RDS-over-TCP is implemented by demultiplexing multiple
- PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
- port]) over a single TCP socket between the 2 IP addresses involved. This
- has the limitation that it ends up funneling multiple RDS flows over a
- single TCP flow, thus it is
- (a) upper-bounded to the single-flow bandwidth,
- (b) suffers from head-of-line blocking for all the RDS sockets.
- Better throughput (for a fixed small packet size, MTU) can be achieved
- by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
- RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp
- connection. RDS sockets will be attached to a path based on some hash
- (e.g., of local address and RDS port number) and packets for that RDS
- socket will be sent over the attached path using TCP to segment/reassemble
- RDS datagrams on that path.
- Multipathed RDS is implemented by splitting the struct rds_connection into
- a common (to all paths) part, and a per-path struct rds_conn_path. All
- I/O workqs and reconnect threads are driven from the rds_conn_path.
- Transports such as TCP that are multipath capable may then set up a
- TPC socket per rds_conn_path, and this is managed by the transport via
- the transport privatee cp_transport_data pointer.
- Transports announce themselves as multipath capable by setting the
- t_mp_capable bit during registration with the rds core module. When the
- transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
- across multiple paths. The outgoing hash is computed based on the
- local address and port that the PF_RDS socket is bound to.
- Additionally, even if the transport is MP capable, we may be
- peering with some node that does not support mprds, or supports
- a different number of paths. As a result, the peering nodes need
- to agree on the number of paths to be used for the connection.
- This is done by sending out a control packet exchange before the
- first data packet. The control packet exchange must have completed
- prior to outgoing hash completion in rds_sendmsg() when the transport
- is mutlipath capable.
- The control packet is an RDS ping packet (i.e., packet to rds dest
- port 0) with the ping packet having a rds extension header option of
- type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
- number of paths supported by the sender. The "probe" ping packet will
- get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
- The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
- be able to compute the min(sender_paths, rcvr_paths). The pong
- sent in response to a probe-ping should contain the rcvr's npaths
- when the rcvr is mprds-capable.
- If the rcvr is not mprds-capable, the exthdr in the ping will be
- ignored. In this case the pong will not have any exthdrs, so the sender
- of the probe-ping can default to single-path mprds.
|