Multipath TCP (MPTCP) extends traditional TCP to allow reliable end-to-end delivery over multiple simultaneous TCP paths, and is coming as a tech preview on Red Hat Enterprise Linux 8.3. This is the first of two articles for users who want to practice with the new MPTCP functionality on a live system. In this first part, we show you how to enable the protocol in the kernel and let client and server applications use the MPTCP sockets. Then, we run diagnostics on the kernel in a sample test network, where endpoints are using a single subflow.
Multipath TCP in Red Hat Enterprise Linux 8
Multipath TCP is a relatively new extension for the Transmission Control Protocol (TCP), and its official Linux implementation is even more recent. Early users might want to know what to expect in RHEL 8.3. In this article, you will learn how to:
- Enable the Multipath TCP protocol in the kernel.
- Let an application open an
IPPROTO_MPTCP
socket. - Use
tcpdump
to inspect MPTCP options with live traffic. - Inspect the subflow status with
ss
.
Enabling Multipath TCP in the kernel
Multipath TCP registers as an upper-layer protocol (ULP) for TCP. Users can ensure that mptcp
is available in the kernel by checking the available ULPs:
# sysctl net.ipv4.tcp_available_ulp net.ipv4.tcp_available_ulp = espintcp mptcp
Unlike upstream Linux, MPTCP is disabled in the default Red Hat Enterprise Linux (RHEL) 8.3 runtime. To enable the possibility of creating sockets, system administrators need to issue a proper sysctl
command:
# sysctl -w net.mptcp.enabled=1 # sysctl net.mptcp.enabled net.mptcp.enabled = 1
Preparing the system for its first MPTCP socket
With MPTCP enabled in the RHEL 8.3 kernel, user-space programs have a new protocol available for the socket
system call. There are two potential use cases for the new protocol.
Native MPTCP applications
Applications supporting MPTCP natively can open a SOCK_STREAM
socket specifying IPPROTO_MPTCP
as the protocol and AF_INET
or AF_INET6
as the address family:
fd = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);
After the application creates a socket, the kernel will operate one or more TCP subflows that will use the standard MPTCP option (IANA number = 30
). Client and server semantics are the same as those used by a regular TCP socket (meaning that they will use bind()
, listen()
, connect()
, and accept()
).
Legacy TCP applications converted to MPTCP
Most user-space applications have no knowledge of IPPROTO_MPTCP
, nor would it be realistic to patch and rebuild all of them to add native support for MPTCP. Because of this, the community opted for using an eBPF program that wraps the socket()
system call and overrides the value of protocol
.
In RHEL 8.3, this program will run on CPU groups so that system administrators can specify which applications should run MPTCP while others continue with TCP. We will discuss the eBPF helper upstream in the next weeks, but we want to support early RHEL 8.3 users who want to try their own applications with MPTCP.
You can use a systemtap script as a workaround to intercept calls to __sys_socket()
in the kernel. You can then allow a kernel probe to replace IPPROTO_TCP
with IPPROTO_MPTCP
. You will need to add packages to install a probe in the kernel with stap
. You'll also use the good-old ncat
tool from the nmap-ncat
package to run the client and the server:
# dnf -y install \ > kernel-headers \ > kernel-devel \ > kernel-debuginfo > kernel-debuginfo-common_x86_64 \ > systemtap-client \ > systemtap-client-devel \ > nmap-ncat
Use the following command to start the systemtap
script:
# stap -vg mpctp.stap
Protocol smoke test: A single subflow using ncat
The test network topology shown in Figure 1 consists of a client and a server that run in separate namespaces, connected through a virtual ethernet device (veth
).
Adding additional IP addresses will simulate multiple L4 paths between endpoints. First, the server opens a passive socket, listening on a TCP port:
# ncat -l 192.0.2.1 4321
Then, the client connects to the server:
# ncat 192.0.2.1 4321
From a functional point of view, the interaction is the same as using ncat
with regular TCP: When the user writes a line in the client's standard input, the server displays that line in the standard output. Similarly, typing a line in the server's standard input results in transmitting it back to the client's standard output. In this example, we use ncat
to send a "hello world (1)\n
" message to the server. It waits for a second, then sends back "hello world (2)\n
," then it closes the connection.
Note: Current Linux MPTCP does not support mixed IPv4/IPv6 addresses. Therefore, all addresses involved in client/server connectivity must belong to the same family.
Capturing traffic and examining it with tcpdump
The Red Hat Enterprise Linux 8 version of tcpdump
doesn't yet support dissecting MPTCP v1 suboptions in TCP headers. We can overcome this problem by building a binary from the upstream repository. Alternatively, we can replace it with a more recent binary. With either of those changes, it's possible to inspect the MPTCP suboption.
Three-way handshake: The MP_CAPABLE suboption
During a three-way-handshake, the client and server exchange a 64-bit key using the MP_CAPABLE
suboption, which is visible in the output of tcpdump
in the braces ({}) after mptcp capable
. These keys are then used later to compute the DSN/DACK and token. The MP_CAPABLE
suboption that originates in the client is also present following a successful connection setup. It will be present until the server explicitly acknowledges it using a data sequence signal (DSS) suboption:
# tcpdump -#tnnr capture.pcap 1 IP 192.0.2.2.44176 > 192.0.2.1.4321: Flags [S], seq 1721499445, win 29200, options [mss 1460,sackOK,TS val 33385784 ecr 0,nop,wscale 7,mptcp capable v1], length 0 2 IP 192.0.2.1.4321 > 192.0.2.2.44176: Flags [S.], seq 3341831007, ack 1721499446, win 28960, options [mss 1460,sackOK,TS val 4061152149 ecr 33385784,nop,wscale 7,mptcp capable v1 {0xbb206e3023b47a2d}], length 0 3 IP 192.0.2.2.44176 > 192.0.2.1.4321: Flags [.], ack 1, win 229, options [nop,nop,TS val 33385785 ecr 4061152149,mptcp capable v1 {0x41923206b75835f5,0xbb206e3023b47a2d}], length 0 4 IP 192.0.2.2.44176 > 192.0.2.1.4321: Flags [P.], seq 1:17, ack 1, win 229, options [nop,nop,TS val 33385785 ecr 4061152149,mptcp capable v1 {0x41923206b75835f5,0xbb206e3023b47a2d},nop,nop], length 16
MPTCP-level sequence numbers: The DSS suboption
After that, TCP segments will carry the DSS suboption that contains MPTCP sequence numbers. More specifically, we can observe the data sequence number (DSN) and data acknowledgment (DACK) values, as shown here:
5 IP 192.0.2.1.4321 > 192.0.2.2.44176: Flags [.], ack 17, win 227, options [nop,nop,TS val 4061152149 ecr 33385785,mptcp dss ack 1711754507747579648], length 0 6 IP 192.0.2.2.44176 > 192.0.2.1.4321: Flags [P.], seq 17:33, ack 1, win 229, options [nop,nop,TS val 33386778 ecr 4061152149,mptcp dss ack 1331650533424046587 seq 1711754507747579648 subseq 17 len 16,nop,nop], length 16 7 IP 192.0.2.1.4321 > 192.0.2.2.44176: Flags [.], ack 33, win 227, options [nop,nop,TS val 4061153142 ecr 33386778,mptcp dss ack 1711754507747579664], length 0
Using a single subflow, DSN and DACK increase by the same amount as the TCP sequence and acknowledgment numbers. When the connection ends, the subflows are closed with a FIN
packet, just like regular TCP flows would be. Because it also closes the MPTCP socket, the data fin
bit is set in the DSS suboption, as shown here:
8 IP 192.0.2.2.44176 > 192.0.2.1.4321: Flags [F.], seq 33, ack 1, win 229, options [nop,nop,TS val 33387798 ecr 4061153142,mptcp dss fin ack 1331650533424046587 seq 1711754507747579664 subseq 0 len 1,nop,nop], length 0 9 IP 192.0.2.1.4321 > 192.0.2.2.44176: Flags [.], ack 34, win 227, options [nop,nop,TS val 4061154203 ecr 33387798,mptcp dss ack 1711754507747579664], length 0 10 IP 192.0.2.1.4321 > 192.0.2.2.44176: Flags [F.], seq 1, ack 34, win 227, options [nop,nop,TS val 4061162156 ecr 33387798,mptcp dss fin ack 1711754507747579664 seq 1331650533424046587 subseq 0 len 1,nop,nop], length 0 11 IP 192.0.2.2.44176 > 192.0.2.1.4321: Flags [.], ack 2, win 229, options [nop,nop,TS val 33395793 ecr 4061162156,mptcp dss ack 1331650533424046587], length 0
Inspecting subflow data with ss
Because MPTCP uses TCP as a transport protocol, network administrators can query the kernel to retrieve information on TCP connections that are being used by the main MPTCP socket. In this example, we're running ss
on the client filtering on the server listening port, where information relevant to MPTCP can be read after tcp-ulp-mptcp
:
# ss -nti '( dport :4321 )' dst 192.0.2.1 State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess ESTAB 0 0 192.0.2.2:44176 192.0.2.1:4321 cubic wscale:7,7 [...] bytes_sent:32 bytes_acked:33 [...] tcp-ulp-mptcp flags:Mmec token:0000(id:0)/768f615c(id:0) seq:127af91ad1b321fb sfseq:1 ssnoff:c7304b5f maplen:0
SS command output explained
The line below tcp-ulp-mptcp
is the output of ss
in the client namespace immediately following the transmission of packet 6 in the previous section:
- Each value of
token
is the truncated Hashed Message Authentication Code algorithm (HMAC) of the remote peer's key, which the client receives during the three-way handshake. FurtherMP_JOIN SYN
packets will use that value to prove that they have not been spoofed. Theid
is the subflow identifier as specified in the RFC. For non-MP_JOIN
sockets, only the local token and ID are available. flags
is a bitmask containing information on the subflow state. For instance,M/m
records the presence of theMP_CAPABLE
suboption in the three-way handshake. Thec
means that the client received the server's key (that is, it acknowledged the SYN/ACK), whilee
means that the exchange of both MPTCP keys is complete.seq
denotes the next MPTCP sequence number that the endpoint expects on reception, or, equivalently, the DACK value for the next transmitted packet.sfseq
is the subflow sequence number, meaning that it is the current TCP ACK value for this subflow.ssnoff
is the current difference between the TCP sequence number and the MPTCP sequence number for this subflow. If you are using a single subflow, this value will not change during the connection. If you are using more than one subflow to simultaneously carry data segments, then this value can increase or decrease depending on the path capacity.maplen
indicates how many bytes are left to fill the current DSS map.
Note that we can compute the value of seq
by starting from the server key in the SYN/ACK (which is packet 2 of the capture) and computing the server's Initial Data Sequence Number (IDSN), then truncating sha256(ntohll(bb206e3023b47a2d))
to the least-significant 64-bit, as specified by RFC 8684.
Also note that, because the client is not receiving any data from the server, seq
remains equal to the IDSN throughout the connection's lifetime. For the same reason, the value of sfseq
is constantly equal to 1 in the example. We can see the IDSN in the DSN number of packet 10 and in the DACK number of packets 6 and 8 (in decimal format: 1331650533424046587
), as well as in the output of ss
(in hex format: 127af91ad1b321fb
). Similarly, in this example the SSN offset (c7304b5f
in the ss
output) is constantly equal to the initial TCP sequence number (3341831007
in the SYN/ACK, packet 2 of the capture output).
Conclusion and what's next
In realistic scenarios, MPTCP will generally use more than one subflow. In this way, sockets can preserve connectivity even after an event causes a failure in one of the L4 paths. In the next article, we will show you how to use iproute2
to configure multiple TCP paths on RHEL 8.3, and how to watch ncat
doing multipath for real.