A 50 Gbps Connection With Multipath TCP
First time accepted submitter Olivier Bonaventure writes "The TCP protocol is closely coupled with the underlying IP protocol. Once a TCP connection has been established through one IP address, the other packets of the connection must be sent from this address. This makes mobility and load balancing difficult. Multipath TCP is a new extension that solves these old problems by decoupling TCP from the underlying IP. A Multipath TCP connection can send packets over several interfaces/addresses simultaneously while remaining backward compatible with existing TCP applications. Multipath TCP has several use cases, including smartphones that can use both WiFi and 3G, or servers that can pool multiple high-speed interfaces. Christoph Paasch, Gregory Detal and their colleagues who develop the implementation of Multipath TCP in the Linux kernel have achieved 50 Gbps for a single TCP connection [note: link has source code and technical details] by pooling together six 10 Gbps interfaces."
Doesn't SCTP provide for these scenarios (and many more)?
Matt
RFC 6182 if anyone is interested.
-- I have a private email server in my basement.
without every user making 3 connects to view their friends cat picture.
yes, this could have some good uses, no, dont release it to the drooling masses so 12 year old Tiffany can stream Justin Bieber videos while texting her buddy sitting right next to her "faster".
I remember getting dual-channel ISDN, which was 128k, but it was split into two 56k data channels and a 16k control channel. You could never download from any one site faster than 56k because a connection couldn't straddle more than one data channel.
Still, I could play EQ and surf at the same time on a different computer, a novel thing you young punks take for granted get off my lawn!
(-1: Post disagrees with my already-settled worldview) is not a valid mod option.
Wouldn't six 10 gig connections add up to one 60 gig connection instead of one 50 gig connection?
One of the barriers to this technology will be API support. Many APIs provide the IP address (on both sides) with the connection object. Implementors will have to make a choice about which ip to expose and remain backward compatible.
yes, I know etherchannel load balancing ... but maybe that would be easier to "fix" than inventing something that mostly exists. ... 10 years ago.
Do you also know the nowadays mobile devices? Wouldn't it be nice to use both WiFi and mobile wireless communication in the same time without special equipment from Cisco? Even more: transparent to you when you step from on public WiFi hot-spot coverage into another and be assigned with new IP address?
Questions raise, answers kill. Raise questions to stay alive.
I'm possibly missing the point here, but I'm struggling to understand how this would be put to use:
Servers that can pool multiple high-speed interfaces:
This capability has been around for years - port channeling/interface bonding/whatever vendors call it, do this already at L2. Aside from making it easier for those interfaces to sit on different subnets, why would you want to push this capability to L3/TCP? Seems like it just introduces additional complexity.
"Smart" devices (or any client really)
I suppose this makes a bit more sense, but given the order of magnitude latency and (typically) bandwidth differential between your typical 3/4G and a local wireless connection, this doesn't seem like a huge boost - 3/4G in particular is more hampered by latency than net available bandwidth anyway. Does a wider pipe really boost things that much?
On a side note - I thought TCP was all about guaranteed and ordered delivery - aren't you just stuck waiting for the slowest link in the chain anyway?
You disable the ones you do not want to use at any given time.
No, you dont. If I remember correctly, LACP will give you the maximum bandwidth provided by a single link, per connection. You cant just hook up LACP / LAGG / whatever your vendor calls it, fire up iSCSI, and magically have a 2gbps link to your SAN-- because iSCSI does a single connection per LUN, you will get a 1gbps connection even with LACP.
LACP gets you higher total capacity, so if you were running two iSCSI connections you could get 1gbps on each with no contention. If the summary be believed, this would give you a truly multi-gbps link off of aggregated gbit connections.
No need for expensive Cisco equipment. I have cheap Net gear switches that are completely LACP capable, and some cheap Realtor adapters support teaming now. It's more a driver issue than anything. If cell manufacturers designed their equipment and built the right drivers, you could easily dynamically team a cell and wifi connection.
I understand the application sets up its sockets as usual, and the kernel add TCP extensions for MTCP, and use them to negociate with remote endpoint and startup the subflows. This is transparent for the application, but is there any way it can inspect what is going on? I think about some kind of generalized getpeername(), which would allow the caller to get information about all subflows in use.
Does not work with nat. Unless you could bgp anounce a single ip address.
On second thought not even then. Each connection has it's own ip address. You would need to find a way to terminate a single TCP connection on two ips. I assume this require change to the server and client end's software/os/
If cell manufacturers designed their equipment and built the right drivers
And if Apple refuses to implement it, you will still be able to grab an Android, compile/install the MPTCP stack and do it (without waiting for Apple to resist the mobile providers pressure in not supporting a feature that would hurt their bottom line. Or, for the matter, wait for the mobile providers to upgrade their towers and hurt their bottom line by themselves).
Questions raise, answers kill. Raise questions to stay alive.
If you want to use multiple links all at the same time, with the packets spread over them, you're supposed to get an Autonomous System number.
This is more akin to link aggregation than it is multihomed Internet connections. Any two hosts could use this. They could be in the same autonomous system. They could be on the same subnet. There's no need to get a separate AS number for each host.
Note that one of the other use cases suggested is for smartphones.
For those wanting to try, their install howto. Seems supported on:
1. Linux - either debian binaries or compiling from source. Both kernel module and UserSpace ways.
2. Virtualized Linuxes - their example is provided for Amazon EC2
3. Mac OSX - but, obviously, not on iPhone (I estimate slim chances for this to happen in the near future - it's a technology disruptive for the mobile providers income, as it makes the multi-pathing over cell/WiFi hot-spots transparent to end user)
4. Android (Opinion: see? This is one of the reasons relying on "walled gardens" is bad: you have to wait for the mercy of the garden lord to benefit from something).
Questions raise, answers kill. Raise questions to stay alive.
Is anyone making network adapters with a built in programmable processor with an open specification these days? This particular protocol may well be compatible with some of the existing checksum offload implementations but GSO and GRO would seem to need something special especially for multiplexing across streams.
When you advertise something over BGP you essentially broadcast it to every core router in the world. Having every core router know about every device is just not going to happen. Having every router know about every ISP and large company is bad enough. Also BGP is built on a mixture of trust and manually applied filters. So unless you want to open it up to every idiot breaking other peoples traffic then it's not going to handle systems that regually move arround very well.
Also internet routing looks for the path with the least number of AS hops and will generally only use one connection for any particular pair of end systems. What you really want as an end user with multiple connections is to use both paths at once to connect to the same place.
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Then you setup your load balancing on your LACP links incorrectly for what you were trying to accomplish. Choose a different implementation for pathing and you'll get total aggregate.
The problem you're refering to is a problem with the implementation of LACP you're using, not the specification itself.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
So they're able to get 50Gbps out of 6 10Gbps circuits/handoffs/etc... well simple math would say why only 50 and not 60? Ok so it's cool if you want to tell me that I can simultaneously send data to a single destination via wifi and 3/4g, simultaneously being the key word. that part is interesting because in most cases the public address is different. But the article also mentions "servers that can pool multiple high speed interfaces". What? Any decent firewall, switch or server has been providing this functionality for years. It's called teaming or bonding. You want to send a shitload of data to a destination but it takes too long? Not a problem, throw a couple quad nics in those bitches and bond them up, problem solved providing your network can support the throughput. What am I missing?
This would be a dream for small businesses and home users. Have 2-4 DSL/Cable/Wireless WAN ports and one port for your LAN/Router. Plug and play for instant redundant internet at a much higher speed than any one low cost connection could provide.
or maybe we could just filter comments based on length or number of links. >1000 words or >20 links
Not unless they changed something recently. Read http://www.ieee802.org/3/hssg/public/apr07/frazier_01_0407.pdf LACP requires that any conversation goes over only a single link at a time. Out of order packets can do some rather nasty things to tcp connections and adding buffers to correct that does nasty things to voip / other latency sensitive bits. Sure linux boxes have some non standard modes that might work if you sitting one switch away but that's not conforming to the LACP spec. They also do not scale as they require keeping state of every session running through them. What networking gear are you using?
No sir I dont like it.
Yes, you can do it at layer 2 with various different technology, some vendor specific, some vendor neutral. This new method is doing it at layer 4. If you can't see why this matters then I suggest learning why the layered model exists in the first place.
According to both the article which silas linked below (which is the original source for what I said), as well as a whole boatload of other documentation, thats not correct; its an 802.1ad issue.
I did find this on serverfault which indicates that ONLY balance-roundrobin can get you 2gbps on a single tcp connection; and it also notes that some protocols dont like it, which means that its not really a transparant bonding technology. All of the other methods of distributing packets rely on a hash of various values, for instance source mac and destination mac IDs, and regardless of method the hash will ALWAYS be the same on a single TCP connection, which means that the same single link will be used.
Regardless, the Linux Bonding driver is NOT the same thing as LACP, and its not something you implement on the switch.
MPTCP has separate sequence-number spaces. One for the subflow, inside the regular TCP header. And the data sequence-numbers, included inside the TCP option-space.
This data sequence numbers include data-acks. So, this is your mentioned "cross-subflow ack machinery".
Has anyone actually thought about how The Onipn Network might take advantage of this? This could potentially, or partially, fix one of Tor's biggest problems, which is the inherent low bandwidth due to one single TCP connection via several nodes in a serial manner. If Tor could leverage multiple nodes in parallel into a single connection, this ould certainly make it faster.... And even more anonymous tha it is today.
I was doing TCP multipath in 2004 using iptables to get more upstream out of my box at home. I had two 1.5/384 connections and could and up with 768k upstream. All it took was a clever iptables script that marked alternating packets - even and odd if you will - and mangled odd packets to go out one interface, and even packets for the other.
Obvioulsy, an actual TCP extension for this is going to be more elegant since it's more scalable and easy to deal with, but the idea is not new.
I find it annoying that my app is disrupted when I leave a Wifi and the (Android) phone needs a few seconds to connect to the #G/4G network. But then it might just to that because it needs to save the power on the network interface.
But then I'd buy the phone with double or tripple the battery over the slim one anyway. If only they would be available.
Waiting to see an apartment full of geeks and gamers rocking 10 cable modems all linked together and splitting the bill.
NFS on multipath is my interest, too.
In NFS (v4) on TCP, the endpoints frame messages on the bytestream, independently in each direction. As FireFury03 states above, we're basically (potentially large) packet based. We'd like help from new transports in framing those messages optimally, avoiding head-of-line blocking for entire messages.
In addition to solving HOL, it's been proposed that we could design message framing on SCTP so as to deliver messages and data chunks on different streams, and get some advantages of NFS on RDMA.
I think it's the same story with a lot of protocols, including HTTP. In fact, like most web servers, the ONC RPC stack I work on is in user space, so I have a/the more complex version of these problems.
So do MTCP developers see solutions for any of these problems on the horizon?
Matt
So do MTCP developers see solutions for any of these problems on the horizon?
I'm in no way affiliated or knowledgeable in MPTCP - so I may be wrong - but from what I got from their presentation, MCTCP is not actually designed with the NFS-like usecase in mind (even if it may be used for such), but with the more "common" usecase of a mobile device able to use either/both cellular and WiFI networking.
The second thing that I saw as peculiar: it is not even supported by a network protocol (like IP is supporting TCP/UDP/SCTP/etc), but is supported by TCP. While it will have to deal somehow with re-assembling back a stream from packets streams over different paths, in itself it will be as prone to HOL as the bunch of underlying individual TCP streams which support those different paths.
That is, assuming the head-of-line is send on one path and that path involves a HOL-blocking, then it doesn't matter if the other paths have lower latencies, the entire original stream will be HOL-blocked
Questions raise, answers kill. Raise questions to stay alive.
That is, assuming the head-of-line is send on one path and that path involves a HOL-blocking, then it doesn't matter if the other paths have lower latencies, the entire original stream will be HOL-blocked
The implementation includes a solution to overcome HOL-blocking by reinjecting the blocking data-segment on the lower-latency path. Have a look at our scientific paper, which explains this mechanism: http://inl.info.ucl.ac.be/publications/how-hard-can-it-be-designing-and-implementing-deployable-multipath-tcp
The implementation includes a solution to overcome HOL-blocking by reinjecting the blocking data-segment on the lower-latency path.
Have a look at our scientific paper, which explains this mechanism: http://inl.info.ucl.ac.be/publications/how-hard-can-it-be-designing-and-implementing-deployable-multipath-tcp
Oh, wow! Thanks.
Questions raise, answers kill. Raise questions to stay alive.
Hi,
Speed-reading the paper, there appears to be some implicit ammunition for the SCTP approach (if it worked!), at least for applications like HTTP, NFS, etc, which are characterized by multiplexing of large and small messages on the stream. I conclude this from section 5.3, which I think states MPTCP over 2 links was slower than ordinary TCP over one link, when message size was 30K.
(Apologies if I'm misreading.)
Thanks,
Matt
Matt
I conclude this from section 5.3, which I think states MPTCP over 2 links was slower than ordinary TCP over one link, when message size was 30K.
For very small flow-sizes (like less than 30KB), MPTCP should not try to create additional subflows. Because, the whole data fits in the initial window of the first subflow. However, at the moment the linux implementation always tries to establish new subflows. In the paper's stress-testing scenario these additional subflows just consumed CPU-cycles and thus the "bad" results for MPTCP with very small flows
An easy fix would be to delay the establishment of additional subflows until a certain threshold of data has been sent or a certain time has passed.
Thanks, Christoph.
I think I was mis-reading, section 5.3 is discussing performance with short-lived connections, such as happens with HTTP .9 or 1.0. The question I would ask next is, how does MPTCP perform when HTTP 1.1 or similar channel multiplexing is used?
Matt
Matt
We have not yet done extensive experiments with HTTP1.1.
But, MPTCP is a benefit for bandwidth-intensive (thus increasing throughput) and long-lived connections (increased resilience against link-failures).
So, I would say that HTTP 1.1 would benefit from MPTCP.
Cheers,
Christoph
TCP X2 now with Shotgun technology!
LACP uses various methods to choose which link to send frames over-- for example sourceport id, source mac, etc. Regardless of what you choose, a single TCP connection will end up using the same link even when LACP is implemented on the switch.
You might try reading the linked articles in my and silas' responses before arguing; particularly as one of them is a link to the IETF spec.
Yes, you have that total amount of bandwidth. If you were to have 4 iSCSI connections, each of them would get a full gigabit; if you had 8 connections each would get 500mbps.
However, a single connection from a single TCP port coming out of a single MAC address / IP address is going to get a single gigabit /sec of traffic; theres not really a good workaround for this.
If youve found a way to get 4gbps on a single iSCSI connection using LACP, please do share, as a LOT of people would be interested to get that running.
It's worse than that depending on how things are hashing you can have 4 connections and 2 of them use the same link (or all four) It's really dependent on the networking kit your using LACP does not specify how you do this just that you do something to insure packets for a given "conversation" only go down one path at a time. Simpler networking kit might only look at 1 mac address smarter bits go all they way up to protocol ports.
For iSCSI the "easy" fix is to run multipath it assuming your san supports it. That makes LACP work much better as it has more mac's IP and ports to hash against.
No sir I dont like it.
LACP does not need to care about L3 or L4 it's free to do so. Really cheap networking kit (and some broken really expensive bits) only use mac's ever smarter kit can do whatever it wants higher up in the protocol stack. How it distributes packets is not something that's negotiated it's just fixed or a setting on each end. If you plug the dumbest of the dumb lacp switch into the smartest switch you will get good load balance in one direction and poor the other.
No sir I dont like it.
Multipath TCP traverses NAT and other types of middleboxes without problems.
The main benefit of Multipath TCP when used in multihoming scenarios compared to BGP-based multihoming is that Multipath TCP capable hosts can use different paths simultaneously while BGP-based multihoming would provide one path for each client server pair.
With Multipath TCP, multihoming must be exposed to the server. For example, consider a small enterprise network connected to two different providers, A and B. With BGP-based multihoming, you assign address C to your server and advertise it through the two providers via BGP. BGP decides which path will be used and the ASPath metric used by BGP is far from being the most accurate metric to evaluate the quality of a path.
With Multipath TCP, you should ask a block of addresses from both A and B and assign one address from each provider you your server, say A.1 and B.1. Both addresses are advertised in the DNS. Address A.1 is always reachable via provider A and B.1 always reachable via provider B. When a TCP connection reaches the server, say over A.1, it will also advertise address B.1 using Multipath TCP and a second subflow will be established. Multipath TCP will then regulate the usage of the two paths in function of the amount of congestion on each path. If one path fails, Multipath TCP will perform failover automatically.