Slashdot Mirror


Pushing the Limits of Network Traffic With Open Source (cloudflare.com)

An anonymous reader writes: CloudFlare's content delivery network relies on their ability to shuffle data around. As they've scaled up, they've run into some interesting technical limits on how fast they can manage this. Last month they explained how the unmodified Linux kernel can only handle about 1 million packets per second, when easily-available NICs can manage 10 times that. So, they did what you're supposed to do when you encounter a problem with open source software: they developed a patch for the Netmap project to increase throughput. "Usually, when a network card goes into the Netmap mode, all the RX queues get disconnected from the kernel and are available to the Netmap applications. We don't want that. We want to keep most of the RX queues back in the kernel mode, and enable Netmap mode only on selected RX queues. We call this functionality: 'single RX queue mode.'" With their changes, Netmap was able to receive about 5.8 million packets per second. Their patch is currently awaiting review.

55 comments

  1. What does this mean? by Anonymous Coward · · Score: 0

    If I have a 100Mb/s NIC, I'm only getting 10 MB/s on Linux? I doubt that.

    1. Re:What does this mean? by Knuckles · · Score: 3, Informative

      If I have a 100Mb/s NIC, I'm only getting 10 MB/s on Linux? I doubt that.

      Packets != Bytes

      --
      "When I first heard Daydream Nation it quite frankly scared the living shit out of me." -- Matthew Stearns
    2. Re:What does this mean? by luvirini · · Score: 4, Informative

      A packet is not a byte. A packet is a sequence of bits including a address, other header information and the actual payload.

      IPv4 packet will as example have 20 bytes(160 bits) header and a maximum payload of 65,515 bytes(though often lower in practice)

      If you were to send a lot of packets with only a single byte payload then each packet will be 168 bits and your 100 Mb/s will result in about 600 000 packets. But at a gigabit connection the actual limit will start to hit for such strange traffic.

      Note that normally you would send more than a single byte of information/packet so in most real applications you would need much higher speeds to hit the limit. At 105 bytes of information you would have a total length of 1000, bits so would be at about the limit on gigabit hardware. But still most high bandwidth traffic tends to have much more information in each packet and thus not usually hit such limits.

      The limit has really started to hit due to the high availability of 10 gigabit and faster network cards coming down in price.

    3. Re:What does this mean? by monkeyhybrid · · Score: 2

      Sounds about right. Even if you were to ignore TCP/IP overhead, the most you could hope to achieve is 12.5MB/s over a 100Mb/s link. But that has nothing to do with what this article is about.

    4. Re:What does this mean? by fustakrakich · · Score: 1

      I always thought the best was 12.5MB/s, and that it didn't matter what system you're using.

      But on the average, after all the other losses, 100Mbits/s is more or less about 10MBytes/s. At least that is what shows up on the display.

      --
      “He’s not deformed, he’s just drunk!”
    5. Re:What does this mean? by Anonymous Coward · · Score: 0

      With a 100Mbit/s NIC you can (in theory, assuming no overhead) get 12.5MB/s, regardless of OS. In practice, you'll need to consider overhead at various protocol layers, so you'll get slightly less, under ideal conditions.

      There's no one-to-one mapping of the above (amount of data transferred per unit of time) to packets/s, however. Raw packet-processing performance depends (also) on various overheads as well as the size of each packet.

      When your packets become smaller, your bandwidth tends to drop a bit. If your hardware/network stack can't keep up with the raw packet rate, your bandwidth tends to drop more than a bit.

      High-performance packet-processing stacks can give you decent bandwidth even for small packets.

      Scaling this up to really beafy pipes, like 10Gbit/s and up, you hit all sorts of limitations unless your stack is top-notch all the way from the hardware at the bottom up to and including any software stack involved in the processing.

      The work described here is in the "middle", of sorts, or just above it.

      It seems to be worthwhile work.

    6. Re:What does this mean? by haruchai · · Score: 1

      I've never seen better than about 8.5 MB/s sustained at any place I've worked.

      --
      Pain is merely failure leaving the body
    7. Re:What does this mean? by cciechad · · Score: 1

      The limit has really started to hit due to the high availability of 10 gigabit and faster network cards coming down in price.

      10G coming down in price? We're starting to get 40G pretty commonly now. I think the cards can be had for around 1k or less. Hell I wouldn't be surprised if we start seeing some 100G or a weird intermediate speed. With RoCE Ethernet is starting to be a viable alternative for MPI/RDMA instead of Infiniband and 40G switchports are amazingly cheap and can be broken into 4x10G interfaces to support legacy 10G server

      --
      https://www.fsf.org/associate/support_freedom
    8. Re: What does this mean? by Anonymous Coward · · Score: 0

      corporate shillery. just like /.

    9. Re:What does this mean? by Bengie · · Score: 1

      On my old AMD 2500xp with an integrated 100Mb nvidia NIC, I was getting over 11MiB/s via windows SMB on WinXP and with my current home computers, I get 114MiB/s over my 1Gb/s network.

    10. Re:What does this mean? by Anonymous Coward · · Score: 0

      Netmap is developed on, and runs best on, FreeBSD.
      So what it means is that if you want a serious networking platform, choose FreeBSD, not Linux.

    11. Re: What does this mean? by Anonymous Coward · · Score: 0

      Measuring and comparing penis sizes in Gb/s now, are we?

    12. Re: What does this mean? by Anonymous Coward · · Score: 0

      Do you have any figures on frame throughput to back that up, or is it hot air as usual from the BSD corner?

    13. Re: What does this mean? by Anonymous Coward · · Score: 0

      The limit I've noticed is 11.7 MB/s for HTTP streams.

    14. Re:What does this mean? by Anonymous Coward · · Score: 0

      Latency of 25G is better than 40G. This is because multiple transceivers are needed for 40G which need to be synchronised.

    15. Re: What does this mean? by greenfruitsalad · · Score: 1

      your question is in the area of: "you claim the sun is bright. show me your sources!!!"

      his claim is as common a knowledge as water being wet, earth being round, etc..

      http://bsd.slashdot.org/story/...

    16. Re:What does this mean? by Anonymous Coward · · Score: 0

      Dear AC. YOU are the reason people continue to say "raid is not backup". It takes a lot to get me riled up enough to post on Slashdot, but you have pushed it too far.
      and buy a freaking NIC that was made this century. Dopey Linux fan boi.

    17. Re:What does this mean? by SuricouRaven · · Score: 2

      The general rule is divide-by-ten: 100Mb link means 10MB throughput. Over eight for the bit-to-byte convertion, but over ten to allow for overhead. It's not exact, but it's a good rule of thumb.

    18. Re:What does this mean? by jabuzz · · Score: 1

      If you think that RoCE is a viable alternative for Infiniband/MPI then you have been on the crack pipe again.

      Sure you might be able to replace QDR Infiniband with RoCE, but the Infiniband world has moved on, and replacing EDR with RoCE is a sick joke. While your RoCE gets maybe 1.5us latency which is in the ball park of QDR at 1.2us, by EDR is doing 0.5us latency, and Infiniband is about latency as much as throughput. In addition EDR Infiniband is a lot cheaper than RoCE at 100Gbps.

    19. Re: What does this mean? by Lotharus · · Score: 1

      your question is in the area of: "you claim the sun is bright. show me your sources!!!"

      No, his question is in the area of "you claim the sun illuminates this patch of ground better than this 12kW arc lamp. show me your sources!!!" Sure, the sun is bright. So is a 12kW arc lamp. Making the claim that one illuminates an area better than another requires supporting evidence in the form of luminous intensity measurements.

      Since you put it another way, I will too:
      "Water is wet." Sure, but is water wetter than alcohol? Ferrocene? Sodium laureth sulfate? His claim is that it's the best, which is, as the saying goes, an extraordinary claim which requires extraordinary evidence.

  2. This patch and its effects by Anonymous Coward · · Score: 2, Insightful

    must be thoroughly considered. CloudFlare is the greatest Man-in-the-Middle on the Internet, and don't think for a second they're not collaborating with U.S agencies who wants to get at sensitive data going through their systems.

    1. Re:This patch and its effects by Anonymous Coward · · Score: 1

      "Prince and his team were inspired to start the company after a call from the Department of Homeland Security."(quote from article, not my opinion)
      http://exiledonline.com/isucker-big-brother-internet-culture/

      Interesting take on it?

    2. Re:This patch and its effects by Anonymous Coward · · Score: 0

      .... sorry, the techcrunch article that is linked in this one is where the quote was.

  3. This is what routers and switches are for by JoeyRox · · Score: 3, Interesting

    If they only need to "shuffle" packets around (ie, not crack open the frames and actually interpret the data beyond making routing decisions) then routers/switches are better suited for this. If they actually need to do something more with the data then that quoted 5.8 million packets/sec. rate will drop very quickly for each single line of code they add that does anything with the data.

    1. Re:This is what routers and switches are for by raxx7 · · Score: 2

      Their goal is to receive the packets into their own user space analysis software and drop most of them (as being a flood attack).
      Their problem is that, using the existing methods, they can't get more than ~1 M packets/s into their software.

      I guess they are not using dedicated router hardware because there's no way to run their software on it.
      At which point, maybe they need a piece of kit based on Cavium's chips (lots of of low performance cores).

    2. Re:This is what routers and switches are for by bananaquackmoo · · Score: 1

      You do realize that routers are made out of software too, right?

    3. Re:This is what routers and switches are for by Anonymous Coward · · Score: 0

      Wait wait wait, you're saying routers have a soul?

    4. Re:This is what routers and switches are for by Anonymous Coward · · Score: 0

      What?

      The hardware of current routers handles all packet routing. Any "softwareish" aspect of routing acts on headers, not the contents of body, and is more like a hardware pipeline specification than software. (For instance, it's not Turing complete.)

    5. Re:This is what routers and switches are for by INT_QRK · · Score: 4, Funny

      Yes we have souls you insensitive dolt. If you SYN me do I not ACK?

    6. Re:This is what routers and switches are for by AaronW · · Score: 4, Interesting

      I work at Cavium on the SDK team (I do all the bootloader stuff for their MIPS chips). The Ubiquiti Edgerouter Lite uses one of our old (2nd gen CN5020) low-end dual core chips and is able to handle 1M packets/second by running the packet processing on a dedicated core and Linux on the other core. Our current generation (4th gen) is far faster. I work with chips from 4 up to 48x2 cores (48 cores, 2 chips running in NUMA). There's a lot of support for offloading packet processing in our chips, for example, directing packet flows to different groups of CPU cores. There's also various engines built-in to the chips for things like compression, pattern matching, deep packet inspection, encryption, RAID calculations and more. We also are selling NIC cards (Liquid I/O) which can run Linux on the NIC card as well as dedicated software that can offload a lot. For example, it can perform all the SSL, VPN and firewall stuff on the NIC. I'm working on some of the new ones now. I'd love to see some inexpensive eval boards available, especially with our CN73XX or even CN70xx chip. Even our low-end quad core CN71xx can handle 10Gbps of traffic.

      --
      This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
    7. Re:This is what routers and switches are for by JoeyRox · · Score: 1

      High-performance routers implement their routing/switching logic in hardware.

    8. Re:This is what routers and switches are for by AaronW · · Score: 1

      Higher-end routers have hardware dedicated to doing things like deep packet inspection and modification with less software overhead. For example, I work at Cavium and the CPUs I work with have a lot of dedicated packet processing hardware designed to offload much of that processing to the hardware which has many dedicated engines.

      --
      This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
    9. Re:This is what routers and switches are for by cciechad · · Score: 1

      I'm curious I just looked and I don't see any LiquidIO 40G adapters.Am I just missing them on the website? The ones I found seem nice but the Mellanox 40G with FPGA chip seem nicer as 10G is kind of out of date at this point in the server market.

      --
      https://www.fsf.org/associate/support_freedom
    10. Re:This is what routers and switches are for by AaronW · · Score: 1

      Look up the 410Nv. It has 4 SFP+ ports on it.

      --
      This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
  4. Lets use cloudfare and TOR! by Anonymous Coward · · Score: 0

    Both created and heavily funded by the CIA/NSA

  5. SystemD by El_Muerte_TDS · · Score: 3, Funny

    Wouldn't it just be easier to put this in systemd?

    1. Re:SystemD by Anonymous Coward · · Score: 0

      Aren't there enough threads already with 800+ replies and pointless flamewars. There will be another one soon: it will go something like "systemd takes over X" - start shouting! Have patience, my friend.

    2. Re:SystemD by Anonymous Coward · · Score: 0

      Aren't there enough threads already with 800+ replies and pointless flamewars. There will be another one soon: it will go something like "systemd takes over X" - start shouting! Have patience, my friend.

      Systemd wouldn't have that 'hard' of a time if the actually had the full functionality of the tool they are replacing before pushing to replace the next.

  6. "technical limits" by Anonymous Coward · · Score: 0

    As they've scaled up, they've run into some interesting technical limits on how fast they can manage this.

    Yeah, no kidding. Anyone who's tried to use 4chan in the last few weeks has experienced Cloudflare's technical limitations first-hand.

    1. Re: "technical limits" by Anonymous Coward · · Score: 0

      Wtf are you doing on 4chan to start with?

    2. Re: "technical limits" by Anonymous Coward · · Score: 0

      Troleing, it's more fun there than it is here.

  7. My company addresses this by AaronW · · Score: 4, Interesting

    My employer deals with this on their multi-core MIPS processors. What we do is we can run Linux on one set of cores and dedicated applications on other cores. These applications offload most of the TCP/IP stack and only pass the relevant traffic to the kernel. The Ubiquiti EdgeRouter Lite uses one of our lowest-end chips and handles 1M packets/second. Our higher-end chips can easily handle far more packets. Then again, the dedicated cores are also able to take much better advantage of the hardware offload support for forwarding and filtering. Even without using the dedicated special application we can handle 40Gbps or more of traffic on the high-end chips. We can also handle stuff like IPSec at these rates due to built-in encryption and hashing instructions if coded properly.

    Having the right NIC card can also help since some NIC cards can offload things like TCP/IP segmentation and reassembly. I've also dealt with small gigabit switch chips that can offload stuff like NAT but Linux can't really take advantage of that as-is.

    There's a lot of room for improvement. Some years ago I was doing performance analysis for Atheros with respect to CPU cache utilization. The biggest bottleneck was the fact that the transmit path in the Linux networking stack would only pass a single packet at a time. Batch processing of packets for WiFi makes a HUGE difference since groups of packets need to be aggregated for 802.11N. It also would allow for more efficient packet processing for non-wireless as well. There are a lot of other areas that also could be improved.

    --
    This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
    1. Re:My company addresses this by Bengie · · Score: 1

      A 900mhz single core x86 CPU can handle 14 mil pps, but if using Netmap or some other decent network API/stack.

    2. Re:My company addresses this by SuricouRaven · · Score: 1

      You can see it clearly if you take a managed switch apart. There are usually two large chips. A very big one that connects to all the interfaces and does the actual switching logic with specialised silicon, and a much smaller x86 or ARM processor that runs the management software.

  8. My Understanding Is Limited by Anonymous Coward · · Score: 0

    My understanding is limited. But this sounds very similar to the earlier Slashdot story about the BBC bypassing the kernel to improve UHD throughput. It's a different, even opposite, solution intended to overcome similar or same limitation.

  9. What am I missing? by Anonymous Coward · · Score: 0

    A recent blog post from Red Hat details how they're able to get three times those numbers on a single CPU and able to process over 12 million packets per second (http://rhelblog.redhat.com/2015/09/29/pushing-the-limits-of-kernel-networking/). What is so different about the workloads?

  10. Why don't license PF_RING by Anonymous Coward · · Score: 0

    I'm sure Luca Deri would license it to them.

  11. FreeBSD anyone? by Anonymous Coward · · Score: 0

    Why not use FreeBSD?

    1. Re:FreeBSD anyone? by Bengie · · Score: 1

      Pfft, FreeBSD. That's for SysAdmins, not DevOps. Dev Ops for life!

    2. Re:FreeBSD anyone? by Anonymous Coward · · Score: 0

      Pfft, FreeBSD. That's for SysAdmins, not DevOps. Dev Ops for life!

      So true and sad at the same time. DevOps shouldn't be a one way road. The knowledge of the SysAdmins is valuable and shouldn't be pushed aside, because it makes life easier (until you hit a previously known wall).

  12. linux isn the bottleneck by nimbius · · Score: 1

    Real switching, high speed carrier grade stuff, is more about hardware asics than software. Its comparatively exhaustingly expensive to route subnet or vlan traffic because the CPU on most machines isn't quick enough with bus overhead. Cisco and others own a monopoly on ultra high speed asic enabled hardware used by cloudflare and others. Modern virtual switching hardware is fast enough to crush practically any consumer hardware.

    --
    Good people go to bed earlier.
  13. Maxxing the NIC card by Taco+Cowboy · · Score: 2

    From TFA:

    ... the unmodified Linux kernel can only handle about 1 million packets per second, when easily-available NICs can manage 10 times that ...

    ... which implies that NICs can easily manage more than 10 million packets per second, right?

    ... With their changes, Netmap was able to receive about 5.8 million packets per second ...

    5.8 million packets per second might be fast, but it is still _ much lower _ than the theoretical >10 million packets per second max speed ...

    I am curious, has any software (no matter if it's open source, or proprietary) successfully achieved the >10 million packets per second threshold yet?

    --
    Muchas Gracias, Señor Edward Snowden !
    1. Re:Maxxing the NIC card by SuricouRaven · · Score: 1

      10000000000/(64*8) = 19,531,250

      That's a maximum of 19m packets/second - assuming every frame is the minimum size, and you've a ten-gigabit ethernet interface.

      So 10mp/s isn't realistic in most situations, but it is possible - and you might hit it if you're trying to monitor traffic on a major backbone link, which is exactly the sort of thing netmap may be used for.