Have Sockets Run Their Course?

wrong by jipn4 · 2009-05-12 18:07 · Score: 4, Interesting

Although the addition of a single system call to a loop would not seem to add much of a burden, this is not the case

Really? For a lot of networking code that's in use these days, I don't see that the system call overhead is the bottleneck. On clients you usually have network bandwidth as the limiting step (rather than system calls). On servers, it usually seems to be disk access or HLL interpreters.

Each system call requires arguments to be marshaled and copied into the kernel, as well as causing the system to block the calling process and schedule another.

That's easy to fix without changing the socket API: just add a system call that can return multiple packets from multiple streams simultaneously, a cross between select and readv. If there's a lot of data buffered in the kernel, it can then return that with a single system call.

Solving this problem requires inverting the communication model between an application and the operating system.

Not only does it not require that, inversion of control doesn't even solve it, since you still have the context switches.

Re:wrong by jipn4 · 2009-05-12 18:20 · Score: 2, Interesting

Oops... left out half of it...
That's easy to fix without changing the socket API: just add a system call that can return multiple packets from multiple streams simultaneously, a cross between select and readv. If there's a lot of data buffered in the kernel, it can then return that with a single system call. The user mode socket library can use that system call internally and still present every caller with the regular select/poll/socket abstraction; when callers request data, it first returns data that's already buffered in the process without another system call, and when it runs out of that, then it calls back into the kernel.
Re:wrong by jipn4 · 2009-05-12 18:44 · Score: 4, Interesting

even if its completely academic, i think its interesting to look at the user kernel boundary and try to refactor things which have negative structural impacts.
And you think that 2009 is the first time people think about this? System call overhead used to be a much bigger issue. UNIX and Linux has the current set of interfaces because they are a good compromise between simplicity and efficiency.
And these issues are constantly being evaluated implicitly: people who write network servers benchmark their code and find the bottlenecks. If the bottleneck is some system call, they complain to the kernel mailing list and maybe roll up their sleeves and come up with something new. If that turns out to be useful, more and more people ask for it to be put into the kernel, and eventually it becomes standard.
What motivates kernel developers is real benchmarks and the needs of important, real-world applications, not fluff pieces that express generic displeasure with the way things are done.
Re:wrong by convolvatron · 2009-05-12 18:51 · Score: 4, Interesting

no. in fact i can remember having discussions myself about this more than 20 years ago, and those were hardly the first.
unix has these interfaces as a matter of historical accident, what was an excellent design at the time. its hardly the only good point in the space.
you might find that it helps to think about these thing..even when developing important, real-world applications. why shouldn't the kernel be able to call into userspace safely and transfer ownership of a buffer? is that really so terrible to consider?
Re:wrong by Darinbob · 2009-05-12 19:17 · Score: 4, Interesting

But socket-like interfaces exist on systems without any user kernel interface. Ie, embedded systems. Many of those have implementations that do a good job of avoiding extra data copying, and yet still have an API that resembles sockets. I wonder if people are confusing the general idea of "sockets" with the specific "Berkeley Sockets" implementation and specification?
Re:wrong by RAMMS+EIN · 2009-05-12 22:24 · Score: 4, Interesting

``Windows' solution is pretty nice. You can pass a pre-created socket handle to accept_ex, which automatically accepts an incoming connection using that socket handle, so that you don't have to use two system calls (select and accept). You can also pre-accept multiple sockets, instead of having to make the system calls under load.
Sockets can also be closed with a "re-use" flag, which leaves the handle valid and saves making a system call to create another.
You then associate the sockets with an "IO completion port", which as best as I can tell is a multithreaded-safe linked list for really fast kernel to user program communication.''
I don't know. To me, it all just sounds like kludges to work around the facts that system calls are slow and that the implementation of the Berkeley API causes many system calls. You are adapting the structure of your program to code around the problems, instead of fixing the problems that cause the natural style of your program to lead to slowness.
There is nothing in the Berkeley socket API that mandates system calls or context switches. At worst, some copying is necessary (because the API lets the caller specify where data are to be stored, instead of letting the callee return a pointer to where data are actually stored).
The reason we have system calls and context switches, I claim, is that we are using unsafe languages. Because of this, applications could contain code that overwrites other programs' memory. We don't want that, and we have taken to separate address spaces to avoid it. The separate address spaces are enforced by the hardware, but this has a price, especially on x86. Perhaps it is time to rethink the whole "C is fast" credo. As the number of work instructions that can be executed in the time it takes to do a context switch increases, so does the relative performance of systems that do not need context switches, but of course we can only do away with context switches if we can provide safety guarantees in another way. One way would be to have the compiler enforce them. But that is outside the scope of Berkeley sockets, of course.

--
Please correct me if I got my facts wrong.

Couldn't this be like a flag, rather than new API? by tjstork · 2009-05-12 18:16 · Score: 4, Interesting

he recently developed SCTP (Stream Control Transport Protocol)4 incorporates support for multihoming at the protocol level, but it is impossible to export this support through the sockets API

The word that bugs me there, is "impossible". The question is, why? If you have to do something with sockets under the hood, then so be it, but it would seem to me that you could just add a few more fields to socket address to take into account multiple homes.

We've already had alternative APIs to sockets and for quite some time. sockets won. There were named pipes, ipx/spx, and the seemingly stupid idea of treating a network resource as a file has trumped every time.

--
This is my sig.

SCTP an interesting example by isj · 2009-05-12 18:17 · Score: 5, Interesting

I am developing SCTP applications and has contributed to the linux implementation, and I think that one of the advantages of the socket API is that it is usable with select()/ and poll(), ie. it is file descriptors you can pass around.

But for SCTP there are things that don't fit nicely into the socket API, especially when using one-to-many socket types. For instance for retrieving options for an association you have to piggyback data in a getsockopt() call by using the output buffer also for input. It works, but it is not nice. Also, for sending/receiving messages you have to use sendmsg/recvmsg with all the features including control data, and the ugly control data parsing.

Hmm... by fozzy1015 · 2009-05-12 18:40 · Score: 3, Interesting

In my experience the way the socket API can slow down a processor is having to monitor many thousands of socket descriptors using select() or poll(), like in a web server. For Linux epoll() was created for this scenario.

Re:RFC 1925 by dbIII · 2009-05-12 19:06 · Score: 3, Interesting

There are some sitautions where it isn't the best choice. In very simple clustering they just may not be enough sockets. For instance one package uses "rsh" up to around 512 hosts beyond which it doesn't work reliably unless you use "ssh" and a single socket. Of course "rsh" access scares people for plenty of other good reasons but that's a point best discussed elsewhere.

User level networking and the last copy by wdebruij · 2009-05-12 19:31 · Score: 4, Interesting

This is hardly news and partly mistaken.

The statement that sockets limit throughput by copying between kernel and application processes is a bit simplistic. The copy of Rx data to an application usually primes the cache. If data isn't touched and loaded into the cache at this point, it will have to be loaded shortly, anyway. Granted, for Tx this trick does not hold.

Second, the interface is not the implementation. Just because sockets are traditionally implemented as system calls does not state that they have to. User level networking is a well known alternative to OS services for high-bandwidth and low-latency communication (e.g., U-net developed around '96). I know, because I myself built a network stack with large shared buffers that implements the socket API through local function calls (blatant plug, but on topic. The implementation is still shoddy, but good enough for UDP benchmarking).

User level networking can also offers low latency. My implementation doesn't, but U-net does.

This leaves the third point of the article, on multihoming. As sockets abstract away IP addresses and network interfaces, I don't see why they cannot support multihoming behind the socket interface. Note that IP addresses do not have to mapped 1:1 onto NICs. Operating systems generally support load-balancing or fail-over behind the interface through virtual interfaces (in IRIX) or some other means (Netfilter in Linux).

Not need to replace sockets just yet.

Re:Couldn't this be like a flag, rather than new A by phantomfive · 2009-05-12 19:41 · Score: 4, Interesting

The word that bugs me there, is "impossible". The question is, why? If you have to do something with sockets under the hood, then so be it, but it would seem to me that you could just add a few more fields to socket address to take into account multiple homes.

Especially since SCTP actually does use the sockets API. You have to use recvmsg() instead of recv() if you want to do multi-homing, but in using SCTP I was actually impressed by how flexible the BSD socket API actually is. I can't say I particularly like it, and everyone who uses it ends up writing a wrapper around most of the send and recv calls, but flexibility is definitely it's strong point. If we ever do get routing by carrier pigeon, the BSD socket API will be able to adapt to it.

--
Qxe4

Re:Structured Stream Transport by ace123 · 2009-05-12 20:26 · Score: 2, Interesting

I definitely agree with you. In fact byte streams being a fundamental part of POSIX is one thing I love and make use of every day, for example piping output between programs/sockets. My post was not very clear, but I was trying to say that users developing application protocols should not be using BSD sockets directly any more--people usually write or use libraries for that sort of thing.

As far as new protocols go, you can build basically anything using UDP (and UDP is far less likely to be firewalled than any custom IP-level protocol you make up). I think such a protocol could only ever be practically implemented user-space library anyway

I would be curious what the article thinks is so fundamentally wrong with the sockets paradigm.

It's not sockets, its bind() by argent · 2009-05-12 21:14 · Score: 5, Interesting

The socket API... or rather the UNIX file descriptor API... has been extended many times. Sockets are already one such extension, and there's no reason you couldn't do something like mmap() a socket to map the buffers into user space directly. Heck, udp sockets already diverge from the read/write paradigm.

The problem with sockets is at a higher level. They're not mapped into the file system name space. You should be able to open a socket by calling open() on something like "/dev/tcp/address-or-name/port-or-name" and completely hide the details of gethostbyname(), bind(), and so on from the application layer. If they'd done that we'd already be using IPv6 for everything because applications wouldn't have to know about the details of addresses because they'd just be arbitrary strings like file names already are.

Re:RFC 1925 by dbIII · 2009-05-12 22:24 · Score: 2, Interesting

Yes, that's why I said "some". Just like the guys that wrote clustering software that is really just "rsh" and couldn't imagine anyone running it on a couple of thousand nodes it looks like the author hit a case where it really should have been done another way. Good answer above, however what I really was doing was trying to show a way that sockets can be used badly or used well.

Re:Which sockets API? by Anonymous Coward · 2009-05-12 23:36 · Score: 5, Interesting

The Berkeley socket API has stood up very well against the tests of time, and it is fairly lean and quite versatile, but yeah, there's definitely room for newcomers.

For example, when it comes to high packet rates - say, thousands of VoIP RTP streams - the length of the typical path a packet takes through the kernel layers becomes quite prohibitive.

I've been trying to reach gigabit ethernet saturation with G711 VoIP RTP streams (that is, 172-byte UDP packets @ 50Hz per stream), which works out to a theoretical maximum of 10500 streams - 525000 packets/second. My initial speed tests, with minor tweaking, got me around 1/10th of that, thanks to all the kernel overhead, and the lack of control over how and when packets will be sent.

So I wrote my own socket-> UDP-> IP-> ARP-> Ethernet abstraction which hooks directly into the PACKET_MMAP API (as used by libpcap), with the TX Ring patch, and with all the corner-cutting I managed to achieve 10000 streams (500k packets/sec) which equates to about 95% of the theoretical peak.

In short, we probably need more widespread support for different network programming APIs which address more specific needs - BSD sockets are too generalised sometimes.

Re:RFC 1925 by ThePhilips · 2009-05-12 23:38 · Score: 2, Interesting

Of course, if someone can actually produce some real-world benchmarks that validate the "let's ditch Sockets" claim...

There are really few real world example where you can do something better than sockets.

BSD sockets are quite versatile API. I have programmed them on both side - implementing my own protocol/address family and actually using them in program - and hardly see how one can do it better, maintaining level of guarantees provided by the API. And the level of guarantees what makes it possible to develop applications behaving reliably/predictably under ever varying conditions - and not loose your sanity in the process.

Also what many novice forget that sockets support a number of assertions application can make on sync/async error handling. IOW, one can easily improve performance of BSD socket by simply removing error handling. But something tells me that no-one's gonna do it.

--
All hope abandon ye who enter here.

Re:Which sockets API? by LSD-OBS · 2009-05-12 23:38 · Score: 3, Interesting

Stupid thing posted me anonymously despite being logged in!

--
Today's weirdness is tomorrow's reason why. -- Hunter S. Thompson

Real problems, or.... by countach · 2009-05-13 00:20 · Score: 2, Interesting

It seems to me that all the issues the author mentions could be solved with some library written over the top of sockets (and potentially other primitives like threads). Sockets are meant to be a low level interface, not to solve every problem.

The multi-home problem is real, but could be fixed with a relatively minor extension to the API, like IPV6 has been added in.

Re:RFC 1925 by Anonymous Coward · 2009-05-13 01:49 · Score: 1, Interesting

[*1] As with you, this is totally ignoring the security implications, etc.

If you can break our firewall and then escape with any significant fraction of our petabytes of information because our using of rsh is a security problem, then we will thank you and give you a job.

[*2] In no way is this a personal attack at you; I mean it in a purely academic sense. It's a very tall claim to say that decades of networking history, and thousands of talented engineers were wrong.

As implemented, sockets have limitations. On large scales, we run out of them. The number of file descriptors used to be an issue, now its the number of sockets.

ssh is NOT an option, because the handshaking and key exchanges are orders of magnatude too slow to be of any use on large scale. And does anybody believe that ssh w/o a passphrase on the private key is more secure than hostbased rsh authentication on a private network?

Netbook standards do not apply to petascale computing.

Comment removed by account_deleted · 2009-05-13 02:27 · Score: 3, Interesting

Comment removed based on user account deletion

Re:Which sockets API? by Luyseyal · 2009-05-13 02:52 · Score: 2, Interesting

Sounds like a new achievement "Too much karma: Enlightenment to Anonymous Cowardom"

-l

--
Help cure AIDS, cancer, and more. Donate your unused computer time to worldcommunitygrid.org. Join Team Slashdot!

Re:Open Transport, Part II by wealthychef · 2009-05-13 03:02 · Score: 3, Interesting

I found Open Transport to be a nightmare in practice. It did everything under the sun, so in order to just open a connection, send data, and tear it down, you had to do a bunch of stuff that I really could not understand as a beginning programmer. Maybe the documentation and usability has gotten better since then, or maybe I just wasn't smart enough. At any rate, sockets are easy to use, so I was glad when they switched to a Unix with sockets.

--
Currently hooked on AMP

Re:Structured Stream Transport by Panaflex · 2009-05-13 03:27 · Score: 2, Interesting

if you are loading a website over HTTP and you get stuck loading a huge image, you have no choice but to open up another socket connection or else wait

I think your confusing the HTTP protocol with BSD sockets. Your example is an HTTP 1.0 limitation, check out HTTP pipelining.

A socket is at it's very basic a read/write file handle. You can implement asynchronous handling, write your own protocol and do lots of extreme goodness. If you choose to be protocol stupid about how you transport your data then you live with the consequences.

As a network protocol engineer, you must look at minimum guaranteed latency, pick an average guaranteed bandwidth and taylor your protocol & packet sizes as necessary.

Writing a protocol is difficult when you care about performance and error handling.

IMHO, HTTP should have allowed a UDP pipelined transport mode . The overhead savings would have been worth the hassle.

--
I said no... but I missed and it came out yes.

Re:Structured Stream Transport by lewiscr · 2009-05-13 08:54 · Score: 2, Interesting

I'll always be NATing my home connection, even with IPv6. I assume my cable provider will charge me for those "extra" IPv6 IPs that I would be using. And if this one doesn't, the cable provider that buys this one will.

Horrible for multiple connections by harlows_monkeys · 2009-05-13 11:33 · Score: 3, Interesting

Sockets are very annoying when you have a lot of clients being served by one server. Consider, for instance, a chat server, with 25000 clients connected. You have 25000 sockets, one per client (plus a listen socket for new clients to connect to).

Whenever data arrives, the system has to somehow notify you that one of your sockets is ready to read. That generally involves some kind of polling, with select or poll, or some kind of interrupt mechanism, such as a signal. I'm leaving out some options, but regardless of how you get notified, you then read the data from the appropriate socket.

Then guess what happens? Most likely you take that data, wrap it in a data structure that tells you which client it was for, and stick it on a work queue, where the main thread or threads pull things to process.

Step back and look at what happened here:

The data from all 25000 clients comes in on a single interface.
The kernel goes to great effort to process this stream of data and split it up into separate streams for the 25000 clients.
You have to deal with that data coming into your server application via 25000 different sockets.
You put it all back into a single stream (your work queue) as a bunch of messages.
You pull the items from the work queue to process.

That's just insane! The kernel demultiplexed the incoming data, and the server just remultiplexed it when it put it onto the work queue. Demultiplexing belongs in the server application, not the kernel.

What I want is a single stream between my code and the kernel that delivers all the data for all 25000 clients. Whenever any client has data, I want to be able to read from that, and get back a message, that identifies which client it is from, and gives me that data.

The kernel should just be parsing the incoming TCP stream enough to recognize what port a given packet is for, and what client it came from, and then should queue it up into a single stream for the server handling that port. (The kernel has enough information from that to keep track, on a per client basis, of how much data is pending in the queue for the server app, so has what it needs to manage flow control).

Slashdot Mirror

Have Sockets Run Their Course?

26 of 230 comments (clear)