Scalable, Fault-Tolerant TCP Connections?

← Back to Stories (view on slashdot.org)

Scalable, Fault-Tolerant TCP Connections?

Posted by Cliff on Tuesday January 8, 2002 @08:02PM from the more-sockets-than-you-can-shake-a-server-at dept.

pauljlucas asks: "My company is developing custom server software for an instant messaging type server (under Solaris). Every client maintains a TCP connection to the server when it is 'logged in'; the server maintains state of who's logged in where. For large-scale deployment, there are two problems: scalability and fault-tolerance. A single server can handle at most around 64000 open sockets. To go beyond this, you need many servers. Another way would be to 'fake' a TCP stack in user-space (by reading/writing raw TCP packets) thus not having one real socket per connection. For fault-tolerance, ideally one would like N servers to maintain the exact state, at least for the server process, so that if one goes off line, the other(s) can pick up seamlessly. I'm thinking that both of these issues must have already been solved without having to write lots of custom software. Is anybody aware of off-the-shelf software and/or hardware solutions (either commercial or freeware)?"

13 of 41 comments (clear)

Min score:

Reason:

Sort:

LocalDirector by Huusker · 2002-01-08 22:53 · Score: 3, Interesting

You can get a TCP load balancer like Cisco LocalDirector or one its competing clones. They are expensive tho ($20,000)
1. Re:LocalDirector by Jeremiah+Cornelius · 2002-01-09 05:47 · Score: 2
  
  NO!!!!!
  Local Director is a horrible device. Unlike modern loadbalancers from F5, Alteon, Foundry, etc. the Local Defector is a layer-2 bridge. It cannot have more than one path to a given target, causes all the problems that bridges introduce into switched networks, and allows for potential security breaches, becuse it is commonly used to bridge between differing subnets!
  All of the above vendors provide a proxy/switch style solution for layer 3 and above. If you can afford F5's BigIP HA+ in fail-over, this is a dream! Host-based on Intel, with a customized *BSD. Unless you are a freak for IOS-type management, Unix admins will love this.
  Check out O'Reilly on Bridge-Path vs. Route-Path Server Load Balancing
  
  --
  "Flyin' in just a sweet place,
  Never been known to fail..."
Your actual problems are somewhat different by Twylite · 2002-01-09 00:06 · Score: 5, Insightful

You indicate that your scalability problem kicks in at around 64000 simultaneous clients. Having developed high-performance scalable servers I would recommend taking a look at The C10k Problem, which is rather sobering: handling 10000 simultaneous connections is already a bitch.

Basically if you are using select() or poll() and have 10000 connections, you can expect 30% of your timeslice (on a fast machine) to be taken scanning the connection list for available data. Your performance also goes down the drain after about 250 connections (empirical observation; our server handles async requests which are offloaded to a hardware device for processing, so most server overhead is packet handling).

To get more connections than that you need to look at OS-specific methods: IOCompletion ports on NT, /dev/poll on Linux & Solaris, and kqueue on FreeBSD. These scale must more linearly out to 10000 connections (depending of course on if your server can handle the total load), but still don't give you the ability to scale endlessly.

Finally, I know of no intrinsic reason (although OS implementations may have arbitrary limits) why a server should be restricted to 64k connections. A TCP/IP connection is defined by two endpoints, each being an IP address/port cobination. Your server's ip/port is always one of those endpoints, and the client are unique. Client connections on the other hand are limited by the number of available port on the computer, i.e. 64k.

For fault tolerance, your best bet is to look at the architecture of Enterprise Java servers. Effectively you have a load balancer up front which redirects packets to one of several application servers; any persistent information must be saved off to a database server, and if necessary two or more database servers must be configured to synchronise with each other on a continual basis.

I am not aware of any software that can magically do this for an arbitrary application, although you may find network shared memory libraries which can more or less accomplish this. But if you are concerned about performance, you are unlikely to find COTS software to do the job (since the current business model is to throw more computing power at it).

--
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
1. Re:Your actual problems are somewhat different by markj02 · 2002-01-10 08:25 · Score: 2
  
  You are wrong: the 64k limit is intrinsic to TCP, both on the server and on the client side. When the server accepts a connection, it needs to keep track of which packets belong to which connection. Otherwise, the packets from multiple clients running on the same client machine couldn't be assigned to the correct stream. That bookkeeping is done with a 16bit number. Take a look at the protocols. If you actually complied with TCP/IP timeouts as originally specified, you can't even sustain anywhere near the connection rates that modern web servers achieve, and the shortened timeouts actually are rather bad from the point of view of robustness.
  In any case, the fact that many kernels fall on their face for far smaller numbers of connections is a result of simplistic data structures (linear lists, bit arrays, etc.). Why do kernel developers choose simplistic data structures? Beats me. Perhaps it's related to the fact that implementing and reusing good data structure libraries in C is just such a pain, but it's hard to say whether that's the cause of the problem or merely the consequence of the general mindset of kernel developers. In any case, there is no point in whining about the poor abstraction in many operating system kernels--obviously, nobody else wants to do the work. As long as kernel developers fix these problems when they come up in whatever way they like, everybody is happy. And they do fix them. Several UNIX kernels changed over from lists to hash tables in their network-related data structures when they hit performance limits, so whatever is wrong in your favorite kernel can be fixed by your favorite kernel hackers as well.
2. Re:Your actual problems are somewhat different by Twylite · 2002-01-11 01:08 · Score: 2
  
  Hi. No bug in the OS or the kernel, but in assumptions ;) Our server is tailored for performance under load, not for massive amounts of idle connections. In such a scenario the time taken by select() has a serious impact on the overall performance.
  
  In instant message by contrast, I will agree with you that the lighter processing load will be less affected by scanning (mostly idle) descriptors.
  
  --
  i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
3. Re:Your actual problems are somewhat different by Twylite · 2002-01-11 01:13 · Score: 2
  
  Hi. I'm afraid I still fail to understand how this shortcoming is intrinsic to TCP w.r.t server connections.
  
  When a client makes outbound connections, each connection must be from a unique port. A client cannot share a port used by a server, and port is a 16 bit value, so there are (65536 - server ports - 1 [0 is not addressable]) client connections permitted from any given machine.
  
  Now when it comes to the server, the situation is different. A server connection is identified by a single port on the server, irrespective of the client. The server distinguishes the client connection based on the client's IP and source port. That gives an intrinsic limitation of about 64k connections per client IP, and about 4G IPs.
  
  I see no reason why this 'bookkeeping' (when applied to server connections) has to be done with a 16 bit number. That sounds like an arbitrary operating system limitation. It would imply that an incoming packet is examined for IP and source port, that looked up in a table with max 64k entries, and the result taken to be the connection id. This makes no sense -- the kernel would look up the ip:port combination and get a unique stream identifier, which is an int (32 bit).
  
  Could you please explain clearly (in theory, or in terms of OS implementation) why this isn't the case? Thanks.
  
  --
  i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
4. Re:Your actual problems are somewhat different by Twylite · 2002-01-14 00:02 · Score: 2
  
  Hi. Just FYI, out performance testing results. We have three implementations of this server: NT IO-Completion ports, select() and multi-threaded.
  
  Testing was done using n identical clients, all flooding synchronous requests. i.e. each client sits in a "send, wait for reply" loop. The bottleneck is network bandwidth - we get sub-optimum performance except on Gb ethernet.
  
  The nature of our server is such that max. performance is reached at at 8-12 conenctions. The multi-threaded server has the best performance out to 20-25 connections. IO completion ports scales out to 500 connections, dropping only 3% in performance (compared to 12 conenctions). select() never tops the scales, but is consistently about 1% behind IO completion ports out to 120 connections, then drops back to 10% behind out to 250 connections, then drops off horribly.
  
  I will readily admit, however, that our implementation using select() could be more efficient. Even so, in architectural benchmarks it doesn't fare well beyond about 400 connections in a server doing IPC/RPC type transactions.
  
  --
  i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
You want linuxvirtualserver by pthisis · 2002-01-09 00:41 · Score: 3, Informative

First off, the Kegel's c10k page referenced earlier is definitely worth a read. And if you're under the impression that having 2^16 TCP port numbers limits you to 2^16 connections, that's not accurate. You can have hundreds of thousands of connections to one machine, presuming you manage them properly (as the c10k page points out).

More importantly, you should check out http://www.linuxvirtualserver.org/ ; it's aim is exactly what you want:

"The Linux Virtual Server is a highly scalable and highly available server built on a cluster of real servers, with the load balancer running on the Linux operating system. The architecture of the cluster is transparent to end users. End users only see a single virtual server."

Sounds like a perfect match.

Sumner

--
rage, rage against the dying of the light
1. Re:You want linuxvirtualserver by pthisis · 2002-01-09 00:44 · Score: 2
  
  Just to make it clear:
  
  "The advantage of the virtual server via NAT is that real servers can run any operating system that supports TCP/IP protocol, real servers can use private Internet addresses, and only an IP address is needed for the load balancer"
  
  So you can keep your application on Solaris with LVS.
  
  Sumner
  
  --
  rage, rage against the dying of the light
Did you consider UDP? by john@iastate.edu · 2002-01-09 02:46 · Score: 2

or did you just not mention why you rejected it?
If you're thinking of going to the trouble of simulating TCP with raw sockets, UDP seems a simpler alternative to that.

--
Shut up, be happy. The conveniences you demanded are now mandatory. -- Jello Biafra
He also asked for redundancy by jmaslak · 2002-01-09 03:17 · Score: 2

A load balancer does not give redundancy. With a load balancer, if a server dies, NEW connections are sent to a different server instead, but the existing connections to the down server all are closed - an external non-OS integrated solution like load balancing does not give transparent failover on TCP connections. It works for HTTP because browsers are used to connections suddenly dieing and will simply retry. But, if the client isn't smart enough to reconnect, it won't work.

The way to do this is to build a custom TCP stack and integrate it tightly into your app. A lot of work and hard to get right.

I would ask, "Do we REALLY" need this when our application already has to handle things like network failures? You might, though - I don't know what your application is.

Also, don't forget to use redundant routers, redundant firewalls, etc. If you use NAT, that imposes one more problem - transparently moving the connection table between the failed firewall and the working one.
Re:Forgot: clients talk UDP peer-to-peer by Ayende+Rahien · 2002-01-09 06:45 · Score: 2

Okay, let's see if I understand.

You've a instant messaging application, which is composed of a lot of clients and a central server.
The clients talk to the server only on start-up, but presumbely you want to keep the TCP connection open so you would know when a client goes offline.

That is not a good way to do it, not because you'll have problems with ports, you can have as many connections open as you want, after all, a TCP connection is identified by source_ip:port + dest_ip:port. But because of the overhead you'll encounter maintaining so many connections.

It's also not good to use UDP for peer-to-peer connections, why complicate your life with things that TCP already offers?

Persumbely, on start-up, a client calls the server, log-in, register its state (online,busy,D/A, etc), and asks for a list of names/ID of friends that it has.
You want to send it the IPs of those who are online, so it can send messages to them directly.
The way to do it, in my idea, is to handle it so:

Have a database of your clients, which would include username & password, an IP & state.
Another field should be a list of names that should be notified when this client goes online/change status.

When the clients calls in, authenticate it, and register it state in the DB, send it the list of IPs & states of the people online that he requested.
And register its state in the DB. Notify everyone that requested to be notified and is currently online that the user went online.
Cut the TCP connection. **
When the clients closes, have it send a message saying "I'm going offline".
Then change the state of it in the DB & inform the users linked to it. **

That way, the only connections that you've is of clients starting, changing status & shuting down.
That should lower your load considerably.

As for scalability & fault tolerance, just put it behind a load balancer, that way, if a server is busy or down, the request goes to another server, all of them are linked to the same DB, so the state is being preserved.
If a server goes down in the middle of a request, then the client should be smart enough to recover, and try again (that is what browsers do, and why load balancers works so well for HTTP).
Be sure to make the client stop after a couple of failed tries, though, you don't want to overload the network in case all your servers down for some reason.

What about errors, you ask? If a client is being terminate or disconnected without having a chance to inform you?

Well, the way I would do it is let other clients discover that.
If a client can't form a direct connection to another client, it should tell the server about this.
The server would try to reach the client by himself, and if he fails, would register that client as offline, store the request for when that user goes online again/discard it, and tell all the clients that are linked to the failed client about it [***].
I think it's much better than have the server poll at possibly hundreds of thousands of connections. (Or more, if you are lucky.)
After all, it doesn't matter if a client that no one is trying to call is offline while it's marked online. #

[**]
If you care about bandwidth/load, have the clients maintain a list of the people/or give it to him during log-on, and have the *client* do the notifications. On start-up, shut-down & status chagne. You'll have to handle the errors yourself, though. ***

[***]
You can be really nasty and have the client that discovered the error inform the rest of the clients that are linked to the failed client that it's gone. But that would require that client to have the list of people that want the failed client's link, which can be bad from privacy point of view.

[#]
If it does matter to you, have the clients poll the other clients in their list every hour or so, it would balance out so you wouldn't have too much lost connections marked as alive.

--

--
Two witches watched two watches.
Which witch watched which watch?
Re:Forgot: clients talk UDP peer-to-peer by Ayende+Rahien · 2002-01-13 07:55 · Score: 2

> No, we want to keep the TCP connection open so a client knows when it has an incoming message instantly.

Why do it via the *server*, anyway? If you have a message from one client to another, then transfer it directly from one client to another, not through the server. That way, the client is aware instantly, and you aren't wasting bandwidth by transferring the data.

--

--
Two witches watched two watches.
Which witch watched which watch?