Scalable, Fault-Tolerant TCP Connections?
pauljlucas asks:
"My company is developing custom server software for an instant
messaging type server (under Solaris). Every client maintains a
TCP connection to the server when it is 'logged in'; the server
maintains state of who's logged in where. For large-scale
deployment, there are two problems: scalability and fault-tolerance.
A single server can handle at most around 64000 open sockets. To go
beyond this, you need many servers. Another way would be to 'fake' a
TCP stack in user-space (by reading/writing raw TCP packets) thus not
having one real socket per connection. For fault-tolerance, ideally
one would like N servers to maintain the exact state, at least for
the server process, so that if one goes off line, the other(s) can
pick up seamlessly. I'm thinking that both of these issues must have
already been solved without having to write lots of custom software.
Is anybody aware of off-the-shelf software and/or hardware solutions
(either commercial or freeware)?"
It would be possible to implement a server-side solution, using TCP, but you'd need to keep track of things like sequence numbers and window sizes to move a TCP connection seamlessly between two computers (meaning you'd probably need to write a TCP stack).
You can get a TCP load balancer like Cisco LocalDirector or one its competing clones. They are expensive tho ($20,000)
You indicate that your scalability problem kicks in at around 64000 simultaneous clients. Having developed high-performance scalable servers I would recommend taking a look at The C10k Problem, which is rather sobering: handling 10000 simultaneous connections is already a bitch.
Basically if you are using select() or poll() and have 10000 connections, you can expect 30% of your timeslice (on a fast machine) to be taken scanning the connection list for available data. Your performance also goes down the drain after about 250 connections (empirical observation; our server handles async requests which are offloaded to a hardware device for processing, so most server overhead is packet handling).
To get more connections than that you need to look at OS-specific methods: IOCompletion ports on NT, /dev/poll on Linux & Solaris, and kqueue on FreeBSD. These scale must more linearly out to 10000 connections (depending of course on if your server can handle the total load), but still don't give you the ability to scale endlessly.
Finally, I know of no intrinsic reason (although OS implementations may have arbitrary limits) why a server should be restricted to 64k connections. A TCP/IP connection is defined by two endpoints, each being an IP address/port cobination. Your server's ip/port is always one of those endpoints, and the client are unique. Client connections on the other hand are limited by the number of available port on the computer, i.e. 64k.
For fault tolerance, your best bet is to look at the architecture of Enterprise Java servers. Effectively you have a load balancer up front which redirects packets to one of several application servers; any persistent information must be saved off to a database server, and if necessary two or more database servers must be configured to synchronise with each other on a continual basis.
I am not aware of any software that can magically do this for an arbitrary application, although you may find network shared memory libraries which can more or less accomplish this. But if you are concerned about performance, you are unlikely to find COTS software to do the job (since the current business model is to throw more computing power at it).
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
No matter what you do with the software to handle a giant number of connections, you still have the physcial limits of the machine don't you? The NIC and CPU can only do so much; so isn't that going to be a bottel neck for you?
It also seems like keeping the state information about 64K+ connections for whatever they are being used for has to involve some kind of overhead as well. You'd really need some efficent way to deal with organziing it all so its reasonably easy and fast to access.
You can't handle each connection with its own thread, but even if you break it up into several threads or several processes each handling a couple thousand connections (polling or something similar), thats got to have a lot of latency as well if you expect a large number of these connections to be active most of the time.
I could be wrong, I don't know exactly what you're trying to do. Maybe the connections are mostly idle. But I think that you are probably looking at more than one bottle neck forcing a single computer to do all the work.
In general, trying to customize a single machine, or single program to scale isn't usually a good solution. You'd probably be better to find a way to design the software to work with a number of machines. Most large websites have load balancers that distribute the requests to several machines. Large services typically do this, or rely on the fact that all users probably won't be connected all at once.
-- Eric
...but you should feel free to tell me why it sucks.
If the server must keep state on all connections simultaneously, and you must have TCP connections, and you don't want to keep the state information in a database, why not create sub-servers which take client TCP connections and aggregate them into a few server-to-server streams?
A load balancer could be used in this configuration, but if you can throw cheap Netra X1's at it, maybe you could get by with random DNS choice balancing instead (I bet CNN does a part of load-balancing this way -- look at all the A records for www.cnn.com).
Just add a new Netra X1 for every 2500 users or so, and you'll be set.
Now, the rest of you: Tell me why this is stupid.
dw
Fault-detecting switches (Alteon, Big5, Cisco Local Director) do transparent switching - as do software HA solutions (Rainfinity, Linux Virtual Server).
BUT...
...when the connection is being switched to a new server (for load balancing or failure switchover), the server's application does not know the connection and thus the switched one will be rejected. You will need to program your application to keep and distribute state of the connection. If you do stateless applications (e.g. web-based), then at least the (next) IP-connection will have a clean switchover on the srever side.
First off, the Kegel's c10k page referenced earlier is definitely worth a read. And if you're under the impression that having 2^16 TCP port numbers limits you to 2^16 connections, that's not accurate. You can have hundreds of thousands of connections to one machine, presuming you manage them properly (as the c10k page points out).
More importantly, you should check out http://www.linuxvirtualserver.org/ ; it's aim is exactly what you want:
"The Linux Virtual Server is a highly scalable and highly available server built on a cluster of real servers, with the load balancer running on the Linux operating system. The architecture of the cluster is transparent to end users. End users only see a single virtual server."
Sounds like a perfect match.
Sumner
rage, rage against the dying of the light
Hence, the server sees very little traffic in comparison.
If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.
Why not just have the clients send 'keep alives' every 20 seconds or so? And I'd definitely suggest using some of the above mentioned hardware TCP limiters.....
If you're thinking of going to the trouble of simulating TCP with raw sockets, UDP seems a simpler alternative to that.
Shut up, be happy. The conveniences you demanded are now mandatory. -- Jello Biafra
A load balancer does not give redundancy. With a load balancer, if a server dies, NEW connections are sent to a different server instead, but the existing connections to the down server all are closed - an external non-OS integrated solution like load balancing does not give transparent failover on TCP connections. It works for HTTP because browsers are used to connections suddenly dieing and will simply retry. But, if the client isn't smart enough to reconnect, it won't work.
The way to do this is to build a custom TCP stack and integrate it tightly into your app. A lot of work and hard to get right.
I would ask, "Do we REALLY" need this when our application already has to handle things like network failures? You might, though - I don't know what your application is.
Also, don't forget to use redundant routers, redundant firewalls, etc. If you use NAT, that imposes one more problem - transparently moving the connection table between the failed firewall and the working one.
A single server can handle at most around 64000 open sockets.
If my memory serves me correctly one can easily break the 64K limit by using multiple processes. This wasn't on Solaris, though.
Is anybody aware of off-the-shelf software and/or hardware solutions (either commercial or freeware)?
Assuming you have control over the protocol, I've written such a server. I don't have the code, but my non-compete agreement has expired.
I have personally opened over 1.6 million
simultaneous connections to a well tuned FreeBSD 4.3 machine.
I don't understand your "64,000" limit. Are you perhaps talking about outbound connections not bound to a particular outbound IP address? It's true that the port space would be limited to 64K connection, in that case, but the easy answer to that is to bind the connections to a particular IP address on the way out, and then us multiple IP addresses...
For fault-tolerance (at least, network fault tolerance), you should consider using SCTP. The Stream Control Transmission Protocol is designed to utilize multiple network paths to accomodate network failures. It doesn't quite have the ubiquity that TCP does, but there is work to include it in the Linux Kernel.
Okay, why not use jabber =) It's even Open Source, and also a commercial version if I remember right. Clients for just about ANY platform, and you can even store messages and do authentication from an LDAP server. It even supports SMS paging =) www.jabber.org or www.jabber.com