Writing High-Availability Services?

← Back to Stories (view on slashdot.org)

Writing High-Availability Services?

Posted by Cliff on Tuesday April 15, 2003 @10:23AM from the keeping-your-services-from-falling-over dept.

bigattichouse asks: "I have a project coming up that will require some serious load capabilities accepting socket connections. while I have a design that can be distributed over multiple servers (using queued reads/writes to the db) and is as low-overhead as I can make it - I am concerned about falling into common problems that may have been overcome in many other projects. What strategies (threading, forks, etc) give the best capability? What common pitfalls should I avoid?"

7 of 21 comments (clear)

Min score:

Reason:

Sort:

One common pitfall ... by Anonymous Coward · 2003-04-15 10:30 · Score: 3, Informative

... is attempting to parallelize a program that would otherwise have been more efficient had it just been kept serial.

All too often I've read the argument: "Oh, performance isn't good, so I'll parallelize it". That doesn't hold much weight, as not all things are efficiently parallelizable.

So, before anyone suggests that you start pthread_create()ing threads everywhere, give some serious thought as to maxing out the serial performance first.
1. Re:One common pitfall ... by gnuadam · 2003-04-18 04:44 · Score: 2, Informative
  
  I'd not say this is perfectly good advice.
  
  When you carefully optimize your code to acheive maximum serial performance, you get just that, maximum serial performance.
  
  The algorithm that acheives maximal parallel performance, in my experience, is often quite different. What you really need to do is to carefully plan your code for maximum benefit in the resources you have available.
  
  If you want to design a parallel code, start with that assumption, not from the standpoint of parallelizing a serial code.
  
  --
  You say :wq, I say ZZ. Why can't we all just get along?
Beware slow connections by linuxwrangler · 2003-04-15 10:42 · Score: 4, Informative

In a former job we totally hammered an app on our internal lan and got many times the requests rate we would need in the real world.

Fat, dumb and happy we figured that the real world couldn't hammer us as hard as we could internally. Wrong! Slow connections require maintaining connection resources much longer than on an internal network where the response can be created and dispensed with almost instantly.

Maintaining all those simultaneous connections depleted our resources and the app went into full meltdown mere seconds after being released on the public servers.

We beat a hasty retreat to the old code, licked our wounds, and learned a valuable lesson.

--

~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
The C10K problem by Panoramix · 2003-04-15 10:55 · Score: 4, Informative

You probably know about this paper already, but just in case you don't:

The C10K problem

The paper deals with web servers handling ten thousand simultaneous TCP connections. But most of it is not particularly related to HTTP or web problems, but with more general socket I/O stuff --particulary with the ways of dealing with readiness/error notifications (e.g. select(), poll(), asynchronous signals, etc.). It also discusses other kind of limits (threads, processes, descriptors).

It is quite enlightening. It may be a bit outdated --I remember reading it about the time Netcraft was doing all that noise about Windows being faster than Linux as a web server-- but I'm sure most of it is very relevant.
Multiple strategies for HA systems by isj · 2003-04-15 13:06 · Score: 3, Informative
The goal of HA is usually that the end-user or the client applications will never detect that part of the system has been down. One strategy is:
- Separate the system into component
- For each component:
  Devise a mechanishm for dealing with the situation where the component is unavailable for several hours. If that is not possible you must implement redundancy.
Another (or additional) strategy is to implement self-monitoring. The component should monitor themselves for faults, and optionally monitor other components and restart them if necessary. The gotcha here is not to mask any errors for any high-level monitoring system.
You also need error detection&recovery in all components.
One thing that sometimes really bites you with TP is the long time it takes to detect that a connection is broken. You need application-layer keep-alives to detect this rapidly. Changing the kernel parameters for TCP timeouts can be necessary too.
Finally, you may want to have a look at Self-healing servers
use erlang by sesquiped · 2003-04-15 17:49 · Score: 2, Informative

Erlang makes writing applications like this much much easier than in any other language or framework I've seen.

Check out this tutoral on making a fault-tolerant server in Erlang.
Slow connections, and lots of 'em! by Pierre+Phaneuf · 2003-04-16 03:17 · Score: 3, Informative

I like a single thread/process per CPU design, where each thread/process use event-driven I/O to operate. A few things to keep in mind:

Never forget how a lot of idle connections can kill you, for example a thousand of people connecting to your fast server over 56k modems, sucking only a packet now and then. If you have a thread/process-per-connection design, like Apache, you'll get screwed real hard when you have a bazillion thread/process doing *almost* (but not quite) nothing, swamping the I/O scheduler and context switching like mad. If you use a select/poll-based approach, scanning all these inactive file descriptors, looking for those that are readable/writable, wastes a lot of time. Check out the new epoll stuff or Ben LaHaise's callback-based AIO interface.

You should use something like libevent or liboop to abstract your event loop, so that you can use select/poll on old or unpatched kernel, but so that you use epoll and other fancy event dispatching mechanisms on your production servers.

Here are a few URLs for you:

http://kegel.com/c10k.html
http://pl.atyp.us/co ntent/tech/servers.html