Writing High-Availability Services?

← Back to Stories (view on slashdot.org)

Writing High-Availability Services?

Posted by Cliff on Tuesday April 15, 2003 @10:23AM from the keeping-your-services-from-falling-over dept.

bigattichouse asks: "I have a project coming up that will require some serious load capabilities accepting socket connections. while I have a design that can be distributed over multiple servers (using queued reads/writes to the db) and is as low-overhead as I can make it - I am concerned about falling into common problems that may have been overcome in many other projects. What strategies (threading, forks, etc) give the best capability? What common pitfalls should I avoid?"

21 comments

Min score:

Reason:

Sort:

NIH? by cpeterso · 2003-04-15 10:26 · Score: 1, Insightful

Why do you need to reinvent the wheel. There are plenty of other high-performance web/application servers that connect to databases.

--
cpeterso
1. Re:NIH? by Pseudonym · 2003-04-15 18:36 · Score: 1
  
  I re-read the article several times, and not once did I see the original poster say s/he was writing a web server. Did I miss something?
  
  --
  sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
2. Re:NIH? by demian031 · 2003-04-17 04:09 · Score: 1
  
  you're exactly right. this has been solved before. since it's not the gay-90s i suggest you look at an application server or corba or some other proven distributed solution.
  instead of worrying about connection pools, socket protocols and the like you could do something 'nutty' like solving business problems. ...just an idea.
One common pitfall ... by Anonymous Coward · 2003-04-15 10:30 · Score: 3, Informative

... is attempting to parallelize a program that would otherwise have been more efficient had it just been kept serial.

All too often I've read the argument: "Oh, performance isn't good, so I'll parallelize it". That doesn't hold much weight, as not all things are efficiently parallelizable.

So, before anyone suggests that you start pthread_create()ing threads everywhere, give some serious thought as to maxing out the serial performance first.
1. Re:One common pitfall ... by Anonymous Coward · 2003-04-15 18:56 · Score: 0
  
  Yes, but once you parallelize it, you get get another speed boost by serialising it!
2. Re:One common pitfall ... by gnuadam · 2003-04-18 04:44 · Score: 2, Informative
  
  I'd not say this is perfectly good advice.
  
  When you carefully optimize your code to acheive maximum serial performance, you get just that, maximum serial performance.
  
  The algorithm that acheives maximal parallel performance, in my experience, is often quite different. What you really need to do is to carefully plan your code for maximum benefit in the resources you have available.
  
  If you want to design a parallel code, start with that assumption, not from the standpoint of parallelizing a serial code.
  
  --
  You say :wq, I say ZZ. Why can't we all just get along?
Beware slow connections by linuxwrangler · 2003-04-15 10:42 · Score: 4, Informative

In a former job we totally hammered an app on our internal lan and got many times the requests rate we would need in the real world.

Fat, dumb and happy we figured that the real world couldn't hammer us as hard as we could internally. Wrong! Slow connections require maintaining connection resources much longer than on an internal network where the response can be created and dispensed with almost instantly.

Maintaining all those simultaneous connections depleted our resources and the app went into full meltdown mere seconds after being released on the public servers.

We beat a hasty retreat to the old code, licked our wounds, and learned a valuable lesson.

--

~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
The C10K problem by Panoramix · 2003-04-15 10:55 · Score: 4, Informative

You probably know about this paper already, but just in case you don't:

The C10K problem

The paper deals with web servers handling ten thousand simultaneous TCP connections. But most of it is not particularly related to HTTP or web problems, but with more general socket I/O stuff --particulary with the ways of dealing with readiness/error notifications (e.g. select(), poll(), asynchronous signals, etc.). It also discusses other kind of limits (threads, processes, descriptors).

It is quite enlightening. It may be a bit outdated --I remember reading it about the time Netcraft was doing all that noise about Windows being faster than Linux as a web server-- but I'm sure most of it is very relevant.
What kind of system? by Pyromage · 2003-04-15 11:25 · Score: 2, Interesting

In general, there are many things you can do. Pooling, caching, etc. can help in many situations. But what situation are you in?

Are you writing a web app where you have to hold session data across TCP connections?

Are you writing an app that will have sustained connections (more than one request per connection?)?

These different situations require different strategies.

DB reads more common or writes? How big's the difference?

What kind of system is your target? Can you trade memory for speed (caching)?

Take a look at SEDA http://seda.sourceforge.net. While you probably won't be rewriting your app to use this framework, many of the strategies may be useful and applicable to your app.

Also, just note the difference between efficient and scalable: some designs will take longer than others on short loads, but many of those make tradeoffs that are only noticable under high stress. Consider what tradeoffs you've made so far: some may be good or bad, and more may need to be made.

All this was said without knowledge of what you app is other than a DB app. I am not an expert, but I doubt an expert could say all that much with that little information.
Multiple strategies for HA systems by isj · 2003-04-15 13:06 · Score: 3, Informative
The goal of HA is usually that the end-user or the client applications will never detect that part of the system has been down. One strategy is:
- Separate the system into component
- For each component:
  Devise a mechanishm for dealing with the situation where the component is unavailable for several hours. If that is not possible you must implement redundancy.
Another (or additional) strategy is to implement self-monitoring. The component should monitor themselves for faults, and optionally monitor other components and restart them if necessary. The gotcha here is not to mask any errors for any high-level monitoring system.
You also need error detection&recovery in all components.
One thing that sometimes really bites you with TP is the long time it takes to detect that a connection is broken. You need application-layer keep-alives to detect this rapidly. Changing the kernel parameters for TCP timeouts can be necessary too.
Finally, you may want to have a look at Self-healing servers
In one case.... by OwnerOfWhinyCat · 2003-04-15 16:06 · Score: 2, Interesting

Though Pyromage's criteria requests are vital to making good suggestions, I had a high-burst rate problem for a server application that I solved slightly "out of the box." Since I wrote the client as well, I switched from the "connectioned" TCP interface to the "connectionless" UDP one. Since my application had to track the state of every pending request in any case, going to the connectionless protocol only meant adding 4 more states. This cut the kernel overhead significantly, and the total packet counts went down by half.

If you can supply the rest of the data it's likely that other good tradeoffs can be suggested.
use erlang by sesquiped · 2003-04-15 17:49 · Score: 2, Informative

Erlang makes writing applications like this much much easier than in any other language or framework I've seen.

Check out this tutoral on making a fault-tolerant server in Erlang.
Less fluff, more detail by Twylite · 2003-04-15 19:47 · Score: 1

You haven't given any detail about the nature of the application. You also appear more concerned with achieving high performance than high availability (which you only mention in the title). If this is such a big application why are you even talking about socket connections?

I must assume that you are developing an enterprise application, given your performance and availability needs. Contemporary systems of this nature fall loosely into one of two categories: web technology based, or not.

If you're basing your application on web technology, get someone with appropriate skills (consultant, contract or permanent staff). There is firewall, routing and load balancing hardware available to deal with redundancy and hot failover; leaving you with a farm of application web servers talking to a high availability database (which you can set up as a cluster system or on redundant hardware like Stratus).

If you're not using web technology, then you should be looking at an alternative enterprise technology, not rolling your own and asking about sockets. DCOM, .NET, Java/RMI, EJB, CORBA and MOM are your primary options. Of those there is an increasing leaving towards MOM (Message Oriented Middleware) in enterprise systems, as it offers scalability and ease of integration that the other technologies don't.

So investigate appropriate middleware, including the fault tolerant options that are offered. IBM's WebSphere MQ for example has failover support, and MSMQ can be run as part of a cluster.

You also need to ask yourself why such a high load is required. Do you have a huge number of clients? Does each client send/request a large amount of data? How can you restructure the system to reduce the number and/or size of requests/responses, or at least distribute them so that you don't have a single choke-point?

I can't decide whether your asking this question because you don't really have the experience necessary to design a system of this nature, or because you have enough experience to be comfortable about asking others. Either way, you're probably best off identifying areas of possible technical deficiency, and hiring a domain expert to look at the issues.

--
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
1. Re:Less fluff, more detail by mcdrewski42 · 2003-04-15 22:27 · Score: 1
  
  I must assume that you are developing an enterprise application, given your performance and availability needs. Contemporary systems of this nature fall loosely into one of two categories: web technology based, or not.
  
  All of these technologies are well and good at an enterprise level in which latency is not an issue, but move into (say) telecomms and suddenly your drivers are:
  1) Availability
  2) Latency
  
  When I pick up my prepaid cell phone, dial a number and press send there are milliseconds for the entire network to work out where the call is going, work out how much $ I have and then handle trying to actually connect the call. 500ms is a noticable delay. 1000ms is unacceptable. Think about me being in Guatamala and my phone company being in Abu Dhabi and then tell me CORBA/NET/J2EE/XYZ is the right solution.
  
  --
  /* affect != effect */ void affect(int *thing,int effect) { *thing += effect; }
2. Re:Less fluff, more detail by PsiComa · 2003-04-15 22:42 · Score: 1
  
  Jesus, when is this nonsense putting Corba and .NET in the same sentence going to end ?! Corba is a distributed systems technology; .NET is a software platform. And yes, if you're talking about web services, those are indeed included in .net but not in any way related to it more then they are related to say php or c++. Except of course MS has 70% or so of the commitee working on web services :).
3. Re:Less fluff, more detail by Twylite · 2003-04-15 22:47 · Score: 2, Interesting
  
  Ah, telecoms :) Is this the industry/application in question, or just hypothetical? There was mention of throughput (indirectly) and availability, but not of latency in the original question. Also there was mention of queuing queries to a back-end database ... this doesn't sound like a minimal-latency scenario.
  
  Anyway, the technologies you mention are not likely to be acceptable in such a scenario -- but MOM is quite likely to be appropriate. In fact many cellular services are based on MOM (conceptually, although they may not use commercial MOM products), as you receive requests and send responses based on discrete messages from/to the handset.
  
  --
  i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
4. Re:Less fluff, more detail by mattc58 · 2003-04-16 02:31 · Score: 1
  
  .NET Remoting is a direct answer to CORBA for distributed systems. WebServices can also be used as such. So I think the practice of putting CORBA and .NET in the same sentence is valid.
5. Re:Less fluff, more detail by ivan256 · 2003-04-16 04:16 · Score: 1
  
  I guess this is a good spot for a plug...
  
  Though you wouldn't know it from our horribly out-of-date website, our primary product at the company I work for (Mission Critical Linux) is a high availability middleware product that can be tightly integrated with custom software so that you don't have to reinvent the wheel when it comes to HA clustering services. I'm talking about things like inter-node communications, distributed lock management, heartbeating, service location management... If you have a tight schedule or budget, or you just don't know if you know that the problems that need to be solved are (Hint: you haven't thought of everything until it's done) we'll help. We don't just hand you the tools and libraries and leave you on your own. We'll hook you up with an experienced HA engineer and hold your hand as necissary. (The more hand holding required the higher the price, but what do you expect...) If you're concerned about performance (you obvoiusly are) we can help there too. Our stuff is 10 times faster than any other HA middleware vendor out there right now. Really. It's not necissarily just for linux, either.
Avoid select() by Anonymous Coward · 2003-04-16 01:35 · Score: 0

Just don't use standard select() on the sockets. A number of solutions exist for efficient socket connections on Linux and other platforms, e.g. the much-hyped NT completion ports, but select() ain't one of 'em.
Slow connections, and lots of 'em! by Pierre+Phaneuf · 2003-04-16 03:17 · Score: 3, Informative

I like a single thread/process per CPU design, where each thread/process use event-driven I/O to operate. A few things to keep in mind:

Never forget how a lot of idle connections can kill you, for example a thousand of people connecting to your fast server over 56k modems, sucking only a packet now and then. If you have a thread/process-per-connection design, like Apache, you'll get screwed real hard when you have a bazillion thread/process doing *almost* (but not quite) nothing, swamping the I/O scheduler and context switching like mad. If you use a select/poll-based approach, scanning all these inactive file descriptors, looking for those that are readable/writable, wastes a lot of time. Check out the new epoll stuff or Ben LaHaise's callback-based AIO interface.

You should use something like libevent or liboop to abstract your event loop, so that you can use select/poll on old or unpatched kernel, but so that you use epoll and other fancy event dispatching mechanisms on your production servers.

Here are a few URLs for you:

http://kegel.com/c10k.html
http://pl.atyp.us/co ntent/tech/servers.html
When to parallelize by samjam · 2003-04-17 09:22 · Score: 1

We had a redundant sotory distribution distributing to many hosts that the sum of the per-host latencies was too high when there were lots of stories that it couldn't keep up even though the CPU was idle.

Parallelizing it WAS the answer and it ran like a dream from then onwards; arrivals were more synchronized and end-to-end time was much less and CPU was more utilized.

Sam

--
blog.sam.liddicott.com