How Far Can Large Commercial Applications Scale?
clusteroid81 asks: "I've been working with customers who run large commercial applications on big iron (16-32 symmetric multi-processor systems - 64GB or more memory ). There are always numerous other front-end servers involved, but the application on the back end server is often difficult to spread across multiple systems or clusters due to the application architecture. Scaling is done by increasing memory and processor counts. As things progress, the bottleneck is usually contention within the application or operating system. Are there folks here on Slashdot who work with large single system commercial applications? What kind of processor counts and memory do the applications have and how well do they scale?"
...but have you considered trying to contact the EVE-Online developers at CCP.
Their game is little more than a MASSIVE database application supporting tens of thousands of simultaneous users... They have lag issues but, on the whole, seem to be scaling bloody well.
Anyhow, they started out on a 4-way machine and had scaled up to the 64-way without many code changes. If it had been cost effective, they would have kept on scaling upwards.
Do you even lift?
These aren't the 'roids you're looking for.
Your description is very little to go about suggesting solutions ...
....etc. if you are on a *NIX type of system).
...etc until you hit the diminishing returns areas).
You have to tell us many many specific things before we can suggest specific solutions. All we know is that the application runs on a 32 cPU system, and has 64 GB. This is all about the hardware. The application is a "large commercial application", and there is "contention within the application or the operating system". We do not even know what the hardware is, nor what operating system it is.
Anyways, here are some generic suggestions form past experience, most of it on UNIX systems, many with Oracle, and most with commerical non-web systems.
- Is the application CPU bound, memory bound, or I/O bound? If you do not know then you have to find out first, then attack the area of
- Is the application transactional in nature or batch? Is it an operational system, or a decision support type of application?
- Does the application use a database (probably does)? Is the database on the same box that runs the application? If so moving the database to a separate box with a fast connection (FDDI or Gigabit Ethernet) may help things.
- Does the application uses queues or message passing? Do these queues fill up at certain peak hours causing slow downs?
- Can you benchmark/load test the application on a similar box? If you have transaction generation/injection tools, then you can simulate the real load and then run tools for profiling, performance and the like in real time (e.g. sar, vmstat, top,
Performance tuning is an iterative process that is more of an art than a science. Start with the 80/20 rule, and get the low hanging fruit (attack the easiest and most obvious area that would gain you some performance, then move to the next area,
2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning.
One place I used to work had a system that scaled up to well over 20 Sun boxes each with 10 more CPUs. It all depends on having the design right. For example, if you have a batch job, you architect the job to follow a master/worker paradigm where a master process doles out chunks of works to worker processes that may or may not be running on the same machine (think SETI@Home). Not every job can be redesigned to to this, but it it's a fairly easy way to do a large number of different tasks. Further, there's no reason that this design couldn't be used by Linux/PostgreSQL or some other Free Software stack rather than Solaris/Oracle. There are also other paradigms. Perhaps you should do a search on scholarly comp sci papers instead of asking /.. The problem of scaling is not exactly new. Quite a few papers have been written on various way to solve the problem depending on what sort of computational tasks you have to accomplish.
Do you mean to ask how far things can scale "vertically", by buying progressively bigger individual machines? That's an easy one: never far enough.
Even if you can magically get a single system that's big enough for your needs forever, you'll still pay orders of magnitude too much money for it, and get no added reliability through redundancy.
Any application that requires a solitary, unique, big server is just definitionally broken. It needs to be redesigned to allow it to be spread over an arbitrary number of small systems in geographically diverse locations. For reliability, your serving infrastructure needs to be at least n+1 at every layer to allow for planned maintenance, unexpected failures, and site-destroying disasters. And for scale, it needs to allow you to continue to plug in more batches of cheap little machines and get more throughput.