How Far Can Large Commercial Applications Scale?

← Back to Stories (view on slashdot.org)

How Far Can Large Commercial Applications Scale?

Posted by ryuzaki0 on Tuesday April 18, 2006 @12:45PM from the taming-the-performance-curve dept.

clusteroid81 asks: "I've been working with customers who run large commercial applications on big iron (16-32 symmetric multi-processor systems - 64GB or more memory ). There are always numerous other front-end servers involved, but the application on the back end server is often difficult to spread across multiple systems or clusters due to the application architecture. Scaling is done by increasing memory and processor counts. As things progress, the bottleneck is usually contention within the application or operating system. Are there folks here on Slashdot who work with large single system commercial applications? What kind of processor counts and memory do the applications have and how well do they scale?"

7 of 56 comments (clear)

Min score:

Reason:

Sort:

It may seem offtopic.... by riprjak · 2006-04-18 13:05 · Score: 2, Interesting

...but have you considered trying to contact the EVE-Online developers at CCP.

Their game is little more than a MASSIVE database application supporting tens of thousands of simultaneous users... They have lag issues but, on the whole, seem to be scaling bloody well.
yes by larry+bagina · 2006-04-18 13:14 · Score: 2, Interesting

I did some freelance work a few years back for a client. They were converting some custom inhouse applications from a 64 processor Cray Superserver 6400 to a cluster-based approach. I can't comment on what they were doing, but they needed all the ram and cycles they could get ahold of.
Anyhow, they started out on a 4-way machine and had scaled up to the 64-way without many code changes. If it had been cost effective, they would have kept on scaling upwards.

--
Do you even lift?
These aren't the 'roids you're looking for.
Posting Anon to save my ass by Anonymous Coward · 2006-04-18 13:40 · Score: 1, Interesting

I can certainly declare that applications scale nicely to at least 128 processors that I have personally seen. That is if the application is designed and implemented to allow that.
That is just the database server which handles approx 40,000+ user sessions at one time.
Of course in front of that you have your liberal sprinkling of app server and database proxy servers and whatnot, amounting to about 100 other seperate systems.
As others have noted, you need lots of enterprise, which costs money.
The flip side is there is only one core environment to maintain, which reduces staff and administration requirements.*
Going to a single system environment eases the pains of distaster recovery, hot standby and failover by a large margin. But the cost of that one large environment (Serve Chassis(s), CPUs, Memory, SAN, etc) will almost always cost more than many smaller systems of the same aggregate capacity. Meaning it costs a _lot_ more to have a spare hot failover in the datacenter and another system located in a disaster recovery site. Plus you will need to have a copy for Dev/Test, maybe two.
So you never end up with just one big system. Following the enterprise ruleset, *you'll need at least 5 if not more.
The moral of the story: Once you pass the 8-cpu mark, things get really, really, really expensive in a hurry. Your best bet is to find a way to load balance a large app across many smaller servers. If it is at all possible.
Very little to go by ... by kbahey · 2006-04-18 15:05 · Score: 3, Interesting

Your description is very little to go about suggesting solutions ...

You have to tell us many many specific things before we can suggest specific solutions. All we know is that the application runs on a 32 cPU system, and has 64 GB. This is all about the hardware. The application is a "large commercial application", and there is "contention within the application or the operating system". We do not even know what the hardware is, nor what operating system it is.

Anyways, here are some generic suggestions form past experience, most of it on UNIX systems, many with Oracle, and most with commerical non-web systems.

- Is the application CPU bound, memory bound, or I/O bound? If you do not know then you have to find out first, then attack the area of

- Is the application transactional in nature or batch? Is it an operational system, or a decision support type of application?

- Does the application use a database (probably does)? Is the database on the same box that runs the application? If so moving the database to a separate box with a fast connection (FDDI or Gigabit Ethernet) may help things.

- Does the application uses queues or message passing? Do these queues fill up at certain peak hours causing slow downs?

- Can you benchmark/load test the application on a similar box? If you have transaction generation/injection tools, then you can simulate the real load and then run tools for profiling, performance and the like in real time (e.g. sar, vmstat, top, ....etc. if you are on a *NIX type of system).

Performance tuning is an iterative process that is more of an art than a science. Start with the 80/20 rule, and get the low hanging fruit (attack the easiest and most obvious area that would gain you some performance, then move to the next area, ...etc until you hit the diminishing returns areas).

--
2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning.
My experience with Solaris/Oracle by brokeninside · 2006-04-18 15:10 · Score: 3, Interesting

One place I used to work had a system that scaled up to well over 20 Sun boxes each with 10 more CPUs. It all depends on having the design right. For example, if you have a batch job, you architect the job to follow a master/worker paradigm where a master process doles out chunks of works to worker processes that may or may not be running on the same machine (think SETI@Home). Not every job can be redesigned to to this, but it it's a fairly easy way to do a large number of different tasks. Further, there's no reason that this design couldn't be used by Linux/PostgreSQL or some other Free Software stack rather than Solaris/Oracle. There are also other paradigms. Perhaps you should do a search on scholarly comp sci papers instead of asking /.. The problem of scaling is not exactly new. Quite a few papers have been written on various way to solve the problem depending on what sort of computational tasks you have to accomplish.
Not far enough. by Onan · 2006-04-18 15:11 · Score: 2, Interesting

Do you mean to ask how far things can scale "vertically", by buying progressively bigger individual machines? That's an easy one: never far enough.

Even if you can magically get a single system that's big enough for your needs forever, you'll still pay orders of magnitude too much money for it, and get no added reliability through redundancy.

Any application that requires a solitary, unique, big server is just definitionally broken. It needs to be redesigned to allow it to be spread over an arbitrary number of small systems in geographically diverse locations. For reliability, your serving infrastructure needs to be at least n+1 at every layer to allow for planned maintenance, unexpected failures, and site-destroying disasters. And for scale, it needs to allow you to continue to plug in more batches of cheap little machines and get more throughput.
Problems scale too by Hyperhaplo · 2006-04-18 18:08 · Score: 1, Interesting

We had a small (Os/390) Dev box that was upgraded recently. One thing we noticed was that one application (SAS Websrv) was taking 30% CPU at some times. When we upgraded the box (and moved to ZOS) this was much more noticable. (please don't ask why it wasn't noticed before the upgrade). Funnily enough.. no one really noticed on the older machine, but we noticed pretty quickly on the new one.

The moral of the story is:
You're not just scaling up your effeciency / work load. You are also scaling up the other variables as well.

If you think it's fun watching a little OS/390 LPAR flogging itself silly (max 30% for the Websrv task), just wait until you see the look on the Performance Team's faces when that 30% finally gets noticed :-) **

** FYI: A task hogging 30% of an LPAR all of the time after the application has crashed is quite significant effect on your budget if left unchecked!

--
You have a sick, twisted mind. Please subscribe me to your newsletter.