How Far Can Large Commercial Applications Scale?

← Back to Stories (view on slashdot.org)

How Far Can Large Commercial Applications Scale?

Posted by ryuzaki0 on Tuesday April 18, 2006 @12:45PM from the taming-the-performance-curve dept.

clusteroid81 asks: "I've been working with customers who run large commercial applications on big iron (16-32 symmetric multi-processor systems - 64GB or more memory ). There are always numerous other front-end servers involved, but the application on the back end server is often difficult to spread across multiple systems or clusters due to the application architecture. Scaling is done by increasing memory and processor counts. As things progress, the bottleneck is usually contention within the application or operating system. Are there folks here on Slashdot who work with large single system commercial applications? What kind of processor counts and memory do the applications have and how well do they scale?"

7 of 56 comments (clear)

Min score:

Reason:

Sort:

It all depends on the applications... by georgewilliamherbert · 2006-04-18 12:57 · Score: 3, Insightful

I've run oracle on 32 processor Sun E10Ks with reasonably linear speedup from few-processor performance, back in the Solaris 7 days.

I've run (now obsolete) ATG Dynamo on the same, with similar results.

I've run Apache (1.3.x) on the same, with similar results.

I've seen applications which stopped scaling well at much less than that.

"Large business applications" isn't specific enough.
Enterprise by Procyon101 · 2006-04-18 13:00 · Score: 4, Funny

It depends on how much enterprise you have in them. Enterprise is expensive, but when added liberally you can scale to huge amounts.

I like to add a couple hundred enterprise myself.
Vague question... Vague answers by subreality · 2006-04-18 13:21 · Score: 4, Insightful

Different problems in computer science scale differently. You haven't given us enough data to really know what problem you're solving, so you're really not going to get a reasonable answer.

I work for a company that has a large commercial application. We knew we needed to scale our data set and processing power to be huge, so we made sure from the start that the heavy lifting could be divided into little chunks, and thrown to the cluster. For our purposes, back end scalability is basically linear. When we need more, we just bring another rack of little 1U critters online. There are a few theoretical bottlenecks, but we'll never see them before we have our own nuclear power plant to run the data centers.

For other applications we use, there is *no* scalability. The algorithm has to be single threaded. It doesn't matter if I run it on a cluster, or a machine bristling with CPUs. So we basically buy the data center equivalent of a gaming PC: The fastest processor and memory that fits our budget.

So there are the ends of the spectrum. Your scalability will be somewhere between zero and infinity, depending on the problem at hand.
Re:scale by hashing by dgatwood · 2006-04-18 13:50 · Score: 3, Informative

There are many ways to divide up a set of queries. It all depends on what the application is, how much data sharing is needed, etc.
One way divide the data is per-user or per-group. Divide data according to its owner so that each user account is hosted on a given machine and has first-class access to his/her own data and his/her group's data, but second-class (network-based) access to everyone else's data.
Another way, as you mention, is to do hashing based on some well-defined key, but for this to be useful requires that the front end be thoroughly abstracted from the back end so that multiple front ends share multiple back end stores. Otherwise, you are probably just moving the bottleneck around. It also requires that this key be known in advance, which means that it doesn't generally work well if, for example, you need to do a join on two tables and one of those tables is scattered across multiple machines. The only way that it would work for such use would be if either the key being used for the join is the hashed key or if each machine has a table index that spans multiple machines' content, at which point, you are going to have cache coherency problems.
Which brings us to a fairly nice compromise solution: a replicated database with each of the outer-ring database servers being read-only caches with some sort of built-in cache consistency protocol, and the central database accepting write queries from clients, but with all the read queries directed to the outer ring. Makes for seriously scalable database access.
This, of course, assumes that the app in question is a front-end for a database. If you're doing some other sort of application, then all bets are off. Give us more information.

--
Check out my sci-fi/humor trilogy at PatriotsBooks.
Very little to go by ... by kbahey · 2006-04-18 15:05 · Score: 3, Interesting

Your description is very little to go about suggesting solutions ...

You have to tell us many many specific things before we can suggest specific solutions. All we know is that the application runs on a 32 cPU system, and has 64 GB. This is all about the hardware. The application is a "large commercial application", and there is "contention within the application or the operating system". We do not even know what the hardware is, nor what operating system it is.

Anyways, here are some generic suggestions form past experience, most of it on UNIX systems, many with Oracle, and most with commerical non-web systems.

- Is the application CPU bound, memory bound, or I/O bound? If you do not know then you have to find out first, then attack the area of

- Is the application transactional in nature or batch? Is it an operational system, or a decision support type of application?

- Does the application use a database (probably does)? Is the database on the same box that runs the application? If so moving the database to a separate box with a fast connection (FDDI or Gigabit Ethernet) may help things.

- Does the application uses queues or message passing? Do these queues fill up at certain peak hours causing slow downs?

- Can you benchmark/load test the application on a similar box? If you have transaction generation/injection tools, then you can simulate the real load and then run tools for profiling, performance and the like in real time (e.g. sar, vmstat, top, ....etc. if you are on a *NIX type of system).

Performance tuning is an iterative process that is more of an art than a science. Start with the 80/20 rule, and get the low hanging fruit (attack the easiest and most obvious area that would gain you some performance, then move to the next area, ...etc until you hit the diminishing returns areas).

--
2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning.
My experience with Solaris/Oracle by brokeninside · 2006-04-18 15:10 · Score: 3, Interesting

One place I used to work had a system that scaled up to well over 20 Sun boxes each with 10 more CPUs. It all depends on having the design right. For example, if you have a batch job, you architect the job to follow a master/worker paradigm where a master process doles out chunks of works to worker processes that may or may not be running on the same machine (think SETI@Home). Not every job can be redesigned to to this, but it it's a fairly easy way to do a large number of different tasks. Further, there's no reason that this design couldn't be used by Linux/PostgreSQL or some other Free Software stack rather than Solaris/Oracle. There are also other paradigms. Perhaps you should do a search on scholarly comp sci papers instead of asking /.. The problem of scaling is not exactly new. Quite a few papers have been written on various way to solve the problem depending on what sort of computational tasks you have to accomplish.
Re:It's the network! by multimediavt · 2006-04-18 16:21 · Score: 3, Informative

I'm gonna go ahead and disagree with you there. The network alone is not to blame. Also, keep in mind that the latency differences between most 10GigE implementations and Myrinet are radically different especially once you get above the hardware and protocol levels. They are getting better, Force10's new 10GigE switches being good examples, but they're not that close when you put something like MPI and then a poorly implemented-algorithm wise-application on top of that. Another thing to keep in mind is that there are other interconnect technologies like Infiniband and Quadrics that may give you better performance.

The real scaling issues (in a lot of cases) are within the application itself. Some applications scale really well. I'll use scientific codes as examples. For instance, we've gotten LAMPSS (a molecular dynamics code) to scale very well across our 1024 node, 2048 processor cluster. It is capable of using the entire system to process jobs; all 2048 processors with an Infiniband interconnect and MVAPICH. However, applications like AMBER, another molecular dynamics code, don't scale at all well beyond 256 processors on our system. It's not a fault of the hardware, the network, or the message passing interface in a lot of cases. It's simply that the algorithm used in the code just doesn't scale well beyond a certain point. The code just isn't optimized well, or it just won't scale, period. There are other code bases that are being used by our researchers that do well in an SMP, shared-memory architecture, but simply won't run at all in a distributed memory, cluster architecture. Some because they require a large memory footprint, others simply because the problem the code needs to solve cannot be decomposed and spread across nodes in a cluster. As far as performance goes, we've actually seen some codes, like the quadrature code (APREC) run by David Bailey of LBL, actually achieve super-linear gains. He ran a series of jobs in his quest to do the largest one-dimensional quadrature calculation (which he achieved and published at SC04) starting with one processor and scaling to 512 nodes (1024 processors). At the 16, 64, and 256 processor range, his code actually got 17.66, 69.79, and 270.17 times speed up over a single processor, respectively. Now this is not typical behavior. Typically, you don't get this kind of speed up (usually you do see significantly lower efficiency; in the range of 15 to 20 percent in a lot of cases), and his code did fall off to 919.22 times speed up for 1024 processors. My point is, the application itself has as much impact on performance as the architecture it is being run on. And, don't forget compiler differences, but this could go on for days.

I would strongly urge the original poster to talk to the vendors that develop the software you use and simply ask them if the reason they don't make a cluster version of the software is due to economic reasons, or simply because the application just won't work in that architecture. Remember, computing is a right-tool-for-the-right-job arena. There's no single platform that will do everything for everybody.