How Far Can Large Commercial Applications Scale?
clusteroid81 asks: "I've been working with customers who run large commercial applications on big iron (16-32 symmetric multi-processor systems - 64GB or more memory ). There are always numerous other front-end servers involved, but the application on the back end server is often difficult to spread across multiple systems or clusters due to the application architecture. Scaling is done by increasing memory and processor counts. As things progress, the bottleneck is usually contention within the application or operating system. Are there folks here on Slashdot who work with large single system commercial applications? What kind of processor counts and memory do the applications have and how well do they scale?"
I've run oracle on 32 processor Sun E10Ks with reasonably linear speedup from few-processor performance, back in the Solaris 7 days.
I've run (now obsolete) ATG Dynamo on the same, with similar results.
I've run Apache (1.3.x) on the same, with similar results.
I've seen applications which stopped scaling well at much less than that.
"Large business applications" isn't specific enough.
I'm still studying computer science with little practical experience, but you can divide certain aspects of your application by hashing---you hash datasets or queries. This distributes the workload across a cluster of computers. However, implementing hashing requires you to make intrusive changes to your code, and maybe most companies aren't willing to do so. Hashing generally has to be implemented from the very beginning, which requires foresight. Google is the one company that does it well.
I once had a signature.
Different problems in computer science scale differently. You haven't given us enough data to really know what problem you're solving, so you're really not going to get a reasonable answer.
I work for a company that has a large commercial application. We knew we needed to scale our data set and processing power to be huge, so we made sure from the start that the heavy lifting could be divided into little chunks, and thrown to the cluster. For our purposes, back end scalability is basically linear. When we need more, we just bring another rack of little 1U critters online. There are a few theoretical bottlenecks, but we'll never see them before we have our own nuclear power plant to run the data centers.
For other applications we use, there is *no* scalability. The algorithm has to be single threaded. It doesn't matter if I run it on a cluster, or a machine bristling with CPUs. So we basically buy the data center equivalent of a gaming PC: The fastest processor and memory that fits our budget.
So there are the ends of the spectrum. Your scalability will be somewhere between zero and infinity, depending on the problem at hand.
My company has developed a large software project on a server cluster for the backend. Our server-side architecture is (in theory) scalable as large as we want to go. We use BEA Tuxedo to assign different applications to different servers, and all the databases are available via a SAN. The Unix servers use are currently configured with 4 to 8 CPUs each, and 8 to 16 GB memory. The server cluster is currently configured between 2 and 10 servers for our current deployments, though we could scale larger simply by rearranging the tuxedo configuration files if we needed to.
Now, some server-side apps in our system are architected to scale very well, and some we have had to spend the last few months tweaking the code as we grow with our current customer's deployment. In general though, our system tends towards lots of specific apps running simultaneously to handle individual tasks, rather than a small number of large, monolithic apps. I think it is very much making sure you have large system scalability in mind from the beginning, and not starting small and then realizing "Oh no! We never realized we'd have to handle THIS much traffic!" Our project is a perfect example of learning that lesson over and over as we've had to tweak or rewrite pieces of it as we add more and more clients to our customers' deployment. It can be done, but depending on how you've written your apps, it may not be easy.
In my experience, the custom applications I deal with seem to be built with not just incorrect assumptions regarding load, but *no* assumptions regarding load. When I first fired up one particular application in a production environment, we were seeing 6000 incoming messages per second. I asked the lead developer what we should be expecting to see. He had no idea.
This is caused by short sighted project management, which translates into short sighted programming. The necessary questions about throughput aren't asked, because it all works fine on the developers' PC with a test load. In our case, we eventually got the application running OK, but changes that have been made since have not taken into account anything to do with I/O, so the fact that our CPU usage is not maxing out seems to indicate to the development team that we are not bound by the server performance, and hence have not reached any scalability thresholds.
Obviously this is madness. If one was to investigate the scalability of this application properly, one should be looking at where I/O happens, where interprocess communication happens, where object creation and destruction happens, and so on... There is no other way to scale an application -- you have to define what the "load" is, find what happens when you increase it, work out where any bottleneck is, and how parallelisable this bottleneck is. Anything less is no more than buzzwords.
Quidquid latine dictum sit, altum videtur.