Ask Slashdot: Capacity Planning and Performance Management?

← Back to Stories (view on slashdot.org)

Ask Slashdot: Capacity Planning and Performance Management?

Posted by Soulskill on Monday August 10, 2015 @05:39AM from the throw-servers-at-it-until-the-alerts-stop dept.

An anonymous reader writes: When shops mostly ran on mainframes, it was relatively easy to do capacity planning because systems and programs were mostly monolithic. But today is very different; we use a plethora of technologies and systems are more distributed. Many applications are decentralized, running on multiple servers either for redundancy or because of multi-tiering architecture. Some companies run legacy systems alongside bleeding-edge technologies. We're also seeing many innovations in storage, like compression, deduplication, clones, snapshots, etc.

Today, with many projects, the complexity make it pretty difficult to foresee resource usage. This makes it hard to budget for hardware that can fulfill capacity and performance requirements in the long term. It's even tougher when the project is still in the planning stages. My question: how do you do capacity planning and performance management for such decentralized systems with diverse technologies? Who is responsible for capacity planning in your company? Are you mostly reactive in adding resources (CPU, memory, IO, storage, etc) or are you able to plan it out well beforehand?

2 of 64 comments (clear)

Min score:

Reason:

Sort:

Spend Money on the Right Tools by dave562 · 2015-08-10 06:46 · Score: 3, Informative

These days capacity planning comes down to have the right tool set for the job. I like VMturbo. There are a few others out there that will get the job done. VMturbo is nice because it is platform agnostic and can help you decide where to place workloads not only based on pure performance numbers, but also on resource cost. (For example, HyperV is likely less expensive than VMware in most situations).
It is also worth considering an Application Performance Monitoring (APM) tool. Being able to identify exactly where the application is slow, and whether or not is an issue with the code or the underlying OS / infrastructure will save a lot of time during troubleshooting, and also help identify rooms to proactively allocate resources to head of potential bottlenecks.
On a similar subject, a tool that provides deep visibility into the database layer helps a lot for the same reasons. A lot of junior admins make the mistake of assuming that high database server utilization is indicative of under provisioned hardware. In reality, poorly written queries will bring down even the beefiest of database servers. While you get information with the built in management tools, a dedicated monitoring platform (like Spotlight from Dell for example) will help you develop historical trends, while at the same time providing real time troubleshooting capabilities.
Most of the time the network is the last bottleneck. In Cisco shops you can utilize NetFlow to figure out where the problems are. Or if the company you are working for has money to burn, the UCS infrastructure stack is very robust and comes with a whole slew of management and monitoring tools that can be leverage to discover latencies before they impact production environments too severely.
Simplify the problem, use a metrics based approach by ArijitMukherji · 2015-08-10 08:38 · Score: 3, Informative
This is exactly the situation we ran into when we launched our SAAS platform SignalFx to general availability. Internally it is composed of 15-20 different micro-services, making capacity planning a big challenge. We blogged about our experience here Metrics based approach to capacity planning . SignalFx is a metrics based monitoring perform, so in a meta way, we used SignalFx to capacity for SignalFx's launch
tl:dr; version of our lessons and suggestions
1. Design your architecture to be loosely coupled, so that it is possible to capacity-plan for each sub-component independently. Break a complex problem into N simpler ones
2. Identity the 'limiting system resource' for each component individually (i.e. what will hit the wall first - CPU, memory, network etc.). You can do this through a combination of experimentation and plain and simple reasoning based on understanding of how it works
3. Identify a business metric that correlates with the utilization of the limiting resource (e.g. api calls per second, number of logged in users, or whatever)
4. Use analytics/math to project the capacity of the system, and how much free capacity you have (make sure to leave enough buffer, e.g. most services won't run very well at 99.99% cpu)
At the end, you'll have something like this for each component of the system - e.g. "if I'm CPU bound on component X, and CPU of X linearly goes up with API_calls/s, and I'm currently at 5000 API/sec at 50% CPU, then I have total capacity for 9000 API/sec (with a 10% buffer) and free capacity for another 4000 API/sec.
Now divide and conquer - let each component owner the responsibility to manage capacity of their system based on business needs provided to them.