Ask Slashdot: Capacity Planning and Performance Management?
An anonymous reader writes: When shops mostly ran on mainframes, it was relatively easy to do capacity planning because systems and programs were mostly monolithic. But today is very different; we use a plethora of technologies and systems are more distributed. Many applications are decentralized, running on multiple servers either for redundancy or because of multi-tiering architecture. Some companies run legacy systems alongside bleeding-edge technologies. We're also seeing many innovations in storage, like compression, deduplication, clones, snapshots, etc.
Today, with many projects, the complexity make it pretty difficult to foresee resource usage. This makes it hard to budget for hardware that can fulfill capacity and performance requirements in the long term. It's even tougher when the project is still in the planning stages. My question: how do you do capacity planning and performance management for such decentralized systems with diverse technologies? Who is responsible for capacity planning in your company? Are you mostly reactive in adding resources (CPU, memory, IO, storage, etc) or are you able to plan it out well beforehand?
Today, with many projects, the complexity make it pretty difficult to foresee resource usage. This makes it hard to budget for hardware that can fulfill capacity and performance requirements in the long term. It's even tougher when the project is still in the planning stages. My question: how do you do capacity planning and performance management for such decentralized systems with diverse technologies? Who is responsible for capacity planning in your company? Are you mostly reactive in adding resources (CPU, memory, IO, storage, etc) or are you able to plan it out well beforehand?
Speed of implementation in various organizations (or even departments, divisions, etc) runs a spectrum of "do stuff on more or less a whim" to "go through eight years of planning meetings to discuss the possibility of actually doing something." On the former end of the spectrum you buy extra capacity. At the latter end of the spectrum it doesn't matter, because you won't get the budget to buy extra capacity.
Help save the critically endangered Blue Iguana
I used to work for the (late, lamented) Sun Microsystems, and when we needed to give a credible answer to a price-sensitive customer, we used Teamquest Model. It pulls time-based info out of production-systems stats, so it doesn't add to the load, and then off-line does a classic queuing-system model of the system, working all in time units. That then allows the customer (really meaning me!) to ask what to expect from some specific configuration, and compare different systems for their price-performance tradeoffs.
For common setups, we have spreadsheets based on what Model said, so the salespeople typically don't know there's a cool mathematical model behind the scenes (;-)) That's probably true of other vendors who use TQ models: it runs on anything modern, so lots of vendors use it.
I have nothing to do with the company: they just allowed me to save $1.2 million once for a new datacenter, so I'm really really impressed by them.
--dave
davecb@spamcop.net
That depends are you getting the information you need?
Are your business analysts/architects even able to answer questions such as, how many net new users, concurrent users, able to summarize the typical workload? Back in the day and i'm only in my early 40's, this stuff used to be well defined. We used to have large documents which go down to the level of expected network load. So its either as you said, its too difficult given the diversity of the systems or they just don't know how to do their jobs anymore. I honestly think its about a 20/80 split. Yes the environments are more difficult to manage, but BA's/architects haven't adapted or frankly just don't care.
BA's can't give me any information which would help me forsee or estimate how much load a project/change is going to have on the environment. So when i'm asked if we need new hardware, I just usually tell them to make sure they plan a proper load test and be prepared to spend money.
In my company, its my job to make sure lights on runs well and highlight any issues related to capacity. For new projects, then its part of the project team which I may or may not be a part of.
Storage, for us, seems to be the largest constraint, with memory and cpu coming in behind. Since we can't get much information, we just make sure we have all our servers hooked up to a large san so we can quickly provision more space.
"Thanks to the remote control I have the attention span of a gerbil."
It isn't hard to convince business end, it is a function of money. IT is a cost center, it doesn't generate revenue. Therefore, by default there is a desire to hold costs down, which means limited IT budgets. Trust me, the business end understands, they just don't care about IT the way IT cares about IT.
That being said, it is EASY to get either money or absolution for the problems that Business end creates by not funding IT properly. You get them to sign off on the responsibility for when the shit hits the fan because of shortsighted budget concerns.
"If airplanes crash into this building, and 9/11 happens to us, how much data can you afford to lose".
"If a hacker gains access to our database, how much would that cost the company"
"How much does IT downtime cost this company"
People incapable of answering these questions (and a thousand more), should not be making IT decisions, until they can.
"Good IT is expensive. Bad IT is costly"
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
You should be using your monitoring system to gather performance data, and then analyzing that data.
I am partial to check_mk right now, but I've done this kind of thing on nagios with pnp4nagios. When you have your monitoring system gathering network interface data, disk usage, cpu utilization, etc, and storing it in some kind of database like rrd, influxdb, or graphite, it isn't that much of a stretch to examine that data as an aggregate and graph trends. It really is amazing all the stuff you can figure out with this technique.