Slashdot Mirror


Ask Slashdot: Capacity Planning and Performance Management?

An anonymous reader writes: When shops mostly ran on mainframes, it was relatively easy to do capacity planning because systems and programs were mostly monolithic. But today is very different; we use a plethora of technologies and systems are more distributed. Many applications are decentralized, running on multiple servers either for redundancy or because of multi-tiering architecture. Some companies run legacy systems alongside bleeding-edge technologies. We're also seeing many innovations in storage, like compression, deduplication, clones, snapshots, etc.

Today, with many projects, the complexity make it pretty difficult to foresee resource usage. This makes it hard to budget for hardware that can fulfill capacity and performance requirements in the long term. It's even tougher when the project is still in the planning stages. My question: how do you do capacity planning and performance management for such decentralized systems with diverse technologies? Who is responsible for capacity planning in your company? Are you mostly reactive in adding resources (CPU, memory, IO, storage, etc) or are you able to plan it out well beforehand?

64 comments

  1. We use a very very high tech bleeding edge system by Anonymous Coward · · Score: 2, Funny

    Pay attention. This is VERY complicated....

    We ask our users what their plans are.

  2. Monitoring, instrumentation by Anonymous Coward · · Score: 0

    You cannot save yourself from understanding the architecture. Then you need data. Then you combine the two.

    1. Re:Monitoring, instrumentation by plopez · · Score: 1

      That's backwards, data first then everything else flows from that.

      --
      putting the 'B' in LGBTQ+
  3. Enterprise Architecture by Maxwell · · Score: 0

    is this a trick question?

    1. Re:Enterprise Architecture by Anonymous Coward · · Score: 0

      It is called Enterprise Architecture, and not IT Architecture anymore, for a reason. That reason is that you can not have a proper IT architecture without an Enterprise Architecture.
      Problem is that it is damned hard to convince the business end of this. Therefor, most of us just have to wing it.

      So yes, you answered the question, but at the same time managed to not answer it at all.

    2. Re:Enterprise Architecture by Archangel+Michael · · Score: 5, Interesting

      It isn't hard to convince business end, it is a function of money. IT is a cost center, it doesn't generate revenue. Therefore, by default there is a desire to hold costs down, which means limited IT budgets. Trust me, the business end understands, they just don't care about IT the way IT cares about IT.

      That being said, it is EASY to get either money or absolution for the problems that Business end creates by not funding IT properly. You get them to sign off on the responsibility for when the shit hits the fan because of shortsighted budget concerns.

      "If airplanes crash into this building, and 9/11 happens to us, how much data can you afford to lose".
      "If a hacker gains access to our database, how much would that cost the company"
      "How much does IT downtime cost this company"

      People incapable of answering these questions (and a thousand more), should not be making IT decisions, until they can.

      "Good IT is expensive. Bad IT is costly"

      --
      Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
    3. Re:Enterprise Architecture by dave562 · · Score: 1

      This is a good point. I have come to realize that as an IT professional, often times the only thing that I have power to do is to generate options, build the business case for those options (including the risks of not doing them *very key to do this step*) and then present them to the business. My job is to help the business leaders make informed decisions.

      If they ultimately decide to avoid the costs and accept the risks of doing so, they only have themselves to blame if / when the risk materializes.

      Most vendors who offer management / performance tuning toolsets understand these challenges. Therefore they have metrics that help demonstrate what the ROI for the tools is, how those tools will reduce CapEx and or drive down OpEx costs. Some are straight forward like "Implement VMturbo and increase your VM density by 25%" Others require some analysis, like quantifying how implementing an Application Performance Management (APM) tool will drive faster incident resolution and require less staff to troubleshoot problems. In a real life example, I managed to justify a $250,000 spend on an APM tool because it allows our sales team to prove to prospective clients that our SaaS application is better managed and has better up time than anyone else in our market segment. They have closed a couple multi-year, multi-multi-million dollar deals by being able to show that we have a 99.99%+ transaction success rate.

    4. Re:Enterprise Architecture by Anonymous Coward · · Score: 0
    5. Re:Enterprise Architecture by PPalmgren · · Score: 1

      A good idea seems to be incorporating IT, in some form, into risk management. Risk Management people who understand IT will fight for you, and the finance department will certainly listen to Risk Management if you can't get through to them yourselves.

      I always wondered if there's some sort of cost model out there that uses multiplier factors. Like, yes, IT may be a cost center, but IT effectiveness is essentially a multiplier on every other department. If it sucks, it slows down every department and if its great it adds to every other department. Would be hard to map/quantify for sure, but it'd be a sure-fire way to get number-obsessed management types to actually understand the value of IT.

      I also think a lot of companies fail to properly allocate certain IT expenses to various departments and such. If Finance knows 10% of the IT budget is a baseline operating expense for department A through P, allocates the cost as such, and understands the cost baseline/scaling, maybe they'd be less likely to axe it randomly.

    6. Re:Enterprise Architecture by Maxwell · · Score: 1

      I'll stick with Enterprise Architecture as the answer as they do exactly what the OP asks. There are plenty of books on the subject to get one started. Failure to properly implement in your organization is a different discussion.

  4. WTB: Crystal Ball by Anonymous Coward · · Score: 0

    My powers of prediction are inadequate; can anyone sell me a good, working crystal ball? Or, better, maybe /. can just give me one...

    1. Re:WTB: Crystal Ball by bobbied · · Score: 1

      My powers of prediction are inadequate; can anyone sell me a good, working crystal ball? Or, better, maybe /. can just give me one...

      Oh plenty of vendors are more than willing to take your money and claim to give you a crystal ball... That's not unique..

      --
      "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  5. Isn't capacity nearly disposable now? by Anonymous Coward · · Score: 0

    Maybe if you're Netflix or Facebook you worry about this sort of thing. But most enterprises I work with this is becoming less of an issue. Most CPUs sit idle , memory is plentiful and the only item that is always running out is disk capacity, but even that is so cheap to replenish that it costs more to chase someone to cleanup then to simply keep adding disk.

    I know this doesn't answer your question, but how much capacity are we talking about here? For reference, I'm talking at least some fortune 500 shops I've dealt with who aren't terribly compute intense. Less than 10PB disk, ~1500 physical hosts, ~10,000 VMs.

  6. Capacity Planning by Anonymous Coward · · Score: 0

    This question is why scale out architecture has become so popular. It is virtually impossible to plan for 2-3 years down the road in most environments. Since you cannot plan, it is paramount that you can add compute/storage/hyperconverged nodes as needed.

  7. Uses a mainframe by StaffanEricsson · · Score: 1

    It is simple, I use a mainframe and buy capacity when needed.

  8. Why bother? by Anonymous Coward · · Score: 0

    Capacity planning is a rather anachronistic idea from the days when hardware was expensive and procurement cycles were extraordinarily long. The better approach these days is to craft application level service level agreements and let that drive capacity needs. Virtualization, cloud resources and on demand computing have in my opinion rendered most of the art/guesswork of capacity planning irrelevant.

  9. ha by Anonymous Coward · · Score: 0

    in Soviet Amerika, Capacity Plans YOU!

    1. Re:ha by Anonymous Coward · · Score: 0

      sadly, this is closer to the truth than you'd think...

  10. Go Big or Go Home by Anonymous Coward · · Score: 0

    Massively overestimate HW because the beancounters will cut it back even if only to prove they can.

    Explain the following, repeatedly, until it penetrates the thick skull of whoever fails to understand it: Hardware is CHEAP, data is EXPENSIVE. It is cheaper to buy all the HW now than to try and process or recreate all the unprocessed or lost data the lack of HW caused.

    Get somebody to sign off on the final configuration. It will almost certainly be suboptimal for the task, and that signature will give you the protection you need when scapegoat(s) are required.

  11. Depends on the corporate culture involved by ErikTheRed · · Score: 4, Interesting

    Speed of implementation in various organizations (or even departments, divisions, etc) runs a spectrum of "do stuff on more or less a whim" to "go through eight years of planning meetings to discuss the possibility of actually doing something." On the former end of the spectrum you buy extra capacity. At the latter end of the spectrum it doesn't matter, because you won't get the budget to buy extra capacity.

    --

    Help save the critically endangered Blue Iguana
    1. Re:Depends on the corporate culture involved by Anonymous Coward · · Score: 0

      +1 made my day

  12. VMs duh by Anonymous Coward · · Score: 0

    Planning? That's what VMs are for. Just throw more hardware at the resource pool.

  13. Only you can decide that. by mlts · · Score: 5, Insightful

    There are a lot of tools you can use to help with capacity, be it VM farms, SANs/NASes, cloud providers, chassis/blades. Only a few points of advice:

    1: Everyone will sell their product as the silver hammer, where each target is a nail. The VNX guys will sell their SAN as a be all and end all, even if you just use CIFS/SMB. The security vendor will be selling you exotic appliances for encryption for your tape silos. The PC guy will be selling you tons of 1U racks and try to convince you that the onboard drive array is better than a SAN if they don't have a SAN product, otherwise, how slick their HBAs work when used with their SAN.

    2: Don't forget security. It may be cheaper to have one VM cluster for everything, but it be wiser to keep one client's hyper-sensitive stuff on one VMWare datacenter [1], while the other client who is running some backend stuff for an app would be in a different container.

    3: Before committing to purchase something, grab manuals and documentation, and read on the device. You might find it doesn't do what you want. Don't forget to take into account type of I/O and other items. I have had to deal with a terabyte/hour of random writes, and the only solution for that was going with either a caching HBA that had that much SSD so it would turn the random writes into an easy to digest sequential stream for the SAN, or go pure SSD. Sequential I/O is a lot easier and a lot cheaper to deal with than random I/O. Similar with I/O that is often cached versus I/O that never is reused.

    [1]: A datacenter is a VMWare object type. Can't vMotion across it, and is intended to provide distinct separation between items.

    1. Re:Only you can decide that. by lsatenstein · · Score: 1

      There are a lot of tools you can use to help with capacity, be it VM farms, SANs/NASes, cloud providers, chassis/blades. Only a few points of advice:

      1: Everyone will sell their product as the silver hammer, where each target is a nail. The VNX guys will sell their SAN as a be all and end all, even if you just use CIFS/SMB. The security vendor will be selling you exotic appliances for encryption for your tape silos. The PC guy will be selling you tons of 1U racks and try to convince you that the onboard drive array is better than a SAN if they don't have a SAN product, otherwise, how slick their HBAs work when used with their SAN.

      2: Don't forget security. It may be cheaper to have one VM cluster for everything, but it be wiser to keep one client's hyper-sensitive stuff on one VMWare datacenter [1], while the other client who is running some backend stuff for an app would be in a different container.

      3: Before committing to purchase something, grab manuals and documentation, and read on the device. You might find it doesn't do what you want. Don't forget to take into account type of I/O and other items. I have had to deal with a terabyte/hour of random writes, and the only solution for that was going with either a caching HBA that had that much SSD so it would turn the random writes into an easy to digest sequential stream for the SAN, or go pure SSD. Sequential I/O is a lot easier and a lot cheaper to deal with than random I/O. Similar with I/O that is often cached versus I/O that never is reused.

      [1]: A datacenter is a VMWare object type. Can't vMotion across it, and is intended to provide distinct separation between items.

      Why not do a calibration between Sales, Inventory, Purchasing, Logistics as x= a*s+b*i+c*p+d* vs No CPUs + No Network + No Transactions + support staff + some critical resource at a known level of x. Do not assume linear growth, but some growth that is proportional to x**3 or x**5 where x = the

      + a+b+c+d=100% and all a,b,c,d >0

      Typically you will have to include some measures such as a max of 1 second response time (or 0.1 seconds response time). Don't try to go cutting hairs in 4 endwise.

      --
      Leslie Satenstein Montreal Quebec Canada
  14. For anything expen$sive, we use TQ by davecb · · Score: 5, Interesting

    I used to work for the (late, lamented) Sun Microsystems, and when we needed to give a credible answer to a price-sensitive customer, we used Teamquest Model. It pulls time-based info out of production-systems stats, so it doesn't add to the load, and then off-line does a classic queuing-system model of the system, working all in time units. That then allows the customer (really meaning me!) to ask what to expect from some specific configuration, and compare different systems for their price-performance tradeoffs.

    For common setups, we have spreadsheets based on what Model said, so the salespeople typically don't know there's a cool mathematical model behind the scenes (;-)) That's probably true of other vendors who use TQ models: it runs on anything modern, so lots of vendors use it.

    I have nothing to do with the company: they just allowed me to save $1.2 million once for a new datacenter, so I'm really really impressed by them.

    --dave

    --
    davecb@spamcop.net
    1. Re:For anything expen$sive, we use TQ by Archangel+Michael · · Score: 1

      allowed me to save $1.2 million once for a new datacenter

      That's a lot. Or that isn't much. It depends on what the whole build out actually was. In other words, was that 50% cost reduction, or 1%.

      --
      Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
    2. Re:For anything expen$sive, we use TQ by turbidostato · · Score: 1

      "allowed me to save $1.2 million once for a new datacenter

      That's a lot. Or that isn't much. It depends on what the whole build out actually was. In other words, was that 50% cost reduction, or 1%."

      That's a lot in any case. Even if it's 1% it's still 1.2 millions out of a investment of (whatever the wages of that guy X time expended).

      You might have questioned if that kind of savings was the biggest at hand -if he could have invested his time to save 2.4 millions somewhere else, but the way you did, it remains me Everett Dirksen's (seemingly misattributed, as per wikipedia) quote: "A billion here, a billion there, pretty soon, you're talking real money."

    3. Re:For anything expen$sive, we use TQ by Anonymous Coward · · Score: 0

      once

  15. Ask your vendors by Anonymous Coward · · Score: 0

    Ask your vendors for their capacity modeling and planning tools. Hopefully they can provide something as easy as a spreadsheet.
    Although I'm sure the answer would always be "you need to buy more of our stuff", at least that will help you ask the right questions and get semi-reasonable answers. GIGO.

    1. Re:Ask your vendors by davecb · · Score: 1

      They'll do the spreadsheet calculation for you, if you're proposing something mildly profitable. Don't expect to get a copy of the spreadsheet, the salespeople think it high-tech (;-))

      --
      davecb@spamcop.net
  16. Or just use NoSQL. by Anonymous Coward · · Score: 2, Funny

    You don't even have to ask your users. Just use NoSQL, for everything. Use it for storing the data, use it for the back end business logic, use it for the middleware, and use it for the front end. Thanks to the CAP Theorem and JavaScript (all NoSQL uses JavaScript), you don't need to worry about scaling at all. That's the beauty of NoSQL: effortless and infinite scalability.

    1. Re:Or just use NoSQL. by Hognoxious · · Score: 1

      Also, it's webscale.

      Oh hang on, I'm confusing it with MongoDB.

      Umm, look, a CLOUD!

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  17. Re:We use a very very high tech bleeding edge syst by __aaclcg7560 · · Score: 1

    According to ancient lore, a pair of low-level HP engineers put in a request for a Saturn V rocket and a vice president killed their request. HP would have been a different company if it had to listened to their users.

  18. Honest answer? by Anonymous Coward · · Score: 0

    It's mostly a crapshoot, and often the PHBs don't take your advice anyway.

  19. Use amazon by Anonymous Coward · · Score: 0

    and let them figure out how to make it work. why bother?

  20. That depends, are you getting the information ... by Stone316 · · Score: 4, Interesting

    That depends are you getting the information you need?

    Are your business analysts/architects even able to answer questions such as, how many net new users, concurrent users, able to summarize the typical workload? Back in the day and i'm only in my early 40's, this stuff used to be well defined. We used to have large documents which go down to the level of expected network load. So its either as you said, its too difficult given the diversity of the systems or they just don't know how to do their jobs anymore. I honestly think its about a 20/80 split. Yes the environments are more difficult to manage, but BA's/architects haven't adapted or frankly just don't care.

    BA's can't give me any information which would help me forsee or estimate how much load a project/change is going to have on the environment. So when i'm asked if we need new hardware, I just usually tell them to make sure they plan a proper load test and be prepared to spend money.

    In my company, its my job to make sure lights on runs well and highlight any issues related to capacity. For new projects, then its part of the project team which I may or may not be a part of.

    Storage, for us, seems to be the largest constraint, with memory and cpu coming in behind. Since we can't get much information, we just make sure we have all our servers hooked up to a large san so we can quickly provision more space.

    --
    "Thanks to the remote control I have the attention span of a gerbil."
  21. Translation: resumes with Capacity Planning by Anonymous Coward · · Score: 1, Funny

    Translation:

    "Dear Product^WReaders, we at Dice want to know if 'Capacity Planning' is the new buzzword. Please provide anecdotal evidence to the approximate value of resume candidates with these keywords."

  22. A good monitoring system helps by wezelboy · · Score: 3, Interesting

    You should be using your monitoring system to gather performance data, and then analyzing that data.

    I am partial to check_mk right now, but I've done this kind of thing on nagios with pnp4nagios. When you have your monitoring system gathering network interface data, disk usage, cpu utilization, etc, and storing it in some kind of database like rrd, influxdb, or graphite, it isn't that much of a stretch to examine that data as an aggregate and graph trends. It really is amazing all the stuff you can figure out with this technique.

  23. Anybody get fired for buying too much? by swb · · Score: 3, Insightful

    Too much network bandwidth? Too much storage capacity? Too much CPU?

    Usually the drill seems to involve a lot of begging and pleading for money from management. The intermediate levels get dinged if they have to go back to the well, but they don't seem to get dinged if budgeted money ends up buying unused capacity.

    I don't doubt there are places which do heavy audits and ask hard questions about why you have a SAN with a bunch of free space or why your 10 gig NICs are running at sub-gigabit utilization and cause all manner of pain and suffering for excess capacity already budgeted and bought.

    But usually it doesn't seem to happen that way. Management barely supplies enough resources to meet their running demands and line management buys as much excess capacity as they can beg, borrow or steal because they know they will be punished for buying too little.

  24. Spend Money on the Right Tools by dave562 · · Score: 3, Informative

    These days capacity planning comes down to have the right tool set for the job. I like VMturbo. There are a few others out there that will get the job done. VMturbo is nice because it is platform agnostic and can help you decide where to place workloads not only based on pure performance numbers, but also on resource cost. (For example, HyperV is likely less expensive than VMware in most situations).

    It is also worth considering an Application Performance Monitoring (APM) tool. Being able to identify exactly where the application is slow, and whether or not is an issue with the code or the underlying OS / infrastructure will save a lot of time during troubleshooting, and also help identify rooms to proactively allocate resources to head of potential bottlenecks.

    On a similar subject, a tool that provides deep visibility into the database layer helps a lot for the same reasons. A lot of junior admins make the mistake of assuming that high database server utilization is indicative of under provisioned hardware. In reality, poorly written queries will bring down even the beefiest of database servers. While you get information with the built in management tools, a dedicated monitoring platform (like Spotlight from Dell for example) will help you develop historical trends, while at the same time providing real time troubleshooting capabilities.

    Most of the time the network is the last bottleneck. In Cisco shops you can utilize NetFlow to figure out where the problems are. Or if the company you are working for has money to burn, the UCS infrastructure stack is very robust and comes with a whole slew of management and monitoring tools that can be leverage to discover latencies before they impact production environments too severely.

  25. Just use AWS by Cyberax · · Score: 2

    Just use AWS and scale out as needed. Your capacity planning then becomes more of a question which reserved instances to buy. AWS is not suitable for 100% of applications, but if you ask how to do capacity planning then your use-case is unlikely to be that 1% that doesn't fit the AWS model.

    1. Re:Just use AWS by Jaime2 · · Score: 1

      Not a magic bullet.

      Often what you want is a projection of the next five year's cost. Sure, AWS is good at making sure you only pay for what you need, but it doesn't help you make the Go/NoGo decision on a project. You can still easily get into a "this thing costs me more in usage that it's saving me" situation.

    2. Re:Just use AWS by Cyberax · · Score: 2

      You can buy reserved instances for 3 year periods, this locks-in the price and guarantees availability. And 5-year projections actually don't make much sense - hardware is likely to go through a couple of generations during this time. I worked for many companies and all the long-range cost projections I've seen were nothing but a pipe dream (or actually a checkbox document written by engineers eager to do actual work instead of generating tons of useless paperwork).

    3. Re:Just use AWS by Jaime2 · · Score: 1

      You can buy reserved instances for 3 year periods, this locks-in the price and guarantees availability.

      But that doesn't guarantee that the capacity you reserved will provide the performance you need. See the TeamQuest Model posts above for an example of a tool that can help you predict how much capacity you'll need to scale up from a pilot to a full implementation.

  26. Capacity Pros and Cons by DarthVain · · Score: 1

    Not sure how it is done elsewhere, but I think we tend to balance load tenuously. Adding more applications and users slowly. When things start slowing down, users will start to complain, bring on a few more servers, repeat.

    However, one thing I will say, there is a danger in overcapacity. Managing multiple servers can be difficult (apparently), and it can cause some pretty hard to nail down issues in applications, particularly in legacy applications.

    I've had a couple of instances where a couple (not all) severs were configured differently, and applications would perform differently, depending on what server you were currently randomly connected to. Made user complaints really had to decipher as sometimes it would work fine, and other time not. Like having the default system date set to MM/DD/YYYY VS DD/MM/YYYY with legacy systems that just accept whatever... I've also had a few instances where certain services were turned off some but not all servers... again with similar results in that performance really depended on what servers you are being assigned to, and what the specific application is doing...

    1. Re:Capacity Pros and Cons by digsbo · · Score: 1

      I've had a couple of instances where a couple (not all) severs were configured differently, and applications would perform differently, depending on what server you were currently randomly connected to

      That's a system management issue, not a capacity planning issue.

    2. Re:Capacity Pros and Cons by DarthVain · · Score: 1

      Agreed. However I was just saying that it seems the more capacity you add, the more complex a system can get, which can make management more difficult, so simply adding capacity for for capacity sake isn't all that great an idea either.

  27. Re:For anything expensive, we use TQ by davecb · · Score: 1

    It was a customer's center so I'm being vague, but it was more than 10%

    --
    davecb@spamcop.net
  28. OFFS by multimediavt · · Score: 1

    Today, with many projects, the complexity make it pretty difficult to foresee resource usage. This makes it hard to budget for hardware that can fulfill capacity and performance requirements in the long term.

    How is it any harder than ten years ago, twenty years ago, thirty years ago? Either you know what you have and how it is performing today or you don't. Either you know what the user demand(s) is/are or you don't. Either you know what options you have for hardware, software and services needed or you don't. Heaven forbid you do your job and find out BEFORE you start a new project plan.

    I know why this person posted this Ask /. anonymously. Either they're completely incompetent or their organization is when it comes to IT project planning and basic knowledge of their existing systems and users. OR, this is just another stupid Dice plant to get data from us for their recruiting metrics. Just seems like somebody is either out of their depths or is pulling a fast one.

    1. Re:OFFS by multimediavt · · Score: 1

      Sorry, these stupid fucking questions are really beginning to piss me off, especially coming from an AC!

  29. Follow the gosphel of holmes by Anonymous Coward · · Score: 0

    I just imitate holmes on homes and overbuild everything. eventually you outgrow everything, but it extends useful lifetime by quite a lot.

  30. Capacity Planning by Anonymous Coward · · Score: 0

    You might try your question on the Capacity Planning and Performance Evaluation group on LinkedIn instead. There are lots of lurkers there who may be of assistance.

  31. Simplify the problem, use a metrics based approach by ArijitMukherji · · Score: 3, Informative
    This is exactly the situation we ran into when we launched our SAAS platform SignalFx to general availability. Internally it is composed of 15-20 different micro-services, making capacity planning a big challenge. We blogged about our experience here Metrics based approach to capacity planning . SignalFx is a metrics based monitoring perform, so in a meta way, we used SignalFx to capacity for SignalFx's launch

    tl:dr; version of our lessons and suggestions

    1. Design your architecture to be loosely coupled, so that it is possible to capacity-plan for each sub-component independently. Break a complex problem into N simpler ones
    2. Identity the 'limiting system resource' for each component individually (i.e. what will hit the wall first - CPU, memory, network etc.). You can do this through a combination of experimentation and plain and simple reasoning based on understanding of how it works
    3. Identify a business metric that correlates with the utilization of the limiting resource (e.g. api calls per second, number of logged in users, or whatever)
    4. Use analytics/math to project the capacity of the system, and how much free capacity you have (make sure to leave enough buffer, e.g. most services won't run very well at 99.99% cpu)

    At the end, you'll have something like this for each component of the system - e.g. "if I'm CPU bound on component X, and CPU of X linearly goes up with API_calls/s, and I'm currently at 5000 API/sec at 50% CPU, then I have total capacity for 9000 API/sec (with a 10% buffer) and free capacity for another 4000 API/sec.

    Now divide and conquer - let each component owner the responsibility to manage capacity of their system based on business needs provided to them.

  32. products are out there.. by Anonymous Coward · · Score: 0

    One I've seen is Fujitsu's QoS when I worked there. Monitors windows, linux, solaris (apparently hpux and aix) and has nice capacity reporting and alerting. Not sure if they have a public demo. I've only seen it with MS they've done and as per usual most of their internal products aren't marketed well. Have a look.

  33. Hire a specialist by LDAPMAN · · Score: 1

    Take a look around you organization...do you have anyone who has a firm understanding of the breadth of the technologies used in your systems? Does that same person have experience with performance testing and capacity analysis? Since your asking Slashdot the answer is very likely NO so go find a consultant who has that knowledge and hire him/her ASAP. Capacity analysis of distributed systems is not something to learn on the fly. Hire a pro.

  34. How we do by Anonymous Coward · · Score: 0

    The key is good automated reporting. We have reporting and alerting when we are getting close to capacity. We keep these usage reports which means every year we can forecast capacity for the year based on the previous year. At the start of the year we forecast and make purchase recommendations. Extra capacity is purchased and installed. Simples.

  35. The solution is cloud computing. by Anonymous Coward · · Score: 0

    Migrate to the cloud, either public or private.

    1. Re:The solution is cloud computing. by manu0601 · · Score: 1

      Migrate to the cloud, either public or private.

      Yes, I heard it would make you rich, make your wife come back home, and moreover it cures cancer.

  36. Re:Simplify the problem, use a metrics based appro by davecb · · Score: 1

    Convert your metrics into time units so you can say something like "I need 6 CPU/S per 100 users at nor more that 80% utilization" . The math is less weird than trying to work in percentages of something you're going to replace with a CPU that's 12% faster (;-))

    --
    davecb@spamcop.net
  37. PCA by Anonymous Coward · · Score: 0

    At Nokia Mobile Phones I was the principal performance engineer in the PCA group for the proxy browser department - PCA meaning Performance, Capacity, and Analytics. Before Microsoft took over the division we were most of the way to move our systems from our own data centers to the Amazon EC2 cloud. Why? Because adding capacity to existing data centers with our own hardware had a number of major costs - hardware, networking, power, cooling, and size. Adding a rack to an existing data center was a serious capital cost and took time, for approval, specifications, purchase, installation, etc. Then we needed staff to see to the maintenance of the gear. Moving this to AWS meant that we could easily scale up or down our capacity as needs changed, even hour by hour, and only pay for the capacity that we actually needed. The savings were HUGE! With our own gear, we had to plan for worst case (highest) usage, and even when it wasn't being used, the cost was still there. With AWS we could compute what our needs were going to be in few hours, and spin up or down servers as appropriate. If we weren't using them, we weren't paying for them.

    As for performance, I designed and built tools in C++ that ran on each server (both ours and Amazon's) and pushed 10 billion data points per day to both a Hadoop cluster in AWS for real-time performance/reliability analytics, and an AWS Redshift database for click-stream analytics.

  38. if you don't know your needs up front... by Chirs · · Score: 1

    then cloud computing is pretty much aimed at you. (At least if your need is likely to be variable.)

    The whole point of the cloud is that you only pay for what you use. If your needs are wildly variable from one month (or day) to the next, it might make sense to rent time/storage/throughput via the cloud.

    If your needs generally only increase, and increase at a predictable pace, then it probably makes more sense to buy your own hardware.

  39. Re: Simplify the problem, use a metrics based appr by ArijitMukherji · · Score: 1

    Exactly. That is one of the things we consider in the blog.

  40. Planning Tips by John.Miecielica · · Score: 1

    I ran the Capacity Management practice for a leading provider of financial data servers for 10 years. We had a dedicated team doing the capacity planning for the company. However, a key part of my practice was making the performance data available to the application analysts as well as the centralized capacity planning team. By fostering this partnership with the application teams, we were able to understand exactly what was going on with each system and develop capacity plans togethger. We settled on the TeamQuest set of products (for full disclosure, I am now working for TeamQuest as a Product Manager). There are many tools out there that can you help with with some of the initial tasks like gathering performance metrics in order to move from Chaotic to Reactive. However, as you move up the maturity curve to proactive, service and value, the available technology really starts to thin out. Some of the key elements you want to pay attention to are: - Data Granularity. Some tools only go to the 15 minute level. TeamQuest can collect data down to 1 sec intervals which some of our customers find to be essential. For many customers 1 minute or 5 minute granularity is sufficient. You will also want to view performance data at the process level to find the culprit(s) for high CPU utilization (for example). - Problem Resolution. You will want an easy to use flexible interface to view the performance data in fine granularity to assist in problem determination. Automated correlation analysis is a huge plus in this area as the problem you are looking at may actually be a victim of something else going on in your infrastructure - Prediction. Performance of computing systems, unfortunately, is not linear. So simple linear trending will only take you so far. You will need more sophisticated modeling technology to really understand when response time and throughput is going to suffer. - Guidance, You need a set of analytic tools which tell you exactly what will cause performance issues and when & understand what it will take to prevent the problems from occurring in the first place.