The Trouble with Virtualization - Cranky IT Staffs
lgmac writes "A new survey on the results of Enterprise use of virtualization shows that the process is seeing wide and appreciative use. Technical hurdles are obviously the biggest problem facing corporate IT shops. Just the same, political squabbles among IT staffers fighting for turf after being forced to work together in new ways seems to be a going concern as well. 'Technical woes rank higher--to be expected when CIOs deploy a new technology such as virtualization. However, the politics pain many of you. Remember, virtualization not only asks people to cede some control over their physical server kingdoms, but also asks IT experts from different realms to work more closely together.'"
34% of surveyed companies have been running virtualized desktops? Putting aside that that number doesn't seem to square with the "Virtual Desktops a Hard Sell" table below, does that seem likely?!?
What I'm listening to now on Pandora...
My companies biggest problem concerning virtualization at this point has been backing up running copies of virtual server without interruption, anyone have some insight on this?
Technology is continually changing. Those who adapt will be the most successful. Those who don't will eventually be pushed aside. Fighting over turf won't get you far in a corporate environment in the long term.
Developers: We can use your help.
In my experience as a systems engineer, the biggest problem we've had with virtualization is that too many people who don't understand it well view it as a magic wand that you can wave to make all your capacity & provisioning problems disappear.
"Hey! We need a new server to run Blah version 3.0!"
"No problem! Sammy can create a new virtual server!"
"Oh wait - my bad. We actually need a whole farm."
"That's okay, he can whip up a whole batch of them!"
Ad nauseaum. About the worst I've heard was a clueless manager asking me if the resource requirements for Oracle 10g could be relaxed because we were running it on VMware. I actually found myself calling a "come to Jesus" meeting in which I explained, in as simple terms as I could, that "making the system virtual" doesn't mean that hardware requirements go away. Very, very few applications get faster when you put them on equivalent hardware, only virtualized.
I'd imagine that one of the big problems with virtualization is clueless IT managers/staff who don't understand that you basically are dividing a server down into sub-servers. I've encountered a few people who seem to think that virtualization multiplies the server resources. That is, everyone using a VM basically gets the full specs of the host machine--all at once! Ugh! Maroons!
My company works with several shops that are working on large-scale virtualization and common platform projects. I would say the biggest single issue is simply politics, because much of the initial work is affecting older platforms that are the biggest win technically and financially to replace. For instance, one shop has a significant investment in Alpha systems, and still has production servers that are 15+ years old running a huge chunk of their revenue producing systems. The folks working directly on the Alpha servers have considerable clout, since they've been the golden children for many, many years. Their bosses know how to play politics, and, considering that Alpha/VMS experience is one of those IT areas where there is little new blood from younger IT staff members, they are quite adept at finding reasons why it won't work to serve their own ends.
Not only that, but virtualization will result in lost jobs at some point. Many IT staffers are afraid, whether rightly or wrongly, of losing their jobs. In a sense, they are outsourcing a good chunk of their day-to-day duties. I remember when this particular company went to SAN's over the last half-decade, and you would have thought, from the way the Alpha guys were fighting it, that the world was ending. They created road-block after road-block about how they wouldn't be able to keep the systems running, how it wouldn't work in "their" environment, etc, etc.
And, because of the compartmentalization that often occurs in large enterprise, many of these guys have very little idea about anything outside their own box. I know guys who have architected corporate platform migrations who are so narrow in their focus that they have *NO* experience outside their box, be it a particular OS, a server type, a network type, whatever. When the box becomes a cloud of equipment, they are lost and often have little or no ability to work with the other layers involved. Learning new troubleshooting skills in these environments is a painstaking process, and not one that many people are comfortable with.
In the end, these various factors are creating far larger artificial roadblocks for implementing virtualization than any technical challenges. To top it off, much of this is being driven by financials. The CFO and CTO are desparately trying to find ways to cut costs. By the time this message percolates down to the workers, they feel threatened rather than empowered, and have little incentive (and generally no training, either) to be complicit in what they feel is a threat.
Bill
It may be more practical to back up the system from within the VM, i.e. treat it as if it weren't a VM. By definition this will be on a live system.
Another option:
Have your VM use a checkpoint disk. Once a day shut down the VM, merge the changes from this week into the checkpoint disk, and restart the VM. This may take anywhere from a few minutes to tens of minutes. Restart the VM. Back up the checkpoint-disk image.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
How do you ensure that the VM supervisor fairly and efficiently allocates resources to the VMs? The mainframe people put a great deal of work into this area. One badly behaved VM shouldn't be able to degrade the performance of the other VMs.
Mea navis aericumbens anguillis abundat
This is a problem with management and/or the IT staff.
Management should run the company in a way that cooperation is rewarded not punished. Consolidation to save money shouldn't result in harm to those who are making it happen or anyone else for that matter.
The IT staff as well as all of the other employees and officers should have the attitude that if it's good for the company and not bad for anyone else it's the right thing to do.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Yes, well, naturally the problem is us "peons" can't work together. It has nothing to do with the fact that our bosses don't have a fucking clue about how to use the technology.
With HA and Clustering capabilities offered by many virtualization solutions you could end up taking a physical server and resources that weren't redundant and through consolidation efforts end up with more redundancy than before. It's all in how the solution is designed and knowing when to use virtualization and when not to use virtualization.
The problem is not that we don't want to work together. It is that you often cede control when you virtualize. And most of us don't love giving up control.
With virtualization in some Corp's, you have to ask for another of the 32 processors, instead of just having the headroom all the time.(work that one through a buricratic organization, it can take months)
Say you have a need to add another fax board(or whatever) to the virtualized x86 server, to find that they stuck some mission critical Virtual Environment on the Server and It CAN'T come down for another 2 weeks.
Yep, it saves hardware, but multiplies headaches in some situations. It is no wonder some fear it.
Items that need to be redundant, should not be virtualized on shared hardware. I've heard people want to virtualize redundant instances of directory services, databases, proxy servers...etc. I call this the "putting all your eggs in one,central-point-of-failure, hardware basket".
If you're doing something stupid like putting clusters or redundant servers on the same virtualization host, then I would agree. High availability loses it's meaning if all your nodes have a single point of failure.
However, there's absolutely no reason you can't make your virtualization implementation highly available itself. Right now, I have clusters running in VMware VI3, that are running on separate hosts. Even with DRS, which balances all your VMs across an entire pool of servers, I can ensure that redundant servers and clusters don't end up running on the same piece of physical hardware. And when you add HA into the mix, you also provide a level of high availability to systems that you might not otherwise have been able to justify the expense on.
An effect I've noticed many times is that when you ask IT staff to vote, the windows IT staff always outnumber the Unix and mac It staff. Thus one man one vote favors the windows firedrill fix-it jockeys over the more talented kernel of Unix and mac support gurus. Yes I realize that's ripe for flamebait, but it's actually true. By and large windows has so many problems to keep functioning it lakes a large staff of low paid trained monkies on hand. The revenge of the c-strudents is that they out number the A-students who run the linux servers.
You want to watch a fight? Get the Windows Server sysadmins and the UNIX/LINUX sysadmins and ask each group which server OS should be the "Native" operating system under which the other runs....fun...
"when CIOs deploy a new technology"
That could be your problem right there. When a specific technology or whoop-do-doo product is pushed from the top down, rather than the bottom up, it's a problem. That's not the same as management saying "Get this done", so much as it's "Use this fancy thingy I read about in the newspaper... who cares what it does or if there is something better, I'm the decider!"
This is a classic sign of a broken IT department. One place I worked, if you (well, if I) needed to increase the size of a database table, I had to get sign-offs from
net result? nothing ever got agreed. The simplest changes took forever and cost a fortune. The operation is now outsourced.
Who's to blame? Probably not the techies, they just pressed buttons. Quite likely the team-leaders for turning it political, definitely the IT managers who allowed the situation to continue.
Who kept their jobs?
yup, the managers! You've been warned: infighting only hurts the foot-soldiers, the generals aren't affected. Sort it out yourselves or you'll have to start learning chinese.
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
It's happened twice to me at two different companies.
Whenever I need a machine scratch-pad, I boot up a VMWare machine. Test the software or do whatever I need to do and delete it. But while it's running, it broadcasts itself on the local net. Admins really freak out when a machine named //FAKEOUT or //BOGUS suddenly shows up on their net.
I've given two different IT guys at two different companies cardiac events over it.
Sorry, fellas.
Weaselmancer
rediculous.
Many large virtualized deployments include very advanced technologies such as shared SANs, shared infrastructure, and complex virtualization tools.
Frankly, I would argue that you are probably just redeploying people resources into different roles and responsibilities, while probably saving on hardware and energy costs for the infrastructure through consolidation.
Lindsay Blanton
RadioReference.com
I think your math is flawed. Your 500 physical servers just went to 500 virtual servers. Each one has a dependency that is now harder to dissect and is more abstract from the hardware.
Sure you may have a smaller room - but you definitely have more complexity. Now you have servers moving all around, because you really can't tell what their true capacity is on the physical server - it just looks like CPU or memory is getting busy. Now you need to move it. In pops VMotion. This is very fancy ooo-aaah stuff. You start doing this once, twice, eighty times a day (obviously depending on the size of your environment) However, soon you start with what I call "VMotion Sickness".
See, Vmotions create two problems. 90% of the time they are just moving a problem resource to somewhere else, creating a never-ending game of whack-a-mole, AND at the same time it consumes a serious amount of resources across your entire virtualization environment.
What you've done is add COMPLEXITY to those 500 servers by pushing them onto 100 physical boxes. You've also magnified risk of physical hardware problems so maintenance must be much more rigorous. And you've now added a huge learning curve to your entire team to learn how to triage any problems and avoid whack-a-mole.
Full disclosure: I work for a systems management software company, http://www.hyperic.com/, that specializes in managing virtualized environments and talk to these shops every day. Also, while I am at it - one of our customers, http://www.mosso.com/, a clustered hosting provider that is a division of Rackspace built a great case study on managing a 100% virtualized environment. And for the record, they were able to keep their staff the same, which they thought was a big achievement. They had actually thought that virtualization would add so much complexity, they would have to ADD staff to maintain SLAs. The case study can be found here: http://download.hyperic.com/pdf/Hyperic-CS-Mosso.pdf
The one thing I have been able to rip from users is certain services. Like an oracle/mysql/postgress server. In the past the users felt they had to maintain it. Now we have one server and they use it as a service. The cluster handles keeping it up. This only works well with RedHat and only if you know what you are doing. Now the end users are relieved. They don't have to worry about the database, server configuration and maintenance that used to dog them with Solaris, Windows, BSD and SUSE boxes. Windows being the biggest PIA because Microsoft does things to you if you update it. Then the other issues the /. crowd is used to.
To me I have a load balancer that is managed by a gang of web servers as a clustered service so it never goes down. The web servers are highly available so I can reboot whenever I want. The database is also highly available. People just upload stuff to a virtual address and a different port and it is just there. It gets updated very quickly when a patch comes out. In short I don't have to even schedule down time anymore unless we have a power outage. Just be sure you have a place to test updates first. If something goes wrong with the clustering software, it can really go wrong. Then it is like having 100 dishes up in the air. Instead of dropping one dish, you drop 100.
The thing I hate about it is trying to explain it to end users and even guys I used to think were technical. They just don't grasp the concept of a gang of servers, virtual servers and virtual databases. They think that if someone gets a form from one machine, it must return the data to that machine. As of the server is like a logon session. Maybe it is that "logon to www.sitename" bullshit they put out there in the news. They should say "visit site www.sitename", leave "logon" out of it entirely. Anyhow, eyes gloss over and it's a bitch to get them back. Sometimes now I just tell them we are moving them off of their old machine and let it go at that. They don't have a need to know. MUCH easier that way. The only PIA is when they ask what the serial number is of "their" machine.
Still, there are some people that just don't want to give anything up. I do agree that this environment requires more cognitive abilities from the IT staff. I don't think you can be average and get by anymore. The IT staff needs to have bright people now. People that can learn. Otherwise they are left behind and it can be brutal.
We recently moved everything into virtua-land, complete with a hige SAN, fiberchannel switches, blade servers - the whole nine yards.
While I do think the move was a net positive, the complication of 60 physical servers was more or less replaced by the complication of all the new SAN/Bladecenter components and their interdependency.
One particular thing we've run into is "firmware hell", where you have several components in the chain that all require firmware updates and all depend on each other.
I don't always use unix-like operating systems; but when I do, I prefer FreeBSD.
Justification for replacing the BSD box was that "I was the only person who knew how to fix it". The fact that the box had never even hiccuped in two and a half years, and there was ample documentation on how to get mail flowing temporarily in case of a failure if I was gone, apparently meant nothing. The Barracuda is ok, but it didn't solve the support problem at all. The Barracuda is not as accurate, and it has 'hiccuped' a few times, causing minor mail flow issues and both times we were stuck sitting with our thumbs up our asses while waiting for support.
Now we are running Vmware ESX, so I get to come to the rescue every time the GUI management tools fail and the need to hit the bare console comes up. Five bucks say we'll be replacing Vmware with Microsoft's virtual solution in a few years!
Anyway, enough with the rant. Thanks for the advice on CentOS. I'll keep that in the back of my mind.
I don't always use unix-like operating systems; but when I do, I prefer FreeBSD.