The Decline and Fall of System Administration

Hyperviser by betterunixthanunix · 2011-03-02 01:55 · Score: 1, Informative

Someone still has to maintain the machines that are actually running the VMs.

--
Palm trees and 8

Re:Hyperviser by Larryish · 2011-03-02 01:57 · Score: 2

WHOOSH!
Re:Hyperviser by Baki · 2011-03-02 02:17 · Score: 2

With bare metal virtualizatoin, there is not that much to maintain, and there is pointy clicky software to do that. No real admin skills required.
Re:Hyperviser by Anonymous Coward · 2011-03-02 02:18 · Score: 1

It's VMs all the way down.
Re:Hyperviser by Anonymous Coward · 2011-03-02 02:31 · Score: 4, Interesting

Because pointing and clicking inherently takes more skill than using CLI, right? Never mind that most CLI commands will readily assist you with syntax if your format incorrectly, whereas documentation for a GUI, if it exists at all, is often useless..,
Re:Hyperviser by bberens · 2011-03-02 02:40 · Score: 1

This is the natural progression of technology across all industries. We'll be migrating to needing a very small number of highly skilled people and a lot of "sysadmin" drones who mostly do point and click things.

--
Check out my lame java blog at www.javachopshop.com
Re:Hyperviser by digitalchinky · 2011-03-02 02:48 · Score: 1

Given your low UID I find your comment rather bewildering. Setting up a server so that it does exactly what you want is a complex task - add in a good bit of security and you're so far away from the mouse that it is utterly absurd to make this claim.
Someone still has to make the images that the point and click types use. That requires real sys-admin work.
Re:Hyperviser by __aamnbm3774 · 2011-03-02 02:57 · Score: 5, Insightful

This whole argument is retarded. I always pick the most appropriate response to the problem at hand. If your server is hosed and not booting, I don't have time to mess around with some Knoppix DVD, trying to figure out exactly where in the boot process it is dying. Especially if you have nightly backups! Sometimes a clean sweep and restore is perfectly acceptable and reasonable. Why even sacrifice downtime trying to troubleshoot an issue that could be resolved within minutes?!

Now, if it happens again the following night, you do have a deeper problem and should investigate it further, because constantly restoring the machine is now the inefficient part in the process.

It's like we've lost common sense in favor of our technical ego.
Re:Hyperviser by causality · 2011-03-02 02:58 · Score: 1

Someone still has to make the images that the point and click types use. That requires real sys-admin work.
Someone still has to write the programs that average end-users run. That requires real programming skill.
Yet I don't see too many average end-users who are skilled programmers.
Point is, you only need one person with actual sysadmin skill to make and maintain an imagine. Hundreds of point-and-click types can then use that image. It happens in large organizations all the time. Why pay for a hundred skilled, experienced sysadmins when you only need one skilled, experienced sysadmin and 99 paper MCSEs? For many businesses this is an easy decision.

--
It is a miracle that curiosity survives formal education. - Einstein
Re:Hyperviser by jc42 · 2011-03-02 03:00 · Score: 4, Insightful

... documentation for a GUI, if it exists at all, is often useless..,
How true. There popular explanation of the difference between a CLI and a GUI is that CLIs are so complicated that you need a manual to use it, whereas GUIs are so simple and intuitively obvious that no manual is needed.
Of course, the reality is that this attitude allows vendors to supply GUIs without any "unnecessary" manuals, but make the nested tree of windows and menus so deep and complex that nobody can ever remember where everything has been hidden, and there are no good tools to help you find something that you know is in there somewhere.
Meanwhile, the people who build the CLI know that nobody can ever remember it all, so they include tools for finding your way around. They also tend to make the defaults for the commands fit the most common cases, so you don't have to use the manuals all that often. And most tools have a -help option (though they can't quite agree on how to spell it), to provide quick reminders. And the CLI includes a current directory, search paths and aliasing, so you don't have to remember full paths to everything.
One of the ongoing frustrations with every GUI is constantly seeing a new window pop up, which is positioned back at the root directories, and I have to laboriously poke at things to get down to the directory that I'm working in. Then, when I do what the window was opened for, it closes, all that navigation is lost, and I have to do it all over again the next time I want to access a file in the same directory.
GUIs may have some aesthetic appeal (aka "pretty pictures"), but they remain the slowest, clumsiest way to use a computer that we've yet developed. But I trust that people are working on finding ways to make it even clumsier and slower. This seems to be happening with the "cloud" approach, for example.

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
Re:Hyperviser by __aamnbm3774 · 2011-03-02 03:03 · Score: 1

*Note: I am primarily referring to Webservers and 'Stock' machines. Which have nothing on them except Apache or IIS and the Website, or some other generic/easily-replaceable application.
If this is your Database Server, you might run into issues with data-loss, so again, pick the appropriate response. (which still might be formatting and restoring)
Re:Hyperviser by Courageous · 2011-03-02 03:03 · Score: 3, Insightful

Someone still has to maintain the machines that are actually running the VMs.
This is true. What's also true is that those admins can be fairly intensely busy running those machines. The summary mixes the concepts of the growing age of virtualization with "marginal admins." The summarizer doesn't really know what's going on, I think. In intensive virtualization operations, the talent pool is shrinking, but growing more concentrated. Cross training is now becoming more common, with the few critical people one has for the core operation being, trained in operating systems (both windows and linux), storage administration, and network administration.
These admins are often far too busy to spend a great deal of time on a specific VM. They're might be literally thousands of virtual machines in a large operation. For just one VM to draw their attention, it has to be something important and shared. Domain controllers, DNS systems, Radius servers, or other shared production systems will often get close attention, but if a quick reboot might resolve things and isn't any more disruptive than the current problem, of course you are going to do that.
What I think the summarizer isn't really grokking is that in this growing age of virtualization, the number of admins per server is going down a lot, and the focus of these admins has changed.
C//
Re:Hyperviser by jc42 · 2011-03-02 03:09 · Score: 1

It's VMs all the way down.
Could be. One of my favorite cosmological theories is that our universe is a simulation. In the "real" universe, there's a big computer that has a data object for every elementary particle in our universe. The simulation software (probably massively parallel) "steps" through the simulation, by calculating the position and velocity of each particle after the next time quantum. The beings running the simulation can stop it, do a bit of editing, and restart, which explains the religious "miracles" that have been so often reported.
It's hard to imagine how we could test this hypothesis. If we were to do a successful test, the simulation could just be stopped, reloaded from backup, and edited so our test came out inconclusive.
Of course, if this is valid, then we should also consider that the simulation might itself be running in a simulated universe ...

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
Re:Hyperviser by Anonymous Coward · 2011-03-02 03:10 · Score: 0

Easy, just automate the reimaging process!
Re:Hyperviser by Anonymous Coward · 2011-03-02 03:13 · Score: 0

It's almost as if there should be a public documentation project or something.
Re:Hyperviser by hitmark · 2011-03-02 03:24 · Score: 1

So in the end, the cloud sits on a assembly line.

--
comment first, facts later. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
Re:Hyperviser by TheLink · 2011-03-02 03:25 · Score: 1

If the virtualization is perfect AND hidden by design, you can't test it. There would be no way for the stuff inside to "break out" or detect an "outside" without the help of those already outside (those outside might be able to copy or move stuff out).

Of course the virtualization could be perfect but the design might intentionally leave clues of an "outside", just like some virtual machines will indicate "vmware" as the brand of some "hardware".

I believe the mathematicians and physicists have already realized all this long ago. Can't remember the reference or what it's called though.
--
- Too many replies beneath your current threshold
Re:Hyperviser by causality · 2011-03-02 03:39 · Score: 3, Informative

It's VMs all the way down.
Could be. One of my favorite cosmological theories is that our universe is a simulation. In the "real" universe, there's a big computer that has a data object for every elementary particle in our universe. The simulation software (probably massively parallel) "steps" through the simulation, by calculating the position and velocity of each particle after the next time quantum. The beings running the simulation can stop it, do a bit of editing, and restart, which explains the religious "miracles" that have been so often reported.
It's hard to imagine how we could test this hypothesis. If we were to do a successful test, the simulation could just be stopped, reloaded from backup, and edited so our test came out inconclusive.
Of course, if this is valid, then we should also consider that the simulation might itself be running in a simulated universe ...
That's really not too far from Hermetic thought, which is quite ancient. What follows is an oversimplification I hope is still useful. The main difference could just be that they didn't have computers thousands of years ago. Rather than imagining that the simulation is running on a highly advanced computer that's basically a machine of the ultimate sophistication, they conceive the simulation (the "software") to be thoughts in the mind of God. It's also an explanation for how God could be transcendental, beyond the Universe, omniscient and omnipresent, but not some old man in the clouds you could shake hands with like the more childish notions of God.
The Matrix is based on some very old ideas.
I also think it's fascinating to wonder ... if you could see the Universe as a whole, in its entirety, all at once, like perhaps from the perspective of another Universe, what would it look like? Would it look like a single living being, recognizable as such? Would it look sort of like a man even, as in the "we are made in the 'image of God'" idea? What fascinates me about that is the notion of galaxies being like cells in its body, which are made of stars, which have planets, which have organisms, which are made of cells, which are made of molecules, which are made of atoms, which are made of subatomic particles, etc, potentially to infinity. It could be infinite both ways, scaling ever smaller and also scaling ever larger. It's like the fractal Universe idea.
That, in turn, reminds me of the holographic Universe idea. It's a notion of such a fractal nature in terms of interrelatedness. It's an analogy for how the "parts contain the whole". Basically, if you take a glass photographic plate and take an ordinary photograph on it, and then break that plate ... you get something like a jigsaw puzzle. Each piece has an incomplete fraction of the total information. If you put a hologram onto a photographic plate and then break that plate into pieces, you get something quite different. You don't get a jigsaw puzzle at all. If it breaks into 10 pieces, then you get 10 complete holograms containing the full information of the original, just with each of them 1/10th the size of the original.
It's like the notion that truly understanding yourself would require truly understanding the Universe. Carl Sagan may or may not have been thinking something like that when he said, "in order to make an apple pie from scratch, you must first invent the Universe."

--
It is a miracle that curiosity survives formal education. - Einstein
Re:Hyperviser by GooberToo · 2011-03-02 03:40 · Score: 1

And when its the third time this has happened and you still have no explanation as to why its happening or how to avoid loss and downtime in the future, you absolutely should be fired - with malice.
Re:Hyperviser by Lumpy · 2011-03-02 04:03 · Score: 1

No real admin skills required....
so a MCSE can do it then?

--
Do not look at laser with remaining good eye.
Re:Hyperviser by __aamnbm3774 · 2011-03-02 04:05 · Score: 1

Right, as I mentioned, after three times in a row, restoring would not be an appropriate response.
Re:Hyperviser by MightyMartian · 2011-03-02 04:14 · Score: 2

To my mind restoring from image isn't a replace for system administration, but it can buy you precious time. Too many times in the past I've had a gun to my head over trying to figure out why this database server or that mail server was barfing, and if I could have just kept things going while I tested on a sandboxed copy, it would have been a lifesaver. VM images and other type of OS images are tools, nothing more and nothing less. At the end of the day you still have to have some skill in troubleshooting, otherwise even with these powerful tools your system will be down the tubes soon enough.

--
The world's burning. Moped Jesus spotted on I50. Details at 11.
Re:Hyperviser by drsmithy · 2011-03-02 04:17 · Score: 1

With bare metal virtualizatoin, there is not that much to maintain, and there is pointy clicky software to do that. No real admin skills required.
When you have a non-trivial virtualisation environment, there's still plenty to be done.
Re:Hyperviser by drsmithy · 2011-03-02 04:23 · Score: 3, Insightful

How true. There popular explanation of the difference between a CLI and a GUI is that CLIs are so complicated that you need a manual to use it, whereas GUIs are so simple and intuitively obvious that no manual is needed.

No, the difference is that a CLI is nearly impossible to use if you aren't familiar with it - the semantics and syntax are as, if not more, important than the concepts - whereas a GUI requires much less focussing on the "how", allowing much more focussing on the "what".

GUIs may have some aesthetic appeal (aka "pretty pictures"), but they remain the slowest, clumsiest way to use a computer that we've yet developed.

Ridiculously untrue, particularly in the context of non-specialised, non-expert users.
Re:Hyperviser by locofungus · 2011-03-02 04:26 · Score: 3, Insightful

Of course, the reality is that this attitude allows vendors to supply GUIs without any "unnecessary" manuals, but make the nested tree of windows and menus so deep and complex that nobody can ever remember where everything has been hidden, and there are no good tools to help you find something that you know is in there somewhere.
I think it's even worse than that.
If you have a problem to solve with a CLI then you might spend several days trying to make something work the way you need it to, but, once it's sorted it's very easy to document it for next time. (where next time might be several years down the line)
With GUI it's almost impossible to know what you've actually done at the time, let alone several years later.
Need to change a config file - take a copy, make changes, experiment etc. Once you've worked out what it is you actually need to to, restore the copy and then make the required changes. (Or just diff the original with the new version, "Hmmm, don't think I should have changed that setting, I'll change it back".)
With a GUI that "try this, try that" means that you have no idea what you might inadvertently/incorrectly have changed on your way to fixing the issue that you were really interested in.
And five years later when you need to do it again - CLI, all the options have changed subtly but your notes immediately give you a point to google and half an hour later you've worked out the correct set of switches to achieve what you need with the current version.
With the GUI, even if you've got perfect notes on what you did back then if it's even slightly non-obvious then it's very likely that the configuration option you need doesn't even exist any more (but no way to tell that of course).
Tim.

--
God said, "div D = rho, div B = 0, curl E = -@B/@t, curl H = J + @D/@t," and there was light.
Re:Hyperviser by tehcyder · 2011-03-02 04:29 · Score: 1

Er, I think GP's use of "pointy click software" shows which he thinks is more skilful.

--
To have a right to do a thing is not at all the same as to be right in doing it
Re:Hyperviser by ron_ivi · 2011-03-02 04:30 · Score: 2

> with a CLI ... it's very easy to document it for next time.
Indeed - just run "script" before starting typing.
Show me the equivalent of that for any GUI too.
And once you've cleaned up your document (changing 'vi filename' to 'sed .... filename') you can usually get to the point where you can just run your documentation with /bin/sh the next time you need it.
Re:Hyperviser by ron_ivi · 2011-03-02 04:33 · Score: 1

> Someone still has to make the images that the point and click types use. That requires real sys-admin work.
Not really. Amazon reduced this to just "save this image" as well; so luser can create as sloppy an install of linux just as they can for windows; and faster than a real sysadmin ever could ('cause the sysadmin would spend a moment thinking), create their own image.
Re:Hyperviser by ron_ivi · 2011-03-02 04:39 · Score: 3, Interesting

> a clean sweep and restore is perfectly acceptable and reasonable
NNNOOOOOOO!
Often a glitch like that is the only evidence you'll have that a machine had been compromised or that hardware is failing.
If you must do a clean sweep, do that on a standby system, and keep an image of the failed one until you can investigate the exact reason for the failure.
Re:Hyperviser by cinderellamanson · 2011-03-02 04:42 · Score: 0

And I'd like to point out that this is in regards to a rather standard and simplified system. If the server provides a generalized service, like central servers for smaller operations, then firing the admin means getting rid of the one person with the experience to trouble shoot the system in place. So, in the example above you can quickly remove the admin who provides a basic specialized service with higher tolerances than that in a generalized service with lower tolerances.
I think the original argument is a little goofy, I reboot when I don't need to at times, but only because I can - the system is not providing a live service. In fact failover provides support for this sort of thing and is hardly a step away from proper administration.
My openbsd laptop, basically works under the assumption that the system will be reinstalled, from disk, on release. This is a good thing as it insures proper installation of the new system and makes a proper backup strategy a necessity.

--
Hey buddy, can i bum a karma? ~}CinderellaManson{~
Re:Hyperviser by Belial6 · 2011-03-02 04:55 · Score: 1

I think that the other thing that the summarizer is also missing that there have always been crappy system admins. I have met many a gray beard back in the day that had no idea what was going on in the big picture. They had been trained to type in a few commands based on certain requests. It looked like they knew what they were doing only because the commands they had been taught to type in were so cryptic. They were like illiterate monks copying bibles that they couldn't read.

Then like now, there were certainly those that did know what was going on, but not as many as your average user thought. This whole article reads like when people reminisce over how much better video games used to be. They forget about all of the really bad crap that was unremarkable.
Re:Hyperviser by cayenne8 · 2011-03-02 04:58 · Score: 3, Funny

...so a MCSE can do it then?
I always thought it was MCRE..??
Microsoft Certified Reboot Engineer...?
;)

--
Light travels faster than sound. This is why some people appear bright until you hear them speak.........
Re:Hyperviser by _Sprocket_ · 2011-03-02 05:09 · Score: 2

Why even sacrifice downtime trying to troubleshoot an issue that could be resolved within minutes?! Now, if it happens again the following night, you do have a deeper problem and should investigate it further, because constantly restoring the machine is now the inefficient part in the process. It's like we've lost common sense in favor of our technical ego.
You make a fair point. However, the fundamental question is more complex than you're giving it credit for. There's always the question of tradeoffs between immediate, fast fixes and long-term advantage. That has to be balanced with the situation at hand, of course. But there are times when the initial time / effort investment pays off in the long run. And that trade-off as much a philosophical question within the admin world as a technical one.
Quite a few years ago, we were migrating our institutional firewalls from one product to an entirely different product. Large institution. Very large and complex rule sets consisting of a lot of legacy. We paired down the rules a bit by taking advantage of the effort to audit out some legacy cruft. But we still had a pretty impressive configuration to convert between legacy and new environments. Eventually the rules got split between two of us - I got one firewall boundary and a co-worker took the another.
My co-worker got immediately to work on his portion of the rule-set. He was a very hands-on kind of guy. His tactic was to read a given rule from the legacy system and manually write up the equivalent rule (including various objects, groups, etc.) for the new system.
My tactic was different. I created a few test files based on a sampling of legacy objects. I then went to work creating several scripts that could be run in sequence to do specific tasks that would convert our legacy configuration files to a configuration file for the new environment as well as a simple expect script that would load that configuration in to the target devices when the time came.
I have to admit that I was knocking off a fair bit of rust during my scripting exercise; my script development was far from efficient. So I wasn't too surprised that my co-worker was churning through configuration well before I had a functional, error-free script. It was a little disconcerting when he announced his config. file was complete before I had my script. Which had me questioning whether I was doing the Right Thing by spending time developing scripts instead of just banging out a config. file. But shortly after my co-worker's announcement, I had my script converting legacy to new configurations. Even if I had wasted time writing scripts, I hadn't wasted TOO much time.
Then came the sanity checking. We swapped config files and went over each other's configurations with new eyes. He manually spot-checked my work. I ran his legacy config file through my scripts and then compared my script's config to his manually written config. That was the first dividend. I uncovered numerous typos very quickly.
Then came the implementation. I won't go in to details and make an already long story longer. But in the middle of a massive down-time, we discovered some fundamental mistakes in how the firewall was being deployed. We would have to rework the firewall configurations. We were already past the half-way mark, everyone was tired, and it seemed like we'd have to pull the plug and go back to legacy while we re-grouped and scheduled another major outage to try again at some future date. It wasn't Fun Happy Time. I pondered over the situation. I realized that if I made a few adjustments to the outputs of my configurations in between running the various stages of my conversion scripts, we'd have our new configuration adapted to the new reality. I ran my scripts. And despite the limited time and our fatigue, I was able to produce the config we needed to press forward. The deployment was a success.
In the end, there were two competing strateg
Re:Hyperviser by ifiwereasculptor · 2011-03-02 05:14 · Score: 2

No, the difference is that a CLI is nearly impossible to use if you aren't familiar with it - the semantics and syntax are as, if not more, important than the concepts - whereas a GUI requires much less focussing on the "how", allowing much more focussing on the "what".
Yes. The only problem is the "where" gets a little jumbled up every now and then, but that's the result of a sloppy implementation, not a flaw inherent to GUIs.
Re:Hyperviser by GooberToo · 2011-03-02 05:17 · Score: 2

In fact failover provides support for this sort of thing and is hardly a step away from proper administration.
This is actually a very good thing. Far too often people have automatic failover but never test it and are shocked to find it doesn't automatically failover and remained undetected because failover is never tested.
Re:Hyperviser by causality · 2011-03-02 05:19 · Score: 1

Ridiculously untrue, particularly in the context of non-specialised, non-expert users.
There is a difference between "easy to use" and "easy to learn".
For example, Linux is extremely easy to use -- if you understand it. Windows is a hell of a lot easier to learn but knowing all about it won't make it much easier to use.
Your comment there describes what is easy to learn.
The CLI appeals to people who are willing to learn, who like learning new things and consider it worthwhile. Once they achieve a level of understanding, the learning is then a one-time investment that continues to pay off into the future, in the form of a system that is easy to use, simple but not oversimplified, elegant, easily automated, that does what you tell it to do but nothing more and nothing less. This is why many Linux users who use a pretty, feature-packed GUI like KDE still keep a terminal window open that they frequently use. The terminal is for non-trivial tasks.
The average Windows user who views learning as an unreasonable burden that should never be expected of anyone who wants to use a complex machine ... they avoid the up-front investment of learning to understand the system. Instead, they can jump in and start using the system right now. But they continuously pay for it over time in the form of enjoying few or none of those advantages.
It's like the difference between people who live within their means and use plastic only as a form of payment, saving up until they can actually afford something before they purchase it, versus those who live all the time on credit. The person living on credit gets the stuff they want right now but ultimately pays quite a bit more for it and can quickly find themselves in over their head. The discipline and delayed gratification that the latter is trying so hard to avoid is something that the former considers to be virtues worth cultivating.

--
It is a miracle that curiosity survives formal education. - Einstein
Re:Hyperviser by jc42 · 2011-03-02 05:23 · Score: 5, Interesting

No, the difference is that a CLI is nearly impossible to use if you aren't familiar with it - the semantics and syntax are as, if not more, important than the concepts - whereas a GUI requires much less focussing on the "how", allowing much more focussing on the "what".
While there's a certain truth to this, GUIs are in general a lot less "intuitive" than people tend to believe. Without documentation and training, most users are unaware of most of their GUI's capabilities, and have great difficulty in learning much more than the basics.
An example I've read a number of warnings about in web-design documents is that a significant number (often estimated at around 50%) of "non-geek" users don't understand scroll bars. This is usually mentioned along with the advice to put the important part of your web pages close to the top, because the non-scrolling users won't be able to see anything below that.
Yes, I was dubious when I first read this. But over the years, I've run into several clear examples. I've been involved in building web sites for some very non-geeky organizations. The orgs' leaders generally want a lot of stuff on their main page, and at the top they usually want some text about the organization, its purposes, its main activities, etc. They also agree that it's good to have a list of upcoming public events on the main page, and inevitably that's positioned below the introductory text, so it's often not visible unless the user has a rather large window.
In each case, there were eventually meetings with discussions of how to improve the web site. One thing that would come up was suggestions from users (including members) that the home page should have a list of upcoming events. The leaders have always been dumfounded by this. "But, but, ... There is such a list on the home page." "What?? No, there isn't."
Eventually, I have to interrupt, and explain to the org's leaders that they're hearing from people who don't understand scrollbars, have never seen the events table because they don't scroll down to see it. The users are, of course, confused; they know that there's no such table because they've never seen it. We bring up the site on a handy machine (preferably a laptop or tablet with a small screen), and I show the users that it's there by scrolling down to it. Their response again is confusion, because they don't know what I did or how I did it. "Why's it hidden like that?"
So I teach them about scrollbars, and a few users have learned something useful. But this has a more important effect: It gets across to the leaders why their design was wrong, as I'd been telling them, and they'll have a better web site if they'll let me fix it.
One instance of this happened just last week. The org's web site now has that block of extensive history and purpose in a separate box at the bottom of the page, and the table of coming events is positioned near the top, just below the logo bar, where non-geek users will see it and be able to read at least the first few entries.
Examples like this abound in GUI design. Many of the common widgets are not at all intuitive to most people. Even if they accidentally poke at things and trigger the actions, it's often difficult to grasp what the effect was. You see things change, but the changes don't make sense, and have no obvious relation to the icon that you clicked on. Often the icons don't look like anything that most users can name. The result is that most of the GUI is unusable to most of the users.
I wish I knew good ways around this. But truly making a GUI obvious is very difficult, and takes a lot of time studying the users and learning about their misconceptions. I very rarely have the time to do this, and in many cases the people paying me have expressly forbid wasting time with dumb users.
And that's something that's very difficult to program around. ;-)

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
Re:Hyperviser by Anonymous Coward · 2011-03-02 05:25 · Score: 0

GUIs may have some aesthetic appeal (aka "pretty pictures"), but they remain the slowest, clumsiest way to use a computer that we've yet developed. But I trust that people are working on finding ways to make it even clumsier and slower. This seems to be happening with the "cloud" approach, for example.
Of course they're slow and clumsy. Something has to use up the cycles on the next generation of processors.
Re:Hyperviser by Anonymous Coward · 2011-03-02 05:26 · Score: 0

Clearly we need a slower, clumsier way so you don't feel too bad.We should work on the pavlov operating system at once!
Re:Hyperviser by SuperQ · 2011-03-02 05:31 · Score: 1

I work on large cluster computing systems. I deal with this every day at work. One machine that's doing strange things like causing every third job to SIGSEGV is annoying and I take it out of production, wipe it, run it through memory and CPU tests and then put it back. Of course this work is not really something I have to think about, I just flag it and automation takes over.
My real job is when this machine comes back from testing still broken. I dig in, find out what is wrong.. could be CPU, memory, some other random hardware defect on the mainboard. Once it's root caused and testable the test can be added to the automation and I don't have to do anything but whack-a-mole for similar problems in the future.
And that's just one aspect of what's going on day to day.
Re:Hyperviser by Anonymous Coward · 2011-03-02 05:40 · Score: 0

Everyone goes through the process of learning how to use a new piece of software, and during that period of initial learning you are absolutely correct. A GUI provides a relatively easy way to discover what the software can do, and at the same time it provides the means to do it.
But here's the problem - we don't stay newbies forever.
I am experienced enough that for the vast majority of applications I use, I know exactly how to do the things I commonly do, and I know exactly how to discover/re-discover things that aren't so common. And in that context, CLI beats GUI hands down, for all the reasons posted by the GP.
I have no problem with a GUI interface for applications that are rarely used. What I do have a problem with is that for many applications intended for constant use *coughMicrosoftOfficecough* there is no expert/CLI mode to eliminate the hugely redundant pointing and clicking it takes to get anything done.
Re:Hyperviser by tbannist · 2011-03-02 05:54 · Score: 1

Well you could be talking about Brain in a Vat, A Bunch of Rocks or The Allegory of the Cave among many other similar thought experiments.

--
Fanatically anti-fanatical
Re:Hyperviser by drsmithy · 2011-03-02 06:03 · Score: 5, Insightful

For example, Linux is extremely easy to use -- if you understand it. Windows is a hell of a lot easier to learn but knowing all about it won't make it much easier to use.
That, is entirely a matter of opinion.

Your comment there describes what is easy to learn.
No, it doesn't. Your comment assumes that an interface should *have* to be learnt, to be easy to use.

The CLI appeals to people who are willing to learn, who like learning new things and consider it worthwhile.
No, the CLI appeals primarily to people who like to focus on memorising semantic minutiae and believe that doing so is, in and of itself, a productive endeavour.

The terminal is for non-trivial tasks.
The implication that GUIs are only used for "trivial" tasks is ridiculous on its face.

The average Windows user who views learning as an unreasonable burden that should never be expected of anyone who wants to use a complex machine ... they avoid the up-front investment of learning to understand the system. Instead, they can jump in and start using the system right now. But they continuously pay for it over time in the form of enjoying few or none of those advantages.
There is nothing unique to Windows, or even computers, about this. Do you know the intricacies of how your car works ? How about your blender or oven ? Could you fabricate a new bed or sofa from raw materials, and without modern tools ? Do you grow your own produce ? Could you butcher a cow or chicken ? Could you set a complex fracture or create your own painkillers ? Can you brew your own beer ?

It's like the difference between people who live within their means and use plastic only as a form of payment, saving up until they can actually afford something before they purchase it, versus those who live all the time on credit. The person living on credit gets the stuff they want right now but ultimately pays quite a bit more for it and can quickly find themselves in over their head. The discipline and delayed gratification that the latter is trying so hard to avoid is something that the former considers to be virtues worth cultivating.
No, it's nothing like that at all. One is an example of financial irresponsibility and the other is simply realising that you do not need a deep and intricate understanding of a given thing to use or take advantage of the services or benefits it provides.
Re:Hyperviser by EdIII · 2011-03-02 06:06 · Score: 1

What I think the summarizer isn't really grokking is that in this growing age of virtualization, the number of admins per server is going down a lot, and the focus of these admins has changed.
The summarizer does not understand the real world. It's arrogance and ego. I MUST understand the problem fully. I MUST be able to fix it.
This is what happens when you are young, stupid, and arrogant. I know this from experience too. "Suddenly somehow rational"?
In the real world you have to prioritize your time. It is finite, and if you're lucky, you get to balance it with a life. When I was young, I would spend 18 hours without a break "digging deeper" to find the problem. Now? Re-image the damn thing because I have a million other responsibilities today.
That is not clueless. It's understanding the bigger picture and the difference between an arrogant grunt (although highly skilled) and a manager and leader who is overseeing all those grunts.
Furthermore, when you do realize the bigger picture, you can start to design systems around those principles. That way you can re-image safely and keep production systems up and running because the individual parts are easily replaced, highly redundant, and load balanced.
If you have the time and resources you can always "dig deeper", but I would love for a bunch of highly skilled system administrators to tell me that they really have time for that.
P.S - The guys that came up with the idea and design for re-imaging probably did so because of large time constraints on their jobs too.
Re:Hyperviser by houstonbofh · 2011-03-02 06:15 · Score: 1

But you can get out... http://www.imdb.com/title/tt0139809/
Re:Hyperviser by drsmithy · 2011-03-02 06:27 · Score: 3, Interesting

While there's a certain truth to this, GUIs are in general a lot less "intuitive" than people tend to believe. Without documentation and training, most users are unaware of most of their GUI's capabilities, and have great difficulty in learning much more than the basics.

Sure, but the point is with a CLI and no understanding of its syntax and semantics, you're pretty much dead in the water from the get-go. You could have a deep understanding of networking, but if you're unfamiliar with the syntax of iptables, you're not going to be able to configure a Linux firewall.
Your scrollbar example is actually a good one, because it highlights the key differences between a GUI and a CLI. In a GUI, there is both a visual indicator that the content is larger than a single page, positive feedback from the UI element if the user tries to interact with it (ie: it reacts to a mouse click), and secondary feedback that the UI element is important even if it is triggered "accidentally" (ie: it moves if the user presses page down, space, or in some other way makes the page scroll).
In a CLI, you would simply be presented with a single page of text. Advancing to the next page would require knowing which key(s) to press to do so. If you don't know the key, you're screwed. Some CLIs may present a "press space to continue", or similar, message, but that's starting to blur the line between CLI and GUI, IMHO.
Further, the new knowledge those users have about the scrollbar is now applicable to pretty much any GUI they use in the future, even ones running on completely different OSes (I recognise this doesn't apply to all UI elements, but the fundamentals - buttons, menus, scrollbars, selection boxes, etc - are pretty consistently implemented in similar ways across the board). The knowledge they have gained about the CLI interaction is probably specific to that CLI only (how many different ways in different CLIs do you know of to trigger a page down ?).

Examples like this abound in GUI design. Many of the common widgets are not at all intuitive to most people. Even if they accidentally poke at things and trigger the actions, it's often difficult to grasp what the effect was. You see things change, but the changes don't make sense, and have no obvious relation to the icon that you clicked on. Often the icons don't look like anything that most users can name. The result is that most of the GUI is unusable to most of the users.

Sure, but the point is that there *ARE* things there to "poke at" and there is feedback that something actually happened. A CLI has neither - you need to know the commands in advance to do anything, and often the only feedback from a command is to indicate an error (and frequently said feedback is not useful at all in understanding what the error was).
Human cognition is highly depend on visualisation, context and feedback. A CLI interface lacks - or typically has very minimal implementations of - all of those.
Re:Hyperviser by rgviza · 2011-03-02 06:32 · Score: 1

To me GUIs are more complicated. I can't ever find the function I need buried in those oujia boards. They change so often it's impossible to keep up. The CLI, on the other hand, hasn't changed much in the 15 years I've been working on Unix/Linux.
I usually smack people's hands when I see them installing a GUI on a linux server. 2+ GB of complete waste of time. Further they increase the attack surface of whatever you install them on. GUIs (like oujia boards) are bad juju for lots of reasons.
Server GUIs are an attempt to make administration simple. They often do more harm than good because they enable people that would otherwise have no idea of how to break a machine, to break important system settings with point and click. They provide the illusion of simplicity. This is dangerous, especially when the gui is broken in some subtle way (which is often the case) that's not readily apparent to the person using it.
At the end of the day you can't escape the fact that even with a "simple" point and click GUI, you still need to know what you are doing, and if you know what you are doing, you don't need a GUI. If you are new to the game, sit down at a terminal and learn CLI. It will improve your life.
As well if you have a problem and are talking to an old salt, he'll understand what you are trying to tell him if you describe what you do in the CLI. If you are explaining some visual thing you are pointing and clicking on in a GUI, there's a good chance that he won't be able to help you.

--
Don't kid yourself. It's the size of the regexp AND how you use it that counts.
Re:Hyperviser by Anonymous Coward · 2011-03-02 06:34 · Score: 0

I have met many a gray beard back in the day that had no idea what was going on in the big picture. They had been trained to type in a few commands based on certain requests.
Those people were not graybeards. They were junior sysadmins with delusions of grandeur.
Re:Hyperviser by drsmithy · 2011-03-02 06:41 · Score: 1

I have no problem with a GUI interface for applications that are rarely used. What I do have a problem with is that for many applications intended for constant use *coughMicrosoftOfficecough* there is no expert/CLI mode to eliminate the hugely redundant pointing and clicking it takes to get anything done.
Your example is ridiculous. Office, along with pretty much every other Microsoft product, has extensive keyboard accessibility, both in the form of direct keyboard shortcuts and via the menus.
That's before even getting into VBA...
Re:Hyperviser by Anonymous Coward · 2011-03-02 06:48 · Score: 0

Edit html in a plain text editor,
Edit html in a WYSIWYG editor. ...
Re:Hyperviser by dkf · 2011-03-02 06:51 · Score: 1

Sure, but the point is that there *ARE* things there to "poke at"
Sure, but a lot of users don't poke at things "in case they break something". To them, it's the magical mystery machine.
Doing a good user interface, whether a CLI or a GUI, is difficult. It requires thinking about what users actually do and how they actually think. The advantage that many CLIs have is merely that the people developing them are part of the target community of users. (That's a huge advantage!)

--
"Little does he know, but there is no 'I' in 'Idiot'!"
Re:Hyperviser by causality · 2011-03-02 06:58 · Score: 1

That, is entirely a matter of opinion.
It's difficult to beat the economy of expression and precision the command line offers to those who know how to use it. It's really difficult to beat the ease of automation.

No, it doesn't. Your comment assumes that an interface should *have* to be learnt, to be easy to use.
My comment acknowledges that there is often a trade-off to be made. It also accepts the reality that it is far easier for a person to adapt to the needs of a machine than it is for a machine to adapt to the needs of a person.

There is nothing unique to Windows, or even computers, about this. Do you know the intricacies of how your car works ? How about your blender or oven ? Could you fabricate a new bed or sofa from raw materials, and without modern tools ? Do you grow your own produce ? Could you butcher a cow or chicken ? Could you set a complex fracture or create your own painkillers ? Can you brew your own beer ?
In each of those cases I hire experts who are skilled in those trades. Most non-farmers don't operate farm equipment or raise livestock for food. Most people who are not mechanics wouldn't attempt to rebuild their car engines. Most people who lack medical training don't set broken bones or create pharmaceuticals. Those diverse trades you mention have one thing in common: people who are unskilled at them don't generally try to do them.
Most people who are not technicians do use computers. That's what makes it unlike those things. That's why your analogy there is irrepairably flawed. Computing is one of the only areas where people routinely operate a highly complex piece of equipment they don't remotely understand and still expect everything to go smoothly. Predictably and unsurprisingly, they often experience problems. It's something of a miracle they don't have a lot more problems than they do.
What I call basic competence is not like having enough skill as a mechanic to rebuild your car's engine (that would be expertise, not merely competence). It's more like knowing how to drive, understanding what defensive driving is, and understanding that the vehicle needs periodic maintainence in order to remain road-worthy.
If someone who has never driven a car before and does not understand how to drive safely gets behind the wheel and causes an accident, no one finds that surprising. No one blames that on the car being too difficult to use. They understand that as a machine it is only doing what the driver told it to do. If someone who is equally unskilled and lacks basic competency operates a computer and has problems, we blame the computer. I think the only reason this faulty thinking is so widespread is simply that misusing a computer doesn't typically lead to serious injury or death, otherwise more people would feel the necessity of recognizing this mentality as the set of unrealistic expectations that it is.
All I'm asking for is a little consistency.

No, it's nothing like that at all. One is an example of financial irresponsibility and the other is simply realising that you do not need a deep and intricate understanding of a given thing to use or take advantage of the services or benefits it provides.
It is an imperfect analogy to be sure, though not a fatally flawed one. The reason? It's a comparison of a one type of willingness to invest in good results with another.
I never claimed a deep and intricate understanding was necessary either. That'd be more like the mechanic who can rebuild an engine with confidence. You know what would be a drastic improvement? If average users took a little time to learn some best practices, even if all they did was to memorize them with no real understanding of why they are best practices. That'd be more like knowing how to drive safely.

--
It is a miracle that curiosity survives formal education. - Einstein
Re:Hyperviser by internettoughguy · 2011-03-02 07:00 · Score: 1

GUIs may have some aesthetic appeal (aka "pretty pictures"), but they remain the slowest, clumsiest way to use a computer that we've yet developed.
Ridiculously untrue, particularly in the context of non-specialised, non-expert users.

I think you missed the "system administration" part, I don't think anyone is suggesting that graphic design turtle necks should be CLI-ing their illustrations.
Re:Hyperviser by operagost · 2011-03-02 07:00 · Score: 1

Like this?

--

Gamingmuseum.com: Give your 3D accelerator a rest.
Re:Hyperviser by jedidiah · 2011-03-02 07:03 · Score: 1

The GUI for an inherently complicated task will still be complicated.
This has been true since the 80s as soon as GUI applications started to get byzantine.

--
A Pirate and a Puritan look the same on a balance sheet.
Re:Hyperviser by Sigma+7 · 2011-03-02 07:07 · Score: 1

One of the ongoing frustrations with every GUI is constantly seeing a new window pop up, which is positioned back at the root directories, and I have to laboriously poke at things to get down to the directory that I'm working in.
This depends on the GUI. For example, Windows XP and later are getting better at fixing it (e.g. remembering the last used directory), while pre-XP may place you in either the root, the current working directory for the app (e.g. the directory containing the executable), or somewhere else. However, there's still the worst offender, a dialogue box that requires you to manually pick a folder by browsing an extra-small window.

GUIs may have some aesthetic appeal (aka "pretty pictures"), but they remain the slowest, clumsiest way to use a computer that we've yet developed
I've seen worse. In particular, one CLI application treats running in Windows as a "dumb" terminal, simply because I haven't edited some configuration file that requires an excessive amount of hunting, and in spite of me already doing fancy ASCII graphics using the Windows API. End result is that some other CLI utility doesn't treat less as a viable pager and uses a weaker one that doesn't support paging backward.
There's also GUI apps that don't support cut-and-paste (i.e. don't react) for actions where it would be expected.
I''d say it's time to switch paradigms. Perhaps a hybrid GUCLI or CLGUI approach should be much better.
Re:Hyperviser by jedidiah · 2011-03-02 07:08 · Score: 1

> No, the CLI appeals primarily to people who like to focus on
> memorising semantic minutiae and believe that doing so is,
> in and of itself, a productive endeavour.
No. The CLI appeals primarily to people know know and understand exactly what they want.
One key feature of a good CLI is that I don't have to "memorize" anything. I can encapsulate a small bit of research (that I would have needed to do for a GUI anyways) into an alias or simple script that saves me time if I have to do anything more than once.
The CLI appeals to people that don't want to be bothered babysitting a pretty GUI and expect the computer to chug along by itself with minimal direction.
If you need to understand something, a pretty picture or some check boxes won't help you remain willfully ignorant.

--
A Pirate and a Puritan look the same on a balance sheet.
Re:Hyperviser by tendrousbeastie · 2011-03-02 07:13 · Score: 1

Now come on old chap. It isn't really on to compare a good CLI with a bad GUI. Either compare the best of each or the worst of each.
Not at GUIs are "nested tree of windows and menus so deep and complex that nobody can ever remember where everything has been hidden, and there are no good tools to help you find something that you know is in there somewhere" ...and not all CLIs are "include tools for finding your way around. They also tend to make the defaults for the commands fit the most common cases, so you don't have to use the manuals all that often. And most tools have a -help option (though they can't quite agree on how to spell it), to provide quick reminders. And the CLI includes a current directory, search paths and aliasing, so you don't have to remember full paths to everything".
Re:Hyperviser by jedidiah · 2011-03-02 07:13 · Score: 1

> Sure, but the point is with a CLI and no understanding of its syntax
> and semantics, you're pretty much dead in the water from the get-go.
Not really. The real killer is a total lack of desire to explore or learn.
The exact nature of the interface is largely irrelevant.
If you aren't willing to explore, you won't get much out of ANY interface. Usually, the people that are willing to explore a GUI are the same ones that are willing to explore and use other interfaces.
It's not the tool. It's the user.

--
A Pirate and a Puritan look the same on a balance sheet.
Re:Hyperviser by causality · 2011-03-02 07:23 · Score: 1

The implication that GUIs are only used for "trivial" tasks is ridiculous on its face.
I forgot to respond to that point and I wanted to be comprehensive.
I made no such claim. I was describing what Linux users who are skilled with the command line yet enjoy featureful GUIs often do. When I said "the terminal is for non-trivial tasks" that is not the same thing as claiming "a GUI cannot perform a non-trivial task". The GUI can do that. The terminal is simply a better tool for the job in many cases. Saying that "A can perform C" is not the same thing as saying "Only A can perform C, therefore B cannot perform C". This is basic reasoning.
So, within the context of describing what experienced Linux users like to do, that is why they often retain a terminal or three even though they may be running a full-blown, feature-packed GUI that provides graphical methods to accomplish most of the same tasks.
That context thing is important. Quoting things out of context to portray them in the most unfavorable fashion possible may seem like an easy way to score points in a discussion or a shortcut to declaring the other guy wrong but the objections raised this way are trivial to invalidate.

--
It is a miracle that curiosity survives formal education. - Einstein
Re:Hyperviser by drsmithy · 2011-03-02 07:23 · Score: 1

I think you missed the "system administration" part, I don't think anyone is suggesting that graphic design turtle necks should be CLI-ing their illustrations.
Things like interconnections and dependencies between, say, virtualisation hosts, networking equipment and storage systems are _vastly_ easier to see and understand in graphical form.
Re:Hyperviser by alexborges · 2011-03-02 07:32 · Score: 1

Such a stupid point all around because not all servers are needed for the same things. Yes, if you want a reverse proxy cluster you can do one and multiply with vm software but when the fucker goes down you still need to know which machine to iron out and how to configure it (even plain bare metal needs some after tweaking or some serius storage).
Even then, there are boxes that are so large that fixing them and knowing how to do it is way way easyer and time productive than ironing from the net or the SAN.
So it depends on how serius your admin work is and what the fuck your doing in your particular job.

--
NO SIG
Re:Hyperviser by alexborges · 2011-03-02 07:34 · Score: 1

BTW, i was not refering to YOUR point, which is in tune with what ive posted, but to the main point of this discussion.

--
NO SIG
Re:Hyperviser by drsmithy · 2011-03-02 07:44 · Score: 1

It's difficult to beat the economy of expression and precision the command line offers to those who know how to use it. It's really difficult to beat the ease of automation.
It's difficult to beat the interactiveness and feedback presented in a graphical interface, to say nothing of the density and depth of information available, or its ability to behave dynamically based on context.

My comment acknowledges that there is often a trade-off to be made. It also accepts the reality that it is far easier for a person to adapt to the needs of a machine than it is for a machine to adapt to the needs of a person.
Yet the whole point of having the machine in the first place is to free up the person from mundane and unproductive tasks so they can do something useful with their time. We should absolutely be trying to adapt interfaces to people, not vice versa.

Those diverse trades you mention have one thing in common: people who are unskilled at them don't generally try to do them.

Most people who are not technicians do use computers.

Most people who are not mechanics drive (some even for fun or competition). Most people who are not farmers eat. Most people who are not doctors need medical attention. That's exactly what makes using a computer just like all those other things - you do not need to know how a steak gets from a cow to your supermarket to eat it.

If someone who has never driven a car before and does not understand how to drive safely gets behind the wheel and causes an accident, no one finds that surprising. No one blames that on the car being too difficult to use. They understand that as a machine it is only doing what the driver told it to do. If someone who is equally unskilled and lacks basic competency operates a computer and has problems, we blame the computer.
No, the computer is blamed when people who do have basic competency have problems. Similarly with things like cars, planes, and other devices that are designed to remove certain aspects of required expertise.
I'm not aware of anyone who would blame "the computer" if someone who had literally never touched one sat down and didn't know how to use it. I do know lots of people blame "the computer" when they have learnt the basics but they run into problems because the interface is poorly designed, inconsistent, or simply non-existant.
Re:Hyperviser by drsmithy · 2011-03-02 07:48 · Score: 1

Sure, but a lot of users don't poke at things "in case they break something". To them, it's the magical mystery machine.
That's not really relevant.

Doing a good user interface, whether a CLI or a GUI, is difficult. It requires thinking about what users actually do and how they actually think. The advantage that many CLIs have is merely that the people developing them are part of the target community of users. (That's a huge advantage!)
The advantage that many CLIs have is that the people using them often take *pride* in how difficult they are to use, and boast about their mastery of various arcane and complicated uses of same.
Re:Hyperviser by drsmithy · 2011-03-02 08:14 · Score: 1

That context thing is important.
Indeed, which is why in the context of a whole bunch of other comments clearly suggesting that CLIs are awesome and GUIs suck, the implication that GUIs are only used for trivial tasks (and its corollary - only the inexperienced and ignorant who have trivial tasks to do, use GUIs) was pretty clear.
If you had said "the terminal is for tasks best done in a CLI", then I wouldn't have even noticed.
Re:Hyperviser by icebraining · 2011-03-02 08:26 · Score: 1

Yeah, but the tool shouldn't display them. It should output a report, and a second tool should read it and generate an image file from it, which can then be displayed by a proper image viewer.
Bootchart is a good example.
Turning the system administration tool into a GUI for graph visualization is running its potential for extensibility and automatism for no good reason.

--
Dilbert RSS feed
Re:Hyperviser by causality · 2011-03-02 08:48 · Score: 1

That context thing is important.
Indeed, which is why in the context of a whole bunch of other comments clearly suggesting that CLIs are awesome and GUIs suck, the implication that GUIs are only used for trivial tasks (and its corollary - only the inexperienced and ignorant who have trivial tasks to do, use GUIs) was pretty clear.
If you had said "the terminal is for tasks best done in a CLI", then I wouldn't have even noticed.
I can see that settling this matter less directly didn't work. One step deeper it is, then. You said "the implication that GUIs are only used for "trivial" tasks is ridiculous on its face."
That means you had a choice. When a statement is "ridiculous on its face" there are two potential reasons. You could assume that I'm careless/stupid/ridiculous/insert-negative-adjective-here. Or you could assume that if you feel that way, you must not have correctly understood what I was saying.
You "wouldn't have noticed" if I had more explicitly spelled it out instead of relying on your ability to interpret something within the given context. That's because if I had done that, it would have left you no room to make an assumption. So I left you a bit of wiggle room there. What did you do with that? Did you at least ask me if that's really what I meant? No, you instantly attribute to me a statement I did not make even though it flies in the face of basic reasoning (a claim that such a user would have terminals for non-trivial tasks is not a claim that a GUI cannot perform a non-trivial task).
You lack the grace and the willingness to extend benefit of doubt, or failing all of that, the awareness that such a glaring omission would be inconsistent with the way I articulated everything else I said, to assume that such a trivial objection indicates you have misunderstood me. That's because in your mind you disagree with my position, therefore I must be wrong. It follows that everything I say must be interpreted in the way that most efficiently reflects on my wrongness, since you've already decided that, even if you must ignore both context and basic reasoning in order to do it.
People have egos, in other words. Egos are quite ingenious at worming their way into a discussion while appearing superficially reasonable. They just don't stand up to critical examination, for that acknowledges the importance of things like context and sound reasoning. A lot of people have a very deep-seated need to feel right, only it's not enough that they feel right, someone else must also be wrong. That's really all this was. Because it's a mostly unconscious process that you didn't deliberately plan on, you may be tempted to think that just because you didn't intend it then you could not have done it.
What you're doing there is something I call playing the hostile audience. It's not about really understanding what I believe and where I am coming from so that you can better explain why you have a different view. Instead, it's about "anything you say can and will be used against you." There's only one thing it really accomplishes: much hair-splitting and much discussion that doesn't elucidate anyone's view or further anyone's understanding of the actual subject. But hey, at least for a brief time between the moment you wrote your "objection" and now, you got to feel like you made an easy slam-dunk and caught me committing something rather stupid in writing. That's what matters, right?

--
It is a miracle that curiosity survives formal education. - Einstein
Re:Hyperviser by snuf23 · 2011-03-02 08:49 · Score: 1

The cloud gui I use has version control, history and version diffing for all scripts. If for whatever reason I don't want to use the gui anymore I can just pull the scripts and roll it the old way.

--
Sometimes my arms bend back.
Re:Hyperviser by NotSanguine · 2011-03-02 09:13 · Score: 1

As an administrator who has been implementing and managing Windows and unix/linux systems since 1991 I have disagree with drsmithy on this.

"No, the CLI appeals primarily to people who like to focus on memorising semantic minutiae and believe that doing so is, in and of itself, a productive endeavour."
I don't memorize (yes I'm an American) semantic minutae for the CLI.
I use the commands they way they were designed -- with built in help (gee, they have that with GUIs too don't they?) and the program documentation.
The same goes for GUIs. I'm not concerned with knowing everything -- I'm concerned with knowing how to find out what I need to know in a timely fashion. That goes for CLI and GUI.
In the Windows environment
Using the GUI Active Directory tools is much simpler for one-off tasks.
Scripting with ADSI or Perl/Net::LDAP or command line tools are much simpler for automating repetitive tasks
In the Unix/Linux environment GUI tools are available, but are used less (at least by experienced admins) because they generally hide the more sophisticated functionality of the CLI.
That said, both GUI and CLI are useful and can be worthwhile. This whole GUI vs CLI thing is just dumb.
I'll stop now. Real admins have *real* work to do.
I still don't understand why people keep posting Paul Venezia's crappy aritcles.

--
No, no, you're not thinking; you're just being logical. --Niels Bohr
Re:Hyperviser by Rufty · 2011-03-02 09:21 · Score: 1

Yet the whole point of having the machine in the first place is to free up the person from mundane and unproductive tasks so they can do something useful with their time.

And sometimes the way to do that is the command line. I had to move some files from an old mac to a server. It had a graphical ftp client on it.
Click "File"
Click "Open"
Click "Desktop"
Click "workfolder"
Scroll to filename
Have another go to see if multiple file select might just work now
Double-click filename
Click "upload button"
Click "browse network"
Double-click server
Click "OK"
Repeat for remaining ~1800 files.
Or for a cli ftp client: "mput *"
And sometimes the way to do that is the gui. I had 3 pdf files to combine into 1 (text, color figures, appendix from a different author using different software). Spent ages with pdftk. Then tried preview.app. Drag and drop the pages into place.
I'm still waiting for a one-size-for-all toolset!

--
Red to red, black to black. Switch it on, but stand well back.
Re:Hyperviser by h4rm0ny · 2011-03-02 09:27 · Score: 1

If the virtualization is perfect AND hidden by design, you can't test it
Ah, but you presuppose the the possibility of perfect virtualisation, which may not be so.

--

Aide-toi, le Ciel t'aidera - Jeanne D'Arc.
Re:Hyperviser by h4rm0ny · 2011-03-02 09:31 · Score: 1

The holographic universe idea is one of my current favourites. But one thing that occurred to me a while back was that if you were creating a simulation, you wouldn't necessarily want to track every part of it, and indeed, maybe couldn't. Instead, you would resolve things as needed, much like a POV swinging around in a computer game: you only render what is appearing on the monitor. Or to put it another way, you only collapse the probabilities when they are observed. Sound familiar from anywhere?

--

Aide-toi, le Ciel t'aidera - Jeanne D'Arc.
Re:Hyperviser by causality · 2011-03-02 09:34 · Score: 1

It's difficult to beat the economy of expression and precision the command line offers to those who know how to use it. It's really difficult to beat the ease of automation.
It's difficult to beat the interactiveness and feedback presented in a graphical interface, to say nothing of the density and depth of information available, or its ability to behave dynamically based on context.

My comment acknowledges that there is often a trade-off to be made. It also accepts the reality that it is far easier for a person to adapt to the needs of a machine than it is for a machine to adapt to the needs of a person.
Yet the whole point of having the machine in the first place is to free up the person from mundane and unproductive tasks so they can do something useful with their time. We should absolutely be trying to adapt interfaces to people, not vice versa.

Those diverse trades you mention have one thing in common: people who are unskilled at them don't generally try to do them.

Most people who are not technicians do use computers.
Most people who are not mechanics drive (some even for fun or competition). Most people who are not farmers eat. Most people who are not doctors need medical attention. That's exactly what makes using a computer just like all those other things - you do not need to know how a steak gets from a cow to your supermarket to eat it.

If someone who has never driven a car before and does not understand how to drive safely gets behind the wheel and causes an accident, no one finds that surprising. No one blames that on the car being too difficult to use. They understand that as a machine it is only doing what the driver told it to do. If someone who is equally unskilled and lacks basic competency operates a computer and has problems, we blame the computer.
No, the computer is blamed when people who do have basic competency have problems. Similarly with things like cars, planes, and other devices that are designed to remove certain aspects of required expertise.
I'm not aware of anyone who would blame "the computer" if someone who had literally never touched one sat down and didn't know how to use it. I do know lots of people blame "the computer" when they have learnt the basics but they run into problems because the interface is poorly designed, inconsistent, or simply non-existant.
I appreciate that you have some solid reasons for seeing this the way that you do. I still view this differently but it is most interesting to gain some understanding, however imperfect, of another valid perspective.
It's possible that our differing views on this can be reconciled. I'll offer up a possible way to do that.
I think the argument could be made that computing for the masses is not yet a mature technology. We haven't had desktop computers for a hundred years like we have had cars for a hundred years. The early cars produced by Henry Ford were not nearly so "ready for prime time", not nearly so reliable, not nearly so easy to use as it is to drive a modern car today. You just about had to be your own mechanic back then. Even starting the engine, which was done with a hand-turned crank, was potentially dangerous to the operator and certainly more difficult than turning a key to activate an electric starter motor.
To really use a computer effectively, not so much in terms of getting work done but rather in terms of avoiding foreseeable problems, in terms of not succumbing to malware infections and other threats, you just about have to be your own technician now. Those who are not often have frustrations and infections. That is becoming less and less true as time passes, but it is not yet a distant memory either. The main advantage corporations and other organizations have is that they employ dedicated IT staff who are expect

--
It is a miracle that curiosity survives formal education. - Einstein
Re:Hyperviser by starfishsystems · 2011-03-02 09:44 · Score: 1

Much more important, in my experience, is that a CLI provides essential modularity and composability that a GUI does not.

That's a win for documentation, because a precise example in the use of a CLI can be easily transcribed. Almost equivalently, it can be wrapped in a script for accurate reuse, or composed as part of a complete automated solution. You can build regression tests into it along the way.

You can't do any of that with a GUI. All you can do is train people to click and click over again, never quite the same way twice, telling them to watch out for this or that indicator on the screen, which may or not be visible in the scrolling window at any particular time, all in the name of repeatability. It's absolutely perverse. And very few people who have exclusively used GUIs have any interest in learning that there could be a better way.

I get that, to them, clicking on a GUI makes them feel useful and involved. They would have done well in the middle ages, copying books by hand.

--
Parity: What to do when the weekend comes.
Re:Hyperviser by Requiem18th · 2011-03-02 10:00 · Score: 1

My dream GUI would be for all windows to have a (collapsible) "command window" printing a history of commands after every GUI event, commands that would be reproduced later by copy pasting.
Think like selecting a snippet in a word processor and choosing the option "Bold" from the "Format" menu, a command would be printed at the bottom like:
>>> selection = mouse.select(x=200, y=300) >>> selection.format.bold()
This of course would be optional and off by default.

--
But... the future refused to change.
Re:Hyperviser by jefe7777 · 2011-03-02 10:43 · Score: 0

you're wrong. mainstream educated, left brained sheeple, flock to windows exactly because they think linearly, they don't see the big picture, and they can't abstract in their head. same damn people who can't imagine the file system as a tree. they need forward and backward buttons. it's the CLI people who are right brained, see the big picture, can think visually and abstractly, are able to NOT memorize minutia for precisely those reasons. visual thinkers don't need crudely drawn GUIs. it's the overly left brained with stunted imaginations that need the clickity click click bobby anti-septic shock go babooom!!! sad thing is, once you get good at the command line, you tend to understand much more closely how programmers think, then if we NEED to use a gui, we usually use them better then the gui people.
Re:Hyperviser by turgid · 2011-03-02 10:51 · Score: 1

No, the difference is that a CLI is nearly impossible to use if you aren't familiar with it
Have you ever used VMS? I used it once or twice to edit, compile and run a FORTRAN program in 1994. In c1999/2000 I was presented with a Micro VAX and asked "can you find out what the IP address of that system is?"
All I could remember was "DIR" and "HELP." After typing HELP and going through the topics presented, I found the command that gave me the IP address.
No GUI required, virtually no prior knowledge required.
If a CLI responds to obvious commands in the user's native language (like "help" for example) and if's done right, it shouldn't be that difficult.
Ridiculously untrue, particularly in the context of non-specialised, non-expert users.
I shall leave you to ponder the wisdom of Master Foo on this very subject.

--
Stick Men
Re:Hyperviser by Hatta · 2011-03-02 11:23 · Score: 1

I very rarely have the time to do this, and in many cases the people paying me have expressly forbid wasting time with dumb users.
Good for them! If someone is too stupid to use a computer that is their problem, the rest of us should not have to deal with it.

--
Give me Classic Slashdot or give me death!
Re:Hyperviser by drsmithy · 2011-03-02 11:26 · Score: 1

If a CLI responds to obvious commands in the user's native language (like "help" for example) and if's done right, it shouldn't be that difficult.

That's a mighty big "if", and the problem is in many CLIs, the answer is "it doesn't".
Re:Hyperviser by aczisny · 2011-03-02 11:31 · Score: 1

Show me the equivalent of that for any GUI too.
Actually, windows 7 has a feature called "Problem Steps Recorder" that essentially does that. It takes a screenshot on every click to show people exactly how you're using the GUI, and as the name implies it was meant to record the steps to a bug/error. It will let you record anything you do on your computer though so it can also (and is at the company I work at) used to make training on how to use a particular piece of software. Microsoft has a page with more about it here. I've done the equivalent before taking screenshots myself to post as a cheatsheet on how to do something on training wikis, PSR just makes it incredibly easy.

--
Now, landing thrusters.. landing thrusters, hmm. Now if I were a landing thruster, which one of these would I be?
Re:Hyperviser by pyrr · 2011-03-02 11:36 · Score: 1

Sure, but the point is with a CLI and no understanding of its syntax and semantics, you're pretty much dead in the water from the get-go. You could have a deep understanding of networking, but if you're unfamiliar with the syntax of iptables, you're not going to be able to configure a Linux firewall.
Well, there is such a thing as man iptables. Further, an app that would GUI-tize the full capabilities of this CLI utility would probably be more difficult to use. A user who has an advanced knowledge of networking and wants to do something complex and elegant with the firewall is probably going to have a lot more success reading the man pages and using iptables than trying to muddle through the limited choices a GUI generally offers.
As I see it, the primary advantage to a GUI is mostly when it comes to viewing the configuration, since it's WAY easier to lay-out the gestalt of complex data through a GUI, where a CLI has more difficulty formatting the data. The next advantage to a GUI, which is not applicable to an expert user, is to provide a substantially dumbed-down interface that would allow a novice to configure a halfway-functional firewall without being burdened with "difficult" decisions...as with Windows Firewall. You know, situations where an adequately-configured, simple solution is more desirable than a poorly-configured, complex solution. That's simply because a good GUI will generally have some level of confirmation and error-checking, and will also limit the user's ability to make wrong choices.
Re:Hyperviser by laddiebuck · 2011-03-02 14:04 · Score: 1

No, the CLI appeals primarily to people who like to focus on memorising semantic minutiae and believe that doing so is, in and of itself, a productive endeavour.
BS. Have you ever compared people who use vim or emacs with people who use TextMate or notepad++ or Eclipse? People don't learn the former two for the hell of it, they learn them because it makes them faster and less stressed.
You find people who learn to manually configure their Gentoo driver settings and build flags and whatnot. But personally I've seen more people who trick out their Windows desktop to look like a Mac, replete with window decorations and docks, or for that matter, a hell of a lot of people who decorate their monitors with little dolls and paper their offices with printouts of cartoons or trite sayings. People waste their time and it has nothing to do with CLI vs GUI.
Fundamentally, given a GUI and CLI that are equally good at a specific task, the CLI is likely to be more automatable and is less likely to get in the way of an expert user: partly due to the inherently different design and partly due to the different philosophy of the people who write each kind of tool.
Re:Hyperviser by TheLink · 2011-03-02 14:05 · Score: 1

Uh, go read Godel's theorem it again. It says nothing about the impossibility of perfect virtualization.
And the "unable to prove" situations are all from the perspective of being _within_ the system. Not outside.
http://en.wikipedia.org/wiki/Godel's_incompleteness_theorems

The first incompleteness theorem states that no consistent system of axioms whose theorems can be listed by an "effective procedure" (essentially, a computer program) is capable of proving all facts about the natural numbers. For any such system, there will always be statements about the natural numbers that are true, but that are unprovable within the system.

The second incompleteness theorem shows that if such a system is also capable of proving certain basic facts about the natural numbers, then one particular arithmetic truth the system cannot prove is the consistency of the system itself.
--
- Too many replies beneath your current threshold
Re:Hyperviser by Anonymous Coward · 2011-03-02 15:11 · Score: 0

Do you know the intricacies of how your car works ? How about your blender or oven ? Could you fabricate a new bed or sofa from raw materials, and without modern tools ? Do you grow your own produce ? Could you butcher a cow or chicken ? Could you set a complex fracture or create your own painkillers ? Can you brew your own beer ?
1) Yes, ASE certified and have 3 race vehicles I built ground up.
2) Cmon dude, those are the simplest appliances in the kitchen, you could have at least said, microwave and refrigerator. But then any true geek should know how those work.
3) Pretty easily. But what do you call modern? Do I at least get to use bronze age tools? Are you telling me, you'll give me fresh hardwood and leather and down to make a couch? Where do I sign up? Appropriation of materials is more of a time consuming affair now than making something. (At least good quality materials).
4) Yes, I have had some form of a square foot garden since I was 9. Back then it was actually huge since we lived on 60 ac of land. Now, living in the desert, it's more of a challenge, but I learned from a master about what grows best and I still have 120sqft of home grown produce.
5) Omg, no... lol... that's where I draw the line and turn vegetarian. Fishing is about it. Slaying of 4-legged creatures, turns me into a capitalistic wimp... where's the Walmart?
6) Yes, had to... hiking. Really? Painkillers is all u got? We've been making painkillers from natural plants for thousands of years. Depending on the part of the world you are in... you may be luckier than others. Try antibiotics. That's where living off the land fails. That's why lifespan was in the 30's before being able to treat simple infections.
7) Yep... beer is actually kinda easy. How bout making your own diesel? And I learned to grow mushrooms too, thanks to my dad.
I hope most everyone here could accomplish half of those at least.
Maybe since I grew up in the spectre of what could have been "red insurgency"
and cold war scare tactics of our government, and since my dad fought in the
3 big wars between 40 and 70... I learned everything that was necessary to be
self-sufficient. If we were to have a social meltdown, I would be the crazy bad-ass
Mad Max mofo runnin around knowin what to do and how to do it.
Was there a point to this part of your post?
-@|
Re:Hyperviser by crafty.munchkin · 2011-03-02 15:27 · Score: 1

(Top posting because I'm a douchebag) In my experience, I've always wanted to work out the underlying reason that it's failed in the first place - but often it's not practical. Hear me out. When you have a senior management douchebag (who out-douchebags you by a factor of n^32) on your case to get this fixed and back up and running yesterday (who also wants status updates every 30 minutes as to why it's not fixed), and your arse (or ass if you're in the US) that's looking at getting fired because of the downtime, do you really want to get fired so you can understand why this server failed? Or do you put in place a mechanism that gets the business running again in 10-15 minutes? My bank manager won't understand that I got fired and can't meet my mortgage repayments because I wanted to know why this system had failed.

--
... wait, what?
Re:Hyperviser by mjwx · 2011-03-02 16:26 · Score: 1

Hear me out. When you have a senior management douchebag (who out-douchebags you by a factor of n^32) on your case to get this fixed and back up and running yesterday
This is why sysadmin's will never dissapear.

Pointy Haired Douchebag 1: Yay, now that we've got the new Cyberdyne Systems 6000 series automated server room set up and fired all the sysadmins all is good.
Pointy Haired Douchebag 2: Someone still needs to change the tapes and check the logs.
Pointy Haired Douchebag 1: I bill $795 an hour, I'm far too important.
Pointy Haired Douchebag 2: I bill $995 an hour, I'm even more important.
Pointy Haired Douchebag 1: /blinks.
Pointy Haired Douchebag 2: I'll just put out an ad for a sysadmin.

Sysadmins will be required as long as computer systems are too complex for the dumbest and laziest of it's operators.

--
Calling someone a "hater" only means you can not rationally rebut their argument.
Re:Hyperviser by drsmithy · 2011-03-02 16:53 · Score: 1

You "wouldn't have noticed" if I had more explicitly spelled it out instead of relying on your ability to interpret something within the given context.
No, I "wouldn't have noticed" if you'd said something _different_.

That's because if I had done that, it would have left you no room to make an assumption. So I left you a bit of wiggle room there. What did you do with that? Did you at least ask me if that's really what I meant? No, you instantly attribute to me a statement I did not make even though it flies in the face of basic reasoning (a claim that such a user would have terminals for non-trivial tasks is not a claim that a GUI cannot perform a non-trivial task).
It most certainly, however, is a strong implication of that. Particularly in the context of the post it was in.

People have egos, in other words. Egos are quite ingenious at worming their way into a discussion while appearing superficially reasonable.
Ain't that the truth.
Re:Hyperviser by drsmithy · 2011-03-02 17:07 · Score: 1

Well, there is such a thing as man iptables.
That's kind of missing the point.

Further, an app that would GUI-tize the full capabilities of this CLI utility would probably be more difficult to use.
Why ? There is no inherent reason it must be.

That's simply because a good GUI will generally have some level of confirmation and error-checking, and will also limit the user's ability to make wrong choices.
Why do you think error and sanity checking is not an equally useful feature for both advanced and ignorant users ?
Re:Hyperviser by Skal+Tura · 2011-03-03 00:21 · Score: 1

Exactly.
and due to the ease of use of the VM systems there are more and more "sysadmins" who have never buried deep into the internals, or are not sysadmins in traditional sense. ie. the barrier has become lower.
Pretty much the samething as with PHP we have tons of "web developers" who are pretty much clueless. Entry barrier is so low, and they think of themselves of being capable, while having no idea of basic internals, good practices etc.

--
Pulsed Media Seedboxes
Re:Hyperviser by __aamnbm3774 · 2011-03-03 02:29 · Score: 1

I like your point. And if I had unlimited time and resources, this would definitely be the preferred route.
But it is not practical in my environment
Sometimes, I have to put the idealism aside and focus on getting that server up as quick as possible.
Re:Hyperviser by wkcole · 2011-03-03 03:41 · Score: 1

> with a CLI ... it's very easy to document it for next time.
Indeed - just run "script" before starting typing.
Show me the equivalent of that for any GUI too.
Limited scope answer...
Purely for documentation, i.e. making visual recordings showing how to do something, there are multiple tools for capturing a video from the screen of doing something in a GUI. I use Snapz Pro X when I need that.

And once you've cleaned up your document (changing 'vi filename' to 'sed .... filename') you can usually get to the point where you can just run your documentation with /bin/sh the next time you need it.
Indeed, logical recordings are much more than documentation. A CLI is much better for making a replayable logical recording of user activity than a GUI because the CLI has a much less complex universe of actions. For example: on my ornately configured personal system I have 2390 executables in my $PATH, but the main display on the same machine has 3686400 logical cursor locations on each of 6 desktops. If the basic action of a CLI is a command and the basic action of a GUI is a mousedown, the GUI starts out with 3 orders of magnitude more possibilities to track at that gross level. The CLI can't lose the contest of what is easier to record in a reproducible way.
That said, MacOS has had recordability to replayable scripts as a feature for a long time (even before OS X,) but it works best with application support and many apps don't bother. That means that it is not a universally available tool, although admins do use it (and the related Automator/AppleScript/OSA plumbing) to do the same sort of integration/automation work that Unix sysadmins are used to doing with shell scripts. It's a bit like scripting in csh: there's lots of intrinsic breakage but a lot of admins manage to work around it.
Re:Hyperviser by urusan · 2011-03-03 04:51 · Score: 1

No, the CLI appeals primarily to people who like to focus on memorising semantic minutiae and believe that doing so is, in and of itself, a productive endeavour.
Um, the semantic minutiae is an unfortunate obstacle that anyone who needs to use a CLI needs to overcome to benefit from it, not the point of the thing. Most CLI users memorize those things once and that's it. It's much like learning the semantic minutiae of calculus so that one may acquire a useful tool.
I use both GUI and CLI interfaces everyday. I tend to use GUIs more because most of the tools I use all the time are GUI-based and they're easy to use for human-centric tasks (web browsing, playing video, e-mail, document writing, etc.).
However, CLIs have several advantages that GUIs can't match (at least without mimicking or indirectly using CLIs). They're extremely easy to program for as the user interface is simple and extremely consistent...and most programming languages are set up to program for CLIs by default. Automation is much easier as you can naturally write scripts that use other programs, whereas a GUI-only interface would require a human to walk through those steps (or a program to blindly click through, making those nice pictures useless and error checking problematic). CLI commands can be very concise and powerful, saving time for the user who knows what they're doing compared to using a GUI. GUI programs have to be heavyweight with virtually every automation task included as a pre-programmed feature (and of course the developers won't be able to fit them all in so some users will lose out), whereas individual CLI programs can be small and do one relatively simple thing as they can be chained together. etc.
That said, CLIs are most advantageous to people who can program. If one can't program then their ability to automate via scripts and write simple but useful programs doesn't exist anyway, so the advantages of GUIs (low learning curve, beauty, human-centric design, etc.) far outweigh the seemingly esoteric advantages of CLIs. Plus, even programmers can benefit from well designed GUI programs, so it's not like it's just for the "masses".
However, CLIs do still have an advantages for the general public so they should not be knocked too hard. When programmers and other advanced users can increase their own productivity or even put together something that can benefit all users that know enough to type a simple command or install a plugin, then we all benefit. Learning the CLI is also often a first step toward becoming a programmer, and we need more of them. Therefore, we should maintain CLI functionality alongside GUI functionality as we move forward.
Re:Hyperviser by drsmithy · 2011-03-03 05:09 · Score: 1

Um, the semantic minutiae is an unfortunate obstacle that anyone who needs to use a CLI needs to overcome to benefit from it, not the point of the thing.
My point is that CLIs have a significant appeal to those kinds of people, even when a GUI would serve them just as well, if not better.

However, CLIs have several advantages that GUIs can't match (at least without mimicking or indirectly using CLIs). They're extremely easy to program for as the user interface is simple and extremely consistent...
I have to disagree with some of that that. Most CLIs - *especially* UNIX ones - are horribly inconsistent, with syntax (same switches to commands doing different things, different switches to do the same things, different ways of taking arguments, etc) being the most obvious area where that is true.
At no point have I even suggested that CLIs are not useful, should be deprecated, or anything similar. My complaint is with the near ubiquitous belief that only CLIs are used for "serious" or "non-trivial" work, and anyone using (or preferring) a GUI is inherently stupid and/or incapable.
Re:Hyperviser by Muros · 2011-03-03 09:23 · Score: 1

Eventually, I have to interrupt, and explain to the org's leaders that they're hearing from people who don't understand scrollbars, have never seen the events table because they don't scroll down to see it. The users are, of course, confused; they know that there's no such table because they've never seen it. We bring up the site on a handy machine (preferably a laptop or tablet with a small screen), and I show the users that it's there by scrolling down to it. Their response again is confusion, because they don't know what I did or how I did it. "Why's it hidden like that?"
You know, I had a thought when I read that. I imagine that a person from 2500 years ago, used to reading scrolls, if presented with a computer with a web page open, would find the scrollbar intuitively obvious. They might however have a great deal of trouble with a link at the bottom saying "Next Page".
Re:Hyperviser by Aspomwell · 2011-03-03 11:30 · Score: 1

Actually in Windows 7 it's called PSR.exe (or problem steps recorder). it's designed to document the steps to allow an admin to reproduce a problem but there's no reason it can't be used to document procedures as well. It will record all your steps and allow you to add comments too.
Re:Hyperviser by jc42 · 2011-03-03 14:51 · Score: 1

I imagine that a person from 2500 years ago, used to reading scrolls, if presented with a computer with a web page open, would find the scrollbar intuitively obvious. They might however have a great deal of trouble with a link at the bottom saying "Next Page".
Youtube has a funny video on this very topic.

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
Re:Hyperviser by jp10558 · 2011-03-04 01:44 · Score: 1

The problem I have with Windows is mostly around wanting to automate things. In Linux, the way I automate something is generally the way I *do* something. At the simplest level, pretty much all OS functions can be done via a simple bash script.
In Windows, there's a lot I can't do at the command line even if I tried, and the command line, even powershell, certainly isn't the way I'm taught to do it via MS docs or whatever. Many apps don't even have command line interfaces. So even simple things like a scripted install can be near impossible. I can try AutoIt, but scripting GUIs seems to be very unreliable. I can learn yet another tool to try and make a .mst if the product supports that. Some have config tools, but many don't. Some that are msi's launch other msi's for the install, but don't pass the parameters like /qn to them, so you can't make a silent install. Finally I could use snapshot based packagers, but somehow they still tend to miss stuff and often don't work.
A little off topic, but how does something that grabs all changes on disk, registry and drivers still miss "something" so the install doesn't work? Where in the ether are these installers hiding critical settings or code?

--
Opera, Proxomitron-Grypen,GPG 0x0A1C6EE3
Re:Hyperviser by ploxiln · 2011-03-04 09:53 · Score: 1

The knowledge they have gained about the CLI interaction is probably specific to that CLI only (how many different ways in different CLIs do you know of to trigger a page down ?)
Funny you should ask - there's this funny key on the keyboard with "Page Down" written on it, and it virtually always works, even in "vi" and "less". There's also a down arrow key, which also virtually always works, even in "vi" and "less".
I think the parent's point was that a scrollbar on the screen is not necessarily more intuitive than an up and down arrow key on the keyboard, with his evidence being that these sorts of novice users who everyone seems so desperate to cater to and empower can't figure out the graphical scrollbar, even though it's used in every graphical application, even when shown how to use it. They're going to have to learn something, and it might as well be the more efficient CLI.
Re:Hyperviser by Courageous · 2011-03-07 06:03 · Score: 1

Yes.
Consider our ESX hosts. Unless a problem is recurring, I have no need or incentive to root cause it. I just put the host in maintenance mode, live migrate the vms off, and reboot. Voila. Even better, I can effectively do this with hardware. Hardware looking flakely? Live migrate stuff off, and...
And you're right about the load balanced, clustered stuff. We reboot domain controllers all the time. And so forth.
C//

Clone my car! by hart · 2011-03-02 01:57 · Score: 2

TFA concludes with "But if all it takes is a few clicks of a mouse in vSphere's Windows-based client to pop out a cloned server instance (ostensibly built by someone who knew what they were doing), then what does it matter? It's all very convenient and cool, right? Wrong. If you don't understand the underpinnings, you're missing the point. Anyone can drive the car, but if it doesn't start for some reason, you're helpless. That's a problem if you're paid to know how to fix the car." While I agree in principle, the analogy here is off. If the car doesn't start in this case, I can just throw it away and clone a working one.

Re:Clone my car! by shawb · 2011-03-02 02:07 · Score: 5, Insightful

The real solution? Reimage the production server to just get it working, then you dig around on the dev server until you find out what's actually going on.

--
I'll never make that mistake again, reading the experts' opinions. - Feynman
Re:Clone my car! by commodore6502 · 2011-03-02 02:13 · Score: 0

>>>If you don't understand the underpinnings, you're missing the point.
Where does a person go to learn those underpinnings and become a Unix or linux Server expert?

--
Information wants to be expensive AND wants to be free. So you have Value vs. Cheap distribution fighting each other.
Re:Clone my car! by Anonymous Coward · 2011-03-02 02:15 · Score: 1

I would also say that the guy cloning the server is NOT a sysadmin. While it's a job that traditionally would have fallen to the sysadmin, advances in technology allow low level techs to now handle that kind of job. However, that does not mean you don't need a sysadmin, he's very likely the guy who knew what he was doing and built the server.
It also depends quite a bit and whether you can reimage or just build a new server. For a webserver or something like that where there's dozens or even hundreds of identical servers all with the exact same config, it doesn't make sense to troubleshoot each problematic server. You can spend hours tracking down the problem or minutes reimaging, and even if you find the problem it may take hours longer to fix it. You still need someone who knows what they're doing though, for several scenarios. Most commonly:
a) Server A is faulty, you reimage. The next day server B shows the same symptoms. You reimage, but ultimately need to spend some real time finding the root cause. If the next day Server A's problems resurface, or server X starts showing the same symptoms, then your reimaging monkeys damn well better have a good admin to call before the problem gets out of hand.
b) Not all servers are clusters. Citrix farms, web farms, db farms, yea, you take one down, reimage and move on. Core application servers you generally don't have that luxury, and without your core app servers working, your massive farms that feed data into and out of them aren't worth shit.
Re:Clone my car! by tsm_sf · 2011-03-02 02:32 · Score: 2

Well, pre-Xbox attention spans it was digging through man pages. I don't know how you're supposed to find that kind of focus now, when everything in your house either blinks, beeps, or vibrates. Good luck.

--
Literalism isn't a form of humor, it's you being irritating.
Re:Clone my car! by bsDaemon · 2011-03-02 02:33 · Score: 3, Insightful

Traditionally? College. Way back when, long before I was born, system admins tended to be graduate students in computer science or other department staff, and those in industry did it in college first. System administration itself wasn't taught, but that's not the point. The point is several technologies grew up together and are generally described in terms of one another: Unix, C, TCP/IP, etc. -- You don't really get what's going on with one without the others in most cases.
C, of course, is the foundational building block. Unix is the cathedral and TCP/IP is the road that connects each building together. Most of the so-called system admins I've seen in the past have been "web developers" who have been put in over their head and forced to deal with things they don't fully understand. I learned C and Unix concurrently, starting by teaching myself in jr. and high school. Try explaining an mbuf to some kid who only knows PHP some time -- it's painful.
The lack of fundamental understanding which would enable them to be competent admins is the same lack of fundamental understanding which keeps them from writing secure code, debugging network issues, etc. But, because there is a large influx of semi-skilled people who think that the fact they installed Ubuntu on their PC at home makes them a sever admin, employers are less willing to offer up the salaries necessary to attract competent admins, and frankly the salaries need to be even higher to make dealing with idiots less of a hassle.
I'm so glad I'm not in web hosting anymore I can't possibly overstate it.
Re:Clone my car! by Ephemeriis · 2011-03-02 02:34 · Score: 5, Insightful

The real solution? Reimage the production server to just get it working, then you dig around on the dev server until you find out what's actually going on.
Exactly.
If the machine is in production it needs to be working. You don't have time to dig around and find the root cause. You need it to work. Now. If you've got a virtualized environment it is trivial to bring up a new VM, throw an image at it, and migrate the data.
Then you take your old, malfunctioning VM into a development environment and dig for the root cause, so that you don't see the same problem crop up on your new production machine.

--
"Work is the curse of the drinking classes." -Oscar Wilde
Re:Clone my car! by Anonymous Coward · 2011-03-02 02:35 · Score: 0

I agree. With a VM, you can take a snapshot to investigate later AND reboot to get things running in a pristine state immediately.
Re:Clone my car! by Isca · 2011-03-02 02:35 · Score: 4, Insightful

That's assuming your new tool that's vitally important actually has a man page. Very little is documented as well as it was 10 years ago.
Re:Clone my car! by camperdave · 2011-03-02 02:50 · Score: 1

The problem with cloning is that if there is a flaw in the master, then there is a flaw in every single clone. We see it all the time in cars. A faulty part leads to a recall. Sure, the circumstances under which the flaw causes problems may never happen to you. On the other hand, if your processes and procedures cause you to hit the flaw, then replacing the server instance isn't going to help. I mean, imagine how Star Wars would have turned out if Jango Fett had been a good marksman.

--
When our name is on the back of your car, we're behind you all the way!
Re:Clone my car! by Anonymous Coward · 2011-03-02 02:54 · Score: 1

Some of the worst sysadmins I've met have been computer scientists. They don't tend to think about things like scalability of applications or reliability of services.
I can't count the number of times I've had to tell CS people that webrick is not a viable basis for a production service, or that backup is just as important as getting a new gee-wiz service running.
Re:Clone my car! by digitalchinky · 2011-03-02 02:54 · Score: 1

Assuming your image of the production server is not borked as well :-)
Re:Clone my car! by Culture20 · 2011-03-02 02:56 · Score: 1

Well, pre-Xbox attention spans it was digging through man pages.
$man man
NAME
man - an interface to the on-line reference manuals

DESCRIPTION
man is the system's manual pager.

SEE ALSO
The full documentation for man is maintained as a Texinfo manual. If
the info and man programs are properly installed at your site, the
command

info man

should give you access to the complete manual.

And no, this isn't really what man man says, but I expect it to eventually. I hate info and its hypertext-ified, hiding-stuff-behind-links format.
Re:Clone my car! by KnownIssues · 2011-03-02 02:58 · Score: 1

If only I was given the time to dink around on the dev server for a few days to find out why some obscure problem was resolved by rebooting the server. And that assumes your dev environment is equivalent to prod enough to repro the issue in the first place. Don't get me wrong, I'd seriously like to be able to do that. I understand the value of spending two days of my time to avoid wasting three days of my future time. But very few for-profit organizations are going to want their employees finding root cause to prevent something that might happen in the future, when they have projects they want completed yesterday.
Re:Clone my car! by bigstrat2003 · 2011-03-02 03:02 · Score: 1

But that wouldn't allow you to write self-righteous rants about how reimaging is for hacks!

--
"16MB (fuck off, MiB fascists)" - The Mighty Buzzard
Re:Clone my car! by Culture20 · 2011-03-02 03:04 · Score: 1

I don't know how you're supposed to find that kind of focus now, when everything in your house either blinks, beeps, or vibrates.
Oh, and: You're supposed to drink copious quantities of caffeinated beverages so that you vibrate and blink in sync with everything else.
http://www.youtube.com/watch?v=hiTF4_sDgPo
Re:Clone my car! by scamper_22 · 2011-03-02 03:18 · Score: 1

While you try and be sarcastic.
Yes, the computer world allows you to clone a car in good working order.
So, yes, you can be an ignorant car driver who doesn't change the oil, rear ends everyone, throws it in reverse while the car is in motion... but unlike real life... when the car stops working... you can very quickly... restore the car to a clone in good working order.
You're not paid to fix the car... you're paid to keep the service running. Two very different things.
At some point... it does become cheaper to just replace things... than fix them. This has happened in many fields. There used to be people called TV repairmen? Who does that anymore? You just buy a new one as the cost is less.
It's a perfectly valid thing to do. If there's something that keeps going wrong, then you call in an expert to figure it out. But for the average sys admin... restore image!
This is all part of automation which is very good. A good architect can setup the system, backup... and the average sys admin just restores. You only need a few of these expert architects. Heck, your company might not even employ them. They might just be consultants.
Re:Clone my car! by Canazza · 2011-03-02 03:18 · Score: 1

I was once a kid who only knew PHP.
I wouldn't dream of telling anyone I knew how to Sysadmin, nor look for a job doing it. I blame your company, partially, and the applicants, slightly, for not knowing their limits.

--
It pays to be obvious, especially if you have a reputation for being subtle.
Re:Clone my car! by hedwards · 2011-03-02 03:25 · Score: 1

Yes, but what happens if for some reason the reimaging doesn't go well? That's really why you ought to be paying for a competent sysadmin in the first place, you don't hire them under the assumption that it's always going to work as designed, you hire them because sometimes it doesn't work. In that case it might be that some lunkhead changed configuration files on the VM and not the image or that the machine itself isn't configured correctly as a result of an error that nobody noticed in the imagine procedure. A good sysadmin has probably seen or foreseen most of the problems that can be encountered or at least has some idea where to look and whom to call.
But, if you really want to see why you need a qualified professional, I think backups are really the place to look. Sure a lot of that is point and click, but you're going to seriously regret not having somebody that knows what they're doing when you're choosing a system, implementing said system and most of all when you're needing to recover the system.
Re:Clone my car! by falcon5768 · 2011-03-02 03:32 · Score: 2

"In that case it might be that some lunkhead changed configuration files on the VM and not the image or that the machine itself isn't configured correctly as a result of an error that nobody noticed in the imagine procedure."
Both cases that depends highly on your organizational structure as well as your Dev environment.
Where I work, only the system admins would ever be able to change cofig files on the VM, and no image of the machine would ever be deployed without going through at least 2-3 month of testing before it was even rolled out. Here the rule is get it working any way possible, if it means imaging then do it, then go back and figure out WTF went wrong. Its because of this procedure that we have been able to have uptime better than most other companies in our field.

--
"Slashdot, where telling the truth is overrated but lying is insightful."
Re:Clone my car! by ibbie · 2011-03-02 03:48 · Score: 1

But, because there is a large influx of semi-skilled people who think that the fact they installed Ubuntu on their PC at home makes them a sever admin
While I won't say that it does make them a server admin, everyone has to start somewhere, and a lot of schools these days leave out a lot when it comes to technology. A friend of mine graduated with a Comp Sci degree a few years ago, and had barely touched anything *nix at all. I think they might have had them log into an old RHEL VM and use pico, perhaps start and stop apache, but that was it. This isn't to say that he wasn't smart, it's just that they didn't teach him anything outside of VB.NET and (how it use it to work with) XML.
I've since met and worked with others who had similar experience, if you s/VB.NET/Java/
Not precisely what I'd consider a broad range of education.

--
The wise follow a damned path, for to know is to be forsaken.
Re:Clone my car! by MattSausage · 2011-03-02 03:51 · Score: 1

I'm guessing the water tasted better in your day, and the damn bananas could peel easier, but at the same time it snowed up to your elbows every winter and you had to walk everywhere you went, even down to the general store!

Sometimes things change guys.. and if it's a change for the more efficient or more profitable, it's a change that sticks. Management only cares things are back up as soon as possible, and Sys Admins get paid the same no matter what they do So there.
Re:Clone my car! by Poppageorgio · 2011-03-02 03:57 · Score: 1

The real solution? Reimage the production server to just get it working, then you dig around on the dev server until you find out what's actually going on.
Exactly.
If the machine is in production it needs to be working. You don't have time to dig around and find the root cause. You need it to work. Now. If you've got a virtualized environment it is trivial to bring up a new VM, throw an image at it, and migrate the data.
Then you take your old, malfunctioning VM into a development environment and dig for the root cause, so that you don't see the same problem crop up on your new production machine.
I ditto this. Its really hard to troubleshoot a problem when users are ringing your phone continuously because a critical server is down. Much easier to pull it into dev and troubleshoot at your leisure. The key being going back to figure out the problem. Not just "forgetting" about it and moving on.

--
Me fail English? That's unpossible!
Re:Clone my car! by Lumpy · 2011-03-02 04:07 · Score: 2

Problem is Companies don't want to PAY for college educated Sysadmins or IT people in general. They want to pay $16-$18 an hour instead of the $32-$42 my BSEE and BSCS deserves. The Cert mills churn out the useless certified IT people that gladly lap up the low wages.
THAT is the demise of the educated Sysadmin. Companies that want to pay the IT department less than the custodial department.

--
Do not look at laser with remaining good eye.
Re:Clone my car! by minvaren · 2011-03-02 04:24 · Score: 2

You don't have time to dig around and find the root cause. You need it to work. Now.
On reflection, this is a good analogy for modern society in general.

--
Big! Strong! Wow! Tada-O!
Re:Clone my car! by drsmithy · 2011-03-02 04:33 · Score: 1

The real solution? Reimage the production server to just get it working, then you dig around on the dev server until you find out what's actually going on.
It'd be nice if a dev environment could genuinely mirror production, but very few people are that lucky.
Re:Clone my car! by Ephemeriis · 2011-03-02 04:35 · Score: 1

The key being going back to figure out the problem. Not just "forgetting" about it and moving on.
Assuming, of course, that you can do this and don't wind up being shoved on to some other project because this one is "fixed".

--
"Work is the curse of the drinking classes." -Oscar Wilde
Re:Clone my car! by hood8263 · 2011-03-02 04:43 · Score: 1

I agree, Computer scientists atleast around here are taught computer programming. Well mostly how to program but not actually with a programming language. I graduated from a comptuer systems technology course which went over everything abit. Well everything as in server admin for windows/linux programming database etc.... The Computer Scientists are pure code no knowledge of servers or anything other then how to create good algorithms. I'm happy with my degree it atleast taught me something, it was also a pain in the ass to get through. We lost around 50% of the class due to people just unable to handle the workload. I can also out program most of the Computer Science students 8/10 times. When it comes to web programming that's about 9.75/10 times.
Re:Clone my car! by pnutjam · 2011-03-02 05:05 · Score: 1

might be right about bananas...

--
Cheap storage VM.
Re:Clone my car! by surgen · 2011-03-02 05:13 · Score: 1

While I agree in principle, the analogy here is off. If the car doesn't start in this case, I can just throw it away and clone a working one.
Yeah, that's the trouble with the analogy, but yours is off too. You can clone an older version of that car, but cloning a working one all depends on your definition of "working".
Lets say your car doesn't start and you don't know why. Its actually just out of gas, but you restore from a backup image of the car from when it was running on fumes. Now maybe that old image is good enough to get you were you were going, but its still risky to reset and forget. (Gas being an expendable resource this isn't a very good analogy, but you get the point.)
Re:Clone my car! by cayenne8 · 2011-03-02 05:34 · Score: 1

It'd be nice if a dev environment could genuinely mirror production, but very few people are that lucky.
Or to even have a dev environment PERIOD.
On so many things I've seen...even govt/DoD...I've seen things where the dev machine becomes the production machine as soon as it is working good enough to go. Back then, we found that when buying the machine for dev...to try to get as much hardware as we could because we knew there was a great chance that it would also turn into production.

--
Light travels faster than sound. This is why some people appear bright until you hear them speak.........
Re:Clone my car! by SuperQ · 2011-03-02 05:36 · Score: 1

True. But that kind of sysadmin work is boring. It's no better than being a janitor.
The real fun stuff these days is not just doing sysadmin work, but working on automation and monitoring that could replace 1000s of Cert mill morons.
Re:Clone my car! by Hooya · 2011-03-02 05:37 · Score: 1

Or you could implement some sort of redundancy/failover so that when your production machine goes down, you have another one pick it up instantly. Then you don't even need to respond quickly to restore a VM. But alas, redundancy/failover are so last decade and VM is all the rage now with the new kids.
Re:Clone my car! by houstonbofh · 2011-03-02 06:25 · Score: 1

You work in a place that give you time to do this? You can devote time to "fixing" a system that is up and running? Must be nice... (Only somewhat sarcastic)
Re:Clone my car! by houstonbofh · 2011-03-02 06:43 · Score: 1

houstonbofh@tc-us-dev01:~$ man gui-tool
No manual entry for gui-tool
houstonbofh@tc-us-dev01:~$

Damn!
Re:Clone my car! by AK+Marc · 2011-03-02 06:59 · Score: 1

And if you got a comp sci degree in 98 from a particular large public university, you'd have never needed to touch a single PC or server. You could complete the degree with mainframe access via VT 100 terminal only. Oh, and no VB .NET, PHP, HTML, XML, required, just a little C+ but focusing on others like FORTRAN.

--
Learn to love Alaska
Re:Clone my car! by mikael_j · 2011-03-02 09:19 · Score: 1

If anything causes problems with development and staging systems my experience tells me that nine out of ten times it's net connectivity issues (for any kind of networked server). Nothing like having your dev environment on a 10.x.x.x/24 subnet that requires a proxy and other magic to talk to anything, then a staging environment that's stuck "behind" the backend servers for the live system (with a firewall in between). And then of course there are tunnels going all over the place to make sure that it's actually possible to test automatic file synchronization with a partner company's server (or simply another server belonging to your organization).
I dream of an environment where you could just create an exact image of the production environment, copy it to the dev environment, build your software, package it, test a deploy on a staging server and know that if it worked in dev and staging it should work in production (more likely it will first fail on the staging system and then on the production server as well, maybe not spectacularly but in some way there will be a problem).

--
Greylisting is to SMTP as NAT is to IPv4
Re:Clone my car! by ppanon · 2011-03-02 11:01 · Score: 1

So if the problem is the accelerator or brakes fail due to a design flaw and the car wraps itself around a tree, seriously injuring or killing the driver and passengers, you just re-image the car and don't tell the new driver and passengers what happened to the previous users? How is that working out for Toyota?

--
Laissez lire, et laissez danser; ces deux amusements ne feront jamais de mal au monde. - Voltaire
Re:Clone my car! by Anonymous Coward · 2011-03-02 12:00 · Score: 0

I hate to agree but it does seem like computer scientists make terrible "computing professionals", and especially terrible sysadmins. They do occasionally make good programmers. Much like most programmers, they rarely seem to know how an actual computer actually works or is actually used by actual users. I know these are offensive stereotypes but I've been in the industry long enough now to see that they hold up. The people who make the best "computing professionals" are rarely the same people who studied computer science in college.
I will say that I have never held a CS degree against someone, but when the choice is down to hiring a BS in CS with 3 years experience or a BA in Liberal Arts with 2 years experience, I'm going with the BA. He might not get our jokes about binary, but he will be better at nearly every task we throw at him AND less likely to piss anyone off. Even better, I can hire them at least 10% cheaper and they will be happy with it.

Sad but smart by Anrego · 2011-03-02 01:58 · Score: 4, Interesting

I’m not a system admin but I don’t see how this is a bad approach.

I see value in finding out what the problem is and why it happened.. if you just blindly re-image then the problem might pop up again at a less opportune time.

But if you know what the problem is... and you have an image of the server in a working state, or a documented procedure on how to set up the server in it’s intended configuration then why would anyone waste time trying to repair it.

I think you have this kind of problem in most jobs. New approaches that make more sense but require less skill (and imply less e-pene) are always hated by people who have already learnt how to do it “the hard way”.

I see this as a programmer all the time and have been a victim of it. I’ve seen a huge chunk of my chosen industry migrate from meat and potato problem solving to gluing libraries together and sprinkling in business logic.

I’ve been fortunate to land in a job where there’s still a lot of “from the ground up” work, but these jobs are getting scarcer as even the components that everyone uses are made from other components. And executable UML (or something of its ilk) is probably going to be the next thing to cut the legs off us.

Re:Sad but smart by darjen · 2011-03-02 02:07 · Score: 1

I see value in finding out what the problem is and why it happened.. if you just blindly re-image then the problem might pop up again at a less opportune time.,
That's why you have backup servers. Sometimes it simply isn't worth the time or effort to dig deeper. Re-imaging is completely rational from a business perspective.
Re:Sad but smart by TheRaven64 · 2011-03-02 02:24 · Score: 4, Interesting

Add to that - no one (outside of the IT department) cares what the problem is, they care about the downtime. If you have some redundancy, stuff can fail periodically without the users noticing. An 'admin' capable of keeping it running can be someone paid to do something else who has responsibility for clicking the button every few months if required. An admin who can actually address the problem will cost, what, $60,000/year minimum (including associated costs, not just salary)? Is having ten minutes of downtime every few months costing your business $60,000/year? If not, then it's not worth the cost of doing it properly. It may be for a bigger company, but for a small business that would eat most of their profits. This is the advantage of a Windows or Mac server, with its pointy-clicky interface: it may be less reliable, and more expensive, but the cost saving from not needing to employ anyone who actually understands what's going on outweighs it. Especially if you buy a support contract, where the vendor will send someone competent out for the couple of time a year where something goes seriously wrong.

--
I am TheRaven on Soylent News
Re:Sad but smart by causality · 2011-03-02 02:31 · Score: 3, Insightful

I’m not a system admin but I don’t see how this is a bad approach.
I see value in finding out what the problem is and why it happened.. if you just blindly re-image then the problem might pop up again at a less opportune time.
But if you know what the problem is... and you have an image of the server in a working state, or a documented procedure on how to set up the server in it’s intended configuration then why would anyone waste time trying to repair it.
I think the issue here is that the need for a business to get a production system back up and operational with as little downtime as possible can sometimes conflict with the principles that most effectively assure sound system administration.
Unix/Linux systems don't just break for no reason, particularly servers with enterprise hardware. The idea that a system just breaks for no apparent reason and a reboot, reset, or re-image is going to actually fix the cause and somehow prevent a future reoccurrence is alien to this realm. That's a mentality that comes from running Windows (esp. previous incarnations) on commodity hardware.
Something on that "known working" image is faulty or capable of breaking. Otherwise, normal use would not have led to a state of system breakage.
The ideal course of action would be to do whatever is necessary to get the system back online, which may include re-imaging, and then discover what is wrong with the "known working" image that eventually broke. That could be greatly assisted, of course, by saving the data (at least the logs) from the known-faulty system prior to re-imaging.

--
It is a miracle that curiosity survives formal education. - Einstein
Re:Sad but smart by y86 · 2011-03-02 02:43 · Score: 1

Re-imaging is completely rational from a business perspective.
So is treating cancer vs curing it.
Re:Sad but smart by Anrego · 2011-03-02 02:58 · Score: 0

That analogy seems kind of backwards in this case.. because what's important (and why re-imaging seems to be more popular) is uptime. The same could be said with cancer.
If a doctor told you.. "ok.. we have this drug that'll fix that right up for you.. but I'd really like to know why this happened so I'm gonna dig around and try and fix it myself" you'd probably tell him to give you the damn drug.
Actually it's an interesting analogy because the sys admins are kind of like the pharmaceuticals here. It's in a sysadmins best interest to spend hours doing skilled work to fix a problem, and to push that as the best option, rather than let people know there is a magical button that gets you back up and running in an hour.
Re:Sad but smart by Anonymous Coward · 2011-03-02 03:00 · Score: 0

Did you just say "Mac server"?
An admin who can't troubleshoot won't know how he got in trouble in the first place. If it's a bad NIC, faulty wiring, or an issue with a software configuration step, they will just keep doing the same thing OVER AND OVER.
That will eat away at your profits a whole lot faster, because you'll just buy a new "Mac server" or something :)
Re:Sad but smart by bsDaemon · 2011-03-02 03:05 · Score: 1

No, the re-imaging is more like if you have cancer and you decide to commit suicide in hopes that you get re-born in a body less suceptible to cancer, with the intention of using past-life regression hypnosis to remember everything from before. Re-imaging means shit when the problem turns out to be bad sectors in the swap partition causing read errors and spiking CPU load due to IO waits. I've seen that before.
Re:Sad but smart by drsmithy · 2011-03-02 03:06 · Score: 1

Unix/Linux systems don't just break for no reason [...]
They can, however, break for a reason that is beyond your level of knowledge, skill, or simply free time to discover.
Re:Sad but smart by L4t3r4lu5 · 2011-03-02 03:10 · Score: 1

So come downtime you chkdsk / fsck the drive, mark the sectors as bad / replace the drive, and Bob's your mother's brother's daughter's lover.

The question here is: Do you spend the time diagnosing the fault, or do you re-image to a known working state? If the software isn't bad, the hardware must be. Two hours of tinkering might give you the same answer, but 30 minutes of rebuild definitely does.

--
Finally had enough. Come see us over at https://soylentnews.org/
Re:Sad but smart by Anonymous Coward · 2011-03-02 03:11 · Score: 0

There's great value in ensuring you have a good image or can rebuild the system. Before jumping ship to a Windows developer role, I was a *nix admin and the older admins thought it was odd I spent so much time ensuring I could kickstart the new Linux servers (mail, file, network, and ssh). Since we didn't do whole system backups, I wanted to be up and running quickly if we lost the building (there were real science labs in the building and the building next to us was designed to have the roof blown off in the event of a big explosion).
With that said, there's great value in fixing the actual problem. If you re-image to only have the same issues, that was wasted time. Even restarting the system can be a waste of time as often you can restart the service. Some times restarting a service and watching it go bad can help find the underlying cause.
Re:Sad but smart by gravis777 · 2011-03-02 03:18 · Score: 1

This isn't just for system admins, but the entire IT department. I have moved form desktop support to desktop security. Let's take the simple example of a user getting a virus (hopefully rare, hopefully your enterprise virus software is functioning correctly, but it still happens). Now, let's say the virus is one I hadn't seen before, and its really nasty little bugger too. Now, I can spend time crawling through Windows files, registry keys, configuration files, and whatnot, trying to get rid of it, with some help through Google. This could take several hours - possibly a couple of days.
Point is, I know the problem, I now have a fix. In the meantime, the user has been on a loaner laptop for two days. It takes, what, a few minutes to a few hours to back up the users data, about 15-45 minutes to reimage, and maybe another couple of hours to reinstall programs (depending on your enviornment). Point is, I can come in and pick up a laptop at 7:30, and have it reimaged and back to the user by their 10AM meeting with a reimage.
Now, lets say its a REALLY nasty virus, and its infecting multiple people - like we are getting 10-20 of these a day. Now, it may be worth the time to research the issue and find a fix. However, I know one virus I dealt with about 3 years ago that I was getting about 2 a day on. I had a fix, it completely got rid of the virus, and I could get away without doing a reimage. However, it took about 4-5 hours a computer to get rid of, as I was going through multiple, time-consuming steps. I had the knowledge of how the virus worked, I knew EXACTLY what was causing the virus to come in, and I was yelling and screaming at both our Antivirus software company and Microsoft for not patching it. Good for me. However, I could reimage said computer and have it back to the user in under 2 hours.
So, please, tell me, what is wrong with a reimage?
Re:Sad but smart by bsDaemon · 2011-03-02 03:18 · Score: 1

We moved all the data over to a hot swap physical machine then replaced the disk in the original machine later. Everyone else was trying to just suspend this customer for resource abuse. It's not that his shit was /that/ bad, but it was just bad in a way that made him swap to disk about once every 2 hours, with a 75% chance that his shit would hit the bad blocks. The machine needed to be moved, no doubt about it. When I found that out, I also noticed a few other indicators of impending disk failure that none of the other people in my department had noticed. I like my new job, though. we don't have any idiots here and I'm not in ops anymore. It's totally money.
Re:Sad but smart by theBully · 2011-03-02 03:19 · Score: 1

I can definitely see where you're coming from. I leave in both the world of software development and system administration and in both the Windows and Linux platforms. In both worlds there are more and more tools that make it possible for someone without core skills to intervene. Someone (I can't remember whom) was saying "Java was made so that anyone can be a programmer. The problem with it is that everyone can be a programmer." I have seen a whole lot of software, both OSS and proprietary, developed using what I call "The Lego Model". It's quality speaks to it. As an example, I recently did some development around Drupal (which happens to be a fairly big and well supported project). There are still major flaws in it's core design and architecture. I am not trying to badmouth OSS developments as I have seen the same problems in proprietary applications as well. What I'm trying to say is, with the appropriate tools many can nowadays become "programmers" or "sys-admins". But take Eclipse, or Netbeans or VisualStudio away and give them vi. They'll take a day to write 5 lines of crappy code if they can figure how to use vi. Same for "sys-admins". They'll be fine as long as the core of the system works and the problem matches their "fix it script". Put them in front of a non-trivial, non-scripted issue and they're lost. This is when it becomes apparent that there's a difference between sys-admin and tech-support level 1 or 2. The way I see it is that certain tasks that use to be sys admin tasks are moving down a level or two because there are tools to make that possible. Is this making the sys-admin redundant? I think not. Same for developers. There are certain tasks that you can hand down a level or two at this point. Is this going to make the experienced and skillful developer redundant? I think not. It will simply allow him to focus where he is needed most: at the core, design, architecture and algorithm development, while, in the meantime the rest of the team can translate UML or algorithms into code. (I would certainly enforce thorough review of the code just as I would certainly have a sys-admin supervise the tech-support staff that handles imaging.) I don't see a problem with all this at all. It simply shifts tasks from one side to another and as many have already said here, it addresses the issue of reducing downtime to a minimum. And yes, the sys-admin can then take the time to find a fix or a solution offline without the management team behind his back going: "Are we there yet?". People's jobs have been changing and been created and been made redundant one way of another ever since. I remember a time when someone could be a "computer operator" not too long ago. Hopefully we will adapt to the changes instead of staying and screaming "they've cut my legs off".
Re:Sad but smart by swalve · 2011-03-02 03:23 · Score: 0

Except that sometimes they do. Yes, they are a lot more "set it and forget it" than other OSs, but weird shit sometimes happens. Oddball bit flips, a file that was written funny due to an unknown and unreproducible sequence of events, and so on. Computers are supposed to make things easier. Blowing away and reimaging is often the easiest solution. Only when problems re-occur does further investigation become necessary.
The reimage as troubleshooting method is also an investment. You take time at the beginning to set up a system that allows this to happen painlessly BECAUSE you want problems to be solved quickly, and in a standardized manner. For desktops, it assures that systems are periodically refreshed to the proper config. For servers, same reason. All the other servers running off the image are working fine, you know the software is good. If the same software continually fails on a particular machine, you know the machine is bad. It is insanity to think that reinventing the wheel every time a problem crops up is the right way to solve problems. Nuke it from orbit.
Re:Sad but smart by Eivind · 2011-03-02 03:32 · Score: 1

True. But they can break due to freak one-off "accidents", i.e. it's possible for a server to crash, but then upon reboot, run perfectly for a year with no problems exhibited whatsoever.
Offcourse the problem that caused the crash once *can* do so again, but it might not be worth the time to investigate it. Thus it can be quite sensible to go with "once is accidental - twice is suspicious - three times warrants investigation".
Because sometimes "once" really does happen only once, and doesn't repeat itself in the forseeable future, at which point knowing *why* might not be worth the investment in time to find out why.
Pragmatism wins in the real world, and that's not a bad thing. The trick is to find the right balance.
Re:Sad but smart by hedwards · 2011-03-02 03:32 · Score: 1

Indeed, in all the time I've run FreeBSD I can't recall the OS ever crashing completely without cause, every once in a long while it'll panic because something goes horribly wrong, but it's not something that just happens randomly, there's always a cause. Likewise I rarely if ever see a program drop core just because, and the few applications I've seen do that have typically been a configuration error or mismatched library, and definitely not in any software that could be considered mission critical.
The other issue is that if you're not extremely careful reimaging can make things a lot worse. That might be your last chance to pull usable data off the disk, and you've just spent it on reimaging the disk. Or worse, there's a subtle RAM or HDD fault which isn't going to become obvious in the short term. Sure it should be backed up, but I know better than to assume that a given business is doing that properly. Not to mention that if you've got a problem with either RAM or the HDD you might not even notice the problem until you've already wiped the backups.
Re:Sad but smart by D+iz+a+n+k+Meister · 2011-03-02 03:33 · Score: 1

It is smart, but I don't think its very sad.

If you subscribe to ITIL or Visible Operations ideal, you probably believe that 80% of all IT related outages are self-inflicted due to change.

So for 80% of all IT outages, it does make sense to have a strategy where it is cheaper to rebuild(revert the change) than to repair.

But for the other 20%, it does make sense to investigate further. A virtualization strategy where you could redeploy the offending server while saving the old one out of service for investigation seems ideal.

I agree with investigation in principal, but the blog post seems quite sensational and misleading.

--

He painted a unicorn in outer space. I'm askin' ya, what's it breathin'?
Re:Sad but smart by egomaniac · 2011-03-02 03:44 · Score: 1

Did you just say "Mac server"? ... That will eat away at your profits a whole lot faster, because you'll just buy a new "Mac server" or something :)
So you've evidently got no issue with a Windows server, but you take specific exception to Macs? That's odd, considering one of those two is a reliable Unix flavor, and the other is... well, Windows. What exactly is the issue with a Mac server?

--
ZFS: because love is never having to say fsck
Re:Sad but smart by SilentStaid · 2011-03-02 03:44 · Score: 1

That is one of the simplest and yet most poignant comments I've ever read on Slashdot. I wish I had mod points for you.

I do have to say though, I agree with the sentiment that you need everything up as fast as possible generally speaking - if it's not up, you're losing money/work/time and that's something most employers are not okay with. That being said, if you don't have the luxury of working for a company that can afford a lot of redundancy - say a non-profit, then I'd say that curing becomes more of a necessity.

In the end, all I'm trying to say I guess is - different situations require different solutions and a in my opinion at least, that is large part of what defines your abilities as a sysadmin, finding the best solution - regardless of technical expertise required.
Re:Sad but smart by Anonymous Coward · 2011-03-02 03:49 · Score: 0

The IT/Unix Administrator would never just reimage. A true admin would delve deep and discover the issue, what caused it, then create a solution to solve it (a solution, not a band aid).
The IT Manager would evaluate the time needed to troubleshoot and implement a fix vs the time to just reimage. He must get the business back up and running as fast as possible.
That is the difference between the admin and the manager. 2 different jobs, 2 different people.
Re:Sad but smart by Anonymous Coward · 2011-03-02 03:54 · Score: 0

Unix systems break, too, for both hardware and software reasons. Even the $50k+ servers arrive DOA from time to time. Every new model takes at least 6 months so you can find a version of the firmware that actually works. Where I work, we are reporting bugs/issues to Oracle for Solaris all the time. My Linux (Ubuntu) desktop has been up 141 days, and there have been 56 kernel updates (applied via ksplice) since it first booted. That's about one kernel problem fixed every 60 hours.
Re:Sad but smart by causality · 2011-03-02 03:58 · Score: 1

It is smart, but I don't think its very sad.
If you subscribe to ITIL or Visible Operations ideal, you probably believe that 80% of all IT related outages are self-inflicted due to change.
So for 80% of all IT outages, it does make sense to have a strategy where it is cheaper to rebuild(revert the change) than to repair.
But for the other 20%, it does make sense to investigate further. A virtualization strategy where you could redeploy the offending server while saving the old one out of service for investigation seems ideal.
I agree with investigation in principal, but the blog post seems quite sensational and misleading.
That often happens when a generalized article is written about a scenario that differs from place to place. By that I mean, which approach makes the most sense will depend on the actual problem, the downtime you are facing, the preparations you have made, and the needs of the business or organization. It's not such a one-size-fits-all deal though of course it can be spoken of in general terms.
I think it should be appreciated that a great number of problems are preventable, either through best practices or through redundancy. Wherever it is possible, a good sysadmin would rather invest a little effort into foresight up-front than fail to do so and end up having to perform crisis management.

--
It is a miracle that curiosity survives formal education. - Einstein
Re:Sad but smart by jaymz666 · 2011-03-02 03:58 · Score: 1

But if you know what the problem is... and you have an image of the server in a working state, or a documented procedure on how to set up the server in it’s intended configuration then why would anyone waste time trying to repair it.
How do you know that the issue that occurred wasn't because of how the server was initially setup and reimaging is just a time bomb waiting to recur?
Re:Sad but smart by darjen · 2011-03-02 04:02 · Score: 1

No, not really. Anyone who actually discovered a cure for cancer could make billions.
Re:Sad but smart by Anonymous Coward · 2011-03-02 04:31 · Score: 0

I’m not a system admin but I don’t see how this is a bad approach.
I see value in finding out what the problem is and why it happened.. if you just blindly re-image then the problem might pop up again at a less opportune time.
But if you know what the problem is... and you have an image of the server in a working state, or a documented procedure on how to set up the server in it’s intended configuration then why would anyone waste time trying to repair it.
I think the issue here is that the need for a business to get a production system back up and operational with as little downtime as possible can sometimes conflict with the principles that most effectively assure sound system administration.
Unix/Linux systems don't just break for no reason, particularly servers with enterprise hardware. The idea that a system just breaks for no apparent reason and a reboot, reset, or re-image is going to actually fix the cause and somehow prevent a future reoccurrence is alien to this realm. That's a mentality that comes from running Windows (esp. previous incarnations) on commodity hardware.
Something on that "known working" image is faulty or capable of breaking. Otherwise, normal use would not have led to a state of system breakage.
The ideal course of action would be to do whatever is necessary to get the system back online, which may include re-imaging, and then discover what is wrong with the "known working" image that eventually broke. That could be greatly assisted, of course, by saving the data (at least the logs) from the known-faulty system prior to re-imaging.
Now days Windows Server administration is very complex. If you think that just clicking on an interface is system administration you are wrong.
When you click you actually have to know what your are doing. So the point where Windows is less system administration is not valid. The only
difference between a nix sysadmin and a windows one is that windows sysadmins get paid less 10-20% - but add the licensing fee etc and you come
at the same price.
sd
Re:Sad but smart by anyGould · 2011-03-02 04:37 · Score: 1

Re-imaging is completely rational from a business perspective.
Particularly when you consider what the costs of downtime are. Around here, when the server goes down that leaves a hundred or so people in the warehouse completely idle. 100 people x $CRAZY dollars per hour (+ $OVERTIME at the end of the shift to catch up) means that it is almost *always* more cost-effective to the company to punt, reboot, get the system (and people) working again, and then troubleshoot later.
Re:Sad but smart by hood8263 · 2011-03-02 05:02 · Score: 1

You also have to think about some things when handing code down. The lower level programmers may not be as efficient or knowledgeable in their coding and could add in very in-efficient code which can slow the program down. Also if you program using any advanced techniques (thinking of a co-student of mine) they could easily be lost. I know people who could do a massive conditional statement in one line and it would work perfectly. You show that to someone who have minimal knowledge and their head would explode.
Re:Sad but smart by hood8263 · 2011-03-02 05:03 · Score: 1

A real sys admin would have a backup server to failover to and then fix the real problem on the main machine :)
Re:Sad but smart by Anonymous Coward · 2011-03-02 05:16 · Score: 0

It is essential to perform root cause analysis of the issue instead of simply re-imaging. As an example, if your service keeps crashing with an "Illegal instruction at 0x41414141" don't assume "it just happens." If you don't know what that means, you should investigate.
In this example, you have a service with a buffer overflow vulnerability that someone is attempting to exploit to compromise your machine. If you don't investigate this "strange behavior" you might receive a phone call from law enforcement 10 months later informing you about your breach that occurred months ago. If you blindly re-image, you would erase clues to the problem. Also, this crash might point out other issues with the infrastructure such as your patches are not being applied correctly, you have a problem with your firewall configuration (so intruders are getting in), etc.
Re:Sad but smart by fwarren · 2011-03-02 05:25 · Score: 1

If "treating cancer", was a pill without any major side effects, cost like 10 cents a day to take, and would keep a cancer growth at its current size, or shrink it to some degree and prevent it from spreading, whereas actually curing it is either not possible or is like $100,000 or more. Most people would be happy with just treating cancer.
It is all a matter of defining your terms. Even a competent sysadmin with hundreds of machines to maintain who has been tasked with other priorities may end up just reimaging a production server because the first job they are being paid to do is to keep the business users that are paying for the system working. Some systems just can not be shut down, and if reimaging a vm is what keeps the 99.9999 uptime is what works you do it.
It is called pragmatism.

--
vi + /etc over regedit any day of the week.
Re:Sad but smart by Anonymous Coward · 2011-03-02 05:32 · Score: 0

The case of a virus is a bit like a hacked server. In each case, you reimage to a clean state because you know exactly what the problem is AND you apply patches / better anti-virus to the image before you redeploy.
Would you reimage user's machines and send them back out into the wild exactly as vulnerable to the same infection or security hole as they were in the first place? Now lets say that the virus is running around your corporate network. A reimage is just going to buy the user a couple hours before they're infected again. Reimage and redeploy a server? We'll too bad you didn't patch the security hole because the box will just get compromised again.
The problem isn't reimaging; its just reimaging and calling it good. Imaging is a great tool for solving problems, its also a great tool for being a lazy hack.
Re:Sad but smart by secretcurse · 2011-03-02 05:38 · Score: 1

No, the person that finds a drug that will allow people with a broad range of cancers to live for about 80% of the average life expectancy for people without cancer, but live that 80% with a comfortable standard of living, will make billions. Provided the patient has to keep taking the drug for the rest of their life, of course.

--
I'm using all of my mod points to mod ancient memes down. Please join me.
Re:Sad but smart by Anonymous Coward · 2011-03-02 05:42 · Score: 0

Are you from the accounting department???
You are right though. All companies need server and email, but if all you have is 30 people, you cannot afford the full spread of an IT staff, or an Accountant or Lawyer. That is where outsourcing/consultants come in.
Somewhere on the scale to 5000+ users there is going to be a breakpoint for your specific company to bring in someone full time to admin various parts.
Re:Sad but smart by Anonymous Coward · 2011-03-02 05:45 · Score: 0

I see value in finding out what the problem is and why it happened.. if you just blindly re-image then the problem might pop up again at a less opportune time.
I can't tell you how many times in my life people have derisively called me "perfect" or told me that everything goes perfectly for me because of luck. No one wants to hear about how nothing goes wrong because every time something does go wrong I'm willing to spend hours or days figuring out WHY it went wrong so it won't happen again. You try that for a while and notice that your life gets easier as less things go wrong.
Re:Sad but smart by tnk1 · 2011-03-02 06:07 · Score: 1

I don't buy the fact that I have to be in perpetual tension with my manager's goals to get the right balance or to be a true sysadmin. I know better than the manager what will get production up and running in the fastest and safest manner. If I am constantly trying to keep the system down as long as I can to get a root cause, as opposed to determining the route to getting the host back in service as fast as possible, then what will happen is the manager telling me what he wants me to do, which will be sub-optimal because he's a manager. My work will also suffer because I am being constantly stressed and interrupted for status updates.
In fact, it would be my guess that half of the "reimages" that are being employed are because managers direct them to be executed because it is a simple one-word fix-all operation. Much like the other R-word: "reboot". It's my opinion that if the managers can actually trust a sysadmin to understand the need to keep up service levels, then there will be fewer boneheaded directives like rebooting or reimaging going on.
There is no question that a root cause is required, but in most cases, you have recorded the crash dumps and logs for that before your service restoration action. You can then work in the time that your band-aid has bought you to come up with a complete solution for the issue without being under pressure. Failing that, you can devise a test plan and logging scheme so that you can collect more information the next time it crashes. All of that while your company is making money and your managers aren't alternately pulling their hair out and pulling *your* hair out.
Don't get me wrong, reimaging is not a standard solution that I would imagine employing for production issues. If you have good change control, or even any sort of short-term memory at all, you probably know exactly what you changed to screw up your host. You just backtrack and change it back. The worst that you'd have to do is reboot in there somewhere. If your host worked once before, it will usually work again when you restore it to that previous incremental state, no reimage necessary (unless there is data corruption).
Nevertheless, if a reimage *could* provisionally get my host back in service and it was the fastest, safest thing to do, I'd do it in a heartbeat. I would thank Baby Jesus for making VMs so I could now reimage super-fast. The increased speed may even bump up the "reimage" option on the list of actions to take since it *is* much faster than an old-style reimage. The previously unthinkable is now not so bad.
People rely on your service to *work*, that's why they pay you. If you absolutely need to keep the system down to get the information you need to fix it permanently, then you do what you have to do, but don't consider that to be the ideal situation or even standard operating procedure.
In my opinion, the best sysadmins out there have their services back in service before their managers even realize it was down. They also make sure it never happens again, if possible. Those are not mutually exclusive goals.
Re:Sad but smart by tnk1 · 2011-03-02 06:14 · Score: 1

How do you know that the issue that occurred wasn't because of how the server was initially setup and reimaging is just a time bomb waiting to recur?
You don't, but if you can intelligently determine that the initial configuration worked for X amount of time before the crash, and you can expect it to work for a non-trivial amount of time again until the next crash, then you can bring the image back up and you now have X amount of time to devise a permanent solution while being in service at the same time.
Obviously, reimaging instantly at the merest hint of trouble is colossally stupid, just like any other snap action would be. But if you understand the right circumstances where it could be employed to your advantage, then it doesn't have to be the best fix, it just needs to create the best outcome for you, and your employer.
Re:Sad but smart by Asmodae · 2011-03-02 06:23 · Score: 1

You're pushing the time you save back on the user who has to get his/her PC back to working order again. This can sometimes take days of tweaking and installing to get productivity back to where it was, depending on the tools and complexity of configuring them. If your re-image makes the user's computer exactly the way it was but sans virus, nobody will care but that's also unlikely to be true. But if you're just saving yourself time at the expense of the other guy getting his stuff back up and running... well then there could be plenty wrong with a re-image in this case. Especially when you consider every hour the user spends getting situated and back up to speed with an essentially new PC is an hour that everyone depending on that user now loses.
Re:Sad but smart by jaymz666 · 2011-03-02 06:36 · Score: 1

Unless of course condition Y that has never been encountered or tested before was the cause and that condition is going to be met again once the re-imaged server is brought online.
Re:Sad but smart by jgrahn · 2011-03-02 07:45 · Score: 1

But if you know what the problem is... and you have an image of the server in a working state, or a documented procedure on how to set up the server in it’s intended configuration then why would anyone waste time trying to repair it.
If you know what the problem is, you can generally fix it quickly, without downtime. On Unix, at least.
Re:Sad but smart by rcamans · 2011-03-02 08:07 · Score: 1

Smaller companies who cannot afford their own sysadmin should be renting space in a rack space.
Come to think of it, most companies, small or big, should rent space (outsource).

--
wake up and hold your nose
Re:Sad but smart by gravis777 · 2011-03-02 08:24 · Score: 1

You are assuming that a company only has one image, or an individual user gets special treatment. Each person's configuration is due to which department they work in. If you create an image for the department, there should be no reason to custom configure the machine. Granted, you then have multiple images you have to maintain, but seriously, all you got to do is, if you need to update something, grab the old image, update it, sysprep, you got your new image. Saves tons of time.
Re:Sad but smart by gravis777 · 2011-03-02 08:27 · Score: 1

Hopefully your A/V vendor will have released a new dat file before you send the user back out there.The example I gave, it took our vendor about 3 weeks to release a fit, which is why we were yelling and screaming
Re:Sad but smart by Asmodae · 2011-03-02 09:45 · Score: 1

My perspective is that of an electrical engineer. During my daily job I interact with dozens of tools. Each part vender and sometimes each part has a different tool to write code with, debug with, do boundary scan with, etc. We have people that are productive with VI, EMACS, Ultraedit, or even Notepad ++. Each program (work project) mandates either Subversion or Clearcase for config management (both linux and windows versions). All these are different tools and then you have VHDL simulation, VHDL synthesis, board schematic capture, FPGA timing analysis, board timing analysis, power consumption, signal integrity all use different tools, and many of these cases have multiple tools that do the job that we are either evaluating, or one tool is slightly better at one aspect than another and we need that capability, or one program requires an older version due to legacy support vs another newer program that wants the latest tools for integration and debug support. So now you have all these different tools to do different aspects of work in our workflow and often no less than 5-8 versions of each of those tools on the network.

There's no way you can come up with image sets that encompass all those use cases that then don't require extensive installation of extra software. And even if you could, all these tools create a workflow and not everyone does it quite the same way so your images require further tweaking by the user to get back to what they need in order to continue working in an efficient manner. Not to mention that any time they spend working on computer issues costs the company quite a bit more than the IT guys time.

Sure office drones that do nothing but use word might be able to just get by with just a re-image and push, but if they are that restricted/limited why do they even need a full blown PC anyway? They could do their job on an cell-phone with a docking station. :)
Re:Sad but smart by AK+Marc · 2011-03-02 11:05 · Score: 1

Unix/Linux systems don't just break for no reason,
Neither do Windows computers. The issue isn't whether there is some "reason" to why they break. The question is whether uptime is better served by rebooting and restoring from an image if the reboot doesn't work, as opposed to poking around in a live server that's down until you find the reason. Unix/Linux/Windows is irrelevant to this issue.

What is relevant is that many of the Unix/Linux admins come from the "poke at home as a hobby" group and the Windows ones come from a "do what the business wants" background. And the personal desires of the admin may be in conflict with what's in the best interests of the company. Act in a manner that minimizes downtime. Now, you tell me whether troubleshooting on a down server or imaging it and doing forensics later is better for a company's bottom line.

--
Learn to love Alaska
Re:Sad but smart by Leolo · 2011-03-02 11:52 · Score: 1

Especially if you buy a support contract, where the vendor will send someone competent out for the couple of time a year where something goes seriously wrong.
That sounds like an argument FOR using Linux.
Re:Sad but smart by tnk1 · 2011-03-02 13:23 · Score: 1

That is always a risk, but if it only takes 5 minutes to reimage, you have little to lose if you fail as soon as it comes up again or soon after. And in addition, you have taken an action which can then be pointed out as being a serious attempt to get things working again as fast as possible. In my experience bosses tend to give you more leeway if you can prove that the "quick and dirty" didn't work and you need the extra time to get to the bottom of it.
I'm definitely not promoting a reimage as a catch-all tactic or even a standard one. Most UNIX hosts don't need a reboot to fix issues, let alone a reimage, but if I can identify where it might legitimately get me running again faster, I will definitely consider its use. Running production isn't some sort of academic exercise, you don't get to determine the best solution irrespective of external requirements.
Re:Sad but smart by toddestan · 2011-03-02 15:29 · Score: 1

The main problem with the Mac server is that the hardware available from Apple is a joke and you can't run OS X server (legally) on anything else.
Re:Sad but smart by Anonymous Coward · 2011-03-02 21:11 · Score: 0

"I’m not a system admin but I don’t see how this is a bad approach."
You said it, you are not a system admin! >-(

Cost and primary business by xzvf · 2011-03-02 01:59 · Score: 1

An expensive part of most IT budgets is people costs. Unfortunately, if your primary business is not IT, it is also the easiest one to cut.

Re:Cost and primary business by trickyD1ck · 2011-03-02 02:52 · Score: 1

Unfortunately, if your primary business is not IT, it is also the easiest one to cut.
Fortunately, if your primary business is not IT, it is also the easiest one to cut.

FTFY

From personal experience by Xacid · 2011-03-02 01:59 · Score: 5, Insightful

"they punt and rebuild the server from scratch rather than dig deeper."

From personal experience this is normally due to management jumping down our throats to simply "get it done" which unfortunately runs counter to our inquisitive desires to actually solve the problem.

I suspect it's the end result of pressure to get more bang for their bucks in a tight economy, but that's pure speculation. It really could be a trend of the times.

Re:From personal experience by Anonymous Coward · 2011-03-02 02:04 · Score: 0

Exactly. My cousin recently got a virus on his laptop and needed it removed asap so he could travel with it, and he got it in the same way we told him multiple times not to do anymore. So, rather than spend an entire night getting it working with work in the morning, it was nuked with the recovery disk.
This served two purposes - it was running in the desired timespan, and since he lost all his files, it will hopefully keep him from repeating his mistakes in the future.
I might just be overly optimistic here, though, on the second part.
captcha: stress
Re:From personal experience by RocketRay · 2011-03-02 02:04 · Score: 1

Reminds me of our Windoze guy in my previous job. I had run into some kind of problem with XP at home and spent a good amount of time banging my head against it. Finally I told him what was going on and asked what he recommended. "Reinstall XP" was the answer. I thought he was joking, nope he was serious. :P
Re:From personal experience by VolciMaster · 2011-03-02 02:06 · Score: 1, Interesting

"they punt and rebuild the server from scratch rather than dig deeper."
From personal experience this is normally due to management jumping down our throats to simply "get it done" which unfortunately runs counter to our inquisitive desires to actually solve the problem.
I suspect it's the end result of pressure to get more bang for their bucks in a tight economy, but that's pure speculation. It really could be a trend of the times.
Having witnessed this type of behavior across myriad companies and industries, I can say the rebuild/clone/redeploy approach is used NOT because of "pressure to get more bang for their bucks" - it's that it is inherently easier to do this approach than to deep-dive perhaps for days to find The Answer(tm). In an environment of thousands of servers (or even dozens), deep-diving into a problem [generally] is a waste of time. While it is interesting intellectually, there is no other benefit.

--
antipaucity
Re:From personal experience by Anonymous Coward · 2011-03-02 02:13 · Score: 3, Insightful

To a small degree, you are correct. The bigger problem is that in *nix pretty much all the tools you need are available to you, but in the Windows world everything costs money. So often the solution comes down to either spend money to fix it or spend time to rebuild it. Since management thinks computers are simple push button things, "just reboot" because the go to solution.
Re:From personal experience by Anonymous Coward · 2011-03-02 02:18 · Score: 0

it's that it is inherently easier to do this approach than to deep-dive perhaps for days to find The Answer(tm).

"42". HA! Got it first!
Re:From personal experience by Nerdfest · 2011-03-02 02:19 · Score: 4, Insightful

As I've said below, there is a benefit ... you can actually investigate and fix the problem rather than the symptom. The bonus with VMs though is that you can frequently do both. You can create a copy of the VM tio dig into, and create a new fresh instance for production to get them working again.
Re:From personal experience by laffer1 · 2011-03-02 02:29 · Score: 3, Insightful

There is benefit because there is downtime even if only a few minutes to restore the VM. What if the software running in the VM is old and someone has been attacking it? Restoring will result in the same problem a few days or hours later.
If there is a bug in a specific kernel version that's not playing nice with the VM, it will cause stability problem again.
Redeploying and finding the problem is the only real answer. In the long run, it may save work.

--
MidnightBSD: The BSD for Everyone
Re:From personal experience by Anonymous Coward · 2011-03-02 02:31 · Score: 0

very true - and please remember that the computers/network/I.T. department is there to help the company do whatever it does. I.T. is not an end to itself.
Knowing what happened/broke is essential so that we can prevent it from happening again, but "real server admin skills" include the ability to minimize downtime (*cough* re-image *cough*) ...
Re:From personal experience by Darth_brooks · 2011-03-02 02:34 · Score: 4, Insightful

....and his was the right answer. With XP, you're almost certainly talking about a client machine. Why bother dicking with it? It's a hundred dollar OS on a four hundred dollar piece of hardware. Wipe, reload, move on to big boy problems. Even if you're talking about a problem that ends up affecting a number of users, and it happens to be a client side problem, you're farther ahead to nuke and reload.
In my last position I was the only end user support guy for 150 to 200 people. If I sat around and fucked with every little nuance of XP and it's associated ills, I'd have ended up even farther behind than I was when I left. I wrote up a quick backup script that grabbed anything the user didn't (against company policy) store on the network drive, grabbed their local e-mail (Notes), then nuked the machine and reloaded. I could take a user who was dead in the water and have them back up and running in 15-20 minutes. If they had a lot of data to restore, maybe 35-45. Spending an hour 'troubleshooting' was a waste of company time, and my time.

--
There are some people that if they don't know, you can't tell 'em.
Re:From personal experience by jcoy42 · 2011-03-02 02:35 · Score: 3, Insightful

deep-diving into a problem [generally] is a waste of time. While it is interesting intellectually, there is no other benefit.
There can be a benefit. I generally try to get the system working first, then figure out what went wrong. And sometimes it takes a few days of poking at it to figure it out, but when a problem like that comes up again, I'm ready for it.
That's the benefit of an experienced system administrator. Anyone can just make it work again, but someone who has been doing that for a few years is going to be used to writing scripts that hunt for said issues and either correct the problem on the fly or send a notification with some details about where to look first.
I've seen the "make it work and move on" approach result in systems that become increasingly unstable because no one ever tracks down the root problem.

--
Never trust an atom. They make up everything.
Re:From personal experience by J4 · 2011-03-02 02:35 · Score: 1

You win a fine cigar!
Re:From personal experience by Ephemeriis · 2011-03-02 02:36 · Score: 0

While it is interesting intellectually, there is no other benefit.
Well, it'd be nice to find the root cause so that you don't see the problem pop up again on your new server image...
But, yeah. If you can get it up and running with a re-image in a matter of minutes/hours instead of digging around for hours/days... Don't waste the time.

--
"Work is the curse of the drinking classes." -Oscar Wilde
Re:From personal experience by Xacid · 2011-03-02 02:37 · Score: 1

Valid point, but it does have its merits if it's a recurring problem. A wise manager will know when to call for deeper inspection.
And to be fair - I'm fine with reimaging a system to fix a problem if it's not recurring as the downtime typically isn't worth it.
Re:From personal experience by Tom · 2011-03-02 02:46 · Score: 4, Insightful

In an environment of thousands of servers (or even dozens), deep-diving into a problem [generally] is a waste of time. While it is interesting intellectually, there is no other benefit.
Except, of course, finding what the heck was wrong in the first place and fixing it, preventing future outtages.
Sometimes, rebuilding is faster than fixing, and in some contexts, it makes sense. Even then, the original machine should still be examined and the "root cause" (if you need a management buzzwod) identified. At the very least, a reasonable amount of time should be given towards the attempt. It's true that it is pointless to dig around for days and days - but that is not a reason to not at least start looking, as it might turn out you only need a few hours. And more often than not, finding the real problem tells you something that helps you
a) fix other bugs,
b) avoid the same problem on the next server,
c) avoid a repeat performance,
d) makes you realize what you thought was a random server crash was really a break-in / hardware failure / systematic problem and other, additional steps need to be taken.
All of the above have happened before, you would by far not be the first.
A proper incident management process does allocate resources towards follow-up examination. The right thing to do is not suppress it with generic blabla about wasted time, but to set the proper amount of resources for your organisation. Maybe it's half an hour and no money, so some sysadmin can check the logs and do a quick check-up. Maybe it's a full-out forensics analysis. That depends on your needs, your resources, your environment and context.

--
Assorted stuff I do sometimes: Lemuria.org
Re:From personal experience by causality · 2011-03-02 02:51 · Score: 5, Insightful

Reminds me of our Windoze guy in my previous job. I had run into some kind of problem with XP at home and spent a good amount of time banging my head against it. Finally I told him what was going on and asked what he recommended. "Reinstall XP" was the answer. I thought he was joking, nope he was serious. :P
Not only was he completely serious, he probably can't understand why you might have thought he was joking.
The idea that it's a black box and you shouldn't expect to understand how or why something happened is definitely one of the more subtle costs of Microsoft systems. It lends credibility to the (false) notion, so common among average users, that you're either a completely unskilled newbie or a serious expert who can discern the inner workings of the mysterious black box. It discourages middle ground for intermediary skill levels, the kind of thing that would otherwise occur naturally as users gain experience over time.
Most of all, it's supports the falsehood that it's unreasonable to expect the most basic competence from non-experts.

--
It is a miracle that curiosity survives formal education. - Einstein
Re:From personal experience by foobsr · 2011-03-02 02:53 · Score: 1

I suspect it's the end result of pressure to get more bang for their bucks in a tight economy, but that's pure speculation. It really could be a trend of the times.
Both? Maybe interaction?
CC.

--
TaijiQuan (Huang, 5 loosenings)
Re:From personal experience by Anonymous Coward · 2011-03-02 02:54 · Score: 0

With windows you *can* mess it up that badly. In linux it is harder to do as everything is not in a central registry. In linux/unix I can nuke just the offending spot in the system and fix it with negligible downtime. With windows you get a driver in there and windows refuses to load anything but that one... And the info is sprayed across 50 places in the registry. I can even take the offending spot out of the working system and move it to a sandbox and figure out what is wrong.
From a business perspective though I can spend 2-5 days figuring out why something is busted. Or reimage the box in under an hour. If it takes me 5 days to figure it out and or /I can reimage 10 boxes a day. It better be worth the (2 to 5)*8 hours worth of work or nearly 50 boxes with downtime on them. If it is my home computer it may be worth digging in. It is a cost tradeoff. Both ways cost money. It is a matter of is it worth the time to fix it or not. Esp when you got the CEO's pet group that is sitting around surfing the web because they can not get at the servers to do work and barking at him every 10-15 mins.
Re:From personal experience by Anonymous Coward · 2011-03-02 02:55 · Score: 0

From my experience it's due to management insisting on hiring IT workers with no real IT experience and all they know to do is either reboot or re-image.
Re:From personal experience by bberens · 2011-03-02 02:59 · Score: 1

If you'd been making proper backups then restoring to a previous "good state" should be simple. Also, if you count how many hours you'd already spent on it you probably could have gotten it going again from scratch in that time.

--
Check out my lame java blog at www.javachopshop.com
Re:From personal experience by Midnight+Thunder · 2011-03-02 03:00 · Score: 1

it's that it is inherently easier to do this approach than to deep-dive perhaps for days to find The Answer(tm).
"42". HA! Got it first!
But what was the question?

--
Jumpstart the tartan drive.
Re:From personal experience by IICV · 2011-03-02 03:14 · Score: 1

It lends credibility to the (false) notion, so common among average users, that you're either a completely unskilled newbie or a serious expert who can discern the inner workings of the mysterious black box.
How is that false? In Windows, moving beyond the pretty clicky clicky click interfaces is deep, dark fucking Voodoo that nobody actually understands! There are no serious experts in Windows.
I mean for instance: at my last job, the domain server was going crazy. I wasn't a sysadmin, so I didn't fix it, but I did spend a couple of long nights helping the guy who did - and he was convinced it was because our desktop imaging procedure (which he had written, natch) didn't include changing the desktop computer SIDs so we had to go around one night and run NewSID on every single computer in the company.
And then, like six months later, Mark Russinovitch himself says that in fact Windows doesn't actually use the client SID for anything, and it's totally okay to not change it - and in fact, they've deprecated NewSID for that very reason.
So yeah, even people who are theoretically Windows "experts" - even Mark Russinovich, who supposedly understood Windows so much better than anyone at Microsoft that it got him hired there - don't really know how Windows works. What sort of a chance does anyone else have?
Re:From personal experience by drinkypoo · 2011-03-02 03:20 · Score: 1

In my last PC support type position I spent lots of time both cleaning and reimaging systems. If there was data I'd try to clean them, then I'd try to port data and wipe. No data on the client machine, all on the network where it belongs? Wipe that thing with impunity. You do have filesystem images, right? Right?

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:From personal experience by Junior+J.+Junior+III · 2011-03-02 03:23 · Score: 1

Having witnessed this type of behavior across myriad companies and industries, I can say the rebuild/clone/redeploy approach is used NOT because of "pressure to get more bang for their bucks" - it's that it is inherently easier to do this approach than to deep-dive perhaps for days to find The Answer(tm). In an environment of thousands of servers (or even dozens), deep-diving into a problem [generally] is a waste of time. While it is interesting intellectually, there is no other benefit.
I'd disagree with "there is no other benefit."
Off the top of my head, there are at least two benefits:
1) Finding and fixing the problem once and for all prevents the problem from recurring and/or cascading. Even if you can fix the problem by re-imaging the system, and this costs very little in labor (you kick of a script and wait) there are still costs associated with re-imaging, primarily the system downtime incurred. How often are you re-imaging? If it gets to be frequent enough, fixing the root cause of the problem is quite possibly cheaper in the long run.
2) Oftentimes these deeper problems may have security implications, which are better to fix than leave latent.
Whether the benefits are worth the expense (time and resources) to realize them, or whether the same time and resources could bring better benefit if directed elsewhere is the real question; it's not that diving deep and fixing root cause problems in systems lacks any benefit beyond that it's intellectually interesting.

--
You see? You see? Your stupid minds! Stupid! Stupid!
Re:From personal experience by Wheely · 2011-03-02 03:32 · Score: 1

Trouble is, for you to determine it is a recurring problem, then it will have to have failed at least twice. Personally I like to try and avoid the second time.
For those of us who have been doing this for years and years, we find that Unix boxes rarely fail for everyone at once unless you are doing the one application per VM cost FAIL. It is usually not acceptale to take down finance because the system is not working for HR. Problems that affect the entire machine are usually easy enough to track down and work around, if not fix.
Re:From personal experience by swalve · 2011-03-02 03:37 · Score: 0

That is because management understands that IT is in the business of providing a service. The problem that needs solving is getting the users back to productivity. Accomplishing that in the most efficient manner possible is the right answer. Only when things start to repeat does "why" become the problem and figuring that out becomes the more efficient answer.
I used to work for a client that ran Novell across hundreds of desktops. They would send out updates via the App Loader / Zenworks mechanism. Every time, a couple percent of them would fail. What does that say? "The update caused it!" No. The update worked on 98% of the machines. What it really said is that 2% of your machines are fucked up, and probably have been for quite some time. Going out and reimaging them fixes most of the problems, and exposes hard drive failures, spyware infestations, memory or network issues.
It's like sending in arson investigators before the fire is put out. One in a hundred times they might recover evidence that would otherwise burn up. Unfortunately, one in five times the investigator gets killed. Same thing here. Knowing "why" a one-off problem happens is not valuable. Conserving resources is. If it is a trend of the times, it is about time. A low-level tech might get his first boner because he figured out that miniscule.dll or /usr/library/version.last.week/ridiculous.c is corrupted, but that's not what he is there for. If figuring that out took him all day, and a reimage would have taken an hour, he's wrong.
Re:From personal experience by hedwards · 2011-03-02 03:39 · Score: 1

And in the meantime you've got a VM of dubious reliability mucking around in customer records. Sure you've probably wiped any trojans that might be infecting the machine, but it doesn't mean that you're not at an increased risk of losing a significant amount of data to an unspecified problem. Sure it does work and probably most of the time, but is it really wise to assume that it isn't a very serious problem just because you can restore the VM?
Plus if the application is that critical there should be reudundancy already or some other contingency plan to handle the time it takes to figure out what the problem is. Probably the only thing that angers employees more than a service outage is finding out that some of their work has been lost and has to be done over. Which depending upon the backup situation could be enough to kill a company.
Yes, that's damn near a worst case scenario, but I do think that those people who assume that they can just reimage the VM and be OK, need to seriously consider whether or not they've consider the full range of implications. In practice I doubt the scenario I put forward is very common.
Re:From personal experience by Anonymous Coward · 2011-03-02 03:46 · Score: 0

The problem occurred from this "good state". Unless it was a mistake done by the admin after this point, whatever caused the problem in the first place is still in that "good state". If you don't know what triggered it, you don't know when it will happen again.
Re:From personal experience by Fastolfe · 2011-03-02 03:55 · Score: 1

I generally agree, but only at small scales. At truly large scales, even a well-designed system will suffer from things like random bit flips in networking/storage layers (weak checksum collisions do occur). At large scales, it actually becomes counter-productive to even give a token effort at investigating every problem that crops up. Document (statistically) the failure, re-image, and if the same event happens multiple times (and it will, if it's a legitimate bug, if you're running at a large scale), *then* spend some time investigating it.
Re:From personal experience by jaymz666 · 2011-03-02 04:00 · Score: 1

While that may make sense on one level, the time it takes to setup the software on that machine is not zero.
Re:From personal experience by Attila+Dimedici · 2011-03-02 04:07 · Score: 1

I have spent a lot of time cleaning up XP problems. With one or two exceptions, the time spent tracking down and then fixing the problem is not worth the time involved. It is usually much faster to format the box and reinstall/restore from backup. My home machines are all imaged to an external hard drive to make doing so a breeze (even my Linux machine, although that is for when I am experimenting with configuration, software and/or hardware and want a quick and easy way for putting it back the way it was)

--
The truth is that all men having power ought to be mistrusted. James Madison
Re:From personal experience by causality · 2011-03-02 04:12 · Score: 2

How is that false? In Windows, moving beyond the pretty clicky clicky click interfaces is deep, dark fucking Voodoo that nobody actually understands! There are no serious experts in Windows.
That's the problem I am talking about, yes. You've restated it more succinctly than I did.
The notion that thare are no intermediary skill levels between "drooling noob" and "serious expert" is the false part. That's easy to explain: some Windows users are more skilled than others. If you need a concrete example, some are much better about avoiding malware and such than others. That one in particular isn't so much a matter of what you do, but how you do it.
The thing is, the design of Windows and the culture surrounding it tends to encourage people to believe that they should never, ever have to learn anything about it. The truth is, the users who don't resist gradually acquiring more knowledge over time as they gain experience have a better experience with fewer problems compared to the users I like to call "permanent newbies".
If you expect a permanent newbie to know how to perform a very basic office-type e-mail task on the grounds that he's been using that same e-mail program (i.e. Lotus Notes) since its release in 1989, they switch to resentment mode and fallback to excuses like "I am not a computer expert!" Well that's good, because this task doesn't require an expert. Yet somehow in over 20 years of use the user never learned anything about the program except the one or two features he most commonly uses.
To be that resistant to even accidentally noticing readily accessible pieces of information that you witness on a daily basis ... well, it takes a lot more effort than just reading the fucking manual. Really. I'm amazed they can accomplish this at all without daily use of amnesia-inducing drugs. Anyone who has worked front-line tech support or helpdesk jobs has seen this.

--
It is a miracle that curiosity survives formal education. - Einstein
Re:From personal experience by Anonymous Coward · 2011-03-02 04:13 · Score: 0

It really could be a trend of the times.
Then I argue that, with the new trend of the times, they relabel these positions to something other then 'System Administrator'. With all the tools available on linux to figure out a problem, the solution is to reinstall without investigation? So knowing the ins and outs of linux and FOSS is becoming a phased out practice? Yea right! All goes back to scale of infrastructure I guess.
Re:From personal experience by rocker_wannabe · 2011-03-02 04:18 · Score: 1

Yes indeed! Ignorance is bliss! And you can always rationalize that it's not your JOB to actually understand the problems, just make them go away.
Until, of course, that ignorance has lead you down the garden path to your own demise. I certainly wish it was true that "what you don't know won't hurt you". Unfortunately, in IT in particular, that has never held true for me.

--
"Meaningless!, Meaningless!" says the Teacher. "Utterly meaningless!"
Re:From personal experience by ArsonSmith · 2011-03-02 04:19 · Score: 1

Hmm, 15-20 minutes times 150-200 people is a lot more than an hour figuring out what the root problem is. Especially when in 3 weeks you're doing the same thing over again.

--
Paying taxes to buy civilization is like paying a hooker to buy love.
Re:From personal experience by Ultra64 · 2011-03-02 04:26 · Score: 1
"Wipe, reload, move on to big boy problems. "

You mean:
- Wipe
- Reload
- Windows updates
- Install antivirus
- Windows updates
- Find installation discs for all your programs
- Windows updates
- Remember where the license keys are for your programs.
- Windows updates
- Realize that you forgot to back up some important file you had in a folder somewhere.
Re:From personal experience by drsmithy · 2011-03-02 04:47 · Score: 1

How is that false? In Windows, moving beyond the pretty clicky clicky click interfaces is deep, dark fucking Voodoo that nobody actually understands! There are no serious experts in Windows.
Of course there are.
Are there people who know every aspect of every component of Windows inside and out ? Probably not. However, the same is true of every platform these days, they're all too big for any one person to comprehensively understand top to bottom.

So yeah, even people who are theoretically Windows "experts" - even Mark Russinovich, who supposedly understood Windows so much better than anyone at Microsoft that it got him hired there - don't really know how Windows works. What sort of a chance does anyone else have?
It's hard to see how you've reached that conclusion.
Re:From personal experience by causality · 2011-03-02 05:01 · Score: 1

Yes indeed! Ignorance is bliss! And you can always rationalize that it's not your JOB to actually understand the problems, just make them go away.
Until, of course, that ignorance has lead you down the garden path to your own demise. I certainly wish it was true that "what you don't know won't hurt you". Unfortunately, in IT in particular, that has never held true for me.
A lot of people have a type of intellectual laziness which they are quite eager to justify. Sometimes they can successfully justify it by placing emphasis on the effort required to truly resolve a matter (while downplaying what would be gained). Sometimes it's obvious (to everyone but them) that they're just making excuses for a personal shortcoming.
The "tell" is that when investigation of a problem really must be done and there's no way around that, they respond with annoyance and disappointment instead of fascination and curiosity.

--
It is a miracle that curiosity survives formal education. - Einstein
Re:From personal experience by hood8263 · 2011-03-02 05:06 · Score: 1

This is what backup servers are for. You should always have a way to failover to another machine in case something like this happens.
Re:From personal experience by hood8263 · 2011-03-02 05:12 · Score: 1

To be that resistant to even accidentally noticing readily accessible pieces of information that you witness on a daily basis ... well, it takes a lot more effort than just reading the fucking manual. Really. I'm amazed they can accomplish this at all without daily use of amnesia-inducing drugs. Anyone who has worked front-line tech support or helpdesk jobs has seen this.
ugh... Don't remind me. I see this all the time... I end up face palming way to much
Re:From personal experience by anyGould · 2011-03-02 05:27 · Score: 1

very true - and please remember that the computers/network/I.T. department is there to help the company do whatever it does. I.T. is not an end to itself.
Knowing what happened/broke is essential so that we can prevent it from happening again, but "real server admin skills" include the ability to minimize downtime (*cough* re-image *cough*) ...
Or, to phrase another way, think ROI - how many "reimages" will you have to prevent to break even on the amount of time spent solving the problem?
I agree a post-mortem is never a bad idea, but it's not always an effective use of time.
Re:From personal experience by Darth_brooks · 2011-03-02 05:49 · Score: 1

See, this is why you have A. images. B. Up to date images with windows updates. C. Images with the necessary programs already loaded so you don't have to go reloading. D. Anti-virus software that can be pushed automatically and E. backups.

--
There are some people that if they don't know, you can't tell 'em.
Re:From personal experience by Darth_brooks · 2011-03-02 05:51 · Score: 1

When the root problem doesn't affect more than one person, it makes sense. When it affects everyone, it's likely not a client problem or there's a far easier solution that reimaging.

--
There are some people that if they don't know, you can't tell 'em.
Re:From personal experience by toadlife · 2011-03-02 06:31 · Score: 1

I have everything after the second step automated.

--
I don't always use unix-like operating systems; but when I do, I prefer FreeBSD.
Re:From personal experience by tnk1 · 2011-03-02 06:33 · Score: 1

Hmm, 15-20 minutes times 150-200 people is a lot more than an hour figuring out what the root problem is. Especially when in 3 weeks you're doing the same thing over again.
Completely correct, although optimization for efficiency does not need to stop at imaging. If you see a problem extremely often, it may not be such a bad idea to actually spend an hour or two determining a point solution for it. If the volume is high enough, you will make up that research time with interest, even if it only saves you 5 minutes here and there. It will also have the psychological side effect of making people appreciate you as something other than an IT janitor or image monkey if you can actually explain the issue and how it is fixed.
Hell, if you're really lucky, you can document it and have the user correct it themselves, something that can't happen if your only answer is to reimage.
Still, you are absolutely right, its hard to go wrong in a desktop support situation by reimaging. Even small shops should have network storage and re-imageable workstations as a support strategy. It is a huge force multiplier and its got a lot of support these days.
Re:From personal experience by Osgeld · 2011-03-02 06:36 · Score: 1

this is 99% eliminated with 99cent shareware, or in windows7 it comes built in
besides let windows update run in the background, you dont need to be there as it downloads next to the clock
Re:From personal experience by Anonymous Coward · 2011-03-02 06:38 · Score: 0

I think it's a little of both.
Re:From personal experience by zero0ne · 2011-03-02 07:17 · Score: 1

No, he means wipe, reload with proper baseline image, and go onto other problems.
It really isn't that hard these days to PXE boot a machine and create and deploy images. Hell there are multiple freeware packages out there that help you do this if you don't have the cash to purchase a solution.
On my end, it literally is a few clicks (or running a batch script that kicks off the process) and the PC will:
- Reboot
- Boot up to WinPE for imaging
- image deploys to station
- script runs to dynamically load correct hardware drivers (Hardware independent imaging)
- PC reboots
- sysprep takes over and reloads any other drivers (the image was preloaded with DriverPacks, so sysprep scans the driverpacks directory and can dynamically load those too - IE sound, video, chipset etc).
- PC reboots
- PC gets joined to domain and reboots again
- Software gets installed (assuming the image didn't already have software on it)
- PC reboots
User can now use PC.
Of course this takes time, but working at a call center, it makes it so easy to wipe an entire group of 50-100+ stations at one time. I can have a new image deployed within 4 hours to a large group of computers.
Re:From personal experience by Anonymous Coward · 2011-03-02 07:28 · Score: 0

Necessity is the mother of invention.
Your support methods make perfect sense to me. I used to do something similar repairing the old green-screen terminals at a newspaper where I worked. I was by far the least knowledgeable electronics technician, but I was really productive fixing the terminals. While the really experienced techs would spend hours troubleshooting each board with a scope down to component level, I'd just swap out the 8 or so electrolytic caps on the circuit board and see if it fixed the problem. I could get through several an hour like that.
In truth, I know I was "cheating" but the end result was that we always had repaired spares ready to swap out when an aging terminal went "foom" (which happened with alarming frequency)
Re:From personal experience by sysrammer · 2011-03-02 07:34 · Score: 1

Jokes about windows & noobs aside, I've noticed that windows has lowered the bar in several ways. One way is that windows problems can be solved by a reboot. As long as that does not become too disruptive, people are more or less fine with that, because you don't need to be an expert or call an expert "to fix your problem". When it does become too disruptive, then an expert can be hired for a short period to come in & troubleshoot.
Do I like it? Eh, I live with it. It does remind me of someone that had an old Volkswagon. He loved it because it was so easy to work on. I made the mistake of going on a road trip with him once. To my regret, I found out why it needed to be easy to work on.
Ok, so what's my point? I guess I'm just venting/rambling. It is what it is.
sr

--
His ignorance covered the whole earth like a blanket, and there was hardly a hole in it anywhere. - Mark Twain
Re:From personal experience by internettoughguy · 2011-03-02 07:56 · Score: 1

The GP was talking about imaging the PC, not reinstalling, so your entire list is irrelevant, save the last entry. The last entry is easily solved with a fairly simple script, as the GP also pointed out.
Re:From personal experience by Ultra64 · 2011-03-02 08:05 · Score: 1

Oops, I wasn't thinking about the corporate side of computer systems.
I work at an ISP/computer repair shop so we deal with a lot of home computers which aren't as amenable to the imaging solution.
Re:From personal experience by Anonymous Coward · 2011-03-02 09:19 · Score: 0

You win a fine cigar!
Thank you Mr. Clinton! Don't mind if I do!
Re:From personal experience by Anonymous Coward · 2011-03-02 09:41 · Score: 0

Or it could be..... a damn good idea. Virtualisation makes returning a system back to where it was at the last snapshot *dead* easy - and frankly you know what? I might be able to do all of the Unix gee whiz troubleshooting - and I also tend to troubleshoot Windows rather than reboot - part of being fuckign good is about bringing the systems back online FAST so ....... spend 8 hours searching for a problem or one click, return a snapshot in 10 minutes?
And if I want to troubleshoot? Clone the system, put it off to one side, return to last good snapshot, no one knows any better and I can work out WTF went wrong while the system is workign elsewhere. Honestly, that's the bloody point of VM's - snapshots and cloning makes my life better and makes me look awesome becuase I look like I have fixed the probelm from the users POV - they have a working system, which is the point for the company paying me.
The clueless admins are the ones who dont take advantage of the possibilities of Virtualisation to do what they are paid to do and that's to have fuck all downtime. And even in the team I lead, there's good Unix guys who havent gotten the nessage and when I have had a frustrated CEO on the phone askign what is going on, I sigh, go to the admin's desk, stop what they are doing, clone, revert snapshot, CEO is estatic, Unix admin gets a bitchslap and it's all done in 15 minutes. I'll explore what went wrong when the phones are quiet and then if there's something I need to attention in the production VM, I'll snapshot, put fix in, test and viola.
I'm not a real fan of virtualisation but I do see the advantages if done right. The Unix admins are the ones who have it wrong now and dont understand when you dont have tin, you do things differently. However I do not in any way think virtualisation reduces needed skills, in fact I think to do it right you actually need to be f****** good.
Re:From personal experience by AK+Marc · 2011-03-02 11:16 · Score: 1

Sadly, I did a business analysis on employee computers, and it was cheaper to discard computers with issues and buy a new one. Even cheaper than reimaging, because reimaging might not fix it (failing HD or power supply causing blue screens, etc.) and the cost of employee time in running diagnostics, packing up the computer, shipping it off, and receiving it exceeded the cost of just throwing it away and ordering a new one.

It's not the most environmentally responsible option, but it is cheapest, depending on the cost of employees and the price of the computers you buy. There, the lowest tech had been there over 10 years, and was not cheap, and we had a deal with Dell where the computers came with our corporate image already on them, loaded at the factory. But, like most places, if we were bored, we'd pull out a graveyard machine and play with it, usually ending up as second computers for someone in the IT department.

--
Learn to love Alaska
Re:From personal experience by Anonymous Coward · 2011-03-02 22:04 · Score: 0

Well, IMHO I think the right approach would be a mix of the two.
Re-imaging or restoring from a working previous backup/snapshot/whatever would help you in the short frame and would give the experienced administrator time to investigate on the problem in order to solve it up in the right way, without urgency pressure.
Not all work environments and IT setups allow this, it depends a lot on the quality of your infrastructure.
We had a similar problem with a clustered Window 2008 Server some time ago.
Suddendly it stopped offering its services : when it was loading the user profile it would just sit there looking for something and actually didn't allow user entering. Just loading the profile forever and ever....
We tried to understand the problem in a 10-15 minutes time frame and since the problem was bigger than expected, we quickly restored a previous snapshot (thank to NetApp snapshots) and restarted offering the service to users. Meanwhile users were working on the other server so they experienced just service speed degrade not totally missing service.
After restoring the snapshot we troubleshooted the faulty image that we kept apart and discovered and solved the issue, backporting the change on the running instances.
Anyway, again in IMHO in our field is *necessary* to have knowledge and troubleshooting expertise to understand what is the problem and then to solve it.

Time is money by Anonymous Coward · 2011-03-02 01:59 · Score: 0

If the cost of re-imaging a machine in a production environment is less than digging deeper guess which one Im going to do ?

Re:Time is money by Nerdfest · 2011-03-02 02:05 · Score: 1

If it's something that happens repeatedly, it's nice to dig in, find the cause and fix it. The nice part is that with VMs, you can create a copy of the problem environment and have the best of both worlds.
Re:Time is money by d3ac0n · 2011-03-02 02:41 · Score: 1

If it's something that happens more than once in a small enough time period, then of course one would immediately dig deeper. However, if it's a one-off problem or a repeated but reasonably rare issue then either restore from backup, or nuke the server and rebuild.
Most of the admins I know (myself included) will still dig on the issue afterward, even if we've have to restore or rebuild from image. But the first responsibility is to get the system back up and running, not spend hours on bug hunts.

--
Official Heretic from the "Church of Global Warming". Proven right thanks to whistle blowers. AGW = Flat Earth Theory

Gee, ya think? by edremy · 2011-03-02 02:00 · Score: 0, Flamebait

Let's see. When I have a security or performance issue I can

A: Pay a bearded guy in suspenders for hours while he incants various arcane phrases like "sudo" and "grep" and hope that he actually manages to clean up the problem at the end, or

B: Press a button and have a factory fresh install in seconds.

Assuming that you have a decent build done first (Pay the bearded guy big for that) why on earth would *anyone* pick A? It's hardly just Unix- we're a Windows shop and we're heavily virtualized because it makes sense from so many different angles- security, load balance/failover, ease of setup, etc.

--
"Seven Deadly Sins? I thought it was a to-do list!"

Re:Gee, ya think? by rhsanborn · 2011-03-02 02:04 · Score: 5, Insightful

There are a lot of cases where pressing the button means that the problem will go away...for a few weeks. It will work right until you hit the same conditions that caused the problem in the first place. Suddenly, your using the refresh to cover up either a poor implementation, or a standing bug, and it isn't going to go away until you call that guy in suspenders.
Re:Gee, ya think? by Anonymous Coward · 2011-03-02 02:08 · Score: 0

The only problem with paying the bearded guy to build the decent system from the start is that bearded guy will not have incentive to actually build it properly from the beginning...even though you are paying him...because he doesn't have to actually maintain the thing. Just a thought...I've seen this many times in IT (and had to clean it up).
Re:Gee, ya think? by Anonymous Coward · 2011-03-02 02:13 · Score: 0

Problem is, if you apply that factory install without doing some root cause analysis and fixing the problem you will at some point have the performance issue again and you HAVE the (core) security problem RIGHT NOW.
That said there are lots of things that can go wrong with a box that might never happen again and are well understood. Suppose you had a power failure for a really long time over a weekend and the generator ran out of fuel, the box went down hard. You need to figure out a refueling procedure and notification system for sure. You might look at why the box won't boot for a few moments and see if a fix is trivial, but then you know exactly why this happened. The FS got hosed when the machine lost power.
In that situation you could spend hours repairing the file system and selectively replacing corrupt streams, or you could dump your image back and restore the data from that know good (tested) backup you have, all in a few moments. What possible justification would have for not re -imaging?
So like anything thing the key here is understanding and using your brain. A re-image is not a hammer and every problem is not nail, but sometimes it is.
Re:Gee, ya think? by Anonymous Coward · 2011-03-02 02:18 · Score: 0

Despite your five digit UID, I don't believe you frequent Slashdot enough to have missed Don't reboot UNIX boxes.
Mr. Beard is going to ensure your server remains online and connected without disrupting the workflow at your company. He's going to unload modules, install patches, reload them. He's going to do a few hours of real work but for the rest of the month (or year if he's worth his salt) Beardo is going to be warming a seat and making sure everything is perfectly running because he knows his stuff. He earned it unlike some guy who got the job from his frat brother because he knows Linux and reinstalls everything if they get "shpchp 0000:00:01.0: Cannot reserve MMIO region" upon booting up for the first time.
Re:Gee, ya think? by tsm_sf · 2011-03-02 02:38 · Score: 1

Let's see. When I have a security or performance issue I can
A: Pay a bearded guy in suspenders for hours while he incants various arcane phrases like "sudo" and "grep" and hope that he actually manages to clean up the problem at the end, or
B: Press a button and have a factory fresh install in seconds.
Assuming that you have a decent build done first (Pay the bearded guy big for that) why on earth would *anyone* pick A? It's hardly just Unix- we're a Windows shop and we're heavily virtualized because it makes sense from so many different angles- security, load balance/failover, ease of setup, etc.
Well part of it too is that nobody gives a fuck if your Etsy-inspired e-store is working well. The beards are there for critical systems.

--
Literalism isn't a form of humor, it's you being irritating.
Re:Gee, ya think? by prefect42 · 2011-03-02 02:41 · Score: 1

So you make a cluster of these things, where regular failures are normal but tolerated. Then when the cluster starts acting weird, you make a cluster of clusters...

--
jh
Re:Gee, ya think? by pnuema · 2011-03-02 02:41 · Score: 1

If the problem only comes up every few weeks, press the damn button again. I see similar mentalities in my little corner of IT - testing automation. Most test automaters want to write test scripts that are robust and will be re-usable from build to build. My experience is that the amount of effort required to make scripts robust enough to last is exponentially greater than just doing the job over again, quick and dirty. I am looked down upon by the serious scripters - but I have three times the productivity. I am not looked down on by management. :)
A lot of times, the instinct to do a job "right" - to the best of you abilities - actually runs counter to the needs of the business. In that case, you are not actually doing the job "right". Doing the job "right" means getting the goal accomplished with the least effort possible over the medium term. If you end up rebuilding the server 5 times over the course of the year, at 2 hours per pop, you have spent less time than if you spent two days to fix the problem permanently.
Re:Gee, ya think? by Polumna · 2011-03-02 03:00 · Score: 1

> Let's see. When I have a security or performance issue I can...

If you explicitly mention a security issue as a possibility and your first inclination is just to revert to an undoubtedly less-patched state, you do indeed really need to be paying [perhaps someone who isn't in] that stereotype. Even if it's just to keep your images properly updated.

Obviously you don't believe this applies to you, but my primary issue with this article is that it takes one set of a large industry (virtualization) and applies it to UNIX administration as a whole. That is ludicrous. Not only does it ignore that things can happen outside a virtualized instance (what happens when that security issue is on the hypervisor or dom0 or whatever your container-container calls itself?), but it ignores the obvious fact that there are still industries with customer facing servers, virtualized or not, with SLAs.

When you have four minutes of downtime a month, or even forty, before you start losing money rebooting (ESPECIALLY when the instances have to then boot), "oh just reimage it" isn't really an option. The fall of system administration indeed.
Re:Gee, ya think? by rhsanborn · 2011-03-02 03:03 · Score: 1

Sounds like a cluster...
Re:Gee, ya think? by corbettw · 2011-03-02 03:35 · Score: 1

OK, so you hit the refresh button two or three times, and if the problem keeps coming up then you know something serious is going on and then you investigate. There's no point wasting time or resources fixing something that might not even be a problem.

--
God invented whiskey so the Irish would not rule the world.
Re:Gee, ya think? by PlusFiveTroll · 2011-03-02 04:26 · Score: 2

If you end up rebuilding the server 5 times over the course of the year, at 2 hours per pop,
W T F. If you rebuild a server more then once every 3 years...
..Then your hardware sucks and you need better equipment.
..Then your applications suck and need to quick dicking with the operating system.
..Then your admins suck and need to be fired.
While I've only been doing admin work since '95, I can say with any modern server operating system is not going to fall over and die unless there is an underlying issue that needs addressed. I guess this is why I'm the bearded guy that comes in and fixes messes. I also say once every three years because by that time Windows can become a mess of security patches that will run better from a fresh install.
Re:Gee, ya think? by Anonymous Coward · 2011-03-02 04:32 · Score: 0

it isn't going to go away until you call that guy in suspenders.
I love the differences between US and UK English. Brightened my day, this has.
Re:Gee, ya think? by drsmithy · 2011-03-02 04:53 · Score: 1

Well part of it too is that nobody gives a fuck if your Etsy-inspired e-store is working well. The beards are there for critical systems.
"Critical systems" are the ones *most likely* to be simply nuked and reimaged, because they're the ones where minimising downtime is more important than messing about for hours trying to find some obscure cause to a problem that crops up maybe once every 6 months.
Re:Gee, ya think? by Anonymous Coward · 2011-03-02 05:05 · Score: 0

Also if you are not careful you may have regulatory and customer business agreements that require you to confirm if this is a data breach, other type of attack, or fraud and you would destroy your best chance of figuring that out.
I have also seen some unstable environments caused by poor design and just pressing the button does truly resolve th eissue.
Re:Gee, ya think? by anyGould · 2011-03-02 05:36 · Score: 1

The only problem with paying the bearded guy to build the decent system from the start is that bearded guy will not have incentive to actually build it properly from the beginning...even though you are paying him...because he doesn't have to actually maintain the thing. Just a thought...I've seen this many times in IT (and had to clean it up).
And it probably doesn't help that you're breathing down his neck to get it done NOW NOW NOW because your boss' boss' boss picked a deadline based on his golf schedule.
Re:Gee, ya think? by tsm_sf · 2011-03-02 20:50 · Score: 1

You can't have an unknown problem floating around a critical system. You're confusing marketing critical with engineering critical. One needs to be up as often as possible, the other needs to be right as often as possible. When one fails you push a button, the other fails and your IT department has to cancel vacations.

--
Literalism isn't a form of humor, it's you being irritating.

Not a decline, but a reflection of the new normal by zerofoo · 2011-03-02 02:01 · Score: 2

As hosted services become more and more popular, sysadmins have less interest in spending the time to diagnose and solve a problem - this goes for Windows, Mac OS and Linux/Unix. When a fix is needed RIGHT NOW - the quickest way back up sometimes is a re-image.

When I was a small business IT consultant, I asked clients if they wanted to spend $125 per hour for me to diagnose and fix their system - with the understanding that it could take many hours to research and solve the problem - or if they wanted to spend ONE hour re-imaging the system to a known good point.

Almost everyone chose the "fix it now in under an hour" solution.

-ted

Blah blah blah by L4t3r4lu5 · 2011-03-02 02:01 · Score: 1

Yet another story about how the old way was better.

What's better is whatever keeps your employer's company making money for the most time. If re-imaging the server every weekend gives them 100% uptime during the week, do it. If you can inject patches into the app during runtime, more bully to you, but I can't, so I'm going with "re-image to working state and roll forward." If that costs my employer less than you cost your employer, I know who's all of a sudden more employable!

Might want to shave off those neckbeards, folks.

--
Finally had enough. Come see us over at https://soylentnews.org/

Re:Blah blah blah by Mr.+Shotgun · 2011-03-02 02:26 · Score: 2

Till the problem occurs in the middle of the work week and you still don't know what the actual problem is. Then your looking at an hour of downtime during business hours while you re-image yet again with your boss asking what the hell the problem is and what you were doing on the weekend if you weren't solving the actual problem.
Covering up a problem is not the same as solving it.

--
Of all tyrannies, a tyranny sincerely exercised for the (supposed) good of its victims may be the most oppressive
Re:Blah blah blah by d3ac0n · 2011-03-02 03:03 · Score: 1

Or alternately, you could split the difference. Reimage on the weekends WHILE you research the problem. Again, this is the advantage of VMs. You can take a snapshot of the failed system before over-writing it with the known good image, and then troubleshoot the problem during the week, running the broken system in an isolated virtual network. Once you have a proper fix, implement it during the next scheduled downtime and get your weekends back.

--
Official Heretic from the "Church of Global Warming". Proven right thanks to whistle blowers. AGW = Flat Earth Theory
Re:Blah blah blah by Anonymous Coward · 2011-03-02 04:31 · Score: 0

And you just worked seven days straight, assumably incurring OT in the process. You'll also burn yourself out working seven days a week re-imaging systems.
All of this work put in and you have added no value to the business at all. Matter of fact several vendors make software that can replace the average Windows helpdesk jockey who does nothing but image systems and reboot printers all day.
The reality is that "good enough IT" isnt really "good enough" its just shifting the IT employment landscape so that the majority of interesting positions are with the vendors and everything else ends up in outsourcing call centers in the midwest/india/china or the IT people are relegated to "mailroom super clerk" type roles at most business's. Considering that IT was never a major profit garnering or competitive edge for most companies this isnt really that earth shattering.
Re:Blah blah blah by Jonner · 2011-03-02 08:13 · Score: 1

Yeah, you're right. Just because you have absolutely no idea what caused a problem, you can still determine that it happens at most once a week. It's not as if businesses actually need their servers to be up on the weekend.

Expediency wins! by Infernal+Device · 2011-03-02 02:02 · Score: 1

Seriously, which way gets the job done faster?

Being a sysadmin is not about you and the system and your marvelous detecting and repair skills, it's *always and only* about your users. If VM technology improves the speed of recovery so the users can get back to what they were doing (probably messing up your carefully architected system), then so be it.

--
"My God...it's full of trolls!"

Re:Expediency wins! by Anonymous Coward · 2011-03-02 02:45 · Score: 0

You cannot "carefully architect a system" unless you can investigate, learn and constantly improve (often this may even mean doing work in your own time!). Users should not be able to " mess this up" and if they can you have done something wrong.
The reimage reboot approach is for admin grunts - sort that adds new users and fixes printing problems, carrys our upgrades, and installs stuff - your average Mubai shoe shiner has these skills.
Overall it is about attitude and approach to problems - operating system has little to do with it.
Re:Expediency wins! by ajlitt · 2011-03-02 04:03 · Score: 1

Well put. A good sysadmin is like a ninja: if they're doing their job well you won't even know they're there. The problem for most is that management often sees the lack of constant firefighting in a well admin'd shop as either laziness or an opportunity to pile on a heavier workload.
This is how BOFHs are made.
Re:Expediency wins! by Anonymous Coward · 2011-03-02 04:24 · Score: 0

"it's *always and only* about your users."
No it isn't. It's about making money.
Re:Expediency wins! by Anonymous Coward · 2011-03-02 04:33 · Score: 0

By encouraging the easy route, all you're doing is guaranteeing fewer graybeards in the market. So when re-imaging doesn't work anymore there will be virtually no one who knows how to fix the problem. The few that do definitely won't be cheap. That also works in the continuous bug fixing that goes on as it takes graybeards to do that as well. Fewer graybeards=less bugs fixed=larger and more catastrophic failures. Enjoy your system reliability while it lasts.
Re:Expediency wins! by geekoid · 2011-03-02 05:06 · Score: 1

"Well put. A good sysadmin is like a ninja"
A myth?

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
Re:Expediency wins! by Anonymous Coward · 2011-03-02 05:15 · Score: 0

Re-imaging just encourages stupid behavior of the laziest kind, especially from management. What it's broken again, just do what you did the last time as it didn't cost anything. If you're hearing again a lot you need a better solution. It also guarantees the IT department will be completely clueless when a real problem comes along where re-image doesn't work as the graybeards have already left or been tossed as irrelevant, expensive.

The decline of language skills? by Krakadoom · 2011-03-02 02:02 · Score: 1

I just thought it was amusing to post a headline with decline and fall both in the same sentence, when they are clearly the same thing in this instance. Should it actually have been "the rise and fall of ..." or "the decline of ..."?

Re:The decline of language skills? by EnsilZah · 2011-03-02 02:18 · Score: 2

http://en.wikipedia.org/wiki/The_decline_and_fall_of_the_roman_empire
Re:The decline of language skills? by Anonymous Coward · 2011-03-02 02:22 · Score: 0

No, "The Decline and Fall" do not mean the same thing: a decline has a progressive aspect, while a fall has a perfect aspect. Also, the title is an allusion to Gibbon's *Decline and Fall of the Roman Empire*, and frankly Gibbon's English was a HELL of a lot better than yours.
Re:The decline of language skills? by satch89450 · 2011-03-02 02:22 · Score: 1

There was a book by Will Cuppy (1894-1949) titled The Decline and Fall of Practically Everybody (1950; http://www.amazon.com/Decline-Fall-Practically-Everybody-Nonpareil/dp/0879235144) that was an absolutely funny take on history. Will Cuppy's style was to write very straightforward articles, but pepper them liberally with very funny footnotes. I remember seeing a paperback version of this as a kid, and got hooked.
Actually, the phrase "decline and fall" describes the shape of a drop-off not unlike the shallow slope leading to a cliff. Perfectly good English.
Re:The decline of language skills? by VolciMaster · 2011-03-02 02:23 · Score: 1

I just thought it was amusing to post a headline with decline and fall both in the same sentence, when they are clearly the same thing in this instance. Should it actually have been "the rise and fall of ..." or "the decline of ..."?
No - you can "decline" and not "fall", so the headline is fine.

--
antipaucity
Re:The decline of language skills? by lymond01 · 2011-03-02 04:45 · Score: 1

"The Decline and Fall..." is the history of the end of something. You could write about the "fall" which might be its last days or you could write about the decline and fall which portrays a fuller history of why it actually collapsed. This also allows the decline to recover if things go well. Fall gives that sense of permanence.
Perfectly fair use of language.
Re:The decline of language skills? by Jonner · 2011-03-02 08:18 · Score: 1

Apparently, literary allusions are lost on you.

I can't tell you how many times I have heard this. by Noryungi · 2011-03-02 02:04 · Score: 5, Interesting

Many times, what I hear as "solutions" are simply variations on the theme: "Why can't we reboot the server?" or "Why can't we reinstall the server from scratch?".

And my answer usually was: "Listen, I don't care how many times you do this on a Windows machine, but this is UNIX - I'll only reboot this machine if I absolutely need to. In the meantime, watch and learn as I kill the offending processes. Oh, and re-installing the machine means 24h of downtime".

These days, I help run a (very) large application, which runs on top of a (very) large "enterprise" SQL database for a (very) large company. The only problem is: enterprise application does not manage database very well, and leaves zombie processes on the database server. After a while, the database server just crashes (hard) and takes down the application server with it. Logical solution (and the one recommended by sysadmins): upgrade application to version X, which is supposed to have a much better database management.

What do you think the PHB/management solution is? Ask the DBAs to write a script that will monitor zombie processes, so the sysadmins will be warned in advance... Like, around 20 minutes before the application crashes. Just enough time to tell all users to save their work, because we need to reboot everything. Just like under Windows.

Did I mention the application is considered mission-critical and runs 24x7? And that downtime can cost more than 6 figures to said (nameless) company?

And, since you asked, yes, I am looking for another job. (Clueless admins and pointy-haired bosses: a match made in...)

--
The right to offend is far more important than the right not to be offended. (Rowan Atkinson)

Faster by Anonymous Coward · 2011-03-02 02:04 · Score: 0

Often times resolving the issue will take longer than the time to re-image something. This is the benefit of running virtualized infrastructure, quick build up and tear down.

The OS itself shouldn't matter and I've been doing this since I was able to snapshot stuff. Often times it will allow me to go back and work on the broken image while the new image is running, but honestly from a Management view - the admin is there to make stuff work - they don't care how he/she does it. They are interested in quick resolution.

I'll probably get flamed for this by youn · 2011-03-02 02:04 · Score: 1

But what's wrong with having images of servers ready as a viable disaster recovery strategy?

yes I agree it is good to know the system inside out. yes I agree that it's not because a simple minor server process configuration screwup that you should reimage the whole server... but sometimes it may be either time saving at a point where users need the servers immediately. sometimes it might actually be more secure and stable to restore from an image that has been tested for months rather than making tons of changes under the hood... especially if it is a system that has not been documented, where the last changes were made years ago... by diagnostic-ing the server under time constraints, it is possible to mess things up even more. It's not necessarily a pissing context... well I can fix my server without re-imaging in this case.

Now, if the problem occurs regularely and reimaging and putting blinds to the problem... then yes, I agree imaging is wrong. Yes, it is a good thing to do thing to know what is happening, find the problem... and most problems don't necessarily reimaging

my point is it is not necessarily a bad thing to restore a server from an image if you do things right... it may save time, be more secure and save tons in productivity/money.

--
Never antropomorphize computers, they do not like that :p

Re:I'll probably get flamed for this by betterunixthanunix · 2011-03-02 02:18 · Score: 1
Nothing is wrong with it, as long as the following conditions are met:
1. Spawning a fresh instance will happen quickly, faster than actually solving the problem (this is true of VMs, or situations where a backup system is available).
2. The problem will not affect the fresh instance. What is the point of reimaging, if the problem was a faulty piece of hardware or some poorly designed software (e.g. software from decades ago that assumed an 8 bit counter was enough, on a day when you had to count higher than 256)?
--
Palm trees and 8

Re-imaging != bad administration by chrishillman · 2011-03-02 02:04 · Score: 2

Sure it was cool, back in the day, to spend 72 hours working on "the server" because even rebooting was not an option. Back then I had 3 servers, 10 years later I had 15. I didn't have the time to get into why each little snowflake of a problem was happening, I knew reinstalling and upgrading components would be a more prudent use of time. If I can rebuild a server and restore a data backup in 4 hours or I can spend an infinite amount of time "fixing" the existing install, which option do you think my PHB would prefer? It is not bad administration, it is just different.

Re:Re-imaging != bad administration by Anonymous Coward · 2011-03-02 02:08 · Score: 0

It is just necessary unfortunately, like you said essentially, time is money. You're not fixing these servers usually for the vendors to make them better, you're fixing them for a business that only needs it work, fast, and as cost effectively as possible.
Re:Re-imaging != bad administration by Ephemeriis · 2011-03-02 02:58 · Score: 1

Sure it was cool, back in the day, to spend 72 hours working on "the server" because even rebooting was not an option. Back then I had 3 servers, 10 years later I had 15.
We've got about 30 servers to worry about, and this is a small hospital.
And downtime is basically never an option.

If I can rebuild a server and restore a data backup in 4 hours or I can spend an infinite amount of time "fixing" the existing install, which option do you think my PHB would prefer? It is not bad administration, it is just different.
Yup.
72 hours to dig out a problem on some machine that's being cranky? Yeah, that's not gonna happen. We'll restore a snapshot or provision a new VM and be back up and running within hours. Hell, even if we have to rebuild a physical box and restore from tape we can get it up and running in a day.

--
"Work is the curse of the drinking classes." -Oscar Wilde
Re:Re-imaging != bad administration by chrishillman · 2011-03-02 03:18 · Score: 1

Exactly, I have since moved on from that job a few years ago. Server density has gone up dramatically since then, I would not be surprised to see 50 servers at that location. Waay, way back spending your time "figuring things out" was more important. We should all face software is more complex now but easier to manage. Thanks to tools like Google and automatic updates, administrator's lives are very different now.

I am thinking this writer is a shill who is trying to drum up controversy to increase page views on his magazine web site.
Re:Re-imaging != bad administration by Anonymous Coward · 2011-03-02 04:47 · Score: 0

Its not administration, its helpdesk. Systems administration implies that the person has a technical capability and understanding of the systems being administrated. If your reasoning is really that 15 systems is too much of a workload then either your design is flawed or your skill is inadequate to perform your job. Skirting this fact by claiming "speed" is ignorant because the fastest solution to a problem is ALWAYS to prevent the problem from occurring.
Re:Re-imaging != bad administration by toxonix · 2011-03-02 06:02 · Score: 1

We have thousands of Linux boxes. Our admins either slow roll, fast roll, kick or flash them individually or in groups, depending on the situation. Maintaining the servers and apps on them individually doesn't scale in terms of human resources. If an app is leaking resources or otherwise misbehaving, the admins take a snapshot of the app state (heap dump etc) and roll the app. Flashing happens infrequently, usually for upgrades. Patching is just as slow and tedious, so flashing makes sense when you can just schedule an operation to happen on a few hundred servers at a time. Otherwise we'd have to do things by hand incredibly slowly over SSH. We used to do it this way, but that doesn't scale either.

It will just get worse (depending on your view) by Rooked_One · 2011-03-02 02:07 · Score: 1

As VM's are virtualized and taking snapshots of them becomes so easy, why would you bother troubleshooting anything when you can just restore to a snap that is an hour old? Better than the server being down and spending who knows how long trying to figure out what's wrong.

Obviously employers (if they wake up to this) will realize "Hey, I can pay a kid to restore snapshots" instead of "Hey... I need to hire this super expensive IT veteran."

Re:It will just get worse (depending on your view) by vlm · 2011-03-02 02:44 · Score: 3, Interesting

As VM's are virtualized and taking snapshots of them becomes so easy, why would you bother troubleshooting anything when you can just restore to a snap that is an hour old?
The security exploit that cracked the old image in less than a second, will crack the "identical" new image in less than a second. Or data sample #1213 which overflowed the buffer and crashed image A will simply overload and crash image B.
What it really brings up is a class distinction in sysadmins. Theres the guy whom actually fix systems, like patching security holes in system libraries to work around app bugs, redesigning firewall ACLs to avoid a new threat, do scalability assessments before the overload crashes something, and there are the guys that fix individual things like motherboards and hard drives, not administer systems, basically help desk people with the fancy sysadmin job title. Virtualization means the helpdesk board swappers with the cool job titles are outta here, but the real sysadmins have little if anything to fear.

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Re:It will just get worse (depending on your view) by Ephemeriis · 2011-03-02 03:01 · Score: 1

As VM's are virtualized and taking snapshots of them becomes so easy, why would you bother troubleshooting anything when you can just restore to a snap that is an hour old? Better than the server being down and spending who knows how long trying to figure out what's wrong.
Hell... If you can get your data off of the machine itself and store it on a SAN or a separate database or something, you can just boot from a clean, read-only image every day. Makes it really easy to keep the system happy.

--
"Work is the curse of the drinking classes." -Oscar Wilde
Re:It will just get worse (depending on your view) by CBravo · 2011-03-02 06:56 · Score: 1

I guess there will be two kinds of admins: Those who control the borg and those who replace dead parts.

--
nosig today
Re:It will just get worse (depending on your view) by Alex+Belits · 2011-03-02 18:33 · Score: 1

And then your system (maintained by hacks and idiots) corrupts the data on that SAN. And does that every time you restore data from backup (minus all transaction that were completed after last backup) and re-image the system. Now what?
Unix sysadmins are supposed to be smart not because they are maintaining a more complex system. They are supposed to be smart because they are should use all tools available to them, to keep system running reliably and efficiently, and fix the problems that appear or are discovered while running it. "Restoring" a broken system from the state when its brokenness is discovered to the state before the brokenness is discovered, is for VMware jockeys, not admins.

--
Contrary to the popular belief, there indeed is no God.

Nothing new by bryan1945 · 2011-03-02 02:07 · Score: 1

There are always people who are excellent, competent, and flat-out bad at their job. Unfortunately, the numbers of each group skew towards the lower end (well, not everyone is a genius). If this makes for an acceptable solution for the less-skilled, so be it. I hate to reward incompetence, but I hate down time even more. I want my servers running so my employees can do their work.

--
Vote monkeys into Congress. They are cheaper and more trustworthy.

Re:Nothing new by Alex+Belits · 2011-03-02 18:36 · Score: 1

lol wut

--
Contrary to the popular belief, there indeed is no God.

To be honest by TheRealFixer · 2011-03-02 02:08 · Score: 4, Informative

It sounds like this guy is just upset that technology has progressed to the point where we don't need to pay out the nose for some high-priced UNIX consultant to spend 3 days troubleshooting an issue that can be fixed in minutes or hours.

Just because you might learn more by spending days chasing down an issue instead of using your available tools to quickly redeploy the server and get the business back up and running, doesn't make that the correct decision. If you really want dig into the root cause, clone the broken VM off and research it after you get a fresh one deployed from template.

Re:To be honest by LWATCDR · 2011-03-02 04:21 · Score: 1

It is called penny wise pound foolish.
Why live with the problem and the down time if you can fix it once. Yes it takes more time to fix it right but that really is one of the great things about using virtualization. You make an image of the problem system, Restore the image if that will fix it quickly, and then run the problem image our you test system, find the cause and fix it for real.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:To be honest by Anonymous Coward · 2011-03-02 04:48 · Score: 0

shhhh you're bringing *common sense* into this argument.
I predict Slashdot is going to become very vocal with the baby boomer techs of the old days and we will see more anger out of them as they become unemployed and replaced by cheaper methods.
Re:To be honest by Anonymous Coward · 2011-03-02 04:52 · Score: 0

3 days troubleshooting an issue that can be fixed in minutes or hours.
My question to you is this: are you sure the issue is actually FIXED?
Re:To be honest by TheQuantumShift · 2011-03-02 05:32 · Score: 1

Depends on your definition of "Fixed". Discovering root cause to prevent future downtime is the point of support personnel. I do also understand the need to get stuff back up ASAP. Fortunately in *nix if you need to reboot to get stuff up, you can go back in later and actually find out what happened. Windows will just give you the finger and spawn useless messages about "Unexpected Errors" citing hex codes that MS has never seen, that is if you're lucky enough to not get "The Event Log is corrupt and could not be read"...

--

Shift happens. Fire it up.
Re:To be honest by Anonymous Coward · 2011-03-02 06:51 · Score: 0

Heh. Let's see what that high priced admin (unix or otherwise) brings you...
You have a critical app that needs to run 24 hours a day, seven days a week. There are no maintenance windows in normal course because users all over the world need to access the application. Rebooting is not an option.
Before that application is deployed, those high priced admins will be tuning the performance. Resource issues leads to downtime, after all. They will be configuring HA. You will need maintenance and updates, after all. They will be configuring the security of the system. After all, if someone hacks the system it will need to be taken offline.
The app hums along for months... Maybe some code issue develops. The high priced admin will then determine why it failed, and will do so in a minimum amount of time. All this while the app is running.
If the application fails in regular course of events, those high priced admins will be the ones held accountable.
My gross annual salary is over $200K. I maintain AIX, Linux, Solaris and HP systems. At any given time a single outage on an application can cost the company $5K US *per minute*. I am responsible for those applications. So hell yes, my salary is *nothing* in comparison to the insurance having me on the payroll brings.
Re:To be honest by Anonymous Coward · 2011-03-02 07:49 · Score: 0

Just because you might learn more by spending days chasing down an issue instead of using your available tools to quickly redeploy the server and get the business back up and running, doesn't make that the correct decision. If you really want dig into the root cause, clone the broken VM off and research it after you get a fresh one deployed from template.
And how do you gain knowledge and experience if you don't ever explore a system, both when it is working and when it is not? This way when things go totally south, and a reboot / re-image does not work, you'll have a picture in your head of how the entire thing fits together to star from.
If don't spend time debugging, you don't exercise your debugging skills, and so they may not be there when they're needed.
And a UNIX consultant is high-priced often because they know their shit. So sure, if a reboot fixes that's great for you. But if it doesn't you'll want/need someone around who knows what's going on.
Re:To be honest by Jonner · 2011-03-02 08:23 · Score: 1

Paul Venezia didn't say that virtualization is an inherently bad thing or that rebooting is always wrong. He did say that if you don't know what's causing a problem on a production system and reboot to fix it, that's the wrong approach.
Re:To be honest by Anonymous Coward · 2011-03-02 12:02 · Score: 0

But your premise is false... "an issue that can be fixed in minutes or hours."
You haven't fixed the problem. You have (possibly) temporarily restored the system to operation. That is the basic problem with punt and rebuild. FAR too many people equate a system being online with "problem fixed".
If you do not get to the root cause, you haven't "fixed" anything.
Why did it go down? Malicious activity, hardware problem, software problem?
When will it go down again?
How often will it go down?
How much do I lose each time it goes down?
How can you possibly answer these questions accurately if you don't know the root cause of the problem?
I'm not saying that you shouldn't use a VM image to bring a system back online immediately. I'm just saying that if you go no further, you haven't done 95% of your job. Having a professional competent in UNIX identify the root issue, and resolve it, is precisely what should be done....whether you have VMs or not.
"pay out the nose for some high-priced UNIX consultant" is an entirely different problem, and is based on the incompetence of the business management in conducting their business.
Re:To be honest by Anonymous Coward · 2011-03-02 22:37 · Score: 0

...to spend 3 days troubleshooting an issue that can be fixed in minutes or hours.
Correction - "...to spend 3 days troubleshooting an issue that can be band-aided in minutes or hours."

Just because you might learn more by spending days chasing down an issue instead of using your available tools to quickly redeploy the server and get the business back up and running, doesn't make that the correct decision. If you really want dig into the root cause, clone the broken VM off and research it after you get a fresh one deployed from template.
Correction - "Don't bother trying to understand what is going on or explaining to the business why it is a good idea to have people who understand the systems that are running their business. Just hope that you don't find yourself with a problem that can't be fixed by a reboot, because then you and the business are fucked."

Not surprised at all by Stenchwarrior · 2011-03-02 02:09 · Score: 1

It's funny how many admins out there can't even set permissions in *NIX. I was working with a guy who was very well-versed in the VM world. Several certs after his name, in fact. But when he had to actually set permissions on the .vmdk files on the ESX host from the command line, he was clueless. I explained to him the whole rwxX and how each numerical value changes the bit for that permission and it was a completely wasted effort. I guess Veeam will take care of all that from a GUI.

Still, seems like they would teach the basics.

--
Loading...

Re:Not surprised at all by Ephemeriis · 2011-03-02 03:14 · Score: 1

It's funny how many admins out there can't even set permissions in *NIX. I was working with a guy who was very well-versed in the VM world. Several certs after his name, in fact. But when he had to actually set permissions on the .vmdk files on the ESX host from the command line, he was clueless. I explained to him the whole rwxX and how each numerical value changes the bit for that permission and it was a completely wasted effort.
I've been using *NIX systems of various kinds for over 10 years now... I don't have any certificates though, and I wouldn't really call myself anything more than a power user...
I know you can set all the permissions with one command by using the numeric code... Rather than having to go through several passes of u+rw, g-rx, whatever... And maybe I'm just a lousy sysadmin because I can't do the whole binary thing in my head... But it'll take me longer figure out the numeric code than it'll take me to type out the several passes of u/g/o/+/-/=/whatever.

--
"Work is the curse of the drinking classes." -Oscar Wilde
Re:Not surprised at all by Stenchwarrior · 2011-03-02 03:25 · Score: 1

Yeah, but you could if you wanted to. You at least know it's binary and that's already better than many. Yes, it's a great skill to be able to setup a Cisco from the CLI, but it takes 1/10th the time from the GUI.
I get what the post is getting at: It's a dying art and the base skill seems to be dwindling. Ultimately, time is money and if it's faster to re-image, then fine. So long as the the ability to set it all back up from scratch is there in case there are no good images.

--
Loading...
Re:Not surprised at all by Anonymous Coward · 2011-03-02 10:22 · Score: 0

if you need to configure just one, it may be faster; if you need to configure hundreds, no way. Besides, I have stopped counting the times that I have had to login a cisco managed switch through telnet/ssh to fix a problem another admin could not fix from the nice web ui.
This whole discussion is getting ridiculous. Even Microsoft is pushing the cli as hard as they can (powershell anyone) because they know that in order to have the same set of things you need to automize them, and what better way to automize stuff than to script it through commands. Automation is all about using a shell, be it in unix or in windows or whatever.
PS: I really like nice interfaces for casual stuff, but no way I am going to click through all the menus to install an msi accross our whole server park. I have better things to do.
Re:Not surprised at all by laptop006 · 2011-03-02 11:50 · Score: 1

Yes, it's a great skill to be able to setup a Cisco from the CLI, but it takes 1/10th the time from the GUI.
Er, no. For anything larger then a trivial config the CLI is much faster, and that's if you type it. Use templates (as most larger configs will) and it's not even close.

--
/* FUCK - The F-word is here so that you can grep for it */

Virtualization != marginalization of skills... by Shoeler · 2011-03-02 02:14 · Score: 4, Interesting

This seems to me to be a philosophical question. Indeed, if the uptime and more importantly availability is higher by the purported crash and burn (taking liberties with the slash and burn deforestation technique) method, who is to say it is less useful or less valid? Indeed, to espouse skills over delivering for the client seems to be missing the point. It seems to be standing on some pedagogical imperative that knowledge is somehow of more value in the workplace than delivery.

Now - having said that - don't get me wrong. I have seen entirely too many *nix sysadmins (full disclosure: I got an RHCE in 2003) who don't know where the network config files are because they only know the GUI, and are hired by a team of people who have never logged into a *nix box. However, I think the ill that is most egregious is not that it sets some moral and ethical imperative fo fixing rather than reloading (or in this case, recovering from a VM image) a server, but the fact that it misses the point that there has been a dearth of qualified IT candidates since the dawn of our industry and that the fixes to this don't have to do with how we fix a server, but how we hire and more importantly who we hire. As is everything in IT, garbage in == garbage out.

Finally - I absolutely agree with the Infoworld argument. It assumes an unexpected failure within the server, not some external thing that needs to be diagnosed and fixed. If your app crashes because the SQL table isn't there on the SQL server you don't control, rebooting ain't going to do a hill of beans worth of good.

Re:Virtualization != marginalization of skills... by visualight · 2011-03-02 02:42 · Score: 2

The problem is that the new "crop" of developers don't have any real problems to solve. They've all been solved, and solved well. So now we're adding unnecessary abstraction layers that hide what's really going on.
People that spent 3 days figuring out how to burn a CD back in the 90's tend to know how everything works, but the "kids" coming up in recent years only know (and only care to know) the flashy point-and-click abstraction layers, and only program within "frameworks".
Years ago I used to talk to people about the Windows approach vs the Unix approach, but sadly the people currently working at Redhat and Novell are work hard to make a liar out of me.

--
Samsung took back my unlocked bootloader because Google wants me to rent movies. They're both evil.

Fun at scale. by Hawke · 2011-03-02 02:14 · Score: 1

You have 1000 servers. You need to upgrade them to RHEL 6. Do you put a DVD in each of 1000 DVD drives?

NO!

You use an image server. Kickstart. Cobbler. Figure out how the new image looks like, and then pxeboot 1000 servers. That goes much faster. (to the sysadmin above, reimaging a server should take 25 minutes, most of which is spent surfing slashdot, not an hour).

So now, you've got a server that's misbehaving. One of 1000. Out of pure coincidence, honest, the one server you were manually futzing with last week, but that can't possibly be connected. Fixing that server yourself will cause more "configuration drift", and leave you with one server that's still different than the 999 other servers. And hey, that image server is still on your network. Just reimage the thing.

It's popular because it's the answer that scales. kthxbye.

Re:Fun at scale. by Anonymous Coward · 2011-03-02 02:25 · Score: 0

you should probably isolate that one server and figure out what's wrong. Then you can have one cake and keep 999. kthxbye.
Re:Fun at scale. by buchanmilne · 2011-03-02 02:49 · Score: 1

Maybe if RH bothered to ship rpm-4.6.x to RHEL5, you would only need to reboot once during an upgrade from RHEL5 to RHEL6 ...
Like you can on other distros, including other RPM-based distros.
If you used a VCS or a configuration automation tool (cfengine, puppet etc.), then you wouldn't need to re-image or re-install a server to get it's config in-line ...
Re:Fun at scale. by Hawke · 2011-03-02 03:35 · Score: 1

You should.
And then, once you figure out what's wrong, you should reimage the box to fix it. Yes, I'd dead serious. Manually futzing with one box ("configuration drift") is farther up the list of reasons why things break than you would believe.
Re:Fun at scale. by Anonymous Coward · 2011-03-02 03:39 · Score: 0

You have 1000 servers.
There's your first problem... Time to break away from the PC mentality and consider purchasing real hardware, from real vendors!
Re:Fun at scale. by Hawke · 2011-03-02 03:41 · Score: 1

cfengine, puppet, chef et all are in the set of acceptable solutions. And if you have per-host information you care about keeping, superior to blindly reimaging.
But why do you have per-host information? Per-host information (log files, or important data on local storage) is an inherent management pain. The best answer is to keep that to the minimum set of hosts possible, and use coarse tools on the majority. Then you're manually managing 2 hosts, and bulk managing 998. Which is a cubic ton better than manually managing most of 1000 hosts. (remote syslog is your friend.)
(Upgrade? Really? Um, no. Reinstall. Again, you have to be able to reinstall quickly and accurately. And since you can do that, why not do that?)
Re:Fun at scale. by Hawke · 2011-03-02 03:48 · Score: 1
A second specific comment
The configuration of a system is much more complex than most configuration management tools consider. The tools generally limit themselves to the list of things a "sane" person would change.
The list of things that actually affect the running of your system is much, much larger.
- Libraries. Did you hand-jam in a specific openssl version for some application?
- Programs. Did you hand-upgrade openssh on one system?
- /usr/local. Is it in the path of a shell script used to launch a service? Is everything under it managed?
- Permissions. Did someone do "chmod -r" somewhere they should not have?
If you write rules in puppet to handle all of that, your set of rules blows up to be insanely detailed, long, and completely unmanagable.
But the reinstall handles it all. In an automated, scripted fashion that allows you to easily change what you need.
Seriously people. Cobbler & similar install servers. They need to be part of any large scale host management. And since they are already there, are easy to leverage into being a large part of your large scale host management. And then reinstalling the server is the sane solution.
Re:Fun at scale. by Hawke · 2011-03-02 03:55 · Score: 1

So google's doing it wrong? So is Yahoo? They shouldn't have so many computers? How about Amazon? Supercomputers are all doing it wrong?
It's called "running a successful service at internet scale". And it's a really good gig if you can get it
Re:Fun at scale. by geekoid · 2011-03-02 04:58 · Score: 1

"You have 1000 servers. You need to upgrade them to RHEL 6. Do you put a DVD in each of 1000 DVD drives?"
YES!
because it's funny watching server admins do stupid shit over a weekend.
I mean, no, you shouldn't but I wouldn't stop you. I also don't pay you, so keep that in mind.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
Re:Fun at scale. by anyGould · 2011-03-02 06:06 · Score: 1

I think the implication was that what's wrong is "you were manually futzing with" it - zap it back in conformity with the rest, and if it still misbehaves, you have 999 reasons to believe it's a hardware issue.
Re:Fun at scale. by Anonymous Coward · 2011-03-02 08:50 · Score: 0

If you used a VCS or a configuration automation tool (cfengine, puppet etc.), then you wouldn't need to re-image or re-install a server to get it's config in-line ...
there is no reason why you could not combine both (unattended installations + config management, that is). We do this all the time.

Here we go again by cpct0 · 2011-03-02 02:14 · Score: 1

Is this the old geezer versus the new wet diapers yet again? (trying to be as evil on both sides ;) )

There are new technologies and we should embrace them. I am not a proponent of VMs, I don't like them in general, but I do see its uses and it's very effective. Like in C++, you got STL, with very similar and nearly interchangeable std::vector, std::list, std::deque and so on (and not talking about boost or 3rd parties here). You need to know when to apply them or else you'll get problems. Well, in the '10s, you have the same ridicule amount of technologies available to sysadmins, and you need to know when to apply it. That's the new Sysadmin job, not only know that you can code one in bash with grep, awk, echo, while read, pipes and rsync, but actually know there is a package all neatly made for you, available at your fingertips with a simple apt-get (or yum).

I keep my computer tidied-up, I love to know what runs where. Even then, I do a "spring cleaning" once every year, reinstalling everything. And incredibly, my computer runs faster and more efficiently. Why? new /etc defaults, new parameters, new software, old clinging software, things that are nearly impossible to update. Same for the files. Seriously, in today's computers, we get hundred of thousands of files, most of which have some arcane use we couldn't care less, but are necessary for some kind of weird reason. I'm a sysadmin, and I don't pretend to want to know all these files.

I read the article, and yes, there are things that are changing, and seriously, I do respect the One person who can understand the Sendmail configuration files... oh I'd even be impressed with the M4. :) And when there is a problem, I want to know why, because I love to learn. But then ... there are prerogatives, time constraints, servers need to be up, people need to work, and we have all these magnificient tools that will enable every computer to be segregated in their private little VM world (to return to that main article). So should be simply shrug, laugh and go back to The Ancient Ways? You can keep you "vi" editor, leave me my "vim", please. :)

Re:I can't tell you how many times I have heard th by Junta · 2011-03-02 02:14 · Score: 1

Oh, and re-installing the machine means 24h of downtime

I am with you except here. If re-installing a machine incurs 24h of downtime, you do not have a suitable contingency plan. Most environments I deal with are 15-20 minutes from offline to production on reinstall at the long end.

--
XML is like violence. If it doesn't solve the problem, use more.

Time is Money by sheehaje · 2011-03-02 02:16 · Score: 1

I used to scoff at reformatting and reinstalling, but today it's a simple calculation. Will the fix take longer than either reverting from a snapshot or cloning from a template? Many may cringe at that as a solution, but the bottom line is time is money. It used to be that reinstalling, restoring from backup simply took too long, and it was better to fix the problem at the console if possible. Today, that isn't so with automatic snapshots of virtual machines, SAN replication, etc. I don't scoff at it though, it means we can spend more time being proactive rather than reactive.

The fastest fix by Anonymous Coward · 2011-03-02 02:16 · Score: 0

I'm going to get flamed for this, but what the hell.

I've always thought that it is more important to get a server back up and operational as quickly as possible, then it is to keep the server down until you find the problem. Now don't get me wrong, you still need to find the ultimate problem, or at least find out if the problem is repeatable, and then find the answer to it.

So I'm in favour of any method that help me in getting the system back up and running; be it re-imaging or anything else.

Re:The fastest fix by vlm · 2011-03-02 02:48 · Score: 1

I'm going to get flamed for this, but what the hell.
I've always thought that it is more important to get a server back up and operational as quickly as possible, then it is to keep the server down until you find the problem. Now don't get me wrong, you still need to find the ultimate problem, or at least find out if the problem is repeatable, and then find the answer to it.
So I'm in favour of any method that help me in getting the system back up and running; be it re-imaging or anything else.
Doesn't apply to anything that outputs reports to management. Any chance that you're giving them provably wrong data that dude gets shut down till fixed.

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger

Faster is nice, but... by Junta · 2011-03-02 02:17 · Score: 1

Sometimes a one-off mistake happens, and reinstall makes sense. Many other times, the reason you had to reinstall is due to a more persistent problem (program/script systematically messing up or an admin that just needs to not be doing admin work), and skipping root cause analysis means you'll lose more time in the aggregate.

--
XML is like violence. If it doesn't solve the problem, use more.

Re:Faster is nice, but... by Ephemeriis · 2011-03-02 03:18 · Score: 1

Sometimes a one-off mistake happens, and reinstall makes sense. Many other times, the reason you had to reinstall is due to a more persistent problem (program/script systematically messing up or an admin that just needs to not be doing admin work), and skipping root cause analysis means you'll lose more time in the aggregate.
So you re-image and get that new VM into production. And then take the old, cranky VM into development and find the root cause with little to no downtime. And then you incorporate the fix for that root cause into a new image. And then you re-image and put the new and improved VM into production with little to no downtime.

--
"Work is the curse of the drinking classes." -Oscar Wilde
Re:Faster is nice, but... by Fastolfe · 2011-03-02 04:10 · Score: 2

At seriously large scales, the rate of problems caused by events previously considered so improbable that you'll never likely see them in your career, become likely. TCP checksums are weak. Cosmic rays cause bit flips. Sometimes those bit flips mutate data on the way to the disk, so you never notice unless you've also checksummed the data and read it back and re-check it after writing it.
At these scales, it's fruitless to try and root cause every problem that happens, because you will hit problems like these that most sysadmins simply aren't likely to ever figure out. Document (statistically) the problem and re-image, without a second thought. In fact, write automation to collect some data and re-image for you when this situation occurs. Once you have a few repeats of the same event, or the statistics you've been collecting show a disturbing trend, THEN try to do some root-cause analysis. Otherwise, you're just wasting your time chasing things that you're not likely to figure out, or meaningfully fix if you do figure it out.
Re:Faster is nice, but... by Alex+Belits · 2011-03-02 18:45 · Score: 1

It NEVER works.
1. Unless the problem is trivial, it only shows itself under certain conditions, that are unlikely to be imitated outside of the production environment.
2. Whatever the problem is, it's usually related to data -- your current production data.

--
Contrary to the popular belief, there indeed is no God.

Re:I can't tell you how many times I have heard th by Anonymous Coward · 2011-03-02 02:19 · Score: 0

It costs them less to pay the DBA to write the script and inconvenience their users than it does to upgrade the system.

That all important profit margin is what gets in the way of things being done the right way.

captcha: income

Doesn't always work by Baki · 2011-03-02 02:21 · Score: 2

Sometimes a server is gradually degrading due to some issue. During that time, things are being modified. If you learn that the problem started a few months ago, you can't just re-image an old state and loose everything that had changed since then.

Of course to make app servers as stateless as possible helps against this problem. One of the reasons that my company enforces that data are kept on physically separate DB servers, and (virtualized) app server instances should be as dedicated to a single app as possible.

Re:Doesn't always work by Anonymous Coward · 2011-03-02 04:07 · Score: 0

'lose' not 'loose'

Save a buck by Goboxer · 2011-03-02 02:24 · Score: 1

How outrageous that these people don't explore the complex and time consuming issues. Don't they realize that the pursuit of knowledge is *way* more important than getting it done quickly. I mean, in a world where time equals money they shouldn't look at it as tossing money into a hole; they should look at it like investing in their collection of potential Jeopardy trivia.

I know if I had a boss hovering over me, not understanding what was wrong, and just pressuring me to get it done I would tell him to shove off so I could learn. Who cares that every minute I spend working on the issue is a minute I can't spend on other problems. Who cares that I could be replaced by a system admin who would get it done quickly. Knowledge and what other system admins think of me is what is important. After all, those pay the bills. /sarcasm

Re:Save a buck by visualight · 2011-03-02 02:56 · Score: 1

You better know what went wrong in the first place or it will happen again and you will (appropriately) look like an idiot.
If you work for me you better *want* to know what went wrong, even if I don't give you time and make you re-image, or I will (appropriately) think you're lazy to learn and can't be trusted to provide the best solution in all cases.
If someone comes to you with a problem or an idea, do you give them a menu of canned solutions or do you say "Tell me exactly what you want and how you want it." With your attitude I expect the former -and (to speak to your adolescent economic reasoning) that reputation is why I make more money than you.

--
Samsung took back my unlocked bootloader because Google wants me to rent movies. They're both evil.
Re:Save a buck by Anonymous Coward · 2011-03-02 04:02 · Score: 0

i work at a online retailer, and beyond a simple 'that's whats wrong' we're never given an opportunity to go very deep because with most problems going deeper is a waste of money. mainly because most things don't repeat if you just rebuild. the ones that do, the second time will give you a quicker lead to what went wrong. as long as its not catastrohpic there is no harm in narrowing down the possibilities by letting it repeat. the parent made no remark about desire, but only about what his boss wants him to do. you also assume a lot about the poster in order to lash out (unless you do know him, and his wage, in which case i apologize).
Re:Save a buck by Goboxer · 2011-03-02 04:24 · Score: 1

If you believe that spending time to hunt down every bug is a cost effective solution to every problem then I really don't know how to respond to you. The fact of the matter is I try to minimize company costs, regardless of my desires. If I feel that rebuilding is more cost effective, I'm doing it. And a good chunk of time that is going to be the cheapest course of action, both short and long term. As for why you make more money, money is not 100% indicative of good ideas or skills. It's called failing up.
Re:Save a buck by Anonymous Coward · 2011-03-02 05:51 · Score: 0

You don't know how to respond because you're "cost effective" mantra is a LIE. You are rationalizing your own laziness and disinterest. The word you're really looking for is *expedient*. Which in the long term is usually *not* cost effective.

The real problem with the Clone approach by Anonymous Coward · 2011-03-02 02:27 · Score: 0

If you ask me, a major drawback is that fewer eyeballs are looking at the code -> less bugreports -> buggier software.

outsourcing by roc97007 · 2011-03-02 02:30 · Score: 1

I think part of this phenomenon might be due to outsourcing, which puts a layer of call center personnel armed with loose-leaf binders of procedures between you and the one or two remaining competent sysadmins, who are then regulated to firefighting. In this world, there isn't time to diagnose problems because the level of expertise and admin/customer ratio are kept purposefully low.

--
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.

Re:outsourcing by Anonymous Coward · 2011-03-02 03:07 · Score: 0

I think some of this is also due to the general acceptance of Windows as a server platform. The mentality that, "I don't know what the problem was but rebooting fixed it" is easier that finding the real problem. If you read the responses, it also appears that outsourced programming might be at fault just as much as poor sysadmin'ing. Most programmers used Windows for their development platform and have little knowledge of the administration of their workstation (since its locked down in most cases). When their program goes wild, their only choice is to reboot since they don't have the permissions needed to kill processes.
So outsourcing and Windows work hand-in-hand to make "starting over" the option of choice when a server is misbehaving.

I series by Anonymous Coward · 2011-03-02 02:31 · Score: 0

Run a I series, they're like a tank. Slow and cumbersome but they just don't stop.

Re:I series by CompMD · 2011-03-02 04:38 · Score: 1

...Until there's an app problem and some idiot admin decides a reset is necessary...then everyone waits an hour for the monster just to IPL. If you're gonna have a /400, make sure you have someone who knows what to do with it.
Re:I series by anyGould · 2011-03-02 06:17 · Score: 1

Run a I series, they're like a tank. Slow and cumbersome but they just don't stop.
Oh, they'll stop if you don't take care of them properly. We have them here, and when they fall, they fall *hard*.
On the plus side, it's nice to know it's the same hardware that IBM uses to play Jeopardy!

Re:I can't tell you how many times I have heard th by vlm · 2011-03-02 02:36 · Score: 1

The only problem is: enterprise application does not manage database very well, and leaves zombie processes on the database server. After a while, the database server just crashes (hard) and takes down the application server with it.

Did I mention the application ... runs 24x7

So which is it, it crashes "often" enough to be a problem, or it never crashes ever?

The obvious solution is to reload it every day at the least inconvenient time.

If they will not "permit" a controlled reboot, then work around it by running health testing scripts that just happen to knock it out, sort of a euthanasia approach.

The next "solution" is a (caching?) sql proxy server in the middle, no one will notice if the reboot is fast.

Is the upgrade suggested by the admins themselves whom have tested it under load on a test server so they know it'll work, or suggested by the vendor dazzled by the vision of fat commission checks? "It'll work great, sure, it'll work great, great at paying for my sports car, yeah it'll work great"

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger

endless cycle by roc97007 · 2011-03-02 02:37 · Score: 4, Insightful

I'm not sure I buy everything in TFA, but have to admit to a certain extent this phenomenon is real. I've noticed, however a tendency to regenerate an instance, and when it doesn't work regen it again, and again and again because the purposely overextended and/or undertrained admin doesn't have time to figure out that the problem is in his template or due to something external like a dup ip. Come to think of it, this type of endless cycle seems to be fairly common in the Windows world. I guess we've caught up.

Sometimes the user has to diagnose the problem themselves, which is a win for the IT manager because the time didn't come out of the IT budget.

I'm hoping that at some point these practices will be recognized as the false economics they are. But I'm not holding my breath.

--
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.

Re:endless cycle by Anonymous Coward · 2011-03-02 04:57 · Score: 0

Doesn't anyone realize how and where many bugs are found? Most of the time it's after deployment when out in the field. Those bugs are found, fixed, and applied upstream to make the core system more reliable and robust. I can see where this is going as fewer bugs are found and reported, fewer will be fixed, and the relative software reliability that we all depend upon will go down the tubes. Plus it will take more iterations for said software to evolve to a necessary stable state costing more in licensing for more versions and there will be more issues due to lesser reliability.
Re:endless cycle by roc97007 · 2011-03-02 06:54 · Score: 1

Agreed. So eventually we'll be rebooting Unix every 30 days or BSOD whichever comes first. Cherish those 500+ day uptimes, it's not going to be the norm.

--
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.

I am not on Unix by Shivetya · 2011-03-02 02:39 · Score: 1

but know the teams that implement/admin them and I am constantly amazed.

Amazed in all that I read here and elsewhere points to incredibly resilient systems yet I have never been anywhere where they don't have scheduled down time on at minimum a quarterly basis and every major outage relied on a reload. So which is it? They make fun of the windows guys and just hope the windows guys don't look at their statistics (and no I am not on Windows either, think IBM Z and I).

My serious question is, is their a certain size system that reloads are valid on? When does it stop becoming a valid solution? When you get to enterprise level systems what are your options then?

I read all these articles but the one thing never clear is, are these large systems or just small servers (small being PC class hardware)

--
* Winners compare their achievements to their goals, losers compare theirs to that of others.

Re:I am not on Unix by spydum · 2011-03-02 03:26 · Score: 2

Sounds like you have poor unix admins that are exactly the reason this mindset is prevalent. I can tell you from 15+ years as a Unix admin, the only times I have "needed" to reboot were: upgrades (OS or hardware), hardware failure, and testing of init scripts. Real, stable, properly administered systems don't need rebooting. I even think this is fair to say of Windows. The problem is, as already described: there are not many good Windows Admins.
Re:I am not on Unix by causality · 2011-03-02 04:47 · Score: 1

Sounds like you have poor unix admins that are exactly the reason this mindset is prevalent. I can tell you from 15+ years as a Unix admin, the only times I have "needed" to reboot were: upgrades (OS or hardware), hardware failure, and testing of init scripts. Real, stable, properly administered systems don't need rebooting. I even think this is fair to say of Windows. The problem is, as already described: there are not many good Windows Admins.
Unfortunately Windows is not a terribly open system and one of the biggest selling points of Windows is that less-skilled people can run it. It's not like the Unix command line where you're just going to be lost if you don't understand it, if you're missing basic skills or don't grasp first principles, if you don't have a solid foundation for your knowledge. Less-skilled people don't have the payroll expense of skilled people and that appeals to the PHB types.
It's surprising this works out as well as it does, all things considered, except that the systems require more maintainence, they're more difficult to automate (Windows has *nothing* on even a simple shell script) and they often suffer problems that should have been preventable.

--
It is a miracle that curiosity survives formal education. - Einstein

Servers became a smaller piece of the puzzle by Anonymous Coward · 2011-03-02 02:40 · Score: 0

I don't see why reimaging/rebooting a VM instance is different from restarting a service that is misbehaving. Now "services" are VMs, that's all.

You were very happy as a sysadmin of a couple big servers, and now you have to administer several dozens of VMs. The skill set is slightly different, that doesn't mean we're "losing skills". Your Unix wizardry will come in handy anyway. The base concepts about OS operation will be there too.

Things change. Learn and deal with it.

Re:Servers became a smaller piece of the puzzle by AdrianKemp · 2011-03-02 03:35 · Score: 1

I hate to jump to unfair conclusions, so I'm hoping that you'll reply and explain yourself. Given that comment though, you seem to be exactly the type of person this article/summary is about. Rebooting a misbehaving server was never the correct answer, and still isn't. When you do that you lose valuable diagnostic information. It could mean the difference between days of downtime when the failure comes back in a more catastrophic manner, or being immune to the problem going forward.
VM is amazing because it allows you to do both without any other special redundancy/failover/etc. You can migrate the running broken server, replacing it with a working copy without losing that vital data. The only difference here is that you don't have to have downtime to figure out the problem anymore.
As I said, please do respond and correct me if I've misjudged you... But based on your comment you sound like a very bad sysadmin.

anonymous by Anonymous Coward · 2011-03-02 02:45 · Score: 0

The decline and fall? Can you decline and fall at the same time?
Where was the incline and rise?

OK Slashdot - I get it... by acoustix · 2011-03-02 02:45 · Score: 1

...I'm a poor, lowly Windows admin who doesn't know my ass from a hole in the ground. ALL HAIL THE 1337 *NIX H4X0R5!

Seriously...how long is this windows admin vs *nix admin comparison going to last? I can't help it that there are apps that absolutely need to run in a Windows environment. The job needs to get done. If I could run my industry specific software on Linux, I would. I would love to save my company money from licensing.

Now if you'll excuse me, I need to go back to flinging poo all over my server room walls.

--
"A plan fiendishly clever in its intricacies"- Homer Simpson

Re:OK Slashdot - I get it... by swordgeek · 2011-03-02 04:41 · Score: 1

Hey, I've worked with some excellent Windows admins! People I considered to be skilled, competent, responsible, and insightful. They're not all that common, but they do exist.
Part of the problem is that Windows is (still!) a fundamentally flawed system, and us old-timers resent the fact that it has dragged the entire computing world down to a much lower level overall. If you're supporting Windows, you're part of the problem. (It's not fair, but it's a fairly prevalent attitude).
The other problem is that too many of your comrades actually think that rebooting or reinstalling is an acceptable way of fixing things; and in the server world, it just isn't.

--

"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
Re:OK Slashdot - I get it... by anyGould · 2011-03-02 06:22 · Score: 1

Seriously...how long is this windows admin vs *nix admin comparison going to last?
Oh, forever. Although the blame lies more with the software than the people in my opinion.
Unix (as a class) are generally designed to be user-serviceable - so the admins learn how to do that. Windows tends to encourage black-box thinking (here's your Server (tm), don't touch it). That trains their admins to find the solutions that work for them.
So while Windows admins make poor Unix admins, I'd argue the reverse is true as well - a Unix admin will drive himself crazy trying to fix a box labeled "no user-serviceable parts"
Re:OK Slashdot - I get it... by Anonymous Coward · 2011-03-02 07:47 · Score: 0

Seriously...how long is this windows admin vs *nix admin comparison going to last? I can't help it that there are apps that absolutely need to run in a Windows environment. The job needs to get done. If I could run my industry specific software on Linux, I would. I would love to save my company money from licensing.
Now if you'll excuse me, I need to go back to flinging poo all over my server room walls.
It will continue as long as the Windows admins answer to an issue is reboot, re-install, re-configure.

Re:I can't tell you how many times I have heard th by powerlord · 2011-03-02 02:45 · Score: 3, Insightful

Oh, and re-installing the machine means 24h of downtime

I am with you except here. If re-installing a machine incurs 24h of downtime, you do not have a suitable contingency plan. Most environments I deal with are 15-20 minutes from offline to production on reinstall at the long end.

I agree that if the system is as critical as they say, they should have a better failover in place, however in a lot of companies, very little importance is placed on Live Failover systems. More than likely he's including lots more than the OS/Application build in that 24 hour timeframe.

Probably database reload/recovery time, or file system initialization (inadequate RAID controller to Disk design?).

--
This space for rent. All reasonable inquiries will be entertained at proprietors discretion.

Sounds like naive old schoolism by Anonymous Coward · 2011-03-02 02:51 · Score: 0

As a system administrator I don't understand why any option to make quickest and biggest win should be ruled out, even it would be in conflict with tradition. Some times the problem just is that biggest, quickest and easiest fix are not the same fix. Knowing which one to choose, and when, make a good system administrator.

Re:I can't tell you how many times I have heard th by Ephemeriis · 2011-03-02 02:52 · Score: 1

Logical solution (and the one recommended by sysadmins): upgrade application to version X, which is supposed to have a much better database management.

What do you think the PHB/management solution is? Ask the DBAs to write a script that will monitor zombie processes, so the sysadmins will be warned in advance... Like, around 20 minutes before the application crashes. Just enough time to tell all users to save their work, because we need to reboot everything. Just like under Windows.

The new version costs money. And, no matter how important everyone thinks this application is, they obviously don't think it's worth that price. They're willing to deal with a reboot rather than spend the money. I'd recommend the upgrade, too... But I don't write the checks. Nor do I really use the app. I just keep it running. And if you tell me you can live without the app for 10-15 minutes while the server reboots, and you'd rather save $X instead of buying the new version, that's what we're going to do.

Listen, I don't care how many times you do this on a Windows machine, but this is UNIX - I'll only reboot this machine if I absolutely need to. In the meantime, watch and learn as I kill the offending processes.

That's great when you can get away with it... But sometimes it just isn't worth the trouble. Even on a UNIX system.

Yeah, I hate rebooting to fix problems. Seems like a crude approach. Especially when you've got so many nice tools at your disposal on a UNIX system.

And, I guess, I'm kind of wondering why it needs to be rebooted in your situation. You've got a script monitoring zombie processes... And those processes can apparently be killed manually... So why not have that script kill the processes instead of just monitoring them? Or write a second script to fire off a batch of zombie kills?

But sometimes it just isn't worth the time/effort involved. You can spend a couple hours digging for the problem while your users are without their app... Spend a couple hours developing and testing your script while your users are without their app... Spend a few days patching code while your users are without their app... Or you can just reboot the thing and go on with your life.

Oh, and re-installing the machine means 24h of downtime

This seems wrong to me. Or, at least, completely unrelated to the subject of re-imaging in a virtualized environment.

It takes maybe 5 minutes to provision a new VM complete with OS and default config/apps/whatever.

If I had a system that was as essential as what you describe, I'd have a base image of it stored and ready to go. Just bring up the new image, migrate the data, and make it live. That's what we do with all of our truly essential systems. And we can be running off a new image within about 30 minutes if we're able to migrate data off the old system. If we have to go to tape it'll take longer.

If you actually incur 24 hours of downtime to re-image a server, what's your plan if that machine simply dies? What if it takes more than a simple re-image to get it back up and running?

--
"Work is the curse of the drinking classes." -Oscar Wilde

It shouldn't be a Windows vs. Unix issue... by Lifyre · 2011-03-02 02:54 · Score: 1

It is a cost comparison issue. When the time cost to "punt and reload" is lower than the time costs of further troubleshooting that is the correct solution to the problem. Having virtual servers makes it easier and quicker to reload a server by having a default image on stand-by so it makes less troubleshooting worth the time.

That said only a very poor admin would discard the old image without discovering the root cause of the issues in order to prevent it from happening again. Thus saving future troubleshooting costs in an offline environment. Thats what dev servers are there for.

--
I'll meet you at the intersection of "Should be" and "Reality"

Time to train for a different job by Anonymous Coward · 2011-03-02 02:55 · Score: 1

Most of you people don't seem to get the point.

When re-imaging is quick, cheap, and will work, the need for
esoteric diagnostic skills will cease to exist.

Put yourself in management's place : you have two options. One
takes longer and costs more and results in more downtime in
almost all cases. The other option takes less time, costs less,
and minimizes downtime. In the real world, the second option
is the overwhelmingly logical choice, and it is the choice that's
going to be made.

The need for truly expert sysadmins will drop as a result. Ignore this
at your peril, if you work as a sysadmin.

Old days by Anonymous Coward · 2011-03-02 02:56 · Score: 0

I've been fixing Windows machines for my friends and family since Windows 95 (I use Linux exclusively). A possible solution is to reformat the disk and re-load the OS -- however I've never had to resort to this last resort. From viruses to rootkits to buggy drivers, all can be corrected with the help of some good tools (typically stuff written by Mark Russinovich). I do find that the reformat solution seems to be the first choice whenever a computer is taken to repair shop. It makes perfect sense, you see. People in general are lazy and feeble-minded and we nearly always prefer a simple, quick solution to a correct, more time consuming one. After all, why test our own patience when we could be posting pictures on the facebook, or experiencing the deep, profound joy of limiting ourselves to 140 micro-blogging characters. It's our nature. It's why we're fat, why our marriages fall apart and why we don't rise at work.

After years of digging through the Windows OS my skills are pretty decent at this point. I can fix most things. But it's definitely a much more difficult road. Is it any wonder that sys admins with real *nix skills are going to be cast aside by business, replaced by inexpensive new-comers who re-image rather than explore, diagnose and understand? Know that your finances, banking data, tax info, social security information, etc, are always being maintained by the cheapest sys admins on the cheapest computers available.

Perpetial Beta by Anonymous Coward · 2011-03-02 02:57 · Score: 0

I roll out a beta of a completely new server when my server goes down. EVERYTHING changes. Sometimes even the topic of the entire website might change. All tech is replaced, nothing is the same. Where there was flash-animations, there would now be a java-applet. And I just call it a surprising revamp of the website. And the people keep falling for it. I always develop the next version. I think Google has been doing the same. It is a very strange development-practice. It is utterly confusing. But, it works. It actually makes you seem more hip and cool. I call it "the perpetual beta strategy".

Consarn new-fangled technology! by Scutter · 2011-03-02 02:59 · Score: 1

I don't understand why people are using lathes and milling machines to make high-quality, cheap, easy to use tools when they could be carving their own stone axes and axe handles. When their tool breaks, they just set it aside and buy a new one instead of spending days of downtime repairing it! Are we losing the skills needed to carve our own tools by hand in the interest of saving time and money?! This makes absolutely no sense to me and I cast derision on anyone who would do such a thing!

--

"Tell me doctor, with all of your defenses, are there any provisions for an attack by killer bees?"

Re:Consarn new-fangled technology! by geekoid · 2011-03-02 04:55 · Score: 1

"Are we losing the skills needed to carve our own tools by hand in the interest of saving time and money?! "
Why spend precious time building you own tools when someone else makes them? Unless you are a tool maker.
I could go out and make a new handle for my hammer, but that's time I'm not doing something I enjoy... smaking folk singers in the head with my hammer.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

Re:I can't tell you how many times I have heard th by Noryungi · 2011-03-02 03:02 · Score: 1

The only problem is: enterprise application does not manage database very well, and leaves zombie processes on the database server. After a while, the database server just crashes (hard) and takes down the application server with it.

Did I mention the application ... runs 24x7

So which is it, it crashes "often" enough to be a problem, or it never crashes ever?

It runs 24x7... until it crashes. And that's often enough that it is fast becoming a huge problem.

The obvious solution is to reload it every day at the least inconvenient time.

Easier said than done: we have users in (almost) every time zone under the sun. The only time for our (regular) interventions is on Saturday and Sunday. And said "enterprise" application is bad enough that it takes litterally hours to restart. And that is on top-of-the-line major vendor iron too -- we are talking about dozens of CPUs and GB of memory here.

The next "solution" is a (caching?) sql proxy server in the middle, no one will notice if the reboot is fast.

That is, perhaps, one solution. On the other hand, I am not really sure this would work with said application.

Is the upgrade suggested by the admins themselves whom have tested it under load on a test server so they know it'll work, or suggested by the vendor dazzled by the vision of fat commission checks? "It'll work great, sure, it'll work great, great at paying for my sports car, yeah it'll work great"

Trust me on this one: this was born out of desperation, knowing full well the management would not allow the upgrade to be budgeted. And, no, no one in the admin group actually got any "fat commision checks" from any vendor -- as a matter of fact, the only people who are wined and dined by the vendors are the top management, way, way way up above your truly and the rest of his team (aka "the peons").

--
The right to offend is far more important than the right not to be offended. (Rowan Atkinson)

Re:Not a decline, but a reflection of the new norm by rhsanborn · 2011-03-02 03:05 · Score: 1

It's a question of scale. Reimaging a PC is almost always more economical than finding the root cause, unless it's very repetitive, assuming a proper backup solution is in place. It's a different beast when the computer in question is mission critical and the client can't accept a random downtime every few weeks while you rebuild.

I would agree with this assessment by Anonymous Coward · 2011-03-02 03:07 · Score: 0

I work in the vendor side and visit numerous customers on a daily basis. One thing I've often found myself remarking to many is the art of system administration has been lost. I look out at the people who are now considered "senior" and compare that to what used to be senior many years ago, and there is no comparison. Hands down the talent pool has diluted. Now having said that I view that statement, against this article, as a very different issue all together. What I read here as a growing trend of people not spending time troubleshooting problems in favor of rebuilding failing systems. I think the real issue is two fold where this approach has fallen into favor. Firstly the scale of these deployments is larger than it was a decade ago. Secondly the number of admins to server ration has gown way down. Meaning even if the admins were of equal talent to a decade ago they don't bother doing deeper analysis of problems because they have many more things to do and rebuilding is faster. It might be best said there is a cause/effect of all of these points on each other.

Dropping CIOs, VM, slashed IT staff... by rAiNsT0rm · 2011-03-02 03:11 · Score: 1

This field is nothing like what I entered 15 years ago. Not in the usual technology progress way either, but in a steady downward spiral. Fortune 500 companies are beginning to drop CIOs altogether and putting IT in the hands of business depts., VMs are used as a band-aid for everything and as a result requests and demands and the number of servers to be maintained has exploded... all this while staff is cut to the bone. There used to be "Computer Science" and real professionalism and respect, now none exists. We are mostly to blame for it ourselves. For such intelligent people, we aren't smart. Ego and personality traits have been exploited to force us into 24x7 drones that are lowly, subservient, and basically whipping boys.

I have had some great experiences and I have also witnessed the decline first-hand, when I move on from my current position I will not be re-entering the IT workforce. I hate to throw away a lifetime's work and passion, but there is no real upside I can foresee... I only see it continuing to be minimized. People respect and understand tangible skills and products or revenue generating depts., which have always been tough selling points of IT. Knowledge and unseen aspects are hard to convey to non-technical folks, now that things have been abstracted one more layer with VMs and even virtual switching/routing, forget it.

--
http://teasphere.wordpress.com - A little spot of tea

Re:Dropping CIOs, VM, slashed IT staff... by anyGould · 2011-03-02 06:42 · Score: 1
We are mostly to blame for it ourselves. For such intelligent people, we aren't smart. Ego and personality traits have been exploited to force us into 24x7 drones that are lowly, subservient, and basically whipping boys.
While we don't help our own case, I think there's larger forces at work:
1. IT has evolved into a commodity - a computer isn't the magical and mysterious box that it was even ten years ago.
2. IT is a "cost center", so there is continual pressure to do things cheaper/faster/with less.
3. "Computers" was the field that everyone was pushing their kids into during the 90s.
So, what does that leave us with?
- A field that has lost it's air of mystery - which means bosses can assume it's not that hard, and thus not requiring expensive expertise. (Compare with "marketing", which luckily has the right skills built-in to keep itself mysterious and hard to pin down - which keeps the implication that it's Complicated and Difficult, thus Expensive.)
- A glut of people "trained with computers", combined with a push to minimize costs, means that they can push people into low-paying high-effort work, because there's always a new class of "up to speed with latest technology" recruits willing to work for entry-level wages. It doesn't help that many of us are here "for the love of the job", which employers mentally tally as a "job benefit".
Re:Dropping CIOs, VM, slashed IT staff... by deek · 2011-03-02 15:08 · Score: 1

Some good points, but I disagree with the idea that the computer is becoming less magical and mysterious. Actually, I believe the unsaid point of the article is that people understand much less about how a computer works, and are thus unable to cope beyond the number of fixit techniques in their toolbox.
To really diagnose a problem, you need a good understanding of the underlying system. Most people just don't have that understanding. Thus the computer is actually more of a mystery than ever. I have no clue about the bootup process for Windows, other than some ntldr things and some Run/RunOnce/RunServices registry entries. It's all hidden from me. It's magical, as far as I'm concerned. I'm fairly lost if Windows had a boot problem. In fact, I have had issues, and basically try blindly to fix thing, like booting in safe mode and uninstalling software, booting up the Windows recovery console to allow it to automatically restore system files, even googling the problem and following whatever suggestions that people come up with.
With Linux, I can see the whole process on the console. I can examine the startup scripts, and see how they've been ordered. I can look into the inittab file and see how everything has been arranged. I know the linux kernel parameters and see them in my local lilo.conf/menu.lst/grub.conf file. It is not magical to me, and thus I can figure out bootup problems, and can even work around, and have worked around, just about any booting issue. I'm not saying I have complete understanding, but it's certainly deep enough to know how it all fits together.
That sort of knowledge is not a commodity, and to my mind, is essential for coping with system issues and maximising uptime.

These options are not mutually exclusive by spiffmastercow · 2011-03-02 03:12 · Score: 1

Okay, so let's assume you've got a big cluster of servers for some random task, and one of them breaks. Should you diddle with the individual server, which brings it out of sync with the others? No! You re-image it based on the standard for that cluster. But if it happens 10 times, THEN you diddle with the server, and make a new, better server image to deploy across your cluster.

Re:These options are not mutually exclusive by Alex+Belits · 2011-03-02 18:58 · Score: 1

Okay, so let's assume you've got a big cluster of servers for some random task, and one of them breaks.
You find the problem and fix it on all servers -- because if one is broken and all are identical, then all of them are broken.

--
Contrary to the popular belief, there indeed is no God.
Re:These options are not mutually exclusive by spiffmastercow · 2011-03-03 01:30 · Score: 1

Okay, so let's assume you've got a big cluster of servers for some random task, and one of them breaks.
You find the problem and fix it on all servers -- because if one is broken and all are identical, then all of them are broken.
And how do you fix it on all the servers? By fixing it once, making an image, and deploying that image to all the servers. The point is that those servers should be identical apart from their ip addresses, so imaging absolutely makes sense in a large cluster environment.
Re:These options are not mutually exclusive by Alex+Belits · 2011-03-03 07:10 · Score: 1

By using package manager and rsync, you idiot!

--
Contrary to the popular belief, there indeed is no God.

Rebooting fixes almost everything. by indyogb · 2011-03-02 03:19 · Score: 1

What rebooting can't fix, formatting (imaging) can. A few more years, and maybe someone can write a program to re-image my virtual servers automatically, and then I can go flip burgers somewhere. :)

reddit community talked about this by kondor6c · 2011-03-02 03:19 · Score: 0

I remember that the reddit community talked about this not too long ago http://www.reddit.com/r/sysadmin/comments/fhnai/are_we_being_phased_out/ it was about the same idea of virtualization.

Productivity or Accuracy, that is the question by ThinkDifferently · 2011-03-02 03:19 · Score: 1

In the corporate world, it's always been a battle between productivity (less time to fix the problem) and accuracy (more time to fix the problem). It's a judgement call. In our environment, most of the troubleshooting is done by system integrators. The SysAdmins simply keep the back end up and running, and as quickly as possible.

Turn around time by Anonymous Coward · 2011-03-02 03:20 · Score: 0

I think the main thing here is if a server is down your goal is to get it up as quick as possible. So you try a few fixes and they don't work. Then you have to decide is it going to be quicker to dig deeper for a solution or to just re-image the machine. Sure everyone would pry rather figure the problem out, but the truth is the more a server is down the more users it will probably affect.

time by Vorpix · 2011-03-02 03:20 · Score: 1

this all comes down to time. i can reapply the data to a fresh VM image in a matter of hours and have it back up and running, pretty much without variation. hunting down a deep, dark problem can take 30 minutes or it might take days, and depending on the problem, that may simply be unacceptable.

the real skill is knowing when to pull the trigger on a rebuild vs knowing when it's something you can find and fix. hunting down problems and fixing them is something many sysadmins crave. at least VM's give us the ability to investigate the broken machine at our leisure, while a working VM can jump into production.

unless management wants to rely solely on rebuilds and the time investment it takes to do them every time, there will always be a need for sysadmins to analyze problems and figure out the "whys" and "hows" that caused them.

--
frog blast the vent core

Errrm , not quite by Viol8 · 2011-03-02 03:20 · Score: 1

"While I agree in principle, the analogy here is off. If the car doesn't start in this case, I can just throw it away and clone a working one."

Except if theres a hidden problem it won't be working for long and soon the new one won't start and you're back at square one. Thats no good if you've got a load of database data on your VM that'll also be hosed if you revert the VM.

Re:Errrm , not quite by AK+Marc · 2011-03-02 07:03 · Score: 1

Thats no good if you've got a load of database data on your VM that'll also be hosed if you revert the VM.
If that's the case, then you fail as an admin before the failure ever happened. Yes, if you are an incompetent admin, you'll complicate your job later on.

--
Learn to love Alaska
Re:Errrm , not quite by Viol8 · 2011-03-02 23:11 · Score: 1

Its not usually the unix admins job to back up the database. Thats what DBAs are for.
Re:Errrm , not quite by AK+Marc · 2011-03-03 07:28 · Score: 1

Since you don't even know how the job differentiation works, I'll just dismiss you as an armchair IT guy that either doesn't work in the industry, or does and shouldn't.

--
Learn to love Alaska

Exactly like google! by Saint+Stephen · 2011-03-02 03:22 · Score: 1

Wow, this technique of if its not working reimage is such a lame idea that Google, amazon and every cloud in the world does it! Must be a dumb idea

Re:Exactly like google! by AdrianKemp · 2011-03-02 03:44 · Score: 1

Except they don't.
All of these places use a tactic very similar to mine, and any other competent sysadmin. They restore a working copy of it so that users are uninterrupted and then perform forensics on the system that went down so they can solve the problem.
You just don't see the rest of it
Re:Exactly like google! by Alex+Belits · 2011-03-02 19:01 · Score: 1

All of these places use a tactic very similar to mine, and any other competent sysadmin. They restore a working copy of it so that users are uninterrupted and then perform forensics on the system that went down so they can solve the problem.
No. They keep THE REST OF SERVERS running, and apply the fix to all of them once the problem is found. One server being offline means nothing for them.

--
Contrary to the popular belief, there indeed is no God.

Let's be fair here by AdrianKemp · 2011-03-02 03:23 · Score: 1

I don't know about all that many other linux server admins, but I could easily be misrepresented as one of these "redeploy solves everything" people if someone wasn't paying attention.

When a server goes down, my responsibility is not to figure out why, it is to get it the hell back up. Virtualization allows me to do that very, very quickly by restoring a backup or redeploying an identical server. As far as management/users/etc are concerned that's the sum total of what I do when something happens. In reality I'm taking the existing server out of our production pool and replacing it with a working version. I then spend as long as it takes figuring out exactly what went wrong with the broken server so that I can fix it/prevent it in the future.

There is absolutely *nothing* wrong with getting a new server up and running immediately. Anyone that would spend time finding the root of a problem before doing at least basic damage control shouldn't have a job. VM lets me "damage control" by getting the new server up and running in about 2 minutes, so that I can get to the hard part without people breathing down my neck.

Re:Let's be fair here by Alex+Belits · 2011-03-02 19:02 · Score: 1

If you have failures of that kind, you are doing something terribly, terribly wrong. Servers do not go down by themselves.

--
Contrary to the popular belief, there indeed is no God.
Re:Let's be fair here by AdrianKemp · 2011-03-03 01:20 · Score: 1

Hahahahahaha
Come back when you've administered a server in reality. We'll talk then.
Re:Let's be fair here by Alex+Belits · 2011-03-03 07:11 · Score: 1

I did, and I do. As opposed to you, I know what I am doing.

--
Contrary to the popular belief, there indeed is no God.

Everyone is correct, in a fashion by Yaddoshi · 2011-03-02 03:25 · Score: 1

Look, everyone has a preferred method of doing things when it comes to IT, and everyone has an opinion on best practice that is based on a number of different things. No one opinion is the best, and every problem shouldn't be resolved the same way.

I was introduced to UNIX while in college from a user's perspective. I played with LINUX as a desktop platform for the first time, also while in college. I also was exposed to the Mac OS of the 90s because that was the computer of choice at SU while I attended and was the typical system found in every computer lab, with the occasional IBM running Windows 3.11 found here and there. I acquired a 286 running DOS which I used to access BBS and MUDs via telnet. I later upgraded to a Windows 95 box, and after college followed a career path of personal computer repair for the next decade, which means I've had my hands in ever Windows OS at some point or another, including 2000 server and 2003 server.

On the side I've been maintaining a LINUX server for the past 5 years, running Ubuntu. For the duration that I've owned the server, I've only "reimaged" it once, because I switched from a Pentium 3 class system to a Pentium 4. Any issues that it has had during that time I've been able to resolve with research, patience and a little trial and error. I restart it whenever security updates prompt me to, which is typically after a kernel upgrade. When a new LTS distro is released, I do a distribution upgrade, and there's usually stuff that needs changed/fixed afterward for everything to continue working as expected. It can be a total pain in the neck at times, and it drives my wife nuts on occasion, but I've learned more about computer systems this way, in my spare time, that in the long haul will be more useful to me in my career than I managed to pick up in a decade of PC repair.

I understand that this environment is completely different than a live environment that a business depends upon, and I fully sympathize with the gentleman who pointed out that when management is jumping down your throat to make something work, you tend to pick the fastest solution available to you. The only problem with this is that you have not figured out the cause of the problem, which means it could return.

There are a fair number of weird, unexplainable problems that have nothing to do with software, configuration error or hardware failure that can crop up from time to time. These are rare. They only happen once, maybe twice, and cannot be duplicated. A reboot will resolve these. But most of the time the source of the problem is human error of some kind, which means a reboot is a temporary fix.

So it ultimately becomes a longevity issue. If you're wiping out and redoing a server once a month, you probably ought to spend some time tracking down the source of the problem because the downtime during re-imaging over the course of a year will match or exceed the time spent finding the source of the trouble and correcting it. If you are running several servers this problem could affect some, many or all of them, so fixing one will allow you to fix all and the time will be negligible on the remaining servers, which then more than justifies the time invested in researching the problem. Furthermore, if you are experiencing trouble due to hardware beginning to fail, finding and replacing the defective part before it fails under scheduled maintenance is a much better solution than waiting until it fails under load when your company needs that server the most.

If, however, the issues only crop up maybe once a year, spending 72 hours finding a fix is probably not a good investment of time, because the equipment will be replaced/upgraded before the issue is likely to become a serious problem. In these cases I would recommend re-imaging. In the case of Windows operating systems I would be inclined to re-image anyway because lengthy support calls to Microsoft or the server vendor would potentially be required to resolve the problem, and sitting on hold is generally not a system administrator's best use of time.

Please bear in mind I am not a professional system administrator, but I've had the chance to observe them and dabble on both sides of the fence.

Re:Everyone is correct, in a fashion by geekoid · 2011-03-02 04:52 · Score: 1

But the great thing about VM is that you can move the failing image to a test server, re-image the production server to get it up. Then go study the problem.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

Re:Not a decline, but a reflection of the new norm by L4t3r4lu5 · 2011-03-02 03:30 · Score: 1

Devices that important should have redundancy.

--
Finally had enough. Come see us over at https://soylentnews.org/

Old greybeard Unix admin here... by Anonymous Coward · 2011-03-02 03:37 · Score: 1

From personal experience this is normally due to management jumping down our throats to simply "get it done" which unfortunately runs counter to our inquisitive desires to actually solve the problem.

Old greybeard Unix admin here.... who has to admin Windows systems also, because I do need a steady paycheck. ;-)

I'm not paid to seek out and identify the academic curiosities of obscure Windows system problems, I'm paid to keep the systems up and running to do useful work for the end-users.

Sometimes it's way more effective to just "nuke it from orbit" and rebuild a Windows server, because you can spend a stupid amount of time and effort trying to debug what's going wrong inside a failing Windows installation, and it will still be a losing battle, because Windows is closed source, and many of its internals are deliberately kept hidden from you. Sometimes rebuild and reload can get you back to production in very short order.

My aeons of Unix administration experience have taught me well, even in the Windows world, how to be an effective system administrator, in that I am always 1000% (that's once-effing-thousand percent) prepared to deal with Windows administration.... and know that every time a Windows box craps out that it's a potential disaster recovery situation and you need to be able to recognize when it is and when it's a simple fix. Therefore I maintain backups, snapshots, bare-metal restore images and detailed documentation out the wazoo.

Yes, I do also run many Unix and Linux servers too, and a couple of them have uptimes in the hundreds of days, which strokes my ego just fine, but I also run many Windows boxes, such as some which are Citrix servers, that need rebooted every few days just to clear out all of the "my head is full of fuck" from the running Windows kernel that progressively self-destructs internally while running weird shit like Citrix.

I guess to sum in all up, if you're going to admin Windows boxen for a living... to always keep your pimp hand strong.

Re:I can't tell you how many times I have heard th by corbettw · 2011-03-02 03:37 · Score: 1

Um, if you can monitor the zombie processes, why can't the same script kill those processes?

--
God invented whiskey so the Irish would not rule the world.

Re:I can't tell you how many times I have heard th by inglorion_on_the_net · 2011-03-02 03:37 · Score: 1

And, I guess, I'm kind of wondering why it needs to be rebooted in your situation. You've got a script monitoring zombie processes... And those processes can apparently be killed manually... So why not have that script kill the processes instead of just monitoring them? Or write a second script to fire off a batch of zombie kills?

How would you get rid of the zombies? Killing them won't help: zombies are processes that are already dead, but that don't have any process waiting for their exit status. They can be cleaned up by the operating system once the system figures out that nobody is ever going to call wait/waitpid for them, but until that happens, they will clutter up the process table, which only has a limited number of entries (often about 32000). If you create zombies faster than the system destroys them, you will eventually run out of process descriptors, and calls to fork will fail.

--
Please correct me if I got my facts wrong.

Wh...Wh...WHAT?! by Anonymous Coward · 2011-03-02 03:40 · Score: 0

I seriously cant even believe this is a discussion. I read breifly though the comments here and a portion are in favor of actually using the "Well its broken, lets re-image" approach. This simply does not work except on the most basic of servers and even then you have to wonder at the ability of the person installing the server in the first place as the basics are all that are necessary.

First of all, as a VMware ESXi user I can tell you that there are in fact limits to what enterprise virtual machines can and can not do. For example, ESX does not appear to play well with anything above a 50GB database at all. Now I am sure that in the future this limitation will go away, but that is how it is today. As such I would not ever even think about putting a database server regardless of size on a VMware installation simply because if it grows to that point I dont want to be caught with my pants around my ankles. Thus, you CANT use this rebuild it approach....

The only time this approach is even remotely possible is when a server runs a single function and that function is configured in some very basic way. For example, an apache2 server who's whole purpose in life is to be one of the n members of a load balanced front end for a web-pool. Then removing the single member from the pool and rebuilding it is POTENTIALLY faster than finding the issue. However, a good computer technician (not just a systems administrator) is going to know when critical mass for "Is this worth fixing or is it taking up more time than necessary" is reached.

Now as for the idea that Systems Administrators are a dying breed, you are very right. But its not because of virtualization, it is because that is what our job is. It is my job as a systems administrator to ensure that my SERVICES have as close to a 100% uptime as possible. I write scripts that use tools such as bash, perl, ruby, php, etc to fix issues before I have to be involved. For the past 50 years people like me have been doing the exact same thing and none of us believe in reinventing the wheel so we use and improve upon the tools that our predecessors have given us to more effectively manage and administer our systems. I, the systems administrator, am doing such a good job, that I am killing myself off. Those declining skills are due to the fact that Joe Schmoe wrote a shell script back in '96 that did what I want to do, but it hasnt been updated since '02 and as such the script doesnt work the same and needs updated and though I know the concept, I have never had to do it by hand until now thanks to Joe's wonder-tool. Worse yet, I having had my job for the past ten years want to get paid #n money so that I can afford to feed my family something other than Rammen. But the average CS coming out of college today has 1/5th the skills I do (so he cant figure out Joe's wonder-tool and how to make it work in his environment), he gets paid 1/5th what I do, and his answer to "Why is the web server not responding" is "I dont know, but I will re-image it immediately). You get what you pay for.

Skilled admins will save you $$$ by bl8n8r · 2011-03-02 03:42 · Score: 1

Because a skilled unix admin possesses the knowledge to turn a downloaded iso image into a hardened firewall, web server, db server, reverse proxy, network sniffer, VPN, router, iSCSI target, computing cluster, spam filter, XMPP, SMTP, FTP, SNMP, DHCP, NTP, BOOTP, TFTP, SMB (ad nauseum) servers, and/or NID/IPS device. All without a Cisco, Oracle, Windows, Barracuda, Vmware or other site licenses, seat licenses, or maintenance contracts. Most admins I know do not posses these skills, nor do they posses the interest in obtaining them. Perhaps there isn't necessarily a decline and fall of the system admin, but a rise in ubiquity of first-tier administration.

--
boycott slashdot February 10th - 17th check out: altSlashdot.org

Re:Skilled admins will save you $$$ by Anonymous Coward · 2011-03-02 08:22 · Score: 0

Been there done that, would do it again :)

Some of its plain lazy and poor skills by Anonymous Coward · 2011-03-02 03:44 · Score: 0

No its not bad, but it makes for poor skills. I have seen cases where just re-imaging does not work, and the admin had no clue how to fix the problem. The company was ready to trash a whole system once because all the admins knew how to do was try to re-image the system which was not working. If they had more skills or were not in this type of mind set they could have read some documentation and fixed it.

Most companies I worked for don't even have admins anymore, they make the developers do it.

Database servers by boristdog · 2011-03-02 03:48 · Score: 1

Um...AFAIK Virtualization is really not recommended for database servers.

So I get to keep my arcane knowledge.

Re:Database servers by geekoid · 2011-03-02 04:49 · Score: 1

or not:
http://www.oracle.com/technetwork/topics/virtualization/whatsnew/index.html
http://www.oracle.com/technetwork/database/enterprise-edition/db-virtualization-support-133757.pdf

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
Re:Database servers by Alex+Belits · 2011-03-02 19:06 · Score: 1

Virtualization is good for two things:
1. Running Windows.
2. Software development.
Anyone who is running VMware or its ilk as a replacement for version control system, backups or package manager, is incompetent.

--
Contrary to the popular belief, there indeed is no God.

Rebuilding won't fix stupid by whitelabrat · 2011-03-02 03:48 · Score: 1

A server rebuild won't necessarily fix anything. It could be a good recovery strategy, but when you run into a performance or functionality issue who is going to be there to find that and fix it? A rebuild won't help you there. No a good systems administrator probably isn't needed in a "we don't care" one size fits all commodity environment, but you can't expect the same level of service that a skilled professional can provide.

Reboot frequency = Rebuild frequency by shoppa · 2011-03-02 03:49 · Score: 2

My average Unix (in the past decade, Linux) system uptime between reboots is now 3 to 4 years.

Not surprisingly, most of the reboots are there exactly for installation (aka "rebuild") of an updated OS usually on the next generation of server hardware. Major package upgrades (e.g. MySQL, Apache) almost never require any tinkering with the OS.

I compare that to typical Windows servers in my group, where reboots happen in many cases nightly as a preventative measure, and the system is still some crufty old version of Windows (e.g. Windows NT), the application packages are deeply tied to DLL's and drivers, and I suspect that the statistics and attitudes are apples vs oranges.

Re:Reboot frequency = Rebuild frequency by Huckleberry_Hell_Raz · 2011-03-02 04:47 · Score: 1

If you are only rebooting (meaning upgrading kernels) every 3-4 years, let me have your IP addresses and an account on it to verify uptime. I promise I won't root your box...unless you allow absolutely no connectivity to your box, and no physical access to it, you are really doing a disservice by not keeping up on upgrades any better than 3-4 years. You are aware that kernel exploits exists, even in *nix, that will result in you losing your box, right?
Re:Reboot frequency = Rebuild frequency by geekoid · 2011-03-02 04:47 · Score: 1

"My average Unix (in the past decade, Linux) system uptime between reboots is now 3 to 4 years. "
that reeks of improper maintenance.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
Re:Reboot frequency = Rebuild frequency by shoppa · 2011-03-02 09:51 · Score: 1

It's a pretty sad commentary when uptimes approaching a good chunk of a decade, are taken as evidence of improper maintenance.

I compare with my car's engine... if I maintain it right, then it can very well last for 20 years between overhauls. Why shouldn't a computer (or a computer's software) be at least as reliable, as that? Shouldn't the length of time between overhaul (aka "reinstalling the OS") be evidence of very good maintenance (as well as fundamentally good design and configuration?)

And keep in mind that the applications, especially applications open to the big bad outside world over the net, are kept updated. Just that updating them never requires a reboot and certainly not an OS reinstall.

The 3-R's mentality of Windows (Retry, Reboot, and Reinstall) I did not expect would have affected slashdotters. The machines aren't there to suck up my time maintaining them. They're there, to do their job.

Re:I can't tell you how many times I have heard th by Anonymous Coward · 2011-03-02 03:52 · Score: 0

And, since you asked, yes, I am looking for another job. (Clueless admins and pointy-haired bosses: a match made in...)

Good luck with your job search. From your description, your current employer is doing it wrong.

No Question... by Daley_G · 2011-03-02 03:53 · Score: 1

We've seen this trend in just about everything in our daily lives as well. Back in the '50s and '60s, Service Stations were just that - SERVICE. There was a knowledgeable individual that would check the oil and such when you gassed-up your car. Today, none of this exists because it's cheaper to pay some pimply-faced kid to man the cash-register vs. paying someone with the knowledge to actually service an automobile. Now before you make the argument that cars are more technologically-advanced nowadays, consider the fact that you're still spending YOUR money on your car - just that the "service" station isn't spending THEIR money on it anymore - your money is going into the up-front cost of the automobile instead of spreading it out over the life of the product in the way of maintenance and upkeep. The same holds true for technology. You pay an engineer to design the system correctly first, and the cost over time goes down because you don't need to pay an engineer to maintain it - the tools available today do a pretty good job of replicating what you paid the engineer to do, at a fraction of the cost. Apply this same logic to fast-food. Nobody at McDonalds really knows how to cook or prepare a meal. All the "engineering" is already done, so you only have to pay the minimum-wage folks to replicate what's already been done. This list goes on and on and includes everything from your car to your house to your meals to the subject at hand. How many of us still know how to perform repairs that need to be done around the house? I'd bet that's a small number, because we paid for a "system" that doesn't need much maintenance, and when it does it's fairly modular (nobody sweats copper anymore - it's all plastic tubing that snaps together. Nobody adds electrical outlets because the house is pre-wired).

Re:I can't tell you how many times I have heard th by Ephemeriis · 2011-03-02 03:58 · Score: 1

How would you get rid of the zombies?

I dunno. It isn't my server. Maybe I mis-understood the OP...

I thought that his "In the meantime, watch and learn as I kill the offending processes" was a reference to those zombie processes that eat his server. I figured he had some method of cleaning them out, and was wondering why that method couldn't simply be automated.

But if that was a reference to something else entirely, and he's got no magic method for killing zombies, then I suppose it makes sense that you'd have to reboot.

--
"Work is the curse of the drinking classes." -Oscar Wilde

To a point... by CAIMLAS · 2011-03-02 04:04 · Score: 1

There comes a point where it makes sense to replace a system (OS) or rebuild it. Yes, Windows admins jump on that bandwagon more often - but more often than not, it's more appropriate there than elsewhere, too.

Depending on what the cause is, and the system in question, it makes sense. An early FreeBSD 5 machine with no documentation, highly customized ports, many running services, and so on is likely better to piecemeal out to different machines ('rebuild' it) than it is to disrupt service as you figure out how to get the thing upgraded. Improperly removed programs or registry corruption/errors in Windows often means the same thing.

Ultimately, what it comes down to, is time. How long would a rebuild take, and how much downtime is being accrued from ghosts in the machine? How much is that time worth? It may take a day or two to do a rebuild, but even determining the cause of a peculiar problem can take several or more. By eliminating the software idiosyncrasies of the existing install (and often getting things to the most recent patch version in the process) you've eliminated one possible cause as to the problem: either it works now, or it was hardware/firmware/a driver/etc. that's causing the problem.

Sure, wanton reinstalls aren't a good fix. However, they're often a cost-saving measure, and in many applications, appropriate. The hardware and software on most machines is not worth the time invested to "do it the old Unix way".

--
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers

Lake Wobegon Problem by dkleinsc · 2011-03-02 04:04 · Score: 1

Not all sysadmins can be above average. By definition, some of them will suck. As more stuff move into Linux-based systems, those admins which suck will be working on Linux. Ergo, bad admins will do less-than-ideal work.

--
I am officially gone from /. Long live http://www.soylentnews.com/

It's only a symptom by Anonymous Coward · 2011-03-02 04:23 · Score: 0

Virtualisation is definitely a very solid example of the degradation of Admin skills, but isn't the cause in and of itself.

Here is the problem: Poor practices can get into production faster.

Anyone can slap an OS onto a system. Building an OS for an enterprise with certain requirements and demands may involve a lot more work, which of course means time and money up front - a slower time to deployment. Ask a manager if he wants to wait an extra week (or two) to get a server out the door, and the answer will of course be "no!" However, here is where the false economy lies: That generic DVD-install may well be slower, less stable, less reliable, and less secure than the one that was tweaked and properly configured. Time not spent up-front will lead to a less stable environment.
Now when a system blew up before, rebuilding it would take a day or two, unless the admin was able to say "I told you so!" and get his week to set it up properly. Now, with VMs, it takes half an hour to get back into production, so why bother working on it? Who cares if the environment is shitty, unstable, and badly-designed, if it can be rebuilt in bits and pieces in minutes?

The thing is, you WILL be rebuilding it - constantly - and ultimately there's a decent chance that the entire pile of crap will implode on you (or at least run into a dead-end), requiring a complete re-architecture. Of course by that point, the people who pushed for and deployed the entire unsustainable environment will have been promoted to management because of their amazing speed to production, and they encourage the same thing.

In other words, VMs aren't a problem, they're a facilitator for problem behaviour.

separate issue is separate by Anonymous Coward · 2011-03-02 04:24 · Score: 0

UNIX/Linux machines do not magically break without changes. restoring an image will restore that image, of course without the change that broke it. if the change was needed for something, you are still going to have to figure out how to make that change without breaking something.

might i add, 'duh'

Re:I can't tell you how many times I have heard th by Anonymous Coward · 2011-03-02 04:27 · Score: 0

What do you think the PHB/management solution is? Ask the DBAs to write a script that will monitor zombie processes, so the sysadmins will be warned in advance... Like, around 20 minutes before the application crashes. Just enough time to tell all users to save their work, because we need to reboot everything. Just like under Windows.

Wait, if you can monitor the zombies, can't you kill them? Just use that 20m window to kill the oldest zombies and you should be ok.

Re-imagining? by Anonymous Coward · 2011-03-02 04:29 · Score: 0

Is that where you hire Edward James Olmos to be your sysadmin?

It's only a symptom by swordgeek · 2011-03-02 04:32 · Score: 1

(Note: Reposting this while logged in - why did they get rid of the 'login at post' option?)

Virtualisation is definitely a very solid example of the degradation of Admin skills, but isn't the cause in and of itself.

Here is the problem: Poor practices can get into production faster.

Anyone can slap an OS onto a system. Building an OS for an enterprise with certain requirements and demands may involve a lot more work, which of course means time and money up front - a slower time to deployment. Ask a manager if he wants to wait an extra week (or two) to get a server out the door, and the answer will of course be "no!" However, here is where the false economy lies: That generic DVD-install may well be slower, less stable, less reliable, and less secure than the one that was tweaked and properly configured. Time not spent up-front will lead to a less stable environment.
Now when a system blew up before, rebuilding it would take a day or two, unless the admin was able to say "I told you so!" and get his week to set it up properly. Now, with VMs, it takes half an hour to get back into production, so why bother working on it? Who cares if the environment is shitty, unstable, and badly-designed, if it can be rebuilt in bits and pieces in minutes?

The thing is, you WILL be rebuilding it - constantly - and ultimately there's a decent chance that the entire pile of crap will implode on you (or at least run into a dead-end), requiring a complete re-architecture. Of course by that point, the people who pushed for and deployed the entire unsustainable environment will have been promoted to management because of their amazing speed to production, and they encourage the same thing.

In other words, VMs aren't a problem, they're a facilitator for problem behaviour.

--

"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban

Re:I can't tell you how many times I have heard th by Anonymous Coward · 2011-03-02 04:33 · Score: 0

I could have been in a conference call with this unnamed company. I pointed out that they had a lot of zombie processes on the database machine. The manager on their side said he was going to ask the DBA, and then returned and told us that the DBA had said "That's not a problem" and that was the end of that investigation.

To add to this, is that the unnamed big database, had removed the product patch from available downloads from their main site. So it was impossible for us to test with the same version of the database as this unnamed company was using for production. Only reason I can come up with that a big database company should remove a version from the historical download section, is if its so horrible broken that its dangerous to run on it.

And still they did use it.
When I read your comment, I wounder if the manager actually asked the DBA or if he went out of the room and returned and just told us "That's not a problem".

It changes the problems you bother to solve. by gestalt_n_pepper · 2011-03-02 04:34 · Score: 1

I admin about 100+ VMs on 14 separate servers. I used to admin about 50 real physical machines. I can tell you that the physical machines had many more quirky, one-off problems that, quite frankly, weren't worthy of further investigation in terms of business cost-effectiveness. They were inevitably reformatted. All virtualization did was to speed up the process so that a new machine can be created in an hour instead of in two days.

As much as it might be intellectually satisfying to dig down into a problem, most systems are there to serve a business, and make money, not to solve the intellectual curiosity of a system admin who vaguely believes that the world is a better place if he/she can just take the time to solve every little system quirk.

You keep them working so the company can make money. That's all. That's the main priority. And thus we grow up.

--
Please do not read this sig. Thank you.

Re:It changes the problems you bother to solve. by Alex+Belits · 2011-03-02 19:10 · Score: 1

Congratulations, you are a incompetent!

--
Contrary to the popular belief, there indeed is no God.
Re:It changes the problems you bother to solve. by gestalt_n_pepper · 2011-03-03 01:30 · Score: 1

Congratulations. You've shown that you are unable to distinguish between mental masturbation and practical work. Just the kind of guys I've fired.
Believe me, things will look different when you're out of high school.
Cheers!

--
Please do not read this sig. Thank you.
Re:It changes the problems you bother to solve. by Alex+Belits · 2011-03-03 07:15 · Score: 1

I do software development and system administration in a professional capacity for 20 years already.
You, on the other hand, "know computers".

--
Contrary to the popular belief, there indeed is no God.
Re:It changes the problems you bother to solve. by gestalt_n_pepper · 2011-03-03 07:39 · Score: 1

If true, it's unfortunate that you still misunderstand the difference between masturbation and money and sound like a frustrated adolescent who can't quite figure out why the rest of the world doesn't recognize his genius. I'm sure you're also still fascinated by science fiction and think next year is the year of the Linux desktop too.
Look, life is full of people at all skill levels. I deal with C++ programmers who have never cracked a case, and have only a vague familiarity with the relationship between voltage, current, and power and have never touched assembly. Guess what? They can still program in C++ and C# competently, and we all make money. Do they work hard to engage in "best practices?" No. Our current attempt at this has been "Agile" which too will be undermined by human nature. Has the business been viable since 1989? Yes. Will the software one day become unmanageable? It has been that way for the last decade. We patch it up, overlay it with .net and wpf, and muddle through. Nobody cares. We'll all be gone by the time it all falls over, just like every other software company.
Art is not eternal. Neither is software. Embrace the chaos. It keeps us all in our jobs, eh?

--
Please do not read this sig. Thank you.
Re:It changes the problems you bother to solve. by Alex+Belits · 2011-03-03 19:11 · Score: 1

Look, life is full of people at all skill levels. I deal with C++ programmers who have never cracked a case, and have only a vague familiarity with the relationship between voltage, current, and power and have never touched assembly. Guess what? They can still program in C++ and C# competently, and we all make money.
And I have seen various other people who "made money" without any knowledge of what was supposed to be their job. They are called frauds, and what they did was called fraud.
You would have some resemblance of a point if highly skilled and knowledgeable people were extremely rare. They are not. They just drown in the sea of scammers and idiots such as yourself. Die in a fire and don't forget to kill all your friends.

--
Contrary to the popular belief, there indeed is no God.

I hate that it's the truth by RulerOf · 2011-03-02 04:39 · Score: 1

Point is, you only need one person with actual sysadmin skill to make and maintain an imagine. Hundreds of point-and-click types can then use that image. It happens in large organizations all the time. Why pay for a hundred skilled, experienced sysadmins when you only need one skilled, experienced sysadmin and 99 paper MCSEs? For many businesses this is an easy decision.

THIS.

It's a problem I run into myself a lot, really, as well. With the rise of virtualization, operating systems have gone from the tool that allows you to maintain your hardware such that it effectively delivers many applications to users to more of a vehicle on top of which single applications sit. But now, that vehicle, in turn, rides on top of your virtualization platform which is basically designed to as blown out and expendable as possible. While a given piece of hardware effectively delivers the same number of applications to end users, the real "Systems" part of administration is no longer the true integral piece of the puzzle that directly coverts "small iron" into "line of business."

Why should a company waste their money on my time spent digging through event logs, flexing the google-fu, and possibly coming up with the answer of "This would take so long to fix that I could probably rebuild the server and reinstall its single application faster than the problem could be resolved manually," when they can get a good enough result by skipping the investigation and just doing that in the first place?

It's extremely unfortunate that it works this way, especially as I feel I learn so much every time I encounter and solve a new problem that's preventing a system from running correctly. While it may be more intellectually stimulating and personally enriching to do things from the "advanced" perspective, on the whole, it usually ends up taking as much as if not more time than just blowing a system out in the event that you've never solved the given type of problem before.

Perhaps I've just got more learning to do though I suppose. It might be a different story with Linux! (where, ironically, I've simply reinstalled my test systems many times rather than actually solve problems :P)

--
Boot Windows, Linux, and ESX over the network for free.

Re:I hate that it's the truth by AK+Marc · 2011-03-02 06:43 · Score: 1

It's an ego issue. Rebuilding a server is "giving up without knowing the cause." Finding the issue is "demonstrating prowess." So those that do it as a hobby for fun and knowledge should aim for the second. Those under a business deadline might lean towards the first.

The problem is what happens when the hobbyist becomes the paid admin. Should he find the issue, even if it takes two days? Or take 30 minutes to scrape up the backup image, push it out, and restore the data?

--
Learn to love Alaska
Re:I hate that it's the truth by RulerOf · 2011-03-02 06:53 · Score: 1

Or take 30 minutes to scrape up the backup image, push it out, and restore the data?
Precisely!

Being paid (anything but handsomely... sigh) to do systems admin work is wonderful in how fulfilling it can be. Knowing when to give up and just fix the damn thing already... that definitely takes experience to master.

In the meantime when it happens, which isn't often anymore, I've had much luck with my co-workers letting me know when I'm being ridiculous. In turn, they too appreciate when I can tell them how to fix a problem they would usually reimage or reinstall to get out of :)

--
Boot Windows, Linux, and ESX over the network for free.
Re:I hate that it's the truth by ppanon · 2011-03-02 10:08 · Score: 1

I think it depends on what the root cause is. If there is a fundamental configuration issue and the problem is just going to periodically resurface in the restored images (or you may have hundreds of potentially affected images), then it may be more cost effective in the long run to fix it properly this time than to do a rollback patch multiple times. It depends on how frequently that problem rears its ugly head.

--
Laissez lire, et laissez danser; ces deux amusements ne feront jamais de mal au monde. - Voltaire
Re:I hate that it's the truth by AK+Marc · 2011-03-02 10:51 · Score: 1

If there's a fundamental configuration issue, then you should have used more testing. Also, this is obviously for "a problem" not "the continuing errors of the same type and issue that occur frequently." And if you have it affecting hundreds of images, then you'd have no need at all to troubleshoot the one affected machine while causing an outage. You can take an image of the broken one, then put on a working image, then restore the broken image to a test machine. In any case, it's almost never the best course to cause an outage poking around a live computer, rather than just putting a working image on it and putting it back in service.

--
Learn to love Alaska
Re:I hate that it's the truth by ppanon · 2011-03-02 21:28 · Score: 1

If there's a fundamental configuration issue, then you should have used more testing.
Maybe it's something that only shows up under really heavy and steady load conditions, like a logging level that is just a little too verbose, or that's triggered only a few times in your testing but many times in production conditions. Because unless you're automating an existing process, for instance if you're selling a range of "cloud" software services, the actual production mix may be quite different from your planning and expectations.

Also, this is obviously for "a problem" not "the continuing errors of the same type and issue that occur frequently."
Well, unless the "problem" is due to a stray cosmic ray that flipped the wrong bit(s), didn't get caught by ECC because it was in the processor, not the RAM, and caused your machine to corrupt itself before it crashed, there probably is an underlying hardware or software fault that will cause the problem to re-occur. If the system ran OK for years prior to the crash then either a) it's a software fault such as a very rare race condition that probably won't happen again for years, b) due to patch/change you recently deployed, or c) it's an indication that your hardware is getting flaky. The probability of a) is much lower than that of b) or c). If it's c) on a virtualization platform, then the next affected VM might not be one you can recover as easily.

And if you have it affecting hundreds of images, then you'd have no need at all to troubleshoot the one affected machine while causing an outage.

Well, if it's affecting any of hundreds of clones of the same image, then maybe you've got a cloud or web farm that most of the time will have excess capacity, That means that keeping that image the way it is for investigation probably won't be affecting the provided service significantly. Or maybe a worm or a virus infection with a trojan payload is making it's way through your unpatched VM farm and restoring a backed up VM just sets it up to be reinfected.
If something is so critical that you can't afford the downtime, then design the system with proper redundancy and failover. If it's not that critical, then you should at least do a minimum of investigation to figure out what the root cause is before re-spawning the server VM. Time-box it based on business needs, certainly, but if those needs can't support a certain amount of problem investigation during a failure, then they probably dictate a different architecture to provide better reliability.

--
Laissez lire, et laissez danser; ces deux amusements ne feront jamais de mal au monde. - Voltaire
Re:I hate that it's the truth by AK+Marc · 2011-03-03 11:28 · Score: 1

Your "solution" requires that there be no business needs which aren't important enough to justify redundancy but still important. Perhaps on paper one could make that argument, but in practice, you are 100% wrong. Businesses want the most effect for the minimum cost. As such, they will have servers with no backup that cost real money while down where everyone in the business (except you, obviously) would want it imaged and up in 30 minutes, rather than spending hours trying to determine why it failed.

And you prove my point where it's an ego thing to find causes, rather than to actually do what the employer wants. Or are you going to pull a Terry Childs and go to jail because you will do what you think your employer needs to do, even if contrary to what they state they want? You can't have it both ways. Sometimes being an admin means choosing between being a good employee and being a theoretically good admin. And the theory doesn't work when your boss walks in mad after the server has been down an hour and you are poking around when a reboot would have "fixed" the problem.

--
Learn to love Alaska
Re:I hate that it's the truth by ppanon · 2011-03-08 21:29 · Score: 1

I'm perfectly willing to a) do the reboot when the employer wants it with b) the observation that they may be hiding a problem that will come to bite them harder on the derriere later. Preferably with an e-mail "paper" trail so that, if it does recur later, my derriere is covered. The customer is right until Nature shows that she's got the last word.

--
Laissez lire, et laissez danser; ces deux amusements ne feront jamais de mal au monde. - Voltaire

IN a lot of respects by geekoid · 2011-03-02 04:44 · Score: 1

that is the rational solution. It's quicker and easier.

The smart admin copies the failing image, reimage, and then installs the copy of the failing image to an offline machine to study.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

Re:IN a lot of respects by Alex+Belits · 2011-03-02 19:13 · Score: 1

Study how? If the problem did not happen within 10 seconds of boot-up, it had to be somehow triggered, likely by things that only happen in production environment.
It's also very unlikely that VMware jockey keeps an exact copy of production network and a copy of all production data, that are necessary to run a production server's snapshot anywhere outside of the production environment in the first place.

--
Contrary to the popular belief, there indeed is no God.

Good for stable environments by guruevi · 2011-03-02 04:47 · Score: 1

Big shops like Google can do a simple re-imaging job because they have enough cheap servers so they can just throw a server out if it misbehaves, they know it's not their software because it runs fine on millions of other computers and if it misbehaves, usually it's the machine going bad. Only when multiple machines start having the same issue do they look into it as a possible bug, fix it and roll out an update to all their systems.

In a smaller shop usually, there is no space to have multiple downtimes because it will just re-image the same problem over and over again. The sysadmin is also the programmer and the help desk and simply doesn't have time to make a super stable system and usually has to use some 'legacy software' which basically means a custom developed piece of crap that nobody has the source code to. Virtualization has caused some idiot sysadmins to think they have a Google-like infrastructure by using virtualization on one or two boxes as an imitation datacenter while running some unstable software.

A good sysadmin does not have to nuke their server installation from orbit every time something goes wrong. I can understand imaging desktops because users will do some modification that makes it crash but they're never able to tell you exactly what they did. But a server is (or should be) well documented and has only few items that can go wrong. Finding out why your SCSI bus does a reset after a few weeks will be much more advantageous than rebooting or re-imaging it because eventually it will reset the wrong way and you'll end up with a corrupted RAID array.

--
Custom electronics and digital signage for your business: www.evcircuits.com

Kickstart... by Heretic2 · 2011-03-02 04:48 · Score: 1

So there are a couple philosophies on backing up your systems. If you can tightly control the imaging process and automate it so that it only takes 10-20 minutes, re-imaging may actually be not only a viable solution but an elegant solution. Especially when dealing with clouds where instances are essentially newly provisioned images. If you're logging to a centralized system and storing persistent data elsewhere, re-imaging may be OK. However, it doesn't replace engineering (define/design/implement/test cycle) a good imaging process. If there's a problem across all your machines, you'll obviously need to resolve that in the imaging process. I expect typically imaging processes to be complete with automated application deployment and configuration as well.

Symptom of greed... by fallen1 · 2011-03-02 04:49 · Score: 1

I see this as a symptom of greed coupled with ... not necessarily stupidity, but something close. I also agree with several other posters who have said "time" is a major factor in the "just reboot it and get it running again" scenario.

Why do I say greed and stupidity? If a system administrator (and whoever else set up the initial system, if it wasn't just the admin) has done their job correctly, the majority of their time should be taken up doing - technically - nothing. In reality, a good admin will always try to keep their skills up-to-date, learn new skills or methods to help them on the job, and so on. Their normal routine of monitoring the systems and/or network should not take all day (unless the admin is the only one for hundreds of systems) and that leaves them "open" - which, in my opinion, is the correct way for system admins to be. "Open" means they are able to respond to a user's service call if they need to show in-person, they can instantly respond to an issue with a server or the network, they can respond if there is a down router/cable modem/phone system/other component, and above all if they are "open" it means _management_ has not arbitrarily decided "Oh, since you don't appear to be doing anything we're going to assign you task(s) X, Y, and Z -- even though they aren't in your area of expertise."

So, the company assigns them other "work" to do because they don't _appear_ busy (greed) which in turn removes their ability to be "open" to respond in a timely manner (stupidity). There are times when a down system can take someone a while to repair which then means the other "work" added on to the admin doesn't get done and suddenly the admin is getting a bad review -- for work not attached to the job they were hired to do. And so on and so forth.

Basically, if you know less about system administration/network administration than the person you hired to do the job and your systems are running smoothly and efficiently with little to no downtime then fuck off and let them do their job -- even if they don't appear busy.

--

Dream as if you'll live forever.
Live as if you'll die tomorrow.
~Anonymous~

Re:Symptom of greed... by MadMaverick9 · 2011-03-02 16:01 · Score: 1

A long time ago (almost 15 years) a manager at a big american company gave his team members a t-shirt with the following printing:
Of course I don't look busy - I did it right the first time.
This was after successful completion of a project.
I still have this t-shirt in my closet and wear it at times. A tribute to the only manager I've had who understood this.

Failure of good design by NitroWolf · 2011-03-02 04:52 · Score: 1

This is ultimately a failure of good design. Yes, Linux suffers from this as well (All OS's do to one extent or another). Windows has always suffered from this because of the various windows installer packages available to developers. Linux suffers more and more from this because most distributions have a package management system now, which has the same problems as the Windows installers.

If you install an application, either with a package management system (apt, rpm, etc...) or the Windows Installer, there's really no telling what it does to your system... flinging files here and there, modifying configuration files, etc... Yes, you could potentially get the manifest for most package installers on Linux and do some forensics on what it's suppose to be doing. Often times, though, the time it would take to do this far exceeds the time it would take to rebuild the computer. Throw in neophyte users or "system administrators" and this option is completely useless, so again you're back to reinstall. Good luck finding out what an installer has done in Windows.

So the only way to avoid this with the current designs is to go back to a time when it required heavy system knowledge to even install the OS... Obviously, this is not desirable nor is it going to happen. It's a product of our times, man. We are stuck with it, until someone comes up with a better design solution than we've currently got in the Linux and Windows world.

Re:Failure of good design by Anonymous Coward · 2011-03-02 06:52 · Score: 0

That's completely wrong. APT, yum, and even Portage have tools to a) list what files a certain package put on your system and b) manage configuration-file upgrades safely. If you use the CLI tools instead of some point-n-click retard-mode assistant (and yes, I feel the disdain is thoroughly warranted when it comes to servers) none of this junk happens.
Re:Failure of good design by Anonymous Coward · 2011-03-02 09:46 · Score: 0

I do not know where you get the idea that knowing what a package manager does to your system is difficult. Maybe for a computer user, yes, of course, but for a sysadmin? For rpm it is dead easy, just run rpm --query --list nameofpackage and you get a nice list of what the package has done to your system (everything is a file, remember?); in debian based systems, you accomplish the same with dpkg --listfiles nameofpackage ; no sweat.
Likewise, finding what packaged installed a specific file is also very easy with a package manager. So if you need to know what package installed /etc/passwd, in a redhat system you would rpm --query --file /path/to/file; with dpkg you would use dpkg --search /path/to/file
I think you really should read the rpm and dpkg manual pages to see all you need is already there at your disposal with very little effort.
Finding out what a msi ships is a bit more involved, but doable as well.

Missing the point of modern systems administration by bbasgen · 2011-03-02 04:53 · Score: 1

I think the author is missing the point of modern systems administration. I wonder what the average number of servers a system administrator manages today, versus ten years ago? I would guess it has increased by a factor of around 10, particularly with the rise the 1U commodity servers, virtualization, etc. Sysadmins just don't have the time to treat our OS like a zen garden. The OS, especially with modern *nix, has become a kind of commodity, while the bulk of system admin work has moved to a higher levels of application management, systems integration, etc.

This is where I think the author fails most prominently, by implying that sysadmins who simply re-image (a claim that is a straw man) are somehow not as sophisticated and nuanced. Consider instead that they may be working at a higher, more complex level. This whole argument reminds me of the old debates System V admins would have with the rising Linux admins: this notion that package management was for weenies who don't "understand" the intricacies of dependency resolution. I remember incredibly excruciating debates where these folks would insist that spending hours resolving dependency hell was "good" for the craft because, after all, you should know and configure every last component on your system! God forbid it is done automatically for you, with literally tens of packages being installed with somewhat perfunctory knowledge, so that you could move onwards to accomplish the actual task at hand.

Sorry, sysadmin's don't have time for nostalgia. Save the sob stories of a bygone era for an industry that isn't based on constant change.

Re:Not a decline, but a reflection of the new norm by Anonymous Coward · 2011-03-02 04:53 · Score: 0

researching the problem is not billable activity.

Re:I can't tell you how many times I have heard th by geekoid · 2011-03-02 05:05 · Score: 1

why don't the just kill the zombie process they find?

I mean,k even that is less then optimal, but making people log off? WTF?

You know, ther eis an opportunity there. Get into a postiion wher you can afford tio be out of work a couple of months. Then begome a hard ass flag bearer for the technology. Go past your manager, make aoppointment with the CIO or CEO. Be force full. When you get the meeting have a 3-4 page doc with the costs, and time saveing as weilla sthe risks. High elvel. Tell them there systems are going to crash and cost thema lot of money.
If you ahve a share, do to the shareholder meeting and talk about it.

This has two outcomes:
Let go
Big ass promotion.

If you are let go, so what? Use what you where doing to help get you into a decision making position at another company.

If you get the promotion, you get to drive the change and begin changing how management handles technology.

A company can be a master of it's own destiny, or it can let it's technology master its destiny.

IN the end, no matter what happens, your going to have some good stories to tell.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

Re:"fat commission checks" by DocSavage64109 · 2011-03-02 05:07 · Score: 1

I think you've misread the parent. Basically, you can't trust the vendors to have actually solved the database problems as they'll say anything for their "fat commission checks" on any upgrades.

Sounds like there's a link by Anonymous Coward · 2011-03-02 05:08 · Score: 0

From what I read, there actually isn't a "Windows admins are dumb, *nix admins are smart" pattern. It's more about how mainstream you are, and thus how many potential admin candidates you attract (or more importantly, how many open jobs there are).

Re:Sounds like there's a link by Alex+Belits · 2011-03-02 19:26 · Score: 1

Then you read bullshit.
Windows admins are dumb. Their job is dumb because Windows provides no tools or interfaces to perform its administration in intelligent way. Mark Russinovich is probably the closest thing to a smart Windows admin/developer that exists, but he is dumb, too (and he works for Microsoft).
Problems happen when Windows admins are trying to work on non-Windows systems -- they bring dumb Windows way of administering a server, and they are still nothing but Windows admins. They can be easily recognized by their use of VMware products to run Linux in production environment.

--
Contrary to the popular belief, there indeed is no God.

Blame the VMs by Anonymous Coward · 2011-03-02 05:22 · Score: 0

Face it, more often than not, a linux server in the workplace means it's running on top of a virtual machine, your server is one of dozen that were popped up (like popcorn), rarely maintained, receive little ongoing administration; but, magically can be reset to the last snapshot on a whim. The valued skills have changed. Personally, I miss the days of having only a few servers that cost an arm and a leg and you milked them for everything they'd give you.

Re:Blame the VMs by Alex+Belits · 2011-03-02 19:29 · Score: 1

This only happens when server is being administered by a Windows admin. And if it's Linux running on top of VM running on Windows, it is not a Linux server at all, it's an idiotic contraption that has absolutely no excuse to exist.

--
Contrary to the popular belief, there indeed is no God.

Realities of Systems Administration and Imaging by Anonymous Coward · 2011-03-02 05:30 · Score: 0

There have been several very good arguments of virt-servers and re-imaging vs spending time to re-configure and/or fix the issue without the rebooting. There is a couple of realities that apply to proper win-tel and *NIX type administration when imaging is used. Most organizations that do imaging as a normal recovery procedure do not take some of the realities below into consideration:

1) If the server image has been built properly with all services working and tested appropriately; it is completely normal to "save your work" for a rainy day. Most experienced sysadmins would agree that certain servers and applications settings are better to set up once; as the installation and package management can be very painful for very customized apps. Having a good copy to re-image saves having to re-invent the wheel. HOWEVER if this argument does not hold if the original server build has not been properly implemented and tested. Image management is very very important. In my experience most experienced *NIX sysadmins have a good grasp where virtualized/hosted *NIX systems can save cost and time to deploy.

2) Most *NIX and Wintel devices can be hacked or have junior sysadmins (read less competent people) make "changes" that can cause issues. Yes pseudo and other access mechanisms can mitigate the problem, but the reality is that stuff gets &^%#ed up either intentionally or otherwise. Rather than trying to do a historical review of what caused the problem and playing the "blame game" while the system is down; do what was suggested in other posts which is, take an image copy of the broken system, and re-image the the production system, and then play a post-failure investigation/blame game with the comfort of knowing the system is back on-line and functioning. Senior management is much less likely to fire someone if the outage time is minimal, and the error in the old system is an "honest mistake". The longer a system is down the longer that a whole systems support team can face a lot of criticism from senior mangers.

3) Most people that make images do not test them enough to call them production images. These production images are sometimes buggier than the existing production problem. It is very difficult for the image creator to test their own work. It is more appropriate for a team of two to work in tandem. One person to create the image and another knowledgeable team member to test and report issues; which get corrected before preserving the image. The art of proper image and backup/restore management is truly under-rated.

4) Images have to be updated on a regular basis to account for system, performance and most importantly security patches and changes. An image created 8 months ago for a server will not likely be very useful if it has security holes discovered a year ago.

change in ideas by Anonymous Coward · 2011-03-02 05:34 · Score: 0

I think what a lot of people fail to comprehend is that there is a radical shift from 'server' to 'service'. In the 'olden days' you would get a Unix 'server', and that machine would handle many tasks and would typically be the core of your entire computing system. Sometimes you would get a second server for redundancy and as you scaled up you would get many servers to handle the your workload but would still have some 'core' servers running most services and you seperate out the heavy services like the db or directory services etc.

The change is that with VMs you will have a few servers that require very little configuration as they just host VMs and they host what is functionally 'services' not servers. small, special or specific use servers which provide a single service. You run a few of these VMs redundantly and able to live migrate between the real hardware. Now when one of these 'services' go down, you simply redeploy a new one from the template.

Re:change in ideas by Alex+Belits · 2011-03-02 19:36 · Score: 1

There is no "radical shift". Multiple services were running on Unix servers since the beginning of Unix development. The idea of "single service per server" is entirely from Windows system administration, and combined with VMs it can be implemented -- but it's an extremely stupid idea.
Unix design is the opposite to this -- clusters of servers may be involved in implementing a single service, and design of such service will take into account a particular model of distributing the load, keeping shared state, locking, etc. However at the same time one server may handle multiple related or unrelated services. In Windows world such a design is unthinkable -- so VMs are used as crutches to allow huge multicore servers run multiple single-threades piece of shit application, each in its special padded room with its own copy of the OS, its own emulated hardware up to the keyboard controller and SB16 sound card to emit beeps that no one is going to hear. The last thing we need is this model spreading to sanely designed systems.

--
Contrary to the popular belief, there indeed is no God.

The state of the industry by jacobsm · 2011-03-02 05:37 · Score: 1

Companies don't want it good, they want it cheap. Training...read a book. Test hardware and software that a sysadmin can break and fix, sorry too expensive, can't have it.

Just a week or so ago I wanted to take a resource I controlled and partition it into two separate resources on rare occasions. Clueless management wouldn't allow me to do so. Reason, in their opinion it didn't make sense to do so. Beat head against the nearest wall.

Production enviroments by DerekLyons · 2011-03-02 05:46 · Score: 5, Insightful

Some of the comments here remind me of a post on a woodworking board a few months back. Essentially, the poster was lamenting because he had to fire a guy because he couldn't afford to keep him... Not because of the economy, but because the guy was an absolutely inflexible perfectionist. He'd spend $300 worth of time on what should have been a $60 job... The guy was a hell of a woodworker, at home in his own shop, but just couldn't adapt to a production environment.

This isn't about Windows vs. Unix. This is about admins not understanding their job is to get production rolling again, not to satisfy their obsessive need to understand every problem or their need to satisfy their ego. ("I'm a UNIX admin dammit, I refuse to use habits that make me look like a Windows admin" or it's equivalent is a refrain modded up again and again here on Slashdot.) If a reboot or a re-imaging fixes the problem, that's the right solution. If it doesn't, *then* you dig deeper.

Re:Production enviroments by Anonymous Coward · 2011-03-02 06:31 · Score: 0

Brush your teeth if you don't want to pay the dentist bill.
Virtualization is a nice tool, but it's just a tool, It will never replace an admin.
Re:Production enviroments by dave562 · 2011-03-02 06:38 · Score: 2

You brought up a similar point to the one I was going to make. In a production environment, down time costs money. Often times the quickest way to get an application back into production is to restore the machine to a known good state. With virtualization that is trivial to do. If the problem keeps recurring then you need to dig deeper to figure out what is going on.
Re:Production enviroments by Anonymous Coward · 2011-03-02 06:46 · Score: 0

I disagree. I think that the problem is the attitude that the quick fix is the complete fix. Sure, a quick fix often solves a short term problem, but what about the long term problem (if any)? You don't know unless you problem solve. By all means you should do the quick fix in an emergency but the biggest problem is that management then thinks it's done and won't give you the time to follow up on it afterwards. Then you're just expected to pick it up at a later time and fix it when it's convenient for them. The problem is that the convenient time never arrives and you end up with so many pending investigations that you can't even remember what's what anymore. Better to take something on fully from start to finish. Human beings aren't machines and shouldn't be expected to function like them.
Re:Production enviroments by Anonymous Coward · 2011-03-02 06:55 · Score: 0

This isn't about Windows vs. Unix. This is about admins not understanding their job is to get production rolling again, not to satisfy their obsessive need to understand every problem or their need to satisfy their ego. ("I'm a UNIX admin dammit, I refuse to use habits that make me look like a Windows admin" or it's equivalent is a refrain modded up again and again here on Slashdot.) If a reboot or a re-imaging fixes the problem, that's the right solution. If it doesn't, *then* you dig deeper.
I'll add to this, if the problem is not understood then image the BAD server, then employ the quickest possible recovery method and deploy the bad image to a VM or spare for analysis afterwards. Understanding the problem is important, but admins that let a service languish while there is a perfectly valid recovery procedure is fail.
Re:Production enviroments by BoldAndBusted · 2011-03-02 07:08 · Score: 1

Some of the comments here remind me of a post on a woodworking board a few months back. Essentially, the poster was lamenting because he had to fire a guy because he couldn't afford to keep him... Not because of the economy, but because the guy was an absolutely inflexible perfectionist. He'd spend $300 worth of time on what should have been a $60 job... The guy was a hell of a woodworker, at home in his own shop, but just couldn't adapt to a production environment.
This isn't about Windows vs. Unix. This is about admins not understanding their job is to get production rolling again, not to satisfy their obsessive need to understand every problem or their need to satisfy their ego. ("I'm a UNIX admin dammit, I refuse to use habits that make me look like a Windows admin" or it's equivalent is a refrain modded up again and again here on Slashdot.) If a reboot or a re-imaging fixes the problem, that's the right solution. If it doesn't, *then* you dig deeper.
The trouble is that the job of "get production rolling again" is not really an System Administration problem. That job is as an "System Operator", from old 70s and 80s parlance. Many people who are labeled "Administrators" are administrators in name only, and in practice do not have the actual authority to actually make decisions which are a core part of being an administrator ("systems" or otherwise). The fact that most people with this title don't realize this, and demand the authority, makes it hard for them and for the rest of us.
Re:Production enviroments by Anonymous Coward · 2011-03-02 07:46 · Score: 0

If a reboot or a re-imaging fixes the problem, that's the right solution. If it doesn't, *then* you dig deeper.
No, it isn't. Especially if your image is infected, and that is the cause of the problem. The right solution is to UNDERSTAND the problem, then apply a solution. In effect, you have just validated the entire "we wipe/reboot and forget about it" mentality.
Nuke from orbit is the last resort, not the first one. You are reaching for a hammer to crack eggs. I pitty the org you work for.
Re:Production enviroments by Anonymous Coward · 2011-03-02 10:10 · Score: 0

But what if, every time that problem comes up, a figure somewhere in the system get 2 or 200 added to it or a plus becomes a minus?
And six months down the line someone asks where $X million dollars is or what they should do with the 400 cases of widget Y that just arrived.
I have seen systems where things went awry and just rebooting would have made things worse, where the system might not even come back up. You only re-image if you know what has gone wrong. If you don't know what has gone wrong, who's to say that re-imaging isn't just replacing one damaged copy with another?
Re:Production enviroments by cas2000 · 2011-03-02 10:32 · Score: 1

the trouble is that it *doesn't* fix the problem, it just hides it.
rebooting - or even re-imaging - can be fine as a quick-and-dirty fix to get critical infrastructure back up and running as quickly as possible, but is IS NOT and CAN NOT be a substitute for actually figuring out what caused the problem and fixing it so that it doesn't happen again
Re:Production enviroments by Leolo · 2011-03-02 12:09 · Score: 1

If a reboot or a re-imaging fixes the problem, that's the right solution.

The thing is it probably doesn't fix the problem. It might fix the symptom, but the problem will reoccur.
Re:Production enviroments by Anonymous Coward · 2011-03-02 21:22 · Score: 0

"If a reboot or a re-imaging fixes the problem ..." it never does, you just postpone the problem (be sure it will happen again if you don't solve it!) and eventually make it worse next time... "use your head for more than carrying your hat" used to be wise!
Re:Production enviroments by Anonymous Coward · 2011-03-03 21:35 · Score: 0

> couldn't adapt to a production environment
Wow, the comments of north-american/UK/AU people are usually so fucked up that I am not surprised at the thought of the USA being a third world country in disguise. You certainly seem so from a european perspective. Your biggest industries now are a) waging war on things and b) shuffling papers around that say I.O.U.
You can keep advocating for profit-driven idiot managers with more bravado than sense dictate how a job should be done. Keep being a corporate-trained pet if that's your thing (or the trap you can't get out by now even if you know it and want to get out) but leave the competent UNIX people (and managers, they exist, all four of them) out of your dirty argumentations. This 'FIX RIGHT NOW DAMNIT' mentality is only ignorance on both the pointy haired retard and the sysadmin retard sides. At the very least a highly available infrastructure should be in place. Specially for a small business that wants UPTIME 100% DAMNIT AND CHEAP. A second server is fucking cheap, virtualization makes it all even more fucking easy. I see only ignorant sysadmins here who failed to stand their ground. But then again, you can be fired on the spot if you open your mouth, don't you? Poor things.

Re:I can't tell you how many times I have heard th by anyGould · 2011-03-02 05:51 · Score: 1

Oh, and re-installing the machine means 24h of downtime".

Did I mention the application is considered mission-critical and runs 24x7? And that downtime can cost more than 6 figures to said (nameless) company?

And those are two very good reasons to not nuke/pave (although I would trust you realize that this is a bit of a non-standard situation).

In this situation, I'd expect to see a backup on hot standby - something that you fail-over to while you troubleshoot the main.

Re:I can't tell you how many times I have heard th by surgen · 2011-03-02 06:07 · Score: 1

Also, unlike normal processes, the kill command has no effect on a zombie process.

http://en.wikipedia.org/wiki/Zombie_process

Its no longer troubleshooting, just breakfix by 6502_C64 · 2011-03-02 06:11 · Score: 0

As a former Windows SMS Administrator and Desktop Engineer, imaging was a vital tool for desktop breakfix. That's provided you have an image that can be updated and software packaged for ease of profile install. On one-off problems, you can re-ghost and re-install package software and have the user back in business within an hour. Troubleshooting was reserve for company wide issues. My motto used to be, when in doubt, ghost!

Re:Its no longer troubleshooting, just breakfix by Alex+Belits · 2011-03-02 19:39 · Score: 1

As a former Windows SMS Administrator and Desktop Engineer
...you should better shut up when people are discussing real servers.

--
Contrary to the popular belief, there indeed is no God.

See StenchWarrior RUN... lmao! apk by Anonymous Coward · 2011-03-02 06:14 · Score: 0

http://yro.slashdot.org/comments.pl?sid=2015772&cid=35358632

LMAO!

APK

See StenchWarrior RUN away... lol! apk by Anonymous Coward · 2011-03-02 06:19 · Score: 0

http://yro.slashdot.org/comments.pl?sid=2015772&cid=35358632

APK

P.S.=> LOL, just "too, Too, TOO EASY... just '2EZ'"... apk

Incompetence by Anonymous Coward · 2011-03-02 06:26 · Score: 0

I just hope that none of these people are administering any system with any of my personal data on them. This is only marginally better than my personal data being held on a Windows, or even worse, a MacOS X system.
I wouldn't make the analogy of driving a car vs car maintenance, being in any way similar to server maintenance. Clearly, driving a car is something that one might expect any average peasant to be able to achieve, whereas, one may expect said peasant to become hopelessly and helplessly confused, when confronted with a UNIX terminal and a keyboard. It would be reasonable to expect any employer to exercise due diligence when recruiting competent employees.
If I ever discovered that my personal details had been leaked or lost, as a result of the negligent recruiting practices of some incompetent halfwit manager, I would most definitely be seeking legal recourse.
This is most certainly not about getting a production system 'back up' in short time. If this is the aim, the system was designed without sufficient redundancy, and again, the competence of those responsible for its management would be seriously questionable.

Re:I can't tell you how many times I have heard th by Anonymous Coward · 2011-03-02 06:39 · Score: 0

Gawd, the stories I could tell...

We've seen a bunch of memory pressure issues. An app starts gobbling up memory and eventually others procs start getting killed. At some point, even ssh is killed so that this app can continue. We end up having to reboot the server to recover.

In every other organization we can set per-process caps on memory utilization. Not so here. The DBAs demanded that their app/db run unlimited.

We have the security department telling us what we can and cannot run. This leads to bizarre requirements such as ntpd and inetd cannot be enabled,

We recently had an application upgrade to a website. The network resources ballooned six times. Processor utilization jumped 500%. The devs are convinced that it's a hardware error. Or a network problem. Or an OS configuration issue. The latter was interesting. It was a thread-limited app so was using only 4 of 8 processors. We showed them the graphs, but they are dead set on the idea that some magical OS tuning will unlock their apps to work on all cores.

Decline of sysadmin skills? Hardly. We're just sick of the bullshit.

news? by kwoff · 2011-03-02 06:42 · Score: 1

In what way is this "news"? It's like the 3rd time this guy's blog was linked to in the last week or two. A few paragraphs of opinion. Are there any anti-blog tech sites, especially ones where the latest "products" aren't advertised in the form of articles?

Re:I can't tell you how many times I have heard th by orient · 2011-03-02 06:45 · Score: 1

Why not write a script to kill zombies? That's what I did and the server never crashes...

--
Laudele lor desigur m-ar mahni peste masura.

Reimage? by Anonymous Coward · 2011-03-02 06:55 · Score: 0

Reinstalling the OS is almost never the right move. Any time people have suggested it to me (I've been working on Linux and Unix for a long time), it's usually a stab in the dark. If you hear me suggest it, it's because the OS is corrupted.

Unix is highly observable with tools like strace, truss (or even dtrace if you're lucky enough to have that).... it's difficult for me to imagine a scenario where re-imaging is better than finding the error and fixing it.

And for those people defending their intellectually lazy response... you have a lot to learn about Unix and troubleshooting.

Unacceptable by operagost · 2011-03-02 07:03 · Score: 1

If I suggested wiping an OpenVMS server to correct a problem, I would be laughed at... at best.

--

Gamingmuseum.com: Give your 3D accelerator a rest.

Re:I can't tell you how many times I have heard th by corbettw · 2011-03-02 07:29 · Score: 1

Never trust Wikipedia. You can absolutely kill zombies either with kill -9, or rm -rf on the process tree for that process (e.g., for pid 666, "rm -rf /proc/666").

--
God invented whiskey so the Irish would not rule the world.

Reimaging UNIX? by Anonymous Coward · 2011-03-02 07:45 · Score: 0

This has been my SOP since 1978 when VMS became available!!!

The TV repairman by Skapare · 2011-03-02 08:48 · Score: 1

Remember the TV repairman (if you are old enough)? A dying (well, by now probably completely dead) breed. When the TV went on the fritz, he (or she in a rare few cases) would diagnose the problem and apply a fix. Usually it was just a tube that needed replacement, but sometimes a capacitor. Occasionally something would have burned out like a resistor. As transistor TVs came along, the failures went down, but not to zero. Transistors could die, too, and were harder to replace. And they were more susceptible to lightning surges from the antenna (something that back then got TV signals for free). Now days, if a TV goes bad, we just junk the whole thing and get a new one. If it was in warranty, we might get the new one for free. Too often it would die just 3 days after the warranty expired. Just 3 years ago I had a relatively new TV (a digital one, with a VGA input, too) go bad. I could tell it was the power supply delivering unstable or low voltages after it warmed up. Fortunately, it was in warranty. So it was shipped to the manufacturer. About a week later a box comes back with a replacement. This was not the one I sent in, though it was the same model. At least it worked (and has been ever since). But I still wonder if someone replaced the power supply in the one I sent in. And I wonder if someone replaced the component inside that power supply that caused it to fail or if they scrapped the whole thing. So why should failing software be any different? As a system administrator myself, I do like to at least find out what failed. But being practical, I also quota the time I spend on "failure forensics". If I can't figure it out in a few minutes for first time problems, I just reboot. If the problem happens again, then I justify more effort. If it never happens again, I never even think about it, anymore. While I love a good diagnostic challenge, it just don't make business sense to put much effort into that (unless its something we design and manufacture).

--
now we need to go OSS in diesel cars

And pragmatism is important by Sycraft-fu · 2011-03-02 09:17 · Score: 1

You have a limited amount of time in a day to deal with shit and you need to prioritize if you want it all done. Dealing with problems can take far too long sometimes and a reinstall is just faster (and cleaner).

For example: I work at a university and we have a "kind of managed" environment meaning you get things like professors who have laptops that they have admin on. They get viruses and spyware, of course, since they don't pay attention. Our normal strategy is to run automated tools and if they can't clean it up, reinstall. Why? Because it takes less time. Installing Windows 7 takes all of 40-60 minutes and I've modified the image to include the most common apps you need. Usually one of our students can have a system reinstalled and running in a couple hours.

However cleaning it up? That can take days. I can do it, given time I can track down all of the stuff and eliminate it. However some of this spyware is extremely problematic. Is has watcher processes everywhere, sets itself up in all kinds of locations and so on. Also it isn't like they get an infection, they get tons and then finally bring it in. So it is a painstaking process of looking for shit, disabling it, checking to see if it stays gone, cleaning up problems (things like when it modifies the hosts file or LSPs or executable handler), and so on until everything is clean and works right. This also isn't something our students are good at, it is fairly complex and takes some experience, so I (or another staff) has to do it.

It is just not worth the time. We end up having to do it sometimes because professors just refuse a reinstall but it is a huge waste. We can backup data, and reinstall, in far less time. That guarantees all the shit is gone (a manual method always leaves room for doubt, I could make a mistake).

Remember that with troubleshooting the objective is to fix the problem. It isn't to prove you are a toughguy, it is to make things work. So you need to determine the most efficient manner to do that, and the manner that results in the least downtime. What that is varies. For our LDAP server? No a reinstall would not be the best idea (in most cases). However for a client system? Often it is.

The only time... by rnturn · 2011-03-02 09:42 · Score: 1

... I've ever resorted to rebuilding a UNIX system from scratch was a system that I inherited from a previous admin when I took over his job. The broken system was a member of a cluster and, after running checks of all the files on both members, could not figure out just which files had gotten corrupted that were preventing the system from believing it was a member of the cluster. Luckily, support for the OS version was about to be sunsetted and it made sense to reinstall the OS on both members. This was about a dozen years ago. Except for that one instance of doing a reinstallation, I haven't resorted to that means of solving a UNIX problem. Ever.

System disk failures are another story. I have had to do a couple of those on Linux systems when the system disk failed. That's over the 15-16 years I've been running Linux.

So that's three UNIX/Linux reinstallations over more than a couple of decades. I know Windows admins who've done that many reinstalls in a week.

--
CUR ALLOC 20195.....5804M

Re:I can't tell you how many times I have heard th by DarwinSurvivor · 2011-03-02 10:06 · Score: 1

I'm not sure if you fully understand what a "zombie" process is. A zombie process is a process that has ended, but the parent process (high MISSION CRITICAL application) has not "closed" the process yet. Killing the process tree would take down the mission-critical app. rm'ing the process would allow a new process (possibly not owned by said app) to start and if the app tried to eventually close or even check that process, it would segfault the entire app WITHOUT that 10-15 minutes notice.

To all the people against open source (probably few in this crowd), this is a PRIME example of why close-source is bad. Even if this guy's company was not allowed to redistribute the software (like a normal software license), had they been given the code, he probably could have fixed the bug in a fraction of the time he's spent dealing with it. And the next time the system was forced to reboot, BOOM throw in the fixed binaries!

Re:I can't tell you how many times I have heard th by Anonymous Coward · 2011-03-02 22:35 · Score: 0

It takes maybe 5 minutes to provision a new VM complete with OS and default config/apps/whatever.

If I had a system that was as essential as what you describe, I'd have a base image of it stored and ready to go. Just bring up the new image, migrate the data, and make it live. That's what we do with all of our truly essential systems. And we can be running off a new image within about 30 minutes if we're able to migrate data off the old system.

Moreso, just run several concurrent instances, with a health check to kill off
ones which are showing fatigue. Respawn when everything stabilizes. You
can also use this same process to scale up and down the resources. Smooth
out that end of day (whichever timezone) rush to do work.

-@|

My experence by charlieclark22 · 2011-03-02 22:42 · Score: 1

I have administered Linux, Windows and MAC servers and from reading the comments above I must agree (in part at least) with everyone. I think a balance needs to be given to downtime/finding the cause of the problem. The cause of the problem is certainly important, but so is availability! Different situations require different solutions. Personally I always prefer to get things working as soon as possible but while troubleshooting the problem take steps so that once the issue has been fixed it is possible to find out what went wrong (unless you find out along the way) by backing up logs... If you can get the system back up in a reasonable time and tell management why/how/what happened then this is the best situation for everyone. With windows this is more difficult and you are a lot more likely to encounter a problem with windows which seems to have no reason (I know this does not really happen but it appears that way because of the rubbish logging that windows produces). So I guess what I am saying is a competent systems administrator will know how to react in certain situations and these systems administrators are nearly always worth the money (unless, of course, the companies systems aren't that important). About the imaging route, I've always disliked this solution. It requires you to have a spear machine of exactly the same spec. I prefer to set up an automated install (over tftp) for all of the different types of machines I administer. With the Unix based OS's I then use puppet to configure them fully (the client should be available to windows soon also), so if a reinstall is needed I could use a completely different set of hardware (if needed) and get things up and running in a short amount of time. This takes time to get set up initially but it means that anyone who can figure out how to PXE boot a machine on the network can effectively start an install for any type of machine using only 5 minutes of their time and the systems administrator is more free to figure out ways to increase efficiency of the systems and network (every system can always be improved). I do not think this is the end of systems administrators, it just means that we need to up our game.

Slashdot Mirror

The Decline and Fall of System Administration

500 comments