The Decline and Fall of System Administration

Sad but smart by Anrego · 2011-03-02 01:58 · Score: 4, Interesting

I’m not a system admin but I don’t see how this is a bad approach.

I see value in finding out what the problem is and why it happened.. if you just blindly re-image then the problem might pop up again at a less opportune time.

But if you know what the problem is... and you have an image of the server in a working state, or a documented procedure on how to set up the server in it’s intended configuration then why would anyone waste time trying to repair it.

I think you have this kind of problem in most jobs. New approaches that make more sense but require less skill (and imply less e-pene) are always hated by people who have already learnt how to do it “the hard way”.

I see this as a programmer all the time and have been a victim of it. I’ve seen a huge chunk of my chosen industry migrate from meat and potato problem solving to gluing libraries together and sprinkling in business logic.

I’ve been fortunate to land in a job where there’s still a lot of “from the ground up” work, but these jobs are getting scarcer as even the components that everyone uses are made from other components. And executable UML (or something of its ilk) is probably going to be the next thing to cut the legs off us.

Re:Sad but smart by TheRaven64 · 2011-03-02 02:24 · Score: 4, Interesting

Add to that - no one (outside of the IT department) cares what the problem is, they care about the downtime. If you have some redundancy, stuff can fail periodically without the users noticing. An 'admin' capable of keeping it running can be someone paid to do something else who has responsibility for clicking the button every few months if required. An admin who can actually address the problem will cost, what, $60,000/year minimum (including associated costs, not just salary)? Is having ten minutes of downtime every few months costing your business $60,000/year? If not, then it's not worth the cost of doing it properly. It may be for a bigger company, but for a small business that would eat most of their profits. This is the advantage of a Windows or Mac server, with its pointy-clicky interface: it may be less reliable, and more expensive, but the cost saving from not needing to employ anyone who actually understands what's going on outweighs it. Especially if you buy a support contract, where the vendor will send someone competent out for the couple of time a year where something goes seriously wrong.

--
I am TheRaven on Soylent News

From personal experience by Xacid · 2011-03-02 01:59 · Score: 5, Insightful

"they punt and rebuild the server from scratch rather than dig deeper."

From personal experience this is normally due to management jumping down our throats to simply "get it done" which unfortunately runs counter to our inquisitive desires to actually solve the problem.

I suspect it's the end result of pressure to get more bang for their bucks in a tight economy, but that's pure speculation. It really could be a trend of the times.

Re:From personal experience by Nerdfest · 2011-03-02 02:19 · Score: 4, Insightful

As I've said below, there is a benefit ... you can actually investigate and fix the problem rather than the symptom. The bonus with VMs though is that you can frequently do both. You can create a copy of the VM tio dig into, and create a new fresh instance for production to get them working again.
Re:From personal experience by Darth_brooks · 2011-03-02 02:34 · Score: 4, Insightful

....and his was the right answer. With XP, you're almost certainly talking about a client machine. Why bother dicking with it? It's a hundred dollar OS on a four hundred dollar piece of hardware. Wipe, reload, move on to big boy problems. Even if you're talking about a problem that ends up affecting a number of users, and it happens to be a client side problem, you're farther ahead to nuke and reload.
In my last position I was the only end user support guy for 150 to 200 people. If I sat around and fucked with every little nuance of XP and it's associated ills, I'd have ended up even farther behind than I was when I left. I wrote up a quick backup script that grabbed anything the user didn't (against company policy) store on the network drive, grabbed their local e-mail (Notes), then nuked the machine and reloaded. I could take a user who was dead in the water and have them back up and running in 15-20 minutes. If they had a lot of data to restore, maybe 35-45. Spending an hour 'troubleshooting' was a waste of company time, and my time.

--
There are some people that if they don't know, you can't tell 'em.
Re:From personal experience by Tom · 2011-03-02 02:46 · Score: 4, Insightful

In an environment of thousands of servers (or even dozens), deep-diving into a problem [generally] is a waste of time. While it is interesting intellectually, there is no other benefit.
Except, of course, finding what the heck was wrong in the first place and fixing it, preventing future outtages.
Sometimes, rebuilding is faster than fixing, and in some contexts, it makes sense. Even then, the original machine should still be examined and the "root cause" (if you need a management buzzwod) identified. At the very least, a reasonable amount of time should be given towards the attempt. It's true that it is pointless to dig around for days and days - but that is not a reason to not at least start looking, as it might turn out you only need a few hours. And more often than not, finding the real problem tells you something that helps you
a) fix other bugs,
b) avoid the same problem on the next server,
c) avoid a repeat performance,
d) makes you realize what you thought was a random server crash was really a break-in / hardware failure / systematic problem and other, additional steps need to be taken.
All of the above have happened before, you would by far not be the first.
A proper incident management process does allocate resources towards follow-up examination. The right thing to do is not suppress it with generic blabla about wasted time, but to set the proper amount of resources for your organisation. Maybe it's half an hour and no money, so some sysadmin can check the logs and do a quick check-up. Maybe it's a full-out forensics analysis. That depends on your needs, your resources, your environment and context.

--
Assorted stuff I do sometimes: Lemuria.org
Re:From personal experience by causality · 2011-03-02 02:51 · Score: 5, Insightful

Reminds me of our Windoze guy in my previous job. I had run into some kind of problem with XP at home and spent a good amount of time banging my head against it. Finally I told him what was going on and asked what he recommended. "Reinstall XP" was the answer. I thought he was joking, nope he was serious. :P
Not only was he completely serious, he probably can't understand why you might have thought he was joking.
The idea that it's a black box and you shouldn't expect to understand how or why something happened is definitely one of the more subtle costs of Microsoft systems. It lends credibility to the (false) notion, so common among average users, that you're either a completely unskilled newbie or a serious expert who can discern the inner workings of the mysterious black box. It discourages middle ground for intermediary skill levels, the kind of thing that would otherwise occur naturally as users gain experience over time.
Most of all, it's supports the falsehood that it's unreasonable to expect the most basic competence from non-experts.

--
It is a miracle that curiosity survives formal education. - Einstein

I can't tell you how many times I have heard this. by Noryungi · 2011-03-02 02:04 · Score: 5, Interesting

Many times, what I hear as "solutions" are simply variations on the theme: "Why can't we reboot the server?" or "Why can't we reinstall the server from scratch?".

And my answer usually was: "Listen, I don't care how many times you do this on a Windows machine, but this is UNIX - I'll only reboot this machine if I absolutely need to. In the meantime, watch and learn as I kill the offending processes. Oh, and re-installing the machine means 24h of downtime".

These days, I help run a (very) large application, which runs on top of a (very) large "enterprise" SQL database for a (very) large company. The only problem is: enterprise application does not manage database very well, and leaves zombie processes on the database server. After a while, the database server just crashes (hard) and takes down the application server with it. Logical solution (and the one recommended by sysadmins): upgrade application to version X, which is supposed to have a much better database management.

What do you think the PHB/management solution is? Ask the DBAs to write a script that will monitor zombie processes, so the sysadmins will be warned in advance... Like, around 20 minutes before the application crashes. Just enough time to tell all users to save their work, because we need to reboot everything. Just like under Windows.

Did I mention the application is considered mission-critical and runs 24x7? And that downtime can cost more than 6 figures to said (nameless) company?

And, since you asked, yes, I am looking for another job. (Clueless admins and pointy-haired bosses: a match made in...)

--
The right to offend is far more important than the right not to be offended. (Rowan Atkinson)

Re:Gee, ya think? by rhsanborn · 2011-03-02 02:04 · Score: 5, Insightful

There are a lot of cases where pressing the button means that the problem will go away...for a few weeks. It will work right until you hit the same conditions that caused the problem in the first place. Suddenly, your using the refresh to cover up either a poor implementation, or a standing bug, and it isn't going to go away until you call that guy in suspenders.

Re:Clone my car! by shawb · 2011-03-02 02:07 · Score: 5, Insightful

The real solution? Reimage the production server to just get it working, then you dig around on the dev server until you find out what's actually going on.

--
I'll never make that mistake again, reading the experts' opinions. - Feynman

To be honest by TheRealFixer · 2011-03-02 02:08 · Score: 4, Informative

It sounds like this guy is just upset that technology has progressed to the point where we don't need to pay out the nose for some high-priced UNIX consultant to spend 3 days troubleshooting an issue that can be fixed in minutes or hours.

Just because you might learn more by spending days chasing down an issue instead of using your available tools to quickly redeploy the server and get the business back up and running, doesn't make that the correct decision. If you really want dig into the root cause, clone the broken VM off and research it after you get a fresh one deployed from template.

Virtualization != marginalization of skills... by Shoeler · 2011-03-02 02:14 · Score: 4, Interesting

This seems to me to be a philosophical question. Indeed, if the uptime and more importantly availability is higher by the purported crash and burn (taking liberties with the slash and burn deforestation technique) method, who is to say it is less useful or less valid? Indeed, to espouse skills over delivering for the client seems to be missing the point. It seems to be standing on some pedagogical imperative that knowledge is somehow of more value in the workplace than delivery.

Now - having said that - don't get me wrong. I have seen entirely too many *nix sysadmins (full disclosure: I got an RHCE in 2003) who don't know where the network config files are because they only know the GUI, and are hired by a team of people who have never logged into a *nix box. However, I think the ill that is most egregious is not that it sets some moral and ethical imperative fo fixing rather than reloading (or in this case, recovering from a VM image) a server, but the fact that it misses the point that there has been a dearth of qualified IT candidates since the dawn of our industry and that the fixes to this don't have to do with how we fix a server, but how we hire and more importantly who we hire. As is everything in IT, garbage in == garbage out.

Finally - I absolutely agree with the Infoworld argument. It assumes an unexpected failure within the server, not some external thing that needs to be diagnosed and fixed. If your app crashes because the SQL table isn't there on the SQL server you don't control, rebooting ain't going to do a hill of beans worth of good.

Re:Hyperviser by Anonymous Coward · 2011-03-02 02:31 · Score: 4, Interesting

Because pointing and clicking inherently takes more skill than using CLI, right? Never mind that most CLI commands will readily assist you with syntax if your format incorrectly, whereas documentation for a GUI, if it exists at all, is often useless..,

Re:Clone my car! by Ephemeriis · 2011-03-02 02:34 · Score: 5, Insightful

The real solution? Reimage the production server to just get it working, then you dig around on the dev server until you find out what's actually going on.

Exactly.

If the machine is in production it needs to be working. You don't have time to dig around and find the root cause. You need it to work. Now. If you've got a virtualized environment it is trivial to bring up a new VM, throw an image at it, and migrate the data.

Then you take your old, malfunctioning VM into a development environment and dig for the root cause, so that you don't see the same problem crop up on your new production machine.

--
"Work is the curse of the drinking classes." -Oscar Wilde

Re:Clone my car! by Isca · 2011-03-02 02:35 · Score: 4, Insightful

That's assuming your new tool that's vitally important actually has a man page. Very little is documented as well as it was 10 years ago.

endless cycle by roc97007 · 2011-03-02 02:37 · Score: 4, Insightful

I'm not sure I buy everything in TFA, but have to admit to a certain extent this phenomenon is real. I've noticed, however a tendency to regenerate an instance, and when it doesn't work regen it again, and again and again because the purposely overextended and/or undertrained admin doesn't have time to figure out that the problem is in his template or due to something external like a dup ip. Come to think of it, this type of endless cycle seems to be fairly common in the Windows world. I guess we've caught up.

Sometimes the user has to diagnose the problem themselves, which is a win for the IT manager because the time didn't come out of the IT budget.

I'm hoping that at some point these practices will be recognized as the false economics they are. But I'm not holding my breath.

--
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.

Re:Hyperviser by __aamnbm3774 · 2011-03-02 02:57 · Score: 5, Insightful

This whole argument is retarded. I always pick the most appropriate response to the problem at hand. If your server is hosed and not booting, I don't have time to mess around with some Knoppix DVD, trying to figure out exactly where in the boot process it is dying. Especially if you have nightly backups! Sometimes a clean sweep and restore is perfectly acceptable and reasonable. Why even sacrifice downtime trying to troubleshoot an issue that could be resolved within minutes?!

Now, if it happens again the following night, you do have a deeper problem and should investigate it further, because constantly restoring the machine is now the inefficient part in the process.

It's like we've lost common sense in favor of our technical ego.

Re:Hyperviser by jc42 · 2011-03-02 03:00 · Score: 4, Insightful

... documentation for a GUI, if it exists at all, is often useless..,

How true. There popular explanation of the difference between a CLI and a GUI is that CLIs are so complicated that you need a manual to use it, whereas GUIs are so simple and intuitively obvious that no manual is needed.

Of course, the reality is that this attitude allows vendors to supply GUIs without any "unnecessary" manuals, but make the nested tree of windows and menus so deep and complex that nobody can ever remember where everything has been hidden, and there are no good tools to help you find something that you know is in there somewhere.

Meanwhile, the people who build the CLI know that nobody can ever remember it all, so they include tools for finding your way around. They also tend to make the defaults for the commands fit the most common cases, so you don't have to use the manuals all that often. And most tools have a -help option (though they can't quite agree on how to spell it), to provide quick reminders. And the CLI includes a current directory, search paths and aliasing, so you don't have to remember full paths to everything.

One of the ongoing frustrations with every GUI is constantly seeing a new window pop up, which is positioned back at the root directories, and I have to laboriously poke at things to get down to the directory that I'm working in. Then, when I do what the window was opened for, it closes, all that navigation is lost, and I have to do it all over again the next time I want to access a file in the same directory.

GUIs may have some aesthetic appeal (aka "pretty pictures"), but they remain the slowest, clumsiest way to use a computer that we've yet developed. But I trust that people are working on finding ways to make it even clumsier and slower. This seems to be happening with the "cloud" approach, for example.

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.

Re:Hyperviser by jc42 · 2011-03-02 05:23 · Score: 5, Interesting

No, the difference is that a CLI is nearly impossible to use if you aren't familiar with it - the semantics and syntax are as, if not more, important than the concepts - whereas a GUI requires much less focussing on the "how", allowing much more focussing on the "what".

While there's a certain truth to this, GUIs are in general a lot less "intuitive" than people tend to believe. Without documentation and training, most users are unaware of most of their GUI's capabilities, and have great difficulty in learning much more than the basics.

An example I've read a number of warnings about in web-design documents is that a significant number (often estimated at around 50%) of "non-geek" users don't understand scroll bars. This is usually mentioned along with the advice to put the important part of your web pages close to the top, because the non-scrolling users won't be able to see anything below that.

Yes, I was dubious when I first read this. But over the years, I've run into several clear examples. I've been involved in building web sites for some very non-geeky organizations. The orgs' leaders generally want a lot of stuff on their main page, and at the top they usually want some text about the organization, its purposes, its main activities, etc. They also agree that it's good to have a list of upcoming public events on the main page, and inevitably that's positioned below the introductory text, so it's often not visible unless the user has a rather large window.

In each case, there were eventually meetings with discussions of how to improve the web site. One thing that would come up was suggestions from users (including members) that the home page should have a list of upcoming events. The leaders have always been dumfounded by this. "But, but, ... There is such a list on the home page." "What?? No, there isn't."

Eventually, I have to interrupt, and explain to the org's leaders that they're hearing from people who don't understand scrollbars, have never seen the events table because they don't scroll down to see it. The users are, of course, confused; they know that there's no such table because they've never seen it. We bring up the site on a handy machine (preferably a laptop or tablet with a small screen), and I show the users that it's there by scrolling down to it. Their response again is confusion, because they don't know what I did or how I did it. "Why's it hidden like that?"

So I teach them about scrollbars, and a few users have learned something useful. But this has a more important effect: It gets across to the leaders why their design was wrong, as I'd been telling them, and they'll have a better web site if they'll let me fix it.

One instance of this happened just last week. The org's web site now has that block of extensive history and purpose in a separate box at the bottom of the page, and the table of coming events is positioned near the top, just below the logo bar, where non-geek users will see it and be able to read at least the first few entries.

Examples like this abound in GUI design. Many of the common widgets are not at all intuitive to most people. Even if they accidentally poke at things and trigger the actions, it's often difficult to grasp what the effect was. You see things change, but the changes don't make sense, and have no obvious relation to the icon that you clicked on. Often the icons don't look like anything that most users can name. The result is that most of the GUI is unusable to most of the users.

I wish I knew good ways around this. But truly making a GUI obvious is very difficult, and takes a lot of time studying the users and learning about their misconceptions. I very rarely have the time to do this, and in many cases the people paying me have expressly forbid wasting time with dumb users.

And that's something that's very difficult to program around. ;-)

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.

Production enviroments by DerekLyons · 2011-03-02 05:46 · Score: 5, Insightful

Some of the comments here remind me of a post on a woodworking board a few months back. Essentially, the poster was lamenting because he had to fire a guy because he couldn't afford to keep him... Not because of the economy, but because the guy was an absolutely inflexible perfectionist. He'd spend $300 worth of time on what should have been a $60 job... The guy was a hell of a woodworker, at home in his own shop, but just couldn't adapt to a production environment.

This isn't about Windows vs. Unix. This is about admins not understanding their job is to get production rolling again, not to satisfy their obsessive need to understand every problem or their need to satisfy their ego. ("I'm a UNIX admin dammit, I refuse to use habits that make me look like a Windows admin" or it's equivalent is a refrain modded up again and again here on Slashdot.) If a reboot or a re-imaging fixes the problem, that's the right solution. If it doesn't, *then* you dig deeper.

Re:Hyperviser by drsmithy · 2011-03-02 06:03 · Score: 5, Insightful

For example, Linux is extremely easy to use -- if you understand it. Windows is a hell of a lot easier to learn but knowing all about it won't make it much easier to use.

That, is entirely a matter of opinion.

Your comment there describes what is easy to learn.

No, it doesn't. Your comment assumes that an interface should *have* to be learnt, to be easy to use.

The CLI appeals to people who are willing to learn, who like learning new things and consider it worthwhile.

No, the CLI appeals primarily to people who like to focus on memorising semantic minutiae and believe that doing so is, in and of itself, a productive endeavour.

The terminal is for non-trivial tasks.

The implication that GUIs are only used for "trivial" tasks is ridiculous on its face.

The average Windows user who views learning as an unreasonable burden that should never be expected of anyone who wants to use a complex machine ... they avoid the up-front investment of learning to understand the system. Instead, they can jump in and start using the system right now. But they continuously pay for it over time in the form of enjoying few or none of those advantages.

There is nothing unique to Windows, or even computers, about this. Do you know the intricacies of how your car works ? How about your blender or oven ? Could you fabricate a new bed or sofa from raw materials, and without modern tools ? Do you grow your own produce ? Could you butcher a cow or chicken ? Could you set a complex fracture or create your own painkillers ? Can you brew your own beer ?

It's like the difference between people who live within their means and use plastic only as a form of payment, saving up until they can actually afford something before they purchase it, versus those who live all the time on credit. The person living on credit gets the stuff they want right now but ultimately pays quite a bit more for it and can quickly find themselves in over their head. The discipline and delayed gratification that the latter is trying so hard to avoid is something that the former considers to be virtues worth cultivating.

No, it's nothing like that at all. One is an example of financial irresponsibility and the other is simply realising that you do not need a deep and intricate understanding of a given thing to use or take advantage of the services or benefits it provides.

Slashdot Mirror

The Decline and Fall of System Administration

21 of 500 comments (clear)