The Decline and Fall of System Administration
snydeq writes "Deep End's Paul Venezia questions whether server virtualization technologies are contributing to the decline of real server administration skills, as more and more sysadmins argue in favor of re-imaging as a solution to Unix server woes. 'This has always been the (many times undeserved) joke about clueless Windows admins: They have a small arsenal of possible fixes, and once they've exhausted the supply, they punt and rebuild the server from scratch rather than dig deeper. On the Unix side of the house, that concept has been met with derision since the dawn of time, but as Linux has moved into the mainstream — and the number of marginal Linux admins has grown — those ideas are suddenly somehow rational.'"
Someone still has to maintain the machines that are actually running the VMs.
Palm trees and 8
TFA concludes with "But if all it takes is a few clicks of a mouse in vSphere's Windows-based client to pop out a cloned server instance (ostensibly built by someone who knew what they were doing), then what does it matter? It's all very convenient and cool, right? Wrong. If you don't understand the underpinnings, you're missing the point. Anyone can drive the car, but if it doesn't start for some reason, you're helpless. That's a problem if you're paid to know how to fix the car." While I agree in principle, the analogy here is off. If the car doesn't start in this case, I can just throw it away and clone a working one.
I’m not a system admin but I don’t see how this is a bad approach.
I see value in finding out what the problem is and why it happened.. if you just blindly re-image then the problem might pop up again at a less opportune time.
But if you know what the problem is... and you have an image of the server in a working state, or a documented procedure on how to set up the server in it’s intended configuration then why would anyone waste time trying to repair it.
I think you have this kind of problem in most jobs. New approaches that make more sense but require less skill (and imply less e-pene) are always hated by people who have already learnt how to do it “the hard way”.
I see this as a programmer all the time and have been a victim of it. I’ve seen a huge chunk of my chosen industry migrate from meat and potato problem solving to gluing libraries together and sprinkling in business logic.
I’ve been fortunate to land in a job where there’s still a lot of “from the ground up” work, but these jobs are getting scarcer as even the components that everyone uses are made from other components. And executable UML (or something of its ilk) is probably going to be the next thing to cut the legs off us.
An expensive part of most IT budgets is people costs. Unfortunately, if your primary business is not IT, it is also the easiest one to cut.
"they punt and rebuild the server from scratch rather than dig deeper."
From personal experience this is normally due to management jumping down our throats to simply "get it done" which unfortunately runs counter to our inquisitive desires to actually solve the problem.
I suspect it's the end result of pressure to get more bang for their bucks in a tight economy, but that's pure speculation. It really could be a trend of the times.
If the cost of re-imaging a machine in a production environment is less than digging deeper guess which one Im going to do ?
A: Pay a bearded guy in suspenders for hours while he incants various arcane phrases like "sudo" and "grep" and hope that he actually manages to clean up the problem at the end, or
B: Press a button and have a factory fresh install in seconds.
Assuming that you have a decent build done first (Pay the bearded guy big for that) why on earth would *anyone* pick A? It's hardly just Unix- we're a Windows shop and we're heavily virtualized because it makes sense from so many different angles- security, load balance/failover, ease of setup, etc.
"Seven Deadly Sins? I thought it was a to-do list!"
As hosted services become more and more popular, sysadmins have less interest in spending the time to diagnose and solve a problem - this goes for Windows, Mac OS and Linux/Unix. When a fix is needed RIGHT NOW - the quickest way back up sometimes is a re-image.
When I was a small business IT consultant, I asked clients if they wanted to spend $125 per hour for me to diagnose and fix their system - with the understanding that it could take many hours to research and solve the problem - or if they wanted to spend ONE hour re-imaging the system to a known good point.
Almost everyone chose the "fix it now in under an hour" solution.
-ted
Yet another story about how the old way was better.
What's better is whatever keeps your employer's company making money for the most time. If re-imaging the server every weekend gives them 100% uptime during the week, do it. If you can inject patches into the app during runtime, more bully to you, but I can't, so I'm going with "re-image to working state and roll forward." If that costs my employer less than you cost your employer, I know who's all of a sudden more employable!
Might want to shave off those neckbeards, folks.
Finally had enough. Come see us over at https://soylentnews.org/
Seriously, which way gets the job done faster?
Being a sysadmin is not about you and the system and your marvelous detecting and repair skills, it's *always and only* about your users. If VM technology improves the speed of recovery so the users can get back to what they were doing (probably messing up your carefully architected system), then so be it.
"My God...it's full of trolls!"
I just thought it was amusing to post a headline with decline and fall both in the same sentence, when they are clearly the same thing in this instance. Should it actually have been "the rise and fall of ..." or "the decline of ..."?
Many times, what I hear as "solutions" are simply variations on the theme: "Why can't we reboot the server?" or "Why can't we reinstall the server from scratch?".
And my answer usually was: "Listen, I don't care how many times you do this on a Windows machine, but this is UNIX - I'll only reboot this machine if I absolutely need to. In the meantime, watch and learn as I kill the offending processes. Oh, and re-installing the machine means 24h of downtime".
These days, I help run a (very) large application, which runs on top of a (very) large "enterprise" SQL database for a (very) large company. The only problem is: enterprise application does not manage database very well, and leaves zombie processes on the database server. After a while, the database server just crashes (hard) and takes down the application server with it. Logical solution (and the one recommended by sysadmins): upgrade application to version X, which is supposed to have a much better database management.
What do you think the PHB/management solution is? Ask the DBAs to write a script that will monitor zombie processes, so the sysadmins will be warned in advance... Like, around 20 minutes before the application crashes. Just enough time to tell all users to save their work, because we need to reboot everything. Just like under Windows.
Did I mention the application is considered mission-critical and runs 24x7? And that downtime can cost more than 6 figures to said (nameless) company?
And, since you asked, yes, I am looking for another job. (Clueless admins and pointy-haired bosses: a match made in...)
The right to offend is far more important than the right not to be offended. (Rowan Atkinson)
Often times resolving the issue will take longer than the time to re-image something. This is the benefit of running virtualized infrastructure, quick build up and tear down.
The OS itself shouldn't matter and I've been doing this since I was able to snapshot stuff. Often times it will allow me to go back and work on the broken image while the new image is running, but honestly from a Management view - the admin is there to make stuff work - they don't care how he/she does it. They are interested in quick resolution.
But what's wrong with having images of servers ready as a viable disaster recovery strategy?
yes I agree it is good to know the system inside out. yes I agree that it's not because a simple minor server process configuration screwup that you should reimage the whole server... but sometimes it may be either time saving at a point where users need the servers immediately. sometimes it might actually be more secure and stable to restore from an image that has been tested for months rather than making tons of changes under the hood... especially if it is a system that has not been documented, where the last changes were made years ago... by diagnostic-ing the server under time constraints, it is possible to mess things up even more. It's not necessarily a pissing context... well I can fix my server without re-imaging in this case.
Now, if the problem occurs regularely and reimaging and putting blinds to the problem... then yes, I agree imaging is wrong. Yes, it is a good thing to do thing to know what is happening, find the problem... and most problems don't necessarily reimaging
my point is it is not necessarily a bad thing to restore a server from an image if you do things right... it may save time, be more secure and save tons in productivity/money.
Never antropomorphize computers, they do not like that
Sure it was cool, back in the day, to spend 72 hours working on "the server" because even rebooting was not an option. Back then I had 3 servers, 10 years later I had 15. I didn't have the time to get into why each little snowflake of a problem was happening, I knew reinstalling and upgrading components would be a more prudent use of time. If I can rebuild a server and restore a data backup in 4 hours or I can spend an infinite amount of time "fixing" the existing install, which option do you think my PHB would prefer? It is not bad administration, it is just different.
As VM's are virtualized and taking snapshots of them becomes so easy, why would you bother troubleshooting anything when you can just restore to a snap that is an hour old? Better than the server being down and spending who knows how long trying to figure out what's wrong.
Obviously employers (if they wake up to this) will realize "Hey, I can pay a kid to restore snapshots" instead of "Hey... I need to hire this super expensive IT veteran."
There are always people who are excellent, competent, and flat-out bad at their job. Unfortunately, the numbers of each group skew towards the lower end (well, not everyone is a genius). If this makes for an acceptable solution for the less-skilled, so be it. I hate to reward incompetence, but I hate down time even more. I want my servers running so my employees can do their work.
Vote monkeys into Congress. They are cheaper and more trustworthy.
It sounds like this guy is just upset that technology has progressed to the point where we don't need to pay out the nose for some high-priced UNIX consultant to spend 3 days troubleshooting an issue that can be fixed in minutes or hours.
Just because you might learn more by spending days chasing down an issue instead of using your available tools to quickly redeploy the server and get the business back up and running, doesn't make that the correct decision. If you really want dig into the root cause, clone the broken VM off and research it after you get a fresh one deployed from template.
It's funny how many admins out there can't even set permissions in *NIX. I was working with a guy who was very well-versed in the VM world. Several certs after his name, in fact. But when he had to actually set permissions on the .vmdk files on the ESX host from the command line, he was clueless. I explained to him the whole rwxX and how each numerical value changes the bit for that permission and it was a completely wasted effort. I guess Veeam will take care of all that from a GUI.
Still, seems like they would teach the basics.
Loading...
This seems to me to be a philosophical question. Indeed, if the uptime and more importantly availability is higher by the purported crash and burn (taking liberties with the slash and burn deforestation technique) method, who is to say it is less useful or less valid? Indeed, to espouse skills over delivering for the client seems to be missing the point. It seems to be standing on some pedagogical imperative that knowledge is somehow of more value in the workplace than delivery.
Now - having said that - don't get me wrong. I have seen entirely too many *nix sysadmins (full disclosure: I got an RHCE in 2003) who don't know where the network config files are because they only know the GUI, and are hired by a team of people who have never logged into a *nix box. However, I think the ill that is most egregious is not that it sets some moral and ethical imperative fo fixing rather than reloading (or in this case, recovering from a VM image) a server, but the fact that it misses the point that there has been a dearth of qualified IT candidates since the dawn of our industry and that the fixes to this don't have to do with how we fix a server, but how we hire and more importantly who we hire. As is everything in IT, garbage in == garbage out.
Finally - I absolutely agree with the Infoworld argument. It assumes an unexpected failure within the server, not some external thing that needs to be diagnosed and fixed. If your app crashes because the SQL table isn't there on the SQL server you don't control, rebooting ain't going to do a hill of beans worth of good.
You have 1000 servers. You need to upgrade them to RHEL 6. Do you put a DVD in each of 1000 DVD drives?
NO!
You use an image server. Kickstart. Cobbler. Figure out how the new image looks like, and then pxeboot 1000 servers. That goes much faster. (to the sysadmin above, reimaging a server should take 25 minutes, most of which is spent surfing slashdot, not an hour).
So now, you've got a server that's misbehaving. One of 1000. Out of pure coincidence, honest, the one server you were manually futzing with last week, but that can't possibly be connected. Fixing that server yourself will cause more "configuration drift", and leave you with one server that's still different than the 999 other servers. And hey, that image server is still on your network. Just reimage the thing.
It's popular because it's the answer that scales. kthxbye.
Is this the old geezer versus the new wet diapers yet again? (trying to be as evil on both sides ;) )
There are new technologies and we should embrace them. I am not a proponent of VMs, I don't like them in general, but I do see its uses and it's very effective. Like in C++, you got STL, with very similar and nearly interchangeable std::vector, std::list, std::deque and so on (and not talking about boost or 3rd parties here). You need to know when to apply them or else you'll get problems. Well, in the '10s, you have the same ridicule amount of technologies available to sysadmins, and you need to know when to apply it. That's the new Sysadmin job, not only know that you can code one in bash with grep, awk, echo, while read, pipes and rsync, but actually know there is a package all neatly made for you, available at your fingertips with a simple apt-get (or yum).
I keep my computer tidied-up, I love to know what runs where. Even then, I do a "spring cleaning" once every year, reinstalling everything. And incredibly, my computer runs faster and more efficiently. Why? new /etc defaults, new parameters, new software, old clinging software, things that are nearly impossible to update. Same for the files. Seriously, in today's computers, we get hundred of thousands of files, most of which have some arcane use we couldn't care less, but are necessary for some kind of weird reason. I'm a sysadmin, and I don't pretend to want to know all these files.
I read the article, and yes, there are things that are changing, and seriously, I do respect the One person who can understand the Sendmail configuration files... oh I'd even be impressed with the M4. :) And when there is a problem, I want to know why, because I love to learn. But then ... there are prerogatives, time constraints, servers need to be up, people need to work, and we have all these magnificient tools that will enable every computer to be segregated in their private little VM world (to return to that main article). So should be simply shrug, laugh and go back to The Ancient Ways? You can keep you "vi" editor, leave me my "vim", please. :)
Oh, and re-installing the machine means 24h of downtime
I am with you except here. If re-installing a machine incurs 24h of downtime, you do not have a suitable contingency plan. Most environments I deal with are 15-20 minutes from offline to production on reinstall at the long end.
XML is like violence. If it doesn't solve the problem, use more.
I used to scoff at reformatting and reinstalling, but today it's a simple calculation. Will the fix take longer than either reverting from a snapshot or cloning from a template? Many may cringe at that as a solution, but the bottom line is time is money. It used to be that reinstalling, restoring from backup simply took too long, and it was better to fix the problem at the console if possible. Today, that isn't so with automatic snapshots of virtual machines, SAN replication, etc. I don't scoff at it though, it means we can spend more time being proactive rather than reactive.
I'm going to get flamed for this, but what the hell.
I've always thought that it is more important to get a server back up and operational as quickly as possible, then it is to keep the server down until you find the problem. Now don't get me wrong, you still need to find the ultimate problem, or at least find out if the problem is repeatable, and then find the answer to it.
So I'm in favour of any method that help me in getting the system back up and running; be it re-imaging or anything else.
Sometimes a one-off mistake happens, and reinstall makes sense. Many other times, the reason you had to reinstall is due to a more persistent problem (program/script systematically messing up or an admin that just needs to not be doing admin work), and skipping root cause analysis means you'll lose more time in the aggregate.
XML is like violence. If it doesn't solve the problem, use more.
It costs them less to pay the DBA to write the script and inconvenience their users than it does to upgrade the system.
That all important profit margin is what gets in the way of things being done the right way.
captcha: income
Sometimes a server is gradually degrading due to some issue. During that time, things are being modified. If you learn that the problem started a few months ago, you can't just re-image an old state and loose everything that had changed since then.
Of course to make app servers as stateless as possible helps against this problem. One of the reasons that my company enforces that data are kept on physically separate DB servers, and (virtualized) app server instances should be as dedicated to a single app as possible.
I know if I had a boss hovering over me, not understanding what was wrong, and just pressuring me to get it done I would tell him to shove off so I could learn. Who cares that every minute I spend working on the issue is a minute I can't spend on other problems. Who cares that I could be replaced by a system admin who would get it done quickly. Knowledge and what other system admins think of me is what is important. After all, those pay the bills. /sarcasm
If you ask me, a major drawback is that fewer eyeballs are looking at the code -> less bugreports -> buggier software.
I think part of this phenomenon might be due to outsourcing, which puts a layer of call center personnel armed with loose-leaf binders of procedures between you and the one or two remaining competent sysadmins, who are then regulated to firefighting. In this world, there isn't time to diagnose problems because the level of expertise and admin/customer ratio are kept purposefully low.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Run a I series, they're like a tank. Slow and cumbersome but they just don't stop.
The only problem is: enterprise application does not manage database very well, and leaves zombie processes on the database server. After a while, the database server just crashes (hard) and takes down the application server with it.
Did I mention the application ... runs 24x7
So which is it, it crashes "often" enough to be a problem, or it never crashes ever?
The obvious solution is to reload it every day at the least inconvenient time.
If they will not "permit" a controlled reboot, then work around it by running health testing scripts that just happen to knock it out, sort of a euthanasia approach.
The next "solution" is a (caching?) sql proxy server in the middle, no one will notice if the reboot is fast.
Is the upgrade suggested by the admins themselves whom have tested it under load on a test server so they know it'll work, or suggested by the vendor dazzled by the vision of fat commission checks? "It'll work great, sure, it'll work great, great at paying for my sports car, yeah it'll work great"
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
I'm not sure I buy everything in TFA, but have to admit to a certain extent this phenomenon is real. I've noticed, however a tendency to regenerate an instance, and when it doesn't work regen it again, and again and again because the purposely overextended and/or undertrained admin doesn't have time to figure out that the problem is in his template or due to something external like a dup ip. Come to think of it, this type of endless cycle seems to be fairly common in the Windows world. I guess we've caught up.
Sometimes the user has to diagnose the problem themselves, which is a win for the IT manager because the time didn't come out of the IT budget.
I'm hoping that at some point these practices will be recognized as the false economics they are. But I'm not holding my breath.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
but know the teams that implement/admin them and I am constantly amazed.
Amazed in all that I read here and elsewhere points to incredibly resilient systems yet I have never been anywhere where they don't have scheduled down time on at minimum a quarterly basis and every major outage relied on a reload. So which is it? They make fun of the windows guys and just hope the windows guys don't look at their statistics (and no I am not on Windows either, think IBM Z and I).
My serious question is, is their a certain size system that reloads are valid on? When does it stop becoming a valid solution? When you get to enterprise level systems what are your options then?
I read all these articles but the one thing never clear is, are these large systems or just small servers (small being PC class hardware)
* Winners compare their achievements to their goals, losers compare theirs to that of others.
I don't see why reimaging/rebooting a VM instance is different from restarting a service that is misbehaving. Now "services" are VMs, that's all.
You were very happy as a sysadmin of a couple big servers, and now you have to administer several dozens of VMs. The skill set is slightly different, that doesn't mean we're "losing skills". Your Unix wizardry will come in handy anyway. The base concepts about OS operation will be there too.
Things change. Learn and deal with it.
The decline and fall? Can you decline and fall at the same time?
Where was the incline and rise?
...I'm a poor, lowly Windows admin who doesn't know my ass from a hole in the ground. ALL HAIL THE 1337 *NIX H4X0R5!
Seriously...how long is this windows admin vs *nix admin comparison going to last? I can't help it that there are apps that absolutely need to run in a Windows environment. The job needs to get done. If I could run my industry specific software on Linux, I would. I would love to save my company money from licensing.
Now if you'll excuse me, I need to go back to flinging poo all over my server room walls.
"A plan fiendishly clever in its intricacies"- Homer Simpson
Oh, and re-installing the machine means 24h of downtime
I am with you except here. If re-installing a machine incurs 24h of downtime, you do not have a suitable contingency plan. Most environments I deal with are 15-20 minutes from offline to production on reinstall at the long end.
I agree that if the system is as critical as they say, they should have a better failover in place, however in a lot of companies, very little importance is placed on Live Failover systems. More than likely he's including lots more than the OS/Application build in that 24 hour timeframe.
Probably database reload/recovery time, or file system initialization (inadequate RAID controller to Disk design?).
This space for rent. All reasonable inquiries will be entertained at proprietors discretion.
As a system administrator I don't understand why any option to make quickest and biggest win should be ruled out, even it would be in conflict with tradition. Some times the problem just is that biggest, quickest and easiest fix are not the same fix. Knowing which one to choose, and when, make a good system administrator.
Logical solution (and the one recommended by sysadmins): upgrade application to version X, which is supposed to have a much better database management.
What do you think the PHB/management solution is? Ask the DBAs to write a script that will monitor zombie processes, so the sysadmins will be warned in advance... Like, around 20 minutes before the application crashes. Just enough time to tell all users to save their work, because we need to reboot everything. Just like under Windows.
The new version costs money. And, no matter how important everyone thinks this application is, they obviously don't think it's worth that price. They're willing to deal with a reboot rather than spend the money. I'd recommend the upgrade, too... But I don't write the checks. Nor do I really use the app. I just keep it running. And if you tell me you can live without the app for 10-15 minutes while the server reboots, and you'd rather save $X instead of buying the new version, that's what we're going to do.
Listen, I don't care how many times you do this on a Windows machine, but this is UNIX - I'll only reboot this machine if I absolutely need to. In the meantime, watch and learn as I kill the offending processes.
That's great when you can get away with it... But sometimes it just isn't worth the trouble. Even on a UNIX system.
Yeah, I hate rebooting to fix problems. Seems like a crude approach. Especially when you've got so many nice tools at your disposal on a UNIX system.
And, I guess, I'm kind of wondering why it needs to be rebooted in your situation. You've got a script monitoring zombie processes... And those processes can apparently be killed manually... So why not have that script kill the processes instead of just monitoring them? Or write a second script to fire off a batch of zombie kills?
But sometimes it just isn't worth the time/effort involved. You can spend a couple hours digging for the problem while your users are without their app... Spend a couple hours developing and testing your script while your users are without their app... Spend a few days patching code while your users are without their app... Or you can just reboot the thing and go on with your life.
Oh, and re-installing the machine means 24h of downtime
This seems wrong to me. Or, at least, completely unrelated to the subject of re-imaging in a virtualized environment.
It takes maybe 5 minutes to provision a new VM complete with OS and default config/apps/whatever.
If I had a system that was as essential as what you describe, I'd have a base image of it stored and ready to go. Just bring up the new image, migrate the data, and make it live. That's what we do with all of our truly essential systems. And we can be running off a new image within about 30 minutes if we're able to migrate data off the old system. If we have to go to tape it'll take longer.
If you actually incur 24 hours of downtime to re-image a server, what's your plan if that machine simply dies? What if it takes more than a simple re-image to get it back up and running?
"Work is the curse of the drinking classes." -Oscar Wilde
It is a cost comparison issue. When the time cost to "punt and reload" is lower than the time costs of further troubleshooting that is the correct solution to the problem. Having virtual servers makes it easier and quicker to reload a server by having a default image on stand-by so it makes less troubleshooting worth the time.
That said only a very poor admin would discard the old image without discovering the root cause of the issues in order to prevent it from happening again. Thus saving future troubleshooting costs in an offline environment. Thats what dev servers are there for.
I'll meet you at the intersection of "Should be" and "Reality"
Most of you people don't seem to get the point.
When re-imaging is quick, cheap, and will work, the need for
esoteric diagnostic skills will cease to exist.
Put yourself in management's place : you have two options. One
takes longer and costs more and results in more downtime in
almost all cases. The other option takes less time, costs less,
and minimizes downtime. In the real world, the second option
is the overwhelmingly logical choice, and it is the choice that's
going to be made.
The need for truly expert sysadmins will drop as a result. Ignore this
at your peril, if you work as a sysadmin.
I've been fixing Windows machines for my friends and family since Windows 95 (I use Linux exclusively). A possible solution is to reformat the disk and re-load the OS -- however I've never had to resort to this last resort. From viruses to rootkits to buggy drivers, all can be corrected with the help of some good tools (typically stuff written by Mark Russinovich). I do find that the reformat solution seems to be the first choice whenever a computer is taken to repair shop. It makes perfect sense, you see. People in general are lazy and feeble-minded and we nearly always prefer a simple, quick solution to a correct, more time consuming one. After all, why test our own patience when we could be posting pictures on the facebook, or experiencing the deep, profound joy of limiting ourselves to 140 micro-blogging characters. It's our nature. It's why we're fat, why our marriages fall apart and why we don't rise at work.
After years of digging through the Windows OS my skills are pretty decent at this point. I can fix most things. But it's definitely a much more difficult road. Is it any wonder that sys admins with real *nix skills are going to be cast aside by business, replaced by inexpensive new-comers who re-image rather than explore, diagnose and understand? Know that your finances, banking data, tax info, social security information, etc, are always being maintained by the cheapest sys admins on the cheapest computers available.
I roll out a beta of a completely new server when my server goes down. EVERYTHING changes. Sometimes even the topic of the entire website might change. All tech is replaced, nothing is the same. Where there was flash-animations, there would now be a java-applet. And I just call it a surprising revamp of the website. And the people keep falling for it. I always develop the next version. I think Google has been doing the same. It is a very strange development-practice. It is utterly confusing. But, it works. It actually makes you seem more hip and cool. I call it "the perpetual beta strategy".
I don't understand why people are using lathes and milling machines to make high-quality, cheap, easy to use tools when they could be carving their own stone axes and axe handles. When their tool breaks, they just set it aside and buy a new one instead of spending days of downtime repairing it! Are we losing the skills needed to carve our own tools by hand in the interest of saving time and money?! This makes absolutely no sense to me and I cast derision on anyone who would do such a thing!
"Tell me doctor, with all of your defenses, are there any provisions for an attack by killer bees?"
The only problem is: enterprise application does not manage database very well, and leaves zombie processes on the database server. After a while, the database server just crashes (hard) and takes down the application server with it.
Did I mention the application ... runs 24x7
So which is it, it crashes "often" enough to be a problem, or it never crashes ever?
It runs 24x7... until it crashes. And that's often enough that it is fast becoming a huge problem.
The obvious solution is to reload it every day at the least inconvenient time.
Easier said than done: we have users in (almost) every time zone under the sun. The only time for our (regular) interventions is on Saturday and Sunday. And said "enterprise" application is bad enough that it takes litterally hours to restart. And that is on top-of-the-line major vendor iron too -- we are talking about dozens of CPUs and GB of memory here.
The next "solution" is a (caching?) sql proxy server in the middle, no one will notice if the reboot is fast.
That is, perhaps, one solution. On the other hand, I am not really sure this would work with said application.
Is the upgrade suggested by the admins themselves whom have tested it under load on a test server so they know it'll work, or suggested by the vendor dazzled by the vision of fat commission checks? "It'll work great, sure, it'll work great, great at paying for my sports car, yeah it'll work great"
Trust me on this one: this was born out of desperation, knowing full well the management would not allow the upgrade to be budgeted. And, no, no one in the admin group actually got any "fat commision checks" from any vendor -- as a matter of fact, the only people who are wined and dined by the vendors are the top management, way, way way up above your truly and the rest of his team (aka "the peons").
The right to offend is far more important than the right not to be offended. (Rowan Atkinson)
It's a question of scale. Reimaging a PC is almost always more economical than finding the root cause, unless it's very repetitive, assuming a proper backup solution is in place. It's a different beast when the computer in question is mission critical and the client can't accept a random downtime every few weeks while you rebuild.
I work in the vendor side and visit numerous customers on a daily basis. One thing I've often found myself remarking to many is the art of system administration has been lost. I look out at the people who are now considered "senior" and compare that to what used to be senior many years ago, and there is no comparison. Hands down the talent pool has diluted. Now having said that I view that statement, against this article, as a very different issue all together. What I read here as a growing trend of people not spending time troubleshooting problems in favor of rebuilding failing systems. I think the real issue is two fold where this approach has fallen into favor. Firstly the scale of these deployments is larger than it was a decade ago. Secondly the number of admins to server ration has gown way down. Meaning even if the admins were of equal talent to a decade ago they don't bother doing deeper analysis of problems because they have many more things to do and rebuilding is faster. It might be best said there is a cause/effect of all of these points on each other.
This field is nothing like what I entered 15 years ago. Not in the usual technology progress way either, but in a steady downward spiral. Fortune 500 companies are beginning to drop CIOs altogether and putting IT in the hands of business depts., VMs are used as a band-aid for everything and as a result requests and demands and the number of servers to be maintained has exploded... all this while staff is cut to the bone. There used to be "Computer Science" and real professionalism and respect, now none exists. We are mostly to blame for it ourselves. For such intelligent people, we aren't smart. Ego and personality traits have been exploited to force us into 24x7 drones that are lowly, subservient, and basically whipping boys.
I have had some great experiences and I have also witnessed the decline first-hand, when I move on from my current position I will not be re-entering the IT workforce. I hate to throw away a lifetime's work and passion, but there is no real upside I can foresee... I only see it continuing to be minimized. People respect and understand tangible skills and products or revenue generating depts., which have always been tough selling points of IT. Knowledge and unseen aspects are hard to convey to non-technical folks, now that things have been abstracted one more layer with VMs and even virtual switching/routing, forget it.
http://teasphere.wordpress.com - A little spot of tea
Okay, so let's assume you've got a big cluster of servers for some random task, and one of them breaks. Should you diddle with the individual server, which brings it out of sync with the others? No! You re-image it based on the standard for that cluster. But if it happens 10 times, THEN you diddle with the server, and make a new, better server image to deploy across your cluster.
What rebooting can't fix, formatting (imaging) can. A few more years, and maybe someone can write a program to re-image my virtual servers automatically, and then I can go flip burgers somewhere. :)
I remember that the reddit community talked about this not too long ago http://www.reddit.com/r/sysadmin/comments/fhnai/are_we_being_phased_out/ it was about the same idea of virtualization.
In the corporate world, it's always been a battle between productivity (less time to fix the problem) and accuracy (more time to fix the problem). It's a judgement call. In our environment, most of the troubleshooting is done by system integrators. The SysAdmins simply keep the back end up and running, and as quickly as possible.
I think the main thing here is if a server is down your goal is to get it up as quick as possible. So you try a few fixes and they don't work. Then you have to decide is it going to be quicker to dig deeper for a solution or to just re-image the machine. Sure everyone would pry rather figure the problem out, but the truth is the more a server is down the more users it will probably affect.
this all comes down to time. i can reapply the data to a fresh VM image in a matter of hours and have it back up and running, pretty much without variation. hunting down a deep, dark problem can take 30 minutes or it might take days, and depending on the problem, that may simply be unacceptable.
the real skill is knowing when to pull the trigger on a rebuild vs knowing when it's something you can find and fix. hunting down problems and fixing them is something many sysadmins crave. at least VM's give us the ability to investigate the broken machine at our leisure, while a working VM can jump into production.
unless management wants to rely solely on rebuilds and the time investment it takes to do them every time, there will always be a need for sysadmins to analyze problems and figure out the "whys" and "hows" that caused them.
frog blast the vent core
"While I agree in principle, the analogy here is off. If the car doesn't start in this case, I can just throw it away and clone a working one."
Except if theres a hidden problem it won't be working for long and soon the new one won't start and you're back at square one. Thats no good if you've got a load of database data on your VM that'll also be hosed if you revert the VM.
Wow, this technique of if its not working reimage is such a lame idea that Google, amazon and every cloud in the world does it! Must be a dumb idea
I don't know about all that many other linux server admins, but I could easily be misrepresented as one of these "redeploy solves everything" people if someone wasn't paying attention.
When a server goes down, my responsibility is not to figure out why, it is to get it the hell back up. Virtualization allows me to do that very, very quickly by restoring a backup or redeploying an identical server. As far as management/users/etc are concerned that's the sum total of what I do when something happens. In reality I'm taking the existing server out of our production pool and replacing it with a working version. I then spend as long as it takes figuring out exactly what went wrong with the broken server so that I can fix it/prevent it in the future.
There is absolutely *nothing* wrong with getting a new server up and running immediately. Anyone that would spend time finding the root of a problem before doing at least basic damage control shouldn't have a job. VM lets me "damage control" by getting the new server up and running in about 2 minutes, so that I can get to the hard part without people breathing down my neck.
Look, everyone has a preferred method of doing things when it comes to IT, and everyone has an opinion on best practice that is based on a number of different things. No one opinion is the best, and every problem shouldn't be resolved the same way.
I was introduced to UNIX while in college from a user's perspective. I played with LINUX as a desktop platform for the first time, also while in college. I also was exposed to the Mac OS of the 90s because that was the computer of choice at SU while I attended and was the typical system found in every computer lab, with the occasional IBM running Windows 3.11 found here and there. I acquired a 286 running DOS which I used to access BBS and MUDs via telnet. I later upgraded to a Windows 95 box, and after college followed a career path of personal computer repair for the next decade, which means I've had my hands in ever Windows OS at some point or another, including 2000 server and 2003 server.
On the side I've been maintaining a LINUX server for the past 5 years, running Ubuntu. For the duration that I've owned the server, I've only "reimaged" it once, because I switched from a Pentium 3 class system to a Pentium 4. Any issues that it has had during that time I've been able to resolve with research, patience and a little trial and error. I restart it whenever security updates prompt me to, which is typically after a kernel upgrade. When a new LTS distro is released, I do a distribution upgrade, and there's usually stuff that needs changed/fixed afterward for everything to continue working as expected. It can be a total pain in the neck at times, and it drives my wife nuts on occasion, but I've learned more about computer systems this way, in my spare time, that in the long haul will be more useful to me in my career than I managed to pick up in a decade of PC repair.
I understand that this environment is completely different than a live environment that a business depends upon, and I fully sympathize with the gentleman who pointed out that when management is jumping down your throat to make something work, you tend to pick the fastest solution available to you. The only problem with this is that you have not figured out the cause of the problem, which means it could return.
There are a fair number of weird, unexplainable problems that have nothing to do with software, configuration error or hardware failure that can crop up from time to time. These are rare. They only happen once, maybe twice, and cannot be duplicated. A reboot will resolve these. But most of the time the source of the problem is human error of some kind, which means a reboot is a temporary fix.
So it ultimately becomes a longevity issue. If you're wiping out and redoing a server once a month, you probably ought to spend some time tracking down the source of the problem because the downtime during re-imaging over the course of a year will match or exceed the time spent finding the source of the trouble and correcting it. If you are running several servers this problem could affect some, many or all of them, so fixing one will allow you to fix all and the time will be negligible on the remaining servers, which then more than justifies the time invested in researching the problem. Furthermore, if you are experiencing trouble due to hardware beginning to fail, finding and replacing the defective part before it fails under scheduled maintenance is a much better solution than waiting until it fails under load when your company needs that server the most.
If, however, the issues only crop up maybe once a year, spending 72 hours finding a fix is probably not a good investment of time, because the equipment will be replaced/upgraded before the issue is likely to become a serious problem. In these cases I would recommend re-imaging. In the case of Windows operating systems I would be inclined to re-image anyway because lengthy support calls to Microsoft or the server vendor would potentially be required to resolve the problem, and sitting on hold is generally not a system administrator's best use of time.
Please bear in mind I am not a professional system administrator, but I've had the chance to observe them and dabble on both sides of the fence.
Devices that important should have redundancy.
Finally had enough. Come see us over at https://soylentnews.org/
From personal experience this is normally due to management jumping down our throats to simply "get it done" which unfortunately runs counter to our inquisitive desires to actually solve the problem.
Old greybeard Unix admin here.... who has to admin Windows systems also, because I do need a steady paycheck. ;-)
I'm not paid to seek out and identify the academic curiosities of obscure Windows system problems, I'm paid to keep the systems up and running to do useful work for the end-users.
Sometimes it's way more effective to just "nuke it from orbit" and rebuild a Windows server, because you can spend a stupid amount of time and effort trying to debug what's going wrong inside a failing Windows installation, and it will still be a losing battle, because Windows is closed source, and many of its internals are deliberately kept hidden from you. Sometimes rebuild and reload can get you back to production in very short order.
My aeons of Unix administration experience have taught me well, even in the Windows world, how to be an effective system administrator, in that I am always 1000% (that's once-effing-thousand percent) prepared to deal with Windows administration.... and know that every time a Windows box craps out that it's a potential disaster recovery situation and you need to be able to recognize when it is and when it's a simple fix. Therefore I maintain backups, snapshots, bare-metal restore images and detailed documentation out the wazoo.
Yes, I do also run many Unix and Linux servers too, and a couple of them have uptimes in the hundreds of days, which strokes my ego just fine, but I also run many Windows boxes, such as some which are Citrix servers, that need rebooted every few days just to clear out all of the "my head is full of fuck" from the running Windows kernel that progressively self-destructs internally while running weird shit like Citrix.
I guess to sum in all up, if you're going to admin Windows boxen for a living... to always keep your pimp hand strong.
Um, if you can monitor the zombie processes, why can't the same script kill those processes?
God invented whiskey so the Irish would not rule the world.
And, I guess, I'm kind of wondering why it needs to be rebooted in your situation. You've got a script monitoring zombie processes... And those processes can apparently be killed manually... So why not have that script kill the processes instead of just monitoring them? Or write a second script to fire off a batch of zombie kills?
How would you get rid of the zombies? Killing them won't help: zombies are processes that are already dead, but that don't have any process waiting for their exit status. They can be cleaned up by the operating system once the system figures out that nobody is ever going to call wait/waitpid for them, but until that happens, they will clutter up the process table, which only has a limited number of entries (often about 32000). If you create zombies faster than the system destroys them, you will eventually run out of process descriptors, and calls to fork will fail.
Please correct me if I got my facts wrong.
I seriously cant even believe this is a discussion. I read breifly though the comments here and a portion are in favor of actually using the "Well its broken, lets re-image" approach. This simply does not work except on the most basic of servers and even then you have to wonder at the ability of the person installing the server in the first place as the basics are all that are necessary.
First of all, as a VMware ESXi user I can tell you that there are in fact limits to what enterprise virtual machines can and can not do. For example, ESX does not appear to play well with anything above a 50GB database at all. Now I am sure that in the future this limitation will go away, but that is how it is today. As such I would not ever even think about putting a database server regardless of size on a VMware installation simply because if it grows to that point I dont want to be caught with my pants around my ankles. Thus, you CANT use this rebuild it approach....
The only time this approach is even remotely possible is when a server runs a single function and that function is configured in some very basic way. For example, an apache2 server who's whole purpose in life is to be one of the n members of a load balanced front end for a web-pool. Then removing the single member from the pool and rebuilding it is POTENTIALLY faster than finding the issue. However, a good computer technician (not just a systems administrator) is going to know when critical mass for "Is this worth fixing or is it taking up more time than necessary" is reached.
Now as for the idea that Systems Administrators are a dying breed, you are very right. But its not because of virtualization, it is because that is what our job is. It is my job as a systems administrator to ensure that my SERVICES have as close to a 100% uptime as possible. I write scripts that use tools such as bash, perl, ruby, php, etc to fix issues before I have to be involved. For the past 50 years people like me have been doing the exact same thing and none of us believe in reinventing the wheel so we use and improve upon the tools that our predecessors have given us to more effectively manage and administer our systems. I, the systems administrator, am doing such a good job, that I am killing myself off. Those declining skills are due to the fact that Joe Schmoe wrote a shell script back in '96 that did what I want to do, but it hasnt been updated since '02 and as such the script doesnt work the same and needs updated and though I know the concept, I have never had to do it by hand until now thanks to Joe's wonder-tool. Worse yet, I having had my job for the past ten years want to get paid #n money so that I can afford to feed my family something other than Rammen. But the average CS coming out of college today has 1/5th the skills I do (so he cant figure out Joe's wonder-tool and how to make it work in his environment), he gets paid 1/5th what I do, and his answer to "Why is the web server not responding" is "I dont know, but I will re-image it immediately). You get what you pay for.
Because a skilled unix admin possesses the knowledge to turn a downloaded iso image into a hardened firewall, web server, db server, reverse proxy, network sniffer, VPN, router, iSCSI target, computing cluster, spam filter, XMPP, SMTP, FTP, SNMP, DHCP, NTP, BOOTP, TFTP, SMB (ad nauseum) servers, and/or NID/IPS device. All without a Cisco, Oracle, Windows, Barracuda, Vmware or other site licenses, seat licenses, or maintenance contracts.
Most admins I know do not posses these skills, nor do they posses the interest in obtaining them. Perhaps there isn't necessarily a decline and fall of the system admin, but a rise in ubiquity of first-tier administration.
boycott slashdot February 10th - 17th check out: altSlashdot.org
No its not bad, but it makes for poor skills. I have seen cases where just re-imaging does not work, and the admin had no clue how to fix the problem. The company was ready to trash a whole system once because all the admins knew how to do was try to re-image the system which was not working. If they had more skills or were not in this type of mind set they could have read some documentation and fixed it.
Most companies I worked for don't even have admins anymore, they make the developers do it.
Um...AFAIK Virtualization is really not recommended for database servers.
So I get to keep my arcane knowledge.
A server rebuild won't necessarily fix anything. It could be a good recovery strategy, but when you run into a performance or functionality issue who is going to be there to find that and fix it? A rebuild won't help you there. No a good systems administrator probably isn't needed in a "we don't care" one size fits all commodity environment, but you can't expect the same level of service that a skilled professional can provide.
My average Unix (in the past decade, Linux) system uptime between reboots is now 3 to 4 years.
Not surprisingly, most of the reboots are there exactly for installation (aka "rebuild") of an updated OS usually on the next generation of server hardware. Major package upgrades (e.g. MySQL, Apache) almost never require any tinkering with the OS.
I compare that to typical Windows servers in my group, where reboots happen in many cases nightly as a preventative measure, and the system is still some crufty old version of Windows (e.g. Windows NT), the application packages are deeply tied to DLL's and drivers, and I suspect that the statistics and attitudes are apples vs oranges.
And, since you asked, yes, I am looking for another job. (Clueless admins and pointy-haired bosses: a match made in...)
Good luck with your job search. From your description, your current employer is doing it wrong.
We've seen this trend in just about everything in our daily lives as well. Back in the '50s and '60s, Service Stations were just that - SERVICE. There was a knowledgeable individual that would check the oil and such when you gassed-up your car. Today, none of this exists because it's cheaper to pay some pimply-faced kid to man the cash-register vs. paying someone with the knowledge to actually service an automobile. Now before you make the argument that cars are more technologically-advanced nowadays, consider the fact that you're still spending YOUR money on your car - just that the "service" station isn't spending THEIR money on it anymore - your money is going into the up-front cost of the automobile instead of spreading it out over the life of the product in the way of maintenance and upkeep. The same holds true for technology. You pay an engineer to design the system correctly first, and the cost over time goes down because you don't need to pay an engineer to maintain it - the tools available today do a pretty good job of replicating what you paid the engineer to do, at a fraction of the cost. Apply this same logic to fast-food. Nobody at McDonalds really knows how to cook or prepare a meal. All the "engineering" is already done, so you only have to pay the minimum-wage folks to replicate what's already been done. This list goes on and on and includes everything from your car to your house to your meals to the subject at hand. How many of us still know how to perform repairs that need to be done around the house? I'd bet that's a small number, because we paid for a "system" that doesn't need much maintenance, and when it does it's fairly modular (nobody sweats copper anymore - it's all plastic tubing that snaps together. Nobody adds electrical outlets because the house is pre-wired).
How would you get rid of the zombies?
I dunno. It isn't my server. Maybe I mis-understood the OP...
I thought that his "In the meantime, watch and learn as I kill the offending processes" was a reference to those zombie processes that eat his server. I figured he had some method of cleaning them out, and was wondering why that method couldn't simply be automated.
But if that was a reference to something else entirely, and he's got no magic method for killing zombies, then I suppose it makes sense that you'd have to reboot.
"Work is the curse of the drinking classes." -Oscar Wilde
There comes a point where it makes sense to replace a system (OS) or rebuild it. Yes, Windows admins jump on that bandwagon more often - but more often than not, it's more appropriate there than elsewhere, too.
Depending on what the cause is, and the system in question, it makes sense. An early FreeBSD 5 machine with no documentation, highly customized ports, many running services, and so on is likely better to piecemeal out to different machines ('rebuild' it) than it is to disrupt service as you figure out how to get the thing upgraded. Improperly removed programs or registry corruption/errors in Windows often means the same thing.
Ultimately, what it comes down to, is time. How long would a rebuild take, and how much downtime is being accrued from ghosts in the machine? How much is that time worth? It may take a day or two to do a rebuild, but even determining the cause of a peculiar problem can take several or more. By eliminating the software idiosyncrasies of the existing install (and often getting things to the most recent patch version in the process) you've eliminated one possible cause as to the problem: either it works now, or it was hardware/firmware/a driver/etc. that's causing the problem.
Sure, wanton reinstalls aren't a good fix. However, they're often a cost-saving measure, and in many applications, appropriate. The hardware and software on most machines is not worth the time invested to "do it the old Unix way".
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Not all sysadmins can be above average. By definition, some of them will suck. As more stuff move into Linux-based systems, those admins which suck will be working on Linux. Ergo, bad admins will do less-than-ideal work.
I am officially gone from
Virtualisation is definitely a very solid example of the degradation of Admin skills, but isn't the cause in and of itself.
Here is the problem: Poor practices can get into production faster.
Anyone can slap an OS onto a system. Building an OS for an enterprise with certain requirements and demands may involve a lot more work, which of course means time and money up front - a slower time to deployment. Ask a manager if he wants to wait an extra week (or two) to get a server out the door, and the answer will of course be "no!" However, here is where the false economy lies: That generic DVD-install may well be slower, less stable, less reliable, and less secure than the one that was tweaked and properly configured. Time not spent up-front will lead to a less stable environment.
Now when a system blew up before, rebuilding it would take a day or two, unless the admin was able to say "I told you so!" and get his week to set it up properly. Now, with VMs, it takes half an hour to get back into production, so why bother working on it? Who cares if the environment is shitty, unstable, and badly-designed, if it can be rebuilt in bits and pieces in minutes?
The thing is, you WILL be rebuilding it - constantly - and ultimately there's a decent chance that the entire pile of crap will implode on you (or at least run into a dead-end), requiring a complete re-architecture. Of course by that point, the people who pushed for and deployed the entire unsustainable environment will have been promoted to management because of their amazing speed to production, and they encourage the same thing.
In other words, VMs aren't a problem, they're a facilitator for problem behaviour.
UNIX/Linux machines do not magically break without changes. restoring an image will restore that image, of course without the change that broke it. if the change was needed for something, you are still going to have to figure out how to make that change without breaking something.
might i add, 'duh'
What do you think the PHB/management solution is? Ask the DBAs to write a script that will monitor zombie processes, so the sysadmins will be warned in advance... Like, around 20 minutes before the application crashes. Just enough time to tell all users to save their work, because we need to reboot everything. Just like under Windows.
Wait, if you can monitor the zombies, can't you kill them? Just use that 20m window to kill the oldest zombies and you should be ok.
Is that where you hire Edward James Olmos to be your sysadmin?
(Note: Reposting this while logged in - why did they get rid of the 'login at post' option?)
Virtualisation is definitely a very solid example of the degradation of Admin skills, but isn't the cause in and of itself.
Here is the problem: Poor practices can get into production faster.
Anyone can slap an OS onto a system. Building an OS for an enterprise with certain requirements and demands may involve a lot more work, which of course means time and money up front - a slower time to deployment. Ask a manager if he wants to wait an extra week (or two) to get a server out the door, and the answer will of course be "no!" However, here is where the false economy lies: That generic DVD-install may well be slower, less stable, less reliable, and less secure than the one that was tweaked and properly configured. Time not spent up-front will lead to a less stable environment.
Now when a system blew up before, rebuilding it would take a day or two, unless the admin was able to say "I told you so!" and get his week to set it up properly. Now, with VMs, it takes half an hour to get back into production, so why bother working on it? Who cares if the environment is shitty, unstable, and badly-designed, if it can be rebuilt in bits and pieces in minutes?
The thing is, you WILL be rebuilding it - constantly - and ultimately there's a decent chance that the entire pile of crap will implode on you (or at least run into a dead-end), requiring a complete re-architecture. Of course by that point, the people who pushed for and deployed the entire unsustainable environment will have been promoted to management because of their amazing speed to production, and they encourage the same thing.
In other words, VMs aren't a problem, they're a facilitator for problem behaviour.
"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
I could have been in a conference call with this unnamed company. I pointed out that they had a lot of zombie processes on the database machine. The manager on their side said he was going to ask the DBA, and then returned and told us that the DBA had said "That's not a problem" and that was the end of that investigation.
To add to this, is that the unnamed big database, had removed the product patch from available downloads from their main site. So it was impossible for us to test with the same version of the database as this unnamed company was using for production. Only reason I can come up with that a big database company should remove a version from the historical download section, is if its so horrible broken that its dangerous to run on it.
And still they did use it.
When I read your comment, I wounder if the manager actually asked the DBA or if he went out of the room and returned and just told us "That's not a problem".
I admin about 100+ VMs on 14 separate servers. I used to admin about 50 real physical machines. I can tell you that the physical machines had many more quirky, one-off problems that, quite frankly, weren't worthy of further investigation in terms of business cost-effectiveness. They were inevitably reformatted. All virtualization did was to speed up the process so that a new machine can be created in an hour instead of in two days.
As much as it might be intellectually satisfying to dig down into a problem, most systems are there to serve a business, and make money, not to solve the intellectual curiosity of a system admin who vaguely believes that the world is a better place if he/she can just take the time to solve every little system quirk.
You keep them working so the company can make money. That's all. That's the main priority. And thus we grow up.
Please do not read this sig. Thank you.
Point is, you only need one person with actual sysadmin skill to make and maintain an imagine. Hundreds of point-and-click types can then use that image. It happens in large organizations all the time. Why pay for a hundred skilled, experienced sysadmins when you only need one skilled, experienced sysadmin and 99 paper MCSEs? For many businesses this is an easy decision.
THIS.
:P)
It's a problem I run into myself a lot, really, as well. With the rise of virtualization, operating systems have gone from the tool that allows you to maintain your hardware such that it effectively delivers many applications to users to more of a vehicle on top of which single applications sit. But now, that vehicle, in turn, rides on top of your virtualization platform which is basically designed to as blown out and expendable as possible. While a given piece of hardware effectively delivers the same number of applications to end users, the real "Systems" part of administration is no longer the true integral piece of the puzzle that directly coverts "small iron" into "line of business."
Why should a company waste their money on my time spent digging through event logs, flexing the google-fu, and possibly coming up with the answer of "This would take so long to fix that I could probably rebuild the server and reinstall its single application faster than the problem could be resolved manually," when they can get a good enough result by skipping the investigation and just doing that in the first place?
It's extremely unfortunate that it works this way, especially as I feel I learn so much every time I encounter and solve a new problem that's preventing a system from running correctly. While it may be more intellectually stimulating and personally enriching to do things from the "advanced" perspective, on the whole, it usually ends up taking as much as if not more time than just blowing a system out in the event that you've never solved the given type of problem before.
Perhaps I've just got more learning to do though I suppose. It might be a different story with Linux! (where, ironically, I've simply reinstalled my test systems many times rather than actually solve problems
Boot Windows, Linux, and ESX over the network for free.
that is the rational solution. It's quicker and easier.
The smart admin copies the failing image, reimage, and then installs the copy of the failing image to an offline machine to study.
The Kruger Dunning explains most post on
Big shops like Google can do a simple re-imaging job because they have enough cheap servers so they can just throw a server out if it misbehaves, they know it's not their software because it runs fine on millions of other computers and if it misbehaves, usually it's the machine going bad. Only when multiple machines start having the same issue do they look into it as a possible bug, fix it and roll out an update to all their systems.
In a smaller shop usually, there is no space to have multiple downtimes because it will just re-image the same problem over and over again. The sysadmin is also the programmer and the help desk and simply doesn't have time to make a super stable system and usually has to use some 'legacy software' which basically means a custom developed piece of crap that nobody has the source code to. Virtualization has caused some idiot sysadmins to think they have a Google-like infrastructure by using virtualization on one or two boxes as an imitation datacenter while running some unstable software.
A good sysadmin does not have to nuke their server installation from orbit every time something goes wrong. I can understand imaging desktops because users will do some modification that makes it crash but they're never able to tell you exactly what they did. But a server is (or should be) well documented and has only few items that can go wrong. Finding out why your SCSI bus does a reset after a few weeks will be much more advantageous than rebooting or re-imaging it because eventually it will reset the wrong way and you'll end up with a corrupted RAID array.
Custom electronics and digital signage for your business: www.evcircuits.com
So there are a couple philosophies on backing up your systems. If you can tightly control the imaging process and automate it so that it only takes 10-20 minutes, re-imaging may actually be not only a viable solution but an elegant solution. Especially when dealing with clouds where instances are essentially newly provisioned images. If you're logging to a centralized system and storing persistent data elsewhere, re-imaging may be OK. However, it doesn't replace engineering (define/design/implement/test cycle) a good imaging process. If there's a problem across all your machines, you'll obviously need to resolve that in the imaging process. I expect typically imaging processes to be complete with automated application deployment and configuration as well.
I see this as a symptom of greed coupled with ... not necessarily stupidity, but something close. I also agree with several other posters who have said "time" is a major factor in the "just reboot it and get it running again" scenario.
Why do I say greed and stupidity? If a system administrator (and whoever else set up the initial system, if it wasn't just the admin) has done their job correctly, the majority of their time should be taken up doing - technically - nothing. In reality, a good admin will always try to keep their skills up-to-date, learn new skills or methods to help them on the job, and so on. Their normal routine of monitoring the systems and/or network should not take all day (unless the admin is the only one for hundreds of systems) and that leaves them "open" - which, in my opinion, is the correct way for system admins to be. "Open" means they are able to respond to a user's service call if they need to show in-person, they can instantly respond to an issue with a server or the network, they can respond if there is a down router/cable modem/phone system/other component, and above all if they are "open" it means _management_ has not arbitrarily decided "Oh, since you don't appear to be doing anything we're going to assign you task(s) X, Y, and Z -- even though they aren't in your area of expertise."
So, the company assigns them other "work" to do because they don't _appear_ busy (greed) which in turn removes their ability to be "open" to respond in a timely manner (stupidity). There are times when a down system can take someone a while to repair which then means the other "work" added on to the admin doesn't get done and suddenly the admin is getting a bad review -- for work not attached to the job they were hired to do. And so on and so forth.
Basically, if you know less about system administration/network administration than the person you hired to do the job and your systems are running smoothly and efficiently with little to no downtime then fuck off and let them do their job -- even if they don't appear busy.
Dream as if you'll live forever.
Live as if you'll die tomorrow.
~Anonymous~
This is ultimately a failure of good design. Yes, Linux suffers from this as well (All OS's do to one extent or another). Windows has always suffered from this because of the various windows installer packages available to developers. Linux suffers more and more from this because most distributions have a package management system now, which has the same problems as the Windows installers.
If you install an application, either with a package management system (apt, rpm, etc...) or the Windows Installer, there's really no telling what it does to your system... flinging files here and there, modifying configuration files, etc... Yes, you could potentially get the manifest for most package installers on Linux and do some forensics on what it's suppose to be doing. Often times, though, the time it would take to do this far exceeds the time it would take to rebuild the computer. Throw in neophyte users or "system administrators" and this option is completely useless, so again you're back to reinstall. Good luck finding out what an installer has done in Windows.
So the only way to avoid this with the current designs is to go back to a time when it required heavy system knowledge to even install the OS... Obviously, this is not desirable nor is it going to happen. It's a product of our times, man. We are stuck with it, until someone comes up with a better design solution than we've currently got in the Linux and Windows world.
I think the author is missing the point of modern systems administration. I wonder what the average number of servers a system administrator manages today, versus ten years ago? I would guess it has increased by a factor of around 10, particularly with the rise the 1U commodity servers, virtualization, etc. Sysadmins just don't have the time to treat our OS like a zen garden. The OS, especially with modern *nix, has become a kind of commodity, while the bulk of system admin work has moved to a higher levels of application management, systems integration, etc.
This is where I think the author fails most prominently, by implying that sysadmins who simply re-image (a claim that is a straw man) are somehow not as sophisticated and nuanced. Consider instead that they may be working at a higher, more complex level. This whole argument reminds me of the old debates System V admins would have with the rising Linux admins: this notion that package management was for weenies who don't "understand" the intricacies of dependency resolution. I remember incredibly excruciating debates where these folks would insist that spending hours resolving dependency hell was "good" for the craft because, after all, you should know and configure every last component on your system! God forbid it is done automatically for you, with literally tens of packages being installed with somewhat perfunctory knowledge, so that you could move onwards to accomplish the actual task at hand.
Sorry, sysadmin's don't have time for nostalgia. Save the sob stories of a bygone era for an industry that isn't based on constant change.
researching the problem is not billable activity.
why don't the just kill the zombie process they find?
I mean,k even that is less then optimal, but making people log off? WTF?
You know, ther eis an opportunity there. Get into a postiion wher you can afford tio be out of work a couple of months. Then begome a hard ass flag bearer for the technology. Go past your manager, make aoppointment with the CIO or CEO. Be force full. When you get the meeting have a 3-4 page doc with the costs, and time saveing as weilla sthe risks. High elvel. Tell them there systems are going to crash and cost thema lot of money.
If you ahve a share, do to the shareholder meeting and talk about it.
This has two outcomes:
Let go
Big ass promotion.
If you are let go, so what? Use what you where doing to help get you into a decision making position at another company.
If you get the promotion, you get to drive the change and begin changing how management handles technology.
A company can be a master of it's own destiny, or it can let it's technology master its destiny.
IN the end, no matter what happens, your going to have some good stories to tell.
The Kruger Dunning explains most post on
I think you've misread the parent. Basically, you can't trust the vendors to have actually solved the database problems as they'll say anything for their "fat commission checks" on any upgrades.
From what I read, there actually isn't a "Windows admins are dumb, *nix admins are smart" pattern. It's more about how mainstream you are, and thus how many potential admin candidates you attract (or more importantly, how many open jobs there are).
Face it, more often than not, a linux server in the workplace means it's running on top of a virtual machine, your server is one of dozen that were popped up (like popcorn), rarely maintained, receive little ongoing administration; but, magically can be reset to the last snapshot on a whim. The valued skills have changed. Personally, I miss the days of having only a few servers that cost an arm and a leg and you milked them for everything they'd give you.
There have been several very good arguments of virt-servers and re-imaging vs spending time to re-configure and/or fix the issue without the rebooting. There is a couple of realities that apply to proper win-tel and *NIX type administration when imaging is used. Most organizations that do imaging as a normal recovery procedure do not take some of the realities below into consideration:
1) If the server image has been built properly with all services working and tested appropriately; it is completely normal to "save your work" for a rainy day. Most experienced sysadmins would agree that certain servers and applications settings are better to set up once; as the installation and package management can be very painful for very customized apps. Having a good copy to re-image saves having to re-invent the wheel. HOWEVER if this argument does not hold if the original server build has not been properly implemented and tested. Image management is very very important. In my experience most experienced *NIX sysadmins have a good grasp where virtualized/hosted *NIX systems can save cost and time to deploy.
2) Most *NIX and Wintel devices can be hacked or have junior sysadmins (read less competent people) make "changes" that can cause issues. Yes pseudo and other access mechanisms can mitigate the problem, but the reality is that stuff gets &^%#ed up either intentionally or otherwise. Rather than trying to do a historical review of what caused the problem and playing the "blame game" while the system is down; do what was suggested in other posts which is, take an image copy of the broken system, and re-image the the production system, and then play a post-failure investigation/blame game with the comfort of knowing the system is back on-line and functioning. Senior management is much less likely to fire someone if the outage time is minimal, and the error in the old system is an "honest mistake". The longer a system is down the longer that a whole systems support team can face a lot of criticism from senior mangers.
3) Most people that make images do not test them enough to call them production images. These production images are sometimes buggier than the existing production problem. It is very difficult for the image creator to test their own work. It is more appropriate for a team of two to work in tandem. One person to create the image and another knowledgeable team member to test and report issues; which get corrected before preserving the image. The art of proper image and backup/restore management is truly under-rated.
4) Images have to be updated on a regular basis to account for system, performance and most importantly security patches and changes. An image created 8 months ago for a server will not likely be very useful if it has security holes discovered a year ago.
I think what a lot of people fail to comprehend is that there is a radical shift from 'server' to 'service'. In the 'olden days' you would get a Unix 'server', and that machine would handle many tasks and would typically be the core of your entire computing system. Sometimes you would get a second server for redundancy and as you scaled up you would get many servers to handle the your workload but would still have some 'core' servers running most services and you seperate out the heavy services like the db or directory services etc.
The change is that with VMs you will have a few servers that require very little configuration as they just host VMs and they host what is functionally 'services' not servers. small, special or specific use servers which provide a single service. You run a few of these VMs redundantly and able to live migrate between the real hardware. Now when one of these 'services' go down, you simply redeploy a new one from the template.
Companies don't want it good, they want it cheap. Training...read a book. Test hardware and software that a sysadmin can break and fix, sorry too expensive, can't have it.
Just a week or so ago I wanted to take a resource I controlled and partition it into two separate resources on rare occasions. Clueless management wouldn't allow me to do so. Reason, in their opinion it didn't make sense to do so. Beat head against the nearest wall.
Some of the comments here remind me of a post on a woodworking board a few months back. Essentially, the poster was lamenting because he had to fire a guy because he couldn't afford to keep him... Not because of the economy, but because the guy was an absolutely inflexible perfectionist. He'd spend $300 worth of time on what should have been a $60 job... The guy was a hell of a woodworker, at home in his own shop, but just couldn't adapt to a production environment.
This isn't about Windows vs. Unix. This is about admins not understanding their job is to get production rolling again, not to satisfy their obsessive need to understand every problem or their need to satisfy their ego. ("I'm a UNIX admin dammit, I refuse to use habits that make me look like a Windows admin" or it's equivalent is a refrain modded up again and again here on Slashdot.) If a reboot or a re-imaging fixes the problem, that's the right solution. If it doesn't, *then* you dig deeper.
Oh, and re-installing the machine means 24h of downtime".
Did I mention the application is considered mission-critical and runs 24x7? And that downtime can cost more than 6 figures to said (nameless) company?
And those are two very good reasons to not nuke/pave (although I would trust you realize that this is a bit of a non-standard situation).
In this situation, I'd expect to see a backup on hot standby - something that you fail-over to while you troubleshoot the main.
Also, unlike normal processes, the kill command has no effect on a zombie process.
http://en.wikipedia.org/wiki/Zombie_process
As a former Windows SMS Administrator and Desktop Engineer, imaging was a vital tool for desktop breakfix. That's provided you have an image that can be updated and software packaged for ease of profile install. On one-off problems, you can re-ghost and re-install package software and have the user back in business within an hour. Troubleshooting was reserve for company wide issues. My motto used to be, when in doubt, ghost!
http://yro.slashdot.org/comments.pl?sid=2015772&cid=35358632
LMAO!
APK
http://yro.slashdot.org/comments.pl?sid=2015772&cid=35358632
APK
P.S.=> LOL, just "too, Too, TOO EASY... just '2EZ'"... apk
I just hope that none of these people are administering any system with any of my personal data on them. This is only marginally better than my personal data being held on a Windows, or even worse, a MacOS X system.
I wouldn't make the analogy of driving a car vs car maintenance, being in any way similar to server maintenance. Clearly, driving a car is something that one might expect any average peasant to be able to achieve, whereas, one may expect said peasant to become hopelessly and helplessly confused, when confronted with a UNIX terminal and a keyboard. It would be reasonable to expect any employer to exercise due diligence when recruiting competent employees.
If I ever discovered that my personal details had been leaked or lost, as a result of the negligent recruiting practices of some incompetent halfwit manager, I would most definitely be seeking legal recourse.
This is most certainly not about getting a production system 'back up' in short time. If this is the aim, the system was designed without sufficient redundancy, and again, the competence of those responsible for its management would be seriously questionable.
Gawd, the stories I could tell...
We've seen a bunch of memory pressure issues. An app starts gobbling up memory and eventually others procs start getting killed. At some point, even ssh is killed so that this app can continue. We end up having to reboot the server to recover.
In every other organization we can set per-process caps on memory utilization. Not so here. The DBAs demanded that their app/db run unlimited.
We have the security department telling us what we can and cannot run. This leads to bizarre requirements such as ntpd and inetd cannot be enabled,
We recently had an application upgrade to a website. The network resources ballooned six times. Processor utilization jumped 500%. The devs are convinced that it's a hardware error. Or a network problem. Or an OS configuration issue. The latter was interesting. It was a thread-limited app so was using only 4 of 8 processors. We showed them the graphs, but they are dead set on the idea that some magical OS tuning will unlock their apps to work on all cores.
Decline of sysadmin skills? Hardly. We're just sick of the bullshit.
In what way is this "news"? It's like the 3rd time this guy's blog was linked to in the last week or two. A few paragraphs of opinion. Are there any anti-blog tech sites, especially ones where the latest "products" aren't advertised in the form of articles?
Why not write a script to kill zombies? That's what I did and the server never crashes...
Laudele lor desigur m-ar mahni peste masura.
Reinstalling the OS is almost never the right move. Any time people have suggested it to me (I've been working on Linux and Unix for a long time), it's usually a stab in the dark. If you hear me suggest it, it's because the OS is corrupted.
Unix is highly observable with tools like strace, truss (or even dtrace if you're lucky enough to have that).... it's difficult for me to imagine a scenario where re-imaging is better than finding the error and fixing it.
And for those people defending their intellectually lazy response... you have a lot to learn about Unix and troubleshooting.
If I suggested wiping an OpenVMS server to correct a problem, I would be laughed at... at best.
Gamingmuseum.com: Give your 3D accelerator a rest.
Never trust Wikipedia. You can absolutely kill zombies either with kill -9, or rm -rf on the process tree for that process (e.g., for pid 666, "rm -rf /proc/666").
God invented whiskey so the Irish would not rule the world.
This has been my SOP since 1978 when VMS became available!!!
Remember the TV repairman (if you are old enough)? A dying (well, by now probably completely dead) breed. When the TV went on the fritz, he (or she in a rare few cases) would diagnose the problem and apply a fix. Usually it was just a tube that needed replacement, but sometimes a capacitor. Occasionally something would have burned out like a resistor. As transistor TVs came along, the failures went down, but not to zero. Transistors could die, too, and were harder to replace. And they were more susceptible to lightning surges from the antenna (something that back then got TV signals for free). Now days, if a TV goes bad, we just junk the whole thing and get a new one. If it was in warranty, we might get the new one for free. Too often it would die just 3 days after the warranty expired. Just 3 years ago I had a relatively new TV (a digital one, with a VGA input, too) go bad. I could tell it was the power supply delivering unstable or low voltages after it warmed up. Fortunately, it was in warranty. So it was shipped to the manufacturer. About a week later a box comes back with a replacement. This was not the one I sent in, though it was the same model. At least it worked (and has been ever since). But I still wonder if someone replaced the power supply in the one I sent in. And I wonder if someone replaced the component inside that power supply that caused it to fail or if they scrapped the whole thing. So why should failing software be any different? As a system administrator myself, I do like to at least find out what failed. But being practical, I also quota the time I spend on "failure forensics". If I can't figure it out in a few minutes for first time problems, I just reboot. If the problem happens again, then I justify more effort. If it never happens again, I never even think about it, anymore. While I love a good diagnostic challenge, it just don't make business sense to put much effort into that (unless its something we design and manufacture).
now we need to go OSS in diesel cars
You have a limited amount of time in a day to deal with shit and you need to prioritize if you want it all done. Dealing with problems can take far too long sometimes and a reinstall is just faster (and cleaner).
For example: I work at a university and we have a "kind of managed" environment meaning you get things like professors who have laptops that they have admin on. They get viruses and spyware, of course, since they don't pay attention. Our normal strategy is to run automated tools and if they can't clean it up, reinstall. Why? Because it takes less time. Installing Windows 7 takes all of 40-60 minutes and I've modified the image to include the most common apps you need. Usually one of our students can have a system reinstalled and running in a couple hours.
However cleaning it up? That can take days. I can do it, given time I can track down all of the stuff and eliminate it. However some of this spyware is extremely problematic. Is has watcher processes everywhere, sets itself up in all kinds of locations and so on. Also it isn't like they get an infection, they get tons and then finally bring it in. So it is a painstaking process of looking for shit, disabling it, checking to see if it stays gone, cleaning up problems (things like when it modifies the hosts file or LSPs or executable handler), and so on until everything is clean and works right. This also isn't something our students are good at, it is fairly complex and takes some experience, so I (or another staff) has to do it.
It is just not worth the time. We end up having to do it sometimes because professors just refuse a reinstall but it is a huge waste. We can backup data, and reinstall, in far less time. That guarantees all the shit is gone (a manual method always leaves room for doubt, I could make a mistake).
Remember that with troubleshooting the objective is to fix the problem. It isn't to prove you are a toughguy, it is to make things work. So you need to determine the most efficient manner to do that, and the manner that results in the least downtime. What that is varies. For our LDAP server? No a reinstall would not be the best idea (in most cases). However for a client system? Often it is.
... I've ever resorted to rebuilding a UNIX system from scratch was a system that I inherited from a previous admin when I took over his job. The broken system was a member of a cluster and, after running checks of all the files on both members, could not figure out just which files had gotten corrupted that were preventing the system from believing it was a member of the cluster. Luckily, support for the OS version was about to be sunsetted and it made sense to reinstall the OS on both members. This was about a dozen years ago. Except for that one instance of doing a reinstallation, I haven't resorted to that means of solving a UNIX problem. Ever.
System disk failures are another story. I have had to do a couple of those on Linux systems when the system disk failed. That's over the 15-16 years I've been running Linux.
So that's three UNIX/Linux reinstallations over more than a couple of decades. I know Windows admins who've done that many reinstalls in a week.
CUR ALLOC 20195.....5804M
I'm not sure if you fully understand what a "zombie" process is. A zombie process is a process that has ended, but the parent process (high MISSION CRITICAL application) has not "closed" the process yet. Killing the process tree would take down the mission-critical app. rm'ing the process would allow a new process (possibly not owned by said app) to start and if the app tried to eventually close or even check that process, it would segfault the entire app WITHOUT that 10-15 minutes notice.
To all the people against open source (probably few in this crowd), this is a PRIME example of why close-source is bad. Even if this guy's company was not allowed to redistribute the software (like a normal software license), had they been given the code, he probably could have fixed the bug in a fraction of the time he's spent dealing with it. And the next time the system was forced to reboot, BOOM throw in the fixed binaries!
It takes maybe 5 minutes to provision a new VM complete with OS and default config/apps/whatever.
If I had a system that was as essential as what you describe, I'd have a base image of it stored and ready to go. Just bring up the new image, migrate the data, and make it live. That's what we do with all of our truly essential systems. And we can be running off a new image within about 30 minutes if we're able to migrate data off the old system.
Moreso, just run several concurrent instances, with a health check to kill off
ones which are showing fatigue. Respawn when everything stabilizes. You
can also use this same process to scale up and down the resources. Smooth
out that end of day (whichever timezone) rush to do work.
-@|
I have administered Linux, Windows and MAC servers and from reading the comments above I must agree (in part at least) with everyone. I think a balance needs to be given to downtime/finding the cause of the problem. The cause of the problem is certainly important, but so is availability! Different situations require different solutions. Personally I always prefer to get things working as soon as possible but while troubleshooting the problem take steps so that once the issue has been fixed it is possible to find out what went wrong (unless you find out along the way) by backing up logs... If you can get the system back up in a reasonable time and tell management why/how/what happened then this is the best situation for everyone. With windows this is more difficult and you are a lot more likely to encounter a problem with windows which seems to have no reason (I know this does not really happen but it appears that way because of the rubbish logging that windows produces). So I guess what I am saying is a competent systems administrator will know how to react in certain situations and these systems administrators are nearly always worth the money (unless, of course, the companies systems aren't that important). About the imaging route, I've always disliked this solution. It requires you to have a spear machine of exactly the same spec. I prefer to set up an automated install (over tftp) for all of the different types of machines I administer. With the Unix based OS's I then use puppet to configure them fully (the client should be available to windows soon also), so if a reinstall is needed I could use a completely different set of hardware (if needed) and get things up and running in a short amount of time. This takes time to get set up initially but it means that anyone who can figure out how to PXE boot a machine on the network can effectively start an install for any type of machine using only 5 minutes of their time and the systems administrator is more free to figure out ways to increase efficiency of the systems and network (every system can always be improved). I do not think this is the end of systems administrators, it just means that we need to up our game.