Slashdot Mirror


Ask Slashdot: Unattended Maintenance Windows?

grahamsaa writes: Like many others in IT, I sometimes have to do server maintenance at unfortunate times. 6AM is the norm for us, but in some cases we're expected to do it as early as 2AM, which isn't exactly optimal. I understand that critical services can't be taken down during business hours, and most of our products are used 24 hours a day, but for some things it seems like it would be possible to automate maintenance (and downtime).

I have a maintenance window at about 5AM tomorrow. It's fairly simple — upgrade CentOS, remove a package, install a package, reboot. Downtime shouldn't be more than 5 minutes. While I don't think it would be wise to automate this window, I think with sufficient testing we might be able to automate future maintenance windows so I or someone else can sleep in. Aside from the benefit of getting a bit more sleep, automating this kind of thing means that it can be written, reviewed and tested well in advance. Of course, if something goes horribly wrong having a live body keeping watch is probably helpful. That said, we do have people on call 24/7 and they could probably respond capably in an emergency. Have any of you tried to do something like this? What's your experience been like?

25 of 265 comments (clear)

  1. Puppet. by Anonymous Coward · · Score: 4, Informative

    Learn and use Puppet.

    1. Re:Puppet. by bwhaley · · Score: 3, Interesting

      Puppet is a great tool for automation but does not solve problems like patching and rebooting systems without downtime.

      --
      "I either want less corruption, or more chance
      to participate in it." -- Ashleigh Brilliant
    2. Re:Puppet. by sjames · · Score: 3, Informative

      How, exactly, do you snapshot and test the production VM before the maintenance window and guarantee you won't affect (and by "affect", I mean anything that changes behavior in any way that is not expected by the users) any services running on that VM?

      Clone it. upgrade the clone and make sure it works. If so, wipe the clone, snapshot the production VM and upgrade it. If it fails, roll back. Make sure your infrastructure is set up so the clone CAN be properly tested. Yes, sometimes you will have to do that rollback, but with an adequate test setup, frequently you won't.

    3. Re:Puppet. by dnavid · · Score: 3, Interesting

      So it's someone else's fault your test environment doesn't match production?

      People often fail to try hard enough to make the test environment (assuming they even have one) match the production environment, but for some problems test never matches production, and essentially never can: some problems only reveal themselves under production *conditions*. For example, I recently spent a significant amount of time involved in the troubleshooting of a kernel bug that only arose under a very specific (and still not fully characterized) set of disk loads. Test loads including tests involving loads several times higher than the production load did not uncover the bug, which caused kernel faults, and the faults randomly started occurring about a week after the software patch went live.

      You should try to keep test as close as possible to production so testing on it has any validity at all, but you should never assume that testing on the test environment *guarantees* success on production. Its for that reason that, responding to the OP, I have never attempted to do any serious production upgrades in an automated and unattended fashion, and not while I'm alive will any such thing happen on any system I have authority over. As far as I'm concerned, if you decide to automate and go to sleep, make sure your resume is up to date before you do because you might not have a job when you wake up, if you guess wrong.

      Even if you guess right, I might decide to fire you anyway if anyone working for me decided to do that without authorization.

  2. And if it doesn't work? by Anonymous Coward · · Score: 5, Insightful

    Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.

  3. Murphy says no. by wbr1 · · Score: 5, Insightful

    You should always have a competent tech on hand for maintenance tasks. Period. If you do not, Murphy will bite you, and then, instead of having it back up by peak hours you are scrambling and looking dumb. In your current scenario, say the patch unexpectedly breaks another critical function of the server. It happens, if you have been in IT any time you have seen it happen. Bite the bullet and have a tech on hand to roll back the patch. Give them time off at another point, or pay them extra for night hours, but thems the breaks when dealing with critical services.

    --
    Silence is a state of mime.
    1. Re: Murphy says no. by CanHasDIY · · Score: 4, Insightful

      This guy probably is the tech but is wanting to spend more time with his family or something.

      Probably settled down too fast and can't get a better job now. My advice: don't settle down and quit using your wife and children as excuses for your career failures because they'll grow to hate you for it.

      OR, if you want to have a family life, don't take a job that requires you to do stuff that's not family-life-oriented.

      That's the route I've taken - no on-call phone, no midnight maintenance, no work-80-hours-get-paid-for-40 bullshit. Pay doesn't seem that great, until you factor in the wage dilution of those guys working more hours than they get paid for. Turns out, hour-for-hour I make just as much as a lot of the managers around here, and don't have to deal with half the crap they do.

      The rivers sure have been nice this year... and the barbecues, the lazy evenings relaxing on the porch, the weekends to myself... yea. I dig it.

      --
      An enigma, wrapped in a riddle, shrouded in bacon and cheese
    2. Re:Murphy says no. by bwhaley · · Score: 5, Interesting

      The right answer to this is to have redundant systems so you can do the work during the day without impacting business operations.

      --
      "I either want less corruption, or more chance
      to participate in it." -- Ashleigh Brilliant
    3. Re: Murphy says no. by PvtVoid · · Score: 5, Funny

      This guy probably is the tech but is wanting to spend more time with his family or something.

      Probably settled down too fast and can't get a better job now. My advice: don't settle down and quit using your wife and children as excuses for your career failures because they'll grow to hate you for it.

      Congratulations! You're management material!

    4. Re:Murphy says no. by David_Hart · · Score: 4, Informative

      Here is what I have done in the past with network gear:

      1. Make sure that you have a test environment that is as close to your production environment as possible. In the case of network gear, I test on the exact same switches with the exact same firmware and configuration. For servers, VMWare is your friend....

      2. Build your script, test, and document the process as many times as necessary to ensure that there are no gotchas. This is easier for network gear as there are less prompts and options.

      3. Build in a backup job in your script, schedule a backup with enough time to complete before your script runs, or make your script dependent on the backup job completing successfully. A good backup is your friend. Make a local backup if you have the space.

      4. Schedule your job.

      5. Get up and check that the job complete successfully either when the job is scheduled to be completed or before the first user is expected to start using the system. Leave enough time to perform a restore, if necessary.

      As you can probably tell, doing this in an automated fashion would take more time and effort than baby sitting the process yourself. However, it is worth it if you can apply the same process to a bunch of systems (i.e. you have a bunch of UNIX boxes on the same version and you want to upgrade them all). In our environment we have a large number of switches, etc. that are all on the same version. Automation is pretty much the only option given our scope.

    5. Re: Murphy says no. by smash · · Score: 4, Insightful

      This is why you build a test environment. VLANS, virtualization, SAN snapshots. There's no real excuse. Articulate the risks that a lack of a test environment entail to the business, and ask them if they want you doing shit without being able to test to see if it breaks things. Do some actual calculations on cost of system failure, and explain to them ways in which it can be mitigated. Putting your head in the sand and just breaking shit in live... well, that's one way to do it, but I fucking guarantee you: it WILL bit you in the ass, hard one day, whether it is automated or not. if you have a test environment, you can automate the shit out of your process, TEST it, and TEST a backout plan before going live.

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    6. Re:Murphy says no. by Zenin · · Score: 3, Insightful

      In general, don't do anything that isn't your core business. Or another way of saying it, Do What Only You Can Do.

      If you are an insurance company, is building and maintaining hardware your business? No, not in the slightest. You have no more business maintaining computer hardware as you have maintaining printing presses to print your own claims forms.

      Maintaining hardware and the rest of the infrastructure stack however, is the business of Amazon AWS, Windows Azure, etc. The "fantasy" you're referring to is the crazy idea that you, as some kind of God SysAdmin, can out-perform the world's top infrastructure providers at maintaining infrastructure. Even if you were the best SysAdmin alive on the planet, you can't scale very far.

      Sure, any of those providers can (and do, frequently) fail. Still, they are better than you can ever hope to be, especially once you scale past a handful of servers. If you are concerned that they still fail, that's good, yet it's still a problem worst addressed by taking the hardware in house. A much better solution is to build your deployments to be cloud vendor agnostic: Be able to run on AWS or Azure (or both, and maybe a few other friends too) either all the time by default or at the flip of a (frequently tested) switch.

      Even building in multi-cloud redundancy is far easier, cheaper, and more reliable than you could ever hope to build from scratch on your own. That's just the reality of modern computing.

      There are reasons to build on premises still, but they are few and far between. Especially now that cloud providers are becoming PCI, SOX, and even HIPAA capable and certified.

      --
      My /. uid is better then your /. uid
  4. I've toyed with this concept.. by grasshoppa · · Score: 5, Interesting

    ...and while I'm reasonably sure I can execute automated maintenance windows with little to no impact to business operations, I'm not sure. So I don't do it.

    If there were more at stake, if the risk vs benefits were tipped more in my company's favor, I might test implement it. But just to catch an extra hour or two of sleep? Not worth it; I want a warm body watching the process in case it goes sideways. 9 times out of 10, that warm body is me.

    --
    Mod me down with all of your hatred and your journey towards the dark side will be complete!
    1. Re:I've toyed with this concept.. by mlts · · Score: 3, Insightful

      Even on fairly simple things (yum updates from mirrors, AIX PTFs, Solaris patches, or Windows patches released from WSUS), I like babysitting the job.

      There is a lot that can happen. A backup can fail, then the update can fail. Something relatively simple can go ka-boom. A kernel update doesn't "take" and the box falls back to the wrong kernel.

      Even something stupid as having a bootable CD in the drive and the server deciding it wants to run the OS from that rather than from the FCA or onboard drives. Being physically there so one can rectify that mistake is a lot easier when planned as opposed to having to get up and drive to work at a moment's notice... and by that time, someone else likely has discovered it and is sending scathing E-mails to you, CC:5 tiers of management.

  5. Automated troubleshooting? by HBI · · Score: 5, Insightful

    Maintenance windows are at off-hours to accomodate real work happening. If every action was painless and produced the desired result, you could do it over lunch or something like that. But that's not the real world.

    This begs the question of how the hell are you going to fix unexpected problems in an automated fashion? The answer is, you aren't. Therefore, you have to be up at 2am.

    --
    HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
  6. Attended automation by Anonymous Coward · · Score: 3, Interesting

    Attended automation is the way to go. You gain all the advantages of documentation, testing etc. If the automation goes smooth, you only have to watch it for 5 mins. If it doesn't, then you can fix it immediately.

  7. Offshore by pr0nbot · · Score: 4, Insightful

    Offshore your maintenance jobs to someone in the correct timezone!

  8. Sounds like a bad idea ... by gstoddart · · Score: 4, Insightful

    You don't monitor maintenance windows for when everything goes well and is all boring. You monitor them for when things go all to hell and someone needs to correct it.

    In any organization I've worked in, if you suggested that, you'd be more or less told "too damned bad, this is what we do".

    I'm sure your business users would love to know that you're leaving it to run unattended and hoping it works. No, wait, I'm pretty sure they wouldn't.

    I know lots of people who work off hours shifts to cover maintenance windows. My advise to you: suck it up, princess, that's part of the job.

    This just sounds like risk taking in the name of being lazy.

    --
    Lost at C:>. Found at C.
  9. This is why you need.. by arse+maker · · Score: 3, Insightful

    Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.

    Having someone with little or no sleep doing critical updates is not really the best strategy.

    1. Re:This is why you need.. by Shoten · · Score: 5, Insightful

      Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.

      Having someone with little or no sleep doing critical updates is not really the best strategy.

      First off, you can't mirror everything. Lots of infrastructure and applications are either prohibitively expensive to do in a High Availability (HA) configuration or don't support one. Go around a data center and look at all the Oracle database instances that are single-instance...that's because Oracle rapes you on licensing, and sometimes it's not worth the cost to have a failover just to reach a shorter RTO target that isn't needed by the business in the first place. As for load balancing, it normally doesn't do what you think it does...with virtual machine farms, sure, you can have N+X configurations and take machines offline for maintenance. But for most load balancing, the machines operate as a single entity...maintenance on one requires taking them all down because that's how the balancing logic works and/or because load has grown to require all of the systems online to prevent an outage. So HA is the only thing that actually supports the kind of maintenance activity you propose.

      Second, doing this adds a lot of work. Failing from primary to secondary on a high availability system is simple for some things (especially embedded devices like firewalls, switches and routers) but very complicated for others. It's cheaper and more effective to bump the pay rate a bit and do what everyone does, for good reason...hold maintenance windows in the middle of the night.

      Third, guess what happens when you spend the excess money to make everything HA, go through all the trouble of doing failovers as part of your maintenance...and then something goes wrong during that maintenance? You've just gone from HA to single-instance, during business hours. And if that application or device is one that warrants being in a HA configuration in the first place, you're now in a bit of danger. Roll the dice like that one too many times, and someday there will be an outage...of that application/device, followed immediately after by an outage of your job. It does happen, it has happen, I've seen it happen, and nobody experienced who runs a data center will let it happen to them.

      --

      For your security, this post has been encrypted with ROT-13, twice.
  10. Slashdot is a Bad Place to Ask This by terbeaux · · Score: 4, Interesting

    Everyone here is going to tell you that a human needs to be there because that is their livelihood. Any task can be automated at a cost. I am guessing that it is not your current task to automate maintenance tasks otherwise you wouldn't be asking. Somewhere up your chain they decided that for the uptime / quality of service it is more cost effective to have a human do it. That does not mean that you can not present a case showing otherwise. I highly suggest that you win approval and backing before taking time to try to automate anything.

    Out of curiosity, are they VMs?

  11. It depends on the size of your operation... by jwthompson2 · · Score: 4, Interesting

    If you really want to automate this sort of thing you should have redundant systems with working and routinely tested automatic fail-over and fallback behavior. With that in place you can more safely setup scheduled maintenance windows for routine stuff and/or pre-written maintenance scripts. But, if you are dealing with individual servers that aren't part of a redundancy plan then you should babysit your maintenance. Now, I say babysit because you should test and automate the actual maintenance with a script to prevent typos and other human errors when you are doing the maintenance on production machines. The human is just there in case something goes haywire with your well-tested script.

    Fully automating these sorts of things is out of reach more many small to medium sized firms because they don't want, or can't, invest in the added hardware to build out redundant setups that can continue operating when one participant is offline for maintenance. So, depending on the size of your operation and how much your company is willing to invest to "do it the right way" is the limiting factor in how much you are going to be able to effectively automate this sort of task.

    --
    Even if I knew that tomorrow the world would go to pieces, I would still plant my apple tree. -Martin Luther
  12. Simmilar experiences ... by psergiu · · Score: 4, Insightful

    A friend of mine lost his job over a simmilar "automation" task on windows.

    Upgrade script was tested on lab environement who was supposed to be exactly like production (but it turns out it wasn't - someone tested something before without telling anyone and did not reverted). Upgrade script was scheduled to be run on production during the night.

    Result - \windows\system32 dir deleted from all the "upgraded" machines. Hundreds of them.

    On the Linux side i personally had RedHat doing some "small" changes on the storage side and PowerPath getting disabled at next boot after patching. Unfortunate event, since all Volume Groups were using /dev/emcpower devices. Or RedHat doing some "small" changes in the clustering software from one month to the other. No budget for test clusters. Production clusters refusing to mount shared filesystems after patching. Thankfuly on both cases the admins were up & online at 1AM when the patching started and we were able to fix everything in time.

    Then you can have glitchy hardware/software deciding not to come back up after reboot. RHEL GFS clusters are known to randomly hang/crash at reboot. HP Blades have sometimes to be physically removed & reinserted to boot.

    Get the business side to tell you how much is going to cost the company for the downtime until:
    - Monitoring software detects that something is wrong;
    - Alert reaches sleeping admin;
    - Admin wakes up and is able to reach the servers.
    Then see if you can risk it.

    --
    1% APY, No fees, Online Bank https://captl1.co/2uIErYq Don't let your $$$ sit in a no-interest acct.
  13. Think of it a slightly different way by thecombatwombat · · Score: 3, Informative

    First: I do something like this all the time, and it's great. Generally, I _never_ log into production systems. Automation tools developed in pre-prod do _everything_. However, it's not just a matter of automating what a person would do manually.

    The problem is that your maintenance for simple things like updating a package is requiring downtime. If you have better redundancy, you can do 99% of normal boring maintenance with zero downtime. I say if you're in this situation you need to think about two questions:

    1) Why do my systems require downtime for this kind of thing? I should have better redundancy.
    2) How good are my dry runs in pre-prod environments? If you use a system like Puppet for *everything* you can easily run through your puppet code as you like in non-production, then in a maintenance window you merge your Puppet code, and simply watch it propagate to your servers. I think you'll find reliability goes way up. A person should still be around, but unexpected problems will virtually vanish.

    Address those questions, and I bet you'll find your business is happy to let you do "maintenance" at more agreeable times. It may not make sense to do it in the middle of the business day, but deploying Puppet code at 7 PM and monitoring is a lot more agreeable to me than signing on at 5 AM to run patches. I've embraced this pattern professionally for a few years now. I don't think I'd still be doing this kind of work if I hadn't.

  14. Reboot? - Load Balancers and multiple systems by Taelron · · Score: 3, Insightful

    Unless you are updating the Kernel there are few times you need to reboot a centos box. Unless your app has a memory leak.

    The better way to go about it has already been pointed out above. Have several systems, load balance them in a pool, take one node out of the pool, work on it, return it to the pool then repeat for each remaining system. - No outage time and users are none the wiser to the update.