Ask Slashdot: Unattended Maintenance Windows?
grahamsaa writes: Like many others in IT, I sometimes have to do server maintenance at unfortunate times. 6AM is the norm for us, but in some cases we're expected to do it as early as 2AM, which isn't exactly optimal. I understand that critical services can't be taken down during business hours, and most of our products are used 24 hours a day, but for some things it seems like it would be possible to automate maintenance (and downtime).
I have a maintenance window at about 5AM tomorrow. It's fairly simple — upgrade CentOS, remove a package, install a package, reboot. Downtime shouldn't be more than 5 minutes. While I don't think it would be wise to automate this window, I think with sufficient testing we might be able to automate future maintenance windows so I or someone else can sleep in. Aside from the benefit of getting a bit more sleep, automating this kind of thing means that it can be written, reviewed and tested well in advance. Of course, if something goes horribly wrong having a live body keeping watch is probably helpful. That said, we do have people on call 24/7 and they could probably respond capably in an emergency. Have any of you tried to do something like this? What's your experience been like?
I have a maintenance window at about 5AM tomorrow. It's fairly simple — upgrade CentOS, remove a package, install a package, reboot. Downtime shouldn't be more than 5 minutes. While I don't think it would be wise to automate this window, I think with sufficient testing we might be able to automate future maintenance windows so I or someone else can sleep in. Aside from the benefit of getting a bit more sleep, automating this kind of thing means that it can be written, reviewed and tested well in advance. Of course, if something goes horribly wrong having a live body keeping watch is probably helpful. That said, we do have people on call 24/7 and they could probably respond capably in an emergency. Have any of you tried to do something like this? What's your experience been like?
Learn and use Puppet.
Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.
quit complaining.
You should always have a competent tech on hand for maintenance tasks. Period. If you do not, Murphy will bite you, and then, instead of having it back up by peak hours you are scrambling and looking dumb. In your current scenario, say the patch unexpectedly breaks another critical function of the server. It happens, if you have been in IT any time you have seen it happen. Bite the bullet and have a tech on hand to roll back the patch. Give them time off at another point, or pay them extra for night hours, but thems the breaks when dealing with critical services.
Silence is a state of mime.
Phew, thought you were going to ask about it on Windows. Linux, go for it!
...and while I'm reasonably sure I can execute automated maintenance windows with little to no impact to business operations, I'm not sure. So I don't do it.
If there were more at stake, if the risk vs benefits were tipped more in my company's favor, I might test implement it. But just to catch an extra hour or two of sleep? Not worth it; I want a warm body watching the process in case it goes sideways. 9 times out of 10, that warm body is me.
Mod me down with all of your hatred and your journey towards the dark side will be complete!
Maintenance windows are at off-hours to accomodate real work happening. If every action was painless and produced the desired result, you could do it over lunch or something like that. But that's not the real world.
This begs the question of how the hell are you going to fix unexpected problems in an automated fashion? The answer is, you aren't. Therefore, you have to be up at 2am.
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
If you have a high availability system with more than one backup node then daytime maintenance becomes very doable.
Attended automation is the way to go. You gain all the advantages of documentation, testing etc. If the automation goes smooth, you only have to watch it for 5 mins. If it doesn't, then you can fix it immediately.
You just need to schedule some of your days as offset days. Work from 4pm to midnight some days so that you can get some work done when others aren't around. Some days require you being around people, some days command you be alone.
Or you can just work 16hour days like the rest of us and wear it with a badge of honor.
If you are your own boss and do this, you can earn enough money to take random weeks off from work with little to no notice so that you can travel the world, and do some recruiting while doing it so that you can write the expenses off on the company.
Sig: I stole this sig.
I do this for a lot of clients. Automatic Deployment Rules in Configuration Manager, Scripts, Cron jobs etc. For test / dev, it absolutely makes sense as I usually have a monitoring system that goes into Maintenance Mode during the updates. If things take too long or if services aren't restored post update, the monitoring system gives me a shout that something needs remediated. For production, it varies on the expected impact. If it's something I tested in pilot with zero issues and the application isn't something with an insane SLA, sure, I'll use an automatic deployment. When I'm working on hospital equipment such as servers processing imaging or vitals monitoring for surgery, that gets nix'ed no matter what due to the liability concerns. I usually suggest building up trust / experience by automating the less critical systems and phasing in more sensitive systems until you've both gained a lot of experience with it and have more management support to do so as when crap goes down, it's easier to say this is a tested processed we've been using for years vs yeah, oops, new script sorry that knocked down our ERP system.... Resume generating event right there... So, I guess it depends, just another tool for the toolbox and it's up to the carpenter to know when to pull it out.
I like ansible... alot. Chef, salt, something else if that is your preference. In any event, yes, an automated deployment framework allows you to test the maintenance procedure out, throttle the number of servers that get managed at one time, bail (and/or text you) if there is a problem.
Done right it can be run continuously so that you are always confident about the state of your servers and their maintenance procedures.
We do it all the time... Schedule a snapshot, push patches, verify thing are up, and if not throw an alarm... Using Shavlik on, horror of horrors, Windows...
I've rolled back 2 or 3 in the past 5 years, usually do to Microsoft's inablilty to consistently write a patch that doesn't break something, and once because the vendor couldn't see to be hyper sensitive to .Net patch levels...
- cfengine
- puppet
- chef
- ansible
- salt
All should be able to do the work.
Offshore your maintenance jobs to someone in the correct timezone!
You don't monitor maintenance windows for when everything goes well and is all boring. You monitor them for when things go all to hell and someone needs to correct it.
In any organization I've worked in, if you suggested that, you'd be more or less told "too damned bad, this is what we do".
I'm sure your business users would love to know that you're leaving it to run unattended and hoping it works. No, wait, I'm pretty sure they wouldn't.
I know lots of people who work off hours shifts to cover maintenance windows. My advise to you: suck it up, princess, that's part of the job.
This just sounds like risk taking in the name of being lazy.
Lost at C:>. Found at C.
Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.
This is the correct answer. I promise you that at some point, something will fail, and you will have failed by not being there to fix it immediately.
Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.
Having someone with little or no sleep doing critical updates is not really the best strategy.
If these are as critical services as you say, I would assume you have some sort of redundancy, at least a 2nd server somewhere. If so, treat each as "throw away", build out what you need on the alternative server, swing DNS and be done. Rinse and repeat for the next 'upgrade'. Then do your work in the middle of the day. See Immutable Servers: http://martinfowler.com/bliki/...
Jesus saves souls and redeems them for valuable cash prizes
Why would you want to automate someone or yourself out of a job? I realized years ago that Microsoft was working hard to automate me out of my contracts. It's almost done, why accelerate the inevitable?
Setting aside the wisdom (or lack thereof) of automating maintenance, you should also have some process external to the maintained machines that confirms that the maintenance worked. That confirmation could be something like testing that a Web server continues to serve the expected pages, some port provides expected information, etc. If this external process notes a discrepancy, it would page/text/call you.
I am "the guy". The guy that your boss calls when your simple maintenance outage goes all sideways (and I like your idea). Positioning oneself so that any problem becomes a lingering outage that shakes your company's faith in your IT Director's ability to do their job competently is always a great idea. If you can chron it from work, why not chron it from an outsourced location? I mean, either it goes well and they don't need you or it goes sideways and they need me. Either way, you are screwed. PRO TIP: do not store your resume on any system that you chron after-hours updates to.
By far the better solution is to figure out why that one specific server cant be offlined. Its far safer regardless of the tests and validations to work on a server thats not supposed to be running vs one that is. It obviously takes alot of work, but for all your critical/important services they should be running in some sort of HA scenerio. If you cant take a 5 minute outage just after normal business hours, you absolutely cannot take a failure in the service due to any sort of hardware failure(which will happen) This is coming from years of experience in a Software as a Service company/
Everyone here is going to tell you that a human needs to be there because that is their livelihood. Any task can be automated at a cost. I am guessing that it is not your current task to automate maintenance tasks otherwise you wouldn't be asking. Somewhere up your chain they decided that for the uptime / quality of service it is more cost effective to have a human do it. That does not mean that you can not present a case showing otherwise. I highly suggest that you win approval and backing before taking time to try to automate anything.
Out of curiosity, are they VMs?
What's the impact if it all goes wrong and you're not there? If impact is huge and you're fired if it all goes bad, be there. If it doesn't matter and it can fail with no consequences, script it.
Disclaimer: Today (Friday) I found out my company is doing a DR exercise from 10PM tonight to 9AM tomorrow. I'm an ITSec manager, and they wanted to know if they could make a "few firewall changes if they need to". I said no, and told them I would stay up late to review and approve any emergency changes they want but they were NOT getting "carte blanche" with no ITSec oversight, as that would be really irresponsible and break SOX, etc... You do what you have to in order to get the job done properly!
(Posting Anonymous so !humblebragging.)
If you were to set up a hardware remote console, you could do it from home. So yeah, it's 15 minutes out of bed, but then it's right back to bed.
...service bounces that are happening all the time. When it occurs and/or if any other issues, I can send myself a mail. My blackberry has filters which allow an alarm to go off which can wake me during the night. That would seem to meet your needs.
Bukowski said it. I believe it. That settles it.
Although I do feel this is the nature of the beast when working in a true IT position where businesses rely on their systems nearly 100% of the time, there are some smart ways to go about it. I'm not exactly sure what type of environment you're using, but if you use something like VMware's vSphere product, or Microsoft's Hyper-V, both allow for "live migrations". Why not virtualize all of your servers first of all, make a snapshot, perform the maintenance, and live migrate the VMs? You could do it right in the middle of the day and nobody would even know. This kind of setup takes a lot of planning however. I personally wouldn't want any maintenance performed on my servers without manual approval. Unattended maintenance sounds a bit too scary for my likes, and in my experience with even small security updates for both Linux and Windows servers, there's bound to be a point where something would fail and you could potentially get in a lot of legal trouble if you fail to meet you SLA, or cause a loss-of-profit due to downtime with a business.
*plays the Apogee theme song music*
If you really want to automate this sort of thing you should have redundant systems with working and routinely tested automatic fail-over and fallback behavior. With that in place you can more safely setup scheduled maintenance windows for routine stuff and/or pre-written maintenance scripts. But, if you are dealing with individual servers that aren't part of a redundancy plan then you should babysit your maintenance. Now, I say babysit because you should test and automate the actual maintenance with a script to prevent typos and other human errors when you are doing the maintenance on production machines. The human is just there in case something goes haywire with your well-tested script.
Fully automating these sorts of things is out of reach more many small to medium sized firms because they don't want, or can't, invest in the added hardware to build out redundant setups that can continue operating when one participant is offline for maintenance. So, depending on the size of your operation and how much your company is willing to invest to "do it the right way" is the limiting factor in how much you are going to be able to effectively automate this sort of task.
Even if I knew that tomorrow the world would go to pieces, I would still plant my apple tree. -Martin Luther
A friend of mine lost his job over a simmilar "automation" task on windows.
Upgrade script was tested on lab environement who was supposed to be exactly like production (but it turns out it wasn't - someone tested something before without telling anyone and did not reverted). Upgrade script was scheduled to be run on production during the night.
Result - \windows\system32 dir deleted from all the "upgraded" machines. Hundreds of them.
On the Linux side i personally had RedHat doing some "small" changes on the storage side and PowerPath getting disabled at next boot after patching. Unfortunate event, since all Volume Groups were using /dev/emcpower devices. Or RedHat doing some "small" changes in the clustering software from one month to the other. No budget for test clusters. Production clusters refusing to mount shared filesystems after patching. Thankfuly on both cases the admins were up & online at 1AM when the patching started and we were able to fix everything in time.
Then you can have glitchy hardware/software deciding not to come back up after reboot. RHEL GFS clusters are known to randomly hang/crash at reboot. HP Blades have sometimes to be physically removed & reinserted to boot.
Get the business side to tell you how much is going to cost the company for the downtime until:
- Monitoring software detects that something is wrong;
- Alert reaches sleeping admin;
- Admin wakes up and is able to reach the servers.
Then see if you can risk it.
1% APY, No fees, Online Bank https://captl1.co/2uIErYq Don't let your $$$ sit in a no-interest acct.
One way to prepare for failure is to have someone there who can at least recognize the failure and wake someone up in time to fix it.
Another way to prepare for failure is to have a system that is redundant enough that a part could go down and it wouldn't be more than a minor annoyance to users or management.
There are other ways to prepare for failure, but these are two common ones.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Can't you make some kind of setup that triggers if the update fails and alerts you / wakes you up with noise from your smartphone etc.
Or like the other poster who beat me to it - off-load your work to someone in a country where your 5am is mid-day in their country.
Waterfox - a Firefox fork with legacy extension support, security updates and better privacy by default.
While just about everybody does availability over security, it just depends on what you sell your clients.
We do at least all Security-Updates fully-automatically without review or maintainance, just whenever cron-apt picks up a new Debian security update it gets installed automatically. If something goes wrong, our customers understand as that is what they want or need: Security - even if Availability stays behind.
I'm seeing this increasingly often......misuse of the phrase "begs the question". Why don't you look it up?
By proving that your job can be largely automated, you are eroding the reasons to keep you employed.
Sure, we all know it's a bad idea to set things on autopilot because eventually something will break badly. But do your managers know that?
We do some automation for maintenance, but the end result has to be able to be tested thoroughly automatically. If the automated tests succeed, I stay asleep. If they fail, I get paged and wake up to deal with it. 90% of the time, it works and I get to sleep through the night. But we can only really do this for simple maintenance.
This has always been a contention. Some systems can be automated through SMS, System Center or even a vb script. However, I've had windows updates corrupt IIS web servers before requiring me to uninstall all .net frameworks, reinstall IIS, and reinstall the .net framework. This is one of those situations you don't want to wake up in Monday morning with customers down. For critical systems, I always manually test on test systems, push to production and test after updates applied to make sure everything is running as intended. For low impact updates like ccleaner, automated pushes are much more viable because of the impact to the system is relatively low. So as the subject says, "Depends". Hope this helps with your inquiry.
If you do your maintenance out of hours and something goes wrong, who's going to fix it? Some bleary-eyed administrator that's has 2 hours sleep? If they need to escalate it, who are they going to call at 2am? Also, are these guys being paid double time to work these hours, given time off to compensate, or just expected to suck it up and work a normal day shift after working half the night? Whichever way you look at it, it's full of problems.
Instead, rearchitect your solution. If you care about a service enough to not take planned downtime in working hours, you probably care enough that unplanned downtime in working hours should not be business affecting either. So you should double up on servers (which should be pretty cheap if you're running everything in a virtualised environment) and arrange for services to fail over to a secondary if the primary is unavailable. If you're doing this in Windows (my condolences to you) it should mostly support this anyway. If you're doing it in something unix-like you can use things like keepalived to fail a service from one node to another.
Once you have a solution like this, maintenance is easy - you patch/upgrade/reboot your backup server, check that it's OK, then promote it to primary, then do the other server, and then promote it back again. You do it *all* in working hours so that (a) people get a decent nights sleep and (b) if something goes wrong you can call on your support provider without having to pay over the odds for 24x7 support.
In my experience testing, no matter how thorough you think it is, will fail to account for all possibilities. That one possibility you missed will bite you in the ass when you automate your maintenance.
It's wise to maintain human-on-site because it maintains the employers idea that you are worth keeping rather than outsourcing your position to someone who can do the job from a distance.
I work for money, not some blithering ideal of efficiency which may not include me.
When I trim my trees I don't cut off the branch I'm sitting on.
Simple.
You stipulate that for every maintenance, there has to be a full regression testing of any affected applications. You will require the application owner, QA folks, and any other affected personnel online during and after the maintenance to test and ensure everything is working. Bonus points, require them to be on a conference call, and breathe heavily into the mic the entire time (maybe occassionally says "Oops"). When you have enough other people complaining about the 2 am times instead of just you, they magically get moved to move sensible times in the late afternoon.
Your best is to get out of Managed Services and into Professional Services. You just build out new environments / servers / apps and hand them off to the MS guys. Once its off your hands, you never have to worry about a server crashing, maintenance windows, or being on call. Plus, you are generally paid more.
I think automating maintenance is a smart move but still requires you be awake and available for it. The question is do you want to be awake at work for 10 minutes or 2 hours? Plan accordingly.
Do you plan on automating the end-user testing and validation as well?
Countless system administrators have confirmed the system was operational after change without throwing it to real live testers only to find that, well, it wasn't.
Every second you save automating the task, will be taken out of your backside when it goes wrong (see the recent article where a university SCCM server formatted itself and EVERY OTHER MACHINE on campus) and you're not around to stop it or fix it.
Honestly? It's not worth it.
Work out of normal hours, or schedule downtime windows in the middle of the day.
First: I do something like this all the time, and it's great. Generally, I _never_ log into production systems. Automation tools developed in pre-prod do _everything_. However, it's not just a matter of automating what a person would do manually.
The problem is that your maintenance for simple things like updating a package is requiring downtime. If you have better redundancy, you can do 99% of normal boring maintenance with zero downtime. I say if you're in this situation you need to think about two questions:
1) Why do my systems require downtime for this kind of thing? I should have better redundancy.
2) How good are my dry runs in pre-prod environments? If you use a system like Puppet for *everything* you can easily run through your puppet code as you like in non-production, then in a maintenance window you merge your Puppet code, and simply watch it propagate to your servers. I think you'll find reliability goes way up. A person should still be around, but unexpected problems will virtually vanish.
Address those questions, and I bet you'll find your business is happy to let you do "maintenance" at more agreeable times. It may not make sense to do it in the middle of the business day, but deploying Puppet code at 7 PM and monitoring is a lot more agreeable to me than signing on at 5 AM to run patches. I've embraced this pattern professionally for a few years now. I don't think I'd still be doing this kind of work if I hadn't.
We have many thousands of linux and windows desktop clients hosted in data centers accessed by thin client protocols. With linux, no problem. We have our update schedule and everything pretty much works.
However on the windows side we need to use a bunch of custom tools to try and beat the systems into line. We often have things blocked by pending reboots, windows updates and advertised software being pushed to the systems. So we get such a different mix of systems to deal with, it's not always working well.
Please note we also do not have access to the SCCM backend (this has been outsourced). Any suggestions? Except for maybe having a monthly window where we disable wsus and sms host agent, reboot, do our updates, re-enable and reboot again. It's clunky.
GF
If the machine is a VM, why not bring it down, take a snapshot, boot it up and do your update, etc and then reboot. If the machine is not up by 10 minutes or so, boot up the snapshot you made. You can do all of this via an external machine and use the Perl API to vmware or use the standard KVM/Xen virt tools. This way, if your maintenance fails, you can come in the next morning and figure out what went wrong. I think VMWare actually provides a script called called "snapshotmanager.pl" in it's Perl SDK so you don't need to write your own. (If you're using VMWare)
You're trading caution for convenience.
I have automated some things such as patch installation overnight only to wake up to a broken server despite the patches being heavily tested and known to work in 100% of the cases before only to not have them work when nobody was watching.
I urge you to only consider unattended automation overnight when it's for a system that can reasonably incur unexpected downtime without jeopardizing your job and/or the organization. If it's critical -- DO NOT AUTOMATE.
You've been warned.
Maybe back when the maintenance window was created it was created for a valid technical reason, BUT technology moved on and management didn't.
In other words, in some environments, the technical people won't have a sympathetic ear if they ask to cancel the off-hours maintenance window simply because of local politics or the local management, BUT if the maintenance gets botched and services are still down or under-performing through normal business hours, nobody outside of IT will notice.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Just write a simply perl script to handle it, it would take about 1 hour to develop and test and you'd be good to go.
Hello OP,
First things first, we need to discuss some of the less the fun stuff about maintenance windows in production environments. For starters what is the process that your company follows exactly? Do you test on dev box before hand? Do you have a roll back plan in case it goes to hell?
Now for the less the fun part... you will want to be on-site, or at least have someone on-site, whenever you are doing any type of maintenance work on a production system. Not so much a problem now a days but a good example for the olden days was, what if someone left a disk in the A drive? Well that means your box won't boot most likely because it's most likely set to boot from A before C. Just an example but goes to show that having hands an eyes on-site is almost next to non-negotiable (there are some exceptions like completely virtualized environments where you can use the controller to do whatever you need but even then... like another poster said... Murphy will hunt you down and make you regret not having someone on site).
Now moving on to the next part, Lean Sigma Six... if you have no training in it ask your company to pay for your courses. It's a really fun course and it applies to EVERYTHING. I know of a major VoIP solution that uses LSS (Lean Sigma Six) approaches to update systems that are on the 5 9s level (99.999% up-time). you end up breaking down your maintenance into a couple of different steps. Step one is like 7 days before the update (moving any required files to the machine putting them in the right spots, blah blah blah), step 2 is the day before generally making sure no updates where released for the files that you moved the previous week (most likely will not apply to you but still a step never the less) this step normally also includes checking and documenting system health to make sure everything is up to snuff (example wouldn't want to start an update if the raid array is down and dirty or various other things like low disk space, high mem/cpu usage, missing user accounts for maintenance). Step 3 is the actual update and checking for to make sure everything got the update and started back up, step 4 is confirming the end results works as intended.
Really you can make these into as many steps as you want but the goal is to have as much ready and done before the actual update so that your work load at 4am in the morning is as small as possible.
While you can't quite afford to do it fully unattended, you can spin up another (presumably close to identical) machine with everything that needs to be on there, prepare a last sync, let it sit until maintenance, do the last sync, swap out the boxes. Test, if not ok swap back until next time. That way all the hard stuff gets moved out of the dark hours and the rest you can do whenever.
Of course, this requires extra hardware or at least allocated virtualised resources, but since these are regarded as close to free these days, and spun-down instances can be re-used next time or for some other task, well, you know.
I'm not familiar with CentOS or Redhat, but in Debian it's not uncommon to get the odd update that requires configuration wizards. There's no shortcut to those, and in the event of it happening, you are gonna have some early risers complaining.
And even the supposedly safe, unattended updates aren't that safe: For example I updated to the latest linux-image from Debian's repos yesterday. I didn't expect some core services to depend on a computer reboot to start working again, but 5 minutes in people were complaining a Jboss web app wasn't working.
Are you talking about servers/services? If so, every service should have some sort of failover strategy to other hardware. That way anything you need to work on can be failed over during business hours and brought back.
I am updating your outward facing mail server, the update fails, where is your god/email now?
I've worked IT at all levels, from mom and pop shops to 5 9s (99.999% availability).
If it's a production system, it's pretty important to have hands and eyes on-site in case something goes wrong. ANYTHING can go wrong... KVM has a hiccup and all of a sudden the pc things someone is pressing space bar 24x7... guess what isn't booting? last tech forgot to change the KVM to another input and has a disk in the included disk drive... guess what server may not be booting....
There is plenty you can do without people on-site but when your label something as production critical it is because you can't afford to wait that 30 mintues to an hour for someone to wake up and get on-site.
Sometimes, an event happens which begs (for) the question of why nobody planned for it.
This raises the question of why people don't just avoid the pedantic bickering by saying "raises the question".
I get paid for cleaning up after things that don't work right the first time.
That way, when things go south, you have time to right the ship before the early birds start logging in at 5:30.
Do you use VMs? ALL of our servers are now running on VMware at remote locations. I can't automate maintenance, but it does not matter if I do it from the office or at home as I am remoting either way... Set up a snapshot to roll back if there is a problem, and you can at least make it a bit more comfortable if you have to be up at odd hours...
Most of these patches are happening on systems that are in some remote data center that's not in your physical office location anyway. So I see no difference connecting remotely to the servers from your house vs connecting remotely from your office
If you want to progress in your IT career, you need to figure out how to automate basic system operations like maintenance and patching. Having to actually be awake at 2:00am to apply patches is rookie status. Sometimes it is unavoidable, but it should not be the default stance.
My environment is virtual, so our workflow is basically snapshot VM, patch, test. If the test fails, rollback the snapshot and try again (if time is available) or delay until later. If the test is successful, we hold onto the snapshot for three days just in case users find something that we missed. If everything is good after three days, we delete the snapshot.
We have a dev environment that mirrors production that we can use for patch testing, upgrade testing, etc. Due to testing, we rarely have problems with production changes. If we do, the junior guys escalate to someone who can sort it out. Our SLAs are defined to give us plenty of time to resolve issues that occur within the allocated window. (Typically ~4 hours)
In the grand scheme of things, my environment is pretty small. We have ~1500 VMs. We manage it with three people and a lot of automation.
I have done it, on a large scale (20,000+ systems) using OS packaging for configuration management, Solaris JumpStart(TM), and Solaris Flash(TM) and Sun Cluster.
I have also done it with RPM's and Kickstart on CentOS and SuSE Linux Enterprise server, with AutoYaST and Kiwi.
Not only is it doable, it works beautifully. We are a 100% Puppet, CFengine and Chef-free environment - we do not use any such solution.
We can take a node down and "re-flash" it automatically while the other nodes in the cluster continue to provide the service without any interruptions whatsoever.
The entire environment supports rolling upgrades so we no longer need service windows. The entire environment is completely hands-off, automated. System administrators have no need to log in, unless they are doing software RAID administration or servicing hardware failure(s).
Snapshots are great, but they assume all your data is on the snapshot. It's harder to roll back if your new version goes ahead and corrupts some database or something on the NAS.
It's even harder to roll back if your data stores are on some multi-clustered beast that wasn't designed to be rolled back.
Of course, you should have caught that in test, right?
Is it that difficult to juggle schedules for your IT staff?
Consider all of the tasks that you do as a part of your job. Identify which ones should absolutely never be automated -- maybe they're too dangerous, maybe the risk is too great, maybe they're too much fun. I'd bet that upgrading the OS would be pretty well the top of your never-automate-this list.
cf engine or puppet will be best.
Server virtualization, redundancy, test and development environments, backups or snapshots before upgrades, testing, and automation with monitoring should allow you to either do the updates during work hours or sleep during the maintenance window in some update scenarios. Configure an external monitoring system to verify functionality when itÃ(TM)s complete and notify you if anything fails. The notification should default to failure if anything goes wrong.
Unless you are updating the Kernel there are few times you need to reboot a centos box. Unless your app has a memory leak.
The better way to go about it has already been pointed out above. Have several systems, load balance them in a pool, take one node out of the pool, work on it, return it to the pool then repeat for each remaining system. - No outage time and users are none the wiser to the update.
Yeah, I second that. You can automate testing also, but it should be thorough. An external system should be used for this and it should default to fail. Any failure should notify you so you can wake up and fix it.
Whenever you reboot, you have to take fsck into account? If the uptime has been more than 180 days, you'll need an fsck, if it is a small file system sure just a few minutes, but I'd schedule an hour just in case. Larger file system, we could be talking 4-6 hours of down time.
I definitely want to automate this... but not automate alone... I would build in some monitoring and notification... such as result of each step summed into an email/SMS report.
I would also use a remote host, to send an alert if the host in maintenance doesn't come back within the expected window....
And of course.. even though I'm not gonna be doing the maintenance activity manually/live, but I would still want to find a proper way to know/confirm that the maintenance window was successful before sleeping all the way through.
My last job at a UK Military Contractor (With 340+ workstations) was as IT Manager, though I had to take all direction from the IT Executive (A git with no actual IT knowledge) and we all would be told to DO only what he asked us to do, until then we just had to sit around doing nothing. One day the main server lost it RAID board and the IT executive comes storming into the office demanding to know why we were not fixing it as the whole company was down.... We replied that under your expressed guidance we have not been told to look at the issue which drove him mad and then the MD turned up to enquire what was happening and the IT executive started to brown nose the MD so we got out the department guidelines to show the MD that the IT executive was lying. Then we were told by the MD to order a new RAID board and I responded that a new RAID board was a 3 week lead time (which really upset the IT executive and MD) and the entire department was asked to convene in the IT executive office to continue this now heated discussion. When in the IT executive office I asked the MD to look into the IT Executive "in-tray" and said he would find a requisition form for said RAID controller board and it was date 4 week ago. The MD gave me the requestion form after signing it and we were all asked to leave the office. We never did see the IT executive again and now I could sign requisitions and run the department...
Sad but true.
And it's almost 100% likely that the night you decide to try LSD for the first time will be the night that your automation script fails.
Continuous Deployment is what you want to do in some degree. Puppet, already mentioned, is one tool to be used in such scenarios. However, it is not sufficient to run automatic updates and reconfigurations. If you also develop your own software, you should look at Jenkins continuous deployment extensions and in general into DevOps.
Maintenance Windows sounds like upgrading to Windows 8.1 - this is one of the most unfortunate headlines of the year.
How about...
Ask Slashdot: Best time to do unattended maintenance on servers?
If you aren't phasing your patches and updates through a full-cycle test in Dev, testing in Test, fixing what needs to be done in Dev again, and eventually making it bullet proof enough for Prod, you're doing it wrong.
Puppet, Chef, Ansible aren't the answer.
They may enable you you manage the components, but they won't solve your problem. You need environments that exactly mirror your Prod.
If you've managed to bring this up, proper Disaster Recovery is a recipe/manifest away.
We use Ansible, it seems to fit well with our needs, but others use Puppet or Chef.
I really shouldn't have used someone else's email address for this account.
All sorts of automated security updates and patches during the regularly scheduled maintenance window. Couple of key things that make it work: 1. A valid and representative DEV environment or host(s) to vet and test deploy the updates using the same methods as production hosts. 2. A solid alerting system for when the inevitable couple of hosts fail and needs help to get running again. 3. A qualified and responsive on call person to review the results at or near the end of the maintenance window to make sure everything came back online properly and take action where necessary. It doesn't so much eliminate the after hours work as to reduce the volume of the after hours work to a level manageable by a single qualified tech.
Get yourself a VPN to your workstation and do it from home, in bed. If you can get a quicky while you're at it, good on you.
The OP is missing the point. Of *course* you can automate updates. You don't even need an automation system. It can be as simple as writing a bash script.
The point is... what happens when something goes wrong? If all goes well, then there's no problem. But if something does go wrong, you no longer have anyone able to respond because nobody's paying attention. So you come in the next morning with a down server and a clusterf__k on your hands.
The last place I worked at had redundancy both within the data center and across data centers. That is they could survive the loss of a data center. If the service you are supplying is so critical you should have redundancy. This will give you a little more leeway on when maintenance is done.
To the original poster: It is entirely possible, but you're going to need to learn a lot about modern automation and configuration management tools appropriate to the types of maintenance you're looking to automate. You also need solid vision and alignment on how you're going to achieve this level of automation across multiple parts of your business -- Development, IT, the Business, everyone. They all have to buy in and commit, because all of those folks have the ability to fuck it all up if everyone isn't on the same page. You can't do it alone on the admin side. As a start, I would suggest learning about Continuos Integration/Continuous Delivery and Agile and Devops methodologies to get started on the road to where you want to be.
To the rest of you:
The original comment ("Learn and use Puppet") is grossly oversimplified -- there is a lot more to it -- but with proper implementation of configuration management software (Chef, Puppet, Salt, etc), proper automated testing (think Jenkins, Teamcity, etc) and a real commitment in your organization to Continuous Integration and Delivery practices, you can easily do regular automated maintenance. Yes -- sometimes it will break and you'll have to clean it up. But properly and thoughtfully implemented in policy and practice, those times when it breaks will be the exception that proves the rule.
Forgive the argument from authority, but at our firm (International, thousands of primarily linux servers across 14 countries and 40+ datacenters, mostly bare-iron, some virtualization) we have regular daily and weekly automated maintenance. We handle all sorts of significant change -- driver updates, software upgrades, network switch configuration, even forklift OS upgrades involving the full re-imaging of a bare iron system combined with re-deployment of software (including things like databases and hadoop clusters) -- automatically and without human intervention on a regular basis. And by regular I mean daily.
The attitude that "Murphy always wins" or "something will fail and you will have failed by not being there to fix it immediately" is a relic of a time when the tools available to manage large scale infrastructure were inadequate or unavailable. Again, there are failures that will require manual intervention, but if you are doing your jobs well as developers, network admins, systems admins, 'devops' [NOTE: I strongly object to that term being used as a job title, but that's how folks have started using it] then you should be able to conduct automated hands-free production change at 2am on a Saturday and sleep like a baby knowing that when you check your upgrade report in the morning 99% of the time everything will have gone off without a hitch.
Frankly if you approach complex infrastructure management with that defeatist viewpoint of "things will always fail", you are doing yourself and your employer a disservice, and you are severely restricting your career prospects. My company is not in any way unique in our ability to automate and manage our infrastructure, and maintaining that type of outdated attitude is going to cause lots of doors to be slammed in your face. Do you really believe the Googles, Facebooks and Amazons of the world rely on having a human being white-knuckling every change in their infrastructure?
One additional note: If your infrastructure is designed such that you cannot push change without guaranteed downtime or the risk of downtime then you have failed to design your infrastructure properly.
5 and 6 am maintenance windows would be a blessing after working 21 years for a large Telco/ISP. We usually start at midnight or 2 am and there is typically one or two events going on in a week...
Always go in with a well considered plan, and be there when it happens.
Even if your planning is awesome, you'll look unprofessional not being in a position to fix a problem when it is most likely to occur.
If something does happen, and your not there.. There will be crankiness.
You put all your shit behind load balancers. Then you tell the target machine to stop accepting new connections. When the existing connections die, you are free to do whatever upgrades are necessary. Swap the tested box back in, and move to the next target.
At the very least, you should configure two boxes such that the backup box has the same MAC so you can easily swap it in without anyone noticing. Then upgrade the backup, swap it in, upgrade the former primary. Probably good to leave things as-is so both boxes get equal time as primary.
Thanks for all of the feedback -- it's useful.
:)
A couple clarifications: we do have redundant systems, on multiple physical machines with redundant power and network connections. If a VM (or even an entire hypervisor) dies, we're generally OK. Unfortunately, some things are very hard to make HA. If a primary database server needs to be rebooted, generally downtime is required. We do have a pretty good monitoring setup, and we also have support staff that work all shifts, so there's always someone around who could be tasked with 'call me if this breaks'. We also have a senior engineer on call at all times. Lately it's been pretty quiet because stuff mostly just works.
Basically, up to this point we haven't automated anything that will / could be done during a maintenance window that causes downtime on a public facing service, and I can understand the reasoning behind that, but we also have lab and QA environments that are getting closer to what we have in production. They're not quite there yet, but when we get there, automating something like this could be an interesting way to go. We're already starting to use Ansible, but that's not completely baked in yet and will probably take several months.
My interest in doing this is partly that sleep is nice, but really, if I'm doing maintenance at 5:30 AM for a window that has to be announced weeks ahead of time, I'm a single point of failure, and I don't really like that. Plus, considering the number of systems we have, the benefits of automating this particular scenario are significant. Proper testing is required, but proper testing (which can also be automated) can be used to ensure that our lab environments do actually match production (unit tests can be baked in). Initially it will take more time, but in the long run anything that can eliminate human error is good, particularly at odd hours.
Somewhat related, about a year ago, my cat redeployed a service. I was up for an early morning window and pre staged a few commands chained with &&'s, went downstairs to make coffee and came back to find that the work had been done. Too early. My cat was hanging out on the desk. The first key he hit was "enter" followed by a bunch of garbage, so my commands were faithfully executed. It didn't cause any serious trouble, but it could have under different circumstances. Anyway, thanks for the useful feedback
Facts have a liberal bias.
The whole reason we used to get paid extra to provide support was to provide support. That meant weird hours, weekends, and late nights.
If you don't like it, get another job.
I do not fail; I succeed at finding out what does not work.
Automating server maintenance is something most IT departments try to do. But you never want to make it too automated.
You said you had 24/7 personnel on call. Let's refer to them as the NOC.
Are they trained to type and follow commands, along with basic (i mean basic..) skills in *nix?
You could cut the cost of investment in automation which can be costly, and focus more on well documented (and tested as much as possible) steps or instructions that they can perform since they're up anyway.. You can use some of the allocated cost in training the NOC a bit more on *nix, scripting etc..
If anything goes wrong then they can call you and you can follow up. Depending on how good your steps are (and a bit of luck) you might end up waking up less than usual.
Of course I ask that silly question up top because you don't want to be awaken at the start of the maintenance saying "Hello sir, your MOP failed at `ls /vra/log`.
Sleep in?
I don't understand. This just means you swing by and do the update after they close the bar and throw you out.
That's SOP around here.
Have gnu, will travel.
I am updating your outward facing mail server, the update fails, where is your god/email now?
If at least some part of your paging and monitoring system isn't independent from your servers then you're doing it wrong.
We use multiple third party companies to monitor our website. It's highlevel checks but one of the checks is to check
that our internal monitoring software is working. You can purchase third party monitoring software or spin up an instance
somewhere like amazon or digital ocean for a few dollars a month. Depending on how critical your systems are you
could spin up a few dozen. The point is that you should be monitoring your servers from outside your network for
multiple reasons. The first being that it doesn't really matter if everything is up if the outside world can't connect to it
and the second being that you still want to be paged if your entire datacenter goes up in smoke.
that to my is red flag all by itself.
the hard truth is:
if you can't be sure that you production system will work after the patch it means that you don't know its state for sure. Which means someone has done manual updates prior to that. Which means your processes are generally not up for a production environment.
What you need to have is a fully automated build, integration and deployment, for the full system, from OS to configuration to apps, database setup etc. Basically you should be able to issue a "deploy" command form some machine, targeting some other machine and then be able to walk away in confidence. Post installation, production test scripts should verify everything that you would by hand, then let you know of the results. Only then you might maybe do some basic manual checks just on the off-chance something went horribly wrong and fooled the test scripts.
Btw, of course you will have tested the deploy commands plus all scripts many times before even thinking to deploy to production.
I know this is far from being a standard approach, but really there is no excuse any more for not doing it this way.
uh.... one word. CLOUD. My idiot boss thinks there is a nebulous magic button called the cloud that can do it all...
Only automate tasks on systems that can be quickly snapshotted and simply QC'd using scripts.
For instance, if you have a web server you want to update weekly, then setup a script on the virtual host that snapshots the virtual machine before the upgrades and then runs a series of checks on the web server after the upgrades. If the web server does not respond as expected to the post-upgrade checks, the virtual host can revert back to the pre-update snapshot and send a message to you notifying you of the upgrade failure. You could also snapshot the failed virtual machine, spin it up on another machine or instance without networking to check the logs for any errors that occurred during upgrades.
If the virtual machine is *nix based, you could mount the snapshot directly on the host and browse the logs as well, or even automate the collection of failed logs too.
Any upgrade procedure that cannot be easily scripted or delayed in such a fashion should be done manually and well attended by someone knowledgeable.
You should have more than 1 admin anyways in case 1 has to call in sick.
1 on west coast and one on east coast would give you 3 hours extra window to avoid the 2 am and 7 am windows.
Or one i Europe, and another in Asia.
Still automating as much of it is a good idea, allowing to test out the procedure before performing the maintenenance, minimizing your chance of screwing something up.
I do stuff like this a lot at my job.
What I'd do is this:
-Write a script to do the package stuff and the reboot...
-Write another script that's running on a completely different machine/VM... whatever... that pings/wgets/curls/nmaps whatever you need to see that the machine has indeed rebooted and the services you're expecting to be up are operating... like wget through part of your website for example...
-If the script detects an issue via the monitoring machine then it sends out an email to your email address and texts your cell phone
-set this up to run every 5 minutes in cron on your monitoring machine... and if you want text you every 5 minutes at your house while you're sleeping to wake you up....
-get your ass to work if you need to, but if all goes well you get to sleep in most times...
With Virtualization you should have no real need to do server upgrades out of hours. If you need to upgrade a package/service on offer you should just spin up a brand new instance, have some type of automation piece install and configure everything that the instance needs, have some auto testing application confirm that it's all added, then just add the instance to the load balancer, and decommission the old instance. No more out of hours work unless dealing with hardware issues and with HA these issues usually can be dealt with during business hours. If you are restricted by a limit on resources you should at least be using products like Docker or Solaris Zones to isolate guests from the core OS and separate out application vs core OS needs (the bulk of change usually happens in the application layer so this seperation again means less downtime out of hours). Need to update the hyper visor? live migrate the guests to another piece of hardware and do the maintenance again during business hours. If you don't have the budget you can always spin these kinda solutions up. (DRBD/KVM work a treat). Or as others have said host everything in the cloud.
The best would be to have a system that is highly available like exchange dag for example that allows you to take one server offline without an outage.
If not, tell them you will quit.
When they call your bluff, quit.
Accept a 50% raise the next day.
Or Ansible, Chef, Rexify, SaltStack, ....
Sometimes any specific tool is worse than the root problem. In many locations, that applies to puppet. None of these tools are perfect, but they do support automation for UNIX-like OSes.
For Windows - I only have 1 suggestion. format c:
This is exactly why I don't do IT any more. All the responsibility to keep things working, no authority to make users or departments not screw things up, none of the credit when things go smoothly, but all of the blame when anything goes wrong (no matter who caused it), every department's poorly planned extra computer expenses come out of your budget, all that unpaid overtime means you are barely making minimum wage, you are constantly reminded that your job is hanging in the balance, AND you are expected to keep taking expensive certification classes on your own dime just for the priviledge of bending over for one more year.
It's interesting how people are not really discussing utilising off-shore resources. Most large organisation utilise employees in different time zones because of the cost savings and the possibility to have at least some of the team working at any given hour of the day. So you can be in bed in New York at midnight your time, and a worker in India is doing your maintenance activities because it is only 9:30am their time.
I work for a monitoring team. We are 24/7 and I can guarantee you from experience this is a terrible idea. The first time the servers drop out of the monitoring suppression and suddenly a half dozen alarms are going off because your automated server program decided to drag down a series of other servers, or kill the switches at the office I get to call you at 4AM you are going to wonder why you didn't just catch a nap and go back in. Anytime we get an e-mail from a "Senior Sever Manager" stating that "a change will be made this weekend at 2AM but will not affect system uptime," we note in in our shift logs because as sure as we are sitting there Murphy will creep up and jump on that managers back and chew until someone can beat him off. Usually to minimize the damage to our team, we will politely e-mail that manager and ask exactly what systems and what times will this happen as a warning that we really do not want to have to go through the late night procedures to alert someone. Most managers who have experience will actually send us a separate e-mail saying "server XYZ123 will be down from 1AM to 3AM, if we get it up sooner we will call to verify it is up on your end." We monitor 10 different companies of all sizes from a single server room to worldwide systems, and Murphy is a board member for every one and always gets a vote.
"If stupid things work...then they are not stupid."
You should always have a competent tech on hand for maintenance tasks.
I agree with this, but who does maintenance at 1am anymore? What's the point in it? Users are worldwide, and 1am in the US prime business hours in Asia, so why bother patching/upgrading in the middle of the night?
I haven't done a late-night maintenance in at least a decade. It's all about rolling upgrades. Any problems? Rollback. Need to upgrade infrastructure? Take the entire datacenter offline and serve from your other datacenters. Every single upgrade I've done for as long as I can remember has been at 10am, which is the earliest I can get my lazy-ass junior devs to stumble into the office.
OP needs a process upgrade.
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Unattended maintenance is not a good idea. Although in many cases you can automate everything, including the step that verifies that the maintenance was successful, you can't automate tshooting all issues when it's unsuccessful. However, automation wherever you can is almost always a good idea, including attended upgrades.
Work towards building a redundant (and/or highly available if possible) production and test infrastructure that minimizes downtime for users, regardless of whether the downtime was caused by unplanned outages, or planned maintenance. With that in place, you can build management confidence that upgrades can be performed during regular hours, since the expected user impact is minimal if any. However, in IT, there is always maintenance that will need to be performed during non peak/after hours, either because that particular user service is that critical, or because the maintenance is that risky, or because the expected downtime for the maintenance is too great.
Unattended maintenance is not a good idea. Although in many cases you can automate everything, including the step that verifies that the maintenance was successful, you can't automate tshooting all issues when it's unsuccessful. However, automation where you can is almost always a good idea, especially if it's attended.
Work towards building a redundant (and/or highly available if possible) production and test infrastructure that minimizes downtime for users, regardless of whether the downtime was caused by unplanned outages, or planned maintenance. With that in place, you can build management confidence that upgrades can be performed during regular hours, since the expected user impact is minimal if any. However, in IT, there is always maintenance that will need to be performed during non peak/after hours, either because that particular user service is that critical, or because the maintenance is that risky, or because the expected downtime for the maintenance is too great.