Ask Slashdot: Unattended Maintenance Windows?
grahamsaa writes: Like many others in IT, I sometimes have to do server maintenance at unfortunate times. 6AM is the norm for us, but in some cases we're expected to do it as early as 2AM, which isn't exactly optimal. I understand that critical services can't be taken down during business hours, and most of our products are used 24 hours a day, but for some things it seems like it would be possible to automate maintenance (and downtime).
I have a maintenance window at about 5AM tomorrow. It's fairly simple — upgrade CentOS, remove a package, install a package, reboot. Downtime shouldn't be more than 5 minutes. While I don't think it would be wise to automate this window, I think with sufficient testing we might be able to automate future maintenance windows so I or someone else can sleep in. Aside from the benefit of getting a bit more sleep, automating this kind of thing means that it can be written, reviewed and tested well in advance. Of course, if something goes horribly wrong having a live body keeping watch is probably helpful. That said, we do have people on call 24/7 and they could probably respond capably in an emergency. Have any of you tried to do something like this? What's your experience been like?
I have a maintenance window at about 5AM tomorrow. It's fairly simple — upgrade CentOS, remove a package, install a package, reboot. Downtime shouldn't be more than 5 minutes. While I don't think it would be wise to automate this window, I think with sufficient testing we might be able to automate future maintenance windows so I or someone else can sleep in. Aside from the benefit of getting a bit more sleep, automating this kind of thing means that it can be written, reviewed and tested well in advance. Of course, if something goes horribly wrong having a live body keeping watch is probably helpful. That said, we do have people on call 24/7 and they could probably respond capably in an emergency. Have any of you tried to do something like this? What's your experience been like?
...and while I'm reasonably sure I can execute automated maintenance windows with little to no impact to business operations, I'm not sure. So I don't do it.
If there were more at stake, if the risk vs benefits were tipped more in my company's favor, I might test implement it. But just to catch an extra hour or two of sleep? Not worth it; I want a warm body watching the process in case it goes sideways. 9 times out of 10, that warm body is me.
Mod me down with all of your hatred and your journey towards the dark side will be complete!
Attended automation is the way to go. You gain all the advantages of documentation, testing etc. If the automation goes smooth, you only have to watch it for 5 mins. If it doesn't, then you can fix it immediately.
Everyone here is going to tell you that a human needs to be there because that is their livelihood. Any task can be automated at a cost. I am guessing that it is not your current task to automate maintenance tasks otherwise you wouldn't be asking. Somewhere up your chain they decided that for the uptime / quality of service it is more cost effective to have a human do it. That does not mean that you can not present a case showing otherwise. I highly suggest that you win approval and backing before taking time to try to automate anything.
Out of curiosity, are they VMs?
If you really want to automate this sort of thing you should have redundant systems with working and routinely tested automatic fail-over and fallback behavior. With that in place you can more safely setup scheduled maintenance windows for routine stuff and/or pre-written maintenance scripts. But, if you are dealing with individual servers that aren't part of a redundancy plan then you should babysit your maintenance. Now, I say babysit because you should test and automate the actual maintenance with a script to prevent typos and other human errors when you are doing the maintenance on production machines. The human is just there in case something goes haywire with your well-tested script.
Fully automating these sorts of things is out of reach more many small to medium sized firms because they don't want, or can't, invest in the added hardware to build out redundant setups that can continue operating when one participant is offline for maintenance. So, depending on the size of your operation and how much your company is willing to invest to "do it the right way" is the limiting factor in how much you are going to be able to effectively automate this sort of task.
Even if I knew that tomorrow the world would go to pieces, I would still plant my apple tree. -Martin Luther
The right answer to this is to have redundant systems so you can do the work during the day without impacting business operations.
"I either want less corruption, or more chance
to participate in it." -- Ashleigh Brilliant
Puppet is a great tool for automation but does not solve problems like patching and rebooting systems without downtime.
"I either want less corruption, or more chance
to participate in it." -- Ashleigh Brilliant
So it's someone else's fault your test environment doesn't match production?
People often fail to try hard enough to make the test environment (assuming they even have one) match the production environment, but for some problems test never matches production, and essentially never can: some problems only reveal themselves under production *conditions*. For example, I recently spent a significant amount of time involved in the troubleshooting of a kernel bug that only arose under a very specific (and still not fully characterized) set of disk loads. Test loads including tests involving loads several times higher than the production load did not uncover the bug, which caused kernel faults, and the faults randomly started occurring about a week after the software patch went live.
You should try to keep test as close as possible to production so testing on it has any validity at all, but you should never assume that testing on the test environment *guarantees* success on production. Its for that reason that, responding to the OP, I have never attempted to do any serious production upgrades in an automated and unattended fashion, and not while I'm alive will any such thing happen on any system I have authority over. As far as I'm concerned, if you decide to automate and go to sleep, make sure your resume is up to date before you do because you might not have a job when you wake up, if you guess wrong.
Even if you guess right, I might decide to fire you anyway if anyone working for me decided to do that without authorization.