Ask Slashdot: Unattended Maintenance Windows?
grahamsaa writes: Like many others in IT, I sometimes have to do server maintenance at unfortunate times. 6AM is the norm for us, but in some cases we're expected to do it as early as 2AM, which isn't exactly optimal. I understand that critical services can't be taken down during business hours, and most of our products are used 24 hours a day, but for some things it seems like it would be possible to automate maintenance (and downtime).
I have a maintenance window at about 5AM tomorrow. It's fairly simple — upgrade CentOS, remove a package, install a package, reboot. Downtime shouldn't be more than 5 minutes. While I don't think it would be wise to automate this window, I think with sufficient testing we might be able to automate future maintenance windows so I or someone else can sleep in. Aside from the benefit of getting a bit more sleep, automating this kind of thing means that it can be written, reviewed and tested well in advance. Of course, if something goes horribly wrong having a live body keeping watch is probably helpful. That said, we do have people on call 24/7 and they could probably respond capably in an emergency. Have any of you tried to do something like this? What's your experience been like?
I have a maintenance window at about 5AM tomorrow. It's fairly simple — upgrade CentOS, remove a package, install a package, reboot. Downtime shouldn't be more than 5 minutes. While I don't think it would be wise to automate this window, I think with sufficient testing we might be able to automate future maintenance windows so I or someone else can sleep in. Aside from the benefit of getting a bit more sleep, automating this kind of thing means that it can be written, reviewed and tested well in advance. Of course, if something goes horribly wrong having a live body keeping watch is probably helpful. That said, we do have people on call 24/7 and they could probably respond capably in an emergency. Have any of you tried to do something like this? What's your experience been like?
Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.
You should always have a competent tech on hand for maintenance tasks. Period. If you do not, Murphy will bite you, and then, instead of having it back up by peak hours you are scrambling and looking dumb. In your current scenario, say the patch unexpectedly breaks another critical function of the server. It happens, if you have been in IT any time you have seen it happen. Bite the bullet and have a tech on hand to roll back the patch. Give them time off at another point, or pay them extra for night hours, but thems the breaks when dealing with critical services.
Silence is a state of mime.
Maintenance windows are at off-hours to accomodate real work happening. If every action was painless and produced the desired result, you could do it over lunch or something like that. But that's not the real world.
This begs the question of how the hell are you going to fix unexpected problems in an automated fashion? The answer is, you aren't. Therefore, you have to be up at 2am.
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
Offshore your maintenance jobs to someone in the correct timezone!
You don't monitor maintenance windows for when everything goes well and is all boring. You monitor them for when things go all to hell and someone needs to correct it.
In any organization I've worked in, if you suggested that, you'd be more or less told "too damned bad, this is what we do".
I'm sure your business users would love to know that you're leaving it to run unattended and hoping it works. No, wait, I'm pretty sure they wouldn't.
I know lots of people who work off hours shifts to cover maintenance windows. My advise to you: suck it up, princess, that's part of the job.
This just sounds like risk taking in the name of being lazy.
Lost at C:>. Found at C.
Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.
This is the correct answer. I promise you that at some point, something will fail, and you will have failed by not being there to fix it immediately.
Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.
Having someone with little or no sleep doing critical updates is not really the best strategy.
A friend of mine lost his job over a simmilar "automation" task on windows.
Upgrade script was tested on lab environement who was supposed to be exactly like production (but it turns out it wasn't - someone tested something before without telling anyone and did not reverted). Upgrade script was scheduled to be run on production during the night.
Result - \windows\system32 dir deleted from all the "upgraded" machines. Hundreds of them.
On the Linux side i personally had RedHat doing some "small" changes on the storage side and PowerPath getting disabled at next boot after patching. Unfortunate event, since all Volume Groups were using /dev/emcpower devices. Or RedHat doing some "small" changes in the clustering software from one month to the other. No budget for test clusters. Production clusters refusing to mount shared filesystems after patching. Thankfuly on both cases the admins were up & online at 1AM when the patching started and we were able to fix everything in time.
Then you can have glitchy hardware/software deciding not to come back up after reboot. RHEL GFS clusters are known to randomly hang/crash at reboot. HP Blades have sometimes to be physically removed & reinserted to boot.
Get the business side to tell you how much is going to cost the company for the downtime until:
- Monitoring software detects that something is wrong;
- Alert reaches sleeping admin;
- Admin wakes up and is able to reach the servers.
Then see if you can risk it.
1% APY, No fees, Online Bank https://captl1.co/2uIErYq Don't let your $$$ sit in a no-interest acct.
Even on fairly simple things (yum updates from mirrors, AIX PTFs, Solaris patches, or Windows patches released from WSUS), I like babysitting the job.
There is a lot that can happen. A backup can fail, then the update can fail. Something relatively simple can go ka-boom. A kernel update doesn't "take" and the box falls back to the wrong kernel.
Even something stupid as having a bootable CD in the drive and the server deciding it wants to run the OS from that rather than from the FCA or onboard drives. Being physically there so one can rectify that mistake is a lot easier when planned as opposed to having to get up and drive to work at a moment's notice... and by that time, someone else likely has discovered it and is sending scathing E-mails to you, CC:5 tiers of management.
Unless you are updating the Kernel there are few times you need to reboot a centos box. Unless your app has a memory leak.
The better way to go about it has already been pointed out above. Have several systems, load balance them in a pool, take one node out of the pool, work on it, return it to the pool then repeat for each remaining system. - No outage time and users are none the wiser to the update.