Ask Slashdot: Unattended Maintenance Windows?

Puppet. by Anonymous Coward · 2014-07-11 04:25 · Score: 4, Informative

Learn and use Puppet.

Re:Puppet. by Rhys · 2014-07-11 04:35 · Score: 1

That's a failure to test* your code-as-infrastructure, not a puppet failure.
*: Exempting a small subset of physical device issues, though even those can be ignored if you're talking about a VM, so that the physical hardware is never actually in a not-live state.

--
Slashdot Patriotism: We Support our Dupes!
Re:Puppet. by Anonymous Coward · 2014-07-11 04:40 · Score: 1

And a kernel update has never blown up a grub install on a VM...
Nope..never has happened.
Re:Puppet. by bwhaley · 2014-07-11 05:47 · Score: 3, Interesting

Puppet is a great tool for automation but does not solve problems like patching and rebooting systems without downtime.

--
"I either want less corruption, or more chance
to participate in it." -- Ashleigh Brilliant
Re:Puppet. by Rhys · 2014-07-11 06:43 · Score: 1

So... you didn't test... and you have only yourself to blame?
Especially with VMs, it is so easy to snapshot and test things.

--
Slashdot Patriotism: We Support our Dupes!
Re:Puppet. by Lumpy · 2014-07-11 07:12 · Score: 2

Just having a proper IT infrastructure works even better.
Patch and reboot secondary server at 11am. everything checks out, put it online and promote it to primary. All done. Now migrate the changes to the backup, Pack up the laptop and head home at 5pm... not a problem. Our SQL setup has 3 servers we upgrade one and promote it, the upgrade #2 #3 stays at the previous revisions until 5 days have passed so we have a rollback. Yes data is synced across all three, worst case if TWO servers were to explode, we will lose 15 minutes of data entry.
I NEVER do late at night or weekend maintenance anymore. Servers are dirt fricking cheap to not have redundants always running and ready to drop in.

--
Do not look at laser with remaining good eye.
Re:Puppet. by nabsltd · 2014-07-11 07:19 · Score: 1

Especially with VMs, it is so easy to snapshot and test things.
How, exactly, do you snapshot and test the production VM before the maintenance window and guarantee you won't affect (and by "affect", I mean anything that changes behavior in any way that is not expected by the users) any services running on that VM?
If you meant "clone" instead of "snapshot", that doesn't help either, as the clone will have to have a different IP address, can't connect to the production database, etc.
We've had VMs that have become corrupt in very strange ways so that they would not reboot. The corruption didn't affect any running services, but existed for at least six weeks (we had to go back that far to get a backup that didn't have the issue). Testing a kernel patch that requires a reboot wouldn't have revealed this corruption, as the dev and staging servers didn't have the problem. Testing it on the production server would have revealed it, but we would have to do that during scheduled maintenance anyway....
Re:Puppet. by viperidaenz · 2014-07-11 07:44 · Score: 2

So it's someone else's fault your test environment doesn't match production?
Re:Puppet. by x0 · 2014-07-11 08:26 · Score: 1

I NEVER do late at night or weekend maintenance anymore. Servers are dirt fricking cheap to not have redundants always running and ready to drop in.
Sure, hardware is cheap - software licenses, not so much. (An no, I don't have the option to use free/oss replacements.) When it costs my company $25,000 per license, deploying a primary and two 'backup' servers is not really an option.

m

--
In the immortal words of Socrates, who said; 'I drank what?'
Re:Puppet. by sjames · 2014-07-11 09:27 · Score: 3, Informative

How, exactly, do you snapshot and test the production VM before the maintenance window and guarantee you won't affect (and by "affect", I mean anything that changes behavior in any way that is not expected by the users) any services running on that VM?
Clone it. upgrade the clone and make sure it works. If so, wipe the clone, snapshot the production VM and upgrade it. If it fails, roll back. Make sure your infrastructure is set up so the clone CAN be properly tested. Yes, sometimes you will have to do that rollback, but with an adequate test setup, frequently you won't.
Re:Puppet. by Architect_sasyr · 2014-07-11 12:04 · Score: 2

Talk to your supplier. I've never had an issue getting "spare" licensing for testing servers. Regularly audited for it, sure, but never any real trouble getting the licenses.

--
Me failed English...
FreeBSD over Linux. If my comments seem odd, this may explain...
Re:Puppet. by dnavid · 2014-07-11 12:07 · Score: 3, Interesting

So it's someone else's fault your test environment doesn't match production?
People often fail to try hard enough to make the test environment (assuming they even have one) match the production environment, but for some problems test never matches production, and essentially never can: some problems only reveal themselves under production *conditions*. For example, I recently spent a significant amount of time involved in the troubleshooting of a kernel bug that only arose under a very specific (and still not fully characterized) set of disk loads. Test loads including tests involving loads several times higher than the production load did not uncover the bug, which caused kernel faults, and the faults randomly started occurring about a week after the software patch went live.
You should try to keep test as close as possible to production so testing on it has any validity at all, but you should never assume that testing on the test environment *guarantees* success on production. Its for that reason that, responding to the OP, I have never attempted to do any serious production upgrades in an automated and unattended fashion, and not while I'm alive will any such thing happen on any system I have authority over. As far as I'm concerned, if you decide to automate and go to sleep, make sure your resume is up to date before you do because you might not have a job when you wake up, if you guess wrong.
Even if you guess right, I might decide to fire you anyway if anyone working for me decided to do that without authorization.
Re:Puppet. by Anonymous Coward · 2014-07-11 12:53 · Score: 1

Yup, I'm the same Anonymous Coward as the previous post, but I decided to break things up a bit.
Please, help me understand your perspective here.
The question at the end of the parent post was rhetorical. I assume you don't know that much about the poster. The next questions are not. I'm not just trying to provoke, or blow steam. It's something I've genuinely wondered for decades. Back in the 90s, I was in high school and I just figured I'd understand that more when I got a job in the tech industry. But now that I have some experience and have answered many other questions I had earlier in life, this still eludes me. This post just seemed like the perfect example for me to use to pose this question. I would sincerely appreciate any insight that could be shared about this.
smash's post indicates that people should be using "SAN snapshots. There's no real excuse." Sure, we'll get right on that. After a SAN is used. How about if the organization has never invested in a product called a SAN? That sounds, to me, like an excellent excuse to not be implementing SAN snapshots.
When you're posting to a public board like this, why do you just assume that such advice is feasible for readers to implement? Are you just completely blind to realizing just how impractical such an approach is for some of the less privileged environments? I know that sounds like flame-bait, but I've really wondered that, or assumed that is the case, from a lot of comments that I've read and heard.
Yeah, I expect that such procedures should be implemented by the companies that afford teams of full-time IT staff members because those companies are large ones that have names I recognize from the national stock price ticker. This is great advice for some people. But are you urbanites who serve those organizations just so disconnected from the experience of organizations (like commercial companies, or even lower-budget charities), that you've never conceived of how things are done in other organizations where technology is a non-focus?
When I grew up, some students started vegetarian diets when they learned more details about the extent of animal cruelty. I've heard that today, some students start vegetarian diets when schools teach them that meat comes from animals, because they never knew where meat comes from. (Where the heck else do they think the ingredients of "chicken", or a "fish" sandwich, come from?) I do get the benefits of redundant equipment. But, even with that understanding, I wonder if the advice that everybody should be doing such things is a symptom of this same sort of isolation from how many other people actually operate.
Again, I re-iterate, I'm not trying to attack here. I'm trying to build a bridge, by increasing my understanding. Please, I'm humbly requesting, help me to get your point of view.
Re:Puppet. by upuv · 2014-07-12 13:55 · Score: 1

Puppet is not orchestration. This problem is an orchestration problem. A very simple one but still orchestration.
Puppet is declarative which can mean it has no order to events. Most people make use of some screwball dependency chain in puppet giving the illusion of orchestration.
Use something Ansible if you want to orchestrate a change
Re:Puppet. by upuv · 2014-07-12 13:58 · Score: 1

This pattern only works for single nodes.
if you have a complex infrastructure you can't rely on this pattern alone.
Re:Puppet. by sjames · 2014-07-12 14:28 · Score: 2

But the solution will be just a more complex variant on this theme. Consider also that you might have allowed complex to become Rube Goldberg.
Re:Puppet. by nabsltd · 2014-07-13 02:38 · Score: 1

Make sure your infrastructure is set up so the clone CAN be properly tested.
If the clone isn't on the same VLAN, accessing the same data-gathering hardware, there is no way the infrastructure can have a test match production. The data-gathering hardware I'm talking about costs $20-70K in supplies to complete a run, and the hardware itself is in the $250K range, so there is no way to duplicate it.
I'm not saying that we don't try (use data from an old run, etc.), but there is no way to truly duplicate everything, and sometimes you just have to live with that.
Re:Puppet. by sjames · 2014-07-13 02:47 · Score: 1

If the clone isn't on the same VLAN, accessing the same data-gathering hardware, there is no way the infrastructure

Sounds like you should make sure the clone is on the same vlan and accessing the same data gathering hardware, doesn't it?

I'm not saying that we don't try (use data from an old run, etc.), but there is no way to truly duplicate everything, and sometimes you just have to live with that.
So, use the clone with that setup to verify as much as you can. Then use the snapshot to allow you to roll back if things don't work out in production.
If you still can't afford the risk, then you don't need a Maintenance window at all. You need to airgap the network and never update anything on the expensive side.

And if it doesn't work? by Anonymous Coward · 2014-07-11 04:27 · Score: 5, Insightful

Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.

Re:And if it doesn't work? by Anonymous Coward · 2014-07-11 06:05 · Score: 2, Insightful

Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.
This is the correct answer. I promise you that at some point, something will fail, and you will have failed by not being there to fix it immediately.
Use of monitoring and alerting can alleviate this - access to the system through VPN can provide near-immediate access. It also helps if critical services can be made not to be single points of failure.
Re:And if it doesn't work? by 0123456 · 2014-07-11 06:13 · Score: 1

I promise you that at some point, something will fail, and you will have failed by not being there to fix it immediately.
Yeah, but this way, you won't be the one who has to fix it :).
Of course, you might have to start looking at job ads the next day...
Re:And if it doesn't work? by dreamchaser · 2014-07-11 06:37 · Score: 1

Exactly, and when it comes to maintenance windows one should never forget Murphy. If something can go wrong it will, and being there with a console cable and a laptop or tablet to get into a problem device is a good thing.
Re:And if it doesn't work? by Karl+Cocknozzle · 2014-07-11 06:57 · Score: 1

Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.
He might just need a better boss--it sounds like this one expects the guy to stay up all night for maintenance, then come in at 9am sharp, as if he didn't just do a full day's work in the middle of the night.
Rather than automating, he should be lobbying for the right to sleep on maintenance days by shifting his work schedule so that his "maintenance time" IS his workday. "Off-hour work" doesn't mean "Work all day Monday, all night Monday night Tuesday morning, and all day Tuesday." Or, at least, it shouldn't.

--
Who did what now?
Re: And if it doesn't work? by Redbehrend · 2014-07-11 07:53 · Score: 2

I agree we get paid REALLY well, and it's part of the job. Either develop it yourself, pay a minion or find a new job. Lol
Re:And if it doesn't work? by nine-times · 2014-07-11 08:09 · Score: 1

No offense, but that's not a very sensible response. Your job may require off-hours work, but that depends largely on the needs of the company your supporting, and what you negotiate your job to be. Regardless, there's no reason why you shouldn't try to diminish the amount of off-hours work, and make it as painless as possible.
For example, let's say I have to do server updates similar to what this guy is describing, and my maintenance window is 5am-9am. The updates consist of running a few commands to kick the updates off, waiting for everything to download and install, rebooting, then checking to make sure everything was successful. Because the updates are large and the internet is slow, it sometimes takes 3 hours to perform the updates, but only 10 minutes to roll things back.
It's an exaggerated scenario, but given that basic outline, why wouldn't I just script the update process, and roll in at 8:30 with plenty of time to confirm success and roll things back if needed? What, I should still come in at 5am just because an Anonymous Coward on the Internet decided it was "part of the job"?
Re:And if it doesn't work? by mjwalshe · 2014-07-11 08:39 · Score: 1

employers don't want to pay for having professionals on call
Re:And if it doesn't work? by sjames · 2014-07-11 09:44 · Score: 1

Sure, but there's no good reason not to minimize it.
Re:And if it doesn't work? by Neil+Boekend · 2014-07-13 21:23 · Score: 1

access to the system through VPN
Unless the error brings down your VPN server.

--
Well, I might have a way, but it only works on a semi spherical planet in a vacuum.

Murphy says no. by wbr1 · 2014-07-11 04:28 · Score: 5, Insightful

You should always have a competent tech on hand for maintenance tasks. Period. If you do not, Murphy will bite you, and then, instead of having it back up by peak hours you are scrambling and looking dumb. In your current scenario, say the patch unexpectedly breaks another critical function of the server. It happens, if you have been in IT any time you have seen it happen. Bite the bullet and have a tech on hand to roll back the patch. Give them time off at another point, or pay them extra for night hours, but thems the breaks when dealing with critical services.

--
Silence is a state of mime.

Re: Murphy says no. by CanHasDIY · 2014-07-11 04:45 · Score: 4, Insightful

This guy probably is the tech but is wanting to spend more time with his family or something.
Probably settled down too fast and can't get a better job now. My advice: don't settle down and quit using your wife and children as excuses for your career failures because they'll grow to hate you for it.
OR, if you want to have a family life, don't take a job that requires you to do stuff that's not family-life-oriented.
That's the route I've taken - no on-call phone, no midnight maintenance, no work-80-hours-get-paid-for-40 bullshit. Pay doesn't seem that great, until you factor in the wage dilution of those guys working more hours than they get paid for. Turns out, hour-for-hour I make just as much as a lot of the managers around here, and don't have to deal with half the crap they do.
The rivers sure have been nice this year... and the barbecues, the lazy evenings relaxing on the porch, the weekends to myself... yea. I dig it.

--
An enigma, wrapped in a riddle, shrouded in bacon and cheese
Re: Murphy says no. by gbjbaanb · 2014-07-11 04:49 · Score: 2

so once a week you have to get up early and do some work.
big deal.
The benefit is that you get to go home early too - and that mean you're there to pick up little johnny from school instead of seeing him when you drag your sorry arse in from a full day of meetings and emails and stuff.
Frankly, I wouldn't want to do it every day, but I can't see how the occasional early is anything but a good thing for family life.
Re:Murphy says no. by bwhaley · 2014-07-11 04:50 · Score: 5, Interesting

The right answer to this is to have redundant systems so you can do the work during the day without impacting business operations.

--
"I either want less corruption, or more chance
to participate in it." -- Ashleigh Brilliant
Re: Murphy says no. by PvtVoid · 2014-07-11 04:55 · Score: 5, Funny

This guy probably is the tech but is wanting to spend more time with his family or something.
Probably settled down too fast and can't get a better job now. My advice: don't settle down and quit using your wife and children as excuses for your career failures because they'll grow to hate you for it.
Congratulations! You're management material!
Re:Murphy says no. by David_Hart · 2014-07-11 05:00 · Score: 4, Informative

Here is what I have done in the past with network gear:
1. Make sure that you have a test environment that is as close to your production environment as possible. In the case of network gear, I test on the exact same switches with the exact same firmware and configuration. For servers, VMWare is your friend....
2. Build your script, test, and document the process as many times as necessary to ensure that there are no gotchas. This is easier for network gear as there are less prompts and options.
3. Build in a backup job in your script, schedule a backup with enough time to complete before your script runs, or make your script dependent on the backup job completing successfully. A good backup is your friend. Make a local backup if you have the space.
4. Schedule your job.
5. Get up and check that the job complete successfully either when the job is scheduled to be completed or before the first user is expected to start using the system. Leave enough time to perform a restore, if necessary.
As you can probably tell, doing this in an automated fashion would take more time and effort than baby sitting the process yourself. However, it is worth it if you can apply the same process to a bunch of systems (i.e. you have a bunch of UNIX boxes on the same version and you want to upgrade them all). In our environment we have a large number of switches, etc. that are all on the same version. Automation is pretty much the only option given our scope.
Re: Murphy says no. by hodet · 2014-07-11 05:03 · Score: 1

you've just described my life. amen brother.
Re:Murphy says no. by Vellmont · 2014-07-11 05:10 · Score: 1

say the patch unexpectedly breaks another critical function of the server. It happens, if you have been in IT any time you have seen it happen

Yes, this happens all the time. And really it's a case for doing the upgrade when people are actually using the system. If the patch happens at 2am (chosen because nobody is using it at 2am), nobody is going to notice it until the morning. The morning, when the guy who put in the patch is still trying to recover from having to work at 2am. At the very least groggy, and not performing at his/her best.

--
AccountKiller
Re:Murphy says no. by wisnoskij · 2014-07-11 05:13 · Score: 1

This. No matter what you do this maintenance and downtime is hundreds of times more likely to go wrong than normal running times. What is the point of even employing IT if they are not around for this window.

--
Troll is not a replacement for I disagree.
Re:Murphy says no. by mshieh · 2014-07-11 05:14 · Score: 1

Can't agree enough, regular downtime is the root of the problem.
Usually you still want to do it off-peak just in case you're caught with reduced capacity.
Re: Murphy says no. by ranelen · 2014-07-11 05:15 · Score: 1

exactly. it doesn't really matter if you are there or not. eventually something is going to break in a new and interesting way that can't be fixed without a significant amount of work.
generally we try to have at least three systems for any production service so that we can still have redundancy while doing maintenance.
that said, I rarely come in for patching anymore. I just make sure I'm available in case something doesn't come up afterwards. (no binge drinking on patch nights!)
redundancy and proper monitoring make life much, much nicer.

--
--jcbender
Re:Murphy says no. by bwhaley · 2014-07-11 05:18 · Score: 1

Yup. Very dependent on the business, the application, the usage patterns, etc.

--
"I either want less corruption, or more chance
to participate in it." -- Ashleigh Brilliant
Re:Murphy says no. by Anonymous Coward · 2014-07-11 05:20 · Score: 1

Yeah, no shit. I love how everyone is all like: "quit whining and babysit that shit" When babysitting isn't necessary if you have a redundant system in place to recover from failure edge-cases. Get a steady state redundant system then maintain them on alternating weeks. If either one fails during automated maintenance then you can just switch over to the backup until you've had breakfast. Better yet, use Amazon EC2 for your infrastructure so you can spool up as many redundant systems as necessary.
Re:Murphy says no. by smash · 2014-07-11 05:25 · Score: 1

Yup. Although, that said, if you have a proper test environment, like say, a snap-clone of your live environment and an isolated test VLAN, you can do significant testing on copies of live systems and be pretty confident it will work. You can figure out your back-out plan, which may be as simple as rolling back to a snapshot (or possibly not).
Way too many environments have no test environment, but these days with the mass deployment of FAS/SAN and virtualization, you owe it to your team to get that shit set up.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re: Murphy says no. by LordLimecat · 2014-07-11 05:26 · Score: 1

At least where I work maintenance is a once a month thing; Im led to believe this is normal by anecdotal evidence on the internet.
Your average work week ends up at like 42 hours if you factor that in; its really not that onerous.
Re: Murphy says no. by master_kaos · 2014-07-11 05:27 · Score: 1

yup same here, while my yearly salary isn't great I work 35 hour weeks, 4 weeks vacation, 10 sick days, multiple breaks per day, rarely ever any OT (and while we are salaried and don't get OT pay we instead get time in time and a half off). Hour-for-hour I probably make more than a lot of managers as well. Would I like to make double what I am making? Sure, but I would NOT be willing to put in double the work.
Re:Murphy says no. by Culture20 · 2014-07-11 05:28 · Score: 1

say the patch unexpectedly breaks another critical function of the server.
When this happens, it usually takes a lot longer to fix than it takes to drive in to work, because the way it breaks is unexpected. The proper method is to have an identical server get upgraded with this automatic maintenance window method the day before while you're at work or at least hours before the primary system so that you can halt the automatic method remotely before it screws up the primary system. If the service isn't important enough, let your monitoring software wake you up if there's a failure or ignore it until you get in at your normal time. Most of the time, having a regularly well-rested sysadmin is more important to a company than having "light-switch monitoring server three" running between 4AM and 8AM.
Re: Murphy says no. by smash · 2014-07-11 05:30 · Score: 4, Insightful

This is why you build a test environment. VLANS, virtualization, SAN snapshots. There's no real excuse. Articulate the risks that a lack of a test environment entail to the business, and ask them if they want you doing shit without being able to test to see if it breaks things. Do some actual calculations on cost of system failure, and explain to them ways in which it can be mitigated. Putting your head in the sand and just breaking shit in live... well, that's one way to do it, but I fucking guarantee you: it WILL bit you in the ass, hard one day, whether it is automated or not. if you have a test environment, you can automate the shit out of your process, TEST it, and TEST a backout plan before going live.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re: Murphy says no. by Vlad_the_Inhaler · 2014-07-11 05:37 · Score: 1
I have no idea if once a week is realistic, it sounds far too high. I have around 5-10 such windows a year, some are stuff I can do from home (with support from the guys on shift) and some entail me being physically there, so there have been none of the second kind this year.
Major Outages of one of our production systems have been featured on national news and Slashdot before, although it requires an outage of several hours to cross that threshold. Our windows are at around 02:00 to 03:00 depending on which system is affected.
Murphy has really bitten us in the ass a few times:
- Someone making an update (on a test system) which meant that the system did not come up properly after the next reboot which was days later. The symptoms made it look as though the test "window update" caused the problems. It was an accident but very annoying.
- A weird error on one switchable hardware unit rendered it unusable on our main production system. That unit was one of 32 and the allocation system automatically only used it on other machines, the next reboot would have cleared the problem anyway. Someone decided to use *that* unit for a critical update and brought it up manually for that purpose. The update failed and our main system was down. I drove in at 03:30 and (I thought) fixed things by falling back. Shortly after I left again, one application stopped working and dragged the rest down with it. I went back in again and did the original update cleanly - over initial management objections - after which things were fine.
There have been others but they were even more arcane. The absolute worst cases we had were with virtually everyone there. They made the news, two of them made it to Slashdot. Different causes in each case.
--
Mielipiteet omiani - Opinions personal, facts suspect.
Re:Murphy says no. by Anonymous Coward · 2014-07-11 05:45 · Score: 1

A few things to add:
6: Have some sort of rollback procedure. For example, if someone is doing maintenance on a database server, either export the non-system volumes or back the whole thing up so it can be bare-metal restored [1]. For VMs, this is easy. Snapshot the sucker, upgrade, test to make everything is hunky-dory, and perhaps move the snapshot off for an archive. For stuff in a SAN, it can be easy, or difficult. Snap a 4TB+ LUN on a VNX, and there may be a chance that the drive controller might just pop its top (which is why you have MPIO.) Have neither, do a backup, both on the application/DB level, and the system level. You have backups of production machines right? If not, stop what you are doing and get one. A downed production machine with no backups can kill an entire company, and you don't want to be the guy whose head that mountain falls on.
7: Have a changelog somewhere. Doesn't have to be an exact history, but when changing a setting like virtual CPUs in a VM, RAM allocated, VM priority, for the love of $DEITY, write it down and save it some place accessible. Even better, print it, stick it in a paper notebook so it is accessible in a lights-off scenario.
8: Let people know what you are doing. An admin decides to do a driver update. Little does he know that the application uses specialized low-level drivers on some hardware for performance reasons. Ka-blooey happens, and all the fingers point to the poor admin who just clicked a box in Windows Update. Having fingers pointed in your direction when downtime happens is not good for the career. It usually means that you get first in line to be booted.
9: Don't do too many things at once. That way, when something breaks, it isn't that difficult to find the culprit.
10: Make sure you have the hardware vendor, OS vendor, DB vendor, and app vendor's support IDs ready to go. This is production hardware, so it goes without saying it is on a support contract. Even better, make sure the contracts are current before the downtime window, so you don't have to wake up the bloke with the purse strings at 2:00 AM to renew before you can create a sev-1, all-is-down ticket.
11: If part of the upgrade is a power down, consider powering down the hardware completely. As in, after powering it down via the usual commands, walking to the box, and yanking the cords. Then let it sit for a minute or so, plug it back in and kick it off. Some pieces of hardware get into futzy states unless power cycled. I've encountered a disk array which would lose its management head access unless power cycled every 18 months.
12: Again, keep a log of events. If you have to get some support on the phone, a piece of paper or a Notepad document is better than your memory at 3:00 in the morning.
13: Don't do too much in a window. Better to schedule multiple outages than to try to do a major thing all at once. For example, if you need to upgrade the OS, application, DB, firmware, drive firmware, and DB updates... try small chunks, so you know they work, and if something does happen, it can be rolled back fairly quickly. Save yourself a margin of time.
14: It is assumed all of this has been tested in a test environment as similar to production as possible. If not, document that.
15: Did I mention documenting anything odd, big, or small? The machine seeing an blip with a transitory RAM error should be at least noted. The more documentation, the better cover for the derriere should something break and the blamestorming starts.
16: For network gear, did you save the firmware configs on every network device? A lot of people forget to move run config to start config... and things break in a spectacular manner. Are the configs stashed somwhere safe, accessible... but still secure?
17: Have you tested restore capability? I have seen virtually every backup program (be it enterprise utilities or whatever stuff someone digs up for "free") out there perform backups with no errors, but come restore
Re:Murphy says no. by gewalker · 2014-07-11 06:07 · Score: 2

It's even more fun when the CEO stops by, in person, to see how long it is going to take to get things working again. Though not might fault either time I've been there actually fixing the problem, it certainly is attention getting. Neither CEO was being a jerk, he just really needed to know what was going on without any b/s filters by intermediate management. Try imaging that visit if you had just been running an automated script to apply the patch.
So yeah, if it is important, you need to be there, and if drive time is a potential issue, you need to physically be there, not just monitoring the change from home.
Re:Murphy says no. by NotSanguine · 2014-07-11 06:13 · Score: 2, Insightful

...Better yet, use Amazon EC2 for your infrastructure so you can spool up as many redundant systems as necessary.
Exactly. Because if Amazon screws up, they won't blame you. That fantasy and a couple bucks will get you a Starbucks latte.
Using someone else's servers is always a bad idea for critical systems. Virtualization is definitely the way to go, but use your own hardware. Yes, that means you need to maintain that hardware, but that's a small (or not so small, in a large environment -- but worth it) price to pay because Murphy was an optimist.

--
No, no, you're not thinking; you're just being logical. --Niels Bohr
Re: Murphy says no. by CanHasDIY · 2014-07-11 06:33 · Score: 2

Would I like to make double what I am making? Sure, but I would NOT be willing to put in double the work.
Not for these fuckers, anyway.
Were I to strike out on my own, I don't think I'd mind all the extra hours, but it's easy to see things differently when you're your own boss.

--
An enigma, wrapped in a riddle, shrouded in bacon and cheese
Re:Murphy says no. by bmimatt · 2014-07-11 06:41 · Score: 1

Yes.
Also, this is one of these scenarios, where virtualization pays. You can simply spin up a new set of boxes (ideally via puppet,chef, whatever) and cut over to it once the new cluster has been thoroughly tested and tested some more. Human eye watching/managing the cutover still recommended, if not required.
Re: Murphy says no. by nabsltd · 2014-07-11 07:25 · Score: 1

so once a week you have to get up early and do some work.
I don't think that the "2am" listed in TFS is "getting up early". Instead, it's more like "staying up late".
For me, it's not really a problem, but I have had to do that kind of maintenance as a team, and some people are just useless if they stay up that long, or even got a short nap. My current job gives us all day one Saturday a month for maintenance, so you can sleep like normal and get up when appropriate (one hour worth of work, start at 2 in the afternoon if you want...7 hours of work, better start before noon). A lot fewer mistakes seem to be made with this sort of schedule.
Re:Murphy says no. by Zenin · 2014-07-11 08:31 · Score: 1

Or it's not at all dependent on those factors.
It's much more a matter of how much someone cares to put redundancy in place. Doing it right affects the entire stack: Code architecture, deployment tooling, infrastructure architecture and costing.
It's a large reason why PaaS is gaining momentum: This is all assumed and it ends up being easier to do it the right way (that includes all this) from the start than doing it any other way, given that most all of the boiler plate aspects are already built.
If you're building services that still require "regular maintenance windows" in 2014, you're doing it wrong.

--
My /. uid is better then your /. uid
Re:Murphy says no. by bwhaley · 2014-07-11 08:39 · Score: 1

If you're building services that still require "regular maintenance windows" in 2014, you're doing it wrong.
This is a really nice sentiment but is in fact somewhat disconnected from reality.
In the web world, building zero downtime services that don't require maintenance is doable. In many enterprise IT environments with legacy or bloated software (hospitals, education, government) it's a non-starter. The staff do not have the skill, the applications don't have the support, and the political will within the organization is not there. Database migrations alone can be a major source of downtime, and that's largely true even for web services.

--
"I either want less corruption, or more chance
to participate in it." -- Ashleigh Brilliant
Re:Murphy says no. by Zenin · 2014-07-11 08:57 · Score: 3, Insightful

In general, don't do anything that isn't your core business. Or another way of saying it, Do What Only You Can Do.
If you are an insurance company, is building and maintaining hardware your business? No, not in the slightest. You have no more business maintaining computer hardware as you have maintaining printing presses to print your own claims forms.
Maintaining hardware and the rest of the infrastructure stack however, is the business of Amazon AWS, Windows Azure, etc. The "fantasy" you're referring to is the crazy idea that you, as some kind of God SysAdmin, can out-perform the world's top infrastructure providers at maintaining infrastructure. Even if you were the best SysAdmin alive on the planet, you can't scale very far.
Sure, any of those providers can (and do, frequently) fail. Still, they are better than you can ever hope to be, especially once you scale past a handful of servers. If you are concerned that they still fail, that's good, yet it's still a problem worst addressed by taking the hardware in house. A much better solution is to build your deployments to be cloud vendor agnostic: Be able to run on AWS or Azure (or both, and maybe a few other friends too) either all the time by default or at the flip of a (frequently tested) switch.
Even building in multi-cloud redundancy is far easier, cheaper, and more reliable than you could ever hope to build from scratch on your own. That's just the reality of modern computing.
There are reasons to build on premises still, but they are few and far between. Especially now that cloud providers are becoming PCI, SOX, and even HIPAA capable and certified.

--
My /. uid is better then your /. uid
Re:Murphy says no. by NotSanguine · 2014-07-11 09:42 · Score: 1

In general, don't do anything that isn't your core business. Or another way of saying it, Do What Only You Can Do.
If you are an insurance company, is building and maintaining hardware your business? No, not in the slightest. You have no more business maintaining computer hardware as you have maintaining printing presses to print your own claims forms.
Maintaining hardware and the rest of the infrastructure stack however, is the business of Amazon AWS, Windows Azure, etc. The "fantasy" you're referring to is the crazy idea that you, as some kind of God SysAdmin, can out-perform the world's top infrastructure providers at maintaining infrastructure. Even if you were the best SysAdmin alive on the planet, you can't scale very far.
Sure, any of those providers can (and do, frequently) fail. Still, they are better than you can ever hope to be, especially once you scale past a handful of servers. If you are concerned that they still fail, that's good, yet it's still a problem worst addressed by taking the hardware in house. A much better solution is to build your deployments to be cloud vendor agnostic: Be able to run on AWS or Azure (or both, and maybe a few other friends too) either all the time by default or at the flip of a (frequently tested) switch.
Even building in multi-cloud redundancy is far easier, cheaper, and more reliable than you could ever hope to build from scratch on your own. That's just the reality of modern computing.
There are reasons to build on premises still, but they are few and far between. Especially now that cloud providers are becoming PCI, SOX, and even HIPAA capable and certified.
Yes. AWS, Azure, etc. are focused on (and are actually pretty good at) providing compute services (whether that be PaaS or straight-up VMs). However, what they are not is contractually responsible for the safekeeping or integrity of your data.
There are definitely use cases for using "someone else's servers." Use them for external-facing resources like a web presence, customer portal, extranet services or even email. But when it comes to business critical systems and data, no one has a more compelling motive to secure and maintain them than an internal IT staff.
I imagine you'll disagree with me, which is fine. I would point out that despite the costs of implementing and maintaining a highly availabile internal virtualization environment, many of those costs are significantly offset by the usage and maintenance contracts as well as network connectivity required to support internal access to "someone else's servers."
In the end, it's a matter of balancing the costs against the criticality and confidentiality of the data, IMHO.
Assuming it would require me to provide personal information, remind me not to do business with whatever company you work for. Then again, if you're a shill for a "cloud" (marketing-speak for "someone else's servers), I understand. Either way, carry on.

--
No, no, you're not thinking; you're just being logical. --Niels Bohr
Re:Murphy says no. by dnavid · 2014-07-11 12:15 · Score: 1

The right answer to this is to have redundant systems so you can do the work during the day without impacting business operations.
The right answer is you build in as much redundancy as you can, but you still do the work in as careful a manner as possible during downtime windows when necessary so that you don't waste the redundancy you have. You will look like the world's biggest idiot if spend a huge amount of money and design resources maing sure you have two of everything for redundancy, and while you're cavalierly upgrading the B systems because you have redundancy the A systems go down. Which they will, precisely when you bring down B for middle of the day upgrades, because the god of maintenance hates you, always has hated you, and always will hate you.
If you can afford N+2 or N+3 redundancy *everywhere* then you shouldn't be asking anyone else for availability advice.
Re:Murphy says no. by Vellmont · 2014-07-13 10:57 · Score: 1

I don't believe I mentioned the number of people, merely that upgrading when nobody was using the system creates another risk that you won't know about till much later.
People in IT seem to want the "perfect" solution, which doesn't exist, or at the very least a black/white kind of thinking. Everything is tradeoffs and it's important to understand what those tradeoffs are. I've also seen people seem to think all situations and organizations are the same. (Obviously very, very wrong).
But I will say this. In some cases the best solution might be to upgrade the system when people are still using it that it can be switched back quickly.

--
AccountKiller
Re:Murphy says no. by Slashdot+Parent · 2014-07-14 05:24 · Score: 1

Even our network upgrades we do in the middle of the workday. After all testing is done, we just take a whole datacenter down and upgrade it. Once it's back online, we do the next datacenter until it's all done.
There's really no reason in 2014 to do 1am maintenance.

--
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock

I've toyed with this concept.. by grasshoppa · 2014-07-11 04:29 · Score: 5, Interesting

...and while I'm reasonably sure I can execute automated maintenance windows with little to no impact to business operations, I'm not sure. So I don't do it.

If there were more at stake, if the risk vs benefits were tipped more in my company's favor, I might test implement it. But just to catch an extra hour or two of sleep? Not worth it; I want a warm body watching the process in case it goes sideways. 9 times out of 10, that warm body is me.

--
Mod me down with all of your hatred and your journey towards the dark side will be complete!

Re:I've toyed with this concept.. by mlts · 2014-07-11 05:56 · Score: 3, Insightful

Even on fairly simple things (yum updates from mirrors, AIX PTFs, Solaris patches, or Windows patches released from WSUS), I like babysitting the job.
There is a lot that can happen. A backup can fail, then the update can fail. Something relatively simple can go ka-boom. A kernel update doesn't "take" and the box falls back to the wrong kernel.
Even something stupid as having a bootable CD in the drive and the server deciding it wants to run the OS from that rather than from the FCA or onboard drives. Being physically there so one can rectify that mistake is a lot easier when planned as opposed to having to get up and drive to work at a moment's notice... and by that time, someone else likely has discovered it and is sending scathing E-mails to you, CC:5 tiers of management.
Re:I've toyed with this concept.. by pr0fessor · 2014-07-11 06:57 · Score: 1

I always test in advance, have a roll back plan, only automate low risk maintenance, test the results remotely, and have a warm body on back up should the need arise. Saves a little sleep since I don't babysit the entire process just the result. I don't have physical access to most of the equipment since it's scattered across multiple data centers so I do most of my work remotely anyway.

Automated troubleshooting? by HBI · 2014-07-11 04:29 · Score: 5, Insightful

Maintenance windows are at off-hours to accomodate real work happening. If every action was painless and produced the desired result, you could do it over lunch or something like that. But that's not the real world.

This begs the question of how the hell are you going to fix unexpected problems in an automated fashion? The answer is, you aren't. Therefore, you have to be up at 2am.

--
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.

Re:Automated troubleshooting? by HBI · 2014-07-11 04:54 · Score: 2

How about looking up "pedantry".

--
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
Re:Automated troubleshooting? by gstoddart · 2014-07-11 05:05 · Score: 1

I'm seeing this increasingly often......misuse of the phrase "begs the question". Why don't you look it up?
There are now two distinct phrases in the English language:
There is the logical fallacy of begging the question.
Sometimes, an event happens which begs (for) the question of why nobody planned for it.
You might think you sound all clever and stuff, but you're wrong. They sound similar, but they aren't the same. The second one has been in common usage for decades now, and has nothing to do with the logical fallacy.

--
Lost at C:>. Found at C.
Re:Automated troubleshooting? by mshieh · 2014-07-11 05:15 · Score: 1

If you have proper monitoring, you don't need to be up at 2am. You just need to be willing to answer the phone at 2am.
Re:Automated troubleshooting? by HBI · 2014-07-11 05:40 · Score: 1

That is not true. If my job is important and my systems are important, i'm on site to make sure that change is successful.
When I was with IBM, our policy was to open up a conference call and have all the requisite support staff on the call until the change window closed. You paid through the nose for that kind of support, but our downtime was minimal and some customers needed that.
When I am working in theater on critical systems in wartime, I don't sit in my fucking hooch and use automated tools. My ass is in front of the boxes in question to respond instantly. The alternative is broken tactical systems meaning bad information being used to make decisions meaning dead people.
Your slack attitude doesn't cut it in the places I work.

--
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
Re:Automated troubleshooting? by CAIMLAS · 2014-07-11 08:04 · Score: 1

Or, chances are (if you're the ONLY sysadmin on staff), other people could stand not working for a while at 8pm once every other week while you do your maintenance at a saner hour. If you're not big enough to have multiple sysadmins and/or multiple tiers of redundancy, chances are you aren't big enough to justify 365/24 uptime. Someone else can not work so you can get work done, to enable them to keep working.
They probably work too much, anyway. No need for that to make you work too much, too.

--
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers

This is why n+2 and Vmware are so useful. by Anonymous Coward · 2014-07-11 04:30 · Score: 1

If you have a high availability system with more than one backup node then daytime maintenance becomes very doable.

Attended automation by Anonymous Coward · 2014-07-11 04:30 · Score: 3, Interesting

Attended automation is the way to go. You gain all the advantages of documentation, testing etc. If the automation goes smooth, you only have to watch it for 5 mins. If it doesn't, then you can fix it immediately.

Schedule some days as offset days by ModernGeek · 2014-07-11 04:30 · Score: 1

You just need to schedule some of your days as offset days. Work from 4pm to midnight some days so that you can get some work done when others aren't around. Some days require you being around people, some days command you be alone.

Or you can just work 16hour days like the rest of us and wear it with a badge of honor.

If you are your own boss and do this, you can earn enough money to take random weeks off from work with little to no notice so that you can travel the world, and do some recruiting while doing it so that you can write the expenses off on the company.

--
Sig: I stole this sig.

Re:Schedule some days as offset days by DarkOx · 2014-07-11 04:44 · Score: 1

Pretty much this. If your company is big enough or drives enough revenue from its IT systems that require routine off hours maintenance they should staff for that.
That is not say that if its just Patch Tuesdays they need to; or the occasional rare major internal code deployment that happens a couple time a year or so. For that you as the admin should suck it up, and roll out of bed early once and while. Hopefully your bosses are nice and let you have some flextime for it. Knock out at 3p on Fridays those weeks or something.
If there is a regular maintenance window that is frequently used, say at least twice a week, then they need to make the regular scheduled working hours for some employee(s). Maybe some junior admin who can follow deployment instructions works 3a-10a Tuesdays and Wednesdays; but lets be fair to that person they have a life outside of work a deserve to have a predictable schedule. They should still work those hours even if there is nothing going on that week, and just use the time do whatever else they do; update documentation; test out new software versions etc, inventory, etc.

--
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Re:Schedule some days as offset days by CanHasDIY · 2014-07-11 04:47 · Score: 1

Or you can just work 16hour days like the rest of us and wear it with a badge of honor.
IMO, there is no honor in working more hours than you're actually being paid to work. Not only are you hurting yourself, you're keeping someone else from being able to take that job.
If you've got 80 hours worth of work to do at your company, and one guy with a 40-hour-a-week contract, you need to hire another person, not convince the existing guy that he should be proud to be enslaved. Morally speaking.

--
An enigma, wrapped in a riddle, shrouded in bacon and cheese
Re:Schedule some days as offset days by QRDeNameland · 2014-07-11 04:57 · Score: 2

Or you can just work 16hour days like the rest of us and wear it with a badge of sucker.
FTFY

--
Momentarily, the need for the construction of new light will no longer exist.
Re:Schedule some days as offset days by rikkards · 2014-07-11 05:22 · Score: 1

Not only that but a company that lets someone do that is shooting themself in the foot. Sooner or later 80 hour a week guy is going to leave, good luck getting someone that is
A: willing to do it coming in
B: not taking the job until something better comes along.
It's not a badge of honor, just an example of rationalization for a crappy job.
Re: Schedule some days as offset days by jtmach · 2014-07-11 05:36 · Score: 1

In that case, you'll probably make silly spelling mistakes.
Re: Schedule some days as offset days by smash · 2014-07-11 05:37 · Score: 1

Work smarter, not harder. This is the difference between an IT muppet and someone who actually goes places in this industry.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Schedule some days as offset days by Lumpy · 2014-07-11 07:20 · Score: 1

"Or you can just work 16hour days like the rest of us and wear it with a badge of honor."
Keep telling yourself that, someday you will believe it.
Want to know what another badge of honor is accepting a lower wage for those 16 hours. A real IT pro would take $16.00 an hour and sleep under their desk.

--
Do not look at laser with remaining good eye.
Re:Schedule some days as offset days by Lumpy · 2014-07-11 07:27 · Score: 1

Actually that 80 hour a week guy will quit without notice really fucking the company.
Hell I took my vacation time left and called in sick for the last days I had in my sick bank while I worked at my new job.. the day I was supposed to be back at my old job I walked in at 8 am dropped my badge and keys on my bosses desk and said, "I quit, hope you can hire the 3 guys you will need to replace me quickly." I had informed HR that I was quitting in writing that morning at 7am when they got there.
It screwed them over hard, really hard. I got a call from the VP begging for me to come back with a 30% pay increase 2 hours after I walked out the door. I said he doesn't have enough money in the world for me to work there anymore, I wished them luck replacing me.
Funny thing, 6 months later the 2008 crash happened and they closed up for good.

--
Do not look at laser with remaining good eye.
Re:Schedule some days as offset days by Lumpy · 2014-07-13 02:06 · Score: 1

What moron would sign a contract like that? Are you really that dumb?

--
Do not look at laser with remaining good eye.
Re:Schedule some days as offset days by Wolfrider · 2014-07-13 16:31 · Score: 1

--I'll give you an AMEN on that!!

--
.
== WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??

Depends on the Application layer / patch applied by slacklinejoe · 2014-07-11 04:31 · Score: 1

I do this for a lot of clients. Automatic Deployment Rules in Configuration Manager, Scripts, Cron jobs etc. For test / dev, it absolutely makes sense as I usually have a monitoring system that goes into Maintenance Mode during the updates. If things take too long or if services aren't restored post update, the monitoring system gives me a shout that something needs remediated. For production, it varies on the expected impact. If it's something I tested in pilot with zero issues and the application isn't something with an insane SLA, sure, I'll use an automatic deployment. When I'm working on hospital equipment such as servers processing imaging or vitals monitoring for surgery, that gets nix'ed no matter what due to the liability concerns. I usually suggest building up trust / experience by automating the less critical systems and phasing in more sensitive systems until you've both gained a lot of experience with it and have more management support to do so as when crap goes down, it's easier to say this is a tested processed we've been using for years vs yeah, oops, new script sorry that knocked down our ERP system.... Resume generating event right there... So, I guess it depends, just another tool for the toolbox and it's up to the carpenter to know when to pull it out.

Offshore by pr0nbot · 2014-07-11 04:34 · Score: 4, Insightful

Offshore your maintenance jobs to someone in the correct timezone!

Re:Offshore by smash · 2014-07-11 05:40 · Score: 1

Yup. Company I work currently has only a 4 hour window per day where we don't have active users actually on the clock. And if we win a job in say, south america (we're a mining company), that goes out the window entirely. VMotion, virtual networking, virtual filers/writable snapshots, are all beautiful things.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.

Sounds like a bad idea ... by gstoddart · 2014-07-11 04:34 · Score: 4, Insightful

You don't monitor maintenance windows for when everything goes well and is all boring. You monitor them for when things go all to hell and someone needs to correct it.

In any organization I've worked in, if you suggested that, you'd be more or less told "too damned bad, this is what we do".

I'm sure your business users would love to know that you're leaving it to run unattended and hoping it works. No, wait, I'm pretty sure they wouldn't.

I know lots of people who work off hours shifts to cover maintenance windows. My advise to you: suck it up, princess, that's part of the job.

This just sounds like risk taking in the name of being lazy.

--
Lost at C:>. Found at C.

Re:Sounds like a bad idea ... by pr0fessor · 2014-07-11 07:17 · Score: 1

I automate low risk maintenance, it doesn't alleviate the responsibility for prior testing, a roll back plan, or monitoring the results, but it does save time. If you refuse to automate any of your work you would never make a deadline and wouldn't last very long where I work.
Re:Sounds like a bad idea ... by gstoddart · 2014-07-11 07:23 · Score: 1

Oh, I automate deployments, and I automate some monitoring. Don't get me wrong, I'm not opposed to automation.
Like all programmers, I'm lazy and would rather code it once instead of doing it by hand many many times.
That doesn't mean I'd walk away from it and leave it unattended. To me, that's just asking to get bit in the ass.
These days, anything which is low risk maintenance is stuff I do during the daytime because it's not Production. For our Production environments, everything is considered high risk because the systems are mission critical. Any change at all is high risk, because if it breaks, it costs the company large amounts of money to be down.
You have to understand what your threshold of risk is, and what your actual risks are before you do any automation. Some systems you can play fast and loose with. Others, not so much.

--
Lost at C:>. Found at C.

And if it doesn't work? by Anonymous Coward · 2014-07-11 04:34 · Score: 2, Insightful

Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.

This is the correct answer. I promise you that at some point, something will fail, and you will have failed by not being there to fix it immediately.

This is why you need.. by arse+maker · 2014-07-11 04:35 · Score: 3, Insightful

Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.

Having someone with little or no sleep doing critical updates is not really the best strategy.

Re:This is why you need.. by Shoten · 2014-07-11 04:45 · Score: 5, Insightful

Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.
Having someone with little or no sleep doing critical updates is not really the best strategy.
First off, you can't mirror everything. Lots of infrastructure and applications are either prohibitively expensive to do in a High Availability (HA) configuration or don't support one. Go around a data center and look at all the Oracle database instances that are single-instance...that's because Oracle rapes you on licensing, and sometimes it's not worth the cost to have a failover just to reach a shorter RTO target that isn't needed by the business in the first place. As for load balancing, it normally doesn't do what you think it does...with virtual machine farms, sure, you can have N+X configurations and take machines offline for maintenance. But for most load balancing, the machines operate as a single entity...maintenance on one requires taking them all down because that's how the balancing logic works and/or because load has grown to require all of the systems online to prevent an outage. So HA is the only thing that actually supports the kind of maintenance activity you propose.
Second, doing this adds a lot of work. Failing from primary to secondary on a high availability system is simple for some things (especially embedded devices like firewalls, switches and routers) but very complicated for others. It's cheaper and more effective to bump the pay rate a bit and do what everyone does, for good reason...hold maintenance windows in the middle of the night.
Third, guess what happens when you spend the excess money to make everything HA, go through all the trouble of doing failovers as part of your maintenance...and then something goes wrong during that maintenance? You've just gone from HA to single-instance, during business hours. And if that application or device is one that warrants being in a HA configuration in the first place, you're now in a bit of danger. Roll the dice like that one too many times, and someday there will be an outage...of that application/device, followed immediately after by an outage of your job. It does happen, it has happen, I've seen it happen, and nobody experienced who runs a data center will let it happen to them.

--

For your security, this post has been encrypted with ROT-13, twice.
Re:This is why you need.. by CWCheese · 2014-07-11 04:52 · Score: 1

Several posts have alluded to high-availability, mirrored, load balanced, etc etc as being the solution to simply updating systems. The problem from a management point of view is to remain on guard when a patch or upgrade goes bad. Having turned into one of those 'old-guys', I'm quite sobered by the bad maintenance windows I've been a party to and will never consider unattended maintenance windows for my teams. It's better for me to schedule the work and let my folks adjust their work days to get to the maintenance fully alert and aware, and in full attendance for that time when things don't go as planned.

--
Have a Day!
Re:This is why you need.. by CanHasDIY · 2014-07-11 04:58 · Score: 1

Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.
Having someone with little or no sleep doing critical updates is not really the best strategy.
Oh my $deity, this!
I've worked in environments with test-to-live setups, and ones without, and the former is always, always a smoother running system than the latter.

--
An enigma, wrapped in a riddle, shrouded in bacon and cheese
Re:This is why you need.. by MondoGordo · 2014-07-11 05:09 · Score: 1

In my experience, if your load-balancing solution requires all your nodes to be available, and you can't remove one or more nodes without affecting the remainder, it's a piss-poor load balancing solution. Good load balancing solutions are fault tolerant up to, and including, absent or non-responsive nodes and any load balanced system that suffers an outage due to removing a single node is seriously under-resourced.
Re:This is why you need.. by smash · 2014-07-11 05:43 · Score: 1

Yeah, don't get me wrong (i've been posting about setting up a test lab using vSphere, vFilters and vlans) - you can't replace the need to have someone on call or watching in case it all fucks up. But you can generally reduce the outage window and risk significantly by actually testing (both the roll out and roll back) first. And if you've got it to the point where you can reliably test, you can work on your automation scripts, test the shit out of them, and having been tested with a copy of live using a copy of live data, be reasonably confident that they will work.
If they don't? Snapshot the breakage, roll back to pre-fuckup, and examine at your leisure. Then re-schedule once you know wtf happened.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:This is why you need.. by mlts · 2014-07-11 06:16 · Score: 1

There is also the fact that some failure modes will take both sides down. I've seen disk controllers overwrite shared LUNs, hosing both sides of the HA cluster (which is why I try to at least quiesce the DB or application so RTO/RPO in case of that failure mode is acceptable.)
HA can also be located on different points on the stack. For example, an Oracle DB server. It can be clustered on the Oracle application level (active/active or active/passive), or it can be sitting in a VMWare instance, clustered using vSphere HA, where the DB itself thinks it is a single instance, but in reality, it is sitting active/passive on two boxes.
Even if the backup stays up, failing back can be an issue. I've seen HA systems where it will happily drop to the backup node... but failing back to the primary can require a lot of downtime. For active/active setups, it can require a performance hit for resyncing.
Re:This is why you need.. by Slashdot+Parent · 2014-07-14 05:29 · Score: 1

Go around a data center and look at all the Oracle database instances that are single-instance...that's because Oracle rapes you on licensing
Then stop using Oracle if you can't afford RAC/GoldenGate/TAF/whatever. Use what you can afford in order to architect a proper redundant system. Running a database on a single instance is malpractice in 2014.

--
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock

Immutable Servers by skydude_20 · 2014-07-11 04:35 · Score: 2

If these are as critical services as you say, I would assume you have some sort of redundancy, at least a 2nd server somewhere. If so, treat each as "throw away", build out what you need on the alternative server, swing DNS and be done. Rinse and repeat for the next 'upgrade'. Then do your work in the middle of the day. See Immutable Servers: http://martinfowler.com/bliki/...

--
Jesus saves souls and redeems them for valuable cash prizes

Automate Out by whipnet · 2014-07-11 04:36 · Score: 2

Why would you want to automate someone or yourself out of a job? I realized years ago that Microsoft was working hard to automate me out of my contracts. It's almost done, why accelerate the inevitable?

Re:Automate Out by smash · 2014-07-11 05:45 · Score: 2

This is why you move the fuck on and adapt. If your job is relying on stuff that can be done by a shell script, you need to up-skill and find another job. Because if you don't do it, someone like myself will.
And we'll be getting paid more due to being able to work at scale (same shit for 10 machines or 10,000 machines), doing less work and being much happier, doing it.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Automate Out by whipnet · 2014-07-11 06:47 · Score: 1

Thank you, I did. I only have my baby toe dipped into the IT world now in an business ownership role. Good to know there are still people out there that haven't worked in IT long enough to realize that IT IS NOT A CRAFT and you will never perfect anything and will continue to have to learn things that don't mean a pile a beans just a couple of years later. Great gig for me in my 20's, not so much in my 40's and approaching 50's. I'm old school IT and done with it. Enjoy. *

Automate successful execution as well by Boawk · 2014-07-11 04:36 · Score: 1

Setting aside the wisdom (or lack thereof) of automating maintenance, you should also have some process external to the maintained machines that confirms that the maintenance worked. That confirmation could be something like testing that a Web server continues to serve the expected pages, some port provides expected information, etc. If this external process notes a discrepancy, it would page/text/call you.

Re:Automate successful execution as well by Neil+Boekend · 2014-07-13 22:12 · Score: 1

Disclaimer: I am not an IT professional.
Why not automate the deployment and go in via VPN afterwards to check if all is well?
Of course this should be done within driving range so that you can get there a couple of hours before business hours to fix the horrible, horrible mistakes that will be made from time to time. Or when the VPN doesn't respond.

--
Well, I might have a way, but it only works on a semi spherical planet in a vacuum.

Fixing the wrong problem by Zanthras · 2014-07-11 04:38 · Score: 1

By far the better solution is to figure out why that one specific server cant be offlined. Its far safer regardless of the tests and validations to work on a server thats not supposed to be running vs one that is. It obviously takes alot of work, but for all your critical/important services they should be running in some sort of HA scenerio. If you cant take a 5 minute outage just after normal business hours, you absolutely cannot take a failure in the service due to any sort of hardware failure(which will happen) This is coming from years of experience in a Software as a Service company/

Slashdot is a Bad Place to Ask This by terbeaux · 2014-07-11 04:39 · Score: 4, Interesting

Everyone here is going to tell you that a human needs to be there because that is their livelihood. Any task can be automated at a cost. I am guessing that it is not your current task to automate maintenance tasks otherwise you wouldn't be asking. Somewhere up your chain they decided that for the uptime / quality of service it is more cost effective to have a human do it. That does not mean that you can not present a case showing otherwise. I highly suggest that you win approval and backing before taking time to try to automate anything.

Out of curiosity, are they VMs?

Re:Slashdot is a Bad Place to Ask This by gstoddart · 2014-07-11 05:35 · Score: 1

Everyone here is going to tell you that a human needs to be there because that is their livelihood.
No, many of us will tell you a human needs to be there because we've been in the IT industry long enough to have seen stuff go horribly wrong, and have learned to plan for the worst because it makes good sense.
I had the misfortune of working with a guy once who would make major changes to live systems in the middle of the day because he was a lazy idiot. He once took several servers offline for a few days because of this. I consider that kind of behavior lazy and incompetent, because I've seen the consequences of it.
If you consider "doing our jobs correctly, and mitigating business risk" to be job security, you're right. If you think we do these things simply to make ourselves look useful, you're clueless about what it means to maintain production systems which are business critical.
Part of my job is to minimize business risk. And people keep me around because I actually do that.

--
Lost at C:>. Found at C.
Re:Slashdot is a Bad Place to Ask This by smash · 2014-07-11 05:56 · Score: 1

Alternatively, perhaps somewhere up the chain they have no idea what can be done (this IT shit isn't their area of expertise), and are not being told by their IT department how to actually fix the problem properly. Rather, they are just applying band-aid after band-aid for breakage that happens.
It is my experience that if you outline the risks, the costs and the possible mitigation strategies to eliminate the risk, most sensible businesses are all ears. At the very least, if they don't agree on the spot, they are at least aware of what is possible and when the inevitable happens, be more keen to fix the problem next time.
Downtime cost adds up pretty fucking quickly. For example, my company: We have 650 PC users. pay rate probably ranges from 25 bucks an hour to 100 bucks an hour or more. Lets say the average is probably somewhere around 45 per hr.
1 hour of downtime, by 650 users, by 45 bucks per hour = $29,250 in lost productivity. Plus the embarrassment of not being able to deal with clients, etc. Plus potentially other flow on effects (e.g., in our case, possibly: maintenance scheduling for our mining equipment - trucks, drills, etc. didn't run. Plant therefore didn't get serviced properly, $500k engine dies).
If you fuck something up and are down for a day? Well... you can do the math.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Slashdot is a Bad Place to Ask This by grahamsaa · 2014-07-11 06:36 · Score: 1

OP here. Yes, they are VMs in most cases. The only machines we don't virtualize are database servers.

--
Facts have a liberal bias.
Re:Slashdot is a Bad Place to Ask This by Slashdot+Parent · 2014-07-14 05:36 · Score: 1

Oh, a human definitely needs to be there for maintenance. You can't automate fixing up a screwup in the automation.
I just see no reason why maintenance windows have to be done at 1am. In today's world of redundancy and failover, there is just no reason for it. Every upgrade my team has done for as long as I can remember has been at 10am local time because we don't allow downtimes anyway. Why work at 1am?

--
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock

I have automated maintenances in the form of ... by spads · 2014-07-11 04:40 · Score: 2

...service bounces that are happening all the time. When it occurs and/or if any other issues, I can send myself a mail. My blackberry has filters which allow an alarm to go off which can wake me during the night. That would seem to meet your needs.

--
Bukowski said it. I believe it. That settles it.

Nature of the beast by Danzigism · 2014-07-11 04:42 · Score: 1

Although I do feel this is the nature of the beast when working in a true IT position where businesses rely on their systems nearly 100% of the time, there are some smart ways to go about it. I'm not exactly sure what type of environment you're using, but if you use something like VMware's vSphere product, or Microsoft's Hyper-V, both allow for "live migrations". Why not virtualize all of your servers first of all, make a snapshot, perform the maintenance, and live migrate the VMs? You could do it right in the middle of the day and nobody would even know. This kind of setup takes a lot of planning however. I personally wouldn't want any maintenance performed on my servers without manual approval. Unattended maintenance sounds a bit too scary for my likes, and in my experience with even small security updates for both Linux and Windows servers, there's bound to be a point where something would fail and you could potentially get in a lot of legal trouble if you fail to meet you SLA, or cause a loss-of-profit due to downtime with a business.

--
*plays the Apogee theme song music*

It depends on the size of your operation... by jwthompson2 · 2014-07-11 04:43 · Score: 4, Interesting

If you really want to automate this sort of thing you should have redundant systems with working and routinely tested automatic fail-over and fallback behavior. With that in place you can more safely setup scheduled maintenance windows for routine stuff and/or pre-written maintenance scripts. But, if you are dealing with individual servers that aren't part of a redundancy plan then you should babysit your maintenance. Now, I say babysit because you should test and automate the actual maintenance with a script to prevent typos and other human errors when you are doing the maintenance on production machines. The human is just there in case something goes haywire with your well-tested script.

Fully automating these sorts of things is out of reach more many small to medium sized firms because they don't want, or can't, invest in the added hardware to build out redundant setups that can continue operating when one participant is offline for maintenance. So, depending on the size of your operation and how much your company is willing to invest to "do it the right way" is the limiting factor in how much you are going to be able to effectively automate this sort of task.

--
Even if I knew that tomorrow the world would go to pieces, I would still plant my apple tree. -Martin Luther

Simmilar experiences ... by psergiu · 2014-07-11 04:43 · Score: 4, Insightful

A friend of mine lost his job over a simmilar "automation" task on windows.

Upgrade script was tested on lab environement who was supposed to be exactly like production (but it turns out it wasn't - someone tested something before without telling anyone and did not reverted). Upgrade script was scheduled to be run on production during the night.

Result - \windows\system32 dir deleted from all the "upgraded" machines. Hundreds of them.

On the Linux side i personally had RedHat doing some "small" changes on the storage side and PowerPath getting disabled at next boot after patching. Unfortunate event, since all Volume Groups were using /dev/emcpower devices. Or RedHat doing some "small" changes in the clustering software from one month to the other. No budget for test clusters. Production clusters refusing to mount shared filesystems after patching. Thankfuly on both cases the admins were up & online at 1AM when the patching started and we were able to fix everything in time.

Then you can have glitchy hardware/software deciding not to come back up after reboot. RHEL GFS clusters are known to randomly hang/crash at reboot. HP Blades have sometimes to be physically removed & reinserted to boot.

Get the business side to tell you how much is going to cost the company for the downtime until:
- Monitoring software detects that something is wrong;
- Alert reaches sleeping admin;
- Admin wakes up and is able to reach the servers.
Then see if you can risk it.

--
1% APY, No fees, Online Bank https://captl1.co/2uIErYq Don't let your $$$ sit in a no-interest acct.

Re:Simmilar experiences ... by SuiteSisterMary · 2014-07-11 07:24 · Score: 1

I've had 'yum update' do things like change completely where data files for a service are stored, update the configuration, but not move, link or otherwise do anything with the existing data. I've also had 'yum update' introduce kernel level file system bugs that result in data corruption. Both on vanilla Centos installs with no extra repos.

--
Vintage computer games and RPG books available. Email me if you're interested.

Prepare for failure by davidwr · 2014-07-11 04:43 · Score: 1

One way to prepare for failure is to have someone there who can at least recognize the failure and wake someone up in time to fix it.

Another way to prepare for failure is to have a system that is redundant enough that a part could go down and it wouldn't be more than a minor annoyance to users or management.

There are other ways to prepare for failure, but these are two common ones.

--
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.

Re:Prepare for failure by gstoddart · 2014-07-11 04:58 · Score: 1

Some of us would argue that doing maintenance unattended is preparing for failure -- or at least giving yourself the best possible chance of failure.
I work in an industry where if we did our maintenance badly, and there was an outage it would literally cost millions of dollars/hour.
If what you're doing it so unimportant you can leave the maintenance unattended, there's probably no reason you couldn't do the outage in the middle of the day.
If it is important, you don't leave it to chance.

--
Lost at C:>. Found at C.

Set alarms by MrL0G1C · 2014-07-11 04:44 · Score: 1

Can't you make some kind of setup that triggers if the update fails and alerts you / wakes you up with noise from your smartphone etc.

Or like the other poster who beat me to it - off-load your work to someone in a country where your 5am is mid-day in their country.

--
Waterfox - a Firefox fork with legacy extension support, security updates and better privacy by default.

Perception of Necessity by bengoerz · 2014-07-11 04:47 · Score: 1

By proving that your job can be largely automated, you are eroding the reasons to keep you employed.

Sure, we all know it's a bad idea to set things on autopilot because eventually something will break badly. But do your managers know that?

Re:Perception of Necessity by smash · 2014-07-11 06:07 · Score: 1

Automating shit that can be automated so that you can actually do thing that benefit the business instead of simply maintaining the status-quo is not a bad thing. Doing automate-able drudge work when it could be automated is just stupid. Muppets who can click next through a Windows installer or run apt-get, etc. are a dime a dozen. IT staff who can get rid of that shit so they can actually help people get their own jobs done better are way more valuable.
The job of IT is to enable the business to continue to function and improve. Never forget that. People don't spend up big on computer stuff just because. They do it in order to save money by improving process. Improving process is where you should be focused, anything to do with general maintenance of the status quo is dead time.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Perception of Necessity by bengoerz · 2014-07-11 07:03 · Score: 1

My point is not that you should never automate things. Rather, when you automate things, you should make sure your managers know (1) that you were smart enough to improve processes and are therefore valuable to future projects and (2) that the things you automate could one day break (process changes, etc.).

At very least, the poster should be able to articulate a better reason for automation than "I wanted to sleep in".

Good for the Goose by Cigamit · 2014-07-11 04:56 · Score: 1

Simple.

You stipulate that for every maintenance, there has to be a full regression testing of any affected applications. You will require the application owner, QA folks, and any other affected personnel online during and after the maintenance to test and ensure everything is working. Bonus points, require them to be on a conference call, and breathe heavily into the mic the entire time (maybe occassionally says "Oops"). When you have enough other people complaining about the 2 am times instead of just you, they magically get moved to move sensible times in the late afternoon.

Your best is to get out of Managed Services and into Professional Services. You just build out new environments / servers / apps and hand them off to the MS guys. Once its off your hands, you never have to worry about a server crashing, maintenance windows, or being on call. Plus, you are generally paid more.

Re:Good for the Goose by gstoddart · 2014-07-11 05:57 · Score: 1

Your best is to get out of Managed Services and into Professional Services. You just build out new environments / servers / apps and hand them off to the MS guys. Once its off your hands, you never have to worry about a server crashing, maintenance windows, or being on call. Plus, you are generally paid more.
In my experience (personal and professional), those people do a half assed job of building those systems, have no concept of what will be required to maintain them, and are then subsequently unavailable when their stuff falls apart.
They're hit and run artists.
But, they sure to get paid lots of money.

--
Lost at C:>. Found at C.

Its your network by sasquatch989 · 2014-07-11 04:58 · Score: 1

I think automating maintenance is a smart move but still requires you be awake and available for it. The question is do you want to be awake at work for 10 minutes or 2 hours? Plan accordingly.

Testing. Validation. by mythosaz · 2014-07-11 05:00 · Score: 2

Do you plan on automating the end-user testing and validation as well?

Countless system administrators have confirmed the system was operational after change without throwing it to real live testers only to find that, well, it wasn't.

Nope. by ledow · 2014-07-11 05:01 · Score: 1

Every second you save automating the task, will be taken out of your backside when it goes wrong (see the recent article where a university SCCM server formatted itself and EVERY OTHER MACHINE on campus) and you're not around to stop it or fix it.

Honestly? It's not worth it.

Work out of normal hours, or schedule downtime windows in the middle of the day.

Re:Nope. by smash · 2014-07-11 06:10 · Score: 2

That example was due to incompetence, not due to automation. Whilst recover from that would be a pain in the ass, if you are unable to recover at all, you have a major DR oversight.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.

Think of it a slightly different way by thecombatwombat · 2014-07-11 05:01 · Score: 3, Informative

First: I do something like this all the time, and it's great. Generally, I _never_ log into production systems. Automation tools developed in pre-prod do _everything_. However, it's not just a matter of automating what a person would do manually.

The problem is that your maintenance for simple things like updating a package is requiring downtime. If you have better redundancy, you can do 99% of normal boring maintenance with zero downtime. I say if you're in this situation you need to think about two questions:

1) Why do my systems require downtime for this kind of thing? I should have better redundancy.
2) How good are my dry runs in pre-prod environments? If you use a system like Puppet for *everything* you can easily run through your puppet code as you like in non-production, then in a maintenance window you merge your Puppet code, and simply watch it propagate to your servers. I think you'll find reliability goes way up. A person should still be around, but unexpected problems will virtually vanish.

Address those questions, and I bet you'll find your business is happy to let you do "maintenance" at more agreeable times. It may not make sense to do it in the middle of the business day, but deploying Puppet code at 7 PM and monitoring is a lot more agreeable to me than signing on at 5 AM to run patches. I've embraced this pattern professionally for a few years now. I don't think I'd still be doing this kind of work if I hadn't.

Re:Think of it a slightly different way by pnutjam · 2014-07-11 06:14 · Score: 1

Sounds awesome. I embraced solutions like that before I ended up in the over segmented large company. Now, not so much. I have to open a ticket to scratch my ass.

--
Cheap storage VM.
Re:Think of it a slightly different way by 0123456 · 2014-07-11 06:16 · Score: 1

1) Why do my systems require downtime for this kind of thing? I should have better redundancy.
True. Last year we upgraded all our servers to a new OS with a wipe and reinstall, and the only people who noticed were the ones who could see the server monitoring screens. The standby servers took over and handled all customer traffic while we upgraded the others.

Convenience in place of Caution by div_2n · 2014-07-11 05:18 · Score: 1

You're trading caution for convenience.

I have automated some things such as patch installation overnight only to wake up to a broken server despite the patches being heavily tested and known to work in 100% of the cases before only to not have them work when nobody was watching.

I urge you to only consider unattended automation overnight when it's for a system that can reasonably incur unexpected downtime without jeopardizing your job and/or the organization. If it's critical -- DO NOT AUTOMATE.

You've been warned.

Sometimes the reasons aren't technical by davidwr · 2014-07-11 05:18 · Score: 1

Maybe back when the maintenance window was created it was created for a valid technical reason, BUT technology moved on and management didn't.

In other words, in some environments, the technical people won't have a sympathetic ear if they ask to cancel the off-hours maintenance window simply because of local politics or the local management, BUT if the maintenance gets botched and services are still down or under-performing through normal business hours, nobody outside of IT will notice.

--
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.

Re:Sometimes the reasons aren't technical by gstoddart · 2014-07-11 05:27 · Score: 1

BUT if the maintenance gets botched and services are still down or under-performing through normal business hours, nobody outside of IT will notice
Then you're maintaining trivial, boring, and unimportant systems that nobody will notice. If your job is to do that ... well, your job is trivial and unimportant.
The stuff that I maintain, if it was down or under-performing during normal business hours ... we would immediately start getting howls from the users, and the company would literally be losing vast sums of money every hour. Because our stuff is tied into every aspect of the business, and is deemed to be necessary for normal operations.
Sorry, but some of us actually maintain stuff which is mission critical to the core business, and people would definitely notice it.
As one of the technical people who does cover after hours maintenance ... if a technical person suggested we automate our changes and not monitor them, they wouldn't get a sympathetic ear from me either.
There may be systems like you describe. And, as I said before, if that's the case, do your maintenance windows in the middle of the day.

--
Lost at C:>. Found at C.
Re:Sometimes the reasons aren't technical by davidwr · 2014-07-11 06:18 · Score: 1

As I said, sometimes the problems are not technical in nature.

--
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.

Perl by Murdoch5 · 2014-07-11 05:19 · Score: 1

Just write a simply perl script to handle it, it would take about 1 hour to develop and test and you'd be good to go.

Updates can't be left unnatended by cloud.pt · 2014-07-11 05:31 · Score: 1

I'm not familiar with CentOS or Redhat, but in Debian it's not uncommon to get the odd update that requires configuration wizards. There's no shortcut to those, and in the event of it happening, you are gonna have some early risers complaining.

And even the supposedly safe, unattended updates aren't that safe: For example I updated to the latest linux-image from Debian's repos yesterday. I didn't expect some core services to depend on a computer reboot to start working again, but 5 minutes in people were complaining a Jboss web app wasn't working.

No single points of failure by jader3rd · 2014-07-11 05:34 · Score: 1

Are you talking about servers/services? If so, every service should have some sort of failover strategy to other hardware. That way anything you need to work on can be failed over during business hours and brought back.

Re:windows by smash · 2014-07-11 05:34 · Score: 2

OS choice is irrelevant. I've seen plenty of critical linux fuck ups in my day, and OS choice doesn't account for human error. And, being human, you WILL make human errors. You need a test environment and a backout plan. If you don't at least have a back-out plan and an estimate of how much the fuckup will cost BEFORE proceeding (and balancing that against the cost/risk of leaving it the fuck alone), you should not be carrying out the work.

Sure, that sounds like management speak, but seriously... cover your fucking ass. Because one day it will fuck up (whatever, the OS, this isn't just a Linux or Windows problem) and whilst the fuck up may not necessarily be your fault, the extended downtime because you have not tested and have no backout plan will be.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.

Raises the question by tepples · 2014-07-11 05:42 · Score: 1

Sometimes, an event happens which begs (for) the question of why nobody planned for it.

This raises the question of why people don't just avoid the pedantic bickering by saying "raises the question".

Re:Raises the question by gstoddart · 2014-07-11 05:50 · Score: 2

This raises the question of why people don't just avoid the pedantic bickering by saying "raises the question".
Because, generally speaking, pedants are tedious and annoying, and nobody else cares about the trivial minutia they like to get bogged down in because it's irrelevant to the topic at hand.
At least, that's what my wife tells me. ;-)

--
Lost at C:>. Found at C.
Re:Raises the question by NotSanguine · 2014-07-11 06:19 · Score: 1

This raises the question of why people don't just avoid the pedantic bickering by saying "raises the question".
Because, generally speaking, pedants are tedious and annoying, and no one else cares about the trivial minutiae in which pedants like to get bogged down. It's irrelevant to the topic at hand.
At least, that's what my wife tells me. ;-)
There. FTFY. Pedantry and grammar nazism all in one pretty package. You're welcome.

--
No, no, you're not thinking; you're just being logical. --Niels Bohr

I don't get paid for things that work right by prgrmr · 2014-07-11 05:44 · Score: 1

I get paid for cleaning up after things that don't work right the first time.

3 am is better by djupedal · 2014-07-11 05:50 · Score: 1

That way, when things go south, you have time to right the ship before the early birds start logging in at 5:30.

VMs help by normanjd · 2014-07-11 05:52 · Score: 1

Do you use VMs? ALL of our servers are now running on VMware at remote locations. I can't automate maintenance, but it does not matter if I do it from the office or at home as I am remoting either way... Set up a snapshot to roll back if there is a problem, and you can at least make it a bit more comfortable if you have to be up at odd hours...

Re: I have automated maintenances in the form of . by drummerboybac · 2014-07-11 05:54 · Score: 1

Most of these patches are happening on systems that are in some remote data center that's not in your physical office location anyway. So I see no difference connecting remotely to the servers from your house vs connecting remotely from your office

Automation is necessary by dave562 · 2014-07-11 05:55 · Score: 2

If you want to progress in your IT career, you need to figure out how to automate basic system operations like maintenance and patching. Having to actually be awake at 2:00am to apply patches is rookie status. Sometimes it is unavoidable, but it should not be the default stance.

My environment is virtual, so our workflow is basically snapshot VM, patch, test. If the test fails, rollback the snapshot and try again (if time is available) or delay until later. If the test is successful, we hold onto the snapshot for three days just in case users find something that we missed. If everything is good after three days, we delete the snapshot.

We have a dev environment that mirrors production that we can use for patch testing, upgrade testing, etc. Due to testing, we rarely have problems with production changes. If we do, the junior guys escalate to someone who can sort it out. Our SLAs are defined to give us plenty of time to resolve issues that occur within the allocated window. (Typically ~4 hours)

In the grand scheme of things, my environment is pretty small. We have ~1500 VMs. We manage it with three people and a lot of automation.

Snapshots are great, for some things by mveloso · 2014-07-11 05:59 · Score: 1

Snapshots are great, but they assume all your data is on the snapshot. It's harder to roll back if your new version goes ahead and corrupts some database or something on the NAS.

It's even harder to roll back if your data stores are on some multi-clustered beast that wasn't designed to be rolled back.

Of course, you should have caught that in test, right?

Re:No. Do your maintenance *in* working hours. by smash · 2014-07-11 06:08 · Score: 1

Yup. Same reason changes on the weekend are bad, as are changes on (in my opinion) Thursdays and Fridays.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.

Name things that shouldn't be automated. by holophrastic · 2014-07-11 06:09 · Score: 1

Consider all of the tasks that you do as a part of your job. Identify which ones should absolutely never be automated -- maybe they're too dangerous, maybe the risk is too great, maybe they're too much fun. I'd bet that upgrading the OS would be pretty well the top of your never-automate-this list.

Re:Having problems with this on windows. by smash · 2014-07-11 06:11 · Score: 1

Get access to SCCM or get your outsourcer to fix the fucking problems they created.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.

Re:If the machine is virtual.... by smash · 2014-07-11 06:13 · Score: 1

"It boots" does not necessarily constitute success. You really need a test environment. There's no real getting around it.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.

Reboot? - Load Balancers and multiple systems by Taelron · 2014-07-11 06:29 · Score: 3, Insightful

Unless you are updating the Kernel there are few times you need to reboot a centos box. Unless your app has a memory leak.

The better way to go about it has already been pointed out above. Have several systems, load balance them in a pool, take one node out of the pool, work on it, return it to the pool then repeat for each remaining system. - No outage time and users are none the wiser to the update.

It can be done by sentiblue · 2014-07-11 06:46 · Score: 1

I definitely want to automate this... but not automate alone... I would build in some monitoring and notification... such as result of each step summed into an email/SMS report.

I would also use a remote host, to send an alert if the host in maintenance doesn't come back within the expected window....

And of course.. even though I'm not gonna be doing the maintenance activity manually/live, but I would still want to find a proper way to know/confirm that the maintenance window was successful before sleeping all the way through.

And if it doesn't work? by Anonymous Coward · 2014-07-11 06:58 · Score: 1

Sad but true.
And it's almost 100% likely that the night you decide to try LSD for the first time will be the night that your automation script fails.

Ansible by ewieling · 2014-07-11 08:00 · Score: 1

We use Ansible, it seems to fit well with our needs, but others use Puppet or Chef.

--
I really shouldn't have used someone else's email address for this account.

We do this all the time by Bugler412 · 2014-07-11 08:13 · Score: 1

All sorts of automated security updates and patches during the regularly scheduled maintenance window. Couple of key things that make it work: 1. A valid and representative DEV environment or host(s) to vet and test deploy the updates using the same methods as production hosts. 2. A solid alerting system for when the inevitable couple of hosts fail and needs help to get running again. 3. A qualified and responsive on call person to review the results at or near the end of the maintenance window to make sure everything came back online properly and take action where necessary. It doesn't so much eliminate the after hours work as to reduce the volume of the after hours work to a level manageable by a single qualified tech.

VPN by Crizzam · 2014-07-11 08:23 · Score: 1

Get yourself a VPN to your workstation and do it from home, in bed. If you can get a quicky while you're at it, good on you.

Missing the point by ilsaloving · 2014-07-11 08:45 · Score: 2

The OP is missing the point. Of *course* you can automate updates. You don't even need an automation system. It can be as simple as writing a bash script.

The point is... what happens when something goes wrong? If all goes well, then there's no problem. But if something does go wrong, you no longer have anyone able to respond because nobody's paying attention. So you come in the next morning with a down server and a clusterf__k on your hands.

You have redundancy right? by UrsaMajor987 · 2014-07-11 08:59 · Score: 1

The last place I worked at had redundancy both within the data center and across data centers. That is they could survive the loss of a data center. If the service you are supplying is so critical you should have redundancy. This will give you a little more leeway on when maintenance is done.

And if it doesn't work? by Pogie · 2014-07-11 09:30 · Score: 1

To the original poster: It is entirely possible, but you're going to need to learn a lot about modern automation and configuration management tools appropriate to the types of maintenance you're looking to automate. You also need solid vision and alignment on how you're going to achieve this level of automation across multiple parts of your business -- Development, IT, the Business, everyone. They all have to buy in and commit, because all of those folks have the ability to fuck it all up if everyone isn't on the same page. You can't do it alone on the admin side. As a start, I would suggest learning about Continuos Integration/Continuous Delivery and Agile and Devops methodologies to get started on the road to where you want to be.

To the rest of you:

The original comment ("Learn and use Puppet") is grossly oversimplified -- there is a lot more to it -- but with proper implementation of configuration management software (Chef, Puppet, Salt, etc), proper automated testing (think Jenkins, Teamcity, etc) and a real commitment in your organization to Continuous Integration and Delivery practices, you can easily do regular automated maintenance. Yes -- sometimes it will break and you'll have to clean it up. But properly and thoughtfully implemented in policy and practice, those times when it breaks will be the exception that proves the rule.

Forgive the argument from authority, but at our firm (International, thousands of primarily linux servers across 14 countries and 40+ datacenters, mostly bare-iron, some virtualization) we have regular daily and weekly automated maintenance. We handle all sorts of significant change -- driver updates, software upgrades, network switch configuration, even forklift OS upgrades involving the full re-imaging of a bare iron system combined with re-deployment of software (including things like databases and hadoop clusters) -- automatically and without human intervention on a regular basis. And by regular I mean daily.

The attitude that "Murphy always wins" or "something will fail and you will have failed by not being there to fix it immediately" is a relic of a time when the tools available to manage large scale infrastructure were inadequate or unavailable. Again, there are failures that will require manual intervention, but if you are doing your jobs well as developers, network admins, systems admins, 'devops' [NOTE: I strongly object to that term being used as a job title, but that's how folks have started using it] then you should be able to conduct automated hands-free production change at 2am on a Saturday and sleep like a baby knowing that when you check your upgrade report in the morning 99% of the time everything will have gone off without a hitch.

Frankly if you approach complex infrastructure management with that defeatist viewpoint of "things will always fail", you are doing yourself and your employer a disservice, and you are severely restricting your career prospects. My company is not in any way unique in our ability to automate and manage our infrastructure, and maintaining that type of outdated attitude is going to cause lots of doors to be slammed in your face. Do you really believe the Googles, Facebooks and Amazons of the world rely on having a human being white-knuckling every change in their infrastructure?

One additional note: If your infrastructure is designed such that you cannot push change without guaranteed downtime or the risk of downtime then you have failed to design your infrastructure properly.

Dont let me down Bruce.. by tempest69 · 2014-07-11 10:35 · Score: 1

Always go in with a well considered plan, and be there when it happens.

Even if your planning is awesome, you'll look unprofessional not being in a position to fix a problem when it is most likely to occur.

If something does happen, and your not there.. There will be crankiness.

Thanks for the feedback (OP response) by grahamsaa · 2014-07-11 10:48 · Score: 2

Thanks for all of the feedback -- it's useful.

A couple clarifications: we do have redundant systems, on multiple physical machines with redundant power and network connections. If a VM (or even an entire hypervisor) dies, we're generally OK. Unfortunately, some things are very hard to make HA. If a primary database server needs to be rebooted, generally downtime is required. We do have a pretty good monitoring setup, and we also have support staff that work all shifts, so there's always someone around who could be tasked with 'call me if this breaks'. We also have a senior engineer on call at all times. Lately it's been pretty quiet because stuff mostly just works.

Basically, up to this point we haven't automated anything that will / could be done during a maintenance window that causes downtime on a public facing service, and I can understand the reasoning behind that, but we also have lab and QA environments that are getting closer to what we have in production. They're not quite there yet, but when we get there, automating something like this could be an interesting way to go. We're already starting to use Ansible, but that's not completely baked in yet and will probably take several months.

My interest in doing this is partly that sleep is nice, but really, if I'm doing maintenance at 5:30 AM for a window that has to be announced weeks ahead of time, I'm a single point of failure, and I don't really like that. Plus, considering the number of systems we have, the benefits of automating this particular scenario are significant. Proper testing is required, but proper testing (which can also be automated) can be used to ensure that our lab environments do actually match production (unit tests can be baked in). Initially it will take more time, but in the long run anything that can eliminate human error is good, particularly at odd hours.

Somewhat related, about a year ago, my cat redeployed a service. I was up for an early morning window and pre staged a few commands chained with &&'s, went downstairs to make coffee and came back to find that the work had been done. Too early. My cat was hanging out on the desk. The first key he hit was "enter" followed by a bunch of garbage, so my commands were faithfully executed. It didn't cause any serious trouble, but it could have under different circumstances. Anyway, thanks for the useful feedback :)

--
Facts have a liberal bias.

You're a whiny-assed bitch by msobkow · 2014-07-11 10:59 · Score: 1

The whole reason we used to get paid extra to provide support was to provide support. That meant weird hours, weekends, and late nights.

If you don't like it, get another job.

--
I do not fail; I succeed at finding out what does not work.

How about meeting it half way with MOPs? by xushi · 2014-07-11 12:10 · Score: 1

You said you had 24/7 personnel on call. Let's refer to them as the NOC.

Are they trained to type and follow commands, along with basic (i mean basic..) skills in *nix?

You could cut the cost of investment in automation which can be costly, and focus more on well documented (and tested as much as possible) steps or instructions that they can perform since they're up anyway.. You can use some of the allocated cost in training the NOC a bit more on *nix, scripting etc..

If anything goes wrong then they can call you and you can follow up. Depending on how good your steps are (and a bit of luck) you might end up waking up less than usual.

Of course I ask that silly question up top because you don't want to be awaken at the start of the maintenance saying "Hello sir, your MOP failed at `ls /vra/log`.

2 AM? by PPH · 2014-07-11 12:16 · Score: 1

Sleep in?

I don't understand. This just means you swing by and do the update after they close the bar and throw you out.

That's SOP around here.

--
Have gnu, will travel.

Re:I have automated maintenances in the form of .. by Wycliffe · 2014-07-11 12:29 · Score: 1

I am updating your outward facing mail server, the update fails, where is your god/email now?

If at least some part of your paging and monitoring system isn't independent from your servers then you're doing it wrong.
We use multiple third party companies to monitor our website. It's highlevel checks but one of the checks is to check
that our internal monitoring software is working. You can purchase third party monitoring software or spin up an instance
somewhere like amazon or digital ocean for a few dollars a month. Depending on how critical your systems are you
could spin up a few dozen. The point is that you should be monitoring your servers from outside your network for
multiple reasons. The first being that it doesn't really matter if everything is up if the outside world can't connect to it
and the second being that you still want to be paged if your entire datacenter goes up in smoke.

Automate but cover your bases by MatthiasF · 2014-07-11 15:37 · Score: 1

Only automate tasks on systems that can be quickly snapshotted and simply QC'd using scripts.

For instance, if you have a web server you want to update weekly, then setup a script on the virtual host that snapshots the virtual machine before the upgrades and then runs a series of checks on the web server after the upgrades. If the web server does not respond as expected to the post-upgrade checks, the virtual host can revert back to the pre-update snapshot and send a message to you notifying you of the upgrade failure. You could also snapshot the failed virtual machine, spin it up on another machine or instance without networking to check the logs for any errors that occurred during upgrades.

If the virtual machine is *nix based, you could mount the snapshot directly on the host and browse the logs as well, or even automate the collection of failed logs too.

Any upgrade procedure that cannot be easily scripted or delayed in such a fashion should be done manually and well attended by someone knowledgeable.

You are just doing it wrong by elmer+at+web-axis · 2014-07-11 17:44 · Score: 1

With Virtualization you should have no real need to do server upgrades out of hours. If you need to upgrade a package/service on offer you should just spin up a brand new instance, have some type of automation piece install and configure everything that the instance needs, have some auto testing application confirm that it's all added, then just add the instance to the load balancer, and decommission the old instance. No more out of hours work unless dealing with hardware issues and with HA these issues usually can be dealt with during business hours. If you are restricted by a limit on resources you should at least be using products like Docker or Solaris Zones to isolate guests from the core OS and separate out application vs core OS needs (the bulk of change usually happens in the application layer so this seperation again means less downtime out of hours). Need to update the hyper visor? live migrate the guests to another piece of hardware and do the maintenance again during business hours. If you don't have the budget you can always spin these kinda solutions up. (DRBD/KVM work a treat). Or as others have said host everything in the cloud.

Do you get paid overtime? by Ryanrule · 2014-07-11 18:04 · Score: 1

If not, tell them you will quit.
When they call your bluff, quit.
Accept a 50% raise the next day.

Momas don't let your babies grow up to do IT. by GrantRobertson · 2014-07-12 03:04 · Score: 1

This is exactly why I don't do IT any more. All the responsibility to keep things working, no authority to make users or departments not screw things up, none of the credit when things go smoothly, but all of the blame when anything goes wrong (no matter who caused it), every department's poorly planned extra computer expenses come out of your budget, all that unpaid overtime means you are barely making minimum wage, you are constantly reminded that your job is hanging in the balance, AND you are expected to keep taking expensive certification classes on your own dime just for the priviledge of bending over for one more year.

Monitoring Team here by weweedmaniii · 2014-07-13 08:43 · Score: 1

I work for a monitoring team. We are 24/7 and I can guarantee you from experience this is a terrible idea. The first time the servers drop out of the monitoring suppression and suddenly a half dozen alarms are going off because your automated server program decided to drag down a series of other servers, or kill the switches at the office I get to call you at 4AM you are going to wonder why you didn't just catch a nap and go back in. Anytime we get an e-mail from a "Senior Sever Manager" stating that "a change will be made this weekend at 2AM but will not affect system uptime," we note in in our shift logs because as sure as we are sitting there Murphy will creep up and jump on that managers back and chew until someone can beat him off. Usually to minimize the damage to our team, we will politely e-mail that manager and ask exactly what systems and what times will this happen as a warning that we really do not want to have to go through the late night procedures to alert someone. Most managers who have experience will actually send us a separate e-mail saying "server XYZ123 will be down from 1AM to 3AM, if we get it up sooner we will call to verify it is up on your end." We monitor 10 different companies of all sizes from a single server room to worldwide systems, and Murphy is a board member for every one and always gets a vote.

--
"If stupid things work...then they are not stupid."

Do upgrades during the day by Slashdot+Parent · 2014-07-14 05:20 · Score: 1

You should always have a competent tech on hand for maintenance tasks.

I agree with this, but who does maintenance at 1am anymore? What's the point in it? Users are worldwide, and 1am in the US prime business hours in Asia, so why bother patching/upgrading in the middle of the night?

I haven't done a late-night maintenance in at least a decade. It's all about rolling upgrades. Any problems? Rollback. Need to upgrade infrastructure? Take the entire datacenter offline and serve from your other datacenters. Every single upgrade I've done for as long as I can remember has been at 10am, which is the earliest I can get my lazy-ass junior devs to stumble into the office.

OP needs a process upgrade.

--
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock

Slashdot Mirror

Ask Slashdot: Unattended Maintenance Windows?

171 of 265 comments (clear)