Scheduling Large Scale Server Upgrades/Outages?
thesandbender asks: "I've inherited my companies DST patching project and I have to schedule upgrades for 7000+ servers over the course of the next few weeks. Of course each group inside the company has different SLA's and outage windows. I need to somehow turn the pile of spreadsheets I have into a database and create a schedule that spreads the load over our pool of system administrators. There is no way I can reasonably accomplish this by hand, and even software for other industries/applications that could take a few steps out of the process would be appreciated. Does anyone know of a rule based scheduling system where I provide the available outage windows and a priority ranking for each system and the scheduler will recommend the order in which they should be upgraded?"
I think if I had to do this, I'd establish a priority ranking of the systems, taking into consideration critical path and cascading dependances, and then assign the highest priority ones first. When you finish that, come back to the pool for the next high priority job. When you're out of high priority jobs in the pool, move on to mid-priority, and so on. Trying to keep a bunch of inter-related steps in synch will drive you, and your sysadmins, crazy. Set priorities and let the big boys and girls do their job.
Then plug it back in real quick.
Deleted
shutdown -h now
Fuck the users! They exist solely to bemuse the sysadmin! Odds are they've been getting uppity lately and need to be taught a lesson, anyway.
1. Stop using acronyms that nobody knows, slashdotters hate that!
2. never ever ever use spreadsheets ever to hold data ever because you'll eventually want to do database operations on it
3. and as for the specialized software suites that do all that logic and notification and stuff, it'd take longer setup and configure that than to do it manually and cost an ungodly amount of money for licensing. Plus it never gets the logic right because tons of human reasoning is involved in which to drop when and stuff and computers can't handle that. If I were you, I'd stick some blank transparencies in the printer and print color coded, graphical timeline sort of outage window schedules from each department or whatever and then just lay them on top of each other in logical ways until you come up with something that works. The main object is to go through the first day and pack as many possible downtimes together in a row as you can then go to the next day and do the same thing until they all have a scheduled time for when they are allowed to be down. Make sure every single upgrade time has at least one secondary possible time in case the one before it takes longer than it should (which will happen a lot) If you have them arrangeable in overlapping transparencies that way and they can be easily rearranged and examined visually, it's better than any computer program except you're doing all the logic, but that really shouldn't take much longer than an hour or two if you use the logical pattern I said. Hope that made sense cuz it did in my head lol.
Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
If you just put this off for a few months, the problem will probably just go away...
"Not an actor, but he plays one on TV."
>I have to schedule upgrades for 7000+ servers ... pile of spreadsheets ...
Somebody bought 7000 servers with no plan for upgrades?
(Patching for DST, get a new OS...)
My little Linux and tech blog
This used to be my problem ... for the DST change, we have thousands of servers and workstations to deal with. I was getting worried, but instead of taking it on, we found a PM and now it's their problem.
The moral of the story: never try.
We emerge from our mother's womb an unformatted diskette; our culture formats us. - Douglas Coupland
If you have 7000+ Windows Servers you should already be running a software patching solution such as SMS, WSUS, etc...
Sure you'll spend a large amount of time sorting out which server[s] (server group[s])should be patched when, but once that is done - you should be able to schedule them within your chosen solution.
Take WSUS for example. Organise your servers into groups, approve the update and set each group's Windows Update GP properties appropriately.
I interviewed a while ago with a company called BladeLogic, they provide a suit of products for these type of tasks and all types of DataCenter management. I would defiantly give them a look they could help out on this project and many many in the future. http://www.bladelogic.com/
B
Ok, so you have spreadsheets with admins, windows, servers, priorities, etc., in them, and you're just looking for a way to schedule everything? Can you just export the spreadsheets to CSV and write a script to do it for you?
Unpleasantries.
Check out MP2. Our maintenance guys use it to schedule and track maintenance of everything in the plant. They swear by it. I believe you could use it for server maintenance, but I haven't tried it.
I don't know much about it, but I found one site that discusses it here.
Hi, I can't help you. I've no knowledge at all about this field. However, could someone make me a little bit less stupid and explain me those acronyms ? acronymattic has 197 TLAs for "DST" but I couldn't find the one which would fit for sure! SLA, that was standing for "Site Level Aggregator", right?
http://www.microsoft.com/windowsserversystem/updat eservices/default.mspx
That, along with proper scripting of "shutdown -r /m \\computername" should get you through it.
It's a simple timezone change, why on earth the servers need a reboot or any downtime?
Ok maybe a custom app freaks out but the OS should not be affected by the change.
Hi. DST changeover is in early March. If you aren't already halfway done with your 7000 server project, and they all require downtime, you are hosed. Find a new job.
The good news is most Linux systems don't require a reboot for this change, so they can be done sans outage.
When computers get overloaded with work like this (host lookups, for example) they ask for help from other computers. As my stupid first try, how about asking each sysadmin to run a spreadsheet column of hostnames through an md5hash and let them convert servers with a '1' on the first day, 'a' on the tenth day, etc.?
"Provided by the management for your protection."
I would think that a company managing 7000+ servers would have an automated patch scheduling system similar to BMC Marimba Altiris, or Opsware. You surely don't have time to purchase and install one of these mosters now, but it might be wise to pursue in the future.
There are also some GPL things that may work. Can't think of them right off hand. If these are *nux desktops/servers, you have plenty of time to write a perl/bash/python to accomplish the task. Some other slashdot user is going to have to give you advice for a windows environment at this stage of the game you are in.
Best way to do it is to eliminate downtime altogether. Backup your server. Replicate the server over to a second server at a different IP running in parallel and being updated in real time. When the upgrade time comes switch the DNS over to the spare server. Patch the first server, then replicate the changes of the data content of the second server(not OS!!!) back to the first server. Switch the DNS back. This way you get a complete backup of all your servers *and* you get your patching done with no disruption. By the 2nd thousand servers you should be able to do this in your sleep.
Holy chit!
Dude you are so fucked. Actually, if they are Unix-based, you're probably okay, but still....
I recommend what the guy(s) above said: bunch them into high-level groups, and delegate each group to SOMEONE ELSE. Problem solved.
Also, in case you've haven't figured it out yet, you should do the LEAST IMPORTANT servers first. Use those to fine-tune your scripts or procedures.
I have less than 100 FreeBSD servers to deal with and what I usually do is create a robust upgrade script on the dev servers, copy them out to all the other servers, and run in parallel. I have a set of scripts to capture all the output and run commands in parallel so it's pretty easy once the script is done. Takes a couple hours to write the script.
Good luck!
Tablix is a free software package for solving various types of scheduling problems. If you have enough time on your hands to write the necessary modules for your particular problem I'm sure it can schedule your upgrades in the most efficient way.
There are also some GPL things that may work. Can't think of them right off hand. If these are *nux desktops/servers, you have plenty of time to write a perl/bash/python to accomplish the task. Some other slashdot user is going to have to give you advice for a windows environment at this stage of the game you are in.
Hi, I'm "some other Slashdot user," and my advice for the Windows environment is the same as for Linux. Well...almost. If you are running Windows XP on the desktop or 2003 on the server (or later) then Microsoft already has released a patch that should have been part of your regular patch cycle. If not, it's time to dig out WSUS (it's free from Microsoft) or whatever patch management system that you are using to manage your 7000 servers. If you truly have 7000 servers and no patch management system in place, then you are not only screwed, but you are stupid as well.
Now, for anything that is Windows 2000 or older, you will have to manually patch the system, and without the benefit of a patch from Microsoft. No problem. Just hit their Technet article about the issue here and read up on what it entails. Basically, you manually patch one machine of each OS type, export the relevant registry keys, and then import them on the rest of the machines of the matching OS type. Or you can script the install. The referenced site even provides the batch files necessary, but if you want to get fancy you could script it with VBS, Perl, or Javascript (assuming that all of the machines to be patched have WSH installed). You could spend a couple days perfecting the technique and then let the patching script run until it is finished. It shouldn't take too long.
And as far as I'm aware, none of the DST patches (or registry fixes) requires a reboot to complete. All it does is change the date that the DST shift occurs.
This is a political problem.
The best you can do is come up with a realistic schedule for the actual timeframe you have available. And by realistic, I mean working off-hours. Then whomever is at the top of the chain tells everyone else that the upgrade happens at this time, and that's that.
Once upon a time I worked in operations for a Very Large Telecommunications Company (TM). One of my primary duties was to compile an onerous weekly report on server uptimes and send it to one of the directors, via his secretary. One day I found out that his secretary was moving to a different department, so I stopped sending them, to see what would happen. No one ever asked me about those reports again.
"Not an actor, but he plays one on TV."
With all of the comments, only ocbwilg knew or bothered to set this guy straight? The DST change is simple, even if you do not have a patch management system. Nothing a simple script, a file with all of the server names and a some time to let it run won't take care of. No reboot is required, so SLAs do not need to be considered here. I do agree that any company with seven thousand servers needs patch management. In fact, I call bullshit. There is no way in hell they even operate without one.
I would think that a company managing 7000+ servers would have an automated patch scheduling system
Nah. From personal experience I would say that most of them are pretty disorganized. And since they are very much cost driven they don't have cash for luxuries such as automated patch/upgrade tools. I mean, spreadsheets are free as is overtime for salaried employees, right?
putting the 'B' in LGBTQ+
I hate to say it, but Microsoft Access fits your needs almost perfectly, in this case. It can import the data from your spreadsheets, if they're properly formatted. (And they'd have to be, if you wanted to have software make your schedule for you.)
Once your data is in place, you write a query that includes a calculated field for the heuristics you're looking for. Run a query against that that checks against a table containing your available time slots, and you'll have the data you're looking for. (Or, at least, something that will do most of the work for you.)
You've got to patch 7000 servers in four weeks. Do you really want to spend a few days learning a a new software package that will do everything when you could take a piece of software you probably already know and simplify the problem in only a day?
tasks(723) drafts(105) languages(484) examples(29106)
Microsoft (from what I've heard from my desktop folks at work) do have a patch for Windows 2000 - it's just not exactly published yet.
Let's just say the company I work for doesn't have more than 1% WinXP....
Karnal
Some people, as I post this, have sort of strongly hinted at this, but nobody else has directly asked this yet.
What are you already using to patch your 7000+ servers? By the time you reach 7000+, this should have been a problem long solved. Hell, I'd expect it to be solved by the 100+ point.
What's so special about this DST patch that your current process can't handle it?
Because if the answer is "we have no process", you've long since lost, and good odds your systems are already seething piles of unpatched, compromised machines.
If you do have a process but it's inadequate, and Slashdot might actually be able to help you, you'll need to be a little more clear on exactly what the problem is, if it isn't "we have no process".
(What is it with people lobbing questions onto Ask Slashdot and almost, but not quite, never following up? Is the lead on Ask Slashdot so long that people die before it gets posted, or just give up? Obviously I ask this before I can tell whether "thesandbender" is one of the rare exceptions... as of this writing, no, unless (s)he's been modded into oblivion.)
no redundancy?
If you had that number of servers you can just take one, upgrade, test, move onto the next and keep on going. There should be 0% downtime.
However if you have crapware that cannot cope in such situations maybe you should be badgering the vendor so that it can be rolled out in a more sensible manner.
Microsoft (from what I've heard from my desktop folks at work) do have a patch for Windows 2000 - it's just not exactly published yet.
Let's just say the company I work for doesn't have more than 1% WinXP....
Yes, the word is that there is a "patch" for Windows 2000. But since Windows 2000 is out of mainstream support Microsoft is only making it available to companies that have purchased extended support agreements for their Windows 2000 systems. Yes, it probably is part of Microsoft's strategy to push customers into upgrading to Windows XP/2003/Vista/Longhorn. Yes, Microsoft will undoubtedly take some heat for it, but they are also freely providing documentation on how to manually resolve the issue and script the fix, and that should be more than sufficient for any admin worth the half the title to be able to fix it.
But let's be honest here, we all know (or should know) that if you are running a Microsoft OS that is two or more generations old then there are going to be some issues. If you are still running Windows 2000 in your environment (and my company is, so I speak from experience), then this is undoubtedly not the first issue that you've run into that required a work around, nor will it be the last. Fixing them is part of what we get paid to do. Eventually there will be a point where it becomes more cost-effective to upgrade, and that's what we'll do.
Managers must manage.
You don't have the time to put in a system, but you can craft a one off solution.
Your solution starts by sub-dividing your 7k servers into groups based on business units. Poke around to find out what their SLA is, and then _tell_ them that you are going to bend the SLA a little in order to get this 'OMG CRITICAL PATCH' onto your farm.
No offense, but I have found scripting abilities in Unix/Linux shops to be of a lot higher quality than Windows shops. nevertheless, you do have some talent whether you know it or not. Enlist this talent and use scripting for a lot of the nitty gritty details.
Quest Fastlane Reporter, Winbatch, and native WMI are great ways to report on pre and post conditions of servers.
Delegate, delegate, delegate. Let your team plan the methods and schedules for each business unit's servers
Once over the crisis, use the information you have gathered to generate a requirements document and go shopping.
Remember, the key to delegating is trust. You are in charge of managing the 7k servers; you are not in charge of doing the individual upgrades/patches.
I'm sorry to take a bit of a condescending tone, but I'm trying to be clear, not flatter your ego. To reiterate, the bottom line here is that with the time you have, you will be doing an automated manual upgrade. You may find that the process you cobble together will actually become a great plan B when critical patches need to be made; especially if you design with that goal in mind.
Use the 'scare' from the event quickly to get budget money for a Real Patch System(TM).
Good luck!
Move to Arizona, where we don't have Daylight Savings Time.
Scheduling using the earliest deadline late algorithm from the real time computing field might work. Based on the maintenance windows you should have different deadlines for different systems.
Take a package like Minkowsky , or other group calendar package, enter each of the groups you have an SLA with, and block out their you-can't-do-maintenance-here windows as "meetings" for them.
Then try to schedule a "meeting" with as many of them as possible to do the upgrade, and a second meeting with as many as possible of the remaining batch, etc.
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'
Any decent, current PM system (Altiris PM, MS-WSUS, MS-SMS, etc.) - using a SQL or other database back end - should have a method to identify the devices to patch and build a collection and allow you to specify a certain time frame for applying the patch to the selected groupings on separate schedules and perform any necessary reboots all in an automated fashion. (Sorry about the run-on sentence).
Altiris (or any other vendor, this is just the one I am most familiar with) would probably LOVE to have the opportunity to get their products in the door by doing a limited Proof-Of-Concept PM implementation. Or, you could download a demo copy of Altiris' Client Management Suite http://www.altiris.com/Download.aspx ) with a 30-day set of licenses and POC it yourself (1GB Mem Windows Server 2kN with SQL Server 2kN required). ( PS - They also have a Server Management suite with plugins that are appropriate for Dell and HP Servers.)
Of course one would hope that with 7k devices you would already have such a system in place already.
Best of luck.
Mod +1 Snarky
Yeah, because the cost of WSUS ($0) is just too much to turn a profit when factoring in the jobs they are creating.
For the Acronym Illiterate WSUS = Windows Server Update Services
$diff terrorists hippies
$
$rm -rf *terrorists *hippies
You know you could...oh I don't know...maybe patch ONE server and then write a script that would sync the other 6999 servers with it!
You're using her as bait, Master!
Reboot, Deny, Deny works well.
*Admin reboots server*
User: I'm getting an Outlook error.
Admin: Reboot your computer.
User: Okay, it's working now.
Admin: Must have been your workstation.
*Click*
For sale: Signature. One owner. Low miles. Always garaged. New punctuation, just installed!
We are doing this with Tivoli for some very large shops. See capitalsoftware.com/Forums We have written a compliance report gui and we can queue 1000's of machines at one time. Also, for SUN and IBM all you have to do (in most cases is update the TZD files using thier utilities. You can contact me at john_williscapitalsoftware.com