Best Practices For Infrastructure Upgrade?
An anonymous reader writes "I was put in charge of an aging IT infrastructure that needs a serious overhaul. Current services include the usual suspects, i.e. www, ftp, email, dns, firewall, DHCP — and some more. In most cases, each service runs on its own hardware, some of them for the last seven years straight. The machines still can (mostly) handle the load that ~150 people in multiple offices put on them, but there's hardly any fallback if any of the services die or an office is disconnected. Now, as the hardware must be replaced, I'd like to buff things up a bit: distributed instances of services (at least one instance per office) and a fallback/load-balancing scheme (either to an instance in another office or a duplicated one within the same). Services running on virtualized servers hosted by a single reasonably-sized machine per office (plus one for testing and a spare) seem to recommend themselves. What's you experience with virtualization of services and implementing fallback/load-balancing schemes? What's Best Practice for an update like this? I'm interested in your success stories and anecdotes, but also pointers and (book) references. Thanks!"
No, the budget questions comes later.
The first questions are: What are your businesses requirements regarding your IT infrastructure? How long can you do business without it? How fast does something need to be restored?
Starting with those requirements, you can start with possible designs that fit those solutions - for example, if the requirement is that a machine must be operational at last a week after a crash, you can build computers from random spare parts and hope that they'll work. If the requirement is that it should be up and running in two days, you will need to buy servers from a Tier 1 vendor like HP or IBM with appropriate service contracts. If the requirement is that everything must be up and running again in 4 hours, you'll need backups, clusters, site resilience, replicated SAN, etc. pp.
The question of Budget comes into play much later.
don't touch anything if it's been up and running for the past 7 years. if you really must replicate then get some more cheap boxes and replicate. it's cheaper and faster than virtual anything. if you must. but 150 users doesn't warrant anything in my oppinion. I'd rather invest in backup links (from different companies) between offices. you can bond them for extra throughput.
there's hardly any fallback if any of the services dies or an office is disconnected. Now, as the hardware must be replaced, I'd like to buff things up a bit: distributed instances of services (at least one instance per office) and a fallback/load-balancing scheme (either to an instance in another office or a duplicated one within the same).
Is that really necessary? I know that we all would like to have bullet-proof services. However, is the network service to the various offices so unreliable that it justifies the added complexity of instantiating services at every location? Or even introducing redundancy at each location? If you were talking about thousands or tens of thousands of users at each location, it might make sense just because you would have to distribute the load in some way.
What you need to do is evaluate your connectivity and its reliability. For example:
Once you answer at least those questions, then you have the information you need in order to make a sensible decision.
You know, you could've started with a bit more details - what operating system are you running on the servers? What OS are the clients running? What level of service are you trying to achieve? How many people work in your shop? What's their level of expertise?
If you're asking this on Slashdot now, it means you don't enough experience with this yet - so my first advice would be to get someone involved who does. Someone with many people with lots of experience and knowledge on the platform you work on. This means you'll have backup in case something goes south and your network design will benefit from their experience.
As for other advise, make sure you get the requirements from the higher-ups in writing. Sometimes they have ridiculous ideas regarding they availability they want and how much they're willing to pay for it.
If you're like most IT managers, you probably have a budget. Which is probably wholly inadequate for immediately and elegantly solving your problems.
Look at your company's business, and how the different offices interact with each other, and with your customers. By just upgrading existing infrastructure, you may be putting some of the money and time where it's not needed, instead of just shutting down a service or migrating it to something more modern or easier to manage. Free is not always better, unless your time has no value.
Pick a few projects to help you get a handle on the things that need more planning, and try and put out any fires as quickly as possible, without committing to a long-term technology plan for remediation.
Your objective is to make the transition as boring as possible for the end users, except for the parts where things just start to work better.
-- lk t lv ll th vwls t f wrds. T svs lts f tm t wrt bt ts pn n th ss t rd nd mks m lk lk cmplt dpsht.
redundancy.
i wage a holy war against the apostrophe.
For services running on linux, openVZ can be used as a jail with migration capabilities instead of a full on VM,
DISCLAIMER: I don't have a job so I've read about this but not used it in a pro environment yet
IranAir Flight 655 never forget!
Complexity is bad. I work in a department of similar size. Long long ago, things were simple. But then due to plans like yours, we ended up with quadruple replicated dns servers with automatic failover and load balancing, a mail system requiring 12 separate machines (double redundant machines at each of 4 stages: front end, queuing, mail delivery, and mail storage), a web system built from 6 interacting machines (caches, front end, back end, script server, etc.) plus redundancy for load balancing, plus automatic failover. You can guess what this is like: it sucks. The thing was a nightmare to maintain, very expensive, slow (mail traveling over 8 queues to get delivered), and impossible to debug when things go wrong.
It has taken more than a year, but we are slowly converging to a simple solution. 150 people do not need multiply redundant load balanced dns servers. One will do just fine, with a backup in case it fails. 150 people do not need 12+ machines to deliver mail. A small organization doesn't need a cluster to serve web pages.
My advice: go for simplicity. Measure your requirements ahead of time, so you know if you really need load balanced dns servers, etc. In all likelihood, you will find that you don't need nearly the capacity you think you do, and can make due with a much simpler, cheaper, easier to maintain, more robust, and faster setup. If you can call that making due, that is.
The system you have works solidly, and has worked solidly for seven years.
I, personally, am TOTALLY in agreement with the ethos of whoever designed it, a single box for each service.
Frankly, with the cost of modern hardware, you could triple the capacity of what you have now just by gradually swapping out for newer hardware over the next few months, and keeping the shite old boxen for fallback.
Virtualisation is, IMHO, *totally* inappropriate for 99% of cases where it is used, ditto *cloud* computing.
It sounds to me like you are more interested in making your own mark, than actually taking an objective view. I may of course be wrong, but usually that is the case in stories like this.
In my experience, everyone who tries to make their own mark actually degrades a system, and simply discounts the ways that they have degraded it as being "obsolete" or "no longer applicable"
Frankly, based on your post alone, I'd sack you on the spot, because you sound like the biggest threat to the system to come along in seven years.
These are NOT your computers, if you want a system just so, build it yourself with your own money in your own home.
This advice / opinion is of course worth exactly what it cost.
Apologies in advance if I have misconstrued your approach. (but I doubt that I have)
YMMV.
http://slashdot.org/~GuyFawkes/journal
I'd say that everyone has mentioned that big picture points already, except for one : what kind of users?
150 file clerks or accountants and you'll spend more time worrying about the printer that the CIO's secretary just had to have which conveniently doesn't have reliable drivers or documentation, even if it had what neat feature that she wanted and now can't use.
150 programmers can put a mild to heavy load on your infrastructure, depending on what kind of software they're developing and testing (more a function of what kind of environment are they coding for and how much gear they need to test it).
150 programmers and processors of data (financial, medical, geophysical, whatever) can put an extreme load on your infrastructure. Like to the point where it's easier to ship tape media internationally than fuck around with a stable interoffice file transfer solution (I've seen it as a common practice - "hey, you're going to the XYZ office, we're sending a crate of tapes along with you so you can load it onto their fileservers").
Define your environment, then you know your requirements, find the solutions that meet those requirements, then try to get a PO for it. Have fun.
PC moderators can suck my White pierced, tattooed dick. If you think pride == hate, s/dick/Aryan meat mallet/g.
The low-budget solution: buy one server (like a Poweredge 2970) with like 16GB RAM, a combination of 15k and 7.2k RAID1 arrays, and 4hr support. Install a free hypervisor like Vmware Server or Xen, and P2V your oldest hardware onto it. Later on you can spend $$$$$ on clustering, HA, SANs, and clouds. But P2V of your old hardware onto new hardware is a cost-effective way to start.
So let's see if I understand: you want to take a simple, straightforward, easy-to-understand architecture with no single points of failure that would be very easy to recover in the event of a problem and extremely easy to recreate at a different site in a few hours in the event of a disaster, and replace it will a vastly more complex system that uses tons of shiny new buzzwords. All to serve 150 end users for whom you have quantified no complaints related to the architecture other than it might need to be sped up a bit (or perhaps find a GUI interface for the ftp server, etc).
This should turn out well.
sPh
As far as "distributed redundant system", strongly suggested you read Moans Nogood's essay "You Don't Need High Availability" and think very deeply about it before proceeding.
Except of course that management ALREADY HAS that because they've been very lucky for 7 years. Why spend money for what works (never mind we can't upgrade or replace any of it because it's so old)
I think what the article is really asking is what's a good model to start all this stuff. Your looking at one or two servers per location (or maybe even network appliances at remote sites) We read all this stuff on Slashdot and in the deluges of magazines and marketing material...where do we start to make it GO?
If the current system has been acceptable for 7 years, I'm guessing the users needs aren't something so mindbogglingly critical that risk must be removed at any cost. Equally, if that was the case, the business would be either bringing in an experienced team or writing a blank cheque to an external party, not giving it to the guy who changes passwords and has spent the last week putting together a jigsaw of every enterprise option out there, and getting an "n+1" tattoo inside his eyelids.
Finally, 7 years isn't exactly old. We've got a subsidiary company of just that size (150 users, 10 branches) running on Proliant 1600/2500/5500 gear (ie 90's) which we consider capable for the job, which includes Oracle 8, Citrix MF plus a dozen or so more apps and users on current hardware. We have the occasional hardware fault which a maintenance provider can address same day, bill us at ad-hoc rates yet we still see only a couple of thousand dollars a year in maintenance leaving us content that this old junk is still appropriate no matter which we we look at it.
Only big ligs use sigs.
I think what the article is really asking is what's a good model to start all this stuff. You're looking at one or two servers per location (or maybe even network appliances at remote sites).
I totally agree with your premise. In my experience taking something that appears to work (when you realize you've really just been lucky) requires some time to bring about the change that the business really needs.
Now, as for having two servers per location, that heavily depends on how those sites are connected. Are they using a dedicated line or a VPN? That's important since that'll affect what hardware needs to be located where. It's possible (even if unlikely) that some sites would only need a VPN appliance... But since the poster seems to want general advice:
VMWare ESXi is a pretty good starting place for getting going on virtualization. I've had a great experience with it for testing. When you feel like you've got a good handle, get the ESX licenses.
If SAN isn't in your budget, I still recommend some sort of external storage for the critical stuff... Preferably replicated to another site... But you can run the OS on local storage, especially in the early stages. But you'll need to get everything onto external storage to implement the VMotion services and instant failover. Get a good feel for P2V conversion. It'll save you tons of time when it works... It doesn't always, but that's why you'll always test, test and test.
As for the basic services you stated above (www, ftp, email, dns, firewall, dhcp):
Firewall (IMHO) is best done on appliance. Which should be anywhere you have an internet connection coming in. I'm sure you knew that already, but I'm trying to be thorough.
Email is usually going to be on its own instance (guest, cluster, whatever)... But I find that including it in the virtualization strategy has been quite alright. In fact, my experience with virtualization has been quite good except when there is a specific hardware requirement for an application (a custom card, or something like that). USB has been much less of a headcache since VMWare has support for it now, but there are also network based USB adapters (example: USBAnywhere) that provide a port for guest OSes in case you don't use VMWare.
That's a little harsh don't you think?
There are untold numbers of us in this guys position. Asking slashdot is a damn good start at finding a new methodology. Everyone has an opinion, some of them quite intelligent, a few might even work. It's ok for the fortune 500 cube dwellers to jump on the phone and call in a long standing contractor to 'handle it' - the rest of us have to slog through the marketdroid crap and translate the latest buzzword infestations to human speak - then just hope we don't screw it up or waste money.
So far the best suggestions appear to be to figure out how critical things are first (which will shape the hardware requirements), budget second. All the while this is encompassed by the usual core job functions that still need to get done.
So rather than point out the redundant, how about using your fingers to provide a potential solution.
1) Buy a comprehensive insurance policy
2) Write a detailed implementation plan that you copied from a Google search
3) Wait the 3-6 months the plan calls out before actual "work" begins
4) Burn down the building using a homeless person as the schill
5) Submit an emergency "continuity" plan that you wanted to deploy all along
6) implement the new plan in one third the time of the original plan
7) come in under budget by 38.3%
8) hire a whole new help desk at half the budgeted payroll (52.7% savings)
9) speak at the board meeting: challenges you over came to saving the company
10) Graciously accept the position of CIO
(send all paychecks and bonuses to numbered bank account and retire to a non-extradition country) :)