Best Practices For Infrastructure Upgrade?
An anonymous reader writes "I was put in charge of an aging IT infrastructure that needs a serious overhaul. Current services include the usual suspects, i.e. www, ftp, email, dns, firewall, DHCP — and some more. In most cases, each service runs on its own hardware, some of them for the last seven years straight. The machines still can (mostly) handle the load that ~150 people in multiple offices put on them, but there's hardly any fallback if any of the services die or an office is disconnected. Now, as the hardware must be replaced, I'd like to buff things up a bit: distributed instances of services (at least one instance per office) and a fallback/load-balancing scheme (either to an instance in another office or a duplicated one within the same). Services running on virtualized servers hosted by a single reasonably-sized machine per office (plus one for testing and a spare) seem to recommend themselves. What's you experience with virtualization of services and implementing fallback/load-balancing schemes? What's Best Practice for an update like this? I'm interested in your success stories and anecdotes, but also pointers and (book) references. Thanks!"
No, the budget questions comes later.
The first questions are: What are your businesses requirements regarding your IT infrastructure? How long can you do business without it? How fast does something need to be restored?
Starting with those requirements, you can start with possible designs that fit those solutions - for example, if the requirement is that a machine must be operational at last a week after a crash, you can build computers from random spare parts and hope that they'll work. If the requirement is that it should be up and running in two days, you will need to buy servers from a Tier 1 vendor like HP or IBM with appropriate service contracts. If the requirement is that everything must be up and running again in 4 hours, you'll need backups, clusters, site resilience, replicated SAN, etc. pp.
The question of Budget comes into play much later.
there's hardly any fallback if any of the services dies or an office is disconnected. Now, as the hardware must be replaced, I'd like to buff things up a bit: distributed instances of services (at least one instance per office) and a fallback/load-balancing scheme (either to an instance in another office or a duplicated one within the same).
Is that really necessary? I know that we all would like to have bullet-proof services. However, is the network service to the various offices so unreliable that it justifies the added complexity of instantiating services at every location? Or even introducing redundancy at each location? If you were talking about thousands or tens of thousands of users at each location, it might make sense just because you would have to distribute the load in some way.
What you need to do is evaluate your connectivity and its reliability. For example:
Once you answer at least those questions, then you have the information you need in order to make a sensible decision.
You know, you could've started with a bit more details - what operating system are you running on the servers? What OS are the clients running? What level of service are you trying to achieve? How many people work in your shop? What's their level of expertise?
If you're asking this on Slashdot now, it means you don't enough experience with this yet - so my first advice would be to get someone involved who does. Someone with many people with lots of experience and knowledge on the platform you work on. This means you'll have backup in case something goes south and your network design will benefit from their experience.
As for other advise, make sure you get the requirements from the higher-ups in writing. Sometimes they have ridiculous ideas regarding they availability they want and how much they're willing to pay for it.
If you're like most IT managers, you probably have a budget. Which is probably wholly inadequate for immediately and elegantly solving your problems.
Look at your company's business, and how the different offices interact with each other, and with your customers. By just upgrading existing infrastructure, you may be putting some of the money and time where it's not needed, instead of just shutting down a service or migrating it to something more modern or easier to manage. Free is not always better, unless your time has no value.
Pick a few projects to help you get a handle on the things that need more planning, and try and put out any fires as quickly as possible, without committing to a long-term technology plan for remediation.
Your objective is to make the transition as boring as possible for the end users, except for the parts where things just start to work better.
-- lk t lv ll th vwls t f wrds. T svs lts f tm t wrt bt ts pn n th ss t rd nd mks m lk lk cmplt dpsht.
Complexity is bad. I work in a department of similar size. Long long ago, things were simple. But then due to plans like yours, we ended up with quadruple replicated dns servers with automatic failover and load balancing, a mail system requiring 12 separate machines (double redundant machines at each of 4 stages: front end, queuing, mail delivery, and mail storage), a web system built from 6 interacting machines (caches, front end, back end, script server, etc.) plus redundancy for load balancing, plus automatic failover. You can guess what this is like: it sucks. The thing was a nightmare to maintain, very expensive, slow (mail traveling over 8 queues to get delivered), and impossible to debug when things go wrong.
It has taken more than a year, but we are slowly converging to a simple solution. 150 people do not need multiply redundant load balanced dns servers. One will do just fine, with a backup in case it fails. 150 people do not need 12+ machines to deliver mail. A small organization doesn't need a cluster to serve web pages.
My advice: go for simplicity. Measure your requirements ahead of time, so you know if you really need load balanced dns servers, etc. In all likelihood, you will find that you don't need nearly the capacity you think you do, and can make due with a much simpler, cheaper, easier to maintain, more robust, and faster setup. If you can call that making due, that is.
The system you have works solidly, and has worked solidly for seven years.
I, personally, am TOTALLY in agreement with the ethos of whoever designed it, a single box for each service.
Frankly, with the cost of modern hardware, you could triple the capacity of what you have now just by gradually swapping out for newer hardware over the next few months, and keeping the shite old boxen for fallback.
Virtualisation is, IMHO, *totally* inappropriate for 99% of cases where it is used, ditto *cloud* computing.
It sounds to me like you are more interested in making your own mark, than actually taking an objective view. I may of course be wrong, but usually that is the case in stories like this.
In my experience, everyone who tries to make their own mark actually degrades a system, and simply discounts the ways that they have degraded it as being "obsolete" or "no longer applicable"
Frankly, based on your post alone, I'd sack you on the spot, because you sound like the biggest threat to the system to come along in seven years.
These are NOT your computers, if you want a system just so, build it yourself with your own money in your own home.
This advice / opinion is of course worth exactly what it cost.
Apologies in advance if I have misconstrued your approach. (but I doubt that I have)
YMMV.
http://slashdot.org/~GuyFawkes/journal
I'd say that everyone has mentioned that big picture points already, except for one : what kind of users?
150 file clerks or accountants and you'll spend more time worrying about the printer that the CIO's secretary just had to have which conveniently doesn't have reliable drivers or documentation, even if it had what neat feature that she wanted and now can't use.
150 programmers can put a mild to heavy load on your infrastructure, depending on what kind of software they're developing and testing (more a function of what kind of environment are they coding for and how much gear they need to test it).
150 programmers and processors of data (financial, medical, geophysical, whatever) can put an extreme load on your infrastructure. Like to the point where it's easier to ship tape media internationally than fuck around with a stable interoffice file transfer solution (I've seen it as a common practice - "hey, you're going to the XYZ office, we're sending a crate of tapes along with you so you can load it onto their fileservers").
Define your environment, then you know your requirements, find the solutions that meet those requirements, then try to get a PO for it. Have fun.
PC moderators can suck my White pierced, tattooed dick. If you think pride == hate, s/dick/Aryan meat mallet/g.
So let's see if I understand: you want to take a simple, straightforward, easy-to-understand architecture with no single points of failure that would be very easy to recover in the event of a problem and extremely easy to recreate at a different site in a few hours in the event of a disaster, and replace it will a vastly more complex system that uses tons of shiny new buzzwords. All to serve 150 end users for whom you have quantified no complaints related to the architecture other than it might need to be sped up a bit (or perhaps find a GUI interface for the ftp server, etc).
This should turn out well.
sPh
As far as "distributed redundant system", strongly suggested you read Moans Nogood's essay "You Don't Need High Availability" and think very deeply about it before proceeding.
Except of course that management ALREADY HAS that because they've been very lucky for 7 years. Why spend money for what works (never mind we can't upgrade or replace any of it because it's so old)
I think what the article is really asking is what's a good model to start all this stuff. Your looking at one or two servers per location (or maybe even network appliances at remote sites) We read all this stuff on Slashdot and in the deluges of magazines and marketing material...where do we start to make it GO?
That's a little harsh don't you think?
There are untold numbers of us in this guys position. Asking slashdot is a damn good start at finding a new methodology. Everyone has an opinion, some of them quite intelligent, a few might even work. It's ok for the fortune 500 cube dwellers to jump on the phone and call in a long standing contractor to 'handle it' - the rest of us have to slog through the marketdroid crap and translate the latest buzzword infestations to human speak - then just hope we don't screw it up or waste money.
So far the best suggestions appear to be to figure out how critical things are first (which will shape the hardware requirements), budget second. All the while this is encompassed by the usual core job functions that still need to get done.
So rather than point out the redundant, how about using your fingers to provide a potential solution.