Best Practices For Infrastructure Upgrade?
An anonymous reader writes "I was put in charge of an aging IT infrastructure that needs a serious overhaul. Current services include the usual suspects, i.e. www, ftp, email, dns, firewall, DHCP — and some more. In most cases, each service runs on its own hardware, some of them for the last seven years straight. The machines still can (mostly) handle the load that ~150 people in multiple offices put on them, but there's hardly any fallback if any of the services die or an office is disconnected. Now, as the hardware must be replaced, I'd like to buff things up a bit: distributed instances of services (at least one instance per office) and a fallback/load-balancing scheme (either to an instance in another office or a duplicated one within the same). Services running on virtualized servers hosted by a single reasonably-sized machine per office (plus one for testing and a spare) seem to recommend themselves. What's you experience with virtualization of services and implementing fallback/load-balancing schemes? What's Best Practice for an update like this? I'm interested in your success stories and anecdotes, but also pointers and (book) references. Thanks!"
Maybe the first question should really be: what's your budget?
I've been looking at hp c3000 chassis office-size blade servers, which may serve as your production+backup+testing setup, and scale up moderately for what you need. Compact, easily manageable remotely, and if you're good about looking around, not terribly overpriced. Identical blades make a nice starting point for hosting identical VM images.
Care about electronic freedom? Consider donating to the EFF!
Why virtual servers? If you are going to run multiple services on one machine (and that's fine if it can handle the load) just do it.
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
No, the budget questions comes later.
The first questions are: What are your businesses requirements regarding your IT infrastructure? How long can you do business without it? How fast does something need to be restored?
Starting with those requirements, you can start with possible designs that fit those solutions - for example, if the requirement is that a machine must be operational at last a week after a crash, you can build computers from random spare parts and hope that they'll work. If the requirement is that it should be up and running in two days, you will need to buy servers from a Tier 1 vendor like HP or IBM with appropriate service contracts. If the requirement is that everything must be up and running again in 4 hours, you'll need backups, clusters, site resilience, replicated SAN, etc. pp.
The question of Budget comes into play much later.
don't touch anything if it's been up and running for the past 7 years. if you really must replicate then get some more cheap boxes and replicate. it's cheaper and faster than virtual anything. if you must. but 150 users doesn't warrant anything in my oppinion. I'd rather invest in backup links (from different companies) between offices. you can bond them for extra throughput.
Lets cut to the chase - how much MONEY do you have. It's all well to ask pie-in-the-sky questions, but then reality sets in and we find you can't afford it.
Why don't you start with what you CAN afford, and then go from there (cause you know that's what your PHB and Bean Counters are going to tell you).
there's hardly any fallback if any of the services dies or an office is disconnected. Now, as the hardware must be replaced, I'd like to buff things up a bit: distributed instances of services (at least one instance per office) and a fallback/load-balancing scheme (either to an instance in another office or a duplicated one within the same).
Is that really necessary? I know that we all would like to have bullet-proof services. However, is the network service to the various offices so unreliable that it justifies the added complexity of instantiating services at every location? Or even introducing redundancy at each location? If you were talking about thousands or tens of thousands of users at each location, it might make sense just because you would have to distribute the load in some way.
What you need to do is evaluate your connectivity and its reliability. For example:
Once you answer at least those questions, then you have the information you need in order to make a sensible decision.
Beware of load balancing, because it will tempt you into getting too little capacity for mission-critical work. You need enough capacity to handle the entire load with multiple nodes down, or you will be courting a cascade failure. Load balancing is better than fallback, because you will be constantly testing all of the hardware and software setups and will discover problems before an emergency strikes; but do make sure you've got the overcapacity needed to take up the slack when bad things happen.
You know, you could've started with a bit more details - what operating system are you running on the servers? What OS are the clients running? What level of service are you trying to achieve? How many people work in your shop? What's their level of expertise?
If you're asking this on Slashdot now, it means you don't enough experience with this yet - so my first advice would be to get someone involved who does. Someone with many people with lots of experience and knowledge on the platform you work on. This means you'll have backup in case something goes south and your network design will benefit from their experience.
As for other advise, make sure you get the requirements from the higher-ups in writing. Sometimes they have ridiculous ideas regarding they availability they want and how much they're willing to pay for it.
If you're like most IT managers, you probably have a budget. Which is probably wholly inadequate for immediately and elegantly solving your problems.
Look at your company's business, and how the different offices interact with each other, and with your customers. By just upgrading existing infrastructure, you may be putting some of the money and time where it's not needed, instead of just shutting down a service or migrating it to something more modern or easier to manage. Free is not always better, unless your time has no value.
Pick a few projects to help you get a handle on the things that need more planning, and try and put out any fires as quickly as possible, without committing to a long-term technology plan for remediation.
Your objective is to make the transition as boring as possible for the end users, except for the parts where things just start to work better.
-- lk t lv ll th vwls t f wrds. T svs lts f tm t wrt bt ts pn n th ss t rd nd mks m lk lk cmplt dpsht.
I disagree when you have a budget of 800$ and some shoestrings it eliminates a lot of questions ;)
I am still in the process of upgrading a "legacy" infrastructure in a smaller (less than 50) office but I feel your pain.
First, it's not "tech sexy", but you've got to get the current infrastructure all written down (or typed up - but then you have to burn to cd just in case your "upgrade" breaks everything).
You should also "interview" users (preferrably by email but sometimes if you need an answer you have to just call them or... face to face even...) to find out what services they use - you might be surprised to find something that you didn't even know your Dept was responsible for (oh, that Panasonic PBX that runs the whole phone system is in the locked closet they forgot to tell you about...)
Your next step is prioritizing what you actually need/want to do... remember that you're in a business environment so having redundant power supplies for the dedicated cd burning computer may not actually improve your workplace (but yes, it might be cool to have an automated coffee maker that can run on solar power...)
So now that you know pretty much what you have and what you want to change...
Technology wise, Virtualization is definitely your answer... and there's a learning curve:
VMWare is pretty nice and pretty expensive.
Virtualbox (I use) is free but doesn't have as many enterprise features (automatic failover)
Xen with Remus or HA is the thinking man's setup
All of the above will depend on reliable hardware - that means at least RAID 1, and yes you can go with SAN but be aware that it's a level of complexity you might not need (for FTP, DNS, etc.)
Reading what you've listed as "services" it almost sounds like you want a single linux VM running all of those things with Xen and Remus...
Good luck, and TEST IT before you deploy it as a production setup.
Note that he did say VMWare on a cluster. I have an idiot at my office trying to do VMWare all on one server and failing to realize this still creates one point of failure. If you are going to do virtualization, the only benefit comes when you invest in a cluster otherwise don't do it at all.
This is my sig. There are many like it but this one is mine.
For services running on linux, openVZ can be used as a jail with migration capabilities instead of a full on VM,
DISCLAIMER: I don't have a job so I've read about this but not used it in a pro environment yet
IranAir Flight 655 never forget!
Complexity is bad. I work in a department of similar size. Long long ago, things were simple. But then due to plans like yours, we ended up with quadruple replicated dns servers with automatic failover and load balancing, a mail system requiring 12 separate machines (double redundant machines at each of 4 stages: front end, queuing, mail delivery, and mail storage), a web system built from 6 interacting machines (caches, front end, back end, script server, etc.) plus redundancy for load balancing, plus automatic failover. You can guess what this is like: it sucks. The thing was a nightmare to maintain, very expensive, slow (mail traveling over 8 queues to get delivered), and impossible to debug when things go wrong.
It has taken more than a year, but we are slowly converging to a simple solution. 150 people do not need multiply redundant load balanced dns servers. One will do just fine, with a backup in case it fails. 150 people do not need 12+ machines to deliver mail. A small organization doesn't need a cluster to serve web pages.
My advice: go for simplicity. Measure your requirements ahead of time, so you know if you really need load balanced dns servers, etc. In all likelihood, you will find that you don't need nearly the capacity you think you do, and can make due with a much simpler, cheaper, easier to maintain, more robust, and faster setup. If you can call that making due, that is.
Outsource everything to "de cloud", because that way when everything fails spectacularly it isn't your fault.
don't forget the network as well like the switches and maybe the cables as well. Also if you find any hubs get rid of then ASAP.
also for the servers they should be linked to each other with gig-e.
A lot of Windows software can make virtualization a necessity, since running certain components on the same machine may create an unsupported configuration or be a security nightmare. For example, a Terminal Server and DC on the same machine is a security nightmare.
The system you have works solidly, and has worked solidly for seven years.
I, personally, am TOTALLY in agreement with the ethos of whoever designed it, a single box for each service.
Frankly, with the cost of modern hardware, you could triple the capacity of what you have now just by gradually swapping out for newer hardware over the next few months, and keeping the shite old boxen for fallback.
Virtualisation is, IMHO, *totally* inappropriate for 99% of cases where it is used, ditto *cloud* computing.
It sounds to me like you are more interested in making your own mark, than actually taking an objective view. I may of course be wrong, but usually that is the case in stories like this.
In my experience, everyone who tries to make their own mark actually degrades a system, and simply discounts the ways that they have degraded it as being "obsolete" or "no longer applicable"
Frankly, based on your post alone, I'd sack you on the spot, because you sound like the biggest threat to the system to come along in seven years.
These are NOT your computers, if you want a system just so, build it yourself with your own money in your own home.
This advice / opinion is of course worth exactly what it cost.
Apologies in advance if I have misconstrued your approach. (but I doubt that I have)
YMMV.
http://slashdot.org/~GuyFawkes/journal
I'd say that everyone has mentioned that big picture points already, except for one : what kind of users?
150 file clerks or accountants and you'll spend more time worrying about the printer that the CIO's secretary just had to have which conveniently doesn't have reliable drivers or documentation, even if it had what neat feature that she wanted and now can't use.
150 programmers can put a mild to heavy load on your infrastructure, depending on what kind of software they're developing and testing (more a function of what kind of environment are they coding for and how much gear they need to test it).
150 programmers and processors of data (financial, medical, geophysical, whatever) can put an extreme load on your infrastructure. Like to the point where it's easier to ship tape media internationally than fuck around with a stable interoffice file transfer solution (I've seen it as a common practice - "hey, you're going to the XYZ office, we're sending a crate of tapes along with you so you can load it onto their fileservers").
Define your environment, then you know your requirements, find the solutions that meet those requirements, then try to get a PO for it. Have fun.
PC moderators can suck my White pierced, tattooed dick. If you think pride == hate, s/dick/Aryan meat mallet/g.
The low-budget solution: buy one server (like a Poweredge 2970) with like 16GB RAM, a combination of 15k and 7.2k RAID1 arrays, and 4hr support. Install a free hypervisor like Vmware Server or Xen, and P2V your oldest hardware onto it. Later on you can spend $$$$$ on clustering, HA, SANs, and clouds. But P2V of your old hardware onto new hardware is a cost-effective way to start.
Yes, but for example management wanting 24/7 2 hour up&running SLA and having hired a single guy with a budget of 800$ will not work - this is important to get sorted out early. Management needs to know what they want and what they'll get.
Yeah, I thought this was obvious, but until a few weeks ago our head office (which I only visit occasionally) had been using a non-switched hub to connect about 10 PCs together, plus the internet router. Big face-palm!! As soon as I realised that I went out and bought a $25 switch to replace it. Suddenly their database didn't experience slowdowns anymore. Surprise!
Really what your being unspecific about is the difference between upgrade versus an overhaul.
From the floor up (power, cooling, cabling, footprint) is an overhaul.
If you want a phase approach or some other piecemeal approach still you have to consider each a small overhaul within a larger system.
7 year old equipment is likely not going to be cascaded so really your considering it as candidate for heart transplant which means building a some sort of life support while the new system (heart) is brought on line in parallel. This is very expensive in time, budget, and resources.
Your really going to know your business' processes over the course of more than a "business year" so as to do everything without problems.
Business moments like tax time, EOY reports, monthly invoicing periods, HR/payroll are to be expected and must still function.
Un predictables like supporting business audits (like having to pull up old records, on systems that no longer read them?) and changes in executive leadership also would impact an upgrade/overhaul.
At no time did you ever mention disaster recovery plan, regular offsite backup strategy or a business continuity plan. These are often overlooked or dealt with inappropriately during normal business times and should be verified prior to beginning. A major overhaul or upgrade could or ought to trigger any one of these at any moment.
I have been there, and I have been there when everyone in the room craps in their pants when the tapes have been found to be lost or unreadable or blank.
How did you get put in charge of such a project when it is obvious that you have no clue on carrying out the tasks?
That's not true. Running as a VM guest makes it easy to move an image to another machine as time and budget allow. Just because you don't have a cluster right now, doesn't mean it's stupid to go that path.
Wow, someone who really seems to know what they are talking about. You sure you meant to post here? Couldn't agree with you more, requirements come first (although I've seen them often get revised down during the budgeting phase).
So let's see if I understand: you want to take a simple, straightforward, easy-to-understand architecture with no single points of failure that would be very easy to recover in the event of a problem and extremely easy to recreate at a different site in a few hours in the event of a disaster, and replace it will a vastly more complex system that uses tons of shiny new buzzwords. All to serve 150 end users for whom you have quantified no complaints related to the architecture other than it might need to be sped up a bit (or perhaps find a GUI interface for the ftp server, etc).
This should turn out well.
sPh
As far as "distributed redundant system", strongly suggested you read Moans Nogood's essay "You Don't Need High Availability" and think very deeply about it before proceeding.
Why have the headaches, why not have it hosted companies like Rackspace make it so easy and simple. You can also use there cloud services real cheap and easy setup a server in less than 5 minutes and only pay for the memory bandwidth you need, need more? just a few mouse clicks away.
..if it aint broke..
Also not completely true...
When your new cluster comes in and it is not the same architecture (e.g. Ultrasparc instead of your current x86 box) your not going anywhere with your shine VM.
You should make sure the application itself can be scaled, not the machine it is running on.
Sometimes that means using virtualization because the application is a bitch...
But a lot of applications can be scaled without virtualization.
The administrator that uses virtualization for his fileserver should be fired because he is incompetent. His data itself can easily be moved from his old single cpu box to the new SAN array.
Secure messaging: http://quickmsg.vreeken.net/
Except of course that management ALREADY HAS that because they've been very lucky for 7 years. Why spend money for what works (never mind we can't upgrade or replace any of it because it's so old)
I think what the article is really asking is what's a good model to start all this stuff. Your looking at one or two servers per location (or maybe even network appliances at remote sites) We read all this stuff on Slashdot and in the deluges of magazines and marketing material...where do we start to make it GO?
not really, you can split your VMs between 2-3 servers and do the migrations manually in the beginning. Once you make the virtual images the hard work is done, even if you just run 2 images per server, you've saved money or increased reliability. Now that you have VMs you can reinstall from backup tapes to another configured server so you have a start at disaster recovery. Once that part is done it's a function of how much money you are allowed to throw at the solution (blades, clusters, sans, etc)
If you consider virtualisation and high availability check with vendor like Zeus (www.zeus.com) to get software version of load balancer (both local and global) that can run in virtual environment.
If the current system has been acceptable for 7 years, I'm guessing the users needs aren't something so mindbogglingly critical that risk must be removed at any cost. Equally, if that was the case, the business would be either bringing in an experienced team or writing a blank cheque to an external party, not giving it to the guy who changes passwords and has spent the last week putting together a jigsaw of every enterprise option out there, and getting an "n+1" tattoo inside his eyelids.
Finally, 7 years isn't exactly old. We've got a subsidiary company of just that size (150 users, 10 branches) running on Proliant 1600/2500/5500 gear (ie 90's) which we consider capable for the job, which includes Oracle 8, Citrix MF plus a dozen or so more apps and users on current hardware. We have the occasional hardware fault which a maintenance provider can address same day, bill us at ad-hoc rates yet we still see only a couple of thousand dollars a year in maintenance leaving us content that this old junk is still appropriate no matter which we we look at it.
Only big ligs use sigs.
From your post, you not looking at this with the right perspective, not asking the right questions, nor asking them to the right people. You state that you have been put in charge of "maintaining" and never once mention anything about your company's predicted growth, development plans, future computation needs, near and long term service offerings, uptime requirements, security requirements or so forth. You have to do a requirements analysis that extends to between five and ten years and design a system that can grow seamlessly with your employer, meeting their current and expected needs in all pertinent areas.
If you can develop a system that does what is required on paper, the next step is to implement it in parallel with the existing system, and transition services and users over in phases. After all services have been transitioned, you can decommission the old infrastructure piece by piece.
I'm a systems admin at a small college with about 1000 desktop machines in the buildings. We were a strictly Sun/Solaris shop for a long time, but in the last couple years we've invested in some 1U dual processor Xeon boxes. These run Ubuntu Server and Xen. We're in the progress of moving services from physical Solaris servers to virtual Xen servers. Two x86 servers can basically replace our old 16 server Sun rack. We'll likely keep our storage array around for a while, but so far LDAP, email, and web services have been migrated. DHCP and DNS could easily be migrated and if you buy 2U servers with enough large hard drives, a seperate storage array probably wouldn't be necessary.
The tension between budget and business requirements can be useful but it is largely a paper tiger. A budget without a business requirement is a recipe for failure. The budget can help you refine the requirement but ultimately if you cannot pay for what you require, you're not likely to be in business very long. Putting budget first is wasteful and likely to lead to a network that doesn't fit the needs of the business.
Understand the requirements first and plan to meet them. If there is extra budget then consider adding more or better hardware and services. If there is not enough budget; if the requirements are firm, the network plan efficient and the infrastructure has to be replaced all at once, then start looking for another job. Otherwise plan for replacements over several years.
Windoze
Why would you buy a cluster not the same architecture? You don't know what you're talking about. VMs generally aren't used to change architecture like that. In a Virtualized Cluster the "OS" is just another data file too! Just point an available CPU to your file server image on the SAN and start it back up... that's smart, not lazy!
Most people need virtualization because managing crappy old apps on old server OSes is a bitch. The old busted apps are doing mission critical work, customized to the point the manufacture won't support them and management doesn't want to pay out for the new version... or the new version doesn't support the old equipment. The leading purpose for VMs is to get new shiny hardware with a modern OS and backup methods to segregate your old hard to maintain configurations to instances. Then the old and busted doesn't crash the core services anymore. Instances that used to be on dedicated, busted hardware that used to require a call-out can be rebooted from your couch in your jammies! (I vote VNC on iPhone as thee killer admin app!) VMs include backup at the VM level, so those old machines that refused to support backup can be backed up "in spite of" the software trying to prevent it.
Unless you have power problems or financial restrictions you're better off with dedicated boxes. I currently run 3 old computers. Ubuntu, Windows XP, Windows 2003 with Apache on XP running PHP sites and doing reverse proxy for the IIS server on the 2003 box. Ubuntu handles memcache. Because I'm not made out of money I'm going to virtualize all three systems onto one quad core system which will cost around $600 rather than $1800 for three new systems. It'll also cut down on power usage.
Slowness can be caused by any number of issues. An old harddrive can cause a system to be sluggish. Just imaging the existing systems onto brand new drives could make things better. Upgrading the network to 1Gbit or just making sure the switches you have are performing could help. Putting more memory into existing systems could also speed things up.
Make sure the power supplies are running well, fans aren't clogged with dust, and that proper cooling is in place.
If all else is not sufficient, progressively purchase new systems to replace old ones and give the old ones to charity after 6 months to make sure everything is good.
Work Safe Porn
Lots of other people have already pointed this out, but I'll chime in: don't mess with what works.
Unless you have a huge influx of people coming in or a change in the way the network will be used, stick to the current set up. Do not go virtual or load balance and complicate things. That may even void your support contracts if you have any. Assuming you have to upgrade, try this:
1. Buy new servers for each service, just like it was before.
2. Buy at least one extra server. Maybe more.
3. Set up one new server at a time, keeping the old one on hand, in case something on the new server doesn't work perfectly. You should always always be able to revert back during the transition.
4. Make images of the new servers. Use clonezilla or something similar. Then, if one server dies, you have an image that can be transferred to a spare machine (see #2).
The big things here are that you should keep things simple, have a backup in case of hardware/software failure, and do one service at a time. That insures if something goes wrong, you know which server caused the problem.
www, ftp, email, dns, firewall, dhcp
decide what truly needs to be distributed. DNS, DHCP, firewall. What is likely not necessary to distribute WWW, FTP, email.
DNS can be replicated with BIND or you can do a DNS server that uses MySQL and replicate the mysql database. DHCP must run at each site but you need to decide if you want DNS updated with DHCP. If so, you need to decide if you want those hostnames available across the network. DHCP can update DNS when a client requests an address, DNS can then replicate between each sites DNS server and in the end, you could access that machine from anywhere on the network that is permitted by your firewall runes.
for firewall, consider just using iptables and a bash script to download the current config and then replace some placeholders in the file with the local IP information. I have done this where I keep a copy of the firewall config on an internat webpage and just download the file, sed out my LOCALIPADDRESS and WANIPADDRESS with the local IP, and write that data to iptables on a schedule with cron. That way you can make a broad change to the firewalls at each site in a single file.
email doesnt like to be distributed. consider simple keeping a hot spare, even at a remote site, using something like DRDB to keep the email store in sync. Because you already have DNS everywhere you can quickly adjust the DNS entries for the email server. Use low TTL numbers so downtime is minimized. Then you can ssh into the remote machine and mount the store, then start the email services and you are in business.
Once you have met your legal and other regulatory minimum requirements, the rest of the upgrade programme is down to your decision makers. For example: some prefer not to implement hot-standby (relying instead on perhaps a third-party, or business insurance), some make it a 100% absolute requirement for each and every server they possess, you can't just make a statement in isolation, you'll need guidance from the people who control the money - as that's what it all boils down to.
Once you have the answers to two questions:
- what do you value
- how much are you willing to spend for what degree of risk
You can start to make plans. All the best practices I have come across appear to have been written by or for government departments where budgets are effectively infinite, and the worst possible scenario is to open yourself to criticism from your peers and rivals. In the real world neither of these conditions exist. Further, while it's not always good to re-invent the wheel, blindly following one scheme without understanding it's values, shortcomings or benefits means you will certainly not get the best value for your organisation and will not provide a solution that is best for their circumstances..
There is however one best practice you should follow: get everything (esp. from your own people) in writing - who said what, when and to whom.
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
If you have heard of Small Business Server, Microsoft just released a 3 server solution for businesses of your size called EBS. It will do everything you just outlined including setting the foundation for branch office scenarios with redundancy. With EBS, you get SharePoint, Exchange, Fax serving, AD, DNS, DHCP, firewall, FTP, IIS for web serving all included. Because it is built on Windows Server 2008, you get access to all the services that it provides. It will be a huge leap in user experience for your end-users and you'll finally stop fire fighting and actually allow time to deal with the real IT/Business challenges.
Rather than pushing the features, the real work you need to do is to identify business requirements and map them to features, implementation costs, and upkeep costs.
Once you have a sane, self-managing system in place, you can start to role out self-service IT systems for your users so they don't bother you for password resets. Some would say that you're putting yourself out of a job by doing this, but if you play your cards right and plan out the technical and the social aspects of the project, you will really be a hero and you'll probably be seen in a more respectable light.
visit http://www.microsoft.com/ebs
If the administration 'team' has equal access to all the services today on disparate servers, I don't think virtualization is necessarily a good idea, the services can be consolodated in a single OS instance.
In terms of HA, put two relatively low end boxes in each branch (you said 7 year old servers were fine, so high end is overkill). Read up on linux HA which is free, and use DRBD to get total redundancy in your storage as well as a cheap software mirror or raid 5. Some may rightfully question the need for HA, but this approach is pretty dirt cheap at low scale.
XML is like violence. If it doesn't solve the problem, use more.
I think what the article is really asking is what's a good model to start all this stuff. You're looking at one or two servers per location (or maybe even network appliances at remote sites).
I totally agree with your premise. In my experience taking something that appears to work (when you realize you've really just been lucky) requires some time to bring about the change that the business really needs.
Now, as for having two servers per location, that heavily depends on how those sites are connected. Are they using a dedicated line or a VPN? That's important since that'll affect what hardware needs to be located where. It's possible (even if unlikely) that some sites would only need a VPN appliance... But since the poster seems to want general advice:
VMWare ESXi is a pretty good starting place for getting going on virtualization. I've had a great experience with it for testing. When you feel like you've got a good handle, get the ESX licenses.
If SAN isn't in your budget, I still recommend some sort of external storage for the critical stuff... Preferably replicated to another site... But you can run the OS on local storage, especially in the early stages. But you'll need to get everything onto external storage to implement the VMotion services and instant failover. Get a good feel for P2V conversion. It'll save you tons of time when it works... It doesn't always, but that's why you'll always test, test and test.
As for the basic services you stated above (www, ftp, email, dns, firewall, dhcp):
Firewall (IMHO) is best done on appliance. Which should be anywhere you have an internet connection coming in. I'm sure you knew that already, but I'm trying to be thorough.
Email is usually going to be on its own instance (guest, cluster, whatever)... But I find that including it in the virtualization strategy has been quite alright. In fact, my experience with virtualization has been quite good except when there is a specific hardware requirement for an application (a custom card, or something like that). USB has been much less of a headcache since VMWare has support for it now, but there are also network based USB adapters (example: USBAnywhere) that provide a port for guest OSes in case you don't use VMWare.
Don't forget that with all the shiny new servers, to have some sort of backup fabric in place for each and every one of them.
I'd focus on four backup levels:
Level 1, quick local "oh shit" image based restores: A drive attached to the machine where it can do images of the OS and (if the data is small) data volumes. Then set up a backup program (the built in one in Windows Server 2008 is excellent). This way, if the machine tanks, you can do a fast bare metal by booting the OS CD, pointing it to the backup volume, pointing out the new OS volume, click "restore", walk off.
Level 2, a network backup server: The server would be a machine with a large amount of disk, and a tape autochanger. It would run at the low end Retrospect or Backup Exec, upper end, Networker, ArcServe, or TSM. And it would do d2d2t backups, so grabbing the data from machines is fast so you can do the most with a backup window. Then, with the tape array, make a rotation system factoring offsites to Iron Mountain, as well as onsite backups. Of course, this server would handle archiving, perhaps with a dedicated DLT-ICE (or similar WORM tech) drive for backups that can't be tampered with.
Level 3, offsite strategy: If you need to have stuff up 24/7, consider a hot or warm site that can take over should something happen to the main site. Even if you don't need an offsite server room, you do need offsite backup storage and rotation planning. Usually this is Iron Mountain's domain, but it can't hurt to also have a tape safe on some leased company property only known by the top IT brass just in case.
Level 4, the cloud: Cloud storage is costly. There are also security issues with it. However, the advantage is that if your data center gets completely obliterated, the data is still accessible. I'd recommend having some form of encryption (PGP comes to mind, perhaps on the cheap, TrueCrypt containers), and storing your core business tax data (Quickbooks/Peachtree) here. You want to store what you need to recover the business, but you don't want to store too much because you are paying lots of cash for it. Last time I checked, for the cost per month you use a cloud provider for a terabyte of storage, an external 1TB drive a month was cheaper. But you are paying for cloud storage's SLA and relability.
I know backup fabric is usually the last thing on an IT department's minds, but it is VERY important, and may mean the company exists or doesn't exist when (not if) something happens.
Tailor this to your requirements and budget, of course.
You got a lot of posts pointing out the error of your ways; basically what people are saying - it sounds gung ho, there is no clear reasoning in the post justifying your shift.
Maybe they are a bit strong but note there is a lot of experience behind them.
Having said that, I would like to take a kinder gentler tone. Once you go through your fundamental reasons for wanting change, I'd suggest you choose ONE big thing that you want to do. Changing everything at once is usually not so hot.
So what could be a goal that would make your users happier and you a hero? Well, don't know, but I can tell what is typical in many such cases
- lowering capital costs (less spending on physical servers and their maintenance) while keeping everything running is one; cloud computing may help on that
- faster performance is one, but only in those places were users are actually complaining. Making a list of those places and fixing them one at a time would be an approach.
- new business needs is another one, but for that - leave everything that works alone and focus on solving very well the new business need. Your partners are your CEO, CFO, marketing etc...
For example, seems from your post that the overall architecture of the system is actually quite decent. So you may want to just repeat that same architecture in an updated way in a cloud computing approach, save some money and prepare for the next computing trend. If you decide that is for you, move one server at a time, arrange fail-over in the cloud, and prove one-at-a-time that it works as fast as the old stuff.
Bit of advice: don't just do virtualization without knowing why. If the business reason is economics, then jump over virtualization to the next trend, cloud computing. If it isn't economics, don't bother with virtualization at all.
Consider your goals and choose ONE. 'Nuff said.
If you have external access at your offices, leave everything as-is. Image everything, and use Amazon as a backup machine. Simple, low-cost, and basically on-demand.
More info about the setup would be good, but if everything's been running, don't touch it - back it up.
where do we start to make it GO?
It can be helpful to engage an independent VAR. Not all, but some, offer presales assistance that includes needs assessment and design for free or at low cost. They do this with the hope that by demonstrating their technical prowess you will be more comfortable with buying from them, and in the hope that you'll engage their engineering teams for best-practice deployment consulting.
It sounds like the organization in the fine article doesn't have a lot of experience with this. Modern systems can be complex and a single configuration error can lead to downtime, wide-open security, and more. Ask slashdot is nice, but it's not a dialog with a certified professional with years of experience who's on your site and has spent some time understanding your network and needs.
Help stamp out iliturcy.
At least for external services like www. Big red buttons do get pushed. I worked at one company where the big red button in the data centre got pushed, all power went off immediately (the big red button is for fire safety and must cut ALL power) and the Oracle DB got trashed, taking them off air for four days; their customers were not happy. They got religion about redundancy.
Redundancy is one of those things like backups, support contracts, software freedom, etc. that management don't realise how much you need until you get bitten in the arse by the lack of it. You clearly get it, which is good.
(I have a similar problem at present: an important dev machine has (a) no service redundancy (b) no disk redundancy. (b) is unlikely, (a) requires duplicating all services including a proprietary version control system onto another box. I'm going to have to switch on an old Ultra 60 that's been decommissioned. Argh.)
http://rocknerd.co.uk
1) don't screw up. This is a great opportunity to make huge improvements and gain the trust and respect of your managers and clients. Don't blow it.
2) Make sure you have good back ups. Oh you have them? When was the last time you tested them?
3) Go gradually. Don't change too many things at once. This makes recovering easier and isolating the cause easier.
4) Put together a careful plan. Identify what you need to change first. Set priorities.
5) Always have fall back position. Take the old systems offline, cut over to the new system. If the old system fails, rollback. And leave the old systems available for a while until you feel assured they are stable.
6) Don't drink the koolaid. Any product purporting to help migrations should be avoided unless people you trust have used it and/or you are very familiar with it.
7) Always remember point number 1. Being conservative and careful are your best tools.
putting the 'B' in LGBTQ+
The question is not about hardware or configuration. It is about best practices. This is a higher level process question. Not an implementation question.
putting the 'B' in LGBTQ+
Here's how we do it:
- Run your services in a few vservers on the same physical server:
* DNS + DHCP
* mail
* ftp
* www
- Have a backup server where your stuff is rsynced daily. This allows for quick restores in case of disaster.
Vservers are great because they isolate you from the hardware. Server becomes too small? Buy another one, move your vservers to it and you're done. Need to upgrade a service? Copy the vserver, upgrade, test, swap it with the old one when you are set. It's a great advantage to be able to move stuff easily from one box to another.
If MS is going to astroturf, you need to at least learn to be a bit more subtle about it. That post couldn't have been more obviously marketing drivel if it tried. Regardless of technical merit of the solution (which I can't discuss authoritatively).
The post history of the poster is even more amusingly obvious. No normal person is a shill for one specific cause in every single point of every post they ever make.
To all companies: please keep your advertising in the designated ad locations and pay for them, don't post marketing material posing as just another user.
XML is like violence. If it doesn't solve the problem, use more.
Heh. I worked in a small office once where their backbone was a 24-port hub. Better yet, they were using thin clients for everything, so they were slamming that hub every single second of every single day. Once the hub was replaced, it was amazing how many of their "performance issues" disappeared...
And when you need it back up within 5 minutes, and no data loss (other than data that didn't occur due to downtime)?
I don't see what all the fuss is about vm's. It allows you to continue to run one service per "box" and cut down on the amount of servers. Using vm's has allowed us consolidate numerous slightly used, dedicated boxes. In turn, we have improved out fail overs with vmware's management console and snap shots saved to a SAN. Near instantaneous recovery without all the head aches. We still do tape and spinning disk backups depending on how critical the machine's mission. There are still a lot of services the best practices requires they have their own box: Infrastructure services being the critical one. All the rest do just fine virtualized. As for the remote offices, the should need more than slaved DHCP,DNS, LDAP/Active Directory, gateway, and a firewall unless your using the remote location for load balancing on web, connection redundancy, etc. We use an MPLS to one of our remote office for this ourselves. HTH, will
It's not a super config, and a lot of people will argue that it's not a true setup, but it's sufficient for our needs. I think we hit 4% CPU utilization across all the nodes the other day.
With VMWare, watch the 2TB filesystem limit. We ran in to that with our SATA array. Basically you have to slice it in to 2TB chunks to get VMware to accept it as a datastore.
As far as networking goes, we have a couple of gigE switches running the traffic. Our SANs are redundant, as we clone all of the machines from our SAS "SAN" to our SATA. If the "production" SAN goes down we can start up the clone from the SATA box in minutes. After the primary SAN comes back up we can VMotion it across to the other data store.
"Maybe the first question should really be: what's your budget?"
Maybe the first question should really be: you are in charge for the transition but you are clueless about how to do it. What the heck?
"The tension between budget and business requirements can be useful but it is largely a paper tiger."
Yes indeed, but not because of the reasons you highlight. There is no tension between budget and requirements since budget is just a natural outcoming from the requirements themselves: you don't need 24x7 services; you lose XXX dolars per hour when the service is down. Once you factor in the risk management is wishing to take your budget is just a matter of a multiply: it's XXX dolars per downtime hour multiplied by the risk you are accepting. You lose 10.000 per downtime hour and you don't want to lose more than 100.000 on a risk you measured to have a 10% chance (a ten hours downtime)? Then your allowed front cost for this is 30.000 (for iron under three years amortization).
I'm used to hear about "I want uber-redundancy and 24x7 disponibility" "well, that'll cost you XXX" "But I can't pay that!" That means that you don't earn that much from that system. It's never "I can't afford it" but "it doesn't get me so much".
1) Thank you, thank you for thinking of best practices before taking serious action. 2) ITIL is your friend. http://en.wikipedia.org/wiki/Information_Technology_Infrastructure_Library When implemented deliberately and properly, ITIL makes an IT admin darn near *comfy*. Just remember that ITIL != bureaucracy, ITIL == Best Practices.
Assuming you have seven year old Microsoft OS boxes, then switching over to a fewer number of latest Linux OS boxes would be an improvement. Many of the services you list can run in the same Linux box just as happily - without VMing them. Others you may want a dedicated box (email server with big HDD arrays). For a small facility having only 150 users you've got a small budget and insignificant system loads.
However, if you want to make a more significant dent in operations, equipment costs and IT maintenance, look into client-server setups using LTSP.org - transfer all fat-client based 150 users to thin clients (stripped down current machines or new thin clients the size of desk phones) running on a few back-room servers. Switch over the office phone system to something like Asterisk etc. Look into FreeNAS and m0n0wall/pfSense. Set up a Drupal or Wordpress system to publish internal documents and/or to the Web. Lot's to keep you busy and productive besides those few old workhorses.
If there is extra budget then consider adding more or better hardware and services.
Shouldn't that be... if there's extra budget, think a little farther ahead, about how your requirements are likely to increase in the future?
So you can try to meet the anticipated future requirements earlier, to be more efficient, in saving money on delaying future upgrades that will otherwise be needed.
To have true high-availability, even 2 VMware servers isn't enough, you need a reliable shared storage system that both servers can access.
Even then, the storage chassis itself will be a central point of failure. To have true HA you need a pair of independent shared storage units with continuous synchronous replication and some reliable mechanism of failover.
But even without HA...
There are still benefits of running only on one server and using virtualization. Getting higher utilization of a smaller volume of hardware still saves money, since you aren't running 10 servers sitting at 10% load all the time.
You can run multiple OSes.
You can run applications that require their own OS install. For example: domain controller can run on its own without other apps running on the DC. The major apps have their own server
Finally, there are security benefits of isolating apps to their own server. If one server is compromised, it can be taken out of service without affecting the other apps.
You can run the bleeding edge server OS version only for the app that needs it, and run more stable code for other apps.
If one server crashes due to an OS bug, the others keep running.
The hypervisor itself is a thin OS, and if run on proper hardware is highly stable. Driver issues are unlikely to bring down your servers, especially when utilizing advanced CPU features such as processor VT and IOMMU which provide sophisticated I/O and device isolation functions.
Of course, your hardware is a single point of failure. But backups/disaster recovery is easier to manage in a virtual environment, you just VCB and regular copies of your VMDKs to a secondary piece of metal to prevent data loss.
Switching an existing deployment from x86 to UltraSPARC is nuts. Also, SPARC is dying, it's extremely unlikely your new cluster will be SPARC. Almost certainly you will pick x86, x86_64, or Itanium. Itanium is also a niche market, however, and it's unlikely your app will suddenly need it.
Better to in fact virtualize the fileserver. So you can run multiple things on the box. Virtualization basically guarantees you can move the application with minimal work, when you scale up the storage infrastructure later.
If you ever get a SAN, attach FC, iSCSI LUNs with the files to the fileserver, or serve a VMDK with the data from the NAS, problem solved. Let the SAN serve data to servers. Let servers serve data to users. Don't let users come within 100 feet of the SAN or other hard-to-upgrade device (security nightmare).
That's a little harsh don't you think?
There are untold numbers of us in this guys position. Asking slashdot is a damn good start at finding a new methodology. Everyone has an opinion, some of them quite intelligent, a few might even work. It's ok for the fortune 500 cube dwellers to jump on the phone and call in a long standing contractor to 'handle it' - the rest of us have to slog through the marketdroid crap and translate the latest buzzword infestations to human speak - then just hope we don't screw it up or waste money.
So far the best suggestions appear to be to figure out how critical things are first (which will shape the hardware requirements), budget second. All the while this is encompassed by the usual core job functions that still need to get done.
So rather than point out the redundant, how about using your fingers to provide a potential solution.
But you can run the OS on local storage, especially in the early stages.
I would just like to point out that I've been in several environments where people "started out" with their virtualized servers on local storage, then moved them over to SAN.
Let me just say that unless you're willing to drop some *serious* coin on SAN, avoid this...unless you're going to run top-end EMC of some kind, and keep your virtualized servers down to 20 or so per SAN.
Basically, it comes down to this; ever see the throughput 30+ servers pull down from a dual-port fiber connection when they all boot?. It's ugly, my friend.
If you decide to go that route regardless, invest in at least 2 dual-port 4GB Fiber HBAs per host server. You'll be thankful in the long run.
"When I am king, you will be first against the wall..."
1) Buy a comprehensive insurance policy
2) Write a detailed implementation plan that you copied from a Google search
3) Wait the 3-6 months the plan calls out before actual "work" begins
4) Burn down the building using a homeless person as the schill
5) Submit an emergency "continuity" plan that you wanted to deploy all along
6) implement the new plan in one third the time of the original plan
7) come in under budget by 38.3%
8) hire a whole new help desk at half the budgeted payroll (52.7% savings)
9) speak at the board meeting: challenges you over came to saving the company
10) Graciously accept the position of CIO
(send all paychecks and bonuses to numbered bank account and retire to a non-extradition country) :)
So does a cluster, of course. The back-end storage array required for virtual host migration, or the Veritas clustering tools you may use for service clustering, also form single points of failure. And Veritas has historically been extremely unstable under load: it's often misconfigured, it's often mishandled entirely, and it often mistakes having a "high reliability filesystem" for having a highly reliable failover system, when that filesystem itself may be corrupted by the actual software. This is a very serious problem for Oracle systems, by the way. Far too many installers mistake "clustering" software for having a master/slave, and mistake master/slave setups for having actual backups.
Except of course that management ALREADY HAS that because they've been very lucky for 7 years
Whoa there - so using this logic we can assume the company has no fire insurance, etc, because they've been lucky and not had their building burn down in 7 years? Managers might not understand technical issue but one thing managers worth the title CAN do is manage risk ie: balance cost of risk mitigation against risk. I can well imagine a company of 150 people that actually doesn't have any mission critical servers worth spending a lot on redundancy, etc. I can also imagine a company that has gotten lucky while at the same time, the IT person(s) haven't explained IT risks/costs in proper terms because they assume the managers just aren't technical.
The original questioner definitely needs to do a proper risk / cost analysis and present it to the managers. (But right now his "ideas" are WAY too vague and not business need driven) A prompt, proper analysis and plan/alternate plan(s) for risk and risk avoidance is going seriously wanted. It will CYA for that magic moment any day now when these 7 year old systems start failing.
One thing I'm struck by (over, and over, and over again) is just how frequently "solutions" to keep critical system from "ever failing" don't. I've personally witnessed a tens of multi-million dollar solution come crashing down due to a single failed server. And I'm not talking something that was whomped up in the back office by the team, I'm talking Major Vendors (you'd know the names if I could say them, but I can't; please don't ask), and by vendors that are not even given to being thought of as a simple lightweights (as some other, also nameless vendors are thought of). And in the case I'm thinking of, it wasn't a single point of failure. There were over two dozen other servers able to accept the virtual instance - but none did. So the whole house of cards came down. It was the final acceptance demo. Boy, was there a LOT of egg on faces.
About the only "highly available" services that I've really seen work are geo-seperated Xiotech sans, geo-separated Stratus systems - the old, old ones, running Motorola 680x0 chips, (8098 for example), IBM RS-6000's (with Oracle replicated databases), and (shudder) Sperry V-77's, hand built for wagering. (My GHU! People really still use Z80s!) My own private testing of 10 linux systems running in a cluster were more favorable than any major OEM's Windows/Intel solution, but as the creator of the demo, I can't claim to be completely unbiased. However, even with 5 of the 10 servers having had the power plug pulled (or SCSI card cable yanked, or in one memorable case, the mobo hit with a Taser - I hated that hardware and wanted to get rid of it), it did keep running just fine. Most times, the user did not have to authenticate again and the transaction was preserved, but a few tests, this didn't always work. The user had to log in again, and the transaction was rolled back and not completed.
I've never seen a "solution" put together with WinTel platforms that were absolutely reliable. They may be out there, but I've never witnessed one tested by the "Back Room Guys" that passed with flying colors. Perhaps this is because I'm stupid, ignorant, and can't construct a valid test. I'm open to being corrected... but so far, all I've ever heard are whines and nitpicks.
In a few cases, I wanted to tell the vendor "go put on your man pants and try again."
Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
Also, SPARC is dying
Beg your pardon, sir, but do you care to offer a citation for that reference? Perhaps a Netcraft confirmation of some sort?
Didn't think so....
For the record, it's just a tad over your head.
And for the record, SPARC isn't out of the game just yet...
...according to their 1st quarter results from FY2010, things are looking up.
"When I am king, you will be first against the wall..."
I work with VMware daily, so I am biased but also experienced.
The problems you have currently;
-can't get replacement parts for 7year old servers
-if something fails you can't buy a server like the one you have to restore onto; the data is still retrievable it'll just take longer
-you have no like-hardware test environment
How a virtual environment will help you; lets say 2 current model servers and a piece of shared disk
-p2v is more efficient than re-installing on bare metal
-2 servers provide redundancy for ALL the virtual machines
-disaster recovery is now hardware independent
-you can snap-shot and roll-back upgrades that fail
-you can add more resources (cpu) by adding another server
-you can provision new virtual servers easily
-you can fix the hardware during business hours!
Personally I would recommend a mainframe implementation for the workloads you are suggesting.
If you're gonna over cook this, may as well go all in, eh
Two points have already been mentioned before;
1. What kind of users are we talking here? Globally diverse store managers? Scientists? Wall Street? Web developers? Each one of these groups will have different ideas of what "Reliability" means. Which brings me to;
2. Tiers. What are your critical (never-down) services? Typically this translates to cost; how much will a company-wide email outage cost you per day? Hour? Minute? DNS/DHCP/WINS (shudder) and all your "infrastructure" services will probably fall under this category. But which Applications do you provide, and what are the users expectations? This is a great chance to start having "User Group" meetings with the various sections of your user-base, and start fleshing-out requirements.
3. Plans. Everyone with a tie will love to see a black-and-white document outlining things like Backups, Disaster Recovery, Risk Analysis, Acceptable Use Policy, and so forth. However, most small networks (10-20 servers) don't have anything like this. Heck, even if it's "boot the old systems", it's still a plan. Write one up, use a template, Google has a few dozen last I checked.
4. Migration Plan. One thing you can bet on; if *anyone* non-IT has had free reign inside the network, there will be little files, scripts, cron jobs, applications, firewall settings, etc that have been tweaked and long forgotten. Before you "Decommission" anything, make sure it survives a reboot, and make an image of the filesystems.
Word from the wise; setup a Linux box somewhere with a good chunk of space and throw all of them on there, then make sure that system is backed up. Try to avoid mentioning this to anyone, as it increases the "awe" factor and cuts down on unnecessary retrieval requests
5. Blog, wiki-fy, etc. *Anything* that the users can take a look at and "see" what you're doing. Being an I.T. techie is like being a ninja; If you do your job right, no one even knows you're there. But screw up, and everyone will have a torch and pitchfork with your name on it. Sometimes having things out in the open will negate that (maybe they just bring flashlights, instead of actual torches).
6. Go Slow. Take a look at what servers you have, inventory what all is running on them, and guestimate how long it would take to set that up. Then multiply that by a Scotty factor and state that in your paperwork.
Remember, small-time IT guys seldom leave peacefully, they're typically ridden out on a rail. (This coming from someone who's been the exception to that, narrowly at times).
"When I am king, you will be first against the wall..."
This isn't impossible except for the official SLA bit, it's kind of how it's done in my office, and I suspect many others. We've got a number of servers all built with standard off the shelf components from an internet parts shop that happens to also be locally based. We've got one spare server, and if anything other than hard discs fail, we just move the discs into the spare server and switch it straight back on. If the hard discs fail, someone switches on the appropriate services on the spare and sets it going (it's got every service configured up but switched off).
Not necessarily saying this is a perfect idea - we recently had a run of hardware failures, which to me suggests that while you can build a server for £200, you probably shouldn't. We're now I believe looking at better standard motherboards, and proper hard discs for all the servers. It will push up the costs to about £400 / server, excluding KVMs which were got in the computer room budget. And it's all run by one person (thankfully, not me!)
Services running on virtualized servers hosted by a single reasonably sized machine per office seem to recommend themselves.
If your services have started to recommend themselves, they have achieved self-awareness. My advice is to do whatever they ask, and try not to antagonise them.
You need a mod point and I wish I had one.
Anyhow, you can estimate what load anonymous is dealing with - 150 users, 7 year old hardware and that it is not maxed out.
to feet incoming web requests
I think you meant "two feet incoming web requests", which probably means "manually submitted web requests" - mind you, I'm going out on a limb here, could mean roughly 61 cm.
When in the hell has management ever been reasonable? "Do X on a budget of less than half what you need, twice as fast as possible, I went to a training seminar, I know it can be done" seems pretty familiar.
-1 disagree is not a modifier for a reason. -1 troll, flaimbait, redundant, overrated are NOT acceptable substitutes.
Go Microsoft. Easiest way
Well, it's not like i've not got any clashes with out management (or even that of some of our customers), but i've found it to be the better approach to actually talk things through in the hope of getting a better understanding of both parties.
From my technical standpoint, it's very much important that management _exactly_ knows what they're getting for their money. This also means saying "No". Yes, you can lose customers if you don't promise them 99.999999999% availability for 50$ a month, but the real question if you actually wanted those customers in the first place - they may find some idiot which agrees to work with them, but that's their loss.
Even internally, as we also run our internal infrastructure, it's important to say "no" to unreasonable tasks and stupid ideas. Either management trusts you to actually do the job they've hired you to do, or they don't - then you'll need to find a new job.
here what i would reccomend. not affiliated to any company in either way but its what worked for us. If you have money to spend Go with HP servers and Equaloggic or left hand San. Vmware ESX If Budget is tight ( I would do this either ways) get Dell Servers as hosts.(R710 with nehelams they work better than quad proc older procs) Sign up for sun startup essentials. you get more than 30% off retail Get their 2510 iSCSI SANS or 2530 SAS sans. they go for 8grand for 6 TB Have two copies in two locations running Vmware ESX
Even worse, statistically speaking the chance of failure will increase the longer things don't have a failure.
Everything will fail eventually. If something hasn't failed yet, the chance it will happen 'soon' increases.
New things are always on the horizon
VM's are not a security feature. More code means more bugs, which increases the chance of more security problems.
New things are always on the horizon
Buy 3 machines. Put all the services on each and put one in each of your 2 appropriate locations. Everything you list can run on a single Linux box,. Use the 3rd for your sandbox.
I have done this before !
Simple, 2 machines at the main site which will host all the services and be a backup up to one another.
Then 1 machine per each external site hosting all the services too.
Virtualization would only be recommended if security is crucial, and only for the services accessible from outside. But it's a complexity add-on !
Recommendation for system : Mac Mini Servers ! with Snow Leopard Server !
Ritchie
From your question, I'd say you're on the verge of a huge screw-up.
You must be young. Don't set out to make your mark. On the contrary, set out to make yourself entirely forgettable, which is what people want from their IT infrastructure.
First, look to replacing what's currently there, and nothing more. There don't seem to be any requests for added features.
If you can do that within budget, look at what is lacking. It may be ease of use, reliability, redundancy, backups, disaster recovery, speed, room to grow, features...
If you want to be really smart, do just what's asked of you, under budget, under deadline, with no hassle. But plan ahead for the next few requests, and document that. When those requests come up, you'll be able to turn back and said: I knew it, I planned for it already. THAT earns you points. Not trying to force any random feature that catches your fancy down management and users' throats.
The Cloud - because you don't care if your apps and data are up in the air.
Creeping complexity was the bane of my last job - we went from a single-box mail system to a load-balanced front end separate from the mailstore because they wanted "disaster recovery" in case the Tier 1 datacenter we ran our rack of gear at lost all connectivity. Even though none of our customers paid for that level of uptime. It also had a lot more problems than the single-box solution - some that were extremely difficult to fix.
If you're worried about failover, and have the budget, VMWare ESX and VMotion, with a cheap replicated SAN, will give you what you're looking for for hardware redundancy. It's painfully expensive, but if they want redundancy, there's no way to do it short of paying a lot of money. Laying out the cost of that 99.999% uptime to management normally serves to get their expectations in line with reality - if they don't, then time to update that resume, because you'll get blamed for not delivering.
There is no such thing as high availability, easy to use software. It's all complex, and hiring people to work on that shiny new load balanced system just became more difficult - the vast majority of IT types don't have enterprise experience, and those with the experience are going to be working on similar systems for companies that pay a heck of a lot more. The easier you make your architecture, the easier it is to hire help.
Why can't I mod "-1 Idiot"?
mod parent up.
The first step is to find out what the business wants, and how much it is willing to pay. THEN you go out to find out what tech is appropriate/affordable to do it.
Ask the heads of each office, and the main business managers what they want the tech to do now, in a year and in three years. Do you have a business continuity plan that has to be allowed for. If you don't have a BC plan, now's a good time to have one done, before you buy a load of kit that may not do the job.
Once you have a list of business needs, and put them in a prioritised list (again the managers set the priority), you go out and look at what can do the job. Assuming you find a reasonable solution within budget, you need to plan the migration.
Protip: do not attempt to migrate everything in one go. Do it in steps, with breaks in between.
Proprotip: whatever your migration, be able to revert to the original solution in less than 8 hours - ie one working day.
Migration is the biggest gotcha - plan, plan and plan again. Do a dry run. Start with the least critical services. You do have backups, right? Fully tested backups, from ground zero? You do have all your network and infrastructure accurately and completely mapped out, and all configuration settings / files stored on paper and independent machines?
Both arguments for VM and KISS have their place - only you can decide. But when you do decide, make sure it's based on evidence, and will end up making the business better.
Don't forget Total Cost of Ownership - the shiny boxes may run faster, but will you have to hire two more techs to keep them running, or a new maintenance contract?
Don't forget training - for you, your staff and the end users. If you're putting shiney newness in place, people will need to know how to use it, and do their jobs at least as quickly as on the old solution. No use putting in shiny web4.0 uber cloud goodness, if the users end up spending an hour doing a job that used to take 5 minutes, because they don't know how to use it properly, or the interface doesn't easily work with their business processes.
good luck
Can I suggest OpenVZ or Linux VServer ? If you do want to seperate them of maintainability. Not if it's overkill ofcourse (like DNS and DHCP can run fine on the same machine).
New things are always on the horizon
150 people depend on the questioner's responsibility for a living, and the post seems like he's into a new hobby.
The question of budget would seem paramount in this case.
Yes, but if you a requirement for 4-hour return service and they only give you an $800 budget, you should run screaming from the room ;-)
Virtualization isn't always done for redundancy. Virtualization all on one server makes perfect sense if the goal is server consolidation & energy savings. Just make sure that management understands that.
I prefer rogues to imbeciles because they sometimes take a rest.
If you want a virtual environment, witch in my experience is really easy to administer, you need some sort of SAN or iSCSI environment. Then you have a base for attaching the needed computing power to this storage solution. It will be costly to start up, mostly be course of the rather powerful switches you need to get. Those are easy 10K a piece.
We just set up a brand new virtual environment at my work (university it department serving about 5k people), the trick is really to get the infrastructure in place, network connectivity, and backbone/power redundancy etc. Then we are adding R710 Dell boxes, with 50GB ram(we are upgrading all 5 of them to 128GB next year) and 2x Quad core Xeons, those are cheap, only about 7k a piece. The processing power of those new Nahelem Xeons are awesome! Can definitely recommend.
For a not to expensive SAN i would recommend Dell's Equilogic boxes, they have all the new features, while being robust and built redundant (2 storage controllers, psu's etc), the basic box with 40TB is about 70k.
Since the main concern in my eyes are your aging hardware, you need to migrate one way or the other. Maybe just P2V'ing the old stuff to a vm is not desirable, if you need to update all software. Otherwise it is a easy way to move your old server in a convenient and safe way.
good luck.
Before doing my upgrade, I wanted to be sure my infrastructure would be up-to-date with current standards. The following 2-part document first qualifies the person giving advice and then presents 25 questions I needed that person to answer.
(As each of the 254 questions are covered on the CISSP exam, a competent consultant should be able to guide you in the right direction.)
Feel free to adjust the estimates of person-hours for each task. The estimates below are for a company with about 50 servers, 50 network devices, and a WAN / MPLS covering a dozen offices across the US.
Good luck!
RFQ Goal: THE COMPANY desires to contract with a consultant who will, on an annual basis, review THE COMPANY’s compliance with its own security policies and standards. The consultant will summarize their findings in a brief report, including any recommendations for future improvement. In addition, as planning for a major upgrade is underway, additional recommendations for the upgraded system are expected.
Consultant Background: The consultant will be an individual skilled and experienced in this task. The consultant will have no less than five years experience in the information security field.
Credentials: The consultant must have at least one of the following credentials and furnish verification that the credential is current:
* Certified Information Systems Security Professional (CISSP)
* Certified Information Systems Auditor (CISA)
* Certified Information Security Manager (CISM)
Work to be Performed:
* THE COMPANY will send the consultant a Purchase Order authorizing the start of the engagement. Depending on consultant availability, the engagement is expected to take from four to ten weeks to compete.
* Supporting material review: Within two weeks of receiving a purchase order authorizing work to begin, the consultant will spend 6 to 8 hours reviewing any supporting materials provided by THE COMPANY (typically answers to prior security assessments) and developing follow-up questions.
* Estimated consulting time: 8 hours.
* Follow-up questions: Within four weeks of receiving a purchase order authorizing work to begin, the consultant will then email those questions to a designated contact at THE COMPANY and then read any answers that are returned.
* Estimated consulting time: 2 hours.
* Within six weeks of receiving a purchase order authorizing work to begin, the consultant will then spend up to 4 hours on-site at THE COMPANY’s data center, asking questions to validate readings.
* Estimated consulting and travel time: 8 hours.
* Within six weeks of receiving a purchase order authorizing work to begin, the consultant will use an industry standard tool of their choosing and at their cost, to attempt a penetration test of THE COMPANY’s system.
* Estimated consulting time: 16 hours.
* Within eight weeks of receiving a purchase order authorizing work to begin, the consultant will then use Microsoft Word to fill in a twenty-five question survey with their observations and recommendations and email their report to their contact at THE COMPANY. Any question not applicable to a security assessment may be left blank.
* Estimated consulting time: 2 hours.
* Within nine weeks of receiving a purchase order authorizing work to begin, the consultant will conduct a conference call reviewing their findings.
* Within ten weeks of receiving a purchase order authorizing work to begin, the consultant will The agrees to forward to THE COMPANY copies of all supporting documents and other working papers and products performed on behalf of THE COMPANY, and also provide THE COMPANY with an invoice for the amount agreed to in the Purchase Order. THE COMPANY will pay the invoice within fifteen days.
Live Long and Prosper - Thanks Leonard. You are missed.
I have vmware machines on one server at home. There are still benefits even though it's not a cluster. So it's not that stupid.
It is easier to move the virtual servers to another machine or O/S. This is useful when upgrading or when hardware fails or when growing (move from one real server to two or more real servers). There's no need to reinstall stuff because the drivers are different etc.
You can snapshot virtual machines and then back them up while they are running. Backup and restore is not that hard that way. So even if you have a single point of failure, if you have recent image back ups, you could buy a machine with preinstalled O/S, install vmware, and get back up and running rather quickly.
And when power fails and the UPS runs low on battery, I have a script that suspends all virtual machines then powers the server down. That's more convenient too than setting up lots of UPS agents on multiple machines and hoping they all shutdown in time.
DB performance sucks in a vmware guest though, so where DB/IO performance is important, use "real" stuff. Things may be better with other virtualization tech/software.
First back everything up.
Second test the backups.
Third ensure there is good monitoring on everything important.
Only then should you think about upgrades.
I can't believe nobody else has said this.
In my uninformed opinion, blades are mainly a way for hardware vendors to extract more money from suckers.
They probably have niche uses. But when you get to the details they're not so great. Yes the HP iLO stuff is cool etc... When it works.
Many of the HP blades don't come with optical drives. You have to mount CD/DVD images via the blade software. Which seemed to only work reliably on IE6 on XP. OK so maybe we should have tried it with more browsers, than IE8, but who has time? Especially see below why you don't have time:
So far I haven't seen any mention in HP documentation that the transfer rate of the mounted CD/DVD image (or folder) between your laptop to the iLO software to a blade that you're trying to install stuff on is a measly 500 kilobytes per second. But that's what we encountered in practice.
Yes you can attach the blade network to another network and install it over the network, but if you can do that, doesn't that make the fancy HP iLO stuff less important? You might as well just get a network KVM right? That KVM will work with Dell/IBM/WhiteBoxServer so you can tell HP to fuck off and die if you want.
Which brings us to the next important point: Fancy Vendor X enclosures will only work with current and near future Vendor X blades. In 3-5 years time they might start charging you a lot more to buy new but obsolete Vendor X blades. Whoopee. What are the odds you can use the latest blades in your old enclosure? So you pay a premium for vendor lock-in and to be screwed in the future.
I doubt Google, etc use blades. And they seem to be able to manage hundreds of thousands of servers. OK so most of the servers might be running the same image/thing... So that makes it easy.
BUT if you are having very different servers do you really want them in a few blade enclosures? Then when you need to service that enclosure you'd be bringing down all the different blades...
Instead of configuring a complicated redundant network for such a small amount of users, I think you would have better luck implementing a backup/disaster recovery service similar to this: http://www.zenitharca.com/
In the past I have worked in a place that had around the same problem as you say.
I had a very small budget, so I was hosting services on commodity PCs, with outdated systems, no virtualization (no dual cores back then), with as much as 3 to 4 services running in the same machine with no kind of sandboxing.
All was running fine.
Then, I got a small budget to buy a newer system. It was a Dual Core system, and I managed to get two hard drives which I put on simple mirroring RAID (low storage was the main problem that allowed me to buy new hardware). That's when the problems started arising.
I was young back then, and was seeing all the "good stuff" around to speed up machines, so I fell for that RAID thing, since it supposedly would almost double read time and automatically create backups. It ran fine until some weeks after I set it up, when some files simply "vanished" from the file server. Nobody knew where they were. I didn't know where they were or what happened, but since we were small, most files were stored in the users' workstations (even though that was not "a good practice (tm)"). Because each user had its own backups locally, we managed to get going without the files.
Then it happened again. Many files went missing again! But this time I noticed that some files (that vanished in the first incident) appeared again, and the missing ones now were the newer ones added after the first incident. So, I naturally traced it to the raid array and noticed it wasn't in sync. Then I saw that it was not mirroring correctly, and at each boot of the server the active drive could be "swapped".
In the end, I chose the simple path: I disabled RAID and used cron to daily backup from one drive to the other in the end of the day. Problem solved, everybody got happy. From what I've heard, this setup hasn't broken again (since nobody dared mess with it after I left). Lesson learned: follow Occam's razor ("The simplest answer is usually the correct answer."). By the way, as far as availability is concerned, all I had to do would be to get one of the drives to another machine and boot up, as I could do when a lightning fried the motherboard even with correct grounding and UPS.
Why would you split your dollars amongst vendors like that? If you're going to recommend vendors, at least try to keep everything within the same company or within strategic partnerships to maximize savings.
Here is another recommendation that might save more money: Dell Servers and Equalogics SANs. Since Dell owns Equalogics, you'll get additional cost savings if you have a halfway decent sales representative.
My Sysadmin Blog
ModularIT is what you are looking for. Every service runs in a different virtual machine on one or more physical servers, there is a web interface and you can move machines between physical servers. Open source. Developers are friendly and based in Canary Islands (Spain).
Since it wasn't mentioned by the OP as being done, I'll assume it wasn't. Before you even touch anything in the office:
1) get secondary/ternary DNS set up offsite, and preferrably with 2+ different providers. There are many cheap (and even free) services to accomplish this such as DynDNS. It's cheap, it's easy, and it'll save your bacon if you **** up the primary while you're working. It may seem like overkill but at my ~80 person place I have a primary and secondary onsite, and utilize two inexpensive offsite services for backup secondarys.
2) get secondary/ternary MX "spool" hosting; again DynDNS for instance offers this cheaply as do many others. For the same reasons as above, if you screw up the primary mailserver you have automatic backup that will spool any incoming mail for you; at our place I have a primary and secondary onsite, and 2 spool (aka "forwarders") backups at MX level 30 at 2 different providers.
Now if you A) screw something up, or B) lose that internet connection you at least have something covering your buns for these critical, essential services. We also have an offsite IT email (a basic GMail account) with all participating employee personal email addresses in it, so that if SHTF and an office goes offline an IT person can log into the GMail from any computer and send an alert to the company that something is wrong.
Regardless of what you do with the onsite scenario, create offsite backup scenarios and CYA before anything else. And test them. :)
Your employer has put you in charge of an information systems infrastructure overhaul or upgrade and you are posting to /. asking for advice? Tell your employer to hire someone capable of doing their own research. What are you a MCSE?
It also saves on HW - assuming you are the one service per OS/instance type.
Home Office (in this context): Dual vmware servers, each having generally the VM instances:
System:
Guest #1: Windows 2008: Domain controller, DHCP, DNS, WINS
Guest #2: CentOS: Radius
Guest #3: CentOS: WWW, FTP
Network a dual link running BGP, with VPNs to each of the remote sites, which have their own server for DNS (a slave) and DHCP (in case the VPN link is down).
Using VMWare for services that aren't redundant as well. All VMs back up to the other VMWare server (with Ranger) so I can bring up guest VMs if their VMWare server fails. Virtualization gives me very easy DR (instead of having to recover an OS, I only have to recover a VM), easy hardware upgrades (migrate the VM), and generally the services are redundant for OS and hardware maintenance so I can patch and reboot without disrupting most services.
More complex than that in practice, but you get the idea.
(It's so sad to see all the egos and superiority complexes here.)
My only advice is to use a good brand of server hardware. We integrate our software product into Dell, HP and IBM servers, and in our experience (10's of thousands of integrations), IBM provides very poor quality products that take double the setup time, double the maintenance time and have double the failure rate of both Dell and HP. I have no preference between the other 2 brands. They are both quite good.
It's a pity IBM servers have such a good reputation. It really is undeserved.
Running under VMware means there is less code running that is subject to attack (smaller attack surface), because there are fewer apps per guest, and the risk of a bug in VMware itself is approximately equal to the risk of a bug in your CPU microcode, due to the application of VT.
The cut-over point was Feb 2003 for the European Union with a number of other countries following suit shortly thereafter. ROHS-compliant Equipment built after that point may be subject to age-and-use related failures irrespective of whether there are rotating components or unstable environments involved.
Used equipment still running after 7 years? Will probably be reliable. Used equipment slightly newer than PP described? Borderline, I think. You'll have to consider hardware redundancy more carefully with the newer stuff.
Do not mock my vision of impractical footwear
I was put in your exact position four years ago with the current place I work with. Here's some things I suggest: 1- Make a plan. These things can't be fixed in a day. My boss, the CIO said, "Rome wasn't built in a day." He was right on with that one. It took me three years to get things to where they needed to be. One piece at a time. 2- Make sure you break things up and prioritize them. What is the 'oldest' equipment or the pain points? Is the network holding up? Connectivity is the most important part. Make sure you have your network running well before you mess with other parts of the system or put additional strain on the system. 3- Make sure you have the right people on board. I call this checks and balances. You need to have firepower behind your decisions, especially when it comes to making the budget. 4- Remember the phrase: KISS. Burn it in your mind... It means, keep it simple, stupid. Don't bow to salesman, brochures, 'white papers' or peer pressure. Experience and checks and balances are essential. And finally, be cautious and move slow. Systems don't all just fall apart at once. Once you're prioritized, gotten the right people on board and have your ducks in a row, things will run smothly. If managment gets in your way, refer back to the checks and balances you set up and force it down their throats. It's kind of sad to say that this is just like playing chess, but when management doesn't trust IT in general, you have to prove yourself. Following the above steps will help. Good Luck.
The benefits of virtualisation are massive. WE went from 25 physical servers down to 6, and I'm not done virtualising yet. All the existing hardware was old and due for both hardware and software refresh... 25x 3-4k AU for physical hardware worked out to be pretty damn close in terms of cost to 3 physical hosts, a SAN plus an ESX "acceleration pack" including virtualcenter. Benefits we got? SAN storage (instead of local disks everywhere), high availability (vmware HA, vmware FT if we need it later), roll-back to snapshot for failed upgrades, right-click cloning/deploy from template of VMs and down the track, the ability to add on VDI virtual desktops, etc.
Another benefit is that we have standard virtual hardware everywhere. Never again do we need to rebuild an OS simply due to a hardware upgrade.
With ESX, you need nowhere near as much hardware as you would for physical hosts. You can easily separate services out onto different VMs, and not pay as big a hardware cost due to ESXs ability to share memory pages between VMs running the same OS. Rather than running multiple services on one physical server, and having a run-away process kill everything on the server, you can split the task out into multiple VMs and use resource pools to ensure that any resource contention issues are taken care of.
In short, we went ESX and I'm not looking back. Having the ability to upgrade the physical hardware (adding NICs and memory) at 10am during the day with ZERO downtime to the VM services (vmotion them off the single host I am upgrading then vmotion them back to upgrade the next host) running on top of the cluster is awesome.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
If you decide to go that route regardless, invest in at least 2 dual-port 4GB Fiber HBAs per host server. You'll be thankful in the long run.
Do you have any idea how much disk you need on the back end to exceed the throughput of even a single-port 4Gb HBA, for more than a second or two ?
There are a lot of places to spend money before worrying about how fat your server-side SAN connections are. A *LOT*. 99% of the time, a single 1Gb iSCSI port is more than enough from a performance perspective.
If you are going to do virtualization, the only benefit comes when you invest in a cluster otherwise don't do it at all.
This is not true at all. Indeed, the benefits of virtualisation are such that even for a single service on a single server, it's generally better to make it a VM.
I use blade servers every day and I love them. You are correct in saying that blade servers are not the right choice for every installation. We calculated that after 7 servers it is cheaper to buy a blade chassis than separate 1u servers. -blade servers don't work for any application that requires a special card (ie a T1 card for a VOIP server) -the CD rom etc collects dust for the 5 years after the OS is installed -iLo lets you attach the CD/DVD from the workstation you are iLo-ing from to the server to install the OS -you can also install from a USB CD -I can go 2 months between physical visits to our DataCenter. This is important if you have a HotSite or CoLocation for Disaster Recovery.
A single Mac Mini easily serves multiple (in my case 72) domains, e-mail/calendaring, VPN, DNS, FTP, chat, iPhone push notification, web services, SMB domain controller and more. I don't think the CPU utilization has ever gone over 10%. Easy to cluster together since it uses DoveCot for e-mail.
Unless you have an app that sucks CPU, you don't need a blade chassis. That kind of thing's for special-purpose apps, not generic infrastructure.
To have true high-availability, even 2 VMware servers isn't enough, you need a reliable shared storage system that both servers can access.
Even then, the storage chassis itself will be a central point of failure.
With Linux, DRBD, GFS and either KVM or Xen, you don't need shared storage, as DRBD does the replication for you between physical nodes, GFS does the "VMFS"-type concurrently accessible filesystem, and you get live migration free.
To have true HA you need a pair of independent shared storage units with continuous synchronous replication and some reliable mechanism of failover.
If you're looking at that level, most decent storage arrays have redundant controllers, you shouldn't need a second array for HA, mainly for DR (where D in DR stands for disaster, the kind where nothing in the vicinity of the first array works).
you shouldn't need a second array for HA
Until a power supply on the array blows up, taking both controllers, a critical control board, part of the backplane, or a bunch of the drives with it.
The biggest problem I've found with blades is that you can't fill a rack with them. Several of the datacenters I've come across have been unable to fit more than one bladecenter per rack. Cooling and power being the problem.
At the moment. A rack full of 1U boxes look like the highest density to me.
Deleted
Spoken like someone who has had to experience all 3 formats themselves. I would mod you informative, but have no points left.
Anyone is clueless until they manage the first transition. I do wish schools - at any level - would offer such in-depth training, or all that companies would behave responsibly by sending their tech to specialised "seminars" or however they are called for important tasks where they know the employee is insecure in his/her knowledge before taking the plunge, but it doesn't work like this in the real world. Sometimes, you can attribute to malice what seems like incompetence.
I might agree with you if this guy's question was about having trouble convincing his management to upgrade, but it isn't. They've already decided to upgrade even though it isn't broken yet, and they are asking for a plan to accomplish it. In my experience, companies proactive enough to do this are making their own luck.
I see a company who took a calculated risk of a possible occasional day of downtime, probably in exchange for getting up and running faster and cheaper. Now that they are in a position to do so, they are exiting that risk in a controlled manner. This company will now have their desired infrastructure and the advantages of having had cheaper infrastructure costs when it mattered most. The OP doesn't say. Maybe they have had problems with downtime, but the cost was obviously worth it or they wouldn't still be in business. The overly cautious often misattribute success to luck, when it was really a conscious assessment that assuming a risk would have more benefits than drawbacks.
This space intentionally left blank.
We have a hundred or so VM's on a 4 node ESX cluster attached to an EMC Clariion CX3-80 SAN with 8 2TB datastores (dual port 2/4GB FC connected). The same SAN also has another 6 or so physical 2 node clustered MS servers and a few TB in ATA for disk backups as well. That same mentioned setup is duplicated in four of our offices around the country. No performance issues at all from the MS or the ESX side. We are currently testing some 2 node ESX clusters for use in our remote offices with about 30 VM's on a HP 2012FC SAN. Its going good so far. If it works out which i think it will, we will be deploying about 10 setups just like it to replace our standalone ESX servers using local storage in those places. I don't know what our network managers hangup is with iSCSI but he refuses to even acknowledge that it is a valid usable alternative. We tested an HP left hand solution we borrowed from a vendor and it worked fine and met our requirements on paper and in our tests but he refused to commit to it and gave it back, the only reason he gave us was "It worked but I want FC, not iSCSI". Seems odd to be deploying new FC setups in 2009 for a "small" office but oh well, I'm only one of two people at the company that can does switch zoning so that's one up for me.
"Anyone is clueless until they manage the first transition."
WRONG!!!
As stupid as saying that anyone is clueless until they manage the first brain surgery.
"it doesn't work like this in the real world."
And then you get the results you paid for.