Testing Network Changes When No Test Labs Exist?
vvaduva writes "The ugly truth is that many network guys secretly work on production equipment all the time, or test things on production networks when they face impossible deadlines. Management often expects us to get a job done but refuse to provide funds for expensive lab equipment, test circuits and for reasonable time to get testing done before moving equipment or configs into production. How do most of you handle such situations, and what recommendation do you have for creating a network test lab on the cheap, especially when core network devices are vendor-centric, like Cisco?"
It's perhaps not the best solution, as a lot of problems I've faced since I started getting more into networking stuff than software configuration and web server administration have been related to bad cables rather than bad IOS settings, but virtualization can help you create test situations on the cheep. Specifically, GNS3 allows you to create test networks in a virtual environment, then import software images for your Cisco routers, switches, PIX firewalls, Juniper hardware, etc, all run on hypervisor technology.
:) I'm still pretty new to networking myself, and I use it to make little test labs for myself when I need to do more than I can with the two 3600 and the 2600-series routers I got to take home for experimenting with. I actually copied the IOS images off of them via TFTP and then can replicate them as many times as I need to, but I can claim I have whatever interfaces I need, plus it will (thankfully) simulate the ATM switch for me as well.
You can also use QEMU to create virtual network nodes. If you have enough RAM, then this can help at least get the logical issues worked out and the software configurations square. Then you just need to do the real work
I would suggest asking your vendors for demo or evaluation equipment. Cisco, Juniper and 3Com have pools of demo equipment as do the resellers like PC Connection and CDW.
I've done deployments of new switching infrastructure based on work I've done with loaners from my vendors. It can be tough because the typical evaluation period is 30 days. Although you can get 45 and even 60 days.
If you have a good relationship with your sales rep. It would be easy to push them to get the necessary items to do basic testing and get the concepts down of how you need to deploy. Then get the config files so that when you do buy what you need you're 85% of the way there.
Cisco have many (large) labs located around the world. Sign up for some time in one of them.
You do not mention that this has ever made shit hit a fan. I conclude that so far this has not occured.
Consequently, you have proved that you are able to work without expensive test equipment by a combination and motivation and elbow grease. Congratulations!
Now, what is the logic for someone with a finite pool of money to provide equipment for someone who obviously does not NEED it? Yes, None At All!
You can therefore:
1) Wait until shit hits a fan and say "well, that's what happens when we don't have test equipment". You will then get test equipment OR get fired.
2) Make the shit hit the fan yourself. This is quite difficult to do inconspicuously, so you'll probably get fired and a shit reference.
3) Look around for jobs as well paid as yours but with test equipment.
4) Someone mentioned asking vendors for test equipment - maybe that might work? Note: sales reps have a quota of favours they can call in, so it helps if you have some steady business with them.
Make sure that every change request implementation documents that this change is being placed intro the production environment for testing. Document impact ranging from total network failure to moderate inconvenience and include roll out time tables. The roll out needs include travel times such drive to site B or fly cross country.
Of course the downside of this is that management may go out and hire someone who knows, or at least pretends to know, how to drop changes into place without whining about ignorance and making customers uncomfortable.
See how much approval you have to get when the network is down because of a failed GBIC.
Redundancies against component failure are very good for the enterprise, but also make it harder for engineers to do their job, since "nobody notices that something has gone wrong".
Perhaps the real redundancies should be reserved for the absolute most business-critical things.
Make sure less important things are non-redundanct and arranged in a way, so that if any link or GBIC does fail, something noticeable to management will stop working, and cannot be restored without fixing the broken thing.
Note, step 4 may get you fired instead. Business decisions are somewhat nondeterministic.
And that's what happened to me.
I was forced into making changes in the production environment, and caused an outage that affected 2 people. Once I realized what happened, I quickly fixed it; however due to internal politics I was terminated the next day.
Initially I was in shock. 10 years, 2 months employed in a single company. Gone. I have a stay-at-home wife and 3 kids; which made things look even bleaker.
In hindsight, it may be one of the better things to happen to me. I had spoken with a recruiter a few days before hand to start looking for work. When this happened, I was able to dedicate myself full time for job-searching. I was also off for hunting season, and able to do many things with my family that I normally wouldn't be able to do. The environment where I was was just awful. Several former co-workers have left since my special day. The CTO is a psychopath. He has 2 sayings he likes to use - the first is 'to do the job right the 1st time'. The second is a Mario Andretti quote of 'If you don't feel like you are out of control, then you aren't going fast enough'. These sayings are mutually exclusive, but logic doesn't apply.
I start a new position on Jan 5th (but it is only a 6 month contract position). It is a bit more money, and I have about 1/2 the commute. It is also a much better work environment.
Things I learned:
- Stockholm syndrome is apparently real. I didn't want to leave because 'it's not that bad'. It was bad. Worse.
- I hate job hunting.
- Employment law in Ontario, Canada is not what I thought it was. Pretty much everything I though I knew was wrong.
- The economy here in Ontario is poor, but improving (but vastly better than the US).
- Legal advise in Ontario is tax deductible (at least in reference to employment issues).
- A certain CTO is a complete and total prick.
(ha - my captcha word is 'inaction')
3) If possible test network changes on the production equipment at 2am so that impact on users will be less step
That's dangerous. You leave it apparently running and crawl back to sleep at 4:30AM, to get an angry call at 7:05AM when the first users to log in report something essential is fucked up.
Prepare and test at 2AM, then roll back to original. Then re-apply around lunch break and wait with your fingers on roll-back for the first reports of failure.
45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
I'm a call-center telephony engineer. Kinda the same thing as network engineer in that you're routing calls instead of packets.
Back around '01, I was working for First Union (which later became Wachovia). They had this massive corporate push for anyone and everyone in IT to roll out a standardized Software Configuration Management, and of course we were included. The big problem was the lab. The corporate standard was to test changes in a lab environment and then move to production (duh).
For a telephony environment, we had a pretty good lab that could duplicate most of our production scenarios, but not all. Another problem was there were a LOT of people with their fingers in the lab since so many groups were involved: eg. The IVR team is in there because you have to have IVRs in the system. Same with call routing, call recording, desktop software, Q&A, etc.etc.
So the lab was in a constant state of flux with multiple products, multiple teams, and different software cycles and endless testing always occurring. We made it work by testing the stuff we weren't sure about in the lab, only doing changes in prod after hours, and having really good testing and back-out plans.
So when the corporate overlords started telling use we couldn't make any changes to production without running everything through the lab first, we basically laughed and told them we'd need around 500 million for the lab and dedicated resources to run it. I ended up telling them that to duplicate the production environment, we'd need another bank as our "test bank", and we could test changes on the test bank and then put them in the production bank.
As with so many things in that IT department, it went from being a priority to fading away when something else became a priority.
$7.95/mo, 200 GB disk, 2TBxfer, MySQL, PHP, RoR.
"reload in 10" on a core router or switch (eg a massive switch that also has routing duties) is insane, and will probably impact the entire network, for 20-30 minutes, if you accidentally lock yourself out (but don't otherwise impact anything) and fail to cancel that reload.
In addition, reload is risky, and the equipment may fail to come back up correctly.
Sorry, it's not anywhere close to comparable to the configuration management features in JunOS.
"Reload in X" is a bad answer, and should never be done, except on equipment that doesn't matter that much, or at a time when an hour of downtime is completely expected and acceptable.