Slashdot Mirror


Testing Network Changes When No Test Labs Exist?

vvaduva writes "The ugly truth is that many network guys secretly work on production equipment all the time, or test things on production networks when they face impossible deadlines. Management often expects us to get a job done but refuse to provide funds for expensive lab equipment, test circuits and for reasonable time to get testing done before moving equipment or configs into production. How do most of you handle such situations, and what recommendation do you have for creating a network test lab on the cheap, especially when core network devices are vendor-centric, like Cisco?"

40 of 164 comments (clear)

  1. The tag says it all by Lord+Byron+II · · Score: 4, Insightful

    There are zero replies and the story is already tagged with "youreboned". That's the truth. If your higher ups won't front the money for proper test equipment and expect you to roll out production-ready equipment on the first go, then you really are boned. Of course, you can mitigate this by simple pen-and-paper analysis. What should each piece of equipment do? Are the products we've selected appropriate for the roles we're going to put them in? These sorts of questions can find a lot of bugs without any sort of testing. If you think, "what would I do if it was the 1980's?" then you'll be fine.

    1. Re:The tag says it all by DigiShaman · · Score: 5, Insightful

      Not all changes are a one-way trip. Having a rollback plan is also important. Should something very unexpected happen, be prepared to roll back any and all changes to undo what has just been done.

      --
      Life is not for the lazy.
    2. Re:The tag says it all by Anonymous Coward · · Score: 2, Informative

      Not Pushing Juniper gear, but their Commit functions in JUNOS, and commands like "rollback" are serious things to consider in these scenarios. JUNOS also does things like refusing to perform a commit if you've done something obviously stupid (it does basic checking of your config when you commit).

      Label me a shill. Whatever. JUNOS is a lot better from an operator POV.

    3. Re:The tag says it all by BiggerIsBetter · · Score: 4, Insightful

      Not all changes are a one-way trip. Having a rollback plan is also important. Should something very unexpected happen, be prepared to roll back any and all changes to undo what has just been done.

      Couldn't agree more, except to say, don't assume you'll be rolling back from a known state. I've seen roll-back plans that assume they're undoing the changes just put in, not reverting to the state before the changes. Yes, there's a difference between the two! Eg, if your install fails, maybe you can't un-install. Yes, this might mean additional resources and the overhead of FS and DB snapshots, and complete copies of config files, but better that than the alternative.

      --
      Forget thrust, drag, lift and weight. Airplanes fly because of money.
    4. Re:The tag says it all by mysidia · · Score: 2, Informative

      My personal favorite thing about JunOS is "commit confirmed 10"

      This can be a lifesaver, if you fat fingered something, and you break even your ability to access to the device, your transaction should roll back in 10 minutes.

      If nothing goes wrong, you have 9 minutes to do some simple sanity checks, make sure your LAN is still working, and then get back to your CLI session and confirm the change.

    5. Re:The tag says it all by afidel · · Score: 4, Insightful

      This is networking equipment, other than transitory information like peer maps and MAC tables that can be re-learned you should always be able to revert to the previous state as far as the software and configuration.

      My comments are that out of band management are the networking guys best friend, and POTS is the best OOB available. Also learn how to change the running config without affecting the saved config, that way worst case is you have to power cycle (can be done with the correct OOB config or you can pre-schedule a reboot that you cancel if everything goes well). Oh and downtime windows might seem like a luxury but unless you are Google or Amazon the business needs to be made aware that they are necessary and critical to the smooth functioning of their IT infrastructure, so you should be making these changes during the downtime window where everyone is aware that things might break.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    6. Re:The tag says it all by karnal · · Score: 2, Informative

      You bring up a good point regarding changing the running config vs the saved config.

      What I'll do if I'm changing a remote system - POTS or no - is set up a reboot of the device in 15 minutes. After verifying the clock. Then, if something in the config causes an unforseen issue, you just need to wait a little for the switch/router to come back online with it's original config.

      Obviously, this can extend the outage window - however, always plan for worst case...

      --
      Karnal
    7. Re:The tag says it all by POTSandPANS · · Score: 2, Informative

      On a cisco, you can just do "reload in 10" and "reload cancel". If you don't know about those commands, you really shouldn't be working on a production network unsupervised.

      As for the original question: Either use similar low end equipment, or use your spares. (please say you keep spare parts around)

    8. Re:The tag says it all by eggoeater · · Score: 3, Interesting

      I'm a call-center telephony engineer. Kinda the same thing as network engineer in that you're routing calls instead of packets.
      Back around '01, I was working for First Union (which later became Wachovia). They had this massive corporate push for anyone and everyone in IT to roll out a standardized Software Configuration Management, and of course we were included. The big problem was the lab. The corporate standard was to test changes in a lab environment and then move to production (duh).
      For a telephony environment, we had a pretty good lab that could duplicate most of our production scenarios, but not all. Another problem was there were a LOT of people with their fingers in the lab since so many groups were involved: eg. The IVR team is in there because you have to have IVRs in the system. Same with call routing, call recording, desktop software, Q&A, etc.etc.
      So the lab was in a constant state of flux with multiple products, multiple teams, and different software cycles and endless testing always occurring. We made it work by testing the stuff we weren't sure about in the lab, only doing changes in prod after hours, and having really good testing and back-out plans.
      So when the corporate overlords started telling use we couldn't make any changes to production without running everything through the lab first, we basically laughed and told them we'd need around 500 million for the lab and dedicated resources to run it. I ended up telling them that to duplicate the production environment, we'd need another bank as our "test bank", and we could test changes on the test bank and then put them in the production bank.

      As with so many things in that IT department, it went from being a priority to fading away when something else became a priority.

    9. Re:The tag says it all by afidel · · Score: 2

      My favorite ultimate backup for rebooting a device is a DTMF controlled PDU, call into the OOB number and hit a magic number sequence and the device reboots =)

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    10. Re:The tag says it all by mysidia · · Score: 2, Interesting

      "reload in 10" on a core router or switch (eg a massive switch that also has routing duties) is insane, and will probably impact the entire network, for 20-30 minutes, if you accidentally lock yourself out (but don't otherwise impact anything) and fail to cancel that reload.

      In addition, reload is risky, and the equipment may fail to come back up correctly.

      Sorry, it's not anywhere close to comparable to the configuration management features in JunOS.

      "Reload in X" is a bad answer, and should never be done, except on equipment that doesn't matter that much, or at a time when an hour of downtime is completely expected and acceptable.

    11. Re:The tag says it all by Grail · · Score: 2, Informative

      If you truly believe that a simple reversion of a configuration will cause a reversion to a previous state, you're sorely mistaken.

      Once the device you're working on starts misbehaving, other devices around it will start misbehaving too. As an example, one change to a network I'm involved with was supposed to simply prioritise VoIP traffic for one customer. The change was successful, the engineer went home. Then three hours later a major network router failed, because the higher priority voice traffic which was now flowing over the router tripped some magic number of MACs that it could remember, at which point the card had to keep referring routing decisions back to the CPU.

      The router's CPU became overloaded, other routes started dropping packets, and we ended up trying to resolve the problem by rebooting that router (because that's what was broken). The router on starting up was immediately overloaded and crashed again. Overall, it took about four hours to get to the problem resolved, which required reverting the VoIP change and turning off some customer networks to allow the core router to start up without the huge packet load. The customer networks were down for about three and a half hours.

      In this instance, simply reverting the change to the VoIP services would not have resolved the problem. Once the camel's back was broken, removing a straw would not have fixed it.

  2. Could be worse by 7213 · · Score: 4, Insightful

    The best bet is to be ready to blame the vendor when things go south ;-)

    Seriously, I'm right there with you. If management does not want to provide for a test lab & reasonable time to test. Then it's clear they've made a 'business decision' that the network is not of sufficient value / risk is not great enough for such investments.

    This may change quickly once something goes south (assuming they understand why it did) but you're gonna be talking to a brick wall until then.

    It could be worse, you could have management that are afraid of there own shadows & who freak out at the idea of replacing redundant components after a HW failure. (Ever had to get VP approval to replace a failed GBIC? Oh, I have & yes, I hate my life).

    1. Re:Could be worse by mysidia · · Score: 2, Interesting

      See how much approval you have to get when the network is down because of a failed GBIC.

      Redundancies against component failure are very good for the enterprise, but also make it harder for engineers to do their job, since "nobody notices that something has gone wrong".

      Perhaps the real redundancies should be reserved for the absolute most business-critical things.

      Make sure less important things are non-redundanct and arranged in a way, so that if any link or GBIC does fail, something noticeable to management will stop working, and cannot be restored without fixing the broken thing.

    2. Re:Could be worse by hazem · · Score: 2, Insightful

      That reminds me of an article by Nelson Repenning, "Nobody ever gets credit for fixing problems that never happened". It's quite an interesting read... The guy who "saves the day" during an emergency always seems to get credit and reward, but what about the guy who keeps the emergency from ever happening?

  3. Virtualization? by bsDaemon · · Score: 4, Interesting

    It's perhaps not the best solution, as a lot of problems I've faced since I started getting more into networking stuff than software configuration and web server administration have been related to bad cables rather than bad IOS settings, but virtualization can help you create test situations on the cheep. Specifically, GNS3 allows you to create test networks in a virtual environment, then import software images for your Cisco routers, switches, PIX firewalls, Juniper hardware, etc, all run on hypervisor technology.

    You can also use QEMU to create virtual network nodes. If you have enough RAM, then this can help at least get the logical issues worked out and the software configurations square. Then you just need to do the real work :) I'm still pretty new to networking myself, and I use it to make little test labs for myself when I need to do more than I can with the two 3600 and the 2600-series routers I got to take home for experimenting with. I actually copied the IOS images off of them via TFTP and then can replicate them as many times as I need to, but I can claim I have whatever interfaces I need, plus it will (thankfully) simulate the ATM switch for me as well.

    1. Re:Virtualization? by loki_ninboy · · Score: 2, Informative

      I'm using the GNS3 software with some IOS stuff to help prepare for the the CCNA exam. Sure beats paying the money for the extra hardware laying around the house just for learning and testing purposes.

    2. Re:Virtualization? by value_added · · Score: 4, Informative

      Specifically, GNS3 allows you to create test networks in a virtual environment, then import software images for your Cisco routers, switches, PIX firewalls, Juniper hardware, etc, all run on hypervisor technology.

      For anyone unfamiliar with GNS3, a link to the website. There are versions available for Windows, Linux, and OS X. FreeBSD already has it in ports.

      As a side note, I'd add that maintaining a home lab (to the extent practicable and useful) is one way to side-step limitations of what your employer provides. Consider it a combination of "Ongoing Professional Education" and "Proactive Job Security Measures" (i.e., "I better test this shit to save my ass tomorrow").

    3. Re:Virtualization? by Bios_Hakr · · Score: 2, Informative

      If you work a pure Cisco environment, talk to your Cisco guy about getting Packet Tracer. Emulates a few routers and a lot of switches. It works really well. Plus, 5.1 adds virtual networking. You can design several networks on several laptops and then join those networks over a virtual internet.

      --
      I'd rather you do it wrong, than for me to have to do it at all.
  4. Document and test at night by jdigriz · · Score: 5, Informative

    Step 1) Make a formal request for the test lab. Make it as detailed as possible. Explain the impact to business if various components fail. Make a plain-language executive summary calling out risks. step 2) Once the request is denied, make sure you have a paper trail of the rejection step 3) If possible test network changes on the production equipment at 2am so that impact on users will be less step 4) Once the inevitable failure occurs, haul out the paper trail and get the bean counter fired. Repeat until test lab is approved. Note, step 4 may get you fired instead. Business decisions are somewhat nondeterministic.

    1. Re:Document and test at night by Keruo · · Score: 3, Informative

      step 3) If possible test network changes on the production equipment at 2am so that impact on users will be less

      Been there, done that. Sadly the only way to see how your setup works is to try it in production.
      Sure it helps if you can test it beforehand, but sometimes your lab might not reflect what happens in real network when you roll something out.
      Just make sure you can clock those am hours as overtime/nighttime work.
      And remember to backup the running config twice so you can restore the production network if something goes fubar.

      --
      There are no atheists when recovering from tape backup.
    2. Re:Document and test at night by SethJohnson · · Score: 4, Funny

      If it goes smoothly anyway, you might look like a whiner that didn't need the expensive toys to keep on the shelf.

      Hence, you have the plug to the main router beneath your own desk. When the sailing looks smooth, you kick out the cord. While everyone freaks out, you open up a terminal window and begin typing nonsensical commands. Say, "Ahaaah! As you re-plug in the router.

      Job security.

      Seth

    3. Re:Document and test at night by Anonymous Coward · · Score: 3, Interesting

      Note, step 4 may get you fired instead. Business decisions are somewhat nondeterministic.

      And that's what happened to me.

      I was forced into making changes in the production environment, and caused an outage that affected 2 people. Once I realized what happened, I quickly fixed it; however due to internal politics I was terminated the next day.

      Initially I was in shock. 10 years, 2 months employed in a single company. Gone. I have a stay-at-home wife and 3 kids; which made things look even bleaker.

      In hindsight, it may be one of the better things to happen to me. I had spoken with a recruiter a few days before hand to start looking for work. When this happened, I was able to dedicate myself full time for job-searching. I was also off for hunting season, and able to do many things with my family that I normally wouldn't be able to do. The environment where I was was just awful. Several former co-workers have left since my special day. The CTO is a psychopath. He has 2 sayings he likes to use - the first is 'to do the job right the 1st time'. The second is a Mario Andretti quote of 'If you don't feel like you are out of control, then you aren't going fast enough'. These sayings are mutually exclusive, but logic doesn't apply.

      I start a new position on Jan 5th (but it is only a 6 month contract position). It is a bit more money, and I have about 1/2 the commute. It is also a much better work environment.

      Things I learned:

      - Stockholm syndrome is apparently real. I didn't want to leave because 'it's not that bad'. It was bad. Worse.
      - I hate job hunting.
      - Employment law in Ontario, Canada is not what I thought it was. Pretty much everything I though I knew was wrong.
      - The economy here in Ontario is poor, but improving (but vastly better than the US).
      - Legal advise in Ontario is tax deductible (at least in reference to employment issues).
      - A certain CTO is a complete and total prick.

      (ha - my captcha word is 'inaction')

    4. Re:Document and test at night by SharpFang · · Score: 2, Interesting

      3) If possible test network changes on the production equipment at 2am so that impact on users will be less step

      That's dangerous. You leave it apparently running and crawl back to sleep at 4:30AM, to get an angry call at 7:05AM when the first users to log in report something essential is fucked up.

      Prepare and test at 2AM, then roll back to original. Then re-apply around lunch break and wait with your fingers on roll-back for the first reports of failure.

      --
      45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
    5. Re:Document and test at night by tlhIngan · · Score: 2, Informative

      In some environments, that is frustrated by other (lazy) technical staff, who immediately start automatically blaming _every_ problem they find for the next few weeks, on that one change, without even doing any helpful troubleshooting, or finding any reason at all to suggest it might be the case.

      The problem is unrelated and would happen anyways, but because they heard of a recent change, there is a cognitive bias towards immediately suspecting the new change, just because it's a change they know about.

      "I didn't change anything, so if I just started getting a few problem reports it must be your change"

      Which is why you announce the change will happen on X, but actually wait a week or two before actually committing the change. Then any bellyaching that happens, you can file as their problem. If any real issues happen, you can even hold off doing the change in case your change might aggravate the problem.

      It's the same when new cell towers or other equipment are installed - people will complain of headaches and other crap caused by the tower right after it's "turned on", when in reality, it's been running months beforehand, or hasn't even been turned on yet.

  5. My last resort by tchdab1 · · Score: 5, Funny

    I call my buddies at RIM and test my mods on their system.

  6. Re: Testing Network Changes When No Test Labs Exi by droz037 · · Score: 2, Interesting

    I would suggest asking your vendors for demo or evaluation equipment. Cisco, Juniper and 3Com have pools of demo equipment as do the resellers like PC Connection and CDW.

    I've done deployments of new switching infrastructure based on work I've done with loaners from my vendors. It can be tough because the typical evaluation period is 30 days. Although you can get 45 and even 60 days.

    If you have a good relationship with your sales rep. It would be easy to push them to get the necessary items to do basic testing and get the concepts down of how you need to deploy. Then get the config files so that when you do buy what you need you're 85% of the way there.

  7. Packet Life by z4ns4stu · · Score: 3, Informative

    Stretch, over at Packet Life has a great lab set up that anyone who needs to test Cisco configurations on can sign up for and use.

    --
    The whole moon and the entire sky are reflected in one dewdrop on the grass. - Dogen
  8. Tools by Tancred · · Score: 5, Informative

    Here are a few tools:

    GNS3 - http://www.gns3.net/ - free network simulator, based on Dynamips Cisco emulator
    Opnet - http://www.opnet.com/ - detailed planning of networks, from scratch
    Traffic Explorer - http://packetdesign.com/ - plan changes to an existing network

  9. lots of little things by wintermute000 · · Score: 2, Informative

    Older Cisco equipment can function just as well as newer for 95% of lab scenarios. You are very unlikely to be needing to use all the newer features.

    Anything that can run IOS 12.3 and is newer than a decade old can do a lot more than you think. We do all our BGP testing on a stack of 2600s and 3600s and never an issue even though in production its 2800s, 3800s etc.
    Granted there are features that you do need the newer kit esp when syntax changes (e.g. IP SLA commands, newer netflow commands, class map based QoS to name three off the top of my head) but none of the core routing and switching features/commands has changed much since the introduction of CEF - they all do ACLs, route maps, OSPF, BGP, EIGRP, vlans, spanning tree, rapid spanning tree, IPSEC vpns. I'm speaking from an enterprise POV not a service provider but I'd imagine if you are in a telco environment you wouldn't be lacking gear.

    For many minor test scenarios, you can pick a test branch office and use the good old 'reload in XYZ' command to ensure that no matter how badly you stuff it up, everything will bounce and come back (just remember NOT TO COPY RUN START lol).

    Then there's the sleight of hand methods:
    - always ordering more for projects than you really need. Par for the course really esp as most project managers haven't a clue when it comes to the nuts and bolts of a big cisco order.
    - pushing for EOL replacements as early as possible, intentionally conflate end of sale with end of life.
    - getting stuff in for projects as early as possible, then you have a month or two to use it as test gear.
    - remember that your lab need not mirror reality, scale down as much as possible. e.g. to simulate a pair of 4506 multilayer switch running in VRRP, use a pair of 3560s. Use your CCO login and flash away to your hearts content (I know its breaching licencing but for test scenarios, meh).

  10. Re:Pretty simple, really by symbolset · · Score: 5, Funny

    Oh, no. We do this all the time. Around the holidays we rewire the production server racks so their ethernet cables droop over the aisles, so we can hang up Christmas cards. Jimmy has a script that blinks the blue UID lights for a festive holiday display.

    --
    Help stamp out iliturcy.
  11. Go virtual! by leegaard · · Score: 3, Informative

    If you are unable to recycle old equipment into your testlab you should go virtual.

    For Cisco routers, GSN3/Dynamips (www.gns3.net) is your friend. Any recent PC or laptop will allow you to build a large and complex topology that will satisfy most experiments and even support you when doing certification preparation. It will only work for routers so switch-based platforms are out (like the 3570,6500 and 7600). The good news is that the features are more or less the same and they more or less behave the same way. If "more or less" is not close enough you need a replica of your production network or at least a few devices of each to test what can be labelled as critical.

    For Juniper routers, google juniper Olive. It will run a juniper router the same way dynamips runs a Cisco router.

    In both cases a proactive partnership deal with the vendor will be a good idea. Both Cisco and Juniper (and I am sure all other major network vendors) have programs where they will more or less advise, test and prepare the configurations for you. If you run a critical network this is money well spent.

    In the end it comes down to the level of risk your management is willing to take. Ask them if they will allow the network to be less up since you are unable to properly test your changes before implementation.

  12. Out of hours changes, and change managment by anti-NAT · · Score: 2, Informative

    For any sort of medium to large network, you can't fully simulate it. That means you're always going to be making "untested" environment. So, you make very few changes rather than lots, you make sure after each change they've had the desired effect, and you have backout plans.

    --
    The Internet's nature is peer to peer - 20050301_cs_profs.pdf
  13. Borrow a lab! by jimpop · · Score: 3, Interesting

    Cisco have many (large) labs located around the world. Sign up for some time in one of them.

  14. Paper Trail by tengu1sd · · Score: 3, Interesting
    >>>refuse to provide funds for expensive lab equipment, test circuits and for reasonable time to get testing done before moving equipment or configs into production.

    Make sure that every change request implementation documents that this change is being placed intro the production environment for testing. Document impact ranging from total network failure to moderate inconvenience and include roll out time tables. The roll out needs include travel times such drive to site B or fly cross country.

    Of course the downside of this is that management may go out and hire someone who knows, or at least pretends to know, how to drop changes into place without whining about ignorance and making customers uncomfortable.

  15. Don't forget SOX by jackb_guppy · · Score: 2, Informative

    1) You should not be making any direct changes to the network with out correct design, test and sign off.

    2) You should already have a redundant network structure, so "half" can be loss without any loss to network operations. This way the change can be tested in parallel.

    3) You should always report to SOX officer when a request outside correct operations and management is made. It makes it their responsibility to solve the legal issues, for not following their written standards, before you began.

  16. Re:Pretty simple, really by i.r.id10t · · Score: 2

    Oh, a variation on blinkenlights?

            ACHTUNG!
            ALLES TURISTEN UND NONTEKNISCHEN LOOKENPEEPERS!
            DAS KOMPUTERMASCHINE IST NICHT FÜR DER GEFINGERPOKEN UND MITTENGRABEN! ODERWISE IST EASY TO SCHNAPPEN DER SPRINGENWERK, BLOWENFUSEN UND POPPENCORKEN MIT SPITZENSPARKSEN.
            IST NICHT FÜR GEWERKEN BEI DUMMKOPFEN. DER RUBBERNECKEN SIGHTSEEREN KEEPEN DAS COTTONPICKEN HÄNDER IN DAS POCKETS MUSS.
            ZO RELAXEN UND WATSCHEN DER BLINKENLICHTEN.

    Some more stuff to not trip the lameness filter, I hope...

    --
    Don't blame me, I voted for Kodos
  17. Comment removed by account_deleted · · Score: 2, Insightful

    Comment removed based on user account deletion

  18. Re:Download vyatta by itwerx · · Score: 2, Funny

    Tired of the VI vs EMACS war? Try the new Vyatta vs pfSense conflict instead! :) (pfSense is great...)

  19. Re:Pretty simple, really by lukas84 · · Score: 3, Insightful

    Everyone has a test environment. But not everyone has a production environment.