Trying to Help a Troubled Network with Linux?

← Back to Stories (view on slashdot.org)

Trying to Help a Troubled Network with Linux?

Posted by Cliff on Tuesday October 25, 2005 @11:25AM from the penguin-packet-protection dept.

vmehta asks: "I was recently put in a situation where I am trying to help a troubled network with many students accessing it. There are issues with broadcast packets and random outages which seem to be plaguing the network. What tools and methods are the best practice when trying to use Linux and Open Source to analyze and fix a network?"

9 of 68 comments (clear)

Min score:

Reason:

Sort:

Re:Assess the problem by SpaceLifeForm · 2005-10-25 11:42 · Score: 1, Informative

And then clean up the windows boxes. It sure sounds like there are many pwned machines.

--
You are being MICROattacked, from various angles, in a SOFT manner.
Re:Assess the problem by tverbeek · 2005-10-25 12:32 · Score: 4, Informative

Did you read the part of the question where he explained that he was looking for tools to analyze and fix the problem? And did you notice that he didn't mention or imply any kind of migration?
Here's an idea: Before you blunder in with an answer, the first step is to work out what the question is. :)

--
http://alternatives.rzero.com/
map, isolate, trend by grattwood · 2005-10-25 12:38 · Score: 5, Informative

Step 1) Map the network both logically (which networks, what is the routing, etc.) and physically... the "tug test". Label everything, and put it all in a spreadsheet. Tools are nmap, pen and paper, and a label printer. Access to the routers, or being friendly the the router admin is a must.

Step 2) Isolate the problem protocols and hosts. Be on the lookout for appletalk, IPX, or old netbios. All very chatty protocols. Look for old hubs and replace them with switches. Look for comprimised boxen. Try to VLAN things logically (by department, or usage which ever is best for the environment). Tools are snort, ethereal, ntop, and syslog (any managed switches should be sending to a syslog server (I've used syslog-ng))

Step 3) Trend as much as you can. Even before the network is cleaned up, start to collect statistics from the switches, and/or hosts on your network. Any gateways should be monitored as well. This will let you see if there are problems corelated to a particular time of day, if your're going over your bandwidth etc. Tools are MRTG, or for more in depth try Cacti http://www.cacti.net/

There is much more after you get to this point, but people will be much happier the faster you get here.

Good luck
Take a step back by tmasky · 2005-10-25 12:45 · Score: 3, Informative

You're attempting to help diagnose a (presumably) large network. Very honourable, but attempting to do this gung-ho with a few responses from slashdot is very silly.

Grab a consultant from a local small Linux shop for a few days. Someone with good knowledge about system/network architecture.

Get them to poke around on your network. Provide all documentation you have available.

After the first day, you should have all the information necessary to write up a document regarding your existing issues. Make notes while he's using tools to investigate. From there you work with the consultant to come up with a separate document for resolutions with a criticality rating.

From there, you want systems in place to monitor the health of your network. Have a chat to him about it, but I'd be inclined to build a solution which was centered around using Nagios.

While consultants can (and frequently do) suck when you come to specifics, they are a valuable resource for pointing you in the right direction. And experience counts! They've done this stuff before, they know the pitfalls and proven solutions.
The 10 step Universal Troubleshooting Process by Anonymous Coward · 2005-10-25 13:04 · Score: 2, Informative
The 10 step Universal Troubleshooting Process
1. Get the Attitude
2. Get a complete and accurate symptom description
3. Make damage control plan
4. Reproduce the symptom
5. Do the appropriate general maintenance
6. Narrow it down to the root cause
7. Repair or replace the defective component
8. Test
9. Take pride in your solution
10. Prevent future occurrence of this problem
Re:OSS? Linux? WHY? by Anonymous Coward · 2005-10-25 15:33 · Score: 3, Informative

From the readme.win32:
If you want to use a PC running Ethereal to monitor 802.11 traffic to or from other machines, rather than using Ethereal only to look at traffic to and from the machine on which you're running Ethereal, you should seriously consider running it on a recent version of Linux or of one of the free-software BSDs, rather than on Windows.
top 75 list by LinuxGeekMobile · 2005-10-25 16:31 · Score: 2, Informative

More tools than you could learn in a reasonable timeframe can be found here: http://www.insecure.org/tools.html

I would have posted sooner, but T-Mobile's data coverage has been spotty since Wilma hit. Still no power or fuel, but at least I can can get my geek-fix now. :) (at least until my battery dies)

--
- Posted via Danger HipTop2 / T-Mobile Sidek!ck II -
Re:Assess the problem by moro_666 · 2005-10-25 17:06 · Score: 2, Informative

You can attempt a scan&sniff at first, plenty of stuff to choose from,
but if your 100mbit network is being overhauled, it's quite difficult
to isolate single responsible instances.

I guess that probably you will end up doing that :

1) get rid of cheap hubs(made in paiwan) and get some real network switches in place, like those from SMC. Having an old buggy hub talking to several cheap NICs in several machines ends up in massive packet collision, resulting a network that doesnt carry much but is totally jammed.

2) scan all those windows machines attached to the network, they are probably bloated with viruses and spyware. sniff on the windows machines for a 24h period, some viruses/spywares only work at certain hours and a midday scan doesnt show anything. if the windows machines cant be healed from their stuff, unplug them (yes as simple as that).

3) unless all above helps, isolate subnets and firewall them all creating according rules into inner house firewalls, so that the flooding would stop.

4) Most important, tell the windows users to scan their machines regulary, unless they do that, you will have to start all the above the day after tomorrow again.

I recently found a virus in a windows box in my own office sending me the "worm" emails ... luckily thunderbird on my ubuntu laptop didnt quite figure out what to do with "report.pif" files ... Anyway the biggest problem was how to draw the "big red picture" for the whole office that they *have* to scan their machines all the time. I found out that 66% of our machines were infected by worms, some of them even by multiple worms. After the cleanout the network performance increased dramatically and i could even do non lagging X sessions from my home to the office (which i couldnt do before).

Long story short: Replace old crap, scan & sniff, isolate subnets, unplug m$ powered machines.

--

I'd tell you the chances of this story being a dupe, but you wouldn't like it.
read your network by graf0z · 2005-10-25 20:39 · Score: 2, Informative

Troubleshooting a network is a matter of experience, not of some particular tools. But these things help:

* Put you box on the monitor/mirror/analysing port of the switch an read the traffic with tcpdump/tethereal/ethereal (If you just want to check the broadcasts, it does not have to be a monitoring port). Edit the packet filter expression until you do not see the legal/uninteresting traffic anymore but only the suspects. (They are students? Have fun to filter all the p2p traffic ;-) Let ethereal make statistics over the traffic.

* Watch out for ICMP errors, especially ICMP-redirects. Watch out for TCP-resets. Watch out for fragments. Watch out for malicious Spanning-Tree packets. Watch for SMTP to many IPs (spamming trojans), IRC (zombies), weird packets eg. fragmented UDP (zombies attacking a target)

* Check the MAC adresses in the etherframe-header ('tcpdump -e'): are they constant? If there are packets IP_AIP_B, are the accordings MACs really MAC_AMAC_B or MAC_A-->MAC_B and MAC_B-->MAC_C instead?

* Install an arpwatcher. Stealing the default-gateway's MAC is an effective DoS attack on a network.

* Put 2 NICs into a fast linux box, bridge ('brctl') them together, put this linuxbridge in front of the default-gateway. Dump again. Install a snort on it and let it see the traffic - what does the snort log say?

* Do the switches have the feature to log to a remote syslog deamon? Do so and read those logs! Check all the snmp-variables on the switches, especially the "errors". Read the logs of the default-gateway.

* Watch the amount of traffic (snmpget the port-counters of the switches and make mrtg-graphs of the results). Maybe the problem only strikes if some switch ports are under high load?

* Scan the network with nessus. Maybe you'll find some bindshells.

* ...

Hope this helps.

g.