Trying to Help a Troubled Network with Linux?
vmehta asks: "I was recently put in a situation where I am trying to help a troubled network with many students accessing it. There are issues with broadcast packets and random outages which seem to be plaguing the network. What tools and methods are the best practice when trying to use Linux and Open Source to analyze and fix a network?"
First step isn't to blunder in and migrate - the first step is to work out what's causing the outages etc. use ethereal or some other packet sniffer to establish where the broadcast floods are coming from - use nmap to find insecure hosts - also, investigate what kind of routers are being used, and what rules are being employed.
Basically, OSS/Linux are great, but don't rush in without establishing the issues first.
Almost any time I see this, its some random box flooding the network. Just go to your switches...the light that is on solid continuously will point you in the right direction.
No use fixing symptoms go after the root cause.
Whats next, "How do I produce PDF files, using Linux and Open Source?" "How can I leverage Open Source to surf the web?"
Christ, this is like the late 90's, when everything suddenly had "e" in front of it. Dude, get Ethereal, slap it on any Windows box, and be done. No need to get nerdy with Linux. If you know enough that its broadcast traffic, you're halfway there.
I want to delete my account but Slashdot doesn't allow it.
The first step in troubleshooting is in knowing the network topology. How are network segments separated? How are the connected? Where are routers, hubs, switches, etc.? Which switches are managed, and how are the VLANs set up on them? Where are the DHCP servers, and what do they serve? Where are all your network drops?
Do your network segments have multiple subnets attached to them?
Is everything subnetted properly?
The first set of questions are ones YOU should be able to answer. After all, it's YOUR network, and YOU should know how it's set up. The last two are harder to deal with, because these settings may be on computers not in your control.
Answer the first questions first, then when you are looking at packet traces, TCP/IP dumps, logs, etc. and you see a problem, you'll have a better idea where the problem is physically located, saving much time and energy.
And then there's the "dumb questions" I shouldn't have to ask: Do you have a loop? Are your cables wired to T568A or T568B standards? Are all your cables in good repair?
Give me my freedom, and I'll take care of my own security, thank you.
The best thing you can do is use a tool such as Ethereal to find the IP of the system or systems causing it, and subject them to a good cleanup.
For a good toolset, check out the Auditor Security Tools LiveCD for a collection of tools you can take with you wherever you go...
Auditor tools
Without any more information, you've got a bad NIC, almost certainly. Look on the switch for the port whose light is always on. As you've describe it, software has almost nothing to do with it. This is a NIC, or a bad switch, or bad cabling, or something.
"He who would learn astronomy, and other recondite arts, let him go elsewhere. " -- John Calvin, commenting on Genesis 1
See man command for further info on these commands.
....
Use to ping ip-address to see if you can get to the router and beyond. Make sure "allow ICMP" is enabled in the router.
Use traceroute -n ip-address to see where the traffic is failing.
Is it a DNS problem? Try host some.host.name to make sure you can resolve names.
Is it a DHCP problem? Try dhclient to see if you can get an IP address. (maybe pump on some systems.)
Connect a hub (not a switch) to some strategic place on the network. Give yourself an IP address and check for excessive traffic with iptraf. This will give you a breakdown of what bandwidth is being used by what services.
You can use commands like nc and telnet to connect to specific ports. e.g. nc -p 53 dns-server to see if the DNS server is open.
You can also automate these commands in a bash script run via a cronjob every minute. Something like:
for x in router1 router2 router3
do ping $x || echo >>/tmp/failures.txt
done
See man bash for details.
Good luck.
Step 1) Map the network both logically (which networks, what is the routing, etc.) and physically... the "tug test". Label everything, and put it all in a spreadsheet. Tools are nmap, pen and paper, and a label printer. Access to the routers, or being friendly the the router admin is a must.
Step 2) Isolate the problem protocols and hosts. Be on the lookout for appletalk, IPX, or old netbios. All very chatty protocols. Look for old hubs and replace them with switches. Look for comprimised boxen. Try to VLAN things logically (by department, or usage which ever is best for the environment). Tools are snort, ethereal, ntop, and syslog (any managed switches should be sending to a syslog server (I've used syslog-ng))
Step 3) Trend as much as you can. Even before the network is cleaned up, start to collect statistics from the switches, and/or hosts on your network. Any gateways should be monitored as well. This will let you see if there are problems corelated to a particular time of day, if your're going over your bandwidth etc. Tools are MRTG, or for more in depth try Cacti http://www.cacti.net/
There is much more after you get to this point, but people will be much happier the faster you get here.
Good luck
You're attempting to help diagnose a (presumably) large network. Very honourable, but attempting to do this gung-ho with a few responses from slashdot is very silly.
Grab a consultant from a local small Linux shop for a few days. Someone with good knowledge about system/network architecture.
Get them to poke around on your network. Provide all documentation you have available.
After the first day, you should have all the information necessary to write up a document regarding your existing issues. Make notes while he's using tools to investigate. From there you work with the consultant to come up with a separate document for resolutions with a criticality rating.
From there, you want systems in place to monitor the health of your network. Have a chat to him about it, but I'd be inclined to build a solution which was centered around using Nagios.
While consultants can (and frequently do) suck when you come to specifics, they are a valuable resource for pointing you in the right direction. And experience counts! They've done this stuff before, they know the pitfalls and proven solutions.
Bingo. First thing I thought of when I heard this.
While it's *possible* this is a virus (as others have said), I'd look at hardware first. A bad tranciever will generate more bad traffic than a virus could ever hope to.
" when everything suddenly had "e" in front of it. Dude, get Ethereal,"
:-)
you mean eThereal don't you
if
Find out what is connected to what and how. More than 90% of the "network problems" I encounter are basic cable issues.
Remember, when a NIC is connected to a switch, they only auto-negotiate if both are set to auto-negotiate. If someone sets them to a certain configuration, but doesn't get pair correctly matched, you will have a lot more collisions and such.
Make sure that your collision domain is setup correctly. Pay attention to the length of the cables. This is where the physical map comes in. You can check each section to make sure it's good. Then move to the next.
Start at the physical layer and work your way up.
It certainly could be any of the things you mention. With the vagueness of the original post, it could even be a layer 7 problem (i.e. a crappy Windows server.) But with the piss-poor information provided, my money is still on an NIC.
"He who would learn astronomy, and other recondite arts, let him go elsewhere. " -- John Calvin, commenting on Genesis 1
Low-tech is often a faster and more efficient way to find these sorts of problems. For surveillance and diagnosis, I recommend walking around and watching over students' shoulders. For corrective measures, a couple of taps with a ball-peen hammer usually suffices.
--
Twoflower
Start using tcpdump along with ethereal. Put the Linux box on different parts of the network to see what is happening. If you're in a switched environment, you will see mostly broadcasts. Some broadcasts are required and good (like the necessary ARP requests and possibly DHCP requests when a computer boots and initializes its network devices). However, unnecessary broadcasts are very bad for network performance and can cause "packet storms" which cause outages.
Start tracking those broadcasts down and find out what's going on in this network, find out what machines are sending the broadcasts and why. Learn why various services broadcast and what services are available to minimize the broadcasting. For example, Windows boxes configured for a workgroup will typically broadcast until you setup a WINS system. Once properly configured, all the Windows boxes talk to the WINS box directly without broadcasting.
The same is true with Windows domains. If all the boxes are joined to an Active Directory domain, the workstations should switch from broadcasting to unicasting (talking directly) to the Active Directory servers.
This is also true of Service Location Protocol. If you've got a bunch of boxes trying to use SLP, they're probably multicasting (or broadcasting if the network switches and routers don't properly support multicast). Once you start an SLP Directory Agent, all the servers register their services with the DA and all the clients ask the DA where to find those services -- all with unicast instead of broadcast.
Certain older protocols are very "chatty" -- AppleTalk, IPX, NetBEUI, and NetBIOS are good examples. Work toward eliminating these protocols. In a properly configured network, you should be able to do everything you need without them.
Ouch! The truth hurts!
If you want to use a PC running Ethereal to monitor 802.11 traffic to or from other machines, rather than using Ethereal only to look at traffic to and from the machine on which you're running Ethereal, you should seriously consider running it on a recent version of Linux or of one of the free-software BSDs, rather than on Windows.
Go on, mod me 'insightfull' or mod me 'flamebait', it's one or the other.
"A language that doesn't affect the way you think about programming, is not worth knowing" - Alan Perlis
The first step it to document and baseline the systems.
For baselining, I'd enable SNMP for all the managed devices. Then use something like MRTG with RRD Tool and chart every port for every switch for week or so.
While that's happening in the background, start mapping your LAN. Use something like Visio on a laptop and start visiting switches and routers. Confirm the connections between all the routers and switches. Then use good labels (no, not scotch tape and paper) to document those connections with FROM: and TO: information.
FROM:bldg1024 rm201 sw3 p4
TO: blgd2048 rm906 sw17 p33
Now, labeling to and from may seem dumb at first. But the first time you unplug something to move it and then forget where it was suposed to go, you'll thank me.
Once everything is labeled and documented, you can go back you your MRTG graphs and start analyzing the data.
Look at your core switches. Which ports have the highest graphs? Look at your documentation and see what switch is connected to that port. On that switch, which port is highest? Wash, rinse, repeat.
Once you have the access-device that is concentrating all the bad data, set up a clone port and then use a packet sniffer (I use Sniffer Pro) to figure out what the bad data is.
Anyway, after you "shave off the peaks", you can re-baseline the system and start agian. Onc traffic is semi-reasonable, then it's time for hardware analasys.
Using MRTG, look at the CPU, memory, and other nifty stats from the switches and routers themselves. Target devices in need of an upgrade. One word of caution: Cisco switches always have high CPU and memory usage. Just because a device shows 85%CPU does not mean it's working hard. Look at a switch with nothing connected to see what I mean.
1. baseline
2. document
3. analyze
4. fix
5. upgrade
6. re-baseline
7. re-document
I'd rather you do it wrong, than for me to have to do it at all.
More tools than you could learn in a reasonable timeframe can be found here: http://www.insecure.org/tools.html
I would have posted sooner, but T-Mobile's data coverage has been spotty since Wilma hit. Still no power or fuel, but at least I can can get my geek-fix now.
- Posted via Danger HipTop2 / T-Mobile Sidek!ck II -
I thought this was slashdot, where news topics are discussed, not community support forum.
You'll have to read the descriptions to decide which ones to try.
Beta is broken and the link to classic doesn't work. Stop wasting our time or there won't be anybody left here.
Troubleshooting a network is a matter of experience, not of some particular tools. But these things help:
;-) Let ethereal make statistics over the traffic.
...
* Put you box on the monitor/mirror/analysing port of the switch an read the traffic with tcpdump/tethereal/ethereal (If you just want to check the broadcasts, it does not have to be a monitoring port). Edit the packet filter expression until you do not see the legal/uninteresting traffic anymore but only the suspects. (They are students? Have fun to filter all the p2p traffic
* Watch out for ICMP errors, especially ICMP-redirects. Watch out for TCP-resets. Watch out for fragments. Watch out for malicious Spanning-Tree packets. Watch for SMTP to many IPs (spamming trojans), IRC (zombies), weird packets eg. fragmented UDP (zombies attacking a target)
* Check the MAC adresses in the etherframe-header ('tcpdump -e'): are they constant? If there are packets IP_AIP_B, are the accordings MACs really MAC_AMAC_B or MAC_A-->MAC_B and MAC_B-->MAC_C instead?
* Install an arpwatcher. Stealing the default-gateway's MAC is an effective DoS attack on a network.
* Put 2 NICs into a fast linux box, bridge ('brctl') them together, put this linuxbridge in front of the default-gateway. Dump again. Install a snort on it and let it see the traffic - what does the snort log say?
* Do the switches have the feature to log to a remote syslog deamon? Do so and read those logs! Check all the snmp-variables on the switches, especially the "errors". Read the logs of the default-gateway.
* Watch the amount of traffic (snmpget the port-counters of the switches and make mrtg-graphs of the results). Maybe the problem only strikes if some switch ports are under high load?
* Scan the network with nessus. Maybe you'll find some bindshells.
*
Hope this helps.
g.
Grab a consultant from a local small Linux shop for a few days. Someone with good knowledge about system/network architecture.
You should read between the lines. He said: I was recently put in a situation...
Which means he is the consultant. Of course, thanks to a fake curriculum made by the sales representant of the consultancy firm, they sent him while he has no clues about network administration.
I have a better idea. Get Linux and slap it on all your windows boxes and be done. For good.
The difference between Canada and the USA is that in Canada healthcare is a right and gun ownership is a privilege.
Get someone who does this for a living. I am sure there are a few in your local linux shop. Someone who works at an isp should have experience with the problems you site.
Step 2
Follow his/her recommendations (which will probably be splitting the network in more l3 domains) get a 6500, or a few 3750, or if you really can't afford much a few 3550 switches (which will leave you out of luck when ipv6 starts getting used, but otherwise is a fine choice).
This is about having L3 switches closer to the end user than you have now, as far as I know there are no acceptable products that are cheaper.
(probably) You should split up the l2 network into a lot of separate l3 domains. Do not implement firewalling and nobody of the students will mind. Get an IGP running between the l3 domains, and provide multiple, geographically separate uplinks (10 * adsl exporting 0.0.0.0/0's in the igp is a lot better than 1 E3 if you don't really know what you're doing)
In short, if you have to ask, you don't know how to fix this, no analysis tools can help you without a serious and deep understanding of the technology. If you don't want to pay someone for this, a lot of people will say they can fix it, but you'll need to be extremely lucky to actually find someone.
Perhaps if you decide to go the cheap route, go the old-fashioned way, trust someone with a CCNP and a CS degree more than a 17-year-old.
"This is a NIC, or a bad switch, or bad cabling, or something"
Gee, thanks for pinpointing the problem.
Make sure you're students haven't started looping back the cables from one network socket to another, always make sure an unconnected network point isn't connected at the patch panel/switch end, - it's just asking for trouble, the more physical restrictions you have on your network, the easier the rest will be to manage. - Rogue access points may also be a downfall of you network, check for them!
You bought her a Kentucky Fried Chicken Franchise!!!
Actually, another cause could be a looped network connection. We have problems with students who will connect two network jacks together, thus creating a loopback in the switch they are connected to. Generates a whole lot of network traffic. Basically they were doing this when they had exams requiring computers, because bringing down the network ensured no exam...
Tell me. What problems, exactly, would this solve?
Always check the physical layer first.
Just this summer I tracked down an error that was caused by a cisco wireless access point trying to pull electricity from the cat5. It was UNPLUGGED from the power! It took down a whole segment of the network.
The way we found it was from the solid light on the switch.
If you are going to try to write up docs....
If they run Cisco equipment, a show cdp neighbor will help you a lot. Keeping up to date documentation on a network (especially a large one) is a difficult task, but it will make solving future problems much easier.
Here are some tools I use for just about the same thing your about to do. And a brief reasons why I use each. Start with one, then once mastered move to another.
Be patient. Could be just that the lan/wan/Internet connect is just too slow for what the population expects. University and college management rarely understand this until a prof or big wig can't watch football game or something.