Network Monitoring Options?
Nom du Keyboard asks: "We have a LAN network of 7 servers and about 400 PCs. Every so often I'll notice immense slowdowns, from minutes to occasional delays of a couple hours, while getting data from various servers, and it happens from more than just my PC. So far we haven't had any way of determining if a server has suddenly gotten tied up, or if there is some failure in the communications backbone. Without a lot of money to spend on this (I think it's more important than others right now), what cheap or free monitoring options are there available that can map and isolate problems in a network of this size?"
Some of the ones I have more recent experience with. All of these require some reading and planning before you set them up.
OpenNMS - Probably the most trouble-free NMS I've found so far. No, not "trouble-free". But the closest to it.
Nagios - The most flexible, but also the biggest royal pain in the ass to set up & maintain. Almost infinitely scalable, though, if you are willing to take the time to write some perl scripts to automate most administrative tasks and divide the monitoring work up (several "slave" hosts can harvest monitoring data for a subset of your network and push it to your central Nagios server which greatly lessens the load on your main monitoring server). Some really great monitoring possibilities are out there if you look into NRPE with Nagios.
OpManager - We bought this commercial solution at my last job. Great for monitoring Windows servers. A real pain in the ass to monitor anything else with any level of sophistication. It also has some fatal bugs that cause it to quietly orphan nodes if it misses a scheduled poll!
Then NTOP http://www.ntop.org/ is your best bet, this breaks down all traffic on your network and should allow you to see who's being naughty and who's being nice.
Sig
what cheap or free monitoring options are there available . . .
:) ). If you want to get fancy you can buy span or rspan capable switches which will let you mirror traffic from individual ports or Vlans to a single management station port (in which case you can just use a desktop).
If the network is the issue, the cheapest and simplest is a good laptop running Ethereal or Snort. Also pick up (or scrounge up) a dumb hub and if possible a fiber tap, since you're probably running in a mixed-media switched infrastructure (or maybe you're not - hence the problems
This should go withot saying, but those packet captures will be useless unless you know WHERE each mac address is on the network. That said:
1) maintain reliable L1/L2/L3 mappings
2) Tag both ends of long cables and make sure all wallports are numbered, and
3) beat the shit out of anyone who brings personal equipment in and plugs it in. It screws up your records and is probably less secure.
There is more traffic on a network that could cause problems other than icmp. If something is taking all bandwidth on a LAN then ping would be useless because everything you try to ping would be slow.
Also using ping to check servers if they are up and down is a bad idea as well, a mission critical service could go down, but the server would still return a ping, and you would be the first one to know via one of your users, for instance. Using a tool such as nagios to check all services on a server including icmp requests would be the way to go. And to get a general view of all traffic on the network NTOP, then to look at one item on your network I would use tcpdump.
Sig
ntop
Nagios
MRTG
Cacti
Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.
If your intent is to detect network troubles, I recommend using some system like Cricket or MRTG to graph the interfaces as well as the Errors on the interfaces within the network. This may require some finesse in setting up for the first time.
Aside from that, Sysmon was written primarily to monitor hosts and the host based services, but was morphed also to monitoring networks. It may fit your needs as you can set up SNMP thresholds of network errors and other things.
If you want to be super-lazy, I would download the trial of Intermapper it may be able to find these troubles for you if you can SNMP poll the devices and has auto-discovery. I've not used it in awhile, so hopefully it has support for the platforms that you are using.
Thats a pretty vague question, and you didn't provide enough information to really answer it right, but here's some recommendations.
Assuming you have managed switches, collecting per-port data with SNMP is a great first start. I think Cricket (http://cricket.sourceforge.net/ is a great system for collecting this data, but I prefer Drraw (http://web.taranis.org/drraw) for graphing the data. For an example of the power available by combining these two tools, see http://stats.net.cmu.edu/
Once you've got that, install Net-SNMP's snmpd on your host and collect & graph interface stats for your unix servers as well. If you don't have managed switches this may be good enough on its own. You can also graph load average, memory usage, etc.
For actually analyzing your network traffic I suggest Argus, http://www.qosient.com/argus. It's a network traffic auditing tool, think of it as tcpdump for flows instead of packets, or as netflow on crack. You can easily record complete flow statistics for your entire network for later perusal. All you need is a network topology that allows you to sniff most/all of the traffic. A span port on a switch is usually sufficient. If you've already got a snort server and it has enough processing capacity you can just run argus on the same host.
Speaking of which, if you don't have a snort server you probably want one. Nessus as well.
For monitoring/alerting I recommend Mon (http://www.kernel.org/software/mon), but then I'm biased.
And once you've tracked down what machine(s) are causing the problem, do you have records of which machines belong to which users? (Insert plug here for CMU's NetReg system for management of DNS and DHCP, which provides that. (http://www.net.cmu.edu/netreg) I'm biased on this one as well...)
Oh, and my money would be on poorly timed overlapping network backups, saturating a switch uplink. Just a guess...
Are all servers affected? Have you bothered measuring the load on your servers? The problem might not have anything to do with the network.