Slashdot Mirror


Network Monitoring Options?

Nom du Keyboard asks: "We have a LAN network of 7 servers and about 400 PCs. Every so often I'll notice immense slowdowns, from minutes to occasional delays of a couple hours, while getting data from various servers, and it happens from more than just my PC. So far we haven't had any way of determining if a server has suddenly gotten tied up, or if there is some failure in the communications backbone. Without a lot of money to spend on this (I think it's more important than others right now), what cheap or free monitoring options are there available that can map and isolate problems in a network of this size?"

10 of 42 comments (clear)

  1. some options by Yonder+Way · · Score: 3, Informative

    Some of the ones I have more recent experience with. All of these require some reading and planning before you set them up.

    OpenNMS - Probably the most trouble-free NMS I've found so far. No, not "trouble-free". But the closest to it.

    Nagios - The most flexible, but also the biggest royal pain in the ass to set up & maintain. Almost infinitely scalable, though, if you are willing to take the time to write some perl scripts to automate most administrative tasks and divide the monitoring work up (several "slave" hosts can harvest monitoring data for a subset of your network and push it to your central Nagios server which greatly lessens the load on your main monitoring server). Some really great monitoring possibilities are out there if you look into NRPE with Nagios.

    OpManager - We bought this commercial solution at my last job. Great for monitoring Windows servers. A real pain in the ass to monitor anything else with any level of sophistication. It also has some fatal bugs that cause it to quietly orphan nodes if it misses a scheduled poll!

    1. Re:some options by Blkdeath · · Score: 2, Informative
      Some of the ones I have more recent experience with. All of these require some reading and planning before you set them up.

      Before you get into network monitoring software, start at layer 1. Look at the physical topology of the network. Do you have network/switch maps? If not, get some. If there are none, make some. How is your network configured? Is it a high speed backbone (1G? 10G?) with low or high speed desktop connections (10Mbit? 100Mbit?) Is WIFI in play? Are you using VLANs? Are you connected to a WAN? Is spanning-tree involved, or is it a flat heirarchal topology? Could you have a redundant path between switches setup somewhere? Are your switches managed? For that matter, are you using hubs? Is there a wiring fault/short somewhere? When this slowdown occurs, do any ports show unusually high activity, or is it simply a server overload situation? You have 7 servers; could you balance the load between them? Where are the servers located within your topology? Are they connected to the backbone, are they dispersed locally (eg. close to a select group of users)? What protocols are in use? Could you pare them down (eg. do you run IPX? Do you need it? What about NETBEUI/NETBIOS?) Do you have any viral/trojan activity? Is your Internet link saturated, or only your internal connections?

      On a network that size (relatively small) I'd presume you're running a flat topology. If this is the case, try to balance your high(er) traffic computers on the switches. For example; if you have 24 port, 100BaseTX switches fed by a 1000Mbit backbone connection, try to ensure that you don't have an entire high-demand section of the building connected exclusively to this switch. Divide out the wiring so that half the switch is high while the other half is low traffic (or, add a second uplink).

      If your servers serve particular segments of the network, connect them to a high(er) speed port on a switch close to the main traffic source to prevent excess traffic needlessly traversing your backbone. If your servers serve the entire network equally, consider localizing the services else connect them with either a high speed or dual NICs to your backbone.

      If it's a server-load problem, divide off one/some of the services to some of the servers with a lighter load, else upgrade the hardware accordingly. (Low on RAM? Borrow a stick from a lower demand server)

      If you're using spanning-tree, or even if you're not, check the configuration. IBM switches are famous for delaying transactions on their fibre switches if ST is enabled but not utilized. If there are no situations where more than one switch-path exists, disable ST on all switches.

      To check your network activity and isolate protocol chatter, I could suggest any number of traffic sniffers. It may be something as simple as your DHCP leases are too short and your 400 machines are looking for the server(s) and requesting new addresses on a regular basis. Lengthen the leases to a week, or if your network is fairly static make it 2. Are you running a mix of Win'98/ME and 2k/XP machines? If you're running a hybrid NETBEUI / NetBIOS over TCP/IP you may have a lot of broadcast traffic looking for daddy. Look to reconfigure your client/server and peer relationships. On machines that don't require file and/or print sharing, disable the protocols entirely (else they'll broadcast constantly looking for everybody else).

      You're really going to have to isolate where exactly the problem is occurring before a solid reccomendation can be made - even to the monitoring aspect of things. Remember that most network admin work is done by good old fashioned legwork. Solving computer/network problems is all about the process of elimination. If your network structure is solid it'll make your job a helluva lot easier. Meanwhile you need to isolate if it's hardware, software, malicious, or network before we can begin to solve your problem for you.

      --
      BD Phone Home!

      Shameless plug. Like you weren't expecting it.

    2. Re:some options by mjhuot · · Score: 2, Informative

      Let me start by saying I work on the OpenNMS project. You could use OpenNMS very easily to accomplish your goals. OpenNMS does many things, the features that would be most useful to you for this problem would be service polling, service reponse time graphs, snmp performance graphs, and thresholding. Here is a quick run down on each of these -

      Service Polling
      OpenNMS can be configured to poll services on your servers. It will do checks for many protocols such as HTTP, SMTP, FTP, HTTPS, DNS. NTP, RADIUS, and others. Some of the pollers are more advanced than others, but they all at the very least do a TCP SYN, SYN ACK, ACK type thing.

      Service Response Time Graphs -
      OpenNMS will collect the response times from the polls it completes. The data is stored so that it can be displayed from the web interface for any device/service/date selection. OpenNMS uses RRDTool or jRobin to store the data. RRDTool is MRTG's big brother, and jRobin is their cousin written in java. These are the basis for most of the statistical data storage in OpenNMS.

      SNMP Performance Graphs -
      SNMP is a wonderful system for presenting performance data about network attached devices. Unfortunatly it has gotten a bad reputation over the years. SNMP can seem to be overly complex but, OpenNMS makes it easier. If devices have SNMP turned on and are configured with default values(not recommended from a security perspective), OpenNMS will be able to discover the SNMP data and will begin collecting it. OpenNMS will store the data it collects in RRDTool/jRobin format for display in the web interface. What can I get from SNMP, you migh ask. It all depends on the device being monitored. Devices had SNMP MIBs that contain information on what data they will provide via SNMP. OpenNMS has many of the most common MIBs all ready setup. The basic things you should be able to see for the server side will be network utilization, load, CPU, memory and some disk information. Being a network geek, I would also make sure your network infrastructure is also setup to be collected from. In many cases you can watch how the traffic is flowing through the network to determine the source of the problem.

      Thresholding -
      OpenNMS thresholding could be used to look at the data collected and send you an alert when a threshold is crossed.

      In addition, OpenNMS now supports NRPE, to be honest I am not sure to what extent, but if that type of functionality is needed it is there. If there is something needed for NRPE it can always be added.
      There are many more features to OpenNMS check out the web site for mote information. For me the big things are performance, stability and scalability. Given enough hardware I don't think there are too many networks that OpenNMS could not handle. Once you understand it you will see how endless the possibilities are.

  2. Just network? by HavokDevNull · · Score: 4, Informative

    Then NTOP http://www.ntop.org/ is your best bet, this breaks down all traffic on your network and should allow you to see who's being naughty and who's being nice.

    --
    Sig
  3. Cheap = ethereal and a hub by jgaynor · · Score: 4, Informative

    what cheap or free monitoring options are there available . . .

    If the network is the issue, the cheapest and simplest is a good laptop running Ethereal or Snort. Also pick up (or scrounge up) a dumb hub and if possible a fiber tap, since you're probably running in a mixed-media switched infrastructure (or maybe you're not - hence the problems :) ). If you want to get fancy you can buy span or rspan capable switches which will let you mirror traffic from individual ports or Vlans to a single management station port (in which case you can just use a desktop).

    This should go withot saying, but those packet captures will be useless unless you know WHERE each mac address is on the network. That said:

    1) maintain reliable L1/L2/L3 mappings
    2) Tag both ends of long cables and make sure all wallports are numbered, and
    3) beat the shit out of anyone who brings personal equipment in and plugs it in. It screws up your records and is probably less secure.

  4. Re:Pinging by HavokDevNull · · Score: 2, Informative

    There is more traffic on a network that could cause problems other than icmp. If something is taking all bandwidth on a LAN then ping would be useless because everything you try to ping would be slow.

    Also using ping to check servers if they are up and down is a bad idea as well, a mission critical service could go down, but the server would still return a ping, and you would be the first one to know via one of your users, for instance. Using a tool such as nagios to check all services on a server including icmp requests would be the way to go. And to get a general view of all traffic on the network NTOP, then to look at one item on your network I would use tcpdump.

    --
    Sig
  5. Try these tools by Matt+Perry · · Score: 2, Informative
    --
    Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.
  6. network or hosts? by jaredmauch · · Score: 2, Informative
    You didn't make it perfectly clear which you were attempting to isolate, the host related issues or the network related ones. There are a lot of monitoring systems out there from NAGIOS to Sysmon (author disclosure) as well as the previously mentioned OpenNMS.

    If your intent is to detect network troubles, I recommend using some system like Cricket or MRTG to graph the interfaces as well as the Errors on the interfaces within the network. This may require some finesse in setting up for the first time.

    Aside from that, Sysmon was written primarily to monitor hosts and the host based services, but was morphed also to monitoring networks. It may fit your needs as you can set up SNMP thresholds of network errors and other things.

    If you want to be super-lazy, I would download the trial of Intermapper it may be able to find these troubles for you if you can SNMP poll the devices and has auto-discovery. I've not used it in awhile, so hopefully it has support for the platforms that you are using.

  7. Many tools, many types of monitoring by vitroth · · Score: 2, Informative

    Thats a pretty vague question, and you didn't provide enough information to really answer it right, but here's some recommendations.

    Assuming you have managed switches, collecting per-port data with SNMP is a great first start. I think Cricket (http://cricket.sourceforge.net/ is a great system for collecting this data, but I prefer Drraw (http://web.taranis.org/drraw) for graphing the data. For an example of the power available by combining these two tools, see http://stats.net.cmu.edu/

    Once you've got that, install Net-SNMP's snmpd on your host and collect & graph interface stats for your unix servers as well. If you don't have managed switches this may be good enough on its own. You can also graph load average, memory usage, etc.

    For actually analyzing your network traffic I suggest Argus, http://www.qosient.com/argus. It's a network traffic auditing tool, think of it as tcpdump for flows instead of packets, or as netflow on crack. You can easily record complete flow statistics for your entire network for later perusal. All you need is a network topology that allows you to sniff most/all of the traffic. A span port on a switch is usually sufficient. If you've already got a snort server and it has enough processing capacity you can just run argus on the same host.

    Speaking of which, if you don't have a snort server you probably want one. Nessus as well.

    For monitoring/alerting I recommend Mon (http://www.kernel.org/software/mon), but then I'm biased.

    And once you've tracked down what machine(s) are causing the problem, do you have records of which machines belong to which users? (Insert plug here for CMU's NetReg system for management of DNS and DHCP, which provides that. (http://www.net.cmu.edu/netreg) I'm biased on this one as well...)

    Oh, and my money would be on poorly timed overlapping network backups, saturating a switch uplink. Just a guess...

  8. holy vague questions, batman! by Clover_Kicker · · Score: 2, Informative

    Are all servers affected? Have you bothered measuring the load on your servers? The problem might not have anything to do with the network.