Slashdot Mirror


Best Linux Hardware Diagnostics?

An anonymous reader asks: "I've been running Linux for a little while and usually hardware problems have shown up quite easily - kernel panic, no module, no networking, etc. - but recently I've encountered some problems with network disk access causing very high load, which I think might be hardware related. Under Windows I'd fire up SANDRA or the like and run a full system scan. I did a quick search and nothing really stood out. I was wondering if any Linux gurus out there would like to share their expertise on Linux diagnostics?"

42 comments

  1. If you need to ask.... by PDXNerd · · Score: 1

    Well, /var/log/messages is a good place to start. Check your nfs/smb logs, and if all else fails, use a kernel debugger. I think this is what you are asking, if not please clarify.

    1. Re:If you need to ask.... by pilot1 · · Score: 1

      perhaps you could recommend a kernel debugger for him

    2. Re:If you need to ask.... by PDXNerd · · Score: 1

      Once again, if you need to ask...... Well..

      OK, gdb and kdb are the only debuggers I know of - not to say there aren't more but these are the old standards. Take a gander here for more information. There is too much to go over in a post.

    3. Re:If you need to ask.... by MerlynEmrys67 · · Score: 1
      To mod this as funny, or insightful

      Why dear god won't they put a real debugger into Linux - it is the only thing missing to make it a world class operating system

      --
      I have mod points and I am not afraid to use them
    4. Re:If you need to ask.... by karnal · · Score: 3, Funny

      Actually, I would think "World Class" would mean the terminal would scream at you if you had an issue:

      "HEY, YOUR HARD DRIVE IS FUCKED!"

      That would truly rock. Instead, today you can only sift through "/dev/hda: device not ready" errors in the logs... :)

      --
      Karnal
    5. Re:If you need to ask.... by Nasarius · · Score: 1

      I know. It would make diagnosing a strange network problem of mine (which only appears on Linux, not Windows or *BSD) much, much easier. I remember reading that Linus doesn't want it built-in for some reason or another. *sigh*

      --
      LOAD "SIG",8,1
    6. Re:If you need to ask.... by JWSmythe · · Score: 2, Interesting

      Aw, come on, the /dev/hd? errors are a pretty good clue the drive is screwed. :)

          We've been doing data center moves this month, and checking/refurbishing every machine as it gets shipped over. I take any /dev/hd? error as "the machine got dropped, and the drive is dead". If it doesn't turn on or it kernel panics for no aparent reason, I consider it a dead motherboard. For most of our machines, that's a fairly good guess, since everything's integrated.

          We have no expectation that any machine will arrive in one piece. The data is already transfered off to a new machine before we put it in a box. Ooohh, the wonders of 0 downtime. :)

          We've been changing every hard drive anyways, because they all have a few years on them. We've only had one machine not turn on, and it's too old to mess with. The only real odd-ball case was one machine on which when we ran ifconfig, it would kernel panic. We're using identical images on every machine to start, and only 1 of 60 has done that, so I'm fairly sure it's a bad motherboard. Too bad, it was a nice dual 1.4Ghz machine. Oh wait, that's old now, isn't it? :)

          We had two drives in an array fail, but that's the TSA's fault. That array was carried on an airplane, to be "sure" it was safe. It had more damage than any other piece of hardware, even the ones that UPS drop-kicked. I'm amazed by some of the damage I've seen. I haven't figured out how they bent some of the metal, since the boxes looked fine, and they were well packed.

          Needless to say, I was knee deep in packing peanuts again tonight.

      --
      Serious? Seriousness is well above my pay grade.
    7. Re:If you need to ask.... by itwerx · · Score: 1

      So, uh, whatcha doin' with the old hardware...? Need a place to get rid of it? :)

    8. Re:If you need to ask.... by JWSmythe · · Score: 1

      I'm gathering a nice pile of 500Mhz machines with 256Mb ram. I'm thinking of making a beowulf cluster in my living room. :) We're keeping anything newer than 1Ghz online. That's my cutoff for this upgrade cycle.

          Actually, I have plenty of things to try out on 'em. Sometimes it's nice to have a few dozen old boxen laying around. The power company loves it, and my girlfriend ... well ... she has words for my own personal server farm in the living room. I can't repost most of them here. :) She should just wait til I get some racks installed..

          I've been seriously considering a swarm of MythTV boxes.. :) I'd show you my big tv, but it seems to be on a server that happens to be unplugged. For some reason my priorities are screwed up, and personal machines aren't important during server moves. Anyways, it's 8' tall. It's a DLP projector, showing against a plain white wall in a darkened room. It's really fun, but now I really see the need for Hi-Def everything.

      --
      Serious? Seriousness is well above my pay grade.
    9. Re:If you need to ask.... by lanswitch · · Score: 1

      i can't understand why you would want to use a kernel debugger to troubleshoot network problems. Wouldn't a sniffer be a better idea? That way you can check out what's happening on the network level, where the problem seems to be.

    10. Re:If you need to ask.... by Nasarius · · Score: 1

      No. The problem is with the kernel (ie, the networking stack). The exact same hardware has no problems with other OSes, as I said.

      --
      LOAD "SIG",8,1
    11. Re:If you need to ask.... by legirons · · Score: 1

      "Actually, I would think "World Class" would mean the terminal would scream at you if you had an issue: "HEY, YOUR HARD DRIVE IS FUCKED!"

      I actually managed to derive that information from watching a friend's Windows PC boot... was right as well, but people always assume it's a software problem.

    12. Re:If you need to ask.... by legirons · · Score: 1

      "That array was carried on an airplane, to be "sure" it was safe. It had more damage than any other piece of hardware, even the ones that UPS drop-kicked. I'm amazed by some of the damage I've seen. I haven't figured out how they bent some of the metal, since the boxes looked fine, and they were well packed."

      Are you sure you don't want to work for UPS? I'm sure some companies pay for better reviews than "was drop-kicked"

      (yes, I'd agree with you for every parcel carrier I've used!)

    13. Re:If you need to ask.... by itwerx · · Score: 1

      Kinda figured that might be the answer... [grin]

    14. Re:If you need to ask.... by lanswitch · · Score: 1

      You could be right. Another possibility: the nic-drivers. Excuse me for asking, but did you check things like network speed and duplex-settings? there might be a difference after operating system installs.
      my point was that if you use a sniffer you can see the communication on the network itself and where and why it fails. that's more informative than staring at kernels.

    15. Re:If you need to ask.... by JWSmythe · · Score: 1

      They're pretty lucky I'm not posting the pictures of the damage we've gotten. By far, the internal damage done by the TSA has been the worst, but some of it's pretty good.

          The Promise VTrak 15100 arrays are very large, heavy, and sturdy boxes. They have very thick plastic handles on the front to ease putting it in the rack (theoretically). In all reality, you can lift them from the floor by the handles, but unless you can manage to hold over 100 pounds horizontally just by the handles, you could arm-wrestle Arnold Swartznegger and win. :)

          All four handles (two per unit) were crushed, smashed, or otherwise rendered non-existant in shipping.

          One of the brand-new 1u machines, in the manufacturers box, had what appeared to be the hole created by a fork lift, in the side of it. You could see straight into the box. Luckly, whatever pierced the box missed the actual machine by a fraction of an inch.. :) That one, I'm not sure UPS was the shipping company, so I won't blame them on that one.

          I have worked in a warehouse before, and know how stuff gets handled. I know, most stuff makes it to it's destination in one piece, but I've also seen 28" TV's fall 30+ feet to their demise, when the forklift operator "thought" it was in place, only to find he missed by a little bit. I rescued a few of them, where they hadn't quite fallen, and the equipment I was driving was the only one that could get a person close enough to grab it. :)

          I've also seen what happens when a forklift goes through the side of a pallet full of laundry detergent. That has to be the messiest stuff in the world. :)

          I was only responsible for one piece of broken merchandise. I kinda bumped a pallet, by hand, which had another pallet behind it, which was no longer stacked properly. A recriprocating saw fell about 25' to the warehouse floor. Ya, when you get a broken item from the store, something like that may have happened to it. No visible damage, it goes to the store.

          At least, working in a warehouse, the boxes are clearly marked to the contents. Shipping companies are generally screwed as far as that goes. they know it's cardboard, they know it weighs x pounds, but they have no clue what the contents are.

      --
      Serious? Seriousness is well above my pay grade.
  2. Windows programs != Linux Programs by e133tc1pher · · Score: 1

    There isn't really a "suite" that I know of like SANDRA;however, using your system logger (sysklogd, metalog etc) you can get a real good view of what is going on with your system. You may want to enable some debugging settings in kconfig and recompile the kernel so you get more info in the log. If you have any programming experience you can try profiling the kernel and any problems you can attempt to correct or post to the LKML. But really it sounds to me like your in need of a distro change. Try debian or gentoo.

  3. Damn, close. by FLAGGR · · Score: 1

    I thought the article was titled "Best Linux Distrobution" when my eyes passed by it, wouldn't that have been a fun discussion.

    1. Re:Damn, close. by SeeTheLight · · Score: 1

      Those slashdot-enhanced sunglasses must be working well.

    2. Re:Damn, close. by bill_mcgonigle · · Score: 1

      Imagine if all million Slashdot subscribers posted a commnet every time they misread a story title.

      Oh, wait, they do...

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  4. A few programs... by Anonymous Coward · · Score: 0

    lshw (-X gives a basic gui), lspci -vvv and lsusb give a lot of detailed information about your setup.

    Also cat /proc/cpuinfo gives some good CPU info. Try cat /proc/ a few others too.

  5. A couple of suggestions... by jschmerge · · Score: 2, Informative

    /var/log/messages or dmesg

    Both should display kernel messages from boot-up. Kernel boot messages usually contain the information you need to track down IRQ conflicts.

    MemTest86

    Not really a Linux program, but something I usually stick as a boot option in grub. Does a great job at detecting bad Ram. MemTest86 can also be booted from a floppy.

    BadBlocks

    This utility can be used to find bad blocks on a disk partition. I've used it before to check disks.

    You might also want to check out some system monitoring utility like Gkrealm, since that gives a generally complete picture IRQ/Interupt usage, Bandwidth utilization, memory and cpu utilization.

    I don't think I've ever had a hardware problem that couldn't be diagnosed using the aforementioned utilities.

    1. Re:A couple of suggestions... by superpulpsicle · · Score: 1

      LILO used to be the best check against bad kernel entries. It seems like GRUB is now the standard, though you use to just run "LILO". If you entries are bad, you'll know asap from the slew of errors in your telnet window.

    2. Re:A couple of suggestions... by versus · · Score: 1
      BadBlocks - This utility can be used to find bad blocks on a disk partition. I've used it before to check disks.
      Use smartmontools to get S.M.A.R.T disk info (smartctl -a /dev/hdX). Nowadays hard disks substitute unreadable sectors with spare ones - transparently to I/O subsystem.
      --
      Brain is my second favorite organ.
    3. Re:A couple of suggestions... by JWSmythe · · Score: 1

      I had to switch a RedHat machine from Grub to Lilo. that was a lot of fun. The version of lilo that they included didn't actually work right. For some reason, it had no concept of SCSI drives. I don't ask why... I replaced it with one I had compiled from scratch and packaged when I was trying to build my own distro. Sometimes that practice comes in very handy. :)

          Really, I like lilo much better, just as long as you don't do something silly like "lilo ; shutdown -r now". Always make sure it kicked back a friendly message. :)

      --
      Serious? Seriousness is well above my pay grade.
    4. Re:A couple of suggestions... by threephaseboy · · Score: 2, Informative
      Really, I like lilo much better, just as long as you don't do something silly like "lilo ; shutdown -r now". Always make sure it kicked back a friendly message. :)

      Duh. Thats why you do "lilo && shutdown -n now".
      --
      .
    5. Re:A couple of suggestions... by Anonymous Coward · · Score: 0

      Wow this got modded insightful. Hey mods, can we get a (Score:5, Funny) here to keep someone from actually doing this?
      #1 shutdown -n (actually a typo on my part) kills the system real fast like:
      -n: do not go through "init" but go down real fast.
      #2 lilo could possibly return true but still have some kind of problem, in which case && is no better than ;
      -tpb (OT thus anon)

  6. Heard of knoppix? by mnmn · · Score: 4, Informative

    We keep knoppix CDs just for this purpose; hardware diagnostics. dmesg and the /var/log/messages provide information that is otherwise hard to obtain from Windows 2000 or XP, especially if you cant boot the windows.

    Another crucial thing is lspci, which is absent from windows. Say you do a fresh install of windows, which does not detect the network card. How do you know what card is it to obtain the drivers for? In windows you just cant so easily get the PCI information. Enter knoppix.

    I have also used memtest in knoppix and found memory issues before, where windows simply acted up. The problem with windows is you have to boot the entire OS and take ~130MB of Ram and resolve all IRQs before you can run Sandra or the likes. Memory issues, disk issues or IRQ issues will prevent you from booting even.

    Knoppix when booted in single-user mode takes little memory, and you can boot it not to use ACPI, not to use HLT instruction, not to detect SCSI that might freeze the system etc. Then you can diagnose the system. Just get a CD and read the man pages of various tools on the CD.

    --
    "Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
  7. And SANDRA would have shown nothing by Gothmolly · · Score: 2, Interesting

    if network access is causing "load" and not "cpu usage", you need to look at kernel stuff - drivers, TCP/UDP windows, ethernet statistics, etc.

    --
    I want to delete my account but Slashdot doesn't allow it.
  8. good tools by krakrjak · · Score: 4, Informative

    lspci
    cat /proc/cpuinfo
    lsusb
    cat /proc/scsi/scsi
    ls /dev (if using udev)
    dmesg|less (or more depending on your PAGER)
    free

    These usually are enough to determine if BIOS thinks your hardware exists. And also this should help determine if the kernel has loaded a driver and given a device node to your hardware. If you need to know if a harddrive is bad (or partition) you can use the old standby:
    dd if=/dev/ of=/dev/null

    That will tell you if you can read all the data on the device or not. Hope that helps.

  9. hard drive monitoring by runswithd6s · · Score: 4, Informative

    Just to add this to the suggested list of applications: smartmontools control and monitor storage systems using S.M.A.R.T. lmbench Utilities to benchmark UNIX systems memtest86 Test your memory on x86 platforms nictools-nopci Diagnostic tools for many non-PCI ethernet cards nictools-pci Diagnostic tools for many PCI ethernet cards lm-sensors utilities to read temperature/voltage/fan sensors mbmon Hardware monitoring without kernel dependencies (text client) sensord hardware sensor information logging daemon crashme Stress tests operating system stability fuzz stress-test programs by giving them random input spew I/O performance measurement and load generation tool stress A tool to impose load on and stress test a computer system cpuburn a collection of programs to put heavy load on CPU ltp The Linux Test Project test suite

    --
    assert(expired(knowledge)); /* core dump */
  10. UBCD by Jsutton1027w · · Score: 4, Informative

    The Ultimate Boot CD: It's basically a compilation of different boot disks, all put in a nice menu system on a freely-downloadable ISO image. While it's not really Linux (though it contains a number of Linux-based boot disks), it is one of the best utility CD's that I've ever encountered for testing hardware.

    Also, Knoppix is another one that I would suggest, though I use it more for data recovery these days. ;)

  11. Sandra? by Anonymous Coward · · Score: 0

    What is the big deal about SiSoft Sandra? It's not even that amazingly useful for benchmarking let alone diagnostics. Everybody in the Windows world raves about it but I fail to see the attraction. It's vaguely useful for system inventory/information but thats about it.

  12. PC-Check by AmiMoJo · · Score: 1

    Eurosoft do a product called PC-Check. It's not cheap (£150) but it works very well. You get a bootable floppy (which you can copy) that tests just about everything in your PC.

    Best of all, you just slap it in a machine, let it run for an hour, and come back to see the results.

    --
    const int one = 65536; (Silvermoon, Texture.cs)
    SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    1. Re:PC-Check by ChrisF79 · · Score: 1

      Bootable floppy is great for those that still have a floppy drive.

      --
      Finance tutorials and more! Understandfinance
    2. Re:PC-Check by AmiMoJo · · Score: 1

      We keep a couple of USB floppy drives on hand.

      You can also make bootable CDs quite easily from a bootable floppy disk (in Nero anyway). The only downside to a bootable CD is that you can't save and later print the log files.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  13. HDD Tools by Anonymous Coward · · Score: 0

    If it's an hdd bottle neck then check/set your hdd settings with "hdparm" and monitor the performance with "iostat".

  14. And few more commands by vivekg · · Score: 1

    less /var/log/debug
    less /var/log/dmesg or dmesg | less

    Varioues files under /proc

    I prefer less as it gives more options such as MOVING , SEARCHING etc

    Also you can write your own custom script to digout information not just from one linux server but from other Linux/BSD servers and email/page back the results.

    --
    The important thing is not to stop questioning --Albert Einstein.
  15. Monitoring I/O by Anonymous Coward · · Score: 0

    Is there some program similar to 'top' which shows which process is doing the most I/O? Sometimes I just hear a lot of seeking on my hard drive and would like to have a way to find out which process is actually causing it.

  16. Windows and lspci. by Grendel+Drago · · Score: 1

    I remember having to dig into the registry entries to get PCI IDs of devices, then looking them up on sourceforge. But those days, astonishingly, are past. Windows XP's Device Manager has a nice bit where it has a "Details" tab, with "Hardware IDs". For instance, the 3C905-TX in this computer reads as PCI\VEN_10B7&DEV_9200&SUBSYS_100010B7&REV_6C
    . And yes, it does this for unknown devices as well. So no more registry digging.

    The problem with their "recovery mode" being seriously weaker than the equivalent linux console boot... well, that's a whole 'nother issue.

    --grendel drago

    --
    Laws do not persuade just because they threaten. --Seneca
    1. Re:Windows and lspci. by sykjoke · · Score: 1

      It's ludicrously easy to scan the pci bus. Windows users, no imagination.