Slashdot Mirror


Hope For Fixing Longstanding Linux I/O Wait Bug

DaGoodBoy writes "There has been a long standing performance bug in Linux since 2.6.18 that has been responsible for lagging interactivity and poor system performance across all architectures. It has been notoriously difficult to qualify and isolate, but in the last few days someone has finally gotten a repeatable test case! Turns out the problem may not even be disk related, since the test case triggers the bug only by transferring data either between two processes or threads. The test results are very revealing. The developer ran regressions all the way back to version 2.6.15 that demonstrate this bug has more than doubled the time to run the test in 2.6.28. Many, many people working at improving the desktop performance of Linux will be very happy to see this bug die. I know that I, personally, will find a way to send the guy that found this test case his beverage of choice in thanks. Please spread the word and bring some attention to this issue so we can get it fixed!"

45 of 180 comments (clear)

  1. Dang!! by camperdave · · Score: 5, Funny

    Dang! I was going for First Post, but my machine was stuck in some weird I/O wait state.

    --
    When our name is on the back of your car, we're behind you all the way!
    1. Re:Dang!! by Anthony_Cargile · · Score: 2, Funny

      Damn futex_wait states!

    2. Re:Dang!! by Aphoxema · · Score: 2, Insightful

      It was funny to me

      --
      "Most people, I think, don't even know what a rootkit is, so why should they care about it?"
  2. Is this bug currently affecting .... by whoever57 · · Score: 4, Funny

    bugzilla.kernel.org?

    --
    The real "Libtards" are the Libertarians!
    1. Re:Is this bug currently affecting .... by 2Bits · · Score: 3, Funny

      With the current response time, obviously, yes.

    2. Re:Is this bug currently affecting .... by NekoXP · · Score: 2, Insightful

      Yes, by spreading the word and asking people to go look into fixes we crashed the bug tracker so nobody doing kernel development can file new bugs or new bug fixes for anything else today.

      Awesome plan. Really awesome.

    3. Re:Is this bug currently affecting .... by ArsonSmith · · Score: 3, Funny

      Given enough eyeballs, all bug tracking software is fragile

      --
      Paying taxes to buy civilization is like paying a hooker to buy love.
  3. Re:funny by Anthony_Cargile · · Score: 2, Interesting

    Anyone else notice the article 404ing from the front page? I'd say /. needs to fix some bugs/user errors rather than speak about a Linux IO latency most users don't even notice. Just an observation, and if you can read this, they either fixed it or you doctored up a query string like I did :D.

  4. KTorrent by Anonymous Coward · · Score: 2, Interesting

    I'm not sure if this is related, but has anyone else noticed KTorrent can really bog your system down without showing any excessive resource usage in KSysGuard? For all I know, it may be passing information between one thread and another, and it's disk I/O intensive.

    1. Re:KTorrent by Nuitari+The+Wiz · · Score: 2, Informative

      There was a bug in ktorrent that cause an infinite loop when udp trackers were present in a torrent file, maybe you check if you have the latest version.

  5. Longstanding...Since 2.6.18 by akpoff · · Score: 3, Interesting

    Right. I had to get up in the morning at ten o'clock at night, half an hour before I went to bed, eat a lump of cold poison, work twenty-nine hours a day down mill, and pay mill owner for permission to come to work, and when we got home, our Dad would kill us, and dance about on our graves singing "Hallelujah." --Monty Python: Four Yorkshiremen

    Been waiting all of 2 years and change for your precious bug fix, 'ave you? You almost had my eyes tearing up there I tell ya: 25 Year Old BSD Bug.

  6. Desktop??? by corychristison · · Score: 4, Insightful

    I'm not sure about anybody else here, but I was surprised to see that they mentioned that this will benefit 'Desktop' users.

    I think that when it comes to the performance spectrum, Servers would be where this fix is the most needed. Admittedly if you are running a solid server, you should know to use older gen hardware and software that has been proven to be stable. However, some of this 'shiny new' tech coming out is appealing.

    How about the Seagate 1500GB drive hang error? To my understanding Windows has been fixed, but the problem still persists in Linux. Could this potentially make a difference? I've been looking to build myself a nice NAS and those 1500GB drives are _cheap_. I can pick one up for about $160. I remember not too long ago that could only get me 80GB.

    1. Re:Desktop??? by Anonymous Coward · · Score: 4, Informative

      I believe the 1.5tb Seagate linux hang has been fixed. We're using a lot of them (100's) where I work on Ubuntu Hardy servers and haven't had hangs.

    2. Re:Desktop??? by adolf · · Score: 2, Interesting

      Disk-to-disk operations would then bypass the kernel and asynchronous I/O would consume no primary resources. This was fashionable on some systems (most notably drives that used the IEEE 488 bus) in the 70s and was done to some degree with SCSI, but there's really no excuse for not providing such a capability on any modern drive.

      I bought that line, hook line and sinker, in the late 90's with a bunch of IBM 9ES ultra-wide SCSI disks and a good controller.

      It never was clear to me that, at any time, Linux was actually telling the drives to copy data directly from one disk to any other without the kernel in the middle.

      And now that we live in a world of point-to-point serial buses (SATA, SAS) linking disks to seemingly independent controllers: Is it even theoretically possible anymore?

    3. Re:Desktop??? by Compholio · · Score: 2, Interesting

      I'm actually pretty sure that I've spotted the results of this in "everyday" use. I've noticed that every once in a long while my hard-drive activity kicks up (it's happened when I'm just scrolling on an already-loaded web page and I'm using absolutely zero swap) and literally everything stops responding for a good 5 seconds. My guess would be that the slocate or "tracker" program spawns off on recently added and removed files, but it's not something I've put a lot of effort into figuring out.

    4. Re:Desktop??? by NormalVisual · · Score: 3, Funny

      not HAL-9000 intelligence, which would be bad for data anyway

      HAL-9K intelligence doesn't pose any problems to the data - it's the *operators* that need to be concerned, especially when giving the system instructions that could potentially conflict with each other.

      --
      Please stand clear of the doors, por favor mantenganse alejado de las puertas
    5. Re:Desktop??? by cowbutt · · Score: 4, Informative

      How about the Seagate 1500GB drive hang error? To my understanding Windows has been fixed, but the problem still persists in Linux.

      The ST31500341AS requires a firmware update from Seagate to something newer than revision SD19 (more info). In the meantime, if you're using a drive which hasn't been updated to fixed firmware, there's a blacklist in the current development kernel to disable NCQ on affected models as a workaround.

    6. Re:Desktop??? by BlackCreek · · Score: 2, Interesting

      I'm not sure about anybody else here, but I was surprised to see that they mentioned that this will benefit 'Desktop' users.

      They mentioned it because it does hit the desktop: https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/131094

    7. Re:Desktop??? by jd · · Score: 2, Insightful

      The cost of RAM is not that great, compared to the cost of a high-end motherboard on a good server, and is absolutely insignificant compared to even a single hour of downtime in any kind of datacentre. If you want genuine 5N's reliability or better (and you can go a lot better than that), you want as little strain on mechanical components as you can get. There's little point in, say, using Carrier-Grade Linux if the practical lifetime of the hard drive due to usage means your hardware cannot maintain a comparable level of reliability.

      RAM prices matter for home usage, sure, but since when do home users actually have true data servers? (For that matter, when was the last time you used a Carrier-Grade Linux distro at home?) Most home users have one or two computers, but they don't usually designate a box as a NAS. And even then, most home computers these days have at least a gig of RAM. If you generate more than a gig of long-term data per hard disk read on your home machine, you're using it weird.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  7. Killing kernel.org server isn't very nice... by Anonymous Coward · · Score: 3, Funny

    I'm sure kernel.org appreciates these links. Now instead of fixing the bug they're putting out fires in the data center...great job slashdot.

    1. Re:Killing kernel.org server isn't very nice... by statusbar · · Score: 4, Funny


      I'm sure kernel.org appreciates these links. Now instead of fixing the bug they're putting out fires in the data center...great job slashdot.

      Well, maybe the kernel developers or bugzilla developers could use the practice in making a reliable scalable system out of the systems that they design.

      --jeffk++

      --
      ipv6 is my vpn
  8. Windows Port? by Al+Al+Cool+J · · Score: 4, Funny

    If this get resolved is there any chance the fix could get ported to Windows? I just had my Dad's XP laptop completely freeze after I plugged in a bog-basic USB thumbdrive. The desktop sprang to life only after I unplugged it. I wish some of the AC Windows fanboys who were hassling me here last week were around to see it. "Ready for the desktop" my ass.

    1. Re:Windows Port? by troll8901 · · Score: 2, Insightful

      And I'm going to hassle you again.

      (Opps, forgot to check the AC option!)

      Never mind, carry on ...

      (I also have problems with U3 flash drives. I had to use basic flash drives - thus missing out on all the app portability features.)

      So THAT's why we don't have Year of the Linux Desktop! It has performance problems ... just like Vista has performance problems!

  9. Re:Just upgrade by martinw89 · · Score: 3, Funny

    OS not fast enough? Just upgrade your hardware components, preferably to a new, top-of-the-line system.

    Oh wait... that's the Windows way of doing things.

    Yeah, exactly, that's why volunteers have been hard at work to find and fix the (published, admitted) bug. Just like Win... Oh, wait.

  10. this is bad even for /. by Harik · · Score: 5, Informative

    wow, not just badsummary, utterly worthless summary. Here's the relevant discussion from LKML. Yes, this is all of it.

    Peter Zijstra

    Andrew Morton
    In http://bugzilla.kernel.org/show_bug.cgi?id=12309 the reporters have
    identified what appears to be a sched-related performance regression.
    A fairly long-term one - post-2.6.18, perhaps.

    Testcase code has been added today. Could someone please take a look
    sometime?

    There appear to be two different bug reports in there. One about iowait,
    and one I'm not quite sure what it is about.

    The second thing shows some numbers and a test case, but I fail to see
    what the problem is with it.

    This somewhat deflates the excitement evident in the OP. I mean, I know what he's talking about, these apparently random 1-2 second FREEZES while working, but if the guys in LKML arn't talking about it it's probably not being really worked on.

    1. Re:this is bad even for /. by Anonymous Coward · · Score: 2, Interesting

      The fscking freezes are in HAL. They have been driving me nuts for more than a year. In my case, the solution is to unplug the CDROM drive.

    2. Re:this is bad even for /. by haifastudent · · Score: 2, Funny

      This somewhat deflates the excitement evident in the OP. I mean, I know what he's talking about, these apparently random 1-2 second FREEZES while working, but if the guys in LKML arn't talking about it it's probably not being really worked on.

      I know, it looks like someone's pet bug made the cover of /. today. For the record, here is my pet bug: https://launchpad.net/ubuntu/+bug/1

      --
      Thank for reading to the sig. You may stop reading now. It is safe. There is no more content. Why are you still reading?
    3. Re:this is bad even for /. by bjourne · · Score: 4, Interesting

      If you haven't used Linux regularly within the last two years, you probably have not noticed that the system has gotten significantly slower with more recent releases. The probable symptom was discussed here. Many Ubuntu users, including me, have noticed that the latency of desktop operations got significantly larger around the time Gutsy was released, which coincides with the Completely Fair Scheduler and kernel upgrade from 2.6.18.

      Since it is most likely a latency issue, the problem is extremely hard to diagnose. Alt-tabbing between programs seem a little slower, keyboard input might lag somewhat. You can't measure desktop latency easily.

    4. Re:this is bad even for /. by Anonymous Coward · · Score: 2, Interesting

      It's very easy to trigger, just unrar an iso from a torrent. Regardless of CPU cores, copious amounts of RAM, and no other real system activity, your desktop experience will grind to a miserable halt until the archive process has completed. renicing makes very little difference. Linux has had this problem for years, certainly more than two. Memory suggests it came along with SATA.

    5. Re:this is bad even for /. by CAIMLAS · · Score: 4, Interesting

      Yep, this is a petty big problem - an easily reproducible one - and it's been around for a really long time. I don't remember when exactly it came about, but I moved from Debian Sid to Ubuntu 7.x about 8 months ago. I didn't have any problem under debian, and I'm uncertain whether the 7.x ubuntus had the problem, but I certainly noticed it in 8.x releases.

      I do recall a bit of a somewhat gradual progression of desktop performance decreases, though, going all the way back to the later 2.0 kernels. Back then, the schedulers would all allow an at-the-time relatively slow machine run a fairly bloaty window manager (like E16) responsively while untarring an archive and running a kernel build at the same time - provided there was 100+Mb or so of RAM for the process, of course. Even still, if you were to dip into swap, the UI would remain pretty responsive. Not anymore.

      The way things sit now, the Linux I/O scheduler results in desktop performance similar to Windows XP during I/O ops. That is completely unacceptable.

      Part of me thinks this is due to a server-centric focus in development (being as the people doing kernel dev largely work for corporations who want server kernels), but I'm not really in the know. If that's the case, we really need to pull one of the old desktop schedulers out of retirement and use that instead of what we've got now, at least for the desktop, and maintain two different-focus schedulers within the kernel instead of just having a couple generally-suited schedulers.

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
    6. Re:this is bad even for /. by kwabbles · · Score: 3, Informative

      "Many Ubuntu users, including me, have noticed that the latency of desktop operations got significantly larger around the time Gutsy was released, which coincides with the Completely Fair Scheduler and kernel upgrade from 2.6.18."

      Uhh.. I didn't see anything in there about the Complete Fair Queuing - you just mentioned Completely Fair Scheduler, then kernel 2.6.18.

      "Feisty had the 2.6.18 kernel and was quite responsive, so CFQ is in the clear. Gutsy featured 2.6.23 with CFS and was much slower which means it is a possible suspect."

      This performance bug has been reported since 2.6.18.

      --
      Just disrupt the deflector shield with a tachyon burst.
  11. Looks like also affects servers, not just desktops by trolltalk.com · · Score: 2, Funny

    That's because you're not transferring data between yourself and another thread.

    It must also affect servers, because none of the links is transferring data either.

  12. Re:funny by Waffle+Iron · · Score: 2, Funny

    That's because you're not transferring data between yourself and another thread.

    But he is transferring data between himself and another sockpuppet.

  13. Re:funny by iluvcapra · · Score: 3, Funny

    I trrrrrrrrrrrrrrranssssssssfer data betwwwwwwwwwwwwwwwwwwwwwween threads alllllll the time......

    --
    Don't blame me, I voted for Baltar.
  14. Re:Just upgrade by El+Lobo · · Score: 4, Insightful

    Sure, because every Windows developer is a lazy motherfucker that doesn't like his work and plays Solitaire the whole day long, and never ever work fixing things for the love of art. Hard working enthusiastic developers is a Linuzz monopoly.

    --
    It's time to realise that Abble's products are the biggest abomination these days. Just say NO to the dumb iAbble way!!
  15. Re:Just upgrade by Erik+Hensema · · Score: 2, Interesting

    It's because people don't want to wait for a bugfix for over 2 years. They need fast systems NOW, and when a performance bug which doesn't get fixed can be solved by buying faster hardware, that's what they do.

    --

    This is your sig. There are thousands more, but this one is yours.

  16. I second this by waslap · · Score: 5, Interesting

    I am overjoyed that my suspicions have finally been vindicated. I've been working 10+hours a day on linux for the last 13years and you tend to get in tune with your environment (i can still today recite my DOS bootup tune on my XT even though I haven't worked on it for 20 years:-) and some time ago after installing a new flavour of linux I immediately started complaining to fellow workers that something has gone wrong in the kernel but it was not annoying enough to really do something about it; you start living with it. It manifests sometimes when I compile - my system simply locks up for 20-30 seconds which is something I never experienced before. I'd say it happens once out of every 50 compiles of the same program with gcc. During such occurrences, I can't access anything on my desktop which annoyes me cause I typically switch to another kterm session to prepare to run the build whilst compiling (to keep up the productivity and all that). I have also seen strange ratios of i/o to cpu wait in 'top' nowadays but can probably ascribe that to CPU's that just became ridiculously fast and the way top calculates its scores. Nevertheless, I've mumbled over and lambasted i/o wait in Linux ever since a very specific time in the past and even though I haven't noted the exact date, I'm sure its related to this. Anyway, I found this intrigueing enough to create a slashdot account after years to share my joy that the bugs days are hopefully numbered now.

  17. Problem is Real by Anonymous Coward · · Score: 5, Informative

    For what it is worth, the problem is real.

    We have experienced massive negative effects with our MySQL server; downgrading to early linux kernel solves the problem. This has been very difficult to debug as we never guessed that the OS would be a factor... we figured it had to be something we were doing. Only by chance did we try another distro / kernel only to find that everything starts working fine when you downgrade.

    1. Re:Problem is Real by Bert64 · · Score: 2, Interesting

      What version do you need to downgrade to? And does downgrading open you up to any security flaws or incompatibility?

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
    2. Re:Problem is Real by Harik · · Score: 4, Insightful

      If you can reproduce it, do a git-bisect. You'll find the change that caused it pretty quickly.

  18. Re:Karlan Mitchell by Ash-Fox · · Score: 2, Funny

    You should enable DMA.

    --
    Change is certain; progress is not obligatory.
  19. Re:funny by BlackCreek · · Score: 4, Insightful
  20. This is what happens... by Builder · · Score: 2

    ...when you insist on doing development in the 'stable' kernel tree and expect vendors to stablise it.

    Genius!

  21. whereis bugzilla.kernel.org .. by rs232 · · Score: 2, Informative
    --
    davecb5620@gmail.com
  22. Re:I am NOT experiencing this bug by Heather+D · · Score: 2, Interesting

    I am getting it. This is on Ubuntu running the 2.6.20-generic kernel that came from the distro. My backups (~19GB) are responsive but I am currently running Ben Gamari's suggested method to reproduce it and it appears to be showing up. I get 'small' freezes of ~1-3 seconds when entering text as well as larger freezes of ~5-15 seconds upon maximizing a minimized program.

    It only seems to cause a problem for maximizing minimized programs when it happens at the same time as you maximize the window. It doesn't seem to happen very much but when it does its pretty noticeable.

    I never really noticed this before. I suppose I just expected it after hearing about how bad IDE drives are for anything involving heavy multitasking.

    Yep, I've left it running and it just did it again.