Slashdot Mirror


The Exact Cause of the Zune Meltdown

An anonymous reader writes "The Zune 30 failure became national news when it happened just three days ago. The source code for the bad driver leaked soon after, and now, someone has come up with a very detailed explanation for where the code was bad as well as a number of solutions to deal with it. From a coding/QA standpoint, one has to wonder how this bug was missed if the quality assurance team wasn't slacking off. Worse yet: this bug affects every Windows CE device carrying this driver."

20 of 465 comments (clear)

  1. Re:Warning, Y2.1K bug. by LostCluster · · Score: 5, Informative

    Here's your 500 year plan:

    1900 - multiple of 100, not a multiple of 400, no leap day.
    2000 - was a multiple of 100, but also a multiple of 400 so we still had a leap day.
    2100 - see above
    2200 - not a multiple of 400, no leap day.
    2300 - not a multiple of 400, no leap day
    2400 - multiple of 400, so have the leap day anyway.

  2. Import calendar? by TurtleBlue · · Score: 5, Insightful

    "From a coding/QA standpoint, one has to wonder how this bug was missed if the quality assurance team wasn't slacking off."

    I can't remember the last time a QA department was asked to test date functions... but then again, I can't remember the last time anyone wrote their own Leap Year calendaring calculator from scratch.

    I'm sure there are a hundred reasons to do it (licensing being one of them) but really, when was the last time you didn't just import calendaring from another library and call it a day?

    Please clarify to me if this is something at the hardware driver level: I honestly don't know. If this were me, my own bosses wouldn't ask "Why didn't QA catch this", as much as "why are you wasting time writing your own calendar code? And then why didn't you flag it as functionality that needed to be tested?"

    1. Re:Import calendar? by Anonymous Coward · · Score: 5, Informative

      It is driver code supplied by the manufacturer of the hardware platform on which the Zune and a couple of other devices are built. This platform includes a real-time clock which counts seconds since midnight and days since 1/1/1980. Considering that hardware component prices are cut-throat, there is probably no quality management for the software whatsoever. If it appears to work, it ships.

    2. Re:Import calendar? by TurtleBlue · · Score: 5, Insightful

      Thanks - that makes a tad more sense. I see everyone running around blaming Microsoft for the code since their name is on the product, even if it was a 3rd party vendor. They certainly are still liable for all the busted Zunes, but I couldn't imagine Microsoft didn't have *some* C leap-year code sitting around that actually worked, and could be compiled for any chip they wanted.

      Microsoft still has to take the hit up front, but then they'll sue or "renegotiate contracts" with the vendor that supplied the bad driver code, based on what it costs them.

      I'm still shocked that the manufacturer couldn't dig up *some* free/open calendaring code that's was around pre-2004. But hey, at least we know they were honest about not ripping off some other source code and calling it their own.

    3. Re:Import calendar? by nato10 · · Score: 5, Informative

      This is kernel-level code -- part of the OEM Abstraction Layer -- that is used to read the current time from the RTC, hence it is hardware-specific. RTCs on other processors, or Freescale-based devices using external RTCs, may implement the OemGetRealTime () function differently than Freescale has done here (the buggy ConvertDays () function is just a helper function).

  3. "Leaked"...? by Anonymous Coward · · Score: 5, Informative

    It's an open source driver from Freescale.

  4. Bigger bugs have gotten through on Windows CE by msgmonkey · · Score: 5, Interesting

    For example I had some code I developed on Windows CE 4.2 .NET which kept on hanging on calling the FindWindow() fuction call.

    Turns out that trying to find a window by class name will hang (this version of) CE every time, even though you would have thought its a very much used function call and would be caught by CE.

    So no I'm not surprised at all that this bug got through.

  5. Regardless of whatever code in it is faulty by scourfish · · Score: 5, Funny

    Lines 122, 521, 690, 710, and 748 scare me; gotos in C code...

    1. Re:Regardless of whatever code in it is faulty by concernedadmin · · Score: 5, Interesting

      Lines 122, 521, 690, 710, and 748 scare me; gotos in C code...

      They've used one form of a goto that's actually quite readable and useful. Would you rather have:

      if (condition1 && condition2) {
      /* boilerplate code with a return */
      }

      if (issue1 || issue2) {
      /* same repeated boilerplate code with a return */
      }

      or

      if (condition1 && condition2) {
      goto cleanup;
      }

      if (issue1 || issue2) {
      goto cleanup;
      }
      cleanup:
      /* just one instance of this code,
      no need for duplication of efforts */
      Believe it or not, there are useful reasons to use goto, and Microsoft happened to use goto for the right reason here. The Linux kernel also happens to use this practice to boost the readability of the code.

  6. Re:Why write any date/time code? by p0tat03 · · Score: 5, Informative

    This was written by the Freescale guys, not MS, where it would make sense for the device manufacturer to ship their own date/time code.

  7. Re:Old by larry+bagina · · Score: 5, Insightful

    Comments in the last zune slashdot story (yesterday?) were just as detailed as this "story". Maybe slashdot editors should read their own site. Or maybe I should start submitting all +5 comments for their own story.

    --
    Do you even lift?

    These aren't the 'roids you're looking for.

  8. MOD PARENT UP Re:Why write any date/time code? by exphose · · Score: 5, Insightful

    Exactly, just goes to show the dangers in not QA'ing the whole codebase including supplied drivers. You can't trust your own code so you QA it, why should you trust your partner's code.

  9. Probably Not A Widespread Issue by nato10 · · Score: 5, Informative

    This code is actually from the Windows CE OAL (OEM Abstraction Layer), part of the code that reads the current time from the RTC. As such, the implementation is hardware-dependent, which is why there isn't a standard implementation of this function for Windows CE.

    In addition, this code is in a portion of Windows CE source code provided by a device's BSP developer, not by Microsoft. In most cases, Windows CE BSP developers start with sample BSPs written by a processor's manufacturer -- in this case, Freescale -- and then improve it.

    It turns out that this bug is specific to the Freescale's BSP -- sample Windows CE BSPs for other procesors don't have it -- and other Freescale devices using Windows CE will only have this issue if their developers used this code verbatim. Since sample BSPs provided by processor manufacturers are often of poor quality, many Windows CE developers typically rewrite such functions. In other words, the impact of this particular bug may be quite limited, which may be why there haven't been reports of this issue on other devices.

    In this particular case, though, Microsoft (or a contractor) was the Zune's BSP developer, so they certainly should have caught this.

  10. Re:Sad code, sad article by chalkyj · · Score: 5, Informative

    I think slashdot ate your < in the breaking line.

  11. Re:Sad code, sad article by xlv · · Score: 5, Funny

    for (;;) {
            int daysInYear = IsLeapYear (year) ? 366 : 365;
            if (day = daysInYear) break;
            day -= daysInYear; year += 1;
    }

    This is what Knuth called an "N + 1/2" loop

    No, this is what Knuth would call an infinite loop as there's no way to terminate the loop except on the last day of each year...

  12. Re:Wow. by Anonymous Coward · · Score: 5, Funny

    No they didn't no they didn't lalalala I can't hear you That piece of shit was definitely not any Metallica I know.

  13. Re:Warning, Y2.1K bug. by Anonymous Coward · · Score: 5, Funny

    For Slashdotters you lot seem pretty confident the Zune is going to be around for awhile.

  14. Re:Warning, Y2.1K bug. by PIBM · · Score: 5, Insightful

    Actually, it's far from being good. In 99% of the cases you will do 3 modulos operation, in 0.75% you will do 2 modulos and in 0.25% you will do 1 modulo, for an average modulo cost of 2.9875 per run.

    With the initial solution, you have 1 modulo in 75% of the cases, 2 modulo in 24% of the cases, and 3 modulo in 1% of the cases, for a total average modulo cost of 1.26 per run.

  15. Re:Warning, Y2.1K bug. by kybred · · Score: 5, Informative

    No need to hard-code, there's an established algorithm for computing this.

    Why not call it by its name: Zeller's Congruence.

  16. Re:Warning, Y2.1K bug. by Anonymous Coward · · Score: 5, Funny

    I can't help but imagine how I would be directed by work to "solve" this problem.

    First, they would tell me that it's too difficult, expensive, and complicated to implement the correct solution. Even if I gave them a working prototype, they wouldn't change their minds.

    Then they would tell me "just assume every 100th year is not a leap year." So I would do that instead. In the time from 2100 to 2400, they would say that "a better solution is due to come out next quarter." They would say this every quarter for 299 years.

    In 2399, they would finally give me permission to fix the problem. But the leap year-calculating code works, and they don't want me to mess with it. Instead, they'd tell me to add a test when the program starts to see what year it is. If it's 2400, then it will refuse to run. (We'll definitely have a better solution in place by Q1. Definitely.)

    But the program often runs for an extended period of time without being restarted, so it's possible that someone will start it in December 2399 and it will still be running in February/March 2400. Management has a simple fix for this one: calculate the average run time for the program, add a margin of error, and use that to determine the actual "upper limit" on when the program is allowed to start. My boss would be really excited about this, because it would allow us to refine our earlier not-after-January-1st estimate to be "completely accurate."

    Unfortunately, we don't know the average run time for this program. So I'm told to add code to it to track when it starts and ends and store the results in a file. When the program starts, it examines that file (in addition to recording its own start time), calculates the average run time, adds 10% (there are still director-level meetings about whether we should round up to the nearest hour or day), and subtracts that value from February 28th, 2400. If the current timestamp is greater than or equal to the result we got from that, the program won't start.

    That's pretty good, but my boss would be worried about the program crashing. If that happens, after all, we won't know the program's end time -- never mind that it's November by now and there's no chance of getting useful data no matter what -- so instead of logging an end time, the program logs a heartbeat every minute. Now, you can determine when the program ended -- to within a minute! -- simply by looking at the heartbeat timestamps. When you encounter a gap of more than 1 minute (plus a small margin of error), you know the program ended. This has the bonus, my boss tells me, of simplifying the design by only requiring you to log one type of message to the file. He also assures me that this "telemetry data" has the potential to be "really useful for data mining." He talks about adding information on CPU time consumed, memory in use, I/Os, all sorts of stuff, then putting it in a database to be retrieved later. I manage to talk him out of it by pointing out that "the better solution [with which I am completely uninvolved] will be out in just a few months, so you should just make sure it makes it into that instead."

    Not that I'm bitter.