The Exact Cause of the Zune Meltdown
An anonymous reader writes "The Zune 30 failure became national news when it happened just three days ago. The source code for the bad driver leaked soon after, and now, someone has come up with a very detailed explanation for where the code was bad as well as a number of solutions to deal with it. From a coding/QA standpoint, one has to wonder how this bug was missed if the quality assurance team wasn't slacking off. Worse yet: this bug affects every Windows CE device carrying this driver."
It wasn't a bug! It was an unexpected feature!
Microsoft is taking a stance against teenagers blowing their ears out with loud music.
Just before anybody claims to have a foolproof solution to leap years, make sure you test against the year 2100. It's a multiple of four, but also a multiple of 100 that's not a multiple of 400... and therefore NOT a leap year.
"From a coding/QA standpoint, one has to wonder how this bug was missed if the quality assurance team wasn't slacking off."
I can't remember the last time a QA department was asked to test date functions... but then again, I can't remember the last time anyone wrote their own Leap Year calendaring calculator from scratch.
I'm sure there are a hundred reasons to do it (licensing being one of them) but really, when was the last time you didn't just import calendaring from another library and call it a day?
Please clarify to me if this is something at the hardware driver level: I honestly don't know. If this were me, my own bosses wouldn't ask "Why didn't QA catch this", as much as "why are you wasting time writing your own calendar code? And then why didn't you flag it as functionality that needed to be tested?"
It's an open source driver from Freescale.
Amazon link eh? meh.
Try this link for your "sampling" : Deep C Secrets.
Took only 15 seconds for that link. Enjoy.
Yep, but it deserves to be covered so that everybody hears it. It's not just a laugh at Microsoft story, but also a lesson to aspiring programmers to watch there step when it comes to timekeeping. Gotta get a mention to the people who look at /. at work, gotta get a mention to the people who visit weeknights, gotta mention it for the weekend crowd.
From a coding/QA standpoint, one has to wonder how this bug was missed if the quality assurance team wasn't slacking off.
MSFT's QA team hasn't been slacking off. They haven't slacked on since about the mid 90s.
This kind of bug is where TDD shines. If you don't write any code unless you have a test that forces you to, it's very hard to produce this bug type.
(TDD = Test Driven Development)
For example I had some code I developed on Windows CE 4.2 .NET which kept on hanging on calling the FindWindow() fuction call.
Turns out that trying to find a window by class name will hang (this version of) CE every time, even though you would have thought its a very much used function call and would be caught by CE.
So no I'm not surprised at all that this bug got through.
Windows Mobile is best described as a subset of platforms based on a Windows CE underpinning. Currently, Pocket PC (now called Windows Mobile Classic), SmartPhone (Windows Mobile Standard), and PocketPC Phone Edition (Windows Mobile Professional) are the three main platforms under the Windows Mobile umbrella. Each platform utilizes different components of Windows CE, as well as supplemental features and applications suited for their respective devices.
So, every smartphone/PDA that currently uses Windows Mobile uses some form of CE.
Looking at that code, it never had effective code review or Q/A. If I was the manager responsible I would be looking up those who signed off on the code in the last review. I didn't spot one, but 4 issues in that code and would not doubt more exist. Second off, there are much simpler ways of doing this in the C libraries, and simplicty has value.
But the design, I suspect is very flawed. Why not use asctime() and rely on it's more proven calculations of leap year and the like via the OS libraries?
And when you see something like this, you know someones brain was in the off position:
556 day -= 366;
557 year += 1;
Lines 122, 521, 690, 710, and 748 scare me; gotos in C code...
This was written by the Freescale guys, not MS, where it would make sense for the device manufacturer to ship their own date/time code.
Both the original code and the various corrections in the article don't catch what the algorithm is supposed to do, and therefore create code that is too complicated.
The essence of the algorithm is this: We start with number of days since 1/Jan/1980, with the first day having the number one. We want to end up with the correct year, with a day number relative to the first day of that year, with the first day again having the number one. So we set year = 1980. And as long as day is greater than the number of days in that year, we can't have the right value yet, so we change day and year accordingly. This produces a very simple loop:
for (;;) {
int daysInYear = IsLeapYear (year) ? 366 : 365;
if (day = daysInYear) break;
day -= daysInYear; year += 1;
}
This is what Knuth called an "N + 1/2" loop: A loop pattern where a more or less substantial bit of code has to be executed at the beginning of the loop before we can decide whether the loop needs exiting or continuing. By following the "N+1/2 loop" pattern we avoid repeating the same code (with possible small changes) completely. And that exactly was the problem here: The same code was used twice but slightly differently (one set number of days = 365, the other made it dependent on whether the year was a leap year or not). The solutions given in the article all contain repeated code; either two loop exits, or a duplicated calculation of the number of days in a year.
integer function f_isleap(year) :: Return 0 if a year is NOT leap year and a 1 otherwise. .or. .and.
IMPLICIT NONE
c
c Purpose
c
c Description: Every fourth year is a leap year. c But NOT when divisible
c by 100, except if the year is divisible by 400.
c
integer Year
if((MOD(Year,400).eq.0)
% ((MOD(Year,4).eq.0)
(MOD(Year,100).ne.0))) then
f_isleap=1
else
f_isleap=0
endif
return
end
But of course FORTRAN is not fancy enough for super cool C# coders.
Comments in the last zune slashdot story (yesterday?) were just as detailed as this "story". Maybe slashdot editors should read their own site. Or maybe I should start submitting all +5 comments for their own story.
Do you even lift?
These aren't the 'roids you're looking for.
Exactly, just goes to show the dangers in not QA'ing the whole codebase including supplied drivers. You can't trust your own code so you QA it, why should you trust your partner's code.
This code is actually from the Windows CE OAL (OEM Abstraction Layer), part of the code that reads the current time from the RTC. As such, the implementation is hardware-dependent, which is why there isn't a standard implementation of this function for Windows CE.
In addition, this code is in a portion of Windows CE source code provided by a device's BSP developer, not by Microsoft. In most cases, Windows CE BSP developers start with sample BSPs written by a processor's manufacturer -- in this case, Freescale -- and then improve it.
It turns out that this bug is specific to the Freescale's BSP -- sample Windows CE BSPs for other procesors don't have it -- and other Freescale devices using Windows CE will only have this issue if their developers used this code verbatim. Since sample BSPs provided by processor manufacturers are often of poor quality, many Windows CE developers typically rewrite such functions. In other words, the impact of this particular bug may be quite limited, which may be why there haven't been reports of this issue on other devices.
In this particular case, though, Microsoft (or a contractor) was the Zune's BSP developer, so they certainly should have caught this.
Wow. That link is to a book from a good web site: Free eBooks.
Other free books about C and C++: Free C and C++ books
I highly recommend that in cases like this, programmers be good Catholics and abide by the decree of Pope Gregory XIII. Software written to work with modern dates should use Gregorian, not Julian. Or did you mean ordinal?
From the article you linked to: The use of Julian date to refer to the day-of-year (ordinal date) is usually considered to be incorrect, however it is widely used that way in the earth sciences and computer programming.
http://xkcd.com/376/
What I see here is a really convoluted piece of code to perform a really simple task. There are a lot of constants that are written as constants. If there a #define orginyear, the why not #define daysperyear and #define daysperleapyear. The first is used only once, while the rest are used twice.
In any case, the fundamental problem is not encapsulating data. This is quite a common error is code architecture. In this case, this function knows a lot of things it does not need to know. It know about leap years, number of days, and all this confuses the reader. They layout of the function already has the overhead of a fuction call, so why do we not let this overhead work for us by not returning the proxy leap year boolean, but what we actually want, which is the number of days in this year.
In this case all days per year information and leap year information is encapsulated in a single function, and the top function does not need to know about either. This, I think, is writing quality into code, and not depending on QA to catch mistakes common to novice programmers. No guarantee this will work as is, it is just psuedo code, not even checking the logic completely.
"She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
Ah, thank you. This explains better why the 2nd-gen and 3rd-gen Zunes didn't suffer this problem; they were completely designed and developed in-house.
There's no place I could be, since I've found Serenity...
The proper way to do this would be with division and modulus, which gives you a nice constant time solution even if you're still using your Zune in 2108. They ought to read Calendrical Calculations by Nachum Dershowitz and Ed Reingold and learn how to do this properly.
Way back in the pre-Cambrian days when I actually was a decent C programmer, there was a book chalked full of algorithms. I can't remember now if it was the "Stevens" book or the "Stevenson" book. It was our bible. Our guide. The holiest book in our bookshelf. Whenever we got the yen to do some programming, we always took out the "Stevens" book and asked ourselves "What Would Stevens Do?"
In this day, is there not one such book or place where someone says "Gee, I have to write some code that will calculate the date, day of week, and year from a fixed day. I wonder if I can look up this bit of code in some reference book, and do it right the first time?"
And, then the second question: Why in the heck does the Zune care a fig about today's date? I believe there's some other device on the market that rhymes with "Shapple ShiPod" that does something similar to the Zune and yet doesn't care one whit about today's date. I won't claim that particular device is error free, but I but you a couple of doughnuts that it won't freeze up the day before a big holiday because it doesn't realize that 2008 has 366 days in it.
First to finish is not always a good thing. Just ask your girlfriend.
You'd seriously use this for doing calculations between two dates on a modern calendar? You'd convert beginning-of-the-day-midnight to middle-of-the-day-midnight and back again? You'd flip a coin and decide whether or not to store dates internally in a common timezone? You'd add in your own leap years when necessary? (which brings us back to this bug - please look at what exactly the faulty source code was trying to do!)
There are very good reasons for internally storing dates as ordinal. But unless there is a good reason not to, please use your operating system's (or SQL database's, etc) native format/epoch for it, and please use their code, not your own, for calculating those dates. And if you do find yourself in a position where you're the one writing that code for others' benefit, please be at least as pedantic as I have been in this thread. Society at large is counting on you.
http://www.anythingbutipod.com/archives/2009/01/zune-bug-actually-a-freescale-bug-affecting-toshiba-gigabeats-too.php
http://www.popularculturegaming.com -- my blog about the culture of videogame players
"evidence of QA.. slacking off"
These comments routinely come from two groups:
1) Software Developers
2) Joe the Plumber
Or put another way: elitism or ignorance.
If a software division is letting QA "test" all on their own, that's a recipe for disaster... and it's the head of engineering at fault.
See, software testing does not occur in a vacuum, no more than developers code without a list of requirements from Sales or Marketing.
Engineering takes takes the requirements, use that to produce an agreed upon set of specifications.
QA follows the same model... they take the software specs and derive a set of effective tests.... tests which are agreed upon by Engineering, and signed off on.
When I did QA, it was mostly for startups who lacked this kind of process. The result was QA was always 2 steps behind software that continually morphed: hardware changed, or the customer changed their mind. I'm not placing the blame on any 1 group here... I come from Support, then QA, and now develop. Startups can be rough.
But at the end of the day, not documenting and agreeing on what the product and tests should be will cost you big time.. maybe 7 out of 10 times.
This kind of bug is where TDD shines.
I'm not so sure. Let's look at the timeline without TDD:
1) Microsoft writes method (say, one hour).
2) Microsoft discovers but on December 31st, 2008
3) Microsoft spends one hour fixing bug (assuming documentation and source control and test of fix)
Now lets look at that timeline with TDD:
1) Microsoft writes method (let's say one hour again)
2) Microsoft writes test for method. Test includes random dates but not December 31st, 2008. One hour.
3) Microsoft discovers but on December 31st, 2008
4) Microsoft spends one hour fixing bug (assuming documentation and source control and test of fix)
5) Microsoft updates test (one more hour to make sure all cases are caught)
Basically they spent more time on both ends, but very likely would not have prevented the error anyway. You could perhaps say the fix would take lest time since there's less testing to be done, but that's not really true as you have to verify the (also simple) changes to the test suite are correct as well...
The only advantage that TDD would have given is one more chance for the developer to think about the possible edge case after the method was written. But I would argue that with anything that fundamental more time should have gone into initial development, and TDD is the death of a thousand cuts in terms of time to write and maintain tests. Over time that gets unwieldy - I'm a believer in tests when they are meaningful and light and do not detract too much from time spent improving the code instead of tests.
Indeed I can also see where TDD could well have caused this bug. Many TDD proponents would write the test first, and then code to it - which is just the kind of thinking that lets you relax when you should be at your most vigilant, actually writing the code that does the work. I find it a lot easier to consider possibilities when I am staring at a piece of code that does some work as opposed to compiling and coding a list of potential issues into a test.
Another potential issue is that tests tend to be written by the programmer who produced the original code, and of course the natural urge is to produce tests that fit the code as is, since the general thinking is to prevent bugs from future changes. I'm a huge believer (from experience) in the value of having a QA department that does nothing but write test code and makes sure the code always passes that. It works far better than programmers managing code and really produces quality efforts. Unfortunately it has the same ossifying effect where refactoring is harder as you go along because tests must be altered along with code.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
Microsoft is famous for "stack-ranking" employees at review time. This means that somebody in every group will get a "worst of the group" review and somebody will get "best of the group". If you are in the bottom 10% at MSFT, you are never going to get a raise and you will eventually quit.
When has Microsoft ever actually done that? Apple has released updates that DELIBERATELY bricked devices (jailbroken iphones for one), but that's ok, yet when a Microsoft device breaks due to a very obvious bug (obvious in that it's obvious it IS a bug, not obvious in that it really should have been noticed - bugs do happen in pretty much ALL software) that has a stupidly simple fix (Let it drain the battery then turn it on again), suddenly the Conspiracy theories are out in full force and they're once again branded as the most Evil Corporation on the planet? Please.
There's so much you can bash Microsoft for (legitimately), why do you feel the need to actually make shit up?
Besides, from all the reports I've read so far, Windows 7 is actually looking to be a worthy Upgrade (if you're a windows user, that is - for anyone else, your mileage may vary) and I don't just mean from Vista, I mean from XP as well.
But no, it's easier to just hate the large, monolithic, rich company than accept that sometimes shit just happens.
+1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill
Ok, I called your bluff. I actually went and searched for it.
The VERY top link is this slashdot article which states:
"We've all heard the story of Microsoft's battle cry of "DOS ain't done till Lotus won't run". Adam Barr investigates the myth, interviewing various Microsoft and Lotus old-timers (including Mitch Kapor), and finds no basis for its legitimacy or any case of 1-2-3 actually not running. Whom to blame for Lotus Notes is not discussed."
I checked the next few links and they pretty much all pointed to the same article, namely this one. One site even described it as a "complete and utter annihilation of the myth".
I actually thought you were disagreeing with me, but now I see you were pointing out that people have been claiming the same thing for years and it was just as unfounded then as it is now. Thank you, I couldn't have said it better myself.
+1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill
Lots of things use Windows CE, which is fine.
The problem is with the Freescale Semiconductor's* RTC driver. So if you aren't using that specific chip and driver then CE is unaffected.
* No, this doesn't excuse MS from proper QA.
3laws: No freebies, no backsies, GTFO.
Ok, I'm getting sick of this claim. There is no proof that Apple has ever deliberately bricked devices. This is completely unfounded.
In fact, go back and look at the reports of iPhones breaking, and you'll see that most of them started working again with a later OS release. About the only thing that happens on upgrade with jailbroken phones these days is that they are locked again.
Check out DRM-free movies at http://www.bside.com
This should be the proper version of the quote:
I know from the actual Novell developers (I worked for Novell in 1991-92) that on multiple occasions, Microsoft modified a new Dos version between the last beta and the actual release, in such a way that Novell's Netware client drivers stopped working.
Terje
"almost all programming can be viewed as an exercise in caching"
A lot more than Apple or Linux.
;).
With Linux, you can't even be sure that your hardware which was working fine on 2.4.x will continue to work fine in 2.4.y.
In contrast I believe many viruses written in the 1990s will still run fine on Windows XP in 2008
How about a much simpler reason - it plain well sucks giant donkey balls.
We're talking about a device which only works with Windows, only available in a small mumber of countries (I don't give a shit about the music service - you can put music on it without a fucking music service so the need to 'roll out the service' is a bullshit excuse) and the software sucks balls.
Its a top to bottom epic failure - and its in the mold of Microsoft NEVER to learn from these failures or more correctly, learn from its rivals who are making gains. Then again, Microsoft is kinda like a mini-America, the world uses metric, the US uses imperial. The world uses 240V, and the uses 110V etc. etc.
Couldn't you just do this:
if((condition1 && condition2) || (issue1 || issue2)){
}
Most human behaviour can be explained in terms of identity.
How exactly do you think that divisions are implemented ? Even in silicon ? Did you realize that the number of cycle for a DIV instruction is high and dependent on the operand size ?
Ever wondered why the x86 DIV was 14 cycles for 8bits operands, 22 cycles for 16bits and 38 cycles for 32bits ? (hint 6 cycles constant data access + 1 cycle per bit in the subtract/shift loop)
And if the time your teacher told you that was a few years ago, before processor had hardware divide instruction that implemented the loop in silicon, then the pascal run time had to implement division by a series of subtractions and halving...
Now, if he told you that it just subtracted (without halving) then, yeah, he was wrong...
John is that you?!
Has the boss decided what we should do if the thing has run out of storage and we can't log?
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
First of all, this was a braindead stupid bug. Unbelievably poor implementation of what should have been a fairly simple thing leads to an infinite loop on special days. Just looking at the damned loop without actually tracing through every possibility reveals a infinite loop at first glance. This was mindbogglingly stupid.
Second...Apple didn't "deliberately brick" devices. Your bias here is unbelievable. What Apple did was fix a bug that was allowing people to jailbreak and that caused problems from jailbroken phones. They fixed a security flaw that caused something that took advantage of that security flaw to cease to function correctly. Now, personally I would like it if the iPhone didn't require jailbreaking to open it up, but fixing the flaw that allows people to break your security model is not "deliberatly bricking". WGA is deliberately bricking, where it arbitrarily decides that you are invalid and shuts you off. In both cases it is incorrect useage of the word "brick" since either device can be easily recovered. So...to recap. Apple fixed a security flaw that caused bad news for people jailbreaking. Microsoft told your computer to call home every day so they could arbitrarily decide if you were valid or not and then shut you off if you werent.
It is easier to hate the large monolitic rich company that uses illegal business practices, breaks the standards, and buys off the DoJ to avoid punishment (Go look at MS political contributions to either party before the trial...virtually nil...then the year they get busted...they contribute big bucks to both sides and walk away with a wrist slap). Trust me...big time criminals don't need cheerleaders like you to help them out. People like you are like the wife that geats her ass kicked and says "no, but he really loves me, he really is a good guy".
The only change I can believe in is what I find in my couch cushions.
So MACOS X is bug free? What about any licensed version of Unix? What about...oh I don't know...just about every piece of commercial software out there? Bugs happen, it's a sad shame, but it happens and no software is ever "bug free". This is why they measure bugs not by raw number, but by number per XXX lines of code - if you can keep your average below a set number, you're doing well.
I'm sure you're happy to let Linux be a tad buggy because it's open source and thus "free", but there's quite a few licensed distros you're supposed to pay for, what about those?
You should have properly researched your choice for buying Windows 95 at the time. Maybe it wasn't the right software for you, maybe you should have bought a Mac or installed Linux, but from what you're saying, you expect a piece of software to be absolutely perfect when you pay for it, so perfect that future, better versions of the OS should never be needed, so I'm fairly sure you would be bitching and complaining now that you had to upgrade no matter what you decided to buy.
+1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill