Slashdot Mirror


Azure Failure Was a Leap Year Glitch

judgecorp writes "Microsoft's Windows Azure cloud service was down much of yesterday, and the cause was a leap year bug as the service failed to handle the 29th day of February. Faults propagated making this a severe outage for many customers, including the UK Government's recently launched G-cloud service."

22 of 247 comments (clear)

  1. Who could have foreseen a leap year coming? by elrous0 · · Score: 5, Funny

    Seriously, if my American high school education taught me nothing else, it was that those things only come along like every 100 years or something.

    --
    SJW: Someone who has run out of real oppression, and has to fake it.
    1. Re:Who could have foreseen a leap year coming? by Kamiza+Ikioi · · Score: 5, Funny

      In all fairness, Microsoft never figured anyone would still be using this service by the time a leap year rolled around.

      --
      I8-D
    2. Re:Who could have foreseen a leap year coming? by Anonymous Coward · · Score: 5, Funny

      In all fairness, Microsoft never figured anyone would still be using this service by the time a leap year rolled around.

      I work on the Azure team and I can confirm this.

    3. Re:Who could have foreseen a leap year coming? by jc42 · · Score: 5, Insightful

      Actually every hundred year is when a leap year doesn't come along. (unless it's divisible by 400, then it does)

      Right; and I wonder how many computer failures will happen on the first of March, 2100, due to part of the software thinking it's the 29th of February, causing random problems while talking to other software that knows the correct date.

      We all know it's gonna happen ...

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    4. Re:Who could have foreseen a leap year coming? by Alsee · · Score: 5, Funny

      Microsoft has solved the problem and applied a patch to their systems.
      The new patch is anticipated to keep the service up and stable for least 4 years.

      -

      --
      - - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.
  2. Re:TCO TIC by tripleevenfall · · Score: 5, Funny

    Obviously you didn't inform yourself with the very helpful and informative "Get The Facts" materials Microsoft provided us with a few years ago. If you had you would know how much higher the TCO of Linux on the server is even after a massive outage.

  3. Same Story / Different Day by DownWithTheMan · · Score: 5, Funny

    Didn't this happen last leap year to the Zunes... oh yeah...

    1. Re:Same Story / Different Day by firex726 · · Score: 5, Interesting

      What is with MS and their apparent inability to cope with leap years?

    2. Re:Same Story / Different Day by UnknowingFool · · Score: 5, Interesting

      No it came from Freescale in a driver that Toshiba used. Not many know that the original Zune was a Toshiba Gigabeat with a new UI and outer shell.

      --
      Well, there's spam egg sausage and spam, that's not got much spam in it.
    3. Re:Same Story / Different Day by John+Hasler · · Score: 5, Funny

      Well, we knew it was a Microsoft product so we knew they bought it from someone.

      --
      Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
    4. Re:Same Story / Different Day by Dhalka226 · · Score: 5, Insightful

      I had a similar thought about code reuse, but an entirely different conclusion: I thought that they weren't re-using good code since the same problem has cropped up at least two times. That sounds more like a case of re-rolling things that definitely shouldn't be re-rolled (date/time handling) to me.

      In either event, they're not using particularly good practices. Either they are constantly reinventing the wheel and apparently in error-prone ways, or they are re-using code but paying no attention to keeping that external code up to date.

      The only other thing I can think of is that Azure is somehow so drastically different than anything else they have ever done that they had to do the code again from scratch -- which is probably a problem all by itself.

    5. Re:Same Story / Different Day by jc42 · · Score: 5, Interesting

      What is with MS and their apparent inability to cope with leap years?

      I would like to know the same thing. This seems to be systemic.

      Yeah; it's systemic. Or at least it used to be a few years back, and I wouldn't be surprised if they haven't fixed the basic problem yet. The problem is fairly simple: Windows' internal clock is in local time.

      To a programmer with experience writing date/time code, I've found that this is all you need to tell them. Any software whose internal clock is in local time will be buggy, and it will never be completely fixed. Attempts to fix bugs will merely introduce bugs elsewhere in the chains of date/time handling. The sensible solution is to adopt a "universal time" internally, and convert at the last stage when you present the date/time to a human user. Yes, you theoretically can work with local time internally, but (teams of) humans can't actually make this work in practice. The best they can do is make it work in the "normal" cases. Bug fixes then tend to just move the time bugs around to different places in the code. But it can be very difficult to get management to accept this and agree to UT-only internally.

      Java also used to specify local time internally (and may still do so, but I haven't used it in years). I worked on a number of projects where, after repeated date/time disasters at every switch to/from DST and every Feb 29, java was abandoned and everything was rewritten in a language (usually C++) whose libraries supported a UT timestamp and didn't have all those time bugs.

      Does anyone know if MS Windows has introduced a UT internal time yet? If not, then we can reliably predict that such bugs will continue to plague their users.

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
  4. 28 days by ichthus · · Score: 5, Funny

    Well, this is all because 28 days in February ought to be enough for everyone.

    --
    sig: sauer
    1. Re:28 days by davidbrit2 · · Score: 5, Funny

      I always remember to put DEVICEHIGH=FEB.SYS into my config.sys every four years.

  5. In a new press conference.. by Anonymous Coward · · Score: 5, Funny

    Microsoft has told the press that they don't expect the Azure cloud service to fail again for years. In an unrelated schedule change, a down-for-maintenance slot was scheduled 4 years in advance.

  6. office in the cloud by Anonymous Coward · · Score: 5, Funny

    It's sold as Office 365 not Office 366

  7. Only Happens Every 4 Years by trongey · · Score: 5, Funny

    It's not Micorsoft's fault; they're a publicly traded company so they can't think about multi-year events. They're prohibited from considering anything that is beyond the next fiscal quarter.

    --
    You never really know how close to the edge you can go until you fall off.
  8. Single Point of Failure by Bicx · · Score: 5, Insightful

    This points out a serious flaw in the whole idea of cloud reliability by redundancy. You may have a million servers running across multiple countries, but if the distributed software for each virtual server has a bug, every server across the globe is affected. That's a single point of failure.

  9. A leap year issue? Are you SERIOUS? by msobkow · · Score: 5, Insightful

    Given how many DECADES leap year calculations have had to be done and how many years it's been since we fixed the Y2K issues (at great expense, I might add), it is absolutely UNACCEPTABLE for someone to blame a leap year calculation for down time.

    The DIRECTOR of the service division at Microsoft should be FIRED for this failure.

    Expect lawsuits from customers, Microsoft. Because this was a problem you KNEW about and should have written code to deal with.

    What a pathetic excuse for planning and testing on Microsoft's part.

    --
    I do not fail; I succeed at finding out what does not work.
  10. It wasn't just Microsoft... by Anonymous Coward · · Score: 5, Interesting

    ...they just had the most publicly catastrophic failure. I just noticed that all of the Google Chat messages I received yesterday were sent to me at various times on December 31, 1969.

    And it also seems that I didn't even receive any of them until today, March 1, implying that they were incapable of even sending them yesterday.

  11. Attention Microsoft: by Howard+Beale · · Score: 5, Funny

    The following are leap years: 2016 2020 2024 2028 2032 2036 2040 You have been warned. After that, I'll probably be dead, so I won't care (unless Microsoft starts making pacemakers, which may end it for me...).

  12. Some of the most common leap-year bugs by tillerman35 · · Score: 5, Informative

    Some of the common leap year bugs that I've seen over the years:

    1. A matrix with the number of days per month:
    e.g. smallint dayspermonth[12]={31,28,31,30,31,30,31,31,30,31,30,31};
    Indexing into the matrix for February (index 1) ignores leap years.

    1. A matrix with 365 elements to represent a year's worth of something:
    e.g. smallint hightemps[365];
    This usually doesn't fail until Dec 31, when hightemp[mydate.dayofyear()-1] points to a non-existent element.
    Of course, if dayofyear is calculated using the matrix in the prior bug, it will fail invisibly since that will be incorrect
    as well.

    2. Quck-n-dirty subtract one year math:
    e.g. Convert date to char in YYYYDDMM format, convert char to int, subtract 10000, convert back to a char and then date.
    Why people do this when you can dateadd(year,mydate,-1) is that easy, I have no clue. But it breaks horridly when
    you use it to determine "one year ago today" from Feb 29.