Slashdot Mirror


Azure Failure Was a Leap Year Glitch

judgecorp writes "Microsoft's Windows Azure cloud service was down much of yesterday, and the cause was a leap year bug as the service failed to handle the 29th day of February. Faults propagated making this a severe outage for many customers, including the UK Government's recently launched G-cloud service."

8 of 247 comments (clear)

  1. Re:Same Story / Different Day by g0bshiTe · · Score: 4, Insightful

    You would think that they would have remembered, or some brilliant mind would have said "hey don't forget leap days", they should have asked the janitor. Those guys know everything.

    --
    I am Bennett Haselton! I am Bennett Haselton!
  2. Everything MS does as "me too" sucks. by scorp1us · · Score: 4, Insightful

    It seems that all of MS's copied products - hotmail, Azure, Zune are all done with a "me too" attitude of just having something so that they don't get left behind. They don't really try to make these "me too" products as industry leaders. But here's the catch. I know plenty of IT people who will always choose MS's offering because, as I was told "you don't get in trouble for choosing MS". And that knowledge seems to be built into MS's offerings.

    --
    Slashdot's rate-of-post filter: Preventing you from posting too many great ideas at once.
  3. Single Point of Failure by Bicx · · Score: 5, Insightful

    This points out a serious flaw in the whole idea of cloud reliability by redundancy. You may have a million servers running across multiple countries, but if the distributed software for each virtual server has a bug, every server across the globe is affected. That's a single point of failure.

    1. Re:Single Point of Failure by Bert64 · · Score: 4, Insightful

      Thats a flaw in the idea of a monoculture, true redundancy has different software implementing the same basic standards...
      Like how the Internet is built from routers made by different vendors, cisco, juniper, software based linux/bsd devices etc. When new DoS vulnerabilities are found in one vendors kit it doesn`t take down the whole internet, because other vendors are immune.

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
  4. A leap year issue? Are you SERIOUS? by msobkow · · Score: 5, Insightful

    Given how many DECADES leap year calculations have had to be done and how many years it's been since we fixed the Y2K issues (at great expense, I might add), it is absolutely UNACCEPTABLE for someone to blame a leap year calculation for down time.

    The DIRECTOR of the service division at Microsoft should be FIRED for this failure.

    Expect lawsuits from customers, Microsoft. Because this was a problem you KNEW about and should have written code to deal with.

    What a pathetic excuse for planning and testing on Microsoft's part.

    --
    I do not fail; I succeed at finding out what does not work.
  5. Re:Dumb people never learn by Anonymous Coward · · Score: 1, Insightful

    Never trust Microsoft for anything.

    Never trust any vendor for anything.

    FTFY, blah blah blah....

    You whipper snappers don't remeber what it was like doing business with IBM back when they ruled the World and some of the things I see about Oracle here on Slashdot makes me cringe.

  6. Re:Who could have foreseen a leap year coming? by jc42 · · Score: 5, Insightful

    Actually every hundred year is when a leap year doesn't come along. (unless it's divisible by 400, then it does)

    Right; and I wonder how many computer failures will happen on the first of March, 2100, due to part of the software thinking it's the 29th of February, causing random problems while talking to other software that knows the correct date.

    We all know it's gonna happen ...

    --
    Those who do study history are doomed to stand helplessly by while everyone else repeats it.
  7. Re:Same Story / Different Day by Dhalka226 · · Score: 5, Insightful

    I had a similar thought about code reuse, but an entirely different conclusion: I thought that they weren't re-using good code since the same problem has cropped up at least two times. That sounds more like a case of re-rolling things that definitely shouldn't be re-rolled (date/time handling) to me.

    In either event, they're not using particularly good practices. Either they are constantly reinventing the wheel and apparently in error-prone ways, or they are re-using code but paying no attention to keeping that external code up to date.

    The only other thing I can think of is that Azure is somehow so drastically different than anything else they have ever done that they had to do the code again from scratch -- which is probably a problem all by itself.