Slashdot Mirror


Azure Failure Was a Leap Year Glitch

judgecorp writes "Microsoft's Windows Azure cloud service was down much of yesterday, and the cause was a leap year bug as the service failed to handle the 29th day of February. Faults propagated making this a severe outage for many customers, including the UK Government's recently launched G-cloud service."

42 of 247 comments (clear)

  1. Who could have foreseen a leap year coming? by elrous0 · · Score: 5, Funny

    Seriously, if my American high school education taught me nothing else, it was that those things only come along like every 100 years or something.

    --
    SJW: Someone who has run out of real oppression, and has to fake it.
    1. Re:Who could have foreseen a leap year coming? by tripleevenfall · · Score: 4, Funny

      Save us, Captain Obvious! *swoons* :P

    2. Re:Who could have foreseen a leap year coming? by Kamiza+Ikioi · · Score: 5, Funny

      In all fairness, Microsoft never figured anyone would still be using this service by the time a leap year rolled around.

      --
      I8-D
    3. Re:Who could have foreseen a leap year coming? by Anonymous Coward · · Score: 5, Funny

      In all fairness, Microsoft never figured anyone would still be using this service by the time a leap year rolled around.

      I work on the Azure team and I can confirm this.

    4. Re:Who could have foreseen a leap year coming? by jc42 · · Score: 5, Insightful

      Actually every hundred year is when a leap year doesn't come along. (unless it's divisible by 400, then it does)

      Right; and I wonder how many computer failures will happen on the first of March, 2100, due to part of the software thinking it's the 29th of February, causing random problems while talking to other software that knows the correct date.

      We all know it's gonna happen ...

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    5. Re:Who could have foreseen a leap year coming? by VortexCortex · · Score: 3, Interesting

      In all fairness, Microsoft never figured anyone would still be using this service by the time a leap year rolled around.

      Ah, that explains why Zunes went dark on New-Years 2009...

      Think about this. You're a software dev, and you use a MS C++ compiler. They wrote their standard libs, including the "time.h" / &ltctime&gt code... you use their time libraries.
      Now two things:
      0. MS employs some real nut-jobs that can't even use the standard time functions and instead write their own for each project...
      or
      1. MS doesn't even trust their own compiler / libraries to do the right thing?

      It scares me to think that MS makes operating systems... IMHO, they should get back to BASICs.

    6. Re:Who could have foreseen a leap year coming? by Alsee · · Score: 5, Funny

      Microsoft has solved the problem and applied a patch to their systems.
      The new patch is anticipated to keep the service up and stable for least 4 years.

      -

      --
      - - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.
  2. Re:TCO TIC by tripleevenfall · · Score: 5, Funny

    Obviously you didn't inform yourself with the very helpful and informative "Get The Facts" materials Microsoft provided us with a few years ago. If you had you would know how much higher the TCO of Linux on the server is even after a massive outage.

  3. Same Story / Different Day by DownWithTheMan · · Score: 5, Funny

    Didn't this happen last leap year to the Zunes... oh yeah...

    1. Re:Same Story / Different Day by g0bshiTe · · Score: 4, Insightful

      You would think that they would have remembered, or some brilliant mind would have said "hey don't forget leap days", they should have asked the janitor. Those guys know everything.

      --
      I am Bennett Haselton! I am Bennett Haselton!
    2. Re:Same Story / Different Day by firex726 · · Score: 5, Interesting

      What is with MS and their apparent inability to cope with leap years?

    3. Re:Same Story / Different Day by robthebloke · · Score: 3, Funny

      Like how to brush the problem under the carpet for another 4 years?

    4. Re:Same Story / Different Day by UnknowingFool · · Score: 5, Interesting

      No it came from Freescale in a driver that Toshiba used. Not many know that the original Zune was a Toshiba Gigabeat with a new UI and outer shell.

      --
      Well, there's spam egg sausage and spam, that's not got much spam in it.
    5. Re:Same Story / Different Day by John+Hasler · · Score: 5, Funny

      Well, we knew it was a Microsoft product so we knew they bought it from someone.

      --
      Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
    6. Re:Same Story / Different Day by tlhIngan · · Score: 3, Informative

      No it came from Freescale in a driver that Toshiba used. Not many know that the original Zune was a Toshiba Gigabeat with a new UI and outer shell.

      Yeah, it was a really stupid bug, especially when you consider the OS provides a very useful set of APIs for dealing with it (basically convert a SYSTEMTIME (day/month/year/mm/hh/ss) into a FILETIME (64-bit unsigned int similar to time_t), do your math (the compiler will handle the 64-bit computations for you) and convert it back. Two OS calls.

      If you're having ot do leap year calculations or even any sort of date calculations, stop. The OS or library will probably already have a set of functions for doing date calculations without you have to do it manually. Given how easy they are to screw up, far better to leave it to someone else.

      Hell, given Windows worked fine, I don't even want to know what Azure is doing - the fundamental OS and runtimes all handle leap year date calculations with aplomb. Heck, that might be some of the oldest code in the kernel these days because it was written a long time ago, works well and has been thoroughly debugged through the decades.

    7. Re:Same Story / Different Day by UnknowingFool · · Score: 4, Interesting

      According to the details I know it had to do with certificate validation. So part of Azure is using some code that doesn't use standard Windows APIs. Not shocking is that MS does not conform to standards. Shocking is that they don't conform to their own standards.

      --
      Well, there's spam egg sausage and spam, that's not got much spam in it.
    8. Re:Same Story / Different Day by Dhalka226 · · Score: 5, Insightful

      I had a similar thought about code reuse, but an entirely different conclusion: I thought that they weren't re-using good code since the same problem has cropped up at least two times. That sounds more like a case of re-rolling things that definitely shouldn't be re-rolled (date/time handling) to me.

      In either event, they're not using particularly good practices. Either they are constantly reinventing the wheel and apparently in error-prone ways, or they are re-using code but paying no attention to keeping that external code up to date.

      The only other thing I can think of is that Azure is somehow so drastically different than anything else they have ever done that they had to do the code again from scratch -- which is probably a problem all by itself.

    9. Re:Same Story / Different Day by jc42 · · Score: 5, Interesting

      What is with MS and their apparent inability to cope with leap years?

      I would like to know the same thing. This seems to be systemic.

      Yeah; it's systemic. Or at least it used to be a few years back, and I wouldn't be surprised if they haven't fixed the basic problem yet. The problem is fairly simple: Windows' internal clock is in local time.

      To a programmer with experience writing date/time code, I've found that this is all you need to tell them. Any software whose internal clock is in local time will be buggy, and it will never be completely fixed. Attempts to fix bugs will merely introduce bugs elsewhere in the chains of date/time handling. The sensible solution is to adopt a "universal time" internally, and convert at the last stage when you present the date/time to a human user. Yes, you theoretically can work with local time internally, but (teams of) humans can't actually make this work in practice. The best they can do is make it work in the "normal" cases. Bug fixes then tend to just move the time bugs around to different places in the code. But it can be very difficult to get management to accept this and agree to UT-only internally.

      Java also used to specify local time internally (and may still do so, but I haven't used it in years). I worked on a number of projects where, after repeated date/time disasters at every switch to/from DST and every Feb 29, java was abandoned and everything was rewritten in a language (usually C++) whose libraries supported a UT timestamp and didn't have all those time bugs.

      Does anyone know if MS Windows has introduced a UT internal time yet? If not, then we can reliably predict that such bugs will continue to plague their users.

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    10. Re:Same Story / Different Day by ekimminau · · Score: 3, Funny

      According to Microsoft all time started on Jan 1, 0001.

      http://msdn.microsoft.com/en-us/library/system.datetime.ticks.aspx

      No fricking wonder the "system idle process" uses 19% of a cpu. The OS is counting to a billion every second.

      Ooops. But they still lots of things in 32bit land, too.
      http://msdn.microsoft.com/en-us/library/system.datetime.aspx

      --
      Armaments, 2-9-21 And Saint Attila raised the hand grenade up on high, saying, 'O Lord, bless this Thy hand grenade' N
  4. 28 days by ichthus · · Score: 5, Funny

    Well, this is all because 28 days in February ought to be enough for everyone.

    --
    sig: sauer
    1. Re:28 days by davidbrit2 · · Score: 5, Funny

      I always remember to put DEVICEHIGH=FEB.SYS into my config.sys every four years.

  5. What is it with Microsoft and Leap Year? by madsci1016 · · Score: 3, Informative

    Anyone remember trying to turn on their Zune 3.5 years ago? That didn't work so well either.

    1. Re:What is it with Microsoft and Leap Year? by Kozz · · Score: 4, Informative

      Now, I'm not necessarily a Microsoft apologist, but I have to point out that it wasn't so long ago that other things near and dear to us geeks were experiencing similar problems.

      I was trying to run some ant scripts yesterday that interact with an FTP server to delete some files. Those damned files wouldn't get deleted. They weren't even returned from a listing command. As it turns out, I was using a particularly old version of Apache Commons-Net library (this jar file was from 2005) which had a leap-year bug. It simply would not show me files with modification dates of 2/29. I was looking at the FTP server configuration, logging in with other clients, moving and renaming files, and all about ready to break out Wireshark... and then it occurred to me that it was leap day. Hoo-fucking-ray. "touch"ed the file, and sure enough, it was suddenly available. Those are a few hours of my life I'll never get back.

      --
      I only post comments when someone on the internet is wrong.
  6. In a new press conference.. by Anonymous Coward · · Score: 5, Funny

    Microsoft has told the press that they don't expect the Azure cloud service to fail again for years. In an unrelated schedule change, a down-for-maintenance slot was scheduled 4 years in advance.

  7. office in the cloud by Anonymous Coward · · Score: 5, Funny

    It's sold as Office 365 not Office 366

  8. Prepared for future by gmuslera · · Score: 3, Informative

    If they can't handle an exception that is around since 2k years ago, what about newer exception? Would be interesting to see what could happen next June 30.

    1. Re:Prepared for future by SuricouRaven · · Score: 4, Informative

      The leap year specification was only written in 1582. So it isn't 2k years old.

  9. Everything MS does as "me too" sucks. by scorp1us · · Score: 4, Insightful

    It seems that all of MS's copied products - hotmail, Azure, Zune are all done with a "me too" attitude of just having something so that they don't get left behind. They don't really try to make these "me too" products as industry leaders. But here's the catch. I know plenty of IT people who will always choose MS's offering because, as I was told "you don't get in trouble for choosing MS". And that knowledge seems to be built into MS's offerings.

    --
    Slashdot's rate-of-post filter: Preventing you from posting too many great ideas at once.
  10. Only Happens Every 4 Years by trongey · · Score: 5, Funny

    It's not Micorsoft's fault; they're a publicly traded company so they can't think about multi-year events. They're prohibited from considering anything that is beyond the next fiscal quarter.

    --
    You never really know how close to the edge you can go until you fall off.
  11. Single Point of Failure by Bicx · · Score: 5, Insightful

    This points out a serious flaw in the whole idea of cloud reliability by redundancy. You may have a million servers running across multiple countries, but if the distributed software for each virtual server has a bug, every server across the globe is affected. That's a single point of failure.

    1. Re:Single Point of Failure by Bert64 · · Score: 4, Insightful

      Thats a flaw in the idea of a monoculture, true redundancy has different software implementing the same basic standards...
      Like how the Internet is built from routers made by different vendors, cisco, juniper, software based linux/bsd devices etc. When new DoS vulnerabilities are found in one vendors kit it doesn`t take down the whole internet, because other vendors are immune.

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
  12. A leap year issue? Are you SERIOUS? by msobkow · · Score: 5, Insightful

    Given how many DECADES leap year calculations have had to be done and how many years it's been since we fixed the Y2K issues (at great expense, I might add), it is absolutely UNACCEPTABLE for someone to blame a leap year calculation for down time.

    The DIRECTOR of the service division at Microsoft should be FIRED for this failure.

    Expect lawsuits from customers, Microsoft. Because this was a problem you KNEW about and should have written code to deal with.

    What a pathetic excuse for planning and testing on Microsoft's part.

    --
    I do not fail; I succeed at finding out what does not work.
    1. Re:A leap year issue? Are you SERIOUS? by Anonymous Coward · · Score: 3, Funny

      Shouldn't 'pathetic' be in uppercase?

  13. It wasn't just Microsoft... by Anonymous Coward · · Score: 5, Interesting

    ...they just had the most publicly catastrophic failure. I just noticed that all of the Google Chat messages I received yesterday were sent to me at various times on December 31, 1969.

    And it also seems that I didn't even receive any of them until today, March 1, implying that they were incapable of even sending them yesterday.

  14. Arthur David Olson is my hero by Bruce+Perens · · Score: 4, Informative

    30 years ago, Arthur David Olson started engineering a solution to this problem that persists to this day, and which he supported personally for all but the last few months. The systems I have that run his software have never even burped through legislative changes of the calendar, leap-seconds, and the Century leap-year day, which is a separate cycle from the 4-year one.

  15. Re:Dumb people never learn by Tanktalus · · Score: 4, Funny

    Hey! My MS4000 keyboard and MS mouse are working jut fine.

  16. Attention Microsoft: by Howard+Beale · · Score: 5, Funny

    The following are leap years: 2016 2020 2024 2028 2032 2036 2040 You have been warned. After that, I'll probably be dead, so I won't care (unless Microsoft starts making pacemakers, which may end it for me...).

    1. Re:Attention Microsoft: by forkfail · · Score: 3, Funny

      The thought of an MS pacemaker EULA is pretty scary....

      --
      Check your premises.
  17. Some of the most common leap-year bugs by tillerman35 · · Score: 5, Informative

    Some of the common leap year bugs that I've seen over the years:

    1. A matrix with the number of days per month:
    e.g. smallint dayspermonth[12]={31,28,31,30,31,30,31,31,30,31,30,31};
    Indexing into the matrix for February (index 1) ignores leap years.

    1. A matrix with 365 elements to represent a year's worth of something:
    e.g. smallint hightemps[365];
    This usually doesn't fail until Dec 31, when hightemp[mydate.dayofyear()-1] points to a non-existent element.
    Of course, if dayofyear is calculated using the matrix in the prior bug, it will fail invisibly since that will be incorrect
    as well.

    2. Quck-n-dirty subtract one year math:
    e.g. Convert date to char in YYYYDDMM format, convert char to int, subtract 10000, convert back to a char and then date.
    Why people do this when you can dateadd(year,mydate,-1) is that easy, I have no clue. But it breaks horridly when
    you use it to determine "one year ago today" from Feb 29.

  18. Re:Dumb people never learn by MadKeithV · · Score: 4, Funny

    Hey! My MS4000 keyboard and MS mouse are working jut fine.

    I see what you did there.

  19. Microsoft Never Has Been Good At Time by Greyfox · · Score: 4, Interesting
    Dealing with time is hard, but it's been amusing to watch them experience problems solved by UNIX decades earlier. Daylight savings time was a constant problem for them in the early days, though they seem to have mostly got that ironed out. Every so often they seem to have a regression for a piece of new hardware. Maybe they'll eventually get it right.

    Funnily enough, I used to work at IBM doing OS/2 tech support. OS/2 and Windows NT share a common heritage, so a lot of the behind-the-scenes problems I witnessed in OS/2 were (And sometimes still are) problems with Windows. I'm not sure if this is one of them, but I got a call once from a guy who was trying to use his OS/2 system to track satellites. The problem was, the OS/2 timer API specified that you could set milliseconds but it didn't seem to work. I tracked it down to a timing driver which tracked two separate interrupts. The first interrupt happened every few milliseconds and would update the clock millis when that happened. However, if the system was busy it was possible to not handle that interrupt. There was also a system periodic interrupt every 1 second. When that occurred, the system hard-reset the milli time and incremented the seconds. So you could set the millis, but the clock would become inaccurate 1 second later. Just one example of how time has been a thorn in my side for my entire career. I wrote an APAR up on it which was promptly closed "Working as Designed." Dunno if he ever got it fixed...

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  20. Re:What a shame by TheCRAIGGERS · · Score: 4, Funny

    We still see this kind of XXXX coming up every leap year.

    We're all adults (or close enough to it, anyway) here. I think we're all capable of seeing the word "shit" without our faces melting like that nazi who peeped in the ark.

    My apologies to everyone who is now having their face melt off after reading that previous sentence.