Azure Failure Was a Leap Year Glitch
judgecorp writes "Microsoft's Windows Azure cloud service was down much of yesterday, and the cause was a leap year bug as the service failed to handle the 29th day of February. Faults propagated making this a severe outage for many customers, including the UK Government's recently launched G-cloud service."
Seriously, if my American high school education taught me nothing else, it was that those things only come along like every 100 years or something.
SJW: Someone who has run out of real oppression, and has to fake it.
Obviously you didn't inform yourself with the very helpful and informative "Get The Facts" materials Microsoft provided us with a few years ago. If you had you would know how much higher the TCO of Linux on the server is even after a massive outage.
Didn't this happen last leap year to the Zunes... oh yeah...
Well, this is all because 28 days in February ought to be enough for everyone.
sig: sauer
Anyone remember trying to turn on their Zune 3.5 years ago? That didn't work so well either.
Microsoft has told the press that they don't expect the Azure cloud service to fail again for years. In an unrelated schedule change, a down-for-maintenance slot was scheduled 4 years in advance.
It's sold as Office 365 not Office 366
If they can't handle an exception that is around since 2k years ago, what about newer exception? Would be interesting to see what could happen next June 30.
Correct me if I'm wrong, nut they could have avoided this by relying on the UNIX epoch. Same with Y2K. But beware Y2K38 you 32-bit users!
~theCzar
It seems that all of MS's copied products - hotmail, Azure, Zune are all done with a "me too" attitude of just having something so that they don't get left behind. They don't really try to make these "me too" products as industry leaders. But here's the catch. I know plenty of IT people who will always choose MS's offering because, as I was told "you don't get in trouble for choosing MS". And that knowledge seems to be built into MS's offerings.
Slashdot's rate-of-post filter: Preventing you from posting too many great ideas at once.
It's not Micorsoft's fault; they're a publicly traded company so they can't think about multi-year events. They're prohibited from considering anything that is beyond the next fiscal quarter.
You never really know how close to the edge you can go until you fall off.
This points out a serious flaw in the whole idea of cloud reliability by redundancy. You may have a million servers running across multiple countries, but if the distributed software for each virtual server has a bug, every server across the globe is affected. That's a single point of failure.
Given how many DECADES leap year calculations have had to be done and how many years it's been since we fixed the Y2K issues (at great expense, I might add), it is absolutely UNACCEPTABLE for someone to blame a leap year calculation for down time.
The DIRECTOR of the service division at Microsoft should be FIRED for this failure.
Expect lawsuits from customers, Microsoft. Because this was a problem you KNEW about and should have written code to deal with.
What a pathetic excuse for planning and testing on Microsoft's part.
I do not fail; I succeed at finding out what does not work.
...they just had the most publicly catastrophic failure. I just noticed that all of the Google Chat messages I received yesterday were sent to me at various times on December 31, 1969.
And it also seems that I didn't even receive any of them until today, March 1, implying that they were incapable of even sending them yesterday.
30 years ago, Arthur David Olson started engineering a solution to this problem that persists to this day, and which he supported personally for all but the last few months. The systems I have that run his software have never even burped through legislative changes of the calendar, leap-seconds, and the Century leap-year day, which is a separate cycle from the 4-year one.
Bruce Perens.
True
Apparently MS only hires people with no concept of future.
See Zune event, see this. Absolute disregard of date concepts (and testing)
Now, Linux development worries about this THOROUGHLY (I mean, kernel and main libs, of course a sw developer can get this wrong on Linux as well)
No counter wrap-around fsck-ups, date exceptions, etc (people are watching this)
Remember the Windows bug where it would crash after 15 days or so?
how long until
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Hey! My MS4000 keyboard and MS mouse are working jut fine.
The following are leap years: 2016 2020 2024 2028 2032 2036 2040 You have been warned. After that, I'll probably be dead, so I won't care (unless Microsoft starts making pacemakers, which may end it for me...).
Some of the common leap year bugs that I've seen over the years:
1. A matrix with the number of days per month:
e.g. smallint dayspermonth[12]={31,28,31,30,31,30,31,31,30,31,30,31};
Indexing into the matrix for February (index 1) ignores leap years.
1. A matrix with 365 elements to represent a year's worth of something:
e.g. smallint hightemps[365];
This usually doesn't fail until Dec 31, when hightemp[mydate.dayofyear()-1] points to a non-existent element.
Of course, if dayofyear is calculated using the matrix in the prior bug, it will fail invisibly since that will be incorrect
as well.
2. Quck-n-dirty subtract one year math:
e.g. Convert date to char in YYYYDDMM format, convert char to int, subtract 10000, convert back to a char and then date.
Why people do this when you can dateadd(year,mydate,-1) is that easy, I have no clue. But it breaks horridly when
you use it to determine "one year ago today" from Feb 29.
Hey! My MS4000 keyboard and MS mouse are working jut fine.
I see what you did there.
Funnily enough, I used to work at IBM doing OS/2 tech support. OS/2 and Windows NT share a common heritage, so a lot of the behind-the-scenes problems I witnessed in OS/2 were (And sometimes still are) problems with Windows. I'm not sure if this is one of them, but I got a call once from a guy who was trying to use his OS/2 system to track satellites. The problem was, the OS/2 timer API specified that you could set milliseconds but it didn't seem to work. I tracked it down to a timing driver which tracked two separate interrupts. The first interrupt happened every few milliseconds and would update the clock millis when that happened. However, if the system was busy it was possible to not handle that interrupt. There was also a system periodic interrupt every 1 second. When that occurred, the system hard-reset the milli time and incremented the seconds. So you could set the millis, but the clock would become inaccurate 1 second later. Just one example of how time has been a thorn in my side for my entire career. I wrote an APAR up on it which was promptly closed "Working as Designed." Dunno if he ever got it fixed...
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
We still see this kind of XXXX coming up every leap year.
We're all adults (or close enough to it, anyway) here. I think we're all capable of seeing the word "shit" without our faces melting like that nazi who peeped in the ark.
My apologies to everyone who is now having their face melt off after reading that previous sentence.
It was 49.7 days: http://news.cnet.com/Windows-may-crash-after-49.7-days/2100-1040_3-222391.html
And still inexcusable.
I remember that one - it wasn't a crash in the usual sense, where something stops working completely. It was far more insidious than that. Everything still looked as if it was working; the cursor moved when you moved the mouse, icons would highlight if single-clicked, but double-click would refuse to play...
IIRC, the 49.7 days is 2^16 seconds
"She's furniture with a pulse"