Azure Failure Was a Leap Year Glitch
judgecorp writes "Microsoft's Windows Azure cloud service was down much of yesterday, and the cause was a leap year bug as the service failed to handle the 29th day of February. Faults propagated making this a severe outage for many customers, including the UK Government's recently launched G-cloud service."
Seriously, if my American high school education taught me nothing else, it was that those things only come along like every 100 years or something.
SJW: Someone who has run out of real oppression, and has to fake it.
Obviously you didn't inform yourself with the very helpful and informative "Get The Facts" materials Microsoft provided us with a few years ago. If you had you would know how much higher the TCO of Linux on the server is even after a massive outage.
Didn't this happen last leap year to the Zunes... oh yeah...
Well, this is all because 28 days in February ought to be enough for everyone.
sig: sauer
We still see this kind of XXXX coming up every leap year.
Colorless green Cthulhu waits dreaming furiously.
Anyone remember trying to turn on their Zune 3.5 years ago? That didn't work so well either.
This is probably part of the reason why the cloud really hasn't taken off in the corporate sector, and it's no wonder why.
I am Bennett Haselton! I am Bennett Haselton!
Microsoft has told the press that they don't expect the Azure cloud service to fail again for years. In an unrelated schedule change, a down-for-maintenance slot was scheduled 4 years in advance.
It's sold as Office 365 not Office 366
If they can't handle an exception that is around since 2k years ago, what about newer exception? Would be interesting to see what could happen next June 30.
Oh Man! That's so embarrassing! :0)
The purpose of existence is to make money.
Correct me if I'm wrong, nut they could have avoided this by relying on the UNIX epoch. Same with Y2K. But beware Y2K38 you 32-bit users!
~theCzar
Never trust Microsoft for anything.
If we could peek into one of the container-trailers at the M$ server-farm in Oregon would we find stacks of 386-AT&T-clones running DOS 4.4 and Lotus 1-2-3 doing the calculations?
It seems that all of MS's copied products - hotmail, Azure, Zune are all done with a "me too" attitude of just having something so that they don't get left behind. They don't really try to make these "me too" products as industry leaders. But here's the catch. I know plenty of IT people who will always choose MS's offering because, as I was told "you don't get in trouble for choosing MS". And that knowledge seems to be built into MS's offerings.
Slashdot's rate-of-post filter: Preventing you from posting too many great ideas at once.
If MS starts making excuses for this mini fiasco, they will only manage to make Amazon, Google and Apple look like fucking geniuses
"The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
It's not Micorsoft's fault; they're a publicly traded company so they can't think about multi-year events. They're prohibited from considering anything that is beyond the next fiscal quarter.
You never really know how close to the edge you can go until you fall off.
Simply inexcusable.
The Kruger Dunning explains most post on
This points out a serious flaw in the whole idea of cloud reliability by redundancy. You may have a million servers running across multiple countries, but if the distributed software for each virtual server has a bug, every server across the globe is affected. That's a single point of failure.
Given how many DECADES leap year calculations have had to be done and how many years it's been since we fixed the Y2K issues (at great expense, I might add), it is absolutely UNACCEPTABLE for someone to blame a leap year calculation for down time.
The DIRECTOR of the service division at Microsoft should be FIRED for this failure.
Expect lawsuits from customers, Microsoft. Because this was a problem you KNEW about and should have written code to deal with.
What a pathetic excuse for planning and testing on Microsoft's part.
I do not fail; I succeed at finding out what does not work.
I though we fixed all the clock stuff back in 2000? Guess not.
...they just had the most publicly catastrophic failure. I just noticed that all of the Google Chat messages I received yesterday were sent to me at various times on December 31, 1969.
And it also seems that I didn't even receive any of them until today, March 1, implying that they were incapable of even sending them yesterday.
Haha, my "Get The Facts" materials came with a free subscription to Wired if I do recall. The funny part is, I never actually got around to reading the Microsoft stuff - our receptionist recycled it without hesitation. I knew this had happened when I saw only a copy of Wired on my desk.
Done right, the date/time implementation being used should have had no problems automatically adjusting to a leap year.
This is shoddy coding, pure and simple.
This does not reflect well on those developing on MS platforms.
In the year 2012, it is unthinkable for a date/time implementation to not handle this gracefully.
Did they hire an intern to code up some half baked date class?
30 years ago, Arthur David Olson started engineering a solution to this problem that persists to this day, and which he supported personally for all but the last few months. The systems I have that run his software have never even burped through legislative changes of the calendar, leap-seconds, and the Century leap-year day, which is a separate cycle from the 4-year one.
Bruce Perens.
...one giant leap for the rest of us...?
(edit: seriously, my captcha on this is "doomsday" :-D?=
The story yesterday said that they were having a problem with certificate validation. The routine they were using to validate certificate expiration must not have been able to handle the leap year. I wonder what non-standard API they were using to process the expiration date. That reminds me of another article that I read yesterday.
Bet it wont happen again next year!
Idiots!
aren't they using any unit testing? If they are, isn't a leap year an obvious test case?
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
While we jab at MS for the Zune fiasco, to their defense, they didn't write the subroutine that caused the problem and the most that happened was some of their customers could not play music on their Zunes for a while. Not a mission critical situation. But what the hell kind of calendar system is MS using in their mission critical software that cannot deal with leap years which comes every 4 years?
Well, there's spam egg sausage and spam, that's not got much spam in it.
I seem to remember auditors at my first job (mid 1990's) telling me about needing to account for Excel's leap year problem when they used it. I can only find this issue today :
http://en.wikipedia.org/wiki/Year_1900_problem
The following are leap years: 2016 2020 2024 2028 2032 2036 2040 You have been warned. After that, I'll probably be dead, so I won't care (unless Microsoft starts making pacemakers, which may end it for me...).
NTR
Some of the common leap year bugs that I've seen over the years:
1. A matrix with the number of days per month:
e.g. smallint dayspermonth[12]={31,28,31,30,31,30,31,31,30,31,30,31};
Indexing into the matrix for February (index 1) ignores leap years.
1. A matrix with 365 elements to represent a year's worth of something:
e.g. smallint hightemps[365];
This usually doesn't fail until Dec 31, when hightemp[mydate.dayofyear()-1] points to a non-existent element.
Of course, if dayofyear is calculated using the matrix in the prior bug, it will fail invisibly since that will be incorrect
as well.
2. Quck-n-dirty subtract one year math:
e.g. Convert date to char in YYYYDDMM format, convert char to int, subtract 10000, convert back to a char and then date.
Why people do this when you can dateadd(year,mydate,-1) is that easy, I have no clue. But it breaks horridly when
you use it to determine "one year ago today" from Feb 29.
Funnily enough, I used to work at IBM doing OS/2 tech support. OS/2 and Windows NT share a common heritage, so a lot of the behind-the-scenes problems I witnessed in OS/2 were (And sometimes still are) problems with Windows. I'm not sure if this is one of them, but I got a call once from a guy who was trying to use his OS/2 system to track satellites. The problem was, the OS/2 timer API specified that you could set milliseconds but it didn't seem to work. I tracked it down to a timing driver which tracked two separate interrupts. The first interrupt happened every few milliseconds and would update the clock millis when that happened. However, if the system was busy it was possible to not handle that interrupt. There was also a system periodic interrupt every 1 second. When that occurred, the system hard-reset the milli time and incremented the seconds. So you could set the millis, but the clock would become inaccurate 1 second later. Just one example of how time has been a thorn in my side for my entire career. I wrote an APAR up on it which was promptly closed "Working as Designed." Dunno if he ever got it fixed...
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
"Oh Hai, yeah we just carried the one! It's all fixed now. Nothing to worry about, back to work"
Join the Slashcott! Feb 10 thru Feb 17!
As I keep telling my fellow software developers, time is one of those things in software that tend to go wrong. Few developers give it the attention it deserves. Between different formats, timezones, changing timezones (including DST), leap seconds, and limits on what can be represented, there is plenty of opportunity for errors. And contrary to what you might hope, using an existing library to handle time does not absolve you from having to think about it, nor does that library always get it right.
Please correct me if I got my facts wrong.
This just in, the Vatican is raided by the FBI in search of XVI century documents and the evil mastermind of today's attack: Pope Gregory. Joining the FBI are Seals Team Six and Chuck Norris, who brought an abacus. The team sprays Gregory's coffin with bullets. Obama announces the permanent change to a non-leap calendar. The world is safe. Cue in Aerosmith.
I've been accommodating leap years into my date-handling code since 1982. The computation is quite simple, but it has to be done in order to convert a julian date/time value to calendar date/time, or to do simple date arithmetic. At this point in time, there is absolutely zero, zilch, no excuse for this sort of bug to exist in ANY production-quality code. So, I guess that MS software is NOT production-ready? :rolleyes:
on how to handle date functions with a claim specifically for leap year. lol
Then watch for its approval and subsequent issuance.
LoB
"Anyone who stands out in the middle of a road looks like roadkill to me." --Linus
It never ceases to amaze me that software developers have not learned the one simple lesson of time - every internal representation and computation on time needs to be done using seconds-since-epoch. Always. No exceptions. The *only* time that one every deal with a human-formatted time is when reading in a value, or printing out a value. It never, ever, is stored or manipulated in any form other than seconds-since-epoch.
Do that, and you never have these sorts of bugs.
But oddly enough, this is still a debate that I have to run through over and over again with our new developers.
Some code I knocked up a few months ago had a problem with the leap-day just gone.
Difference being, that script wasn't meant to run for more than a few days, was knocked up in an hour, was untested, and didn't run anything critical at all (it moved scanned PDF files into an archive folder for a scanner used by precisely two people).
Seriously, Microsoft, you have a system that you expect government to use and you can't even work around a leap-day in advance?
...until the next day?
...of "The Trustworthy Computing Initiative". HAVE to believe them - hey, Microsoft's spent a lot of money in advertising to tell us how good they are, eh?
YankDownUnder Veni, Vidi, volo in domum redire
Wasn't this the same bug that affected Zune 4 years ago????? My god, MS won't learn.
Let's bet, which MS product will fail on 2016
They need to figure out how to better test it!
Microsoft had a computer date bug? Armature hour!
I'm always amazed by the typical beginner mistake where people do actually store dates as something else than timestamps in so many places, when there's absolutely zero reason to do so.
Most programmers aren't aware that a minute is not made of 60 seconds (hint: they can be 59 or 61 seconds). Most programmers aren't aware that a day isn't 24 hours. In short: they get their invariants wrong.
The real tragedy is that in most applications users don't even need to *see* any exact date. Date should always be stored as a point in time, stored for example in seconds or milliseconds from the epoch.
But, no, they get it all wrong: they store date as complex objects in their DB, then they make even more complex computation like : "if we're this or that month add 30 day to get to next month, otherwise add 31 days, or if we're in february add 28 days". And it's totally silly because in most application / billing scheme, the user don't give a flying F**K if there's a one or two day inaccuracy (e.g. I get billed monthly for my electricity: sometimes the bill is dated from the 4th of the month, sometimes it's the 5th, then back to 4th, etc. and NOBODY GIVES A FLYING F*CK).
That said I didn't expect anything else from the geniuses at MS...
That's all I can say. Which is what you get with a bunch of low bid contractor's monkey programmers.
It reminds of a post a few years ago, I wish I could find it, where the poster stated that the software running on some MS froze up about every 50 days requiring a reboot. No one could figure out why, the 32 bit TimeGetTime functions rolls over at about, you guessed it, every 49.71 days. Defensive coding people!
As it is MS Excel does not handle leap years, IIRC. You have to "roll your own" even when doing spreadsheet dates.
Another anecdote. We were migrating an application and discovered a bunch of financial transactions which occurred on June 31st. Defensive coding people! Verify your input! Verify your date arithmetic!
putting the 'B' in LGBTQ+
I have my own computer/human based leap year story. This was probably human error, but still.
I was contracted at a company for the month of February, and was granted network access until then. I left the office on the 28, having left the computer to do a run (would take 8 hours, wasn't waiting and no remote access). At midnight, the system revoked my access, my jobs failed to complete, and I walked in in the morning to find that I had effectively wasted my second last day on my contract.
I mean, seriously, who revokes systems access at midnight? What happened to a grace period at the end? (If they really didn't want me in, all they had to do was to physically deny me the access). And seriously, if you are giving access for the month of February (or March, or April etc), and you really really have to revoke access as soon as the person leaves, Revoke it on the 1st of the next month.
Rant over.
Ain't gonna happen. They won't introduce a UTC internal time. And as we move to UEFI based machines, we will all be forced to use at least indirectly use local time (though UEFI does keep track of the timezone and daylight savings mode). So if there is a bug in the UEFI timezone code, good luck to you...
The most obvious way to mitigate these failures is to enact the first international holiday, Leap Day. The world should slow down every ~4 years and take the day off. This will allow those who are date challenged to quietly fix their problems. Unfortunately, this means no special events and maybe we should just shut down the Internet too for a "make and mend" day.
HA HA HA HA HA HA HA HA HA HA HA HA HA HA !
And you think the 'cloud' is safe?!
HA HA HA HA HA HA HA HA HA HA HA HA HA HA !
That you don't have parents who thought it useful to only celebrate a birthday when it actually occurred.
Will never look at a blue sky the same.
My Casio watch handled Feb 29th without any problems at all (which actually surprised me - I thought I might have to have adjusted the date manually this year). M$ must have their heads in the "clouds"! ;-)
This failure is another example of how MS does little regression testing on their software. It is cheaper for MS to let the customers(users) find the bugs than for MS to release a quality product. MS is still issuing patches and bug fixes for WinXP 10 years after the product was introduced.
...it's hardly surprising. Seriously, Microsoft has numerous date formats within their system with which to work. Want to the know the current time? You have to use SYSTEMTIME. Need to know the date-time stamp for a file? Then you need to use FILETIME; but why would you need to have a file dated in the 17th century?
Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)