Leap Second Bug Causes Crashes
An anonymous reader writes in with a Wired story about the problems caused by the leap second last night. "Reddit, Mozilla, and possibly many other web outfits experienced brief technical problems on Saturday evening, when software underpinning their online operations choked on the “leap second” that was added to the world’s atomic clocks. On Saturday, at midnight Greenwich Mean Time, as June turned into July, the Earth’s official time keepers held their clocks back by a single second in order to keep them in sync with the planet’s daily rotation, and according to reports from across the web, some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform — were unable to cope with the extra second."
And I didn't do anything special, just kept their software up-to-date.
Interesting. I wonder what conditions had to have been met for a crash to happen, none of my servers had so much as a hick-up.
I'm a Linux admin at a fairly large hosting company. The only thing that I personally aware of happening this time around was that the time change triggered a bug in the OpenManage software on Dell servers causing it to use 100% CPU. The solution was to resync the time and restart OpenManage. It wasn't really a fault of Linux itself, but in OpenManage on Linux. Lots of datacenters use Dell hardware and I'm sure most use OpenManage, so I'm sure the problem was widespread.
>hick-up.
The hick up watching the servers when the leap second came was you.
I'm uncertain why these reports keeps referring to some monolithic "Linux" that is supposed to have had issues - Red Hat's the biggest Linux vendor, and certainly their "Linux" handled it just fine.
What distros had issues?
#DeleteChrome
So far all I've heard about is affected Linux systems, did Windows and OS X just fine?
From my own machines and comparing notes with some other people (all in all, about 3k servers) the bug seems to affect machines randomly. Known facts:
There's a kernel patch that fixes the supposed issue: https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
Affects Debian stable a lot.
Affects Java and Virtualbox (starts using too much CPU).
Affected my browser (iceweasel on debian testing).
Affects SOME mysql installs (5.1 and 5.5, but not all, and of two identical installs one might be affected, the other not).
The fix has been posted at lot of places: /etc/init.d/ntp stop; date; date `date +"%m%d%H%M%C%y.%S"`; date; /etc/init.d/ntp start
(I'm all for switching unix time to a simple counter and leaving it to the calendar libs to put the leap seconds where necessary)
There was a Linux kernel bug. See
http://news.ycombinator.com/item?id=4183122
http://marc.info/?l=linux-kernel&m=134110635328824&w=2
and
https://lkml.org/lkml/2012/6/30/122
It's like the Y2K bug, but every few years.
The hard system lock bug due to a leap second was patched in 2.6.29, so either you've got some weird related bug, or something is very wrong.
We will keep having these kinds of issues for as long as some people who fail to understand that time of day is an arbitrary number whose main utility lies in it being composed of predictable periods and divided into homogenous units. It should have no relation whatsoever to whatever time the sun happens to rise or set at any particular location and above all it should not be changed to accomodate fluctuations in the orbit of a rock circling an arbitrary star. Abominations like leap seconds or daylight savings make the whole system less useful by merely existing.
But personally I wouldn't be surprised if people off the equator were to get summer minutes composed of 120 seconds during daytime (or even better, a scale!) to ensure the sun rises and sets at the same time year around. Or, hey, why not simply make the seconds longer? Or a combination of both plus we can define pi to be 3 to make things simpler.
Configuration of the system to only accept 23:59:59 and not 23:59:60
You will be baked, and there will be cake.
As it turns out my biggest problems was customer-supplied software which uses their own java jre's. We install a jre by default and update it whenever possible, but some software (Adeptia, VLTrader, Alfresco) comes with their own ancient jre and scripts to call that over system-supplied java.
Not a single machine crashed (we are very explicitly in charge of what OS-version there's running) but a lot of java locked up and had to be restarted.
I can even see a small bump in the power-usage around two o' clock (0:00 GMT).
Now that Linux hit the same type of hurdle, we're all of a sudden being very nuanced about the definition of code quality? Typical.
Wow. You're still pissed over Azure failing, your Xbox disabling itself, your Zune crashing for a full day and your Outlook manhandling your appointments (on more than one occasion)?
Talk about carrying a grudge..
Why not bundle them and apply them every 10 or 20 years?
And apparently I'm not alone:
http://en.wikipedia.org/wiki/Leap_second#Proposal_to_abolish_leap_seconds
Hogwash, Astronomers can find coping mechanisms, it's either that or these ridiculous levels of stress for systems admins.
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
and above all it should not be changed to accomodate fluctuations in the orbit of a rock circling an arbitrary star.
That is precisely the point of keeping track of the time of day, or day of the year.
time of day is an arbitrary number whose main utility lies in it being composed of predictable periods and divided into homogenous units.
You do not need a complex system like date time comprised of minutes hours, seconds, months, weeks, and years if you just want to measure time in a convient homogenous unit then define a time-zero, and just count milliseconds from that to whatever arbitrary distance into the past and future you want from that. Measure it kilo-seconds, mega-seconds, giga-seconds... etc.
The entire point of date/time is because we do in fact care a lot about how that "arbitrary counter" lines up with when we will be awake or asleep or eating at various points -- that's what makes it useful.
What we should have is what I've described above, time-zero and a counter. And translations from that to localized date time should be handled by a library.
I run Arch Linux with kernel 3.4.4 and it went haywire. My machine was very heavily loaded at the time and when the leap second happened mysqld, firefox, and ksoftirq processes started consuming 100% CPU. The load factor was well over 10 and the machine was grinding along. It didn't actually fail but it was loaded down.
Even restarting the processes didn't fix it. The high load would go away once I stopped the processes but as soon as I started them again the load would come right back. I had Firefox open on a blank page not doing anything and it was slammed at 100% CPU and had a could ksoftirq tasks slammed at 100% CPU each too.
I had to reboot the machine to get it back to normal.
I have Ubuntu and Debian servers that for whatever reason did not add the leap second so they were fine. Their time was a second off today though (at least until ntp slowly corrected it or I manually intervened).
I'm managing a cluster of 2,400 nodes running FreeBSD, and AFAICS, none was tripped off by leap second NTP adjustments. On the other hand, 4 out of 180 Linux nodes crashed simultaneously at that very moment. All this is exceedingly weird, but may indeed point to a subtle bug in the Linux kernel (only?). I've never witnessed this behavior in the past.
cpghost at Cordula's Web.
About 5 seconds after midnight GMT a Java server app running on my Debian Squeeze server decided it was going to eat-up ALL THE THINGS and for some reason, the server rebooted itself. Glad to know I wasn't alone in shitting myself over odd behaviours.
The game.
You don't need a leap second in order for that to happen. Firefox does that regularly.
Do you care about the security of your wireless mouse?
Google official blog: "Time, technology and leaping seconds" (sept 2011)
http://googleblog.blogspot.in/2011/09/time-technology-and-leaping-seconds.html
I wonder if the leap second has anything to do with the labs Chubby paper / site currently being offline..
Hivemind harvest in progress..
Restarting ntp wasn't enough for me, I had to reset the date with:
date -s "`date`"
Only one machine went haywire though.
Have you read the source for /bin/sh ?
What we should have is what I've described above, time-zero and a counter. And translations from that to localized date time should be handled by a library.
Which, sadly, POSIX doesn't let you have as "UNIX time":
If there were a UN*X API to get a count of seconds since the Epoch (in addition to, or instead of, a call to get "seconds since the Epoch"), and a UN*X API to convert those to UTC and local time labels, that would get what you want. Modulo making it work with NTP, the former could be implemented with less difficulty than a call to get "seconds since the Epoch", and the latter is called "the Olson code complete with the leap seconds database".
However, that would then require some mechanism to allow code to schedule something to happen at a given UTC label; simply calculating the UNIX time for that UTC label, getting the current UNIX time, and scheduling it for then-now seconds in the future is insufficient, as the UNIX time for a given UTC label in the future might change if a leap second is scheduled between then and now. (Note that if you support scheduling something to happen at a given local civil time label would already require correction of that sort to handle DST rule changes.) This would also have to do something if you schedule an event for YYYY-DD-MM 23:59:59 and a negative leap second occurs so that there is no 23:59:59 on YYYY-DD-MM; "something" might be "let somebody know and ask them to correct it" or "do it at 00:00:00 on the next day", perhaps depending on the reason why it's scheduled.
The hard system lock bug due to a leap second was patched in 2.6.29, so either you've got some weird related bug, or something is very wrong.
Well, the weird related bug would arguably count as something being wrong. Apparently there is a bug in the handling of the insertion of positive leap seconds that could cause weird behavior with futexes, and that bug appears not to have been fixed until at least July 1, 2012 (I'm guessing John Stultz has worked up a patch).
Considering leap-seconds happen every now and then, it seems odd that such fundamental things as Linux and Java can not handle it. AFAIK, it was just about for years ago since we last had a leap-second.
Perhaps the bug that was mentioned in the lkml thread that started with this message was introduced less than four years ago, so the code in question had never gotten exposed to a leap second except perhaps in testing (I don't know how hard it is to reproduce it; John Stultz wasn't initially able to reproduce it in his testing, but eventually succeeded).
The bug has already been fixed for months now
A bug might have been fixed for months now, but I don't think that's the bug here.
the difference being this bug was patched already it only affected systems the were not kept up to date.
A bug, perhaps. This bug, perhaps not.
If that actually happened, then they should have just made it do 23:59:59 twice instead of crashing all the computers. I would like somebody to give me a concrete reason why any computer system should actually crash because of a lost second.
This linux-kernel mailing list thread discusses a kernel bug that causes futexes to repeatedly time out, so that code using them (which might include POSIX mutexes and condition variables, if that's what glibc uses for them on Linux) might spin.
That's not the kernel-leap-year-handling bug that was fixed back in March, so it's not as if a properly-patched kernel wouldn't get hit by this (unless you define "properly-patched" as "includes the patch John Stultz came up with on July 1, 2012").
So, yes, this particular bug is Linux-specific (i.e., there's a reason why it hit Linux servers), and might not be the fault of the userland code running atop it (so it might not, for example, be Java's fault).
Mods -- please mod this up to a thousand. kwardroid's fix fixed this for all the affected machines I've found so far.
Data: She brought me closer to humanity than I ever thought possible, and for a time...I was tempted by her offer.
Jean-Luc Picard: How long a time?
Data: Zero point six eight seconds, sir. For an android, that is nearly an eternity.
When NTP knows that a leap second is to be added, it (on Linux at least) sets a flag in the kernel to say that at 23:59:59, please continue to 23:59:60 before going to 00:00:00. This is set by NTP anytime on the day that the leap second is due to be implemented, hence why a server running NTP on Linux would know that TODAY a leap second is due (cause they should always be posted at the 23:59:59 cross-over)
I work at a fairly large international outfit, with data feeds coming and going to the far ends of the Earth. Everything we do is time-sensitive. Processing messages that depend on prior messages already being processed means we can't gracefully handle things coming in out of order.
We spent lots of time and money studying this problem, hired a high-priced consulting outfit to advise us and spun up lots of projects to mitigate the "risk" of the leap second. There were far too many meetings and conference calls with vendors, VARS and other people that wanted us to pay them for their time. What was determined was that we couldn't guarantee that nothing would crash or (gasp!) messages might be discarded or processed incorrectly, which was a risk we weren't willing to take. We run a full gamut of OSes, from HP/UX, Solaris, Linux, z/TPF, z/OS, DB2 etc etc.. You get the idea. Too many variables and too many systems to update and test with the limited funds and limited timeframe given.
In the end, we avoided the problem by shutting down all (and I do mean ALL) processing and flushing all the transactional systems to disk and suspending EVERYTHING from a minute before until a minute after the leap second. (Was that two minutes or two minutes PLUS one second? Clock math has always eluded me.) Shutting down all these interconnected systems in the correct order was a precision dance that, in the end, we didn't perform very well. Messages did end up being discarded. At precisely :20 seconds after the leap second, we began syncing all our systems with our internal NTP server and then at precisely one minute after, we slowly started everything back up. There were some systems that required a restart. We manually reprocessed those earlier discarded messages just as fast as our little fingers could type. In all it took us about 15 minutes to get everything spun back up, and all that time is getting charged to our SLA, which affects ALL our evaluations and year-end bonuses.
Lots of work was done, overtime was paid and buckets of money were given to lots of high-priced consultants and I personally will take a hit to my paycheck, all over ONE GODDAMNED SECOND.
Let's not do that again, okay?
- Pithy comment goes here.
If that actually happened, then they should have just made it do 23:59:59 twice instead of crashing all the computers. I would like somebody to give me a concrete reason why any computer system should actually crash because of a lost second.
If you send 23:59:59 twice, you have the same second in the system twice, which can potentially cause issues with logs. If everything is timestamped to the second/millisecond, how can you be sure an event happened in the first 23:59:59 second, or the second (or subsequent) 23:59:59 second?