Leap Second Bug Causes Crashes

← Back to Stories (view on slashdot.org)

Leap Second Bug Causes Crashes

Posted by samzenpus on Sunday July 1, 2012 @08:00AM from the slip-it-in-there dept.

An anonymous reader writes in with a Wired story about the problems caused by the leap second last night. "Reddit, Mozilla, and possibly many other web outfits experienced brief technical problems on Saturday evening, when software underpinning their online operations choked on the “leap second” that was added to the world’s atomic clocks. On Saturday, at midnight Greenwich Mean Time, as June turned into July, the Earth’s official time keepers held their clocks back by a single second in order to keep them in sync with the planet’s daily rotation, and according to reports from across the web, some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform — were unable to cope with the extra second."

8 of 230 comments (clear)

Min score:

Reason:

Sort:

What about Windows and Mac? by kthreadd · 2012-07-01 08:13 · Score: 4, Interesting

So far all I've heard about is affected Linux systems, did Windows and OS X just fine?
1. Re:What about Windows and Mac? by Guy+Harris · 2012-07-01 11:37 · Score: 3, Interesting
  As far as I can tell, all current operating systems handled it fine. It's applications that have problems, mainly server-type apps that actually use the clock for important things.
  Linux being heavily affected is just a side-effect of most servers running Linux (although apparently some older versions don't handle leap seconds so cleanly - maybe that has something to do with it?).
  Yes, at least one of the problems appears to be a Linux kernel problem. However, as that thread indicates, the consequence of this isn't a kernel crash; it causes futexes to repeatedly time out (or, at least, causing futexes with timeouts to repeatedly time out). I'm guessing, perhaps incorrectly, that this might mean that code waiting for a futex gets a kernel wakeup due to a timeout, checks whether the condition being waited for has happened, discovers that it hasn't, sleeps in the futex again, gets a kernel wakeup due to a timeout, checks whether the condition being waited for has happened, discovers that it hasn't, sleeps in the futex again, lathers, rinses, repeats, so it makes no progress and chews up tons of CPU.
  If so, then:
  
  this particular problem is specific to systems running Linux kernels with the problem (and hence specific to Linux);
  
  applications that don't themselves have issues with leap seconds might be affected by this;
  so Linux being heavily affected might also be a side-effect of, well, some versions of the Linux kernel having a bug that's triggered by leap seconds.
  However, unless an application happens to use futexes in a fashion that trips over the bug, they won't be affected. It might be server applications that are most likely to do so, meaning that you might not see it on, say, a desktop or handheld Linux machine, or even on some servers.
Re:Linux kernel unable to cope? I think not. by Anonymous Coward · 2012-07-01 09:09 · Score: 5, Interesting

I run Arch Linux with kernel 3.4.4 and it went haywire. My machine was very heavily loaded at the time and when the leap second happened mysqld, firefox, and ksoftirq processes started consuming 100% CPU. The load factor was well over 10 and the machine was grinding along. It didn't actually fail but it was loaded down.
Even restarting the processes didn't fix it. The high load would go away once I stopped the processes but as soon as I started them again the load would come right back. I had Firefox open on a blank page not doing anything and it was slammed at 100% CPU and had a could ksoftirq tasks slammed at 100% CPU each too.
I had to reboot the machine to get it back to normal.
I have Ubuntu and Debian servers that for whatever reason did not add the leap second so they were fine. Their time was a second off today though (at least until ntp slowly corrected it or I manually intervened).
Only Linux affected? by cpghost · 2012-07-01 09:15 · Score: 4, Interesting

I'm managing a cluster of 2,400 nodes running FreeBSD, and AFAICS, none was tripped off by leap second NTP adjustments. On the other hand, 4 out of 180 Linux nodes crashed simultaneously at that very moment. All this is exceedingly weird, but may indeed point to a subtle bug in the Linux kernel (only?). I've never witnessed this behavior in the past.

--
cpghost at Cordula's Web.
Google on how they fixed that.. by Barryke · 2012-07-01 09:43 · Score: 3, Interesting

Google official blog: "Time, technology and leaping seconds" (sept 2011)
http://googleblog.blogspot.in/2011/09/time-technology-and-leaping-seconds.html
I wonder if the leap second has anything to do with the labs Chubby paper / site currently being offline..

--
Hivemind harvest in progress..
Re:All of my servers were fine by thePowerOfGrayskull · 2012-07-01 10:42 · Score: 3, Interesting

Our problem was with a third party monitoring solution - its daemon process brought every single one of our servers to a near halt by consuming all available cpu cycles at the stroke of gmt midnight.
The OS itself was fine.
This monitoring software is common enough that it likely was behind a lot of the issues seen around the 'net.
Re:I always thought leap seconds were stupid by at10u8 · 2012-07-01 11:01 · Score: 3, Interesting

except that BIPM, the providers of TAI, have published this http://www.bipm.org/cc/CCTF/Allowed/18/CCTF_09-27_note_on_UTC-ITU-R.pdf wherein the CCTF "stresses that TAI is the uniform time scale underlying UTC, and that it should not be considered as an alternative time reference." This appears to indicate that the CCTF and BIPM are not comfortable with the notion that operational systems might be employing TAI as their time scale. At the end of that paper they also discuss the possibility that TAI could cease to exist.
Re: by Guy+Harris · 2012-07-01 11:51 · Score: 3, Interesting

The hard system lock bug due to a leap second was patched in 2.6.29, so either you've got some weird related bug, or something is very wrong.
Well, the weird related bug would arguably count as something being wrong. Apparently there is a bug in the handling of the insertion of positive leap seconds that could cause weird behavior with futexes, and that bug appears not to have been fixed until at least July 1, 2012 (I'm guessing John Stultz has worked up a patch).