Leap Second Bug Causes Crashes
An anonymous reader writes in with a Wired story about the problems caused by the leap second last night. "Reddit, Mozilla, and possibly many other web outfits experienced brief technical problems on Saturday evening, when software underpinning their online operations choked on the “leap second” that was added to the world’s atomic clocks. On Saturday, at midnight Greenwich Mean Time, as June turned into July, the Earth’s official time keepers held their clocks back by a single second in order to keep them in sync with the planet’s daily rotation, and according to reports from across the web, some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform — were unable to cope with the extra second."
So far all I've heard about is affected Linux systems, did Windows and OS X just fine?
I run Arch Linux with kernel 3.4.4 and it went haywire. My machine was very heavily loaded at the time and when the leap second happened mysqld, firefox, and ksoftirq processes started consuming 100% CPU. The load factor was well over 10 and the machine was grinding along. It didn't actually fail but it was loaded down.
Even restarting the processes didn't fix it. The high load would go away once I stopped the processes but as soon as I started them again the load would come right back. I had Firefox open on a blank page not doing anything and it was slammed at 100% CPU and had a could ksoftirq tasks slammed at 100% CPU each too.
I had to reboot the machine to get it back to normal.
I have Ubuntu and Debian servers that for whatever reason did not add the leap second so they were fine. Their time was a second off today though (at least until ntp slowly corrected it or I manually intervened).
I'm managing a cluster of 2,400 nodes running FreeBSD, and AFAICS, none was tripped off by leap second NTP adjustments. On the other hand, 4 out of 180 Linux nodes crashed simultaneously at that very moment. All this is exceedingly weird, but may indeed point to a subtle bug in the Linux kernel (only?). I've never witnessed this behavior in the past.
cpghost at Cordula's Web.
Google official blog: "Time, technology and leaping seconds" (sept 2011)
http://googleblog.blogspot.in/2011/09/time-technology-and-leaping-seconds.html
I wonder if the leap second has anything to do with the labs Chubby paper / site currently being offline..
Hivemind harvest in progress..
Our problem was with a third party monitoring solution - its daemon process brought every single one of our servers to a near halt by consuming all available cpu cycles at the stroke of gmt midnight.
The OS itself was fine.
This monitoring software is common enough that it likely was behind a lot of the issues seen around the 'net.
except that BIPM, the providers of TAI, have published this http://www.bipm.org/cc/CCTF/Allowed/18/CCTF_09-27_note_on_UTC-ITU-R.pdf wherein the CCTF "stresses that TAI is the uniform time scale underlying UTC, and that it should not be considered as an alternative time reference." This appears to indicate that the CCTF and BIPM are not comfortable with the notion that operational systems might be employing TAI as their time scale. At the end of that paper they also discuss the possibility that TAI could cease to exist.
The hard system lock bug due to a leap second was patched in 2.6.29, so either you've got some weird related bug, or something is very wrong.
Well, the weird related bug would arguably count as something being wrong. Apparently there is a bug in the handling of the insertion of positive leap seconds that could cause weird behavior with futexes, and that bug appears not to have been fixed until at least July 1, 2012 (I'm guessing John Stultz has worked up a patch).