Leap Second Bug Causes Crashes
An anonymous reader writes in with a Wired story about the problems caused by the leap second last night. "Reddit, Mozilla, and possibly many other web outfits experienced brief technical problems on Saturday evening, when software underpinning their online operations choked on the “leap second” that was added to the world’s atomic clocks. On Saturday, at midnight Greenwich Mean Time, as June turned into July, the Earth’s official time keepers held their clocks back by a single second in order to keep them in sync with the planet’s daily rotation, and according to reports from across the web, some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform — were unable to cope with the extra second."
And I didn't do anything special, just kept their software up-to-date.
I'm a Linux admin at a fairly large hosting company. The only thing that I personally aware of happening this time around was that the time change triggered a bug in the OpenManage software on Dell servers causing it to use 100% CPU. The solution was to resync the time and restart OpenManage. It wasn't really a fault of Linux itself, but in OpenManage on Linux. Lots of datacenters use Dell hardware and I'm sure most use OpenManage, so I'm sure the problem was widespread.
>hick-up.
The hick up watching the servers when the leap second came was you.
I'm uncertain why these reports keeps referring to some monolithic "Linux" that is supposed to have had issues - Red Hat's the biggest Linux vendor, and certainly their "Linux" handled it just fine.
What distros had issues?
#DeleteChrome
So far all I've heard about is affected Linux systems, did Windows and OS X just fine?
From my own machines and comparing notes with some other people (all in all, about 3k servers) the bug seems to affect machines randomly. Known facts:
There's a kernel patch that fixes the supposed issue: https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
Affects Debian stable a lot.
Affects Java and Virtualbox (starts using too much CPU).
Affected my browser (iceweasel on debian testing).
Affects SOME mysql installs (5.1 and 5.5, but not all, and of two identical installs one might be affected, the other not).
The fix has been posted at lot of places: /etc/init.d/ntp stop; date; date `date +"%m%d%H%M%C%y.%S"`; date; /etc/init.d/ntp start
(I'm all for switching unix time to a simple counter and leaving it to the calendar libs to put the leap seconds where necessary)
There was a Linux kernel bug. See
http://news.ycombinator.com/item?id=4183122
http://marc.info/?l=linux-kernel&m=134110635328824&w=2
and
https://lkml.org/lkml/2012/6/30/122
As it turns out my biggest problems was customer-supplied software which uses their own java jre's. We install a jre by default and update it whenever possible, but some software (Adeptia, VLTrader, Alfresco) comes with their own ancient jre and scripts to call that over system-supplied java.
Not a single machine crashed (we are very explicitly in charge of what OS-version there's running) but a lot of java locked up and had to be restarted.
I can even see a small bump in the power-usage around two o' clock (0:00 GMT).
and above all it should not be changed to accomodate fluctuations in the orbit of a rock circling an arbitrary star.
That is precisely the point of keeping track of the time of day, or day of the year.
time of day is an arbitrary number whose main utility lies in it being composed of predictable periods and divided into homogenous units.
You do not need a complex system like date time comprised of minutes hours, seconds, months, weeks, and years if you just want to measure time in a convient homogenous unit then define a time-zero, and just count milliseconds from that to whatever arbitrary distance into the past and future you want from that. Measure it kilo-seconds, mega-seconds, giga-seconds... etc.
The entire point of date/time is because we do in fact care a lot about how that "arbitrary counter" lines up with when we will be awake or asleep or eating at various points -- that's what makes it useful.
What we should have is what I've described above, time-zero and a counter. And translations from that to localized date time should be handled by a library.
I run Arch Linux with kernel 3.4.4 and it went haywire. My machine was very heavily loaded at the time and when the leap second happened mysqld, firefox, and ksoftirq processes started consuming 100% CPU. The load factor was well over 10 and the machine was grinding along. It didn't actually fail but it was loaded down.
Even restarting the processes didn't fix it. The high load would go away once I stopped the processes but as soon as I started them again the load would come right back. I had Firefox open on a blank page not doing anything and it was slammed at 100% CPU and had a could ksoftirq tasks slammed at 100% CPU each too.
I had to reboot the machine to get it back to normal.
I have Ubuntu and Debian servers that for whatever reason did not add the leap second so they were fine. Their time was a second off today though (at least until ntp slowly corrected it or I manually intervened).
I'm managing a cluster of 2,400 nodes running FreeBSD, and AFAICS, none was tripped off by leap second NTP adjustments. On the other hand, 4 out of 180 Linux nodes crashed simultaneously at that very moment. All this is exceedingly weird, but may indeed point to a subtle bug in the Linux kernel (only?). I've never witnessed this behavior in the past.
cpghost at Cordula's Web.
You don't need a leap second in order for that to happen. Firefox does that regularly.
Do you care about the security of your wireless mouse?
Restarting ntp wasn't enough for me, I had to reset the date with:
date -s "`date`"
Only one machine went haywire though.
> Why not bundle them and apply them every 10 or 20 years?
The problem we have here is that leap seconds are rare. Things that are common are tested for, and quickly found if broken. Having something which only happens every 20 years is a recipee for disaster every 20 years.
My view is that NTP is at fault, because the 61th second is a brittle way to handle it. NTP should use the same method as google for smearing the leap second out over fx an hour: http://googleblog.blogspot.dk/2011/09/time-technology-and-leaping-seconds.html
What we should have is what I've described above, time-zero and a counter. And translations from that to localized date time should be handled by a library.
Which, sadly, POSIX doesn't let you have as "UNIX time":
If there were a UN*X API to get a count of seconds since the Epoch (in addition to, or instead of, a call to get "seconds since the Epoch"), and a UN*X API to convert those to UTC and local time labels, that would get what you want. Modulo making it work with NTP, the former could be implemented with less difficulty than a call to get "seconds since the Epoch", and the latter is called "the Olson code complete with the leap seconds database".
However, that would then require some mechanism to allow code to schedule something to happen at a given UTC label; simply calculating the UNIX time for that UTC label, getting the current UNIX time, and scheduling it for then-now seconds in the future is insufficient, as the UNIX time for a given UTC label in the future might change if a leap second is scheduled between then and now. (Note that if you support scheduling something to happen at a given local civil time label would already require correction of that sort to handle DST rule changes.) This would also have to do something if you schedule an event for YYYY-DD-MM 23:59:59 and a negative leap second occurs so that there is no 23:59:59 on YYYY-DD-MM; "something" might be "let somebody know and ask them to correct it" or "do it at 00:00:00 on the next day", perhaps depending on the reason why it's scheduled.