Leap Second Bug Causes Crashes

← Back to Stories (view on slashdot.org)

Leap Second Bug Causes Crashes

Posted by samzenpus on Sunday July 1, 2012 @08:00AM from the slip-it-in-there dept.

An anonymous reader writes in with a Wired story about the problems caused by the leap second last night. "Reddit, Mozilla, and possibly many other web outfits experienced brief technical problems on Saturday evening, when software underpinning their online operations choked on the “leap second” that was added to the world’s atomic clocks. On Saturday, at midnight Greenwich Mean Time, as June turned into July, the Earth’s official time keepers held their clocks back by a single second in order to keep them in sync with the planet’s daily rotation, and according to reports from across the web, some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform — were unable to cope with the extra second."

23 of 230 comments (clear)

Min score:

Reason:

Sort:

Re:All of my servers were fine by Sir_Sri · 2012-07-01 08:05 · Score: 4, Informative

That can be hard for some people.
Linux by Anonymous Coward · 2012-07-01 08:08 · Score: 4, Informative

I'm a Linux admin at a fairly large hosting company. The only thing that I personally aware of happening this time around was that the time change triggered a bug in the OpenManage software on Dell servers causing it to use 100% CPU. The solution was to resync the time and restart OpenManage. It wasn't really a fault of Linux itself, but in OpenManage on Linux. Lots of datacenters use Dell hardware and I'm sure most use OpenManage, so I'm sure the problem was widespread.
1. Re:Linux by Anonymous Coward · 2012-07-01 08:21 · Score: 5, Informative
  
  What you describe is a bug in the Linux kernel that causes problems for the Java VM that OpenManage uses.
  It is not a bug in OpenManage at all.
Re:All of my servers were fine by Anonymous Coward · 2012-07-01 08:10 · Score: 3, Informative

Agreed. Patches that aren't required to solve an ongoing incident impacting customer traffic require about 2 weeks advance notice to pass through change control, and that's if everything is perfect. A single error in a ticket can push that ticket out another week, and another, and so on.
Generally, we shoot for 3 weeks before we are allowed to install a patch. On average, it's about right.
Re:All of my servers were fine by Anonymous Coward · 2012-07-01 08:15 · Score: 5, Informative

the patch was posted back in March.
https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
Extremely weird by Anonymous Coward · 2012-07-01 08:18 · Score: 5, Informative

From my own machines and comparing notes with some other people (all in all, about 3k servers) the bug seems to affect machines randomly. Known facts:
There's a kernel patch that fixes the supposed issue: https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
Affects Debian stable a lot.
Affects Java and Virtualbox (starts using too much CPU).
Affected my browser (iceweasel on debian testing).
Affects SOME mysql installs (5.1 and 5.5, but not all, and of two identical installs one might be affected, the other not).
The fix has been posted at lot of places: /etc/init.d/ntp stop; date; date `date +"%m%d%H%M%C%y.%S"`; date; /etc/init.d/ntp start
(I'm all for switching unix time to a simple counter and leaving it to the calendar libs to put the leap seconds where necessary)
1. Re:Extremely weird by burne · 2012-07-01 09:09 · Score: 4, Informative
  
  It's a race-condition, either crashing your ancient kernel or causing software using certain kernel-calls to effectively lock up. In both cases load seems to be a factor.
  Over here the race-condition coincided with the actual leap-second and the start of the first batch of cronjobs at 02:00 local time.
  
  (I'm all for switching unix time to a simple counter and leaving it to the calendar libs to put the leap seconds where necessary)
  Bad idea. It would have prevented kernels affected by the race-condition from crashing, but would have meant most of your running software would have been either hit by this bug or would have been on the mercy of a 17 year old pimple-faced coder.
  I think I prefer a crash over the mayhem caused by banking-software not handling a leap-second correctly. That could bankrupt whole countries.
Re:Our Red Hat servers had no issues at all by Nutria · 2012-07-01 08:22 · Score: 4, Informative

TFA mentioned that the RHE6 kernel had the bug, but not RHE5.
It appears also that system load was a big factor, so if your systems aren't busy on Saturday then they might not have crashed even if running an affected kernel.

--
"I don't know, therefore Aliens" Wafflebox1
Re:Linux kernel unable to cope? I think not. by Anonymous Coward · 2012-07-01 08:33 · Score: 4, Informative

There was a Linux kernel bug. See
http://news.ycombinator.com/item?id=4183122
http://marc.info/?l=linux-kernel&m=134110635328824&w=2
and
https://lkml.org/lkml/2012/6/30/122
Re: by Anonymous Coward · 2012-07-01 08:39 · Score: 2, Informative

The hard system lock bug due to a leap second was patched in 2.6.29, so either you've got some weird related bug, or something is very wrong.
Re:Our Red Hat servers had no issues at all by Anonymous Coward · 2012-07-01 08:44 · Score: 4, Informative

Red Hat had a lot of issues.
https://access.redhat.com/knowledge/articles/15145
https://access.redhat.com/knowledge/solutions/154713
It depended entirely on your load. The buggy kernal code ran every 17 minutes for the 24hr period leading up to the leap-second insertion.
If you had enough load, your chance of dead-locking your system increased significantly.
Solution, strip the leap-second flag by manually setting your time.
You probably don't do much Java, then by burne · 2012-07-01 08:51 · Score: 5, Informative

As it turns out my biggest problems was customer-supplied software which uses their own java jre's. We install a jre by default and update it whenever possible, but some software (Adeptia, VLTrader, Alfresco) comes with their own ancient jre and scripts to call that over system-supplied java.
Not a single machine crashed (we are very explicitly in charge of what OS-version there's running) but a lot of java locked up and had to be restarted.
I can even see a small bump in the power-usage around two o' clock (0:00 GMT).
1. Re:You probably don't do much Java, then by Guy+Harris · 2012-07-01 11:43 · Score: 4, Informative
  
  So are you saying that, in addition to the Linux kernel glitch in question (which appears to cause some userland processes to spin)
  Actually, I'm not sure that's the case. John Stultz's mail from July 1, 2012 speaks of a bug where clock_was_set() wasn't called after the leap second was added, and of a patch he was working on, so the bug in question might not have been fixed in March.
2. Re:You probably don't do much Java, then by archont · 2012-07-01 23:51 · Score: 3, Informative
  
  What is this, 1990? All modern CPUs have protection against overheating and disabling that protection requires, at the very least, some crafty soldering or flashing a 3rd party BIOS. If you're capable enough to do that you're probably running some sci-fi prototype rig from the future using pressurized mercury phase transition cooling or something.
  So no, I don't see how any properly set-up rig can make the CPU cook itself.
Re:All of my servers were fine by lister+king+of+smeg · 2012-07-01 09:50 · Score: 3, Informative

And, all that aside, do you even know if the patch released to fix this problem is included in your distribution-released kernel? If you're not rolling your own kernel it can be nigh to impossible to know what's included and what's not -- in that case it doesn't even matter if it's up-to-date.
Well you could read through the change log and release notes to find out.

--
---Saying gnome 3 is better than windows 8 not so much a compliment as it is damning with light praise.
Re:Linux kernel unable to cope? I think not. by kwardroid · 2012-07-01 10:12 · Score: 5, Informative

Restarting ntp wasn't enough for me, I had to reset the date with:
date -s "`date`"
Only one machine went haywire though.
Re:What about Windows and Mac? by Guy+Harris · 2012-07-01 10:46 · Score: 4, Informative

My guess ist that Windows simply ignored it, so there never was a 61st second in a minute.
Well, if Microsoft's documentation of the SYSTEMTIME structure reflects the implementation, GetSystemTime() , the claim in that man page^W^WMSDN page that "The system time is expressed in Coordinated Universal Time (UTC)" nonwithstanding, cannot acknowledge the existence of a 61st second in a minute ("The second. The valid values for this member are 0 through 59.", as the SYSTEMTIME page says).
But, just as on UN*X, you have "counter" and "human-style label" times (time_t, struct timeval, struct timespec are examples of the former, and a struct tm as returned by, for example, gmtime() is an example of the latter, on UN*X), with the Windows versions of those being SYSTEMTIME and FILETIME respectively. That page on FILETIME says nothing about leap seconds - does it just keep counting over a positive leap second or does it stop or what? And, if it doesn't just keep counting over a positive leap second, does it just freeze for a while second, or does it slow down over some period of time so that it eventually syncs up, or what?
As for NTP, Microsoft has a page on "How the Windows Time service treats a leap second", which says

When the Windows Time service is working as a Network Time Protocol (NTP) client
The Windows Time service does not indicate the value of the Leap Indicator when the Windows Time service receives a packet that includes a leap second. (The Leap Indicator indicates whether an impending leap second is to be inserted or deleted in the last minute of the current day.) Therefore, after the leap second occurs, the NTP client that is running Windows Time service is one second faster than the actual time. This time difference is resolved at the next time synchronization.
(the author of which needs to be told what "inserted or deleted" implies - do they mean that, regardless of whether a leap second is inserted or deleted, the NTP client that is running Windows Time service is one second faster than the actual time?)
And then there's one more question: if there's anything in the NT kernel that deals with leap seconds, does any version have a glitch, as some versions of the Linux kernel do?
If not, then many of the other problems might not exist on Windows. This email from John Stultz, the author of the fix linked to in the previous paragraph, seems to indicate that at least some of the problems, if not all of them, stem from a kernel bug, so it might be that Java and company might be Just Fine on systems that don't have a kernel glitch of that sort (so they might work fine on at least some non-Linux systems, as well as on Linux systems with the bug fixed).
Re:All of my servers were fine by Guy+Harris · 2012-07-01 11:47 · Score: 4, Informative

Our problem was with a third party monitoring solution - its daemon process brought every single one of our servers to a near halt by consuming all available cpu cycles at the stroke of gmt midnight.
The OS itself was fine.
Well, if you're talking a Linux kernel, the part of the OS that dealt with leap seconds was not OK, and was "not OK" in a fashion that could cause processes using futexes to spin and consume all available CPU cycles when a leap second is introduced.

This monitoring software is common enough that it likely was behind a lot of the issues seen around the 'net.
...perhaps by virtue of either using futexes (in what I'm presuming is a legitimate fashion) or using something that uses futexes.
Re:FUD? by Guy+Harris · 2012-07-01 12:08 · Score: 3, Informative

The bug has already been fixed for months now
A bug might have been fixed for months now, but I don't think that's the bug here.
Re:All of my servers were fine by Gil-galad55 · 2012-07-01 12:26 · Score: 5, Informative

They lost commercial power due the big storm system that went through the DC area.

--
To follow knowledge like a sinking star, / Beyond the utmost bound of human thought. ("Ulysses", Tennyson)
Re:What about Windows and Mac? by magamiako1 · 2012-07-01 13:45 · Score: 3, Informative

In an Active Directory domain, the computer with the FSMO PDC Emulator role is not only a proper NTP server, but you can sync your devices to it.

Also, look up the command: w32tm
Please read this lkml thread before commenting by Guy+Harris · 2012-07-01 14:42 · Score: 3, Informative

This linux-kernel mailing list thread discusses a kernel bug that causes futexes to repeatedly time out, so that code using them (which might include POSIX mutexes and condition variables, if that's what glibc uses for them on Linux) might spin.
That's not the kernel-leap-year-handling bug that was fixed back in March, so it's not as if a properly-patched kernel wouldn't get hit by this (unless you define "properly-patched" as "includes the patch John Stultz came up with on July 1, 2012").
So, yes, this particular bug is Linux-specific (i.e., there's a reason why it hit Linux servers), and might not be the fault of the userland code running atop it (so it might not, for example, be Java's fault).
We took the coward's way out... by ElVee · 2012-07-02 01:34 · Score: 3, Informative

I work at a fairly large international outfit, with data feeds coming and going to the far ends of the Earth. Everything we do is time-sensitive. Processing messages that depend on prior messages already being processed means we can't gracefully handle things coming in out of order.
We spent lots of time and money studying this problem, hired a high-priced consulting outfit to advise us and spun up lots of projects to mitigate the "risk" of the leap second. There were far too many meetings and conference calls with vendors, VARS and other people that wanted us to pay them for their time. What was determined was that we couldn't guarantee that nothing would crash or (gasp!) messages might be discarded or processed incorrectly, which was a risk we weren't willing to take. We run a full gamut of OSes, from HP/UX, Solaris, Linux, z/TPF, z/OS, DB2 etc etc.. You get the idea. Too many variables and too many systems to update and test with the limited funds and limited timeframe given.
In the end, we avoided the problem by shutting down all (and I do mean ALL) processing and flushing all the transactional systems to disk and suspending EVERYTHING from a minute before until a minute after the leap second. (Was that two minutes or two minutes PLUS one second? Clock math has always eluded me.) Shutting down all these interconnected systems in the correct order was a precision dance that, in the end, we didn't perform very well. Messages did end up being discarded. At precisely :20 seconds after the leap second, we began syncing all our systems with our internal NTP server and then at precisely one minute after, we slowly started everything back up. There were some systems that required a restart. We manually reprocessed those earlier discarded messages just as fast as our little fingers could type. In all it took us about 15 minutes to get everything spun back up, and all that time is getting charged to our SLA, which affects ALL our evaluations and year-end bonuses.
Lots of work was done, overtime was paid and buckets of money were given to lots of high-priced consultants and I personally will take a hit to my paycheck, all over ONE GODDAMNED SECOND.
Let's not do that again, okay?

--
- Pithy comment goes here.