Leap Second Bug Causes Crashes
An anonymous reader writes in with a Wired story about the problems caused by the leap second last night. "Reddit, Mozilla, and possibly many other web outfits experienced brief technical problems on Saturday evening, when software underpinning their online operations choked on the “leap second” that was added to the world’s atomic clocks. On Saturday, at midnight Greenwich Mean Time, as June turned into July, the Earth’s official time keepers held their clocks back by a single second in order to keep them in sync with the planet’s daily rotation, and according to reports from across the web, some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform — were unable to cope with the extra second."
And I didn't do anything special, just kept their software up-to-date.
Interesting. I wonder what conditions had to have been met for a crash to happen, none of my servers had so much as a hick-up.
I'm a Linux admin at a fairly large hosting company. The only thing that I personally aware of happening this time around was that the time change triggered a bug in the OpenManage software on Dell servers causing it to use 100% CPU. The solution was to resync the time and restart OpenManage. It wasn't really a fault of Linux itself, but in OpenManage on Linux. Lots of datacenters use Dell hardware and I'm sure most use OpenManage, so I'm sure the problem was widespread.
>hick-up.
The hick up watching the servers when the leap second came was you.
For starters, you need a kernel no more recent than 2.6.28, a kernel so old my Debian stable box is four revisions past it!
I'm uncertain why these reports keeps referring to some monolithic "Linux" that is supposed to have had issues - Red Hat's the biggest Linux vendor, and certainly their "Linux" handled it just fine.
What distros had issues?
#DeleteChrome
Considering leap-seconds happen every now and then, it seems odd that such fundamental things as Linux and Java can not handle it. AFAIK, it was just about for years ago since we last had a leap-second.
So far all I've heard about is affected Linux systems, did Windows and OS X just fine?
"some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform — were unable to cope with the extra second."
No opinion about java, and no doubt there's plenty of dodgy software running on Linux, but the part about Linux not coping is BS.
From last night's logs....
Jun 30 19:59:59 thabto kernel: Clock: inserting leap second 23:59:60 UTC
I don't know, but the article reads as FUD. Sure, there might have been problems, but then, aren't there always problems, everywhere? It's just a matter of picking the right ones and you've got a 'Linux and Java = bad' artice? Or am I being a fanboy now?
From my own machines and comparing notes with some other people (all in all, about 3k servers) the bug seems to affect machines randomly. Known facts:
There's a kernel patch that fixes the supposed issue: https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
Affects Debian stable a lot.
Affects Java and Virtualbox (starts using too much CPU).
Affected my browser (iceweasel on debian testing).
Affects SOME mysql installs (5.1 and 5.5, but not all, and of two identical installs one might be affected, the other not).
The fix has been posted at lot of places: /etc/init.d/ntp stop; date; date `date +"%m%d%H%M%C%y.%S"`; date; /etc/init.d/ntp start
(I'm all for switching unix time to a simple counter and leaving it to the calendar libs to put the leap seconds where necessary)
Why has this "story" been posted twice in one day?
Do you guys think we are incapable of remembering things that have happened in the last 24 hours?
Well that explains it. I'm running nothing less then 3.3.8
From the looks of it the kernel must be running on multiple cpus for the livelock to occur. This is probably one of the reasons why none of my servers had any issue.
Not true.
I had a number of boxes running 2.6.32 getting bit by this bug.
It's like the Y2K bug, but every few years.
The hard system lock bug due to a leap second was patched in 2.6.29, so either you've got some weird related bug, or something is very wrong.
Configuration of the system to only accept 23:59:59 and not 23:59:60
You will be baked, and there will be cake.
As it turns out my biggest problems was customer-supplied software which uses their own java jre's. We install a jre by default and update it whenever possible, but some software (Adeptia, VLTrader, Alfresco) comes with their own ancient jre and scripts to call that over system-supplied java.
Not a single machine crashed (we are very explicitly in charge of what OS-version there's running) but a lot of java locked up and had to be restarted.
I can even see a small bump in the power-usage around two o' clock (0:00 GMT).
We had many servers with this issue, mostly RHEL 6 servers running JBoss. The only symptom is high load. If you are not actively monitoring your server load, you may not even know that there's an issue yet.
Why not bundle them and apply them every 10 or 20 years?
And apparently I'm not alone:
http://en.wikipedia.org/wiki/Leap_second#Proposal_to_abolish_leap_seconds
Hogwash, Astronomers can find coping mechanisms, it's either that or these ridiculous levels of stress for systems admins.
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
I'm managing a cluster of 2,400 nodes running FreeBSD, and AFAICS, none was tripped off by leap second NTP adjustments. On the other hand, 4 out of 180 Linux nodes crashed simultaneously at that very moment. All this is exceedingly weird, but may indeed point to a subtle bug in the Linux kernel (only?). I've never witnessed this behavior in the past.
cpghost at Cordula's Web.
About 5 seconds after midnight GMT a Java server app running on my Debian Squeeze server decided it was going to eat-up ALL THE THINGS and for some reason, the server rebooted itself. Glad to know I wasn't alone in shitting myself over odd behaviours.
The game.
Hogwash, Astronomers can find coping mechanisms, it's either that or these ridiculous levels of stress for systems admins.
The same can be said for leap years. They'v been around for a few hundred years and people still can't cope with them. Why don't we just go back to the Julian calendar and drop the Gregorian one?
Ditto for Daylight Saving Time which, IMHO, is completely arbitrary and not tied to any physical need or phenomenon, and we're still "stuck" with nonetheless.
I will never write a program that correctly handles seconds=60. Period. EVAR!
Google official blog: "Time, technology and leaping seconds" (sept 2011)
http://googleblog.blogspot.in/2011/09/time-technology-and-leaping-seconds.html
I wonder if the leap second has anything to do with the labs Chubby paper / site currently being offline..
Hivemind harvest in progress..
Nobody has posted that it also took thepiratebay down, something that the MAFIAA has been trying to do for the last umpteen years?
-Noc
MySQL started spiking my CPU when the leap second hit. Only MySQL, and nothing else. It was odd.
Clock: inserting leap second 23:59:60 UTC
No problem whatsoever on my Gentoo server, with a 3.3.1 hardened (Linux) kernel.
I had a lot of programs (none Java-based though) taking up an inordinate amount of CPU, and high system CPU usage. Couldn't figure out the cause, and a reboot fixed it. In retrospect, I think it was around midnight UTC.
Why can't the guys behind NTP.org provide a leap second smear option like that used by Google as an alternative? Have people who deal with time, deal with the problem in a way that most people wouldn't notice in a centralized, but optional manner?
So, a leap smear NTP capable pool, and a standard NTP pool?
The hard system lock bug due to a leap second was patched in 2.6.29, so either you've got some weird related bug, or something is very wrong.
Well, the weird related bug would arguably count as something being wrong. Apparently there is a bug in the handling of the insertion of positive leap seconds that could cause weird behavior with futexes, and that bug appears not to have been fixed until at least July 1, 2012 (I'm guessing John Stultz has worked up a patch).
Well that post might be a candidate for the super rare +5 offtopic mod. (a mod even rarer than +5 troll)
Work bio at MMWD
If that actually happened, then they should have just made it do 23:59:59 twice instead of crashing all the computers. I would like somebody to give me a concrete reason why any computer system should actually crash because of a lost second.
Considering how much most of the hardware clocks on the hardware I support drift as it is, a leap second ain't nothing compared to the six-hour ntpdate updates.
some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform
Nice troll. How did my half dozen continuously running Linux systems including a server and a router cope with it then?
When all you have is a hammer, every problem starts to look like a thumb.
On all my Linux systems, Chrome plus some kernel threads pulled 100% CPU until exited Chrome (which worked fine with Shift-Ctrl-Q.)
On one system Chrome refuses to start now. It restores the tabs but every tab is an "Aw, snap!" page, even if I move the configuration directory away.
thegodmovie.com - watch it
not sure if it was related, but I noticed a load avg of well over 10 on my amd e350 myth server box. I don't normally watch for load counts but I did notice that java was eating a lot of cpu, too; and I don't run java directly, some other 'stuff' on my system must be doing that. (sigh, I hate java...)
a reboot fixed things. I hate saying that, too. but the system was very slow and I needed to remove a pci card, anyway (lol).
kernel was 3.0.something
--
"It is now safe to switch off your computer."
Yes, for example, there were a number of anecdotes of MySQL databases using 100% CPU time, and the article mentions the similar Java/Cassandra problem. I suspect these two probably account for the majority of issues encountered.
http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-and-the-fix/
This linux-kernel mailing list thread discusses a kernel bug that causes futexes to repeatedly time out, so that code using them (which might include POSIX mutexes and condition variables, if that's what glibc uses for them on Linux) might spin.
That's not the kernel-leap-year-handling bug that was fixed back in March, so it's not as if a properly-patched kernel wouldn't get hit by this (unless you define "properly-patched" as "includes the patch John Stultz came up with on July 1, 2012").
So, yes, this particular bug is Linux-specific (i.e., there's a reason why it hit Linux servers), and might not be the fault of the userland code running atop it (so it might not, for example, be Java's fault).
I'm mildly fascinated (by mildly, I mean if someone here has a good answer, I'll read it, but that's as far as I'm going) -- I have a few linux servers and they are apparently either too old or updated enough to have not had the problem, because they didn't crash.
My first question was, "How the hell would my servers even KNOW that someone had inserted a "Leap Second" into the time unless they happened to do their ntp updates at exactly that time?" That of course would be followed by "why would they care?"
As far as I knew, NTP just says "Hey, server, what time is it" and gets back "It's f*****ing 3 o'clock, exactly 1 hour since the last time you asked. Now go away and quit bothering me!" (only it says it really fast using a single big number) At which point the software makes sure my server knows it's 3 o'clock.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
http://releases.jhu.edu/2011/12/27/time-for-a-change-johns-hopkins-scholars-say-calendar-needs-serious-overhaul/
Proposed permanent calendar has a predictable 91-day quarterly pattern of two months of 30 days and a third month of 31 days,
The calendar - http://henry.pha.jhu.edu/ccct.calendar.html
FAQ - http://henry.pha.jhu.edu/calendar.html
Check out the other threads pointing out an issue with futexes. There's an easy workaround, just manually set the time on your system and the problem will go away until the next leap second.
Data: She brought me closer to humanity than I ever thought possible, and for a time...I was tempted by her offer.
Jean-Luc Picard: How long a time?
Data: Zero point six eight seconds, sir. For an android, that is nearly an eternity.
Why didn't everyone just schedule a reboot for 11:58pm?
NOW can we just collectively pat ourselves on the back for Y2K?
I still talk to people who believe Y2K was all a hoax perpetrated by computer consultancy companies to scare upgrade cash from large customers. :)
Now at least I have some ammunition to shoot back with
And hopefully we can start getting people to take the coming 32-bit epoch end seriously too
Business/App ideas are like arseholes: everyone's got one, they're mostly shit, but very rarely they contain a diamond
That's what I thought too. I don't understand why it's any different to having a manual or automatic clock update for DST or any other reason. If there really was a text version that came across as 23:59:60 that's utterly laughable.
The only system in the world that would accept such a thing is a MySQL "database".
Know your clocks. Check the clock_getclock() and other clock_* functions, and all clock types in POSIX.
Many pieces of software assume 86400 seconds for a day. I just did a quick check and BIND 9.6 has logic using this as an estimate for zone refresh times, etc. The code tries to deal with leap years, but not seconds.
Perl's posix module defines a day as 86400. Perl File::Stat too.
The pw and tcpdump commands in MidnightBSD and at least FreeBSD 7.x makes a similar assumption.
kern_shutdown.c in FreeBSD ...
This assumption is everywhere.
I work at a fairly large international outfit, with data feeds coming and going to the far ends of the Earth. Everything we do is time-sensitive. Processing messages that depend on prior messages already being processed means we can't gracefully handle things coming in out of order.
We spent lots of time and money studying this problem, hired a high-priced consulting outfit to advise us and spun up lots of projects to mitigate the "risk" of the leap second. There were far too many meetings and conference calls with vendors, VARS and other people that wanted us to pay them for their time. What was determined was that we couldn't guarantee that nothing would crash or (gasp!) messages might be discarded or processed incorrectly, which was a risk we weren't willing to take. We run a full gamut of OSes, from HP/UX, Solaris, Linux, z/TPF, z/OS, DB2 etc etc.. You get the idea. Too many variables and too many systems to update and test with the limited funds and limited timeframe given.
In the end, we avoided the problem by shutting down all (and I do mean ALL) processing and flushing all the transactional systems to disk and suspending EVERYTHING from a minute before until a minute after the leap second. (Was that two minutes or two minutes PLUS one second? Clock math has always eluded me.) Shutting down all these interconnected systems in the correct order was a precision dance that, in the end, we didn't perform very well. Messages did end up being discarded. At precisely :20 seconds after the leap second, we began syncing all our systems with our internal NTP server and then at precisely one minute after, we slowly started everything back up. There were some systems that required a restart. We manually reprocessed those earlier discarded messages just as fast as our little fingers could type. In all it took us about 15 minutes to get everything spun back up, and all that time is getting charged to our SLA, which affects ALL our evaluations and year-end bonuses.
Lots of work was done, overtime was paid and buckets of money were given to lots of high-priced consultants and I personally will take a hit to my paycheck, all over ONE GODDAMNED SECOND.
Let's not do that again, okay?
- Pithy comment goes here.
This is so silly. One second. How many ways, for those who CLAIM they needed to account for this second, could problems have been avoided? Hmm. Heaven forbid they simply IGNORE the extra second (all except those oh-so-crucial banking connections who SUPPOSEDLY need to be perfectly in sync) and let the system either adjust it's time however it normally would, or perhaps write a script to pause services for 5-15 seconds while the time is adjusted, or get fancy and write something that slowly took away nanoseconds so that over the course of a minute, hour, day, etc, the second was accounted for.
This is just beyond silly. At least Y2K had logical concerns that people had to deal with (even though THOSE were blown completely out of proportion as well).
AMMalena (www.Malena.net) "The avalanche has already begun. It is too late for the pebbles to vote." (Kosh, B5)
If that actually happened, then they should have just made it do 23:59:59 twice instead of crashing all the computers. I would like somebody to give me a concrete reason why any computer system should actually crash because of a lost second.
If you send 23:59:59 twice, you have the same second in the system twice, which can potentially cause issues with logs. If everything is timestamped to the second/millisecond, how can you be sure an event happened in the first 23:59:59 second, or the second (or subsequent) 23:59:59 second?
Should I feel a second Older or Younger?
Well, its not a lost second.. its an extra second..
The entire issue is that there are so many uses for time that any one strategy does not work in all cases. Consider a simple logger that outputs some value once per second.. well you dont want that logger to output 23:59:59 twice.. that could easily create problems.. and you dont want the logger to miss a tick either because that could cause other problems..
So to solve that problem we create the abnormal 23:59:60.. but because its abnormal it can easily look like 00:00:00 after simple time manipulation operations, causing 00:00:00 to be seen twice instead.. the same problem we were trying to avoid..
I propose the following solution: Stop fucking with time in abnormal ways such as leap seconds.. the subset of problem domains where syncing to some abstract ideal celestial clock is rather small and its far easier to let those problem domains handle conversion from system time to abstract celestial time that it is to make everything else work well with edge cases.
"His name was James Damore."
Check out what date do...
From the man date we have:
(...)
%M minute (00..59)
(...)
%S second (00..60)
So clearly date is able to print 23:59:60 as a valid date
Higuita
We have a small set of Xen servers, about 40 VM's, all running CentOS 6.2 (all installed about 30 days ago) running with a back-end of FreeBSD/HAST disk failover.
I mirrored this in development, only smaller of course.
At 4:59:59 PM PST, all servers within 10 minutes, had their CPU's maxed out with the error:
kernel: BUG: soft lockup - CPU#0 stuck for 92s! [ksoftirqd/0:4]
When I say all, I mean all of the servers, development included which is on a completely separate network.
All of our servers are set to run the newest release of Java and all of our applications are written in-house.
The two machines I actually got updated to the newest kernel that was released last week (I think) didn't have the problem.
The Xen Hypervisors did not have the problem, but the VM's were so wedged I had to force reboot them.
The FreeBSD boxes did not have any problems, and my ancient Solaris installs did not have a problem. A terribly freaky event as I missed there was even an issue that could happen like this.
Just posting post-mortem, not that it helps now.
Windows had zero problems.
How on earth do you screw up so bad that your system crashes if the time changes?
Ok, this is going to hurt but every time I try Linux as a desktop, I'm reminded why I use Windows. And it's not just the ugly fonts, and KDE/Gnome and Firefox making everything huge, and the lack of software support, and the cryptic stuff you have to do on command line to get basic stuff working like video drivers, or the kernel updates that put a cryptic boot choice in the list that I had no idea what it means or which is which, or the problems with the sound, or printers, or other hardware devices that have no drivers, or how the desktops keep going backwards and changes things I suppose just for the sake of change, or how nothing is really documents and it's just good luck on Google (and the list goes on and on and on)...
It's the fan boys who claim there's nothing wrong. And then we have a leap second and it crashes. Shoulda' got that patch that was out months (not years - wow, really?) ago! Nice.
No... really... Grandma could even use it. *cough*
Small adjustments to the system clock are generally applied gradually by increasing or decreasing the rate of the system clock over a period of time (slew'ing). That is done in order to maintain a degree of sanity within the time domain to applications using the clock. Application rightly assume that the clock advances over time. On our Linux systems that was not the case. The clock simply stopped for 1 second! Any application making the perfectly normal assumption that time passes could fail! A simple rate counter sampling some value and dividing by the elapsed time could get a zero division if the 'sleep()' call returns without time actually having passed. As an example a programmer would generally make the reasonable assumption that 'sleep(100)' would return after at least 100ms has elapsed. Not so when the leap second was applied by brute force. Had NTP been used to adjust the time it would automatically happen gradually with no ill effect. We have considered the cost of changing our software to take into account that time may not advance during a sleep() call. That would not only be very expensive but would not solve the problem for libraries that we use. Also, the problem has already been solved by NTP
FYI, running Centos 6.2 (RHEL 2.6.32-220.el6.x86_64)
Noticed constant high CPU in tomcat6 and qpidd - leap second bug was the problem.
Keeping the system up to date doesn't mean that old bugs are not there.
in our case it was many systems running RHEL6 with "2.6.32-220.4.1.el6.x86_64 #1 SMP Thu Jan 19 14:50:54 EST 2012"
see http://blog.admintoon.com/?p=336 for the fun I had on a Sunday