Leap Second Bug Causes Crashes
An anonymous reader writes in with a Wired story about the problems caused by the leap second last night. "Reddit, Mozilla, and possibly many other web outfits experienced brief technical problems on Saturday evening, when software underpinning their online operations choked on the “leap second” that was added to the world’s atomic clocks. On Saturday, at midnight Greenwich Mean Time, as June turned into July, the Earth’s official time keepers held their clocks back by a single second in order to keep them in sync with the planet’s daily rotation, and according to reports from across the web, some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform — were unable to cope with the extra second."
And I didn't do anything special, just kept their software up-to-date.
Interesting. I wonder what conditions had to have been met for a crash to happen, none of my servers had so much as a hick-up.
I'm a Linux admin at a fairly large hosting company. The only thing that I personally aware of happening this time around was that the time change triggered a bug in the OpenManage software on Dell servers causing it to use 100% CPU. The solution was to resync the time and restart OpenManage. It wasn't really a fault of Linux itself, but in OpenManage on Linux. Lots of datacenters use Dell hardware and I'm sure most use OpenManage, so I'm sure the problem was widespread.
>hick-up.
The hick up watching the servers when the leap second came was you.
I'm uncertain why these reports keeps referring to some monolithic "Linux" that is supposed to have had issues - Red Hat's the biggest Linux vendor, and certainly their "Linux" handled it just fine.
What distros had issues?
#DeleteChrome
Considering leap-seconds happen every now and then, it seems odd that such fundamental things as Linux and Java can not handle it. AFAIK, it was just about for years ago since we last had a leap-second.
So far all I've heard about is affected Linux systems, did Windows and OS X just fine?
"some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform — were unable to cope with the extra second."
No opinion about java, and no doubt there's plenty of dodgy software running on Linux, but the part about Linux not coping is BS.
From last night's logs....
Jun 30 19:59:59 thabto kernel: Clock: inserting leap second 23:59:60 UTC
I don't know, but the article reads as FUD. Sure, there might have been problems, but then, aren't there always problems, everywhere? It's just a matter of picking the right ones and you've got a 'Linux and Java = bad' artice? Or am I being a fanboy now?
From my own machines and comparing notes with some other people (all in all, about 3k servers) the bug seems to affect machines randomly. Known facts:
There's a kernel patch that fixes the supposed issue: https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
Affects Debian stable a lot.
Affects Java and Virtualbox (starts using too much CPU).
Affected my browser (iceweasel on debian testing).
Affects SOME mysql installs (5.1 and 5.5, but not all, and of two identical installs one might be affected, the other not).
The fix has been posted at lot of places: /etc/init.d/ntp stop; date; date `date +"%m%d%H%M%C%y.%S"`; date; /etc/init.d/ntp start
(I'm all for switching unix time to a simple counter and leaving it to the calendar libs to put the leap seconds where necessary)
Well that explains it. I'm running nothing less then 3.3.8
It's like the Y2K bug, but every few years.
I want justice. Next time they take away a second from the day, I want one of these "stories" to be expunged.
The hard system lock bug due to a leap second was patched in 2.6.29, so either you've got some weird related bug, or something is very wrong.
24 hours? I assume you mean 1 day and that is NOT 24 hours, but a second more.
Don't fight for your country, if your country does not fight for you.
Configuration of the system to only accept 23:59:59 and not 23:59:60
You will be baked, and there will be cake.
As it turns out my biggest problems was customer-supplied software which uses their own java jre's. We install a jre by default and update it whenever possible, but some software (Adeptia, VLTrader, Alfresco) comes with their own ancient jre and scripts to call that over system-supplied java.
Not a single machine crashed (we are very explicitly in charge of what OS-version there's running) but a lot of java locked up and had to be restarted.
I can even see a small bump in the power-usage around two o' clock (0:00 GMT).
Why not bundle them and apply them every 10 or 20 years?
And apparently I'm not alone:
http://en.wikipedia.org/wiki/Leap_second#Proposal_to_abolish_leap_seconds
Hogwash, Astronomers can find coping mechanisms, it's either that or these ridiculous levels of stress for systems admins.
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
I'm managing a cluster of 2,400 nodes running FreeBSD, and AFAICS, none was tripped off by leap second NTP adjustments. On the other hand, 4 out of 180 Linux nodes crashed simultaneously at that very moment. All this is exceedingly weird, but may indeed point to a subtle bug in the Linux kernel (only?). I've never witnessed this behavior in the past.
cpghost at Cordula's Web.
About 5 seconds after midnight GMT a Java server app running on my Debian Squeeze server decided it was going to eat-up ALL THE THINGS and for some reason, the server rebooted itself. Glad to know I wasn't alone in shitting myself over odd behaviours.
The game.
Google official blog: "Time, technology and leaping seconds" (sept 2011)
http://googleblog.blogspot.in/2011/09/time-technology-and-leaping-seconds.html
I wonder if the leap second has anything to do with the labs Chubby paper / site currently being offline..
Hivemind harvest in progress..
MySQL started spiking my CPU when the leap second hit. Only MySQL, and nothing else. It was odd.
Clock: inserting leap second 23:59:60 UTC
No problem whatsoever on my Gentoo server, with a 3.3.1 hardened (Linux) kernel.
I had a lot of programs (none Java-based though) taking up an inordinate amount of CPU, and high system CPU usage. Couldn't figure out the cause, and a reboot fixed it. In retrospect, I think it was around midnight UTC.
The hard system lock bug due to a leap second was patched in 2.6.29, so either you've got some weird related bug, or something is very wrong.
Well, the weird related bug would arguably count as something being wrong. Apparently there is a bug in the handling of the insertion of positive leap seconds that could cause weird behavior with futexes, and that bug appears not to have been fixed until at least July 1, 2012 (I'm guessing John Stultz has worked up a patch).
Well that post might be a candidate for the super rare +5 offtopic mod. (a mod even rarer than +5 troll)
Work bio at MMWD
If that actually happened, then they should have just made it do 23:59:59 twice instead of crashing all the computers. I would like somebody to give me a concrete reason why any computer system should actually crash because of a lost second.
Considering how much most of the hardware clocks on the hardware I support drift as it is, a leap second ain't nothing compared to the six-hour ntpdate updates.
some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform
Nice troll. How did my half dozen continuously running Linux systems including a server and a router cope with it then?
When all you have is a hammer, every problem starts to look like a thumb.
On all my Linux systems, Chrome plus some kernel threads pulled 100% CPU until exited Chrome (which worked fine with Shift-Ctrl-Q.)
On one system Chrome refuses to start now. It restores the tabs but every tab is an "Aw, snap!" page, even if I move the configuration directory away.
thegodmovie.com - watch it
not sure if it was related, but I noticed a load avg of well over 10 on my amd e350 myth server box. I don't normally watch for load counts but I did notice that java was eating a lot of cpu, too; and I don't run java directly, some other 'stuff' on my system must be doing that. (sigh, I hate java...)
a reboot fixed things. I hate saying that, too. but the system was very slow and I needed to remove a pci card, anyway (lol).
kernel was 3.0.something
--
"It is now safe to switch off your computer."
Yes, for example, there were a number of anecdotes of MySQL databases using 100% CPU time, and the article mentions the similar Java/Cassandra problem. I suspect these two probably account for the majority of issues encountered.
http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-and-the-fix/
This linux-kernel mailing list thread discusses a kernel bug that causes futexes to repeatedly time out, so that code using them (which might include POSIX mutexes and condition variables, if that's what glibc uses for them on Linux) might spin.
That's not the kernel-leap-year-handling bug that was fixed back in March, so it's not as if a properly-patched kernel wouldn't get hit by this (unless you define "properly-patched" as "includes the patch John Stultz came up with on July 1, 2012").
So, yes, this particular bug is Linux-specific (i.e., there's a reason why it hit Linux servers), and might not be the fault of the userland code running atop it (so it might not, for example, be Java's fault).
http://releases.jhu.edu/2011/12/27/time-for-a-change-johns-hopkins-scholars-say-calendar-needs-serious-overhaul/
Proposed permanent calendar has a predictable 91-day quarterly pattern of two months of 30 days and a third month of 31 days,
The calendar - http://henry.pha.jhu.edu/ccct.calendar.html
FAQ - http://henry.pha.jhu.edu/calendar.html
Check out the other threads pointing out an issue with futexes. There's an easy workaround, just manually set the time on your system and the problem will go away until the next leap second.
Data: She brought me closer to humanity than I ever thought possible, and for a time...I was tempted by her offer.
Jean-Luc Picard: How long a time?
Data: Zero point six eight seconds, sir. For an android, that is nearly an eternity.
When NTP knows that a leap second is to be added, it (on Linux at least) sets a flag in the kernel to say that at 23:59:59, please continue to 23:59:60 before going to 00:00:00. This is set by NTP anytime on the day that the leap second is due to be implemented, hence why a server running NTP on Linux would know that TODAY a leap second is due (cause they should always be posted at the 23:59:59 cross-over)
Oh and what you thought of as NTP sounds more like how NTPDATE works (i.e one shot "whats the time Mr Server" style clock updates)
NTP is /far/ more complicated and does stuff like working out the time delay between you and the server(s), the skew of /your/ clock (so it knows if your clock tends to run a bit fast/slow and adjusts for that) and lots of other clever "make time of day clocks work better" stuff (and sometimes even updating the HW TOD clock if needed)
NOW can we just collectively pat ourselves on the back for Y2K?
I still talk to people who believe Y2K was all a hoax perpetrated by computer consultancy companies to scare upgrade cash from large customers. :)
Now at least I have some ammunition to shoot back with
And hopefully we can start getting people to take the coming 32-bit epoch end seriously too
Business/App ideas are like arseholes: everyone's got one, they're mostly shit, but very rarely they contain a diamond
That's what I thought too. I don't understand why it's any different to having a manual or automatic clock update for DST or any other reason. If there really was a text version that came across as 23:59:60 that's utterly laughable.
The only system in the world that would accept such a thing is a MySQL "database".
I work at a fairly large international outfit, with data feeds coming and going to the far ends of the Earth. Everything we do is time-sensitive. Processing messages that depend on prior messages already being processed means we can't gracefully handle things coming in out of order.
We spent lots of time and money studying this problem, hired a high-priced consulting outfit to advise us and spun up lots of projects to mitigate the "risk" of the leap second. There were far too many meetings and conference calls with vendors, VARS and other people that wanted us to pay them for their time. What was determined was that we couldn't guarantee that nothing would crash or (gasp!) messages might be discarded or processed incorrectly, which was a risk we weren't willing to take. We run a full gamut of OSes, from HP/UX, Solaris, Linux, z/TPF, z/OS, DB2 etc etc.. You get the idea. Too many variables and too many systems to update and test with the limited funds and limited timeframe given.
In the end, we avoided the problem by shutting down all (and I do mean ALL) processing and flushing all the transactional systems to disk and suspending EVERYTHING from a minute before until a minute after the leap second. (Was that two minutes or two minutes PLUS one second? Clock math has always eluded me.) Shutting down all these interconnected systems in the correct order was a precision dance that, in the end, we didn't perform very well. Messages did end up being discarded. At precisely :20 seconds after the leap second, we began syncing all our systems with our internal NTP server and then at precisely one minute after, we slowly started everything back up. There were some systems that required a restart. We manually reprocessed those earlier discarded messages just as fast as our little fingers could type. In all it took us about 15 minutes to get everything spun back up, and all that time is getting charged to our SLA, which affects ALL our evaluations and year-end bonuses.
Lots of work was done, overtime was paid and buckets of money were given to lots of high-priced consultants and I personally will take a hit to my paycheck, all over ONE GODDAMNED SECOND.
Let's not do that again, okay?
- Pithy comment goes here.
This is so silly. One second. How many ways, for those who CLAIM they needed to account for this second, could problems have been avoided? Hmm. Heaven forbid they simply IGNORE the extra second (all except those oh-so-crucial banking connections who SUPPOSEDLY need to be perfectly in sync) and let the system either adjust it's time however it normally would, or perhaps write a script to pause services for 5-15 seconds while the time is adjusted, or get fancy and write something that slowly took away nanoseconds so that over the course of a minute, hour, day, etc, the second was accounted for.
This is just beyond silly. At least Y2K had logical concerns that people had to deal with (even though THOSE were blown completely out of proportion as well).
AMMalena (www.Malena.net) "The avalanche has already begun. It is too late for the pebbles to vote." (Kosh, B5)
Is there a system call that actually could return 23:59:60 as a valid time???
If that actually happened, then they should have just made it do 23:59:59 twice instead of crashing all the computers. I would like somebody to give me a concrete reason why any computer system should actually crash because of a lost second.
If you send 23:59:59 twice, you have the same second in the system twice, which can potentially cause issues with logs. If everything is timestamped to the second/millisecond, how can you be sure an event happened in the first 23:59:59 second, or the second (or subsequent) 23:59:59 second?
Well, its not a lost second.. its an extra second..
The entire issue is that there are so many uses for time that any one strategy does not work in all cases. Consider a simple logger that outputs some value once per second.. well you dont want that logger to output 23:59:59 twice.. that could easily create problems.. and you dont want the logger to miss a tick either because that could cause other problems..
So to solve that problem we create the abnormal 23:59:60.. but because its abnormal it can easily look like 00:00:00 after simple time manipulation operations, causing 00:00:00 to be seen twice instead.. the same problem we were trying to avoid..
I propose the following solution: Stop fucking with time in abnormal ways such as leap seconds.. the subset of problem domains where syncing to some abstract ideal celestial clock is rather small and its far easier to let those problem domains handle conversion from system time to abstract celestial time that it is to make everything else work well with edge cases.
"His name was James Damore."
If you examine your linux logs, you'd see an extra second inserted Sunday morning -- 1 minute had 61 seconds so instead of rolling over at 59, it rolled over to 60 and then hit 0 (at least in 3.2.X)...
Know your clocks. Check the clock_getclock() and other clock_* functions, and all clock types in POSIX.
You mean the clock types such as:
Check out what date do...
From the man date we have:
(...)
%M minute (00..59)
(...)
%S second (00..60)
So clearly date is able to print 23:59:60 as a valid date
Higuita
When NTP knows that a leap second is to be added, it (on Linux at least) sets a flag in the kernel to say that at 23:59:59, please continue to 23:59:60 before going to 00:00:00.
Where does the Linux kernel know about "23:59:59" and "23:59:60" rather than "N seconds since the Epoch", other than when dealing with real time clock hardware that maintains year/month/day/hour/minute/second rather than a count of ticks?
It looks as if the stuff in kernel/time/ntp.c adjusts the "N seconds and M whatevers since the Epoch" counter so that it reflects "seconds since the Epoch" rather than the number of seconds that have elapsed since the Epoch.
As far as I knew, NTP just says "Hey, server, what time is it" and gets back "It's f*****ing 3 o'clock, exactly 1 hour since the last time you asked. Now go away and quit bothering me!" (only it says it really fast using a single big number) At which point the software makes sure my server knows it's 3 o'clock.
No, as others have noted, NTP does a lot more - including saying "hey, a {positive or negative} leap second is coming up!" (look for "leap indicator" in RFC 5905). What the NTP client does with that is up to the client; I guess Linux is trying to do what POSIX specifies, i.e. having "seconds since the Epoch" be something other than a count of the seconds that have elapsed since the Epoch.
And it does this at the (human readable version) 23:59:59 to 00:00:00 handover to make it happen at the end of the day.
I appreciate that Linux manages its TOD clock as xxx ticks etc, but what I wrote is accurate from a user watching the result. Trying to explain how NTP does it (on all the different OS' it runs on) is a bit much for for a slashdot post (well at least its more time that I have for posting!)
FWIW NTP on z/OS just spins the clock so the last second runs really slow to make sure that any apps don't ever see the ::60 (or try and react to it at least).
This seems like the kind of problem made by one of those "super-bright" children that got hired for way too much money in the late 90's and were given offices with dog beds and room for their skateboards while they ignored all the great original standards that made the internet possible, and built insanely over-engineered new ones in the hope of making a fortune.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
This seems like the kind of problem made by one of those "super-bright" children that got hired for way too much money in the late 90's and were given offices with dog beds and room for their skateboards while they ignored all the great original standards that made the internet possible, and built insanely over-engineered new ones in the hope of making a fortune.
I'm not sure what "this" refers to there, but, as per RFC 958, the "leap indicator" dates back at least to 1985, and, at least if I remember correctly, the "seconds since the Epoch" doesn't mean "seconds that have elapsed since the Epoch" dates back to the original 1988 POSIX.
On our redhat the clock stopped for 1 second. Java System.getCurrentTimeMillis() returned the same value for an entire second. Not good.