Leap Second Bug Causes Crashes

All of my servers were fine by Anonymous Coward · 2012-07-01 08:03 · Score: 5, Insightful

And I didn't do anything special, just kept their software up-to-date.

Re:All of my servers were fine by Sir_Sri · 2012-07-01 08:05 · Score: 4, Informative

That can be hard for some people.
Re:All of my servers were fine by Anonymous Coward · 2012-07-01 08:10 · Score: 3, Informative

Agreed. Patches that aren't required to solve an ongoing incident impacting customer traffic require about 2 weeks advance notice to pass through change control, and that's if everything is perfect. A single error in a ticket can push that ticket out another week, and another, and so on.
Generally, we shoot for 3 weeks before we are allowed to install a patch. On average, it's about right.
Re:All of my servers were fine by Anonymous Coward · 2012-07-01 08:15 · Score: 5, Informative

the patch was posted back in March.
https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
Re:All of my servers were fine by 0123456 · 2012-07-01 08:55 · Score: 2

One of ours (running Java on Linux) started throwing out NTP alarms at 10 seconds after midnight, but it seems to have stayed up. However, the software on that particular system is especially vulnerable to leap second issues so we'd tested it pretty well beforehand.
Otherwise no-one has complained about any other systems going down so I presume they're OK.
Re:All of my servers were fine by nmb3000 · 2012-07-01 09:26 · Score: 4, Insightful

And I didn't do anything special, just kept their software up-to-date.
That's a nice ideal, but the reality is that many up-to-date "stable" distribution releases are still using kernels which are susceptible the leap second problem (and haven't had the patch back-ported to them). Ubuntu 8.04 LTS server is supposed to be supported until April 2013, and on my (updated!) system,
# uname -r 2.6.24-28-server
I like the idea of stable releases, but this is a glaring problem with the entire idea. Everyone extolls the wondrous virtues of package managers for Linux-based systems, but the dirty secret is that unless you stay bleeding-edge (which is usually the opposite of "server"), you'd better be happy with the 4-year old version of Apache, PHP, MySQL, and the Linux kernel you're running. Sure, it's possible to manually download and install packages from a newer release (assuming you can get past the dependency hell usually associated with it). Sure, it's possible to try and splice in (or "pin" packages using Debian parlance) from a newer repository. Sure, it's possible to install from source, compiling and installing everything by hand. But once you do any of these you've given up 90% of what makes the package manager useful and are just asking for dependency problems in the future.
And, all that aside, do you even know if the patch released to fix this problem is included in your distribution-released kernel? If you're not rolling your own kernel it can be nigh to impossible to know what's included and what's not -- in that case it doesn't even matter if it's up-to-date.

--
"What do you despise? By this are you truly known." --Princess Irulan, Manual of Muad'Dib
/)
Re:All of my servers were fine by lister+king+of+smeg · 2012-07-01 09:50 · Score: 3, Informative

And, all that aside, do you even know if the patch released to fix this problem is included in your distribution-released kernel? If you're not rolling your own kernel it can be nigh to impossible to know what's included and what's not -- in that case it doesn't even matter if it's up-to-date.
Well you could read through the change log and release notes to find out.

--
---Saying gnome 3 is better than windows 8 not so much a compliment as it is damning with light praise.
Re:All of my servers were fine by Anonymous Coward · 2012-07-01 09:55 · Score: 0, Troll

It's pretty sad that a protocol as well defined and well known as UT* times even requires to be patched in any kernel.
Re:All of my servers were fine by RNLockwood · 2012-07-01 10:42 · Score: 2

NASA's Astronomy Picture of the Day, http://antwrp.gsfc.nasa.gov/apod/astropix.html, has apparently been down all day; wonder if this is the cause.
Anyone heard from the Space Lab today?

--
Nate
Re:All of my servers were fine by thePowerOfGrayskull · 2012-07-01 10:42 · Score: 3, Interesting

Our problem was with a third party monitoring solution - its daemon process brought every single one of our servers to a near halt by consuming all available cpu cycles at the stroke of gmt midnight.
The OS itself was fine.
This monitoring software is common enough that it likely was behind a lot of the issues seen around the 'net.
Re:All of my servers were fine by Ruie · 2012-07-01 11:19 · Score: 2

All *.gsfc.nasa.gov sites I tried to access are down - Fermi data, some catalogs, etc.
Re:All of my servers were fine by Guy+Harris · 2012-07-01 11:44 · Score: 1

the patch was posted back in March.
https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
Or not.
Re:All of my servers were fine by Guy+Harris · 2012-07-01 11:47 · Score: 4, Informative

Our problem was with a third party monitoring solution - its daemon process brought every single one of our servers to a near halt by consuming all available cpu cycles at the stroke of gmt midnight.
The OS itself was fine.
Well, if you're talking a Linux kernel, the part of the OS that dealt with leap seconds was not OK, and was "not OK" in a fashion that could cause processes using futexes to spin and consume all available CPU cycles when a leap second is introduced.

This monitoring software is common enough that it likely was behind a lot of the issues seen around the 'net.
...perhaps by virtue of either using futexes (in what I'm presuming is a legitimate fashion) or using something that uses futexes.
Re:All of my servers were fine by Gil-galad55 · 2012-07-01 12:26 · Score: 5, Informative

They lost commercial power due the big storm system that went through the DC area.

--
To follow knowledge like a sinking star, / Beyond the utmost bound of human thought. ("Ulysses", Tennyson)
Re:All of my servers were fine by Guy+Harris · 2012-07-01 12:54 · Score: 2

That's a nice ideal, but the reality is that many up-to-date "stable" distribution releases are still using kernels which are susceptible the leap second problem (and haven't had the patch back-ported to them).
To which of the, apparently, two or more leap second problems are you referring? (The latest one, causing the bogus futex timeouts and subsequent CPU-eating spinfests, is, apparently, having a fix developed today, July 1, 2012, so getting that patched would be a little difficult - especially getting it patched before the leap second is introduced. :-))
Re:All of my servers were fine by Tough+Love · 2012-07-01 13:39 · Score: 2

That can be hard for some people.
And also not necessary on Linux, with the exception of security updates. Even my machines with ancient images like Ubuntu 8 where completely unbothered. Probably you're ok with any Linux younger than 25 years. Most probably, Linux 2.0 would have been fine except for the security update question.

--
When all you have is a hammer, every problem starts to look like a thumb.
Re:All of my servers were fine by Anonymous Coward · 2012-07-01 14:36 · Score: 0

Avamar?
I've seen this crap itself at my workplace today..
Re:All of my servers were fine by thePowerOfGrayskull · 2012-07-01 14:50 · Score: 1

Thanks for posting, I came across this info after my earlier post. I suspect you're correct.
Re:All of my servers were fine by rlseaman · 2012-07-01 14:57 · Score: 2

NASA Goddard is near Baltimore. They lost power in the storm and are operating under "Code Red": http://www.nasa.gov/centers/goddard/
Quite likely other misbehavior blamed on the leap second is actually the result of the storm (or like Pirate Bay, some unrelated crash).
Re:All of my servers were fine by jaymemaurice · 2012-07-01 15:49 · Score: 2

pesky software engineers, writing code for no reason.

--
120 characters ought to be enough for anyone
Re:All of my servers were fine by Sir_Sri · 2012-07-01 17:14 · Score: 1

http://serverfault.com/questions/403732/anyone-else-experiencing-high-rates-of-linux-server-crashes-during-a-leap-second
Seems like this problem didn't do any favours for several linux installs either.
http://arstechnica.com/business/2012/07/one-day-later-the-leap-second-v-the-internet-scorecard/
Re:All of my servers were fine by Bronster · 2012-07-01 18:18 · Score: 1

And not backported to any stable kernels where the bnx2 driver is non-broken :(
Re:All of my servers were fine by foxed · 2012-07-01 18:47 · Score: 1

Two words for you: Debian backports.
Re:All of my servers were fine by Vegemeister · 2012-07-01 21:39 · Score: 1

And yet I had a minor CPU usage issue with all of my Mozilla and Java applications. I'm running 3.2.
Re:All of my servers were fine by Anonymous Coward · 2012-07-01 21:46 · Score: 1

Any Linux younger than 25 years? I'd like to see a Linux older than 25 years ;)
Re:All of my servers were fine by Anonymous Coward · 2012-07-01 23:52 · Score: 1

One answer to your question, and all problems are solved with one stone killing the entire species of birds you dislike:
Run FreeBSD. The ports collection runs on a rolling development model where the OS runs on two development models at the same time, one release, one rolling. The release of the OS has freebsd-update to bring it up to patchlevel, or you can track the stable branch of the version you like (8 or 9).
For example, the oldest version: 7.4 and 7 stable both use the same ports collection as the current version (9.0 and 9 stable). So even though you are being Mr. Stable server and only applying security patches to the OS, you are still able to run the latest stable versions of Apache/PHP and whatever else.
Nice, huh? No wonder that 1/3 of the traffic on the internet is served from FreeBSD (data base on Netflix being a FreeBSD house and that they account for 1/3 of the internet's traffic).
Re:All of my servers were fine by tehcyder · 2012-07-02 01:12 · Score: 2

Nice, huh? No wonder that 1/3 of the traffic on the internet is served from FreeBSD (data base on Netflix being a FreeBSD house and that they account for 1/3 of the internet's traffic).
But I thought that Netflix confirms BSD is dying?

--
To have a right to do a thing is not at all the same as to be right in doing it
Re:All of my servers were fine by Anonymous Coward · 2012-07-02 03:26 · Score: 0

Man. That's amazing. The leap second spawned destructive thunderstorms?
That's it. No more leap seconds. We may not survive the next one!
Re:All of my servers were fine by Tough+Love · 2012-07-02 06:08 · Score: 2

No question there was a kernel bug, a race condition to be precise. Now fixed. The chance of hitting it is pretty small, but if you have enough servers or the right load, some of them will. Kernel hang was possible but livelock with high CPU was more likely. The workaround: set the time.
The chance of your home server hitting this was vanishingly small. You're more likely to get a power outage.

--
When all you have is a hammer, every problem starts to look like a thumb.
Re:All of my servers were fine by Tough+Love · 2012-07-02 06:13 · Score: 1

Just checking to see who's paying attention ;)
Correct age limit: 16 years, before that I don't have any data.

--
When all you have is a hammer, every problem starts to look like a thumb.
Re:All of my servers were fine by Anonymous Coward · 2012-07-09 02:20 · Score: 0

prior to 2.6.29...LENNY...LEENY is DATED.
I use Wheezy.
custom kernel (ATOM) 3.5. This shows how dated Linux kernels are and how SLACKING ADMINS are. Get the latest every six months and recompile it custom. The performance enhancement is worth it.

Re: by bleedingsamurai · 2012-07-01 08:05 · Score: 2

Interesting. I wonder what conditions had to have been met for a crash to happen, none of my servers had so much as a hick-up.

Linux by Anonymous Coward · 2012-07-01 08:08 · Score: 4, Informative

I'm a Linux admin at a fairly large hosting company. The only thing that I personally aware of happening this time around was that the time change triggered a bug in the OpenManage software on Dell servers causing it to use 100% CPU. The solution was to resync the time and restart OpenManage. It wasn't really a fault of Linux itself, but in OpenManage on Linux. Lots of datacenters use Dell hardware and I'm sure most use OpenManage, so I'm sure the problem was widespread.

Re:Linux by Anonymous Coward · 2012-07-01 08:21 · Score: 5, Informative

What you describe is a bug in the Linux kernel that causes problems for the Java VM that OpenManage uses.
It is not a bug in OpenManage at all.
Re:Linux by X0563511 · 2012-07-02 01:51 · Score: 2

It blew up Virtualbox for me as well. Guests were eating 100% CPU even though they were not aware of it, and after killing them the CPU load transferred to another Virtualbox service. Odd.
Reboot and it was working normally again.

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:Linux by CompMD · 2012-07-02 03:35 · Score: 1

I witnessed this also. Our IT department is mostly a Dell shop, and dsm_om_connsvcd went completely bonkers.

Re: by Anonymous Coward · 2012-07-01 08:10 · Score: 5, Funny

>hick-up.

The hick up watching the servers when the leap second came was you.

Re: by Anonymous Coward · 2012-07-01 08:12 · Score: 0

For starters, you need a kernel no more recent than 2.6.28, a kernel so old my Debian stable box is four revisions past it!

Our Red Hat servers had no issues at all by 93+Escort+Wagon · 2012-07-01 08:12 · Score: 4, Insightful

I'm uncertain why these reports keeps referring to some monolithic "Linux" that is supposed to have had issues - Red Hat's the biggest Linux vendor, and certainly their "Linux" handled it just fine.

What distros had issues?

--
#DeleteChrome

Re:Our Red Hat servers had no issues at all by Nutria · 2012-07-01 08:22 · Score: 4, Informative

TFA mentioned that the RHE6 kernel had the bug, but not RHE5.
It appears also that system load was a big factor, so if your systems aren't busy on Saturday then they might not have crashed even if running an affected kernel.

--
"I don't know, therefore Aliens" Wafflebox1
Re:Our Red Hat servers had no issues at all by Anonymous Coward · 2012-07-01 08:27 · Score: 1

The bug is related to kernel version, IIRC (introduced somewhere in the 2.6 series, resolved in 3.2 or somesuch). So it depends what kernel the distros ran.
Re:Our Red Hat servers had no issues at all by 93+Escort+Wagon · 2012-07-01 08:28 · Score: 1

TFA mentioned that the RHE6 kernel had the bug, but not RHE5. -- It appears also that system load was a big factor, so if your systems aren't busy on Saturday then they might not have crashed even if running an affected kernel.
Ah, ok - thanks, I managed to miss that. Most of our servers are still on RHEL 5 because of some odd issues we've experienced with LDAP under RHEL 6.
I've got a test/catch-all machine on RHEL 6, but that doesn't generally have to work very hard.

--
#DeleteChrome
Re:Our Red Hat servers had no issues at all by Anonymous Coward · 2012-07-01 08:44 · Score: 4, Informative

Red Hat had a lot of issues.
https://access.redhat.com/knowledge/articles/15145
https://access.redhat.com/knowledge/solutions/154713
It depended entirely on your load. The buggy kernal code ran every 17 minutes for the 24hr period leading up to the leap-second insertion.
If you had enough load, your chance of dead-locking your system increased significantly.
Solution, strip the leap-second flag by manually setting your time.
Re:Our Red Hat servers had no issues at all by MightyMartian · 2012-07-01 08:45 · Score: 4, Funny

Sorry can't remember the name. It's the one that takes the credit for the work of others.
Windows?

--
The world's burning. Moped Jesus spotted on I50. Details at 11.
Re:Our Red Hat servers had no issues at all by drinkypoo · 2012-07-01 08:49 · Score: 4, Funny

Sorry can't remember the name. It's the one that takes the credit for the work of others.
You must be talking about SCO, but if you're still running CND you should probably upgrade.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:Our Red Hat servers had no issues at all by antdude · 2012-07-01 08:57 · Score: 1

I think my Debian stable box's latest rbot build's launch_here.rb was acting weird from the leap bug because the CPU was going high even when idled. I rebooted after 55 days of uptime and it was fine.

--
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
Re:Our Red Hat servers had no issues at all by Phroggy · 2012-07-01 11:37 · Score: 2

I've got a Slackware 12.0 box running 2.6.21.5 that crashed. Slackware 12.1 (2.6.24.5) and 12.2 (2.6.27.31) did not crash, but it sounds like these versions are vulnerable as well, I just got lucky.

--
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
Re:Our Red Hat servers had no issues at all by Anonymous Coward · 2012-07-01 12:00 · Score: 0

No, iOS
Re:Our Red Hat servers had no issues at all by Anonymous Coward · 2012-07-01 13:06 · Score: 0

Actually, RH was the one that was affected the worst.
RH 5 and below on 2.6., and RH is still supporting at least 4 with extended support.
RH's crappy package manager in 4.x and below, didn't patch kernels by default, so even _if_ RH had issued a patch, it probably wouldn't have been applied on these systems.
We run a bunch of unix and tons of linux, but only a few of those systems are RH, and those are considered the bastard stepchildren. Shitty package management, no in place upgrades (RH recommends against WTF?!), dearth of packaged software, etc. RH is probably pretty kick-ass if all you have ever known is windows or AIX (but even windows / AIX can do an in-place upgrade), but if you know better, can't see why you would run RH unless a proprietary software vendor requires it.
Re:Our Red Hat servers had no issues at all by lucifuge31337 · 2012-07-01 13:21 · Score: 1

Ah, ok - thanks, I managed to miss that. Most of our servers are still on RHEL 5 because of some odd issues we've experienced with LDAP under RHEL 6.
Because goddamn sudoers doesn't work with LDAP since 6.1, when it used to work just fine in 6.0 and now nslcd pukes on the config you need?

Yeah....this is FINALLY patched in 6.3 (a week ago or so). Be aware that you need to change some things and add an additional conf file to make it work. What a pain in the ass, but it's finally over (or will be for me once CentOS gets it downstream).

https://bugzilla.redhat.com/show_bug.cgi?id=760843

--
Do not fold, spindle or mutilate.
Re:Our Red Hat servers had no issues at all by Anonymous Coward · 2012-07-01 14:22 · Score: 0

In my case, two RHEL6 boxes, both running 2.6.32-220.13.1.el6.x86_64 . Judging from other comments, look like I got unlucky with those two since others with the same software setups didn't puke. The sucky part is that one of the two crashed machines was a VM server. The unsucky part is most stuff failed over to the secondary VM server, and the primary one kdump'ed then rebooted itself as well as its VM hosts.
Re:Our Red Hat servers had no issues at all by Anonymous Coward · 2012-07-01 14:41 · Score: 0

Seriously? +5 Funny for faggotry?
Unlikely, since you're still modded at zero.
Re:Our Red Hat servers had no issues at all by Guy+Harris · 2012-07-01 19:10 · Score: 2

The bug is related to kernel version, IIRC (introduced somewhere in the 2.6 series, resolved in 3.2 or somesuch). So it depends what kernel the distros ran.
More like resolved yesterday (today being July 2, 2012 where I'm typing this).

Why now? by Anonymous Coward · 2012-07-01 08:12 · Score: 1

Considering leap-seconds happen every now and then, it seems odd that such fundamental things as Linux and Java can not handle it. AFAIK, it was just about for years ago since we last had a leap-second.

Re:Why now? by Znork · 2012-07-01 08:41 · Score: 2, Interesting

We will keep having these kinds of issues for as long as some people who fail to understand that time of day is an arbitrary number whose main utility lies in it being composed of predictable periods and divided into homogenous units. It should have no relation whatsoever to whatever time the sun happens to rise or set at any particular location and above all it should not be changed to accomodate fluctuations in the orbit of a rock circling an arbitrary star. Abominations like leap seconds or daylight savings make the whole system less useful by merely existing.
But personally I wouldn't be surprised if people off the equator were to get summer minutes composed of 120 seconds during daytime (or even better, a scale!) to ensure the sun rises and sets at the same time year around. Or, hey, why not simply make the seconds longer? Or a combination of both plus we can define pi to be 3 to make things simpler.
Re:Why now? by vux984 · 2012-07-01 09:09 · Score: 4, Insightful

and above all it should not be changed to accomodate fluctuations in the orbit of a rock circling an arbitrary star.
That is precisely the point of keeping track of the time of day, or day of the year.
time of day is an arbitrary number whose main utility lies in it being composed of predictable periods and divided into homogenous units.
You do not need a complex system like date time comprised of minutes hours, seconds, months, weeks, and years if you just want to measure time in a convient homogenous unit then define a time-zero, and just count milliseconds from that to whatever arbitrary distance into the past and future you want from that. Measure it kilo-seconds, mega-seconds, giga-seconds... etc.
The entire point of date/time is because we do in fact care a lot about how that "arbitrary counter" lines up with when we will be awake or asleep or eating at various points -- that's what makes it useful.
What we should have is what I've described above, time-zero and a counter. And translations from that to localized date time should be handled by a library.
Re:Why now? by Guy+Harris · 2012-07-01 11:24 · Score: 4, Funny

What we should have is what I've described above, time-zero and a counter. And translations from that to localized date time should be handled by a library.
Which, sadly, POSIX doesn't let you have as "UNIX time":

4.15 Seconds Since the Epoch
A value that approximates the number of seconds that have elapsed since the Epoch. A Coordinated Universal Time name (specified in terms of seconds ( tm_sec ), minutes ( tm_min ), hours ( tm_hour ), days since January 1 of the year ( tm_yday ), and calendar year minus 1900 ( tm_year )) is related to a time represented as seconds since the Epoch, according to the expression below.
If the year is <1970 or the value is negative, the relationship is undefined. If the year is >=1970 and the value is non-negative, the value is related to a Coordinated Universal Time name according to the C-language expression, where tm_sec , tm_min , tm_hour , tm_yday , and tm_year are all integer types:
tm_sec + tm_min*60 + tm_hour*3600 + tm_yday*86400 + (tm_year-70)*31536000 + ((tm_year-69)/4)*86400 - (tm_year-1)/100)*86400 + ((tm_year+299)/400)*86400
The relationship between the actual time of day and the current value for seconds since the Epoch is unspecified.
How any changes to the value of seconds since the Epoch are made to align to a desired relationship with the current actual time is implementation-defined. As represented in seconds since the Epoch, each and every day shall be accounted for by exactly 86400 seconds.
Note:
The last three terms of the expression add in a day for each year that follows a leap year starting with the first leap year since the Epoch. The first term adds a day every 4 years starting in 1973, the second subtracts a day back out every 100 years starting in 2001, and the third adds a day back in every 400 years starting in 2001. The divisions in the formula are integer divisions; that is, the remainder is discarded leaving only the integer quotient.
If there were a UN*X API to get a count of seconds since the Epoch (in addition to, or instead of, a call to get "seconds since the Epoch"), and a UN*X API to convert those to UTC and local time labels, that would get what you want. Modulo making it work with NTP, the former could be implemented with less difficulty than a call to get "seconds since the Epoch", and the latter is called "the Olson code complete with the leap seconds database".
However, that would then require some mechanism to allow code to schedule something to happen at a given UTC label; simply calculating the UNIX time for that UTC label, getting the current UNIX time, and scheduling it for then-now seconds in the future is insufficient, as the UNIX time for a given UTC label in the future might change if a leap second is scheduled between then and now. (Note that if you support scheduling something to happen at a given local civil time label would already require correction of that sort to handle DST rule changes.) This would also have to do something if you schedule an event for YYYY-DD-MM 23:59:59 and a negative leap second occurs so that there is no 23:59:59 on YYYY-DD-MM; "something" might be "let somebody know and ask them to correct it" or "do it at 00:00:00 on the next day", perhaps depending on the reason why it's scheduled.
Re:Why now? by fluffy99 · 2012-07-01 11:39 · Score: 1

Or maybe just pick a time standard that doesn't have leap seconds? There's at least 14 different time standards and I believe only UTC uses leap seconds. One or two even track the variations in the rotation of the earth itself (for astronomy stuff).
Re:Why now? by Guy+Harris · 2012-07-01 11:54 · Score: 2

Considering leap-seconds happen every now and then, it seems odd that such fundamental things as Linux and Java can not handle it. AFAIK, it was just about for years ago since we last had a leap-second.
Perhaps the bug that was mentioned in the lkml thread that started with this message was introduced less than four years ago, so the code in question had never gotten exposed to a leap second except perhaps in testing (I don't know how hard it is to reproduce it; John Stultz wasn't initially able to reproduce it in his testing, but eventually succeeded).
Re:Why now? by Anonymous Coward · 2012-07-01 12:42 · Score: 0

[...] and above all it should not be changed to accomodate fluctuations in the orbit of a rock circling an arbitrary star.
We should make the year 360 days because it makes the math easier. Also, we should get of leap years, because who cares if the solstices and equinoxes arrive on the same day every year or not? Furthermore, who cares whether the "seasons" occur at the same period of the year: they're just social conventions. We're going to have "summer" (i.e., a period of warm weather) regardless, so does it really matter whether it happens from June to August, or whether from February to April? If we get rid of leap years everything will start shifting, but who cares? We're just accommodating the orbit of a rock circling an arbitrary star.
Re:Why now? by PuZZleDucK · 2012-07-01 15:48 · Score: 1

plus we can define pi to be 3 to make things simpler.
Too late: http://tauday.com/

--
Can a person program a new solution to a problem? Why should anyone be able to stop such a thing? -Richard Stallman
Re:Why now? by Anonymous Coward · 2012-07-01 20:18 · Score: 0

Anyone remember Swatch internet time? No hours or minutes just a 'beat' worth 86.4 seconds. That made 1000 beats per 24 hours. And every clock was set to GMT time, not a time-zone that approximated solar time in their locale.
Of course when someone says it's 14:00 there, one has an idea, of what most people are doing.
Converting beats to seconds is one calculation. The problems is rolling-over a second into the next beat. Essentially
any replacement time system has to work in multiples of 3 so the current unit of time, the second, remains.
Re:Why now? by Dodgy+G33za · 2012-07-01 22:08 · Score: 1

What I don't understand is why operating systems should even care about a second here or there. Surely it is sufficient for them ignore the fact that a leap second is going to occur and just set their clocks to the correct time next time they synch to their time source?
Re:Why now? by michelcolman · 2012-07-01 22:32 · Score: 1

The entire point of date/time is because we do in fact care a lot about how that "arbitrary counter" lines up with when we will be awake or asleep or eating at various points -- that's what makes it useful.
I care about the fact that the sun comes up in the morning and goes down in the evening. I don't care about the fact that it happens to cross some particular meridian going through some English village at precisely 1:00 pm or 12:00 pm (depending on daylight saving time) on exactly two days of the year (thanks to the elliptical orbit of our rock). I'll sleep just fine if it's 12:01 instead of 12:00, so you can skip at least the next 60 leap seconds.
Re:Why now? by vux984 · 2012-07-03 06:54 · Score: 1

fair enough. But a small minority of people do care. And in any case, if we don't have leap seconds then sooner or later we need a leap minute... or leap hour... the problem doesn't go away.
Re:Why now? by michelcolman · 2012-07-03 07:16 · Score: 1

Then one year in the very distant future (several millennia from now) we'll just not switch to daylight savings time and then carry on as before. I still don't see the problem.
As for the small minority of people who do care: I assume they're astronomers, and they can just apply a correction factor. The majority of people shouldn't be affected by the needs of a small minority, especially since the small minority can easily get around the "problem".
Re:Why now? by vux984 · 2012-07-03 08:00 · Score: 1

What exactly is "the problem" ?
To my mind, almost all of the "problems" that occur with leap seconds already occur with normal clock drift and DST changes. Both of which result in adjustments to the current time, resulting in a particular time of day being repeated, a particular time of day being skipped, etc.
Personally I think leap seconds should be exposed to the rest of the system in terms of time as "repeating" a second -- essentially the same way DST is handled. It think it should be done as its 23:59:59, and then its 23:59:59 again, and then its 00:00:00, rather than ever being reported as 23:59:60.
Re:Why now? by badkarmadayaccount · 2012-07-06 23:23 · Score: 1

IAT, the new R6RS Scheme standard uses it, IIRC.

--
I know tobacco is bad for you, so I smoke weed with crack.

What about Windows and Mac? by kthreadd · 2012-07-01 08:13 · Score: 4, Interesting

So far all I've heard about is affected Linux systems, did Windows and OS X just fine?

Re:What about Windows and Mac? by Anonymous Coward · 2012-07-01 08:16 · Score: 1

Doesn't Windows just sync its clock once a day? I don't remember it having a proper NTP daemon.
Re:What about Windows and Mac? by 93+Escort+Wagon · 2012-07-01 08:17 · Score: 1

So far all I've heard about is affected Linux systems, did Windows and OS X just fine?
No problems on a couple of OS X machines that were on during the leap second - one running 10.7 Lion, the other 10.6 Snow Leopard (my laptop, which I was actively using).

--
#DeleteChrome
Re:What about Windows and Mac? by Anonymous Coward · 2012-07-01 08:20 · Score: 0

So far all I've heard about is affected Linux systems, did Windows and OS X just fine?
The glitch mostly affected POSIX compliant operating systems as POSIX specifies a day as 86400.
Re:What about Windows and Mac? by Anonymous Coward · 2012-07-01 08:21 · Score: 0

Maybe they are not that accurate at all.
Re:What about Windows and Mac? by bickerdyke · 2012-07-01 08:23 · Score: 1

My guess ist that Windows simply ignored it, so there never was a 61st second in a minute.
Beeing correct, on the other hand, might come as a surprise to more than one pieces of software.

--
bickerdyke
Re:What about Windows and Mac? by godrik · 2012-07-01 08:24 · Score: 1

well, none of my machines (all running Linux) were affected by the problem. I guess the bug only appeared in some systems.
Re:What about Windows and Mac? by Anonymous Coward · 2012-07-01 08:24 · Score: 1

Mac OS X is POSIX compliant and certified UNIX. None of my OS X systems or Solaris (also POSIX and certified UNIX) had any problem with the extra second. This appears to be a problem for some Linux (not a certified UNIX system) systems.
Re:What about Windows and Mac? by gman003 · 2012-07-01 08:24 · Score: 1

As far as I can tell, all current operating systems handled it fine. It's applications that have problems, mainly server-type apps that actually use the clock for important things.
Linux being heavily affected is just a side-effect of most servers running Linux (although apparently some older versions don't handle leap seconds so cleanly - maybe that has something to do with it?).
Re:What about Windows and Mac? by Nutria · 2012-07-01 08:27 · Score: 2

And apparently neither did any desktop Linux systems.

--
"I don't know, therefore Aliens" Wafflebox1
Re:What about Windows and Mac? by 0123456 · 2012-07-01 08:59 · Score: 1

Ditto. My Ubuntu server is still running fine (the only indication of the leap second is a message in dmesg output) and my Ubuntu laptop had no problems either.
Re:What about Windows and Mac? by Anonymous Coward · 2012-07-01 09:17 · Score: 0

So far all I've heard about is affected Linux systems, did Windows and OS X just fine?
Yes, as did any other BSD system (which OS X is a variant of), as did the commercial Unixes (AIX, Solaris (and its variants, e.g., Illunimos)).
Re:What about Windows and Mac? by Guy+Harris · 2012-07-01 09:39 · Score: 2

So far all I've heard about is affected Linux systems, did Windows and OS X just fine?
The glitch mostly affected POSIX compliant operating systems as POSIX specifies a day as 86400.
So you're saying the glitch could affect OS X (or, at least, OS X Snow Leopard - although Leopard was also registered - but I'll bet Lion behaves, and Mountain Lion will behave, the same way)?
Re:What about Windows and Mac? by Alrescha · 2012-07-01 10:37 · Score: 1, Funny

"And apparently neither did any desktop Linux systems."
There are desktop Linux systems?
(ducks)
A.

--
...bringing you cynical quips since 1998
Re:What about Windows and Mac? by Guy+Harris · 2012-07-01 10:46 · Score: 4, Informative

My guess ist that Windows simply ignored it, so there never was a 61st second in a minute.
Well, if Microsoft's documentation of the SYSTEMTIME structure reflects the implementation, GetSystemTime() , the claim in that man page^W^WMSDN page that "The system time is expressed in Coordinated Universal Time (UTC)" nonwithstanding, cannot acknowledge the existence of a 61st second in a minute ("The second. The valid values for this member are 0 through 59.", as the SYSTEMTIME page says).
But, just as on UN*X, you have "counter" and "human-style label" times (time_t, struct timeval, struct timespec are examples of the former, and a struct tm as returned by, for example, gmtime() is an example of the latter, on UN*X), with the Windows versions of those being SYSTEMTIME and FILETIME respectively. That page on FILETIME says nothing about leap seconds - does it just keep counting over a positive leap second or does it stop or what? And, if it doesn't just keep counting over a positive leap second, does it just freeze for a while second, or does it slow down over some period of time so that it eventually syncs up, or what?
As for NTP, Microsoft has a page on "How the Windows Time service treats a leap second", which says

When the Windows Time service is working as a Network Time Protocol (NTP) client
The Windows Time service does not indicate the value of the Leap Indicator when the Windows Time service receives a packet that includes a leap second. (The Leap Indicator indicates whether an impending leap second is to be inserted or deleted in the last minute of the current day.) Therefore, after the leap second occurs, the NTP client that is running Windows Time service is one second faster than the actual time. This time difference is resolved at the next time synchronization.
(the author of which needs to be told what "inserted or deleted" implies - do they mean that, regardless of whether a leap second is inserted or deleted, the NTP client that is running Windows Time service is one second faster than the actual time?)
And then there's one more question: if there's anything in the NT kernel that deals with leap seconds, does any version have a glitch, as some versions of the Linux kernel do?
If not, then many of the other problems might not exist on Windows. This email from John Stultz, the author of the fix linked to in the previous paragraph, seems to indicate that at least some of the problems, if not all of them, stem from a kernel bug, so it might be that Java and company might be Just Fine on systems that don't have a kernel glitch of that sort (so they might work fine on at least some non-Linux systems, as well as on Linux systems with the bug fixed).
Re:What about Windows and Mac? by Anonymous Coward · 2012-07-01 11:07 · Score: 0

Yes. Apparently, many eyes on the sources is not the only way to avoid bugs. (happy dance)
Re:What about Windows and Mac? by Nutria · 2012-07-01 11:11 · Score: 1

Lots, on an absolute scale, but few relative to the number of Windows and OSX desktops. :(

--
"I don't know, therefore Aliens" Wafflebox1
Re:What about Windows and Mac? by Guy+Harris · 2012-07-01 11:37 · Score: 3, Interesting
As far as I can tell, all current operating systems handled it fine. It's applications that have problems, mainly server-type apps that actually use the clock for important things.
Linux being heavily affected is just a side-effect of most servers running Linux (although apparently some older versions don't handle leap seconds so cleanly - maybe that has something to do with it?).
Yes, at least one of the problems appears to be a Linux kernel problem. However, as that thread indicates, the consequence of this isn't a kernel crash; it causes futexes to repeatedly time out (or, at least, causing futexes with timeouts to repeatedly time out). I'm guessing, perhaps incorrectly, that this might mean that code waiting for a futex gets a kernel wakeup due to a timeout, checks whether the condition being waited for has happened, discovers that it hasn't, sleeps in the futex again, gets a kernel wakeup due to a timeout, checks whether the condition being waited for has happened, discovers that it hasn't, sleeps in the futex again, lathers, rinses, repeats, so it makes no progress and chews up tons of CPU.
If so, then:
- this particular problem is specific to systems running Linux kernels with the problem (and hence specific to Linux);
- applications that don't themselves have issues with leap seconds might be affected by this;
so Linux being heavily affected might also be a side-effect of, well, some versions of the Linux kernel having a bug that's triggered by leap seconds.
However, unless an application happens to use futexes in a fashion that trips over the bug, they won't be affected. It might be server applications that are most likely to do so, meaning that you might not see it on, say, a desktop or handheld Linux machine, or even on some servers.
Re:What about Windows and Mac? by magamiako1 · 2012-07-01 13:45 · Score: 3, Informative

In an Active Directory domain, the computer with the FSMO PDC Emulator role is not only a proper NTP server, but you can sync your devices to it.

Also, look up the command: w32tm
Re:What about Windows and Mac? by Ingenium13 · 2012-07-01 15:11 · Score: 1

As others have stated, it affected apps on Linux desktops. For me it was Chrome using all the CPU which required me to restart the computer to fix, though it now seems that I could have just updated the time and fixed it that easy instead.
Re:What about Windows and Mac? by Anonymous Coward · 2012-07-01 15:35 · Score: 0

And then there's one more question: if there's anything in the NT kernel that deals with leap seconds, does any version have a glitch, as some versions of the Linux kernel do?
Windows uses a TickCount rather than the clock for all its mutexes and other WaitFor(Single|Multiple)Object[s] APIs (including Sleep). Stupid time keeping crap like adding extra seconds doesn't affect the TickCount which is a strictly monotonic 100Hz counter.
Windows maintains its "real time" separately from its behavioural time keeping; as you noted, it uses SYSTEMTIME for this which is hokey but gets the job done. [I'm not familiar with the internals, it may actually use a seconds-since-epoch counter and convert to SYSTEMTIME which would be more efficient]
Frankly, I'm confused by the POSIX API design; it uses the system clock (CLOCK_REALTIME instead of CLOCK_MONOTONIC) for all waits which is just downright stupid. If I set a wait period of 100 seconds on Windows then change the system clock forward an hour, nothing happens. Do that on Linux and the wait will fire immediately. Why? That's useless, fragile and undesirable; you're almost always waiting for some other part of the system to do something, not just because you arbitrarily felt like waiting 10 wall clock seconds.
Re:What about Windows and Mac? by Guy+Harris · 2012-07-01 19:16 · Score: 1

Frankly, I'm confused by the POSIX API design; it uses the system clock (CLOCK_REALTIME instead of CLOCK_MONOTONIC) for all waits which is just downright stupid. If I set a wait period of 100 seconds on Windows then change the system clock forward an hour, nothing happens. Do that on Linux and the wait will fire immediately. Why? That's useless, fragile and undesirable; you're almost always waiting for some other part of the system to do something, not just because you arbitrarily felt like waiting 10 wall clock seconds.
Yes. At, for example, 07:00 on June 30, 2012, "I want this to happen 24 hours from now" and "I want this to happen at 07:00 on July 1, 2012" should be treated as different requests. The former should happen after 86400 seconds have elapsed, regardless of whether any of those seconds were leap seconds or not. The latter should happen at 07:00 on July 1, 2012, even if that happens to mean it happens 86401 seconds later courtesy of a leap seconds inserted at the end of June 30, 2012.
Re:What about Windows and Mac? by michelcolman · 2012-07-01 22:34 · Score: 1

You mean crashing with every leap second is the "proper" way of doing things, as opposed to being one second off but working just fine?
Re:What about Windows and Mac? by MobyDisk · 2012-07-02 02:01 · Score: 1

Windows server admins don't find intermittent lockups and reboots to be unusual. :-)
Re:What about Windows and Mac? by Anonymous Coward · 2012-07-02 04:19 · Score: 0

And apparently neither did any desktop Linux systems.

Sigh, have to post anonymous because I don't want to undo a whole heap of moderating.
That is an unwarranted assumption. At 0000 UTC I was running a fully up to date RHEL6 on my desktop and both Firefox and Thunderbird went to 100% CPU and became non-responsive, VirtualBox got strange, and the entire desktop went to a low level of responsiveness as a consequence. Neither Firefox nor Thunderbird would close; I had to use kill on both of them. I elected to just reboot and be done with any lingering effects, and the system was fine after that.
What I saw exactly met this description.

Linux kernel unable to cope? I think not. by Anonymous Coward · 2012-07-01 08:16 · Score: 1

"some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform — were unable to cope with the extra second."

No opinion about java, and no doubt there's plenty of dodgy software running on Linux, but the part about Linux not coping is BS.

From last night's logs....
Jun 30 19:59:59 thabto kernel: Clock: inserting leap second 23:59:60 UTC

Re:Linux kernel unable to cope? I think not. by Anonymous Coward · 2012-07-01 08:33 · Score: 4, Informative

There was a Linux kernel bug. See
http://news.ycombinator.com/item?id=4183122
http://marc.info/?l=linux-kernel&m=134110635328824&w=2
and
https://lkml.org/lkml/2012/6/30/122
Re:Linux kernel unable to cope? I think not. by Anonymous Coward · 2012-07-01 08:51 · Score: 0

Possibly refers to some of the issues covered here: https://access.redhat.com/knowledge/articles/15145?amp
Re:Linux kernel unable to cope? I think not. by Anonymous Coward · 2012-07-01 08:56 · Score: 0

Why is it even the kernel's business to manipulate the clock this way?
Re:Linux kernel unable to cope? I think not. by Anonymous Coward · 2012-07-01 09:09 · Score: 5, Interesting

I run Arch Linux with kernel 3.4.4 and it went haywire. My machine was very heavily loaded at the time and when the leap second happened mysqld, firefox, and ksoftirq processes started consuming 100% CPU. The load factor was well over 10 and the machine was grinding along. It didn't actually fail but it was loaded down.
Even restarting the processes didn't fix it. The high load would go away once I stopped the processes but as soon as I started them again the load would come right back. I had Firefox open on a blank page not doing anything and it was slammed at 100% CPU and had a could ksoftirq tasks slammed at 100% CPU each too.
I had to reboot the machine to get it back to normal.
I have Ubuntu and Debian servers that for whatever reason did not add the leap second so they were fine. Their time was a second off today though (at least until ntp slowly corrected it or I manually intervened).
Re:Linux kernel unable to cope? I think not. by burne · 2012-07-01 09:16 · Score: 1

You would have been fine if you stopped ntpd before restarting the offending processes.
Re:Linux kernel unable to cope? I think not. by lister+king+of+smeg · 2012-07-01 10:00 · Score: 1

yes, an old one that was patched before this became an issue. the issue if for un-updated/unpatched versions of Linux and shoddily written apps and java

--
---Saying gnome 3 is better than windows 8 not so much a compliment as it is damning with light praise.
Re:Linux kernel unable to cope? I think not. by kwardroid · 2012-07-01 10:12 · Score: 5, Informative

Restarting ntp wasn't enough for me, I had to reset the date with:
date -s "`date`"
Only one machine went haywire though.
Re:Linux kernel unable to cope? I think not. by Guy+Harris · 2012-07-01 12:02 · Score: 1

yes, an old one that was patched before this became an issue.
And a new one that is either in the process of being patched today (July 1, 2012) or that was patched today, as per the lkml thread that starts here.

the issue if for un-updated/unpatched versions of Linux
"Unpatched" with a patch that didn't exist before the problem showed up...

and shoddily written apps and java
Where "shoddily written" means "using futexes or using something that uses futexes"? I'm not sure I'd be so harsh about using futexes; something that lets you do locking mostly in userland doesn't seem like a bad idea offhand....
Re:Linux kernel unable to cope? I think not. by Guy+Harris · 2012-07-01 12:06 · Score: 1

Possibly refers to some of the issues covered here: https://access.redhat.com/knowledge/articles/15145?amp
In particular, to this issue, which apparently first materialized with the recent leap second, and probably not to this issue, which might be the one fixed by this patch.
Re:Linux kernel unable to cope? I think not. by rrohbeck · 2012-07-01 14:12 · Score: 1

Same here (Debian wheezy) but with Chrome.

--
thegodmovie.com - watch it
Re:Linux kernel unable to cope? I think not. by Anonymous Coward · 2012-07-01 18:39 · Score: 0

I had Firefox open on a blank page not doing anything and it was slammed at 100% CPU and had a could ksoftirq tasks slammed at 100% CPU each too.
That's normal Firefox behaviour. Any relevant evidence?
Re:Linux kernel unable to cope? I think not. by Rakarra · 2012-07-01 19:45 · Score: 2

Mods -- please mod this up to a thousand. kwardroid's fix fixed this for all the affected machines I've found so far.
Re:Linux kernel unable to cope? I think not. by Anonymous Coward · 2012-07-02 21:36 · Score: 0

Your tip just saved me from rebooting my home server. Thank you!

FUD? by jimshatt · 2012-07-01 08:18 · Score: 1, Insightful

I don't know, but the article reads as FUD. Sure, there might have been problems, but then, aren't there always problems, everywhere? It's just a matter of picking the right ones and you've got a 'Linux and Java = bad' artice? Or am I being a fanboy now?

Re:FUD? by sjames · 2012-07-01 08:24 · Score: 1

It was a genuine bug but that doesn't make Linux or Java 'bad', all software has bugs. Good or Bad will depend on how many bugs, how often they bite and how badly.
The bug has already been fixed for months now, so the systems having trouble were the ones that weren't kept up to date.
Re:FUD? by Anonymous Coward · 2012-07-01 08:41 · Score: 0

> It was a genuine bug but that doesn't make Linux or Java 'bad', all software has bugs.
#!/bin/sh echo "hello universe"
Look! It's an example of bug-free software!
Re:FUD? by dissy · 2012-07-01 08:41 · Score: 0

Ironically I have an old server down in the basement running a vulnerable 2.6.18.8 kernel, where the only reason I leave it even running is the uptime being at
16:34:29 up 832 days, 21:00
It survived the leap second just fine and is still running along nicely.
I hear a lot of the problems might have been around a vulnerable NTPd on a vulnerable kernel, and I do not run nptd.
I also have another machine here that started off life as a 2.6.18 kernel but has been kpatched to the latest version, and it too survived the leap second, and it does run nptd. I can only assume the fix was already kspliced in shortly after release.
All in all the whole thing seemed like a nonissue.
Re:FUD? by Anonymous Coward · 2012-07-01 08:41 · Score: 1

It was a genuine bug but that doesn't make Linux or Java 'bad', all software has bugs.

I seem to recall Microsoft suffering from some leap year bugs. Boy, the Slashdot comments section lit up about how even a kindergarten programmer would catch the mistake and how it's indicative of the software quality coming out of Microsoft. Now that Linux hit the same type of hurdle, we're all of a sudden being very nuanced about the definition of code quality? Typical.
Re:FUD? by Anonymous Coward · 2012-07-01 08:50 · Score: 0

Well, if you're not running ntpd, then there's no way for the kernel to register the leap second. These things aren't on a regular schedule, so the kernel can only know about them when it gets the information from the outside. No leap second, no problem.
Re:FUD? by burne · 2012-07-01 08:59 · Score: 2

Now that Linux hit the same type of hurdle, we're all of a sudden being very nuanced about the definition of code quality? Typical.
Wow. You're still pissed over Azure failing, your Xbox disabling itself, your Zune crashing for a full day and your Outlook manhandling your appointments (on more than one occasion)?
Talk about carrying a grudge..
Re:FUD? by sjames · 2012-07-01 09:09 · Score: 1

You'll need to talk with the people who made those comments. Of course, I did provide a few more criteria you might want to consider.
Re:FUD? by Anonymous Coward · 2012-07-01 09:24 · Score: 0

And there I was thinking that my 3.2.19 kernel was fairly up to date...
Actually I didn't realize I was affected by this bug until a few minutes ago, when I used strace to see why firefox was using up all the time on one of my cores.
I think there are lots of people that didn't realize as well.
Re:FUD? by kasperd · 2012-07-01 09:43 · Score: 4, Funny

Actually I didn't realize I was affected by this bug until a few minutes ago, when I used strace to see why firefox was using up all the time on one of my cores.
You don't need a leap second in order for that to happen. Firefox does that regularly.

--

Do you care about the security of your wireless mouse?
Re:FUD? by lister+king+of+smeg · 2012-07-01 10:12 · Score: 1, Insightful

I seem to recall Microsoft suffering from some leap year bugs. Boy, the Slashdot comments section lit up about how even a kindergarten programmer would catch the mistake and how it's indicative of the software quality coming out of Microsoft. Now that Linux hit the same type of hurdle, we're all of a sudden being very nuanced about the definition of code quality? Typical.
the difference being this bug was patched already it only affected systems the were not kept up to date. Microsoft however did not patch or fix their software until after the problem had already occurred. the fault lies not in Linux here it lays with system/server admins not updating their servers. where with Microsoft case it was their direct fault for not handeling a well known date issue. where with the Linux bug no one had heard of leap seconds until yesterday yet it did have a patch already out there not their fault or problem.

--
---Saying gnome 3 is better than windows 8 not so much a compliment as it is damning with light praise.
Re:FUD? by dotgain · 2012-07-01 10:17 · Score: 2

Have you read the source for /bin/sh ?
Re:FUD? by fluffy99 · 2012-07-01 11:47 · Score: 1

the difference being this bug was patched already it only affected systems the were not kept up to date.
I would believe that except some of the recent Linux kernels did NOT properly handle the leap second. https://lkml.org/lkml/2012/7/1/19 It was this improper handling of the time change associated with the leap second that sent some software into a tizzy, with the most common side effect being heavy CPU consumption. Some software seems to have have issues regardless of this bug as well.
I agree with the original statement, the if MS had done this the tone of this article would be different.
Re:FUD? by Guy+Harris · 2012-07-01 12:08 · Score: 3, Informative

The bug has already been fixed for months now
A bug might have been fixed for months now, but I don't think that's the bug here.
Re:FUD? by Guy+Harris · 2012-07-01 12:10 · Score: 2

the difference being this bug was patched already it only affected systems the were not kept up to date.
A bug, perhaps. This bug, perhaps not.
Re:FUD? by Guy+Harris · 2012-07-01 12:11 · Score: 1

And there I was thinking that my 3.2.19 kernel was fairly up to date...
At this point, I'm not sure which Linux kernels, other than perhaps the one on John Stultz's machine, are sufficiently up to date.
Re:FUD? by Hognoxious · 2012-07-02 00:47 · Score: 1

Same here, 2.6.18-164.6.1.el5.centos.plus or so it says.
Was running at the time, though probably not under much load. No problems.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."

Extremely weird by Anonymous Coward · 2012-07-01 08:18 · Score: 5, Informative

From my own machines and comparing notes with some other people (all in all, about 3k servers) the bug seems to affect machines randomly. Known facts:

There's a kernel patch that fixes the supposed issue: https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d

Affects Debian stable a lot.

Affects Java and Virtualbox (starts using too much CPU).

Affected my browser (iceweasel on debian testing).

Affects SOME mysql installs (5.1 and 5.5, but not all, and of two identical installs one might be affected, the other not).

The fix has been posted at lot of places: /etc/init.d/ntp stop; date; date `date +"%m%d%H%M%C%y.%S"`; date; /etc/init.d/ntp start

(I'm all for switching unix time to a simple counter and leaving it to the calendar libs to put the leap seconds where necessary)

Re:Extremely weird by burne · 2012-07-01 09:09 · Score: 4, Informative

It's a race-condition, either crashing your ancient kernel or causing software using certain kernel-calls to effectively lock up. In both cases load seems to be a factor.
Over here the race-condition coincided with the actual leap-second and the start of the first batch of cronjobs at 02:00 local time.

(I'm all for switching unix time to a simple counter and leaving it to the calendar libs to put the leap seconds where necessary)
Bad idea. It would have prevented kernels affected by the race-condition from crashing, but would have meant most of your running software would have been either hit by this bug or would have been on the mercy of a 17 year old pimple-faced coder.
I think I prefer a crash over the mayhem caused by banking-software not handling a leap-second correctly. That could bankrupt whole countries.
Re:Extremely weird by Anonymous Coward · 2012-07-01 10:57 · Score: 0

timezone data is already added or subtracted from the time after
time() is called. so i don't buy this argument.
unfortunately it doesn't matter. due to clock drift and the need to
peg the clock representing local time to a standard, we can't prevent
the local clock from running fast or slow.
so the real solution, if you're interested in a time interval, is to use
a local clock that is not adjusted to fit a master.
Re:Extremely weird by Guy+Harris · 2012-07-01 12:21 · Score: 2

From my own machines and comparing notes with some other people (all in all, about 3k servers) the bug seems to affect machines randomly. Known facts:
There's a kernel patch that fixes the supposed issue: https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
I don't think that's the issue. The issue discussed in this lklm thread is a different issue with, presumably, a different John Stultz fix.

The fix has been posted at lot of places: /etc/init.d/ntp stop; date; date `date +"%m%d%H%M%C%y.%S"`; date; /etc/init.d/ntp start
Presumably meaning "workaround" rather than "fix".

(I'm all for switching unix time to a simple counter and leaving it to the calendar libs to put the leap seconds where necessary)
Sounds good to me, but I thought that was a good idea back in the late '80's; the POSIX people thought otherwise, so....
At least as I read RFC 5905, time stamps in NTP packets are essentially "simple counters", and count positive leap seconds and don't count seconds removed with negative leap seconds. I'm not sure what an NTP implementation is supposed to do with the "leap indicator"; that might be dependent on what sort of time the system is supposed to provide to applications. I don't know whether the Linux kernel giving a damn about leap seconds is due to it trying to supply "POSIX time", i.e. time represented as "seconds since the Epoch" rather than as the number of seconds that have elapsed since the Epoch (yes, the two are different), or if it has to do that to function as an NTP client.
Re:Extremely weird by Guy+Harris · 2012-07-01 12:34 · Score: 1

Bad idea. It would have prevented kernels affected by the race-condition from crashing, but would have meant most of your running software would have been either hit by this bug or would have been on the mercy of a 17 year old pimple-faced coder.
I think I prefer a crash over the mayhem caused by banking-software not handling a leap-second correctly. That could bankrupt whole countries.
OK, I'm all for having UN*X kernels (including but not limited to Linux kernels) keep their internal time value as a counter initialized to (as best an approximation as possible of) the number of seconds that have elapsed between the Epoch and the time the counter is initialized, and have those calls that are expected to return "seconds since the Epoch" do so by converting a count of seconds that have elapsed since the Epoch into "seconds since the epoch" by subtracting out positive leap seconds and adding in negative leap seconds (preferably in userland). Then the 17-year-old pimple-faced coders can use the POSIX calls and pretend leap seconds don't exist, and the kernel can presumably not have to care about leap seconds and thus not have to worry about the insertion of leap seconds.
Oh, and "this bug" appears, from this LKML thread, to be due to the kernel caring about leap seconds, so it's not as if your software would have been hit by this bug if the stuff that caused the bug didn't happen to exist in the kernel in the first place.
Re:Extremely weird by xenobyte · 2012-07-01 19:15 · Score: 1

From my own machines and comparing notes with some other people (all in all, about 3k servers) the bug seems to affect machines randomly. Known facts:
There's a kernel patch that fixes the supposed issue: https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
Affects Debian stable a lot.
Affects Java and Virtualbox (starts using too much CPU).
Affected my browser (iceweasel on debian testing).
Affects SOME mysql installs (5.1 and 5.5, but not all, and of two identical installs one might be affected, the other not).
My own Debian Stable server running 2.6.32 and MySQL 5.1.63 had zero problems. It is also an NTP pool stratum 2 server but it worked just fine.
More interestingly, none of the 1.600+ servers at work had any problems. About 400 of these are Linux, mostly Debian (various distributions) but also some RHEL and some CentOS. There are also about 20 OpenBSD and FreeBSD servers which also performed flawlessly.

--
"For every complex problem, there is a solution that is simple, neat, and wrong." -- H.L. Mencken (1880-1956) --

Second story in one day? by Anonymous Coward · 2012-07-01 08:18 · Score: 0

Why has this "story" been posted twice in one day?

Do you guys think we are incapable of remembering things that have happened in the last 24 hours?

Re:Second story in one day? by Teresita · 2012-07-01 08:39 · Score: 1

I want justice. Next time they take away a second from the day, I want one of these "stories" to be expunged.
Re:Second story in one day? by houghi · 2012-07-01 08:40 · Score: 1

24 hours? I assume you mean 1 day and that is NOT 24 hours, but a second more.

--
Don't fight for your country, if your country does not fight for you.

Re: by bleedingsamurai · 2012-07-01 08:21 · Score: 1

Well that explains it. I'm running nothing less then 3.3.8

Re: by Anonymous Coward · 2012-07-01 08:24 · Score: 0

From the looks of it the kernel must be running on multiple cpus for the livelock to occur. This is probably one of the reasons why none of my servers had any issue.

Re: by Anonymous Coward · 2012-07-01 08:25 · Score: 0

Not true.
I had a number of boxes running 2.6.32 getting bit by this bug.

Dreaded S60 bug... by mschaffer · 2012-07-01 08:35 · Score: 2

It's like the Y2K bug, but every few years.

Re:Dreaded S60 bug... by PolygamousRanchKid+ · 2012-07-01 09:32 · Score: 1

Hey, it's an excellent opportunity to drum up bogus consulting work!
"Are your old C programs able to handle a leap second! Think of how much money your company will lose when that one extra second of interest gets calculated on your bank accounts! You need me to check your code for you!"
"Thanks, see you around, for the next leap second!"
The IT industry definitely needs for leap seconds.

--
Schroedinger's Brexit: The UK is both in and out of the EU at the same time!

Re: by Anonymous Coward · 2012-07-01 08:39 · Score: 2, Informative

The hard system lock bug due to a leap second was patched in 2.6.29, so either you've got some weird related bug, or something is very wrong.

Re: by Admiral+Justin · 2012-07-01 08:47 · Score: 2

Configuration of the system to only accept 23:59:59 and not 23:59:60

--
You will be baked, and there will be cake.

You probably don't do much Java, then by burne · 2012-07-01 08:51 · Score: 5, Informative

As it turns out my biggest problems was customer-supplied software which uses their own java jre's. We install a jre by default and update it whenever possible, but some software (Adeptia, VLTrader, Alfresco) comes with their own ancient jre and scripts to call that over system-supplied java.

Not a single machine crashed (we are very explicitly in charge of what OS-version there's running) but a lot of java locked up and had to be restarted.

I can even see a small bump in the power-usage around two o' clock (0:00 GMT).

Re:You probably don't do much Java, then by Guy+Harris · 2012-07-01 10:50 · Score: 1

As it turns out my biggest problems was customer-supplied software which uses their own java jre's. We install a jre by default and update it whenever possible, but some software (Adeptia, VLTrader, Alfresco) comes with their own ancient jre and scripts to call that over system-supplied java.
Not a single machine crashed (we are very explicitly in charge of what OS-version there's running) but a lot of java locked up and had to be restarted.
So are you saying that, in addition to the Linux kernel glitch in question (which appears to cause some userland processes to spin), there are purely-userland problems? Or, if you're running on a Linux that doesn't have John Stultz's fix, is it that some JREs are vulnerable to the Linux kernel glitch and others aren't?
Re:You probably don't do much Java, then by thegarbz · 2012-07-01 11:42 · Score: 4, Funny

I can even see a small bump in the power-usage around two o' clock (0:00 GMT).
Leap seconds contribute to global warming. We need to raise this at the next G8 summit.
Re:You probably don't do much Java, then by Guy+Harris · 2012-07-01 11:43 · Score: 4, Informative

So are you saying that, in addition to the Linux kernel glitch in question (which appears to cause some userland processes to spin)
Actually, I'm not sure that's the case. John Stultz's mail from July 1, 2012 speaks of a bug where clock_was_set() wasn't called after the leap second was added, and of a patch he was working on, so the bug in question might not have been fixed in March.
Re:You probably don't do much Java, then by Lennie · 2012-07-01 14:04 · Score: 2

Looked to me like it was only 64-bit Java, not 32-bit Java

--
New things are always on the horizon
Re:You probably don't do much Java, then by Anonymous Coward · 2012-07-01 22:06 · Score: 0

Yes, due to the spinning. I think a lot of people will *think* their systems are fine till their processor cooks. On multi processor machines it might be that many processors keep going. That's how it appeared to be till I read the output of dmesg and noticed that it said the processor was overheating and it was scaling back processor speed.
I wonder how many processors will overheat to the point of destruction? Bound to hit someone.
Re:You probably don't do much Java, then by archont · 2012-07-01 23:51 · Score: 3, Informative

What is this, 1990? All modern CPUs have protection against overheating and disabling that protection requires, at the very least, some crafty soldering or flashing a 3rd party BIOS. If you're capable enough to do that you're probably running some sci-fi prototype rig from the future using pressurized mercury phase transition cooling or something.
So no, I don't see how any properly set-up rig can make the CPU cook itself.

RHEL + JBOSS = LOAD by Anonymous Coward · 2012-07-01 08:54 · Score: 0

We had many servers with this issue, mostly RHEL 6 servers running JBoss. The only symptom is high load. If you are not actively monitoring your server load, you may not even know that there's an issue yet.

I always thought leap seconds were stupid by circletimessquare · 2012-07-01 09:08 · Score: 2, Interesting

Why not bundle them and apply them every 10 or 20 years?

And apparently I'm not alone:

http://en.wikipedia.org/wiki/Leap_second#Proposal_to_abolish_leap_seconds

Hogwash, Astronomers can find coping mechanisms, it's either that or these ridiculous levels of stress for systems admins.

--
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it

Re:I always thought leap seconds were stupid by burne · 2012-07-01 09:19 · Score: 2

Hogwash, Astronomers can find coping mechanisms, it's either that or these ridiculous levels of stress for systems admins.
TAI doesn't know about leapseconds, and it's the coping mechanism of choice for astronomy.
Re:I always thought leap seconds were stupid by circletimessquare · 2012-07-01 09:21 · Score: 1

so there you go. there's no good reason for leap seconds. bundle them and apply them once a decade or more.

--
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
Re:I always thought leap seconds were stupid by at10u8 · 2012-07-01 11:01 · Score: 3, Interesting

except that BIPM, the providers of TAI, have published this http://www.bipm.org/cc/CCTF/Allowed/18/CCTF_09-27_note_on_UTC-ITU-R.pdf wherein the CCTF "stresses that TAI is the uniform time scale underlying UTC, and that it should not be considered as an alternative time reference." This appears to indicate that the CCTF and BIPM are not comfortable with the notion that operational systems might be employing TAI as their time scale. At the end of that paper they also discuss the possibility that TAI could cease to exist.
Re:I always thought leap seconds were stupid by thue · 2012-07-01 11:09 · Score: 4, Insightful

> Why not bundle them and apply them every 10 or 20 years?
The problem we have here is that leap seconds are rare. Things that are common are tested for, and quickly found if broken. Having something which only happens every 20 years is a recipee for disaster every 20 years.
My view is that NTP is at fault, because the 61th second is a brittle way to handle it. NTP should use the same method as google for smearing the leap second out over fx an hour: http://googleblog.blogspot.dk/2011/09/time-technology-and-leaping-seconds.html
Re:I always thought leap seconds were stupid by Yoda222 · 2012-07-01 12:15 · Score: 0

I think you are wrong. Basically, your proposal is to introduce a bug in ntp (the aim of NTP is to keep the computer date synchronised with UTC, and UTC introduce leap second with an additional 23:59:60 second not by streching the time during one or more hours).
And you want to introduce a bug in a software to fix bugs in several other software. Yeah, great idea, but how many bug in other software which relies on a good system time will this cheat introduce ?
We should not workarround bugs by introducing other bugs, especialy in software widely distributed. Maybe it could be added as an option, with precise documentation, but not as a default behaviour.
Re:I always thought leap seconds were stupid by Anonymous Coward · 2012-07-01 17:34 · Score: 0

My view is that NTP is at fault, because the 61th second is a brittle way to handle it. NTP should use the same method as google for smearing the leap second out over fx an hour: http://googleblog.blogspot.dk/2011/09/time-technology-and-leaping-seconds.html
That would break the main goal of NTP, which is to provide high accuracy time to computers. Many systems, such as telescope control systems, financial trading software etc, depend on NTP to regulate the computer clock at the millisecond or microsecond level, and this accuracy would be lost during a google-style smearing operation.
Re:I always thought leap seconds were stupid by thue · 2012-07-02 00:23 · Score: 1

> That would break the main goal of NTP, which is to provide high accuracy time to computers. Many systems, such as telescope control systems, financial trading software etc, depend on NTP to regulate the computer clock at the millisecond or microsecond level, and this accuracy would be lost during a google-style smearing operation.
Anybody who really cares about reliable time, such as telescopes, should use TAI and not UTC (and I think they do).
To me it is incomprehensible why Unix uses UTC instead of TAI for the hardware clock - TAI is the obviously correct choice.
Re:I always thought leap seconds were stupid by Anonymous Coward · 2012-07-02 01:05 · Score: 0

Hogwash, Astronomers can find coping mechanisms, it's either that or these ridiculous levels of stress for systems admins.
Or maybe stop using poorly tested code in the linux kernel.

Only Linux affected? by cpghost · 2012-07-01 09:15 · Score: 4, Interesting

I'm managing a cluster of 2,400 nodes running FreeBSD, and AFAICS, none was tripped off by leap second NTP adjustments. On the other hand, 4 out of 180 Linux nodes crashed simultaneously at that very moment. All this is exceedingly weird, but may indeed point to a subtle bug in the Linux kernel (only?). I've never witnessed this behavior in the past.

--
cpghost at Cordula's Web.

Re:Only Linux affected? by Anonymous Coward · 2012-07-01 10:50 · Score: 1

IIRC, one of the main FreeBSD dudes is a bit of an atomic timekeeping hobbyist..
I'd actually _expect_ that OS at least to work properly re. leap seconds :-)
Re:Only Linux affected? by Guy+Harris · 2012-07-01 12:37 · Score: 2

I'm managing a cluster of 2,400 nodes running FreeBSD, and AFAICS, none was tripped off by leap second NTP adjustments. On the other hand, 4 out of 180 Linux nodes crashed simultaneously at that very moment. All this is exceedingly weird, but may indeed point to a subtle bug in the Linux kernel (only?)
Could be, if "crashed" means "had some processes start spinning like mad". If it was a kernel-mode crash, that might be another bug.
Re:Only Linux affected? by Spazmania · 2012-07-02 07:52 · Score: 1

In my case, crashed meant that a bunch of processes got stuck in an I/O wait ("ps" reported state "D")

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Re:Only Linux affected? by Guy+Harris · 2012-07-02 09:36 · Score: 1

In my case, crashed meant that a bunch of processes got stuck in an I/O wait ("ps" reported state "D")
Sounds like some other bug, then - the bug discussed in the LKML thread seems to produce CPU spins.

Debian + Java = Issues by thatskinnyguy · 2012-07-01 09:17 · Score: 2

About 5 seconds after midnight GMT a Java server app running on my Debian Squeeze server decided it was going to eat-up ALL THE THINGS and for some reason, the server rebooted itself. Glad to know I wasn't alone in shitting myself over odd behaviours.

--
The game.

Re:I always thought leap [years] were stupid by Anonymous Coward · 2012-07-01 09:20 · Score: 0

Hogwash, Astronomers can find coping mechanisms, it's either that or these ridiculous levels of stress for systems admins.

The same can be said for leap years. They'v been around for a few hundred years and people still can't cope with them. Why don't we just go back to the Julian calendar and drop the Gregorian one?

Ditto for Daylight Saving Time which, IMHO, is completely arbitrary and not tied to any physical need or phenomenon, and we're still "stuck" with nonetheless.

No program I ever write will be able to cope by Anonymous Coward · 2012-07-01 09:21 · Score: 0

I will never write a program that correctly handles seconds=60. Period. EVAR!

Google on how they fixed that.. by Barryke · 2012-07-01 09:43 · Score: 3, Interesting

Google official blog: "Time, technology and leaping seconds" (sept 2011)
http://googleblog.blogspot.in/2011/09/time-technology-and-leaping-seconds.html

I wonder if the leap second has anything to do with the labs Chubby paper / site currently being offline..

--
Hivemind harvest in progress..

Re:Google on how they fixed that.. by Anonymous Coward · 2012-07-01 10:02 · Score: 0

That blog post could have been about 1% the size it was:
We implemented slew mode in our NTP servers.

hmm by Nocturnal+Deviant · 2012-07-01 09:59 · Score: 0

Nobody has posted that it also took thepiratebay down, something that the MAFIAA has been trying to do for the last umpteen years?

--
-Noc

MySQL had issues only. by Zombie+Ryushu · 2012-07-01 10:01 · Score: 1

MySQL started spiking my CPU when the leap second hit. Only MySQL, and nothing else. It was odd.

Re:Gentoo by miknix · 2012-07-01 10:20 · Score: 1

Clock: inserting leap second 23:59:60 UTC

No problem whatsoever on my Gentoo server, with a 3.3.1 hardened (Linux) kernel.

Hmm, could this have been the cause of my issues? by psm321 · 2012-07-01 10:49 · Score: 1

I had a lot of programs (none Java-based though) taking up an inordinate amount of CPU, and high system CPU usage. Couldn't figure out the cause, and a reboot fixed it. In retrospect, I think it was around midnight UTC.

leapsmear.pool.ntp.org? by Anonymous Coward · 2012-07-01 11:39 · Score: 0

Why can't the guys behind NTP.org provide a leap second smear option like that used by Google as an alternative? Have people who deal with time, deal with the problem in a way that most people wouldn't notice in a centralized, but optional manner?

So, a leap smear NTP capable pool, and a standard NTP pool?

Re: by Guy+Harris · 2012-07-01 11:51 · Score: 3, Interesting

The hard system lock bug due to a leap second was patched in 2.6.29, so either you've got some weird related bug, or something is very wrong.

Well, the weird related bug would arguably count as something being wrong. Apparently there is a bug in the handling of the insertion of positive leap seconds that could cause weird behavior with futexes, and that bug appears not to have been fixed until at least July 1, 2012 (I'm guessing John Stultz has worked up a patch).

Re:Your E-Book Is Reading You by micheas · 2012-07-01 12:27 · Score: 1

Well that post might be a candidate for the super rare +5 offtopic mod. (a mod even rarer than +5 troll)

--
Work bio at MMWD

Re: by AmberBlackCat · 2012-07-01 12:34 · Score: 2

If that actually happened, then they should have just made it do 23:59:59 twice instead of crashing all the computers. I would like somebody to give me a concrete reason why any computer system should actually crash because of a lost second.

Non-Tokyo drift by LoadWB · 2012-07-01 12:39 · Score: 1

Considering how much most of the hardware clocks on the hardware I support drift as it is, a leap second ain't nothing compared to the six-hour ntpdate updates.

Re:Non-Tokyo drift by Areyoukiddingme · 2012-07-01 15:26 · Score: 1

I've got a machine with the same problem. In my case, I believe it's caused by a missing CMOS battery (the bracket is broken).
The problem is the 'tick' kernel variable is receiving an invalid value during startup. Use the 'tickadj' program to display and correct this value. On my machine, a value of 10000 eliminates the drift. Your (and my) machine is getting a value that's too high, causing the huge drift.
I have not found a way to make this fix persist across power-loss/reboots.
Once tick has the correct value, the six hour ntpdate updates will go away and ntpd will be able to run. (When a clock is drifting that badly, ntpd refuses to even attempt to discipline the clock.)
Re:Non-Tokyo drift by Anonymous Coward · 2012-07-02 08:40 · Score: 0

That's what you should be using ntpd for, to correct for drift instead of running ntpdate every 6 hours.
Re:Non-Tokyo drift by LoadWB · 2012-07-02 11:03 · Score: 1

Notice I said "hardware." I don't have that level access to all of the hardware I run without "rooting" or otherwise hacking the firmware. My Solaris boxes are one thing, and Windows handles time-skew fairly well. But it's "black box" hardware which is different. None my WAPs or routers crashed over the weekend.

Nice troll by Tough+Love · 2012-07-01 13:34 · Score: 1

some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform

Nice troll. How did my half dozen continuously running Linux systems including a server and a router cope with it then?

--
When all you have is a hammer, every problem starts to look like a thumb.

Re:Nice troll by Anonymous Coward · 2012-07-01 16:35 · Score: 0

Because they probably aren't doing fuck all. Judging that you couldn't be bothered to read the article or the discussion, I'm going to guess you maintain a network nobody really gives a rats ass about. Come talk to me when you have hundreds, thousands or more Linux servers, actually doing more than idling and then we'll talk.
Re:Nice troll by Tough+Love · 2012-07-01 19:23 · Score: 1

Because they probably aren't doing fuck all
You would be wrong about that, but you're just some random loudmouth anonymous asshole, so who cares what you think.

--
When all you have is a hammer, every problem starts to look like a thumb.
Re:Nice troll by Anonymous Coward · 2012-07-01 21:09 · Score: 0

Its quite easy to explain. You fail to realize that just because a bug exists doesn't mean that every single person is going to encounter it. I'd recommend attempting to read bug reports and the technical discussions around it, but I know you wouldn't understand anything given your limited mental capacity for such things.

What's with Chrome? by rrohbeck · 2012-07-01 13:43 · Score: 1

On all my Linux systems, Chrome plus some kernel threads pulled 100% CPU until exited Chrome (which worked fine with Shift-Ctrl-Q.)
On one system Chrome refuses to start now. It restores the tabs but every tab is an "Aw, snap!" page, even if I move the configuration directory away.

--
thegodmovie.com - watch it

mythtv and maybe mysql went nuts for me by TheGratefulNet · 2012-07-01 13:55 · Score: 1

not sure if it was related, but I noticed a load avg of well over 10 on my amd e350 myth server box. I don't normally watch for load counts but I did notice that java was eating a lot of cpu, too; and I don't run java directly, some other 'stuff' on my system must be doing that. (sigh, I hate java...)

a reboot fixed things. I hate saying that, too. but the system was very slow and I needed to remove a pci card, anyway (lol).

kernel was 3.0.something

--

--
"It is now safe to switch off your computer."

Re: mainly applications by neonsignal · 2012-07-01 14:30 · Score: 1

Yes, for example, there were a number of anecdotes of MySQL databases using 100% CPU time, and the article mentions the similar Java/Cassandra problem. I suspect these two probably account for the majority of issues encountered.
http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-and-the-fix/

Please read this lkml thread before commenting by Guy+Harris · 2012-07-01 14:42 · Score: 3, Informative

This linux-kernel mailing list thread discusses a kernel bug that causes futexes to repeatedly time out, so that code using them (which might include POSIX mutexes and condition variables, if that's what glibc uses for them on Linux) might spin.

That's not the kernel-leap-year-handling bug that was fixed back in March, so it's not as if a properly-patched kernel wouldn't get hit by this (unless you define "properly-patched" as "includes the patch John Stultz came up with on July 1, 2012").

So, yes, this particular bug is Linux-specific (i.e., there's a reason why it hit Linux servers), and might not be the fault of the userland code running atop it (so it might not, for example, be Java's fault).

I agree... by CFD339 · 2012-07-01 16:45 · Score: 0

I'm mildly fascinated (by mildly, I mean if someone here has a good answer, I'll read it, but that's as far as I'm going) -- I have a few linux servers and they are apparently either too old or updated enough to have not had the problem, because they didn't crash.

My first question was, "How the hell would my servers even KNOW that someone had inserted a "Leap Second" into the time unless they happened to do their ntp updates at exactly that time?" That of course would be followed by "why would they care?"

As far as I knew, NTP just says "Hey, server, what time is it" and gets back "It's f*****ing 3 o'clock, exactly 1 hour since the last time you asked. Now go away and quit bothering me!" (only it says it really fast using a single big number) At which point the software makes sure my server knows it's 3 o'clock.

--
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln

Re:I agree... by cmdrbuzz · 2012-07-01 21:10 · Score: 3, Insightful

When NTP knows that a leap second is to be added, it (on Linux at least) sets a flag in the kernel to say that at 23:59:59, please continue to 23:59:60 before going to 00:00:00. This is set by NTP anytime on the day that the leap second is due to be implemented, hence why a server running NTP on Linux would know that TODAY a leap second is due (cause they should always be posted at the 23:59:59 cross-over)
Re:I agree... by cmdrbuzz · 2012-07-01 21:15 · Score: 1

Oh and what you thought of as NTP sounds more like how NTPDATE works (i.e one shot "whats the time Mr Server" style clock updates)
NTP is /far/ more complicated and does stuff like working out the time delay between you and the server(s), the skew of /your/ clock (so it knows if your clock tends to run a bit fast/slow and adjusts for that) and lots of other clever "make time of day clocks work better" stuff (and sometimes even updating the HW TOD clock if needed)
Re:I agree... by MobyDisk · 2012-07-02 02:00 · Score: 1

Is there a system call that actually could return 23:59:60 as a valid time???
Re:I agree... by Anonymous Coward · 2012-07-02 05:13 · Score: 0

Is there a system call that actually could return 23:59:60 as a valid time???
Yes, strftime() is just one of them. From the man page:
%S The second as a decimal number (range 00 to 60). (The range is up to 60 to allow for occasional leap seconds.)
And here's what the man page for ctime() has to say about tm_sec in a struct tm:
tm_sec The number of seconds after the minute, normally in the range 0 to 59, but can be up to 60 to allow for leap seconds.
Re:I agree... by Guy+Harris · 2012-07-02 07:26 · Score: 1

When NTP knows that a leap second is to be added, it (on Linux at least) sets a flag in the kernel to say that at 23:59:59, please continue to 23:59:60 before going to 00:00:00.
Where does the Linux kernel know about "23:59:59" and "23:59:60" rather than "N seconds since the Epoch", other than when dealing with real time clock hardware that maintains year/month/day/hour/minute/second rather than a count of ticks?
It looks as if the stuff in kernel/time/ntp.c adjusts the "N seconds and M whatevers since the Epoch" counter so that it reflects "seconds since the Epoch" rather than the number of seconds that have elapsed since the Epoch.
Re:I agree... by Guy+Harris · 2012-07-02 07:32 · Score: 1

As far as I knew, NTP just says "Hey, server, what time is it" and gets back "It's f*****ing 3 o'clock, exactly 1 hour since the last time you asked. Now go away and quit bothering me!" (only it says it really fast using a single big number) At which point the software makes sure my server knows it's 3 o'clock.
No, as others have noted, NTP does a lot more - including saying "hey, a {positive or negative} leap second is coming up!" (look for "leap indicator" in RFC 5905). What the NTP client does with that is up to the client; I guess Linux is trying to do what POSIX specifies, i.e. having "seconds since the Epoch" be something other than a count of the seconds that have elapsed since the Epoch.
Re:I agree... by cmdrbuzz · 2012-07-02 08:20 · Score: 1

And it does this at the (human readable version) 23:59:59 to 00:00:00 handover to make it happen at the end of the day.
I appreciate that Linux manages its TOD clock as xxx ticks etc, but what I wrote is accurate from a user watching the result. Trying to explain how NTP does it (on all the different OS' it runs on) is a bit much for for a slashdot post (well at least its more time that I have for posting!)
FWIW NTP on z/OS just spins the clock so the last second runs really slow to make sure that any apps don't ever see the ::60 (or try and react to it at least).
Re:I agree... by CFD339 · 2012-07-02 09:13 · Score: 1

This seems like the kind of problem made by one of those "super-bright" children that got hired for way too much money in the late 90's and were given offices with dog beds and room for their skateboards while they ignored all the great original standards that made the internet possible, and built insanely over-engineered new ones in the hope of making a fortune.

--
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
Re:I agree... by Guy+Harris · 2012-07-02 12:02 · Score: 1

This seems like the kind of problem made by one of those "super-bright" children that got hired for way too much money in the late 90's and were given offices with dog beds and room for their skateboards while they ignored all the great original standards that made the internet possible, and built insanely over-engineered new ones in the hope of making a fortune.
I'm not sure what "this" refers to there, but, as per RFC 958, the "leap indicator" dates back at least to 1985, and, at least if I remember correctly, the "seconds since the Epoch" doesn't mean "seconds that have elapsed since the Epoch" dates back to the original 1988 POSIX.

Time to follow Hanke-Henry Calendar by Anonymous Coward · 2012-07-01 18:08 · Score: 1

http://releases.jhu.edu/2011/12/27/time-for-a-change-johns-hopkins-scholars-say-calendar-needs-serious-overhaul/
Proposed permanent calendar has a predictable 91-day quarterly pattern of two months of 30 days and a third month of 31 days,

The calendar - http://henry.pha.jhu.edu/ccct.calendar.html
FAQ - http://henry.pha.jhu.edu/calendar.html

likely the futex issue by Chirs · 2012-07-01 18:24 · Score: 1

Check out the other threads pointing out an issue with futexes. There's an easy workaround, just manually set the time on your system and the problem will go away until the next leap second.

Re:likely the futex issue by rrohbeck · 2012-07-01 18:45 · Score: 1

Yeah, I saw in the meantime that Firefox and Chrome are affected in the same way.

--
thegodmovie.com - watch it

What did the computers do with that extra second? by RivenAleem · 2012-07-01 19:55 · Score: 2

Data: She brought me closer to humanity than I ever thought possible, and for a time...I was tempted by her offer.
Jean-Luc Picard: How long a time?
Data: Zero point six eight seconds, sir. For an android, that is nearly an eternity.

Reboot by Anonymous Coward · 2012-07-01 21:09 · Score: 0

Why didn't everyone just schedule a reboot for 11:58pm?

Well done everyone by ghostdoc · 2012-07-01 21:22 · Score: 1

NOW can we just collectively pat ourselves on the back for Y2K?

I still talk to people who believe Y2K was all a hoax perpetrated by computer consultancy companies to scare upgrade cash from large customers.
Now at least I have some ammunition to shoot back with :)
And hopefully we can start getting people to take the coming 32-bit epoch end seriously too

--
Business/App ideas are like arseholes: everyone's got one, they're mostly shit, but very rarely they contain a diamond

Re: by LoztInSpace · 2012-07-01 22:00 · Score: 1

That's what I thought too. I don't understand why it's any different to having a manual or automatic clock update for DST or any other reason. If there really was a text version that came across as 23:59:60 that's utterly laughable.
The only system in the world that would accept such a thing is a MySQL "database".

Yes, POSIX lets you do it by Anonymous Coward · 2012-07-02 00:23 · Score: 0

Know your clocks. Check the clock_getclock() and other clock_* functions, and all clock types in POSIX.

Re:Yes, POSIX lets you do it by Guy+Harris · 2012-07-02 05:21 · Score: 1
Know your clocks. Check the clock_getclock() and other clock_* functions, and all clock types in POSIX.
You mean the clock types such as:
- CLOCK_REALTIME, which "[represents] the amount of time (in seconds and nanoseconds) since the Epoch", where at least in the Rationale, they say it's "a higher resolution version of the clock that maintains POSIX.1 system time", so that it's "seconds and nanoseconds since the Epoch" rather than a count of the number of seconds-and-nanoseconds that have elapsed since the Epoch (i.e., it has the same problem as time()), and
- CLOCK_MONOTONIC, which "represents the amount of time (in seconds and nanoseconds) since an unspecified point in the past", rendering it useless as a time-of-day clock (and they don't even appear to guarantee that the clock won't freeze during a positive leap second - they just say it won't move backwards - and they explicitly indicate that it can jump forward and thus could jump forward during a negative leap second)?

86400 by Anonymous Coward · 2012-07-02 00:49 · Score: 0

Many pieces of software assume 86400 seconds for a day. I just did a quick check and BIND 9.6 has logic using this as an estimate for zone refresh times, etc. The code tries to deal with leap years, but not seconds.

Perl's posix module defines a day as 86400. Perl File::Stat too.

The pw and tcpdump commands in MidnightBSD and at least FreeBSD 7.x makes a similar assumption.

kern_shutdown.c in FreeBSD ...

This assumption is everywhere.

We took the coward's way out... by ElVee · 2012-07-02 01:34 · Score: 3, Informative

I work at a fairly large international outfit, with data feeds coming and going to the far ends of the Earth. Everything we do is time-sensitive. Processing messages that depend on prior messages already being processed means we can't gracefully handle things coming in out of order.

We spent lots of time and money studying this problem, hired a high-priced consulting outfit to advise us and spun up lots of projects to mitigate the "risk" of the leap second. There were far too many meetings and conference calls with vendors, VARS and other people that wanted us to pay them for their time. What was determined was that we couldn't guarantee that nothing would crash or (gasp!) messages might be discarded or processed incorrectly, which was a risk we weren't willing to take. We run a full gamut of OSes, from HP/UX, Solaris, Linux, z/TPF, z/OS, DB2 etc etc.. You get the idea. Too many variables and too many systems to update and test with the limited funds and limited timeframe given.

In the end, we avoided the problem by shutting down all (and I do mean ALL) processing and flushing all the transactional systems to disk and suspending EVERYTHING from a minute before until a minute after the leap second. (Was that two minutes or two minutes PLUS one second? Clock math has always eluded me.) Shutting down all these interconnected systems in the correct order was a precision dance that, in the end, we didn't perform very well. Messages did end up being discarded. At precisely :20 seconds after the leap second, we began syncing all our systems with our internal NTP server and then at precisely one minute after, we slowly started everything back up. There were some systems that required a restart. We manually reprocessed those earlier discarded messages just as fast as our little fingers could type. In all it took us about 15 minutes to get everything spun back up, and all that time is getting charged to our SLA, which affects ALL our evaluations and year-end bonuses.

Lots of work was done, overtime was paid and buckets of money were given to lots of high-priced consultants and I personally will take a hit to my paycheck, all over ONE GODDAMNED SECOND.

Let's not do that again, okay?

--
- Pithy comment goes here.

Hoopla Over One Second by AMMalena · 2012-07-02 01:47 · Score: 1

This is so silly. One second. How many ways, for those who CLAIM they needed to account for this second, could problems have been avoided? Hmm. Heaven forbid they simply IGNORE the extra second (all except those oh-so-crucial banking connections who SUPPOSEDLY need to be perfectly in sync) and let the system either adjust it's time however it normally would, or perhaps write a script to pause services for 5-15 seconds while the time is adjusted, or get fancy and write something that slowly took away nanoseconds so that over the course of a minute, hour, day, etc, the second was accounted for.

This is just beyond silly. At least Y2K had logical concerns that people had to deal with (even though THOSE were blown completely out of proportion as well).

--
AMMalena (www.Malena.net) "The avalanche has already begun. It is too late for the pebbles to vote." (Kosh, B5)

Re: by coolmadsi · 2012-07-02 03:25 · Score: 2

If that actually happened, then they should have just made it do 23:59:59 twice instead of crashing all the computers. I would like somebody to give me a concrete reason why any computer system should actually crash because of a lost second.

If you send 23:59:59 twice, you have the same second in the system twice, which can potentially cause issues with logs. If everything is timestamped to the second/millisecond, how can you be sure an event happened in the first 23:59:59 second, or the second (or subsequent) 23:59:59 second?

Older or Yonger? by colin_faber · 2012-07-02 03:46 · Score: 0

Should I feel a second Older or Younger?

Re:Older or Yonger? by lpq · 2012-07-02 04:32 · Score: 1

If you examine your linux logs, you'd see an extra second inserted Sunday morning -- 1 minute had 61 seconds so instead of rolling over at 59, it rolled over to 60 and then hit 0 (at least in 3.2.X)...

Re: by Rockoon · 2012-07-02 03:51 · Score: 1

Well, its not a lost second.. its an extra second..

The entire issue is that there are so many uses for time that any one strategy does not work in all cases. Consider a simple logger that outputs some value once per second.. well you dont want that logger to output 23:59:59 twice.. that could easily create problems.. and you dont want the logger to miss a tick either because that could cause other problems..

So to solve that problem we create the abnormal 23:59:60.. but because its abnormal it can easily look like 00:00:00 after simple time manipulation operations, causing 00:00:00 to be seen twice instead.. the same problem we were trying to avoid..

I propose the following solution: Stop fucking with time in abnormal ways such as leap seconds.. the subset of problem domains where syncing to some abstract ideal celestial clock is rather small and its far easier to let those problem domains handle conversion from system time to abstract celestial time that it is to make everything else work well with edge cases.

--
"His name was James Damore."

Check out what date do by higuita · 2012-07-02 06:19 · Score: 1

Check out what date do...

From the man date we have:

(...)
%M minute (00..59)
(...)
%S second (00..60)

So clearly date is able to print 23:59:60 as a valid date

--
Higuita

Re:Check out what date do by mattack2 · 2012-07-02 06:50 · Score: 1

I presume you mean date(1)?
I see
ss Seconds, a number from 0 to 61 (59 plus a maximum of two leap seconds).
Re:Check out what date do by higuita · 2012-07-03 02:03 · Score: 1

Your version is for sure newer than mine :)
i'm using debian wheezy
$ date --version
date (GNU coreutils) 8.13
Copyright (C) 2011 Free Software Foundation, Inc.

--
Higuita

I should have watched by Anonymous Coward · 2012-07-02 07:09 · Score: 0

We have a small set of Xen servers, about 40 VM's, all running CentOS 6.2 (all installed about 30 days ago) running with a back-end of FreeBSD/HAST disk failover.
I mirrored this in development, only smaller of course.
At 4:59:59 PM PST, all servers within 10 minutes, had their CPU's maxed out with the error:
kernel: BUG: soft lockup - CPU#0 stuck for 92s! [ksoftirqd/0:4]

When I say all, I mean all of the servers, development included which is on a completely separate network.

All of our servers are set to run the newest release of Java and all of our applications are written in-house.

The two machines I actually got updated to the newest kernel that was released last week (I think) didn't have the problem.

The Xen Hypervisors did not have the problem, but the VM's were so wedged I had to force reboot them.

The FreeBSD boxes did not have any problems, and my ancient Solaris installs did not have a problem. A terribly freaky event as I missed there was even an issue that could happen like this.

Just posting post-mortem, not that it helps now.

The TRUTH about Linux by Anonymous Coward · 2012-07-02 09:07 · Score: 0

Windows had zero problems.

How on earth do you screw up so bad that your system crashes if the time changes?

Ok, this is going to hurt but every time I try Linux as a desktop, I'm reminded why I use Windows. And it's not just the ugly fonts, and KDE/Gnome and Firefox making everything huge, and the lack of software support, and the cryptic stuff you have to do on command line to get basic stuff working like video drivers, or the kernel updates that put a cryptic boot choice in the list that I had no idea what it means or which is which, or the problems with the sound, or printers, or other hardware devices that have no drivers, or how the desktops keep going backwards and changes things I suppose just for the sake of change, or how nothing is really documents and it's just good luck on Google (and the list goes on and on and on)...

It's the fan boys who claim there's nothing wrong. And then we have a leap second and it crashes. Shoulda' got that patch that was out months (not years - wow, really?) ago! Nice.

No... really... Grandma could even use it. *cough*

Unfair to blame 3rd party software (?) by Anonymous Coward · 2012-07-02 17:01 · Score: 0

Small adjustments to the system clock are generally applied gradually by increasing or decreasing the rate of the system clock over a period of time (slew'ing). That is done in order to maintain a degree of sanity within the time domain to applications using the clock. Application rightly assume that the clock advances over time. On our Linux systems that was not the case. The clock simply stopped for 1 second! Any application making the perfectly normal assumption that time passes could fail! A simple rate counter sampling some value and dividing by the elapsed time could get a zero division if the 'sleep()' call returns without time actually having passed. As an example a programmer would generally make the reasonable assumption that 'sleep(100)' would return after at least 100ms has elapsed. Not so when the leap second was applied by brute force. Had NTP been used to adjust the time it would automatically happen gradually with no ill effect. We have considered the cost of changing our software to take into account that time may not advance during a sleep() call. That would not only be very expensive but would not solve the problem for libraries that we use. Also, the problem has already been solved by NTP

Re:Unfair to blame 3rd party software (?) by KenLarsen · 2012-07-02 17:15 · Score: 1

On our redhat the clock stopped for 1 second. Java System.getCurrentTimeMillis() returned the same value for an entire second. Not good.

Centos 6.2 by Anonymous Coward · 2012-07-02 18:11 · Score: 0

FYI, running Centos 6.2 (RHEL 2.6.32-220.el6.x86_64)

Noticed constant high CPU in tomcat6 and qpidd - leap second bug was the problem.

Leap second kernel bug by Anonymous Coward · 2012-07-03 21:53 · Score: 0

Keeping the system up to date doesn't mean that old bugs are not there.
in our case it was many systems running RHEL6 with "2.6.32-220.4.1.el6.x86_64 #1 SMP Thu Jan 19 14:50:54 EST 2012"

see http://blog.admintoon.com/?p=336 for the fun I had on a Sunday

Slashdot Mirror

Leap Second Bug Causes Crashes

230 comments