Slashdot Mirror


Leap Second Bug Causes Crashes

An anonymous reader writes in with a Wired story about the problems caused by the leap second last night. "Reddit, Mozilla, and possibly many other web outfits experienced brief technical problems on Saturday evening, when software underpinning their online operations choked on the “leap second” that was added to the world’s atomic clocks. On Saturday, at midnight Greenwich Mean Time, as June turned into July, the Earth’s official time keepers held their clocks back by a single second in order to keep them in sync with the planet’s daily rotation, and according to reports from across the web, some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform — were unable to cope with the extra second."

29 of 230 comments (clear)

  1. All of my servers were fine by Anonymous Coward · · Score: 5, Insightful

    And I didn't do anything special, just kept their software up-to-date.

    1. Re:All of my servers were fine by Sir_Sri · · Score: 4, Informative

      That can be hard for some people.

    2. Re:All of my servers were fine by Anonymous Coward · · Score: 5, Informative

      the patch was posted back in March.

      https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d

    3. Re:All of my servers were fine by nmb3000 · · Score: 4, Insightful

      And I didn't do anything special, just kept their software up-to-date.

      That's a nice ideal, but the reality is that many up-to-date "stable" distribution releases are still using kernels which are susceptible the leap second problem (and haven't had the patch back-ported to them). Ubuntu 8.04 LTS server is supposed to be supported until April 2013, and on my (updated!) system,

      # uname -r
      2.6.24-28-server

      I like the idea of stable releases, but this is a glaring problem with the entire idea. Everyone extolls the wondrous virtues of package managers for Linux-based systems, but the dirty secret is that unless you stay bleeding-edge (which is usually the opposite of "server"), you'd better be happy with the 4-year old version of Apache, PHP, MySQL, and the Linux kernel you're running. Sure, it's possible to manually download and install packages from a newer release (assuming you can get past the dependency hell usually associated with it). Sure, it's possible to try and splice in (or "pin" packages using Debian parlance) from a newer repository. Sure, it's possible to install from source, compiling and installing everything by hand. But once you do any of these you've given up 90% of what makes the package manager useful and are just asking for dependency problems in the future.

      And, all that aside, do you even know if the patch released to fix this problem is included in your distribution-released kernel? If you're not rolling your own kernel it can be nigh to impossible to know what's included and what's not -- in that case it doesn't even matter if it's up-to-date.

      --
      "What do you despise? By this are you truly known." --Princess Irulan, Manual of Muad'Dib
      /)
    4. Re:All of my servers were fine by Guy+Harris · · Score: 4, Informative

      Our problem was with a third party monitoring solution - its daemon process brought every single one of our servers to a near halt by consuming all available cpu cycles at the stroke of gmt midnight.

      The OS itself was fine.

      Well, if you're talking a Linux kernel, the part of the OS that dealt with leap seconds was not OK, and was "not OK" in a fashion that could cause processes using futexes to spin and consume all available CPU cycles when a leap second is introduced.

      This monitoring software is common enough that it likely was behind a lot of the issues seen around the 'net.

      ...perhaps by virtue of either using futexes (in what I'm presuming is a legitimate fashion) or using something that uses futexes.

    5. Re:All of my servers were fine by Gil-galad55 · · Score: 5, Informative

      They lost commercial power due the big storm system that went through the DC area.

      --

      To follow knowledge like a sinking star, / Beyond the utmost bound of human thought. ("Ulysses", Tennyson)

  2. Linux by Anonymous Coward · · Score: 4, Informative

    I'm a Linux admin at a fairly large hosting company. The only thing that I personally aware of happening this time around was that the time change triggered a bug in the OpenManage software on Dell servers causing it to use 100% CPU. The solution was to resync the time and restart OpenManage. It wasn't really a fault of Linux itself, but in OpenManage on Linux. Lots of datacenters use Dell hardware and I'm sure most use OpenManage, so I'm sure the problem was widespread.

    1. Re:Linux by Anonymous Coward · · Score: 5, Informative

      What you describe is a bug in the Linux kernel that causes problems for the Java VM that OpenManage uses.
      It is not a bug in OpenManage at all.

  3. Re: by Anonymous Coward · · Score: 5, Funny

    >hick-up.

    The hick up watching the servers when the leap second came was you.

  4. Our Red Hat servers had no issues at all by 93+Escort+Wagon · · Score: 4, Insightful

    I'm uncertain why these reports keeps referring to some monolithic "Linux" that is supposed to have had issues - Red Hat's the biggest Linux vendor, and certainly their "Linux" handled it just fine.

    What distros had issues?

    --
    #DeleteChrome
    1. Re:Our Red Hat servers had no issues at all by Nutria · · Score: 4, Informative

      TFA mentioned that the RHE6 kernel had the bug, but not RHE5.

      It appears also that system load was a big factor, so if your systems aren't busy on Saturday then they might not have crashed even if running an affected kernel.

      --
      "I don't know, therefore Aliens" Wafflebox1
    2. Re:Our Red Hat servers had no issues at all by Anonymous Coward · · Score: 4, Informative

      Red Hat had a lot of issues.
      https://access.redhat.com/knowledge/articles/15145
      https://access.redhat.com/knowledge/solutions/154713

      It depended entirely on your load. The buggy kernal code ran every 17 minutes for the 24hr period leading up to the leap-second insertion.
      If you had enough load, your chance of dead-locking your system increased significantly.

      Solution, strip the leap-second flag by manually setting your time.

    3. Re:Our Red Hat servers had no issues at all by MightyMartian · · Score: 4, Funny

      Sorry can't remember the name. It's the one that takes the credit for the work of others.

      Windows?

      --
      The world's burning. Moped Jesus spotted on I50. Details at 11.
    4. Re:Our Red Hat servers had no issues at all by drinkypoo · · Score: 4, Funny

      Sorry can't remember the name. It's the one that takes the credit for the work of others.

      You must be talking about SCO, but if you're still running CND you should probably upgrade.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  5. What about Windows and Mac? by kthreadd · · Score: 4, Interesting

    So far all I've heard about is affected Linux systems, did Windows and OS X just fine?

    1. Re:What about Windows and Mac? by Guy+Harris · · Score: 4, Informative

      My guess ist that Windows simply ignored it, so there never was a 61st second in a minute.

      Well, if Microsoft's documentation of the SYSTEMTIME structure reflects the implementation, GetSystemTime() , the claim in that man page^W^WMSDN page that "The system time is expressed in Coordinated Universal Time (UTC)" nonwithstanding, cannot acknowledge the existence of a 61st second in a minute ("The second. The valid values for this member are 0 through 59.", as the SYSTEMTIME page says).

      But, just as on UN*X, you have "counter" and "human-style label" times (time_t, struct timeval, struct timespec are examples of the former, and a struct tm as returned by, for example, gmtime() is an example of the latter, on UN*X), with the Windows versions of those being SYSTEMTIME and FILETIME respectively. That page on FILETIME says nothing about leap seconds - does it just keep counting over a positive leap second or does it stop or what? And, if it doesn't just keep counting over a positive leap second, does it just freeze for a while second, or does it slow down over some period of time so that it eventually syncs up, or what?

      As for NTP, Microsoft has a page on "How the Windows Time service treats a leap second", which says

      When the Windows Time service is working as a Network Time Protocol (NTP) client

      The Windows Time service does not indicate the value of the Leap Indicator when the Windows Time service receives a packet that includes a leap second. (The Leap Indicator indicates whether an impending leap second is to be inserted or deleted in the last minute of the current day.) Therefore, after the leap second occurs, the NTP client that is running Windows Time service is one second faster than the actual time. This time difference is resolved at the next time synchronization.

      (the author of which needs to be told what "inserted or deleted" implies - do they mean that, regardless of whether a leap second is inserted or deleted, the NTP client that is running Windows Time service is one second faster than the actual time?)

      And then there's one more question: if there's anything in the NT kernel that deals with leap seconds, does any version have a glitch, as some versions of the Linux kernel do?

      If not, then many of the other problems might not exist on Windows. This email from John Stultz, the author of the fix linked to in the previous paragraph, seems to indicate that at least some of the problems, if not all of them, stem from a kernel bug, so it might be that Java and company might be Just Fine on systems that don't have a kernel glitch of that sort (so they might work fine on at least some non-Linux systems, as well as on Linux systems with the bug fixed).

  6. Extremely weird by Anonymous Coward · · Score: 5, Informative

    From my own machines and comparing notes with some other people (all in all, about 3k servers) the bug seems to affect machines randomly. Known facts:

    There's a kernel patch that fixes the supposed issue: https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d

    Affects Debian stable a lot.

    Affects Java and Virtualbox (starts using too much CPU).

    Affected my browser (iceweasel on debian testing).

    Affects SOME mysql installs (5.1 and 5.5, but not all, and of two identical installs one might be affected, the other not).

    The fix has been posted at lot of places: /etc/init.d/ntp stop; date; date `date +"%m%d%H%M%C%y.%S"`; date; /etc/init.d/ntp start

    (I'm all for switching unix time to a simple counter and leaving it to the calendar libs to put the leap seconds where necessary)

    1. Re:Extremely weird by burne · · Score: 4, Informative

      It's a race-condition, either crashing your ancient kernel or causing software using certain kernel-calls to effectively lock up. In both cases load seems to be a factor.

      Over here the race-condition coincided with the actual leap-second and the start of the first batch of cronjobs at 02:00 local time.

      (I'm all for switching unix time to a simple counter and leaving it to the calendar libs to put the leap seconds where necessary)

      Bad idea. It would have prevented kernels affected by the race-condition from crashing, but would have meant most of your running software would have been either hit by this bug or would have been on the mercy of a 17 year old pimple-faced coder.

      I think I prefer a crash over the mayhem caused by banking-software not handling a leap-second correctly. That could bankrupt whole countries.

  7. Re:Linux kernel unable to cope? I think not. by Anonymous Coward · · Score: 4, Informative

    There was a Linux kernel bug. See
    http://news.ycombinator.com/item?id=4183122
    http://marc.info/?l=linux-kernel&m=134110635328824&w=2
    and
    https://lkml.org/lkml/2012/6/30/122

  8. You probably don't do much Java, then by burne · · Score: 5, Informative

    As it turns out my biggest problems was customer-supplied software which uses their own java jre's. We install a jre by default and update it whenever possible, but some software (Adeptia, VLTrader, Alfresco) comes with their own ancient jre and scripts to call that over system-supplied java.

    Not a single machine crashed (we are very explicitly in charge of what OS-version there's running) but a lot of java locked up and had to be restarted.

    I can even see a small bump in the power-usage around two o' clock (0:00 GMT).

    1. Re:You probably don't do much Java, then by thegarbz · · Score: 4, Funny

      I can even see a small bump in the power-usage around two o' clock (0:00 GMT).

      Leap seconds contribute to global warming. We need to raise this at the next G8 summit.

    2. Re:You probably don't do much Java, then by Guy+Harris · · Score: 4, Informative

      So are you saying that, in addition to the Linux kernel glitch in question (which appears to cause some userland processes to spin)

      Actually, I'm not sure that's the case. John Stultz's mail from July 1, 2012 speaks of a bug where clock_was_set() wasn't called after the leap second was added, and of a patch he was working on, so the bug in question might not have been fixed in March.

  9. Re:Why now? by vux984 · · Score: 4, Insightful

    and above all it should not be changed to accomodate fluctuations in the orbit of a rock circling an arbitrary star.

    That is precisely the point of keeping track of the time of day, or day of the year.

    time of day is an arbitrary number whose main utility lies in it being composed of predictable periods and divided into homogenous units.

    You do not need a complex system like date time comprised of minutes hours, seconds, months, weeks, and years if you just want to measure time in a convient homogenous unit then define a time-zero, and just count milliseconds from that to whatever arbitrary distance into the past and future you want from that. Measure it kilo-seconds, mega-seconds, giga-seconds... etc.

    The entire point of date/time is because we do in fact care a lot about how that "arbitrary counter" lines up with when we will be awake or asleep or eating at various points -- that's what makes it useful.

    What we should have is what I've described above, time-zero and a counter. And translations from that to localized date time should be handled by a library.

  10. Re:Linux kernel unable to cope? I think not. by Anonymous Coward · · Score: 5, Interesting

    I run Arch Linux with kernel 3.4.4 and it went haywire. My machine was very heavily loaded at the time and when the leap second happened mysqld, firefox, and ksoftirq processes started consuming 100% CPU. The load factor was well over 10 and the machine was grinding along. It didn't actually fail but it was loaded down.

    Even restarting the processes didn't fix it. The high load would go away once I stopped the processes but as soon as I started them again the load would come right back. I had Firefox open on a blank page not doing anything and it was slammed at 100% CPU and had a could ksoftirq tasks slammed at 100% CPU each too.

    I had to reboot the machine to get it back to normal.

    I have Ubuntu and Debian servers that for whatever reason did not add the leap second so they were fine. Their time was a second off today though (at least until ntp slowly corrected it or I manually intervened).

  11. Only Linux affected? by cpghost · · Score: 4, Interesting

    I'm managing a cluster of 2,400 nodes running FreeBSD, and AFAICS, none was tripped off by leap second NTP adjustments. On the other hand, 4 out of 180 Linux nodes crashed simultaneously at that very moment. All this is exceedingly weird, but may indeed point to a subtle bug in the Linux kernel (only?). I've never witnessed this behavior in the past.

    --
    cpghost at Cordula's Web.
  12. Re:FUD? by kasperd · · Score: 4, Funny

    Actually I didn't realize I was affected by this bug until a few minutes ago, when I used strace to see why firefox was using up all the time on one of my cores.

    You don't need a leap second in order for that to happen. Firefox does that regularly.

    --

    Do you care about the security of your wireless mouse?
  13. Re:Linux kernel unable to cope? I think not. by kwardroid · · Score: 5, Informative

    Restarting ntp wasn't enough for me, I had to reset the date with:
    date -s "`date`"
    Only one machine went haywire though.

  14. Re:I always thought leap seconds were stupid by thue · · Score: 4, Insightful

    > Why not bundle them and apply them every 10 or 20 years?

    The problem we have here is that leap seconds are rare. Things that are common are tested for, and quickly found if broken. Having something which only happens every 20 years is a recipee for disaster every 20 years.

    My view is that NTP is at fault, because the 61th second is a brittle way to handle it. NTP should use the same method as google for smearing the leap second out over fx an hour: http://googleblog.blogspot.dk/2011/09/time-technology-and-leaping-seconds.html

  15. Re:Why now? by Guy+Harris · · Score: 4, Funny

    What we should have is what I've described above, time-zero and a counter. And translations from that to localized date time should be handled by a library.

    Which, sadly, POSIX doesn't let you have as "UNIX time":

    4.15 Seconds Since the Epoch

    A value that approximates the number of seconds that have elapsed since the Epoch. A Coordinated Universal Time name (specified in terms of seconds ( tm_sec ), minutes ( tm_min ), hours ( tm_hour ), days since January 1 of the year ( tm_yday ), and calendar year minus 1900 ( tm_year )) is related to a time represented as seconds since the Epoch, according to the expression below.

    If the year is <1970 or the value is negative, the relationship is undefined. If the year is >=1970 and the value is non-negative, the value is related to a Coordinated Universal Time name according to the C-language expression, where tm_sec , tm_min , tm_hour , tm_yday , and tm_year are all integer types:

    tm_sec + tm_min*60 + tm_hour*3600 + tm_yday*86400 + (tm_year-70)*31536000 + ((tm_year-69)/4)*86400 - (tm_year-1)/100)*86400 + ((tm_year+299)/400)*86400

    The relationship between the actual time of day and the current value for seconds since the Epoch is unspecified.

    How any changes to the value of seconds since the Epoch are made to align to a desired relationship with the current actual time is implementation-defined. As represented in seconds since the Epoch, each and every day shall be accounted for by exactly 86400 seconds.

    Note:

    The last three terms of the expression add in a day for each year that follows a leap year starting with the first leap year since the Epoch. The first term adds a day every 4 years starting in 1973, the second subtracts a day back out every 100 years starting in 2001, and the third adds a day back in every 400 years starting in 2001. The divisions in the formula are integer divisions; that is, the remainder is discarded leaving only the integer quotient.

    If there were a UN*X API to get a count of seconds since the Epoch (in addition to, or instead of, a call to get "seconds since the Epoch"), and a UN*X API to convert those to UTC and local time labels, that would get what you want. Modulo making it work with NTP, the former could be implemented with less difficulty than a call to get "seconds since the Epoch", and the latter is called "the Olson code complete with the leap seconds database".

    However, that would then require some mechanism to allow code to schedule something to happen at a given UTC label; simply calculating the UNIX time for that UTC label, getting the current UNIX time, and scheduling it for then-now seconds in the future is insufficient, as the UNIX time for a given UTC label in the future might change if a leap second is scheduled between then and now. (Note that if you support scheduling something to happen at a given local civil time label would already require correction of that sort to handle DST rule changes.) This would also have to do something if you schedule an event for YYYY-DD-MM 23:59:59 and a negative leap second occurs so that there is no 23:59:59 on YYYY-DD-MM; "something" might be "let somebody know and ask them to correct it" or "do it at 00:00:00 on the next day", perhaps depending on the reason why it's scheduled.