Slashdot Mirror


How the Leap Second Bug Led Facebook To Build DCIM Tools

miller60 writes "On July 1, 2012 the leap second time-handling bug caused many Linux servers to get stuck in a loop. Large data centers saw power usage spike, sometimes by megawatts. The resulting "server storm" prompted Facebook to develop new software for data center infrastructure management (DCIM) to manage its infrastructure, providing real-time data on everything from the servers to the generators. The incident also offered insights into the value of flexible power design in its server farmss, which kept the status updates flowing as the company nearly maxed out its power capacity."

46 comments

  1. DCIM by AK+Marc · · Score: 5, Insightful

    My digital camera already has DCIM tools (as does the computer I plug it in to). I hate re-used acronyms.

    1. Re:DCIM by Anonymous Coward · · Score: 1

      I always got so confused when I tried to get cash from Adobe Type Manager.

    2. Re:DCIM by ttucker · · Score: 3, Insightful

      The point is, make up a different acronym than one which is used ubiquitously in almost every computer related field.

    3. Re:DCIM by NatasRevol · · Score: 2

      If domain controllers had DCIM, it'd be a trifecta!

      --
      There are two types of people in the world: Those who crave closure
    4. Re:DCIM by Just+Some+Guy · · Score: 3, Funny

      I do my online banking over Asynchronous Transfer Mode networks.

      --
      Dewey, what part of this looks like authorities should be involved?
    5. Re:DCIM by Anonymous Coward · · Score: 1

      I hate re-used acronyms.

      I assume you actually hate initialisms (unless, of course, you choose to pronounce "DCIM" as "dickim," in which case I can't help you)

    6. Re:DCIM by leafdevolvqnov · · Score: 0

      Good CPU really bad choices, too many traps.http://onlyhermes.com .

    7. Re:DCIM by AK+Marc · · Score: 1

      I've only heard it pronounced "deesim".

    8. Re:DCIM by Culture20 · · Score: 1

      Not me, I used to use a KVM to switch between my two KVM hosts.

    9. Re:DCIM by mewsenews · · Score: 2

      26*26*26*26 = 456976

      That's basically half a million four letter combinations that companies are able to choose from, all nimbly-bimbly. Yet these assholes decide to use an existing term and mutilate the wikipedia page that was around for four and a half years because of their arrogance

    10. Re:DCIM by gl4ss · · Score: 1

      they could just have called this dcm.

      facebook does overlap in use of dcim and dcim. they got machines doing dcim analyzing which are controlled with dcim.. and this is a story about fb using dcim.

      --
      world was created 5 seconds before this post as it is.
    11. Re:DCIM by Anonymous Coward · · Score: 0

      You're like one of those unsavvy, wannabe tech kiddies who says "ess see ess eye" instead of "scuzzy" or "ess kyoo el" instead of "sequel".

    12. Re:DCIM by oodaloop · · Score: 0

      Yeah, it's hard to imagine why they didn't use ZZXQ, KKKZ, or OOOO. Those are perfectly good acronyms!

      --
      Tic-Tac-Toe, Global Thermonuclear War, and relationships all have the same winning move.
    13. Re:DCIM by Anonymous Coward · · Score: 0

      You're like one of those unsavvy, wannabe tech kiddies who says "ess see ess eye" instead of "scuzzy" or "ess kyoo el" instead of "sequel".

      Only Micro-softheads and kiddies say "sequel" for SQL. Real Programmers call Structured Query Language squeal ... for example the Tandem (now HP) product is pronounced "non stop squeal" and Oracle's implementation (with broken DATE and no TIME) is called "Larry's squeal".

      Now go learn datalog and QDE and get off my damn lawn, Junior.

    14. Re:DCIM by GuB-42 · · Score: 1

      I hate re-used acronyms.

      At work, I witnessed a heated argument about a "VMS". It took several minutes before they realized they weren't talking about the same thing !

    15. Re:DCIM by Anonymous Coward · · Score: 0

      I do my online banking over Asynchronous Transfer Mode networks.

      At the moment, anyway.

    16. Re:DCIM by AK+Marc · · Score: 1

      Sounds like airport codes for new airports. All the "good" ones are taken, so the small and newer ones get letters that don't match the human name for it. George Bush Intercontinental Airport: IAH (likely a hold over for Intercontinental/International Airport of Houston, which was never its name)

    17. Re:DCIM by atom1c · · Score: 1

      The point is, make up a different acronym than one which is used ubiquitously in almost every computer related field.

      Yeah, that's pretty obvious when they stated, "I hate re-used acronyms." HOWEVER, DCIM referring to data center information management is NOT a new acronym/term/concept. It has been around since the dawn of data centers... which, arguably, predate the digital camera image standards. Thus, I would argue that associating the term DCIM with digital images confuses its initial usage related to data centers.

    18. Re:DCIM by ttucker · · Score: 1

      "Data Center Infrastructure Management (DCIM) is an emerging (2012) form of data center management which extends the more traditional systems and network management approaches to now include the physical and asset-level components. DCIM leverages the integration of information technology (IT) and facility management disciplines to centralize monitoring, management and intelligent capacity planning of a data center's critical systems. Essentially it provides a significantly more comprehensive view of ALL of the resources within the data center."

      Data centers predate digital cameras. That particular business buzzword acronym, for that particular business buzzword phrase, does not. I envision some manager looking at a DCIM dashboard somewhere with gauge images and stuff. It just seems like pretty blatant namespace pollution, even in a different domain.

  2. Server farmss by Anonymous Coward · · Score: 1

    Managed by Gollumses.

    1. Re:Server farmss by Anonymous Coward · · Score: 0

      Seems the /. editors are also stuck in a loop.

    2. Re:Server farmss by wonkey_monkey · · Score: 1

      They're still using Tolkien Ring networks.

      Thankyouverymuch!

      --
      systemd is Roko's Basilisk.
  3. System QoS by atom1c · · Score: 3, Insightful

    How often does the leap second bug recur? If It is known to occur, then why would such platforms be relied upon instead of patching it ahead of time?

    It seems to me that developing new DCIM solutions is a bit of a stretch to solve the leap second issue. Or is that just an excuse to fund new DCIM solutions (in other words, a solution in search of a problem)?

    1. Re:System QoS by fullmetal55 · · Score: 1

      I don't think the leap second bug recurring is the reason, but it revealed a gap in their Management abilities, by actively monitoring everything, you can have baselines, notice when something is out of the ordinary, and help pinpoint the exact cause. it's possible they thought they were patched up, and whoops... now with the new DCIM they can more accurately tell when server XYZ in datafarm B is running at 100% cpu and drawing more power than necessary. and maybe even disconnect it from the network and shut it down for repairs.

      So, the individual issue isn't the reason, but it was an epiphany moment of, hey we need to prevent this from happening again... not the same bug, not even the same symptoms, I applaud them for seeing a problem and finding a fix for it. If anything it shows there's still more innovation in facebook than just more ways to serve up ads and to better target ads.. and waste their customers time ;)

    2. Re:System QoS by Anonymous Coward · · Score: 0

      Leap seconds are a huge pain in the ass; when you're reliant on accurate timing, they break things in all kinds of ways. One of the previous systems I worked on required weeks of development and testing and special firmware for the GPS receiver to ensure it would survive a leap second. The most recent system I worked on survived the leap second, but all the users logged off because the software on their end thought the time was out of sync.

    3. Re:System QoS by buchner.johannes · · Score: 1

      How often does the leap second bug recur?

      When there is a leap second. I think the last time this bug was covered on slashdot, the article said it occured on a sizeable number of servers in 2012, and on several in 2013.

      If It is known to occur, then why would such platforms be relied upon instead of patching it ahead of time?

      Are you seriously asking why not all systems are up to date? We are talking routers, mainframes, computers running legacy software, ...
      You can not just update everything. If you are a business, updates have to be tested, making sure the software still runs. And if it does not, or if your distribution is not releasing an update, you are limited by resources. This should be obvious.

      It seems to me that developing new DCIM solutions is a bit of a stretch to solve the leap second issue. Or is that just an excuse to fund new DCIM solutions (in other words, a solution in search of a problem)?

      I assume it is a form of fall-back. In case yet-unknown bugs of the same class occur (servers go into a loop), the problem can be detected. Restarting helps.

      --
      NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
    4. Re:System QoS by Anonymous Coward · · Score: 0

      How often does the leap second bug recur?

      So far 25 leap seconds have been added. The first one was added at the end of June 1972. The last one (so far) was added at the end of June 2012.

      If It is known to occur, then why would such platforms be relied upon instead of patching it ahead of time?

      It isn't predictable, but when it needs to happen, IERS sends out a notice about 6 months in advance. That doesn't leave much time for patching and testing.

    5. Re:System QoS by Anonymous Coward · · Score: 0

      Leap seconds can be introduced into the stream twice a year.

      http://www.nist.gov/pml/div688/grp50/leapsecond.cfm

      There will be no leap second modification for December 2013.

    6. Re:System QoS by Anonymous Coward · · Score: 0

      That doesn't leave much time for patching and testing.

      So wait... you don't test for a condition you *know* occurs about every 18 months on average?

    7. Re:System QoS by tconnors · · Score: 4, Informative

      How often does the leap second bug recur?

      That one? Once. Seen plenty of different style leap second bugs (too many - leap seconds should be a relatively easy calculation, but we only get to test them once every 3 years or so, and in real time because it's kinda hard to convince a global time keeping system that a fake leap second is about to happen for testing. Still, I'd rather we fixed the software than do stupid things like get rid of UTC like some idiots are proposing), but one that causes a futex loop in java processes (and the opera web browser) just the once, and mostly only on RHEL6 and debian ~wheezy kernels at the time.

      If It is known to occur, then why would such platforms be relied upon instead of patching it ahead of time?

      The point of bugs is that they're not known to occur beforehand. This particular one was quite neat in that it wasn't the leap second code itself that was at fault, but it was the mechanism ntp used within the kernel to inform the kernel that a leapsecond was coming up. At least it didn't happen over the public holiday New Year period this time. I knew Monday was going to be a busy day in the datacentre when I saw my 3 laptops at home exhibit the problem on Sunday morning though.

      It seems to me that developing new DCIM solutions is a bit of a stretch to solve the leap second issue. Or is that just an excuse to fund new DCIM solutions (in other words, a solution in search of a problem)?

      Anything can cause a kernel or userland software to suddenly enter a hard loop burning through CPU cycles and thus power. And in a large homogenous environment, that bug can be triggered in many locations all at the one exact moment in time. Another good example might be the RHEL6 bug that affected us around the same time last year - the old "uptime has reached a hundred and something days, let's overflow a counter and kernel PANIC now!" bug. We found out about that bug after patching all of our systems, found out that it only applied to the version of the patch we managed to apply, and had to start planning to bring the next patching cycle forward (but at least we knew about it) . You'd think these were the kinds of bugs that we learnt about in 1995 and were never stupid enough to put such bugs back into the kernel, but it seems every generation must learn about it for themselves instead of reading their Operating System text books.

      The point of these bugs is that anything might cause a large fraction of your machines to start chewing through electricity. In an overprovisioned environment (VMs, power, thin storage, whatever), you want to know about them before you trip your fuses/run out of memory, fill up all your disks.

    8. Re:System QoS by Anonymous Coward · · Score: 0

      ...waste their customers time...

      Please state the productive functions of facebook.

  4. an adticle from Facebook, this time? by Joining+Yet+Again · · Score: 1

    I don't get the point here? What is Facebook doing that's new for a datacentre?

    1. Re:an adticle from Facebook, this time? by Anonymous Coward · · Score: 0

      Slashdot: News for nerds! Software company writes some software, film at 11.

  5. So what? by Anonymous Coward · · Score: 0

    What exactly is earth-shatteringly new about this? Facebook develops some software? Or somebody develops yet-another server monitoring suite of software?

    Wow! Somebody has developed some software? Really?! Goodness me, what will somebody do next? Reinvent the wheel - yet again?

  6. Server farmss by Anonymous Coward · · Score: 0

    The serverses! My preciousss!

  7. Re:What triggered the bug anyway? by rduke15 · · Score: 1

    When I heard last year that there may have been problems with the leap second, I checked the few Linux servers I take care of, and all seemed to be fine. They sync their time to NTP servers.

    What was that problem, anyway? Or did it only affect some very busy servers? Or only in some very special circumstances? Last year's leap second wasn't anything really new either. There had been occasional leap seconds for many years. (But usually on Dec 31).

  8. Re:What triggered the bug anyway? by 0123456 · · Score: 3, Interesting

    That was the one that caused Java processes to run away and use 100% CPU, wasn't it? From what I remember, it was only in a small subset of recent kernels, and older ones were fine.

  9. Why does my camera... by macraig · · Score: 1

    ... have Data Center Infrastructure Management? At least now I know what the name of that subfolder means. Is this another NSA thing, is the NSA or Facebook snarfing my photos right off the camera?

  10. Leap Seconds Are Old News by DERoss · · Score: 4, Informative

    Before 1972, "leaps" were fractions of a second; a UTC second (Universal Time Coordinated) did not have the same duration as a TAI second (the French acronym for International Atomic Time); and "leaps" occurred as often as four times a year. The current form of leap-seconds has been in effect since 1972. By then, software (mostly main frames) handled leap-seconds quite easily.

    The reason for leap-seconds is that the earth's rotation is gradually slowing while many critical operations require precise time indicators. Thus, noon at Greenwich -- even average noon, which takes into account annual and semi-annual variations in the earth's rotation -- cannot be used. Instead, those critical operations use TAI. TAI is a uniform, never-varying time system while UTC is coordinated with noon at Greenwich. Since 1972, however, a UTC second has exactly the same duration as a TAI second; and a UTC clock ticks its seconds exactly at the same time as a TAI clock. If this continued indefinitely, noon on a UTC clock would gradually deviate from noon at Greenwich. Since 1972, if the deviation approaches a whole second, an extra second -- a leap-second -- is added to a UTC clock at the end of the last minute of either 30 June or 31 December.

    All this became a problem in 2006. During the 7 years from 1 January 1999 until 1 January 2006, the slowing of the earth's rotation was so slight that there were no leap-seconds. Too many young software engineers and other technologists failed to learn about leap-seconds and thus ignored them (just the the Y2K issue was ignored until it was almost too late). A situation that was handled quite well in the 1970s, 1980s, and 1990s was no longer handled at all in new systems. But on 1 January 2006, there was indeed a leap-second. By then, many of those who were familiar with leap-seconds and how to handle them had retired (including me).

    1. Re:Leap Seconds Are Old News by delt0r · · Score: 2

      What i don't get is what really breaks with a second error or jump. I often just do the ntpdate thing and my clocks are shifted a lot more than a second. As long as i am not compiling, i havn't had any issues.

      I understand some secure protocols need accurate global and difficult to forge time. But outside that? I mean so what if the time on a wall post is out by a second?

      --
      If information wants to be free, why does my internet connection cost so much?
    2. Re:Leap Seconds Are Old News by mattack2 · · Score: 1

      Too many young software engineers and other technologists failed to learn about leap-seconds and thus ignored them

      That's fine. Zuckerberg thinks it's a good thing to "break stuff" and uses that as a slogan.

  11. facebook and digital cameras images? by MadMaverick9 · · Score: 2

    The filesystem in a digital camera contains a DCIM (Digital Camera IMages) directory.

    Can y'all stop re-using abbreviations, please.

  12. Google did this with NTP "leap smear" by Anonymous Coward · · Score: 1

    Time, technology and leaping seconds

    The solution we came up with came to be known as the "leap smear." We modified our internal NTP servers to gradually add a couple of milliseconds to every update, varying over a time window before the moment when the leap second actually happens. This meant that when it became time to add an extra second at midnight, our clocks had already taken this into account, by skewing the time over the course of the day. All of our servers were then able to continue as normal with the new year, blissfully unaware that a leap second had just occurred. We plan to use this âoeleap smearâ technique again in the future, when new leap seconds are announced by the IERS./blockquote