How the Leap Second Bug Led Facebook To Build DCIM Tools
miller60 writes "On July 1, 2012 the leap second time-handling bug caused many Linux servers to get stuck in a loop. Large data centers saw power usage spike, sometimes by megawatts. The resulting "server storm" prompted Facebook to develop new software for data center infrastructure management (DCIM) to manage its infrastructure, providing real-time data on everything from the servers to the generators. The incident also offered insights into the value of flexible power design in its server farmss, which kept the status updates flowing as the company nearly maxed out its power capacity."
My digital camera already has DCIM tools (as does the computer I plug it in to). I hate re-used acronyms.
Learn to love Alaska
Managed by Gollumses.
How often does the leap second bug recur? If It is known to occur, then why would such platforms be relied upon instead of patching it ahead of time?
It seems to me that developing new DCIM solutions is a bit of a stretch to solve the leap second issue. Or is that just an excuse to fund new DCIM solutions (in other words, a solution in search of a problem)?
I don't get the point here? What is Facebook doing that's new for a datacentre?
What exactly is earth-shatteringly new about this? Facebook develops some software? Or somebody develops yet-another server monitoring suite of software?
Wow! Somebody has developed some software? Really?! Goodness me, what will somebody do next? Reinvent the wheel - yet again?
The serverses! My preciousss!
When I heard last year that there may have been problems with the leap second, I checked the few Linux servers I take care of, and all seemed to be fine. They sync their time to NTP servers.
What was that problem, anyway? Or did it only affect some very busy servers? Or only in some very special circumstances? Last year's leap second wasn't anything really new either. There had been occasional leap seconds for many years. (But usually on Dec 31).
That was the one that caused Java processes to run away and use 100% CPU, wasn't it? From what I remember, it was only in a small subset of recent kernels, and older ones were fine.
... have Data Center Infrastructure Management? At least now I know what the name of that subfolder means. Is this another NSA thing, is the NSA or Facebook snarfing my photos right off the camera?
Before 1972, "leaps" were fractions of a second; a UTC second (Universal Time Coordinated) did not have the same duration as a TAI second (the French acronym for International Atomic Time); and "leaps" occurred as often as four times a year. The current form of leap-seconds has been in effect since 1972. By then, software (mostly main frames) handled leap-seconds quite easily.
The reason for leap-seconds is that the earth's rotation is gradually slowing while many critical operations require precise time indicators. Thus, noon at Greenwich -- even average noon, which takes into account annual and semi-annual variations in the earth's rotation -- cannot be used. Instead, those critical operations use TAI. TAI is a uniform, never-varying time system while UTC is coordinated with noon at Greenwich. Since 1972, however, a UTC second has exactly the same duration as a TAI second; and a UTC clock ticks its seconds exactly at the same time as a TAI clock. If this continued indefinitely, noon on a UTC clock would gradually deviate from noon at Greenwich. Since 1972, if the deviation approaches a whole second, an extra second -- a leap-second -- is added to a UTC clock at the end of the last minute of either 30 June or 31 December.
All this became a problem in 2006. During the 7 years from 1 January 1999 until 1 January 2006, the slowing of the earth's rotation was so slight that there were no leap-seconds. Too many young software engineers and other technologists failed to learn about leap-seconds and thus ignored them (just the the Y2K issue was ignored until it was almost too late). A situation that was handled quite well in the 1970s, 1980s, and 1990s was no longer handled at all in new systems. But on 1 January 2006, there was indeed a leap-second. By then, many of those who were familiar with leap-seconds and how to handle them had retired (including me).
The filesystem in a digital camera contains a DCIM (Digital Camera IMages) directory.
Can y'all stop re-using abbreviations, please.
Time, technology and leaping seconds