Slashdot Mirror


CloudFlare Was Hit By Leap Second, Causing Its RRDNS Software To 'Panic' (silicon.co.uk)

Reader Mickeycaskill writes: The extra leap second added on to the end of 2016 may not have had an effect on most people, but it did catch out a few web companies who failed to factor it in. Web services and security firm CloudFlare was one such example. A small number of its servers went down at midnight UTC on New Year's Day due to an error in its RRDNS software, a domain name service (DNS) proxy that was written to help scale CloudFlare's DNS infrastructure, which limited web access for some of its customers. As CloudFlare explained, a number went negative in the software when it should have been zero, causing RRDNS to "panic" and affect the DNS resolutions to some websites. The issue was confirmed by the company's engineers at 00:34 UTC on New Year's Day and the fix -- which involved patching the clock source to ensure it normalises if time ever skips backwards -- was rolled out to the majority of the affected data centres by 02:50 UTC. Cloudflare said the outage only hit customers who use CNAME DNS records with its service. Google works around leap seconds with a so-called "smearing" technique -- running clocks slightly slower than usual on its Network Time Protocol servers.

52 of 119 comments (clear)

  1. Was the Go prog lang at fault? Would Rust help? by Anonymous Coward · · Score: 1, Flamebait

    The blog post about this incident says:

    RRDNS is written in Go and uses Go’s time.Now() function to get the time. Unfortunately, this function does not guarantee monotonicity. Go currently doesn’t offer a monotonic time source (see issue 12914 for discussion).

    and then later it says:

    When RRDNS selects an upstream to resolve a CNAME it uses a weighted selection algorithm. The code takes the upstream time values and feeds them to Go’s rand.Int63n() function. rand.Int63n promptly panics if its argument is negative. That's where the RRDNS panics were coming from.

    So to me it sounds like this incident was at least partially due to limitations with the Go programming language and its libraries.

    Would this incident still have happened if this software were written in the Rust programming language?

    1. Re:Was the Go prog lang at fault? Would Rust help? by Anonymous Coward · · Score: 3, Insightful

      I don't know if you can blame the language, the devs should have added their own checks if the language didn't have a guarantee.

    2. Re:Was the Go prog lang at fault? Would Rust help? by Anonymous Coward · · Score: 3, Insightful

      Why would you even think of switching programing languages due to the simple and sadly common 'bug' of programmers not verifying parameters match a function's documented pre-conditions? My only guess is you're paid to promote Rust. Lazy programmers will write bugs in every language.

    3. Re:Was the Go prog lang at fault? Would Rust help? by scamper_22 · · Score: 1

      Part of the fault can go to the Go programming language for their API design.

      But most of the blame goes to the developers.
      I haven't coded in Go, but I googled this quickly.
      https://golang.org/pkg/math/ra...
      The Go documentation clearly says it panics if n = 0.

      They could have
      1. validated their inputs.
      2. Handle the panic and assign a default value (I am assuming this is possible in Go. I have never used it)

      In the end, it seems like this is just used to distribute requests. Worst case, it should log the error and then assign say the 1st upstream (default value).

      But I guess then you're in the exception handling debate on whether you swallow the error and keep going or have your application crash so that you detect the weird condition.

      I'm a defensive; keep the system going developer.
      But others prefer to be more exact.

    4. Re:Was the Go prog lang at fault? Would Rust help? by Obfuscant · · Score: 4, Insightful

      The Go documentation clearly says it panics if n = 0.

      And it says it panics if n is less than 0.

      If you write a library function that requires positive input always, and returns positive output always, then use unsigned input and output variables. A good compiler will flag the attempt at sending such a function a signed input as a warning at least. Pedantic compilers will fail -- better than the production program failing.

      And while it seems stupid, the proper action when asked for a random number between 0 and 0 is to return 0, not panic. (I believe the [ on the range means "including", but I could be wrong. If it didn't mean "including", then the documentation should be '(1,n)'.)

      But then, the test cases for the DNS code should have included 0 and negative, so this should have been caught when the function was tested.

    5. Re:Was the Go prog lang at fault? Would Rust help? by scamper_22 · · Score: 1

      Yep.

      I meant to type n (less than or equal to) 0.
      Not sure if slashdot escaped it or something.

      In any case, yes, the Go API was not the best taking in a signed int when negatives are invalid.

    6. Re:Was the Go prog lang at fault? Would Rust help? by WaffleMonster · · Score: 2

      I don't know if you can blame the language, the devs should have added their own checks if the language didn't have a guarantee.

      Noting math/rand is part of the standard go library and more rigorous compile time checking would have prevented this seems like a no-brainer to blame the language.

    7. Re:Was the Go prog lang at fault? Would Rust help? by Waffle+Iron · · Score: 1

      So to me it sounds like this incident was at least partially due to limitations with the Go programming language and its libraries.

      For now, you could use a platform-specific workaround. (Just like you would have to do if you were coding in C). For Linux:

      func Uptime() (int64, err) {
          var si syscall.Sysinfo_t
          err = syscall.Sysinfo(&si)
          return si.Uptime, err
      }

      I'm too lazy to look up whether Windows has a similar feature.

    8. Re:Was the Go prog lang at fault? Would Rust help? by Anonymous Coward · · Score: 1

      Part of the fault can go to the Go programming language for their API design.

      Indeed - a random number function that panics. Now that is useful.

      You want servers to keep running. The better approach would be to return 0. and in the interest of debugging, spam the syslog with "bad parameter -x to rand.Int63n()". Complaints puts some pressure on devs to fix things.

  2. My internet died... by ckatko · · Score: 3, Funny

    ...at exactly midnight, while I was playing Chivalry. I kept getting laggier... and laggier... and then everyone "froze" and the client-side prediction took over. I was recording video and it was pretty funny. Everyone just kept walking forward, until they were in a wall, and kept trying to walk forwards.

    It was interesting what the client prediction would let you do. You could change weapons. You could swing your weapon. You could throw axes (of which you have two) and they flew through the air, stuck in people, and even knocked helmets off. BUT, your axe counter never actually decreased. So you could just keep throwing hundreds of axes. The animation timings / speeds were unaffected. You couldn't "chant" or grunt. You obviously couldn't damage anyone.

    Anyway, my internet was down until the next morning and even then, it still required a cable modem reset to fix the connection.

    1. Re:My internet died... by Falos · · Score: 1

      Specifics vary, but some games have interesting latency tolerances that they're willing to resolve without a Nope shrug (rejection). Mounted WoW players often glide past "no fly zone" triggers, sometimes remaining airborne long enough to reach the unreachable. Usually inconsequential places, by design.

      PoGo players (and Ingress, I'd guess) have been known to capture indoor critters by dashing at building walls in the meatspace, then huddling over their phone to block signal. The game extrapolates position, and they can trigger a now-in-range critter engagement client-side. They resume the signal, and the server resolves it accepted.

      These are amusing examples, but if I was big on FPSs or fighters I'd probably be more bitchy and whining about gritty specifics. I know I didn't enjoy having to run in front of people to cast Backstab in WoW.

      OT: I've heard that, on the whole, these by-hand NTP solutions actually end up working out pretty good all around. I'm too ignorant to have my own opinion.

    2. Re:My internet died... by ckatko · · Score: 1

      I've actually been researching network game architecture lately and I was actually planning on doing some video-recorded analysis of various commercial-game network models when latency, jitter, out-of-order, and other errors occur. Extreme latency is a great way to "reveal" what's going on under-the-hood.

      So this time, I got video and I didn't even have to set up artificial lag!

    3. Re:My internet died... by antdude · · Score: 1

      Do you have still have that video recorded to upload to share with us? ;)

      --
      Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
    4. Re: My internet died... by buchanmilne · · Score: 1

      "Anyway, my internet was down until the next morning and even then, it still required a cable modem reset to fix the connection."

      Some network equipment vendors sent out field notices about 2 weeks in advance of the leap second, recommending operators to use leap-second smearing (as implemented in chronyd for example) if they had affected versions of network device firmware deployed that could crash as a result.

      (We didn't have affected versions deployed, and it would have been non-trivial - at this time of year - to get all our NTP servers upgraded. It't not recommended to use non-smearing and smearing NTP sources on the same device)

    5. Re:My internet died... by ckatko · · Score: 1

      Since you asked, sure.

      It'll take awhile to upload the 4K video (2.75 GB for a mere 2.5 minutes) and YouTube to reprocess it. But the link, when live, will be here:

      https://youtu.be/KIeM1y9S5Mo

    6. Re:My internet died... by antdude · · Score: 1

      Wow, can't you encode to smaller video before uploading? Do we really need to see it in 4K? Hahaha.

      --
      Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
  3. the gift that keeps taking by Thud457 · · Score: 1

    2016 says "Hi, remember me, beeotches?!!"

    --

    the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff

  4. Do we really need to compensate? by Anonymous Coward · · Score: 1

    We lose or gain a second here or there, who cares? The difference has been so far 27 seconds over the past 44 years or, extrapolated out, 1 MINUTE over 97 YEARS.
    Are we really going to notice if the sun goes down a minute earlier every century? We already have to screw around with daylight savings & leap years why not just make February the 29th 24 hours and 1 minute long once a century and have done with it.

  5. Re:Unit test those edge cases by unrtst · · Score: 4, Interesting

    Read the article then. It shows it pretty plainly: https://blog.cloudflare.com/ho...
    I was going to try to guess what they were doing, but they have some actual code snippets.

    AFAICT, a unit test wouldn't have caught this either (unless they planned for this sort of error, in which case the code wouldn't have been broken either). From TFA:

    RRDNS doesn’t just keep a single measurement for each resolver, it takes many measurements and smoothes them. So, the single measurement wouldn’t cause RRDNS to think the resolver was working in negative time, but after a few measurements the smoothed value would eventually become negative.

    So, a unit test with one negative example (which may have been difficult to mimic anyway, due to the direct usage of Time.Now()) probably wouldn't have triggered the issue on its own.

    IMHO, blaming a misconception of time always going forward is just convenient here. The fix was changing this bit:
    if rttMax == 0 {
          rttMax = DefaultTimeout
    }

    They just changed "==" to "<=". There was no reason not to have it as "<=" to begin with, even if one ignores where rttMax comes from. Any time I check if something is == to something else, and I don't have else conditions covering the other cases, I ask myself what should happen in those other else cases and ensure I'm covered. That may still have caused it to break, but it could have done:
    if rttMax == 0 {
          rttMax = DefaultTimeout
    } else if rttMax < 0 {
          panic("What the fuck happened to rttMax to make it negative!?!")
    }
    ...though it probably would have been better to just log that somewhere and set it to the DefaultTimeout.

    Anyway, I think it's a great example of a one character bug that only triggers on very obscure events under significant load.

  6. Re:Unit test those edge cases by skids · · Score: 4, Funny

    I'm still left wondering whether the decision to put a leap second on the night tech support staff are most likely to be over halfway through a bottle of JD was A) some intentional attempt to catch edge cases where leap seconds happen during a year change or B) some tinfoil conspiracy where we'll find out billions of dollars were stolen from a system where that particular edge case could be exploited or C) just made by people so socially isolated that they don't realize just how hard it is to fix crashed boxen over a crappy 3G connection in a dive bar bathroom using a phone covered in some chick's vomit while trying to keep down that pretzel you just washed down with sparkling water.

  7. Echoes of time changes days gone by by Archfeld · · Score: 1

    I always remember time changes as busy nights in support when I worked for a large bank. The spring forward was usually a breeze, just a matter of a lot of server verifications and log checks, but the fall back was usually a messy night. Much harder to deal with and resolve issues involving duplicate timed log entries and transaction logs. I don't really miss those days...

    --
    errr....umm...*whooosh* *whoosh* Is this thing on ?
    1. Re:Echoes of time changes days gone by by Anonymous Coward · · Score: 2, Interesting

      UUIDs are also fun to deal with. Especially with VM images that are copied by those who don't understand how you can have a duplicate UUID.

    2. Re:Echoes of time changes days gone by by caseih · · Score: 1

      So the bank's systems didn't store transaction times in UTC or some other timezone-neutral format? How did they deal with transactions originating from other time zones?

    3. Re:Echoes of time changes days gone by by Anonymous Coward · · Score: 2, Informative

      I've seen too many companies, even large multinational companies, who insist that their servers run on the HQ's local time to be surprised by this anymore.

      It ain't the IT folks doing it, I'll tell you that. The smart ones quietly say "fuck that", set the system clocks to UTC and then set a TZ environment variable everywhere.

  8. Re: Unit test those edge cases by Anonymous Coward · · Score: 1

    Millennial idiot coders are too busy refactoring everything every three months to bother testing anything ever. Mature codebases are too old, dudebro, old is bad, mmkay.

  9. Goes to show the old adage is true by SuperKendall · · Score: 2

    Don't use services who names are terribly ironic in times of failure.

    Flare, Flame, Burn, Drop, Etc. Et.

    The universe just loves to throw a wrench at such forms of un-intentional hubris just for the LOLs.

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
  10. Re: Was the Go prog lang at fault? Would Rust help by fubarrr · · Score: 4, Insightful

    >RRDNS is written in Go

    Their bugs are in HR department.

    Who in the world hired people who are dumb enought to use an experimental language in production?

  11. Re: Was the Go prog lang at fault? Would Rust help by Streetlight · · Score: 1

    Could also be a financial consideration. The smart folks wanted too much money. Or, maybe an unpaid summer high school intern was the choice.

    --
    In a time of universal deceit, telling the truth is a revolutionary act. George Orwell
  12. Re:pure insanity by Highdude702 · · Score: 1

    you apparently have no idea how computers and the internet and encryption type shit works do you? time is very important, if it wasn't why would you do anything right?

  13. Re:Unit test those edge cases by Obfuscant · · Score: 1

    I'm still left wondering whether the decision to put a leap second on the night tech support staff are most likely to be (conspiracy options elided) ...

    Or most likely, the people who put the leap second where it was knew there was no perfect time to do it, and assumed that anyone who was writing software that was so time-critical that it cared if there was a leap second would properly handle the issue in their code.

    It's not their fault that some developers using an off-beat language that has a library that panics if a parameter is invalid (and was written so that there could BE an invalid parameter, which they could have avoided) didn't bounds check their parameters to such a function.

  14. The Leap Seconds issue is much bigger by GreaterNinja · · Score: 1

    The issue with leap seconds is much bigger than just Cloudflare. I’ve found there are difference in at least 4 types of time: Google Time (their unique version of NTP time protocol), International Atomic Time(TAI), Coordinated Universal Time (UTC), and multiple NTP protocol servers. Currently there is a difference of 37 seconds between International Atomic Time (TAI) and Coordinated Universal Time (UTC). https://www.timeanddate.com/ti... When I checked the time sync of time.windows.com to time.is I noticed there is a ~33.4 second difference. Last I checked, there are hundreds of NTP severs that have out of sync times https://community.ntppool.org/ It seems a significant amount of the world is out of sync and there is no absolute consensus on what the time should be.

  15. Re:Hosts hardcodes avoid DNS issues by Ash-Fox · · Score: 3, Insightful

    Do not trust APK's software, APK is a criminal , he is blatantly violating the Computer Fraud and Abuse Act by posting here as he is banned from Slashdot.

    APK's ban evasion has lead to more restrictive filters being placed on Slashdot that hinder good discussions.

    --
    Change is certain; progress is not obligatory.
  16. Re:Hosts hardcodes avoid DNS issues by Ash-Fox · · Score: 1

    Stop with your criminal spam. You are violating the Computer Fraud and Abuse Act.

    --
    Change is certain; progress is not obligatory.
  17. Re:Unit test those edge cases by Aighearach · · Score: 2

    There aren't a billion edges, there is only an edge where your code establishes a limit.

    Some minutes have 60 seconds, some have other amounts. That doesn't cause an edge case. An edge case is caused by when your code assumes that the number of seconds has some specific value. So if in the code I say "if ( seconds assumes that the value will be valid, and they don't do something useful when the values are wrong. So they crash and burn. You want to either not care what the value is, in which case you don't want to even create an edge by testing it, or else make sure that you have a valid code path for all possible values. Did you establish an upper bound? You have to test what happens when the data exceeds it.

    Basic stuff, which is why when there is a leap second, just one piece of junky code stopped working and nothing else had any problem. There were probably large numbers of applications that actually have leap second bugs; careful log analysis might indicate that things that happened during the leap second were recorded as having happened at the start of that minutes. So instead of crash-and-burn, all your things that would have happened at 23:59:60 would be listed as having happened at 23:59:00. That's because competent programmers do something useful when they get bad data instead of just crashing and burning.

  18. Re:Root cause of leap sec issues: Unix time by PPH · · Score: 1

    Unix time would be monotonically increasing and equivalent to TAI or GPS time with an integer seconds offset. Simple. Exact.

    Not a bad idea. But you still have to account for system clocks that drift. And need to be bumped a few seconds one way or the other periodically. If my correcting the system clock occasionally (either manually or via a cron job from an NTP server automatically) causes apps to blow chunks, these are bad apps.

    --
    Have gnu, will travel.
  19. Re: Was the Go prog lang at fault? Would Rust help by Anonymous Coward · · Score: 1

    Blaming the Language is like blaming a toaster for shocking someone in the bath because they felt a wee bit on the hungry side.

    Tools, just like features in languages, should not be made idiot proof.
    A warning is fine. In this case, someone clearly never checked the spec or put in their own check just for leap seconds. Doesn't need to be in the code forever, just that event then comment-out and recompile.
    Putting these checks in by default would add extra overhead. One extra check adds up even after a day. That's extra money down the line because some idiot wanted toast in a bath.

  20. Re: Was the Go prog lang at fault? Would Rust help by K.+S.+Kyosuke · · Score: 1

    A conservative programming language in a version numbered as 1.7 hardly fits any sane person's definition of "experimental".

    --
    Ezekiel 23:20
  21. Re:Ash-Fox "jailhouse lawyer" wrong (not banned) by Ash-Fox · · Score: 1

    Stop involving us in your crimes APK. You are violating the Computer Fraud and Abuse Act.

    --
    Change is certain; progress is not obligatory.
  22. Re:Root cause of leap sec issues: Unix time by PPH · · Score: 1

    a UNIX clock is guarantied to never run backwards

    By design, yes. But I've manually reset the time forwards or backwards by many minutes on a few systems (using the date command). And I can't recall breaking anything. A few apps have raised warnings about objects being in the future, but nothing that a click on 'Continue Anyway' didn't fix. I guess I just don't use shitty apps.

    --
    Have gnu, will travel.
  23. Re:Unit test those edge cases by unrtst · · Score: 1

    So if in the code I say "if ( seconds assumes that the value ...

    Kinda amusing that your post is an example of unexpected (though well known) data causing an incorrect outcome. IE. slashdot ate part of your comment (I'm hoping that assumption is correct. Otherwise, your brain ate part of it). Sadly, we have to manually escape < (ie: &lt;) and friends here (and I have no idea what all must be escaped).

  24. If ... by cwsumner · · Score: 1

    "If Engineers built buildings the way Programmers write programs, the first woodpecker that came along would destroy civilization!"

    And if you did it that way because your "pointy-haired boss" said to, then it is still your fault... ;-)

  25. Re:If whipslash asks me to go I'm gone by Ash-Fox · · Score: 1

    APK commiting criminal activity under the Computer Fraud and Abuse Act

    Respectfully, stop involving me and others in your criminal activities. You're ruining Slashdot with your illegal spam posts and illegal comments.

    --
    Change is certain; progress is not obligatory.
  26. Re:Assfux got shafted & likes it, lol... apk by Ash-Fox · · Score: 1

    You've been told to stop involving me and others in your criminal activities. You are in direct violation of the Computer Fraud and Abuse Act. Slashdot is not a platform your illegal spam and illegal comments. Your activities have only caused Slashdot to tighten filters to the point that insightful commentary is now difficult to try to deal with you.

    You have spammed this article so many times, it's ridiculous!

    You have previously violated on Slashdot privacy rights, promoted offers without the express written consent of Slashdot Media, your content is destructive due to what has happened with Slashdot filters and embedding advertising without the express written consent of Slashdot media. All of these are against the Slashdot's "Terms of Use" and in turn you have violated the Computer Fraud and Abuse Act.

    Your criminal activities are unacceptable and your continued persistence after being advised of such means you willfully and intentionally violate the Computer Fraud and Abuse Act and Slashdot's "Terms of Use" to further propogate your spam without a care that you are responsibile for further ruining discourse on Slashdot.

    You've been asked to stop, you've been told to stop, you've even been banned and you continue. Your persistance in unethical and criminal behaviour is disgusting.

    --
    Change is certain; progress is not obligatory.
  27. Re:AssFox quit stalking me... apk by Ash-Fox · · Score: 1

    Cease and desist your criminal activities immediately.

    --
    Change is certain; progress is not obligatory.
  28. Re:Take your own advice "count stalkula" by Ash-Fox · · Score: 1

    Your sock puppeting is still in violation of the Computer Fraud and Abuse Act, cease and desist your unethical and criminal activities immediately.

    --
    Change is certain; progress is not obligatory.
  29. Re:Take your own advice "count stalkula" by Ash-Fox · · Score: 1

    This sock puppeting is still in violation of the Computer Fraud and Abuse Act, cease and desist your disgusting unethical and criminal activities immediately, APK. You are knowingly violating the Computer Fraud and Abuse Act and Slashdot's "Terms of Use".

    --
    Change is certain; progress is not obligatory.
  30. Re:Take your own advice "count stalkula" by Ash-Fox · · Score: 1

    Your continued knowingly repettition of unethical and criminal activities show what kind of disgusting person you are, APK. Cease and desist your unethical and criminal activities immediately.

    --
    Change is certain; progress is not obligatory.
  31. Re:Take your own advice "count stalkula" by Ash-Fox · · Score: 1

    Cease these criminal acitivites.

    --
    Change is certain; progress is not obligatory.
  32. Re:Take your own advice "count stalkula" by Ash-Fox · · Score: 1

    Immediately cease these criminal activities.

    --
    Change is certain; progress is not obligatory.
  33. Re:Take your own advice "count stalkula" by Ash-Fox · · Score: 1

    Your persistent willful criminal activities have revealed exactly what kind of person you are. Stop involving Slashdot and others in your crimes.

    --
    Change is certain; progress is not obligatory.
  34. Re:Take your own advice "count stalkula" by Ash-Fox · · Score: 1

    Cease your criminal activities immediately, APK.

    --
    Change is certain; progress is not obligatory.
  35. Re:Take your own advice "count stalkula" by Ash-Fox · · Score: 1

    You have been told to cease your criminal acts and you still persist, APK.

    --
    Change is certain; progress is not obligatory.