Slashdot Mirror


CloudFlare Was Hit By Leap Second, Causing Its RRDNS Software To 'Panic' (silicon.co.uk)

Reader Mickeycaskill writes: The extra leap second added on to the end of 2016 may not have had an effect on most people, but it did catch out a few web companies who failed to factor it in. Web services and security firm CloudFlare was one such example. A small number of its servers went down at midnight UTC on New Year's Day due to an error in its RRDNS software, a domain name service (DNS) proxy that was written to help scale CloudFlare's DNS infrastructure, which limited web access for some of its customers. As CloudFlare explained, a number went negative in the software when it should have been zero, causing RRDNS to "panic" and affect the DNS resolutions to some websites. The issue was confirmed by the company's engineers at 00:34 UTC on New Year's Day and the fix -- which involved patching the clock source to ensure it normalises if time ever skips backwards -- was rolled out to the majority of the affected data centres by 02:50 UTC. Cloudflare said the outage only hit customers who use CNAME DNS records with its service. Google works around leap seconds with a so-called "smearing" technique -- running clocks slightly slower than usual on its Network Time Protocol servers.

1 of 119 comments (clear)

  1. Was the Go prog lang at fault? Would Rust help? by Anonymous Coward · · Score: 1, Flamebait

    The blog post about this incident says:

    RRDNS is written in Go and uses Go’s time.Now() function to get the time. Unfortunately, this function does not guarantee monotonicity. Go currently doesn’t offer a monotonic time source (see issue 12914 for discussion).

    and then later it says:

    When RRDNS selects an upstream to resolve a CNAME it uses a weighted selection algorithm. The code takes the upstream time values and feeds them to Go’s rand.Int63n() function. rand.Int63n promptly panics if its argument is negative. That's where the RRDNS panics were coming from.

    So to me it sounds like this incident was at least partially due to limitations with the Go programming language and its libraries.

    Would this incident still have happened if this software were written in the Rust programming language?