CloudFlare Was Hit By Leap Second, Causing Its RRDNS Software To 'Panic' (silicon.co.uk)
Reader Mickeycaskill writes: The extra leap second added on to the end of 2016 may not have had an effect on most people, but it did catch out a few web companies who failed to factor it in. Web services and security firm CloudFlare was one such example. A small number of its servers went down at midnight UTC on New Year's Day due to an error in its RRDNS software, a domain name service (DNS) proxy that was written to help scale CloudFlare's DNS infrastructure, which limited web access for some of its customers. As CloudFlare explained, a number went negative in the software when it should have been zero, causing RRDNS to "panic" and affect the DNS resolutions to some websites. The issue was confirmed by the company's engineers at 00:34 UTC on New Year's Day and the fix -- which involved patching the clock source to ensure it normalises if time ever skips backwards -- was rolled out to the majority of the affected data centres by 02:50 UTC. Cloudflare said the outage only hit customers who use CNAME DNS records with its service. Google works around leap seconds with a so-called "smearing" technique -- running clocks slightly slower than usual on its Network Time Protocol servers.
Read the article then. It shows it pretty plainly: https://blog.cloudflare.com/ho...
I was going to try to guess what they were doing, but they have some actual code snippets.
AFAICT, a unit test wouldn't have caught this either (unless they planned for this sort of error, in which case the code wouldn't have been broken either). From TFA:
RRDNS doesn’t just keep a single measurement for each resolver, it takes many measurements and smoothes them. So, the single measurement wouldn’t cause RRDNS to think the resolver was working in negative time, but after a few measurements the smoothed value would eventually become negative.
So, a unit test with one negative example (which may have been difficult to mimic anyway, due to the direct usage of Time.Now()) probably wouldn't have triggered the issue on its own.
IMHO, blaming a misconception of time always going forward is just convenient here. The fix was changing this bit:
...though it probably would have been better to just log that somewhere and set it to the DefaultTimeout.
if rttMax == 0 {
rttMax = DefaultTimeout
}
They just changed "==" to "<=". There was no reason not to have it as "<=" to begin with, even if one ignores where rttMax comes from. Any time I check if something is == to something else, and I don't have else conditions covering the other cases, I ask myself what should happen in those other else cases and ensure I'm covered. That may still have caused it to break, but it could have done:
if rttMax == 0 {
rttMax = DefaultTimeout
} else if rttMax < 0 {
panic("What the fuck happened to rttMax to make it negative!?!")
}
Anyway, I think it's a great example of a one character bug that only triggers on very obscure events under significant load.
UUIDs are also fun to deal with. Especially with VM images that are copied by those who don't understand how you can have a duplicate UUID.