CloudFlare Was Hit By Leap Second, Causing Its RRDNS Software To 'Panic' (silicon.co.uk)
Reader Mickeycaskill writes: The extra leap second added on to the end of 2016 may not have had an effect on most people, but it did catch out a few web companies who failed to factor it in. Web services and security firm CloudFlare was one such example. A small number of its servers went down at midnight UTC on New Year's Day due to an error in its RRDNS software, a domain name service (DNS) proxy that was written to help scale CloudFlare's DNS infrastructure, which limited web access for some of its customers. As CloudFlare explained, a number went negative in the software when it should have been zero, causing RRDNS to "panic" and affect the DNS resolutions to some websites. The issue was confirmed by the company's engineers at 00:34 UTC on New Year's Day and the fix -- which involved patching the clock source to ensure it normalises if time ever skips backwards -- was rolled out to the majority of the affected data centres by 02:50 UTC. Cloudflare said the outage only hit customers who use CNAME DNS records with its service. Google works around leap seconds with a so-called "smearing" technique -- running clocks slightly slower than usual on its Network Time Protocol servers.
...at exactly midnight, while I was playing Chivalry. I kept getting laggier... and laggier... and then everyone "froze" and the client-side prediction took over. I was recording video and it was pretty funny. Everyone just kept walking forward, until they were in a wall, and kept trying to walk forwards.
It was interesting what the client prediction would let you do. You could change weapons. You could swing your weapon. You could throw axes (of which you have two) and they flew through the air, stuck in people, and even knocked helmets off. BUT, your axe counter never actually decreased. So you could just keep throwing hundreds of axes. The animation timings / speeds were unaffected. You couldn't "chant" or grunt. You obviously couldn't damage anyone.
Anyway, my internet was down until the next morning and even then, it still required a cable modem reset to fix the connection.
I don't know if you can blame the language, the devs should have added their own checks if the language didn't have a guarantee.
Read the article then. It shows it pretty plainly: https://blog.cloudflare.com/ho...
I was going to try to guess what they were doing, but they have some actual code snippets.
AFAICT, a unit test wouldn't have caught this either (unless they planned for this sort of error, in which case the code wouldn't have been broken either). From TFA:
RRDNS doesn’t just keep a single measurement for each resolver, it takes many measurements and smoothes them. So, the single measurement wouldn’t cause RRDNS to think the resolver was working in negative time, but after a few measurements the smoothed value would eventually become negative.
So, a unit test with one negative example (which may have been difficult to mimic anyway, due to the direct usage of Time.Now()) probably wouldn't have triggered the issue on its own.
IMHO, blaming a misconception of time always going forward is just convenient here. The fix was changing this bit:
...though it probably would have been better to just log that somewhere and set it to the DefaultTimeout.
if rttMax == 0 {
rttMax = DefaultTimeout
}
They just changed "==" to "<=". There was no reason not to have it as "<=" to begin with, even if one ignores where rttMax comes from. Any time I check if something is == to something else, and I don't have else conditions covering the other cases, I ask myself what should happen in those other else cases and ensure I'm covered. That may still have caused it to break, but it could have done:
if rttMax == 0 {
rttMax = DefaultTimeout
} else if rttMax < 0 {
panic("What the fuck happened to rttMax to make it negative!?!")
}
Anyway, I think it's a great example of a one character bug that only triggers on very obscure events under significant load.
I'm still left wondering whether the decision to put a leap second on the night tech support staff are most likely to be over halfway through a bottle of JD was A) some intentional attempt to catch edge cases where leap seconds happen during a year change or B) some tinfoil conspiracy where we'll find out billions of dollars were stolen from a system where that particular edge case could be exploited or C) just made by people so socially isolated that they don't realize just how hard it is to fix crashed boxen over a crappy 3G connection in a dive bar bathroom using a phone covered in some chick's vomit while trying to keep down that pretzel you just washed down with sparkling water.
Someone had to do it.
Don't use services who names are terribly ironic in times of failure.
Flare, Flame, Burn, Drop, Etc. Et.
The universe just loves to throw a wrench at such forms of un-intentional hubris just for the LOLs.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
Why would you even think of switching programing languages due to the simple and sadly common 'bug' of programmers not verifying parameters match a function's documented pre-conditions? My only guess is you're paid to promote Rust. Lazy programmers will write bugs in every language.
>RRDNS is written in Go
Their bugs are in HR department.
Who in the world hired people who are dumb enought to use an experimental language in production?
UUIDs are also fun to deal with. Especially with VM images that are copied by those who don't understand how you can have a duplicate UUID.
I've seen too many companies, even large multinational companies, who insist that their servers run on the HQ's local time to be surprised by this anymore.
It ain't the IT folks doing it, I'll tell you that. The smart ones quietly say "fuck that", set the system clocks to UTC and then set a TZ environment variable everywhere.
The Go documentation clearly says it panics if n = 0.
And it says it panics if n is less than 0.
If you write a library function that requires positive input always, and returns positive output always, then use unsigned input and output variables. A good compiler will flag the attempt at sending such a function a signed input as a warning at least. Pedantic compilers will fail -- better than the production program failing.
And while it seems stupid, the proper action when asked for a random number between 0 and 0 is to return 0, not panic. (I believe the [ on the range means "including", but I could be wrong. If it didn't mean "including", then the documentation should be '(1,n)'.)
But then, the test cases for the DNS code should have included 0 and negative, so this should have been caught when the function was tested.
I don't know if you can blame the language, the devs should have added their own checks if the language didn't have a guarantee.
Noting math/rand is part of the standard go library and more rigorous compile time checking would have prevented this seems like a no-brainer to blame the language.
Do not trust APK's software, APK is a criminal , he is blatantly violating the Computer Fraud and Abuse Act by posting here as he is banned from Slashdot.
APK's ban evasion has lead to more restrictive filters being placed on Slashdot that hinder good discussions.
Change is certain; progress is not obligatory.
There aren't a billion edges, there is only an edge where your code establishes a limit.
Some minutes have 60 seconds, some have other amounts. That doesn't cause an edge case. An edge case is caused by when your code assumes that the number of seconds has some specific value. So if in the code I say "if ( seconds assumes that the value will be valid, and they don't do something useful when the values are wrong. So they crash and burn. You want to either not care what the value is, in which case you don't want to even create an edge by testing it, or else make sure that you have a valid code path for all possible values. Did you establish an upper bound? You have to test what happens when the data exceeds it.
Basic stuff, which is why when there is a leap second, just one piece of junky code stopped working and nothing else had any problem. There were probably large numbers of applications that actually have leap second bugs; careful log analysis might indicate that things that happened during the leap second were recorded as having happened at the start of that minutes. So instead of crash-and-burn, all your things that would have happened at 23:59:60 would be listed as having happened at 23:59:00. That's because competent programmers do something useful when they get bad data instead of just crashing and burning.