Why Browsers Blamed DNS For Facebook Outage
Julie188 writes "That was probably the only time 'DNS' will ever be a trending term on Twitter. The cause was Facebook's 2.5 hour outage on Thursday, which incorrectly told users trying to access the site that a DNS error was to blame. In truth, experts who've read Facebook's explanation say the site went down because Facebook gave itself a distributed denial-of-service attack when a system admin misconfigured a database. So why was DNS blamed? The 27-year-old communications protocol has been known to cause other, somewhat similar outages."
So why was DNS blamed?
From http://www.facebook.com/note.php?note_id=431441338919&id=9445547199&ref=mf&_fb_noscript=1
The way to stop the feedback cycle was quite painful - we had to stop all traffic to this database cluster, which meant turning off the site.
I'm, uh, taking a wild guess that simply shutting off port 80 is not going to allow for a controllable ramp up... they could redirect to another site, Orkut or myspace would have been mildly humorous. I am mildly surprised they don't have a simple emergency box with a simple static "undergoing repair" page, but, whatever ...
So, other than zapping the A records and waiting, what are they supposed to do? Bonus points if they were doing DNS based load balancing and simply unplugged their (dns based) load balancer.
I have no dog in the fight, having deleted my facebook account months ago. It is kind of funny that a page of technobabble is described as "technical details" as if folks like us/me would find it to be a complete description rather than pretty vague. Then again we're dealing with farmville addicts and you can't reason with addicts.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
The 27-year-old communications protocol
So? TCP/IP is 36 years old.
Then stop buying Dells. :P
I read TFA and all I got was this lousy cookie
The confusion might have come from the fact that when I looked, there seemed to also be some DNS problem.
Basically, when asking directly, the servers that are authoritative for the zone were giving me a CNAME for the 'ANY' query, but not the associated A records, which it should, since the CNAME was pointing to a host name within the same authority. At this point, any sensible resolver stops asking !
This only lasted for a little while though - so it might have been a glitch or possibly a deliberate action related to how they were trying to fix the underlying issue itself - possibly averting traffic until they actually solved the actual problem.
--Ivan
It didn't fail, they turned it off. This was the easiest way to "shut off the entire site" as their post-mortem describes. The DNS errors users saw were being generated by the front-end HTTP proxies, not by client browsers, which caused most of this confusion. Once the database issue cleared, they reactivated the DNS entries for the back-end servers one cluster at a time and the site came back.