Why Browsers Blamed DNS For Facebook Outage

← Back to Stories (view on slashdot.org)

Why Browsers Blamed DNS For Facebook Outage

Posted by timothy on Sunday September 26, 2010 @03:44AM from the three-letters-bad dept.

Julie188 writes "That was probably the only time 'DNS' will ever be a trending term on Twitter. The cause was Facebook's 2.5 hour outage on Thursday, which incorrectly told users trying to access the site that a DNS error was to blame. In truth, experts who've read Facebook's explanation say the site went down because Facebook gave itself a distributed denial-of-service attack when a system admin misconfigured a database. So why was DNS blamed? The 27-year-old communications protocol has been known to cause other, somewhat similar outages."

7 of 96 comments (clear)

Min score:

Reason:

Sort:

Duh by vlm · 2010-09-26 04:00 · Score: 5, Insightful

So why was DNS blamed?

From http://www.facebook.com/note.php?note_id=431441338919&id=9445547199&ref=mf&_fb_noscript=1

The way to stop the feedback cycle was quite painful - we had to stop all traffic to this database cluster, which meant turning off the site.
I'm, uh, taking a wild guess that simply shutting off port 80 is not going to allow for a controllable ramp up... they could redirect to another site, Orkut or myspace would have been mildly humorous. I am mildly surprised they don't have a simple emergency box with a simple static "undergoing repair" page, but, whatever ...
So, other than zapping the A records and waiting, what are they supposed to do? Bonus points if they were doing DNS based load balancing and simply unplugged their (dns based) load balancer.
I have no dog in the fight, having deleted my facebook account months ago. It is kind of funny that a page of technobabble is described as "technical details" as if folks like us/me would find it to be a complete description rather than pretty vague. Then again we're dealing with farmville addicts and you can't reason with addicts.

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Ageism by Vahokif · 2010-09-26 04:05 · Score: 5, Informative

The 27-year-old communications protocol
So? TCP/IP is 36 years old.
1. Re:Ageism by morgan_greywolf · 2010-09-26 05:03 · Score: 4, Insightful
  
  Really? DNS is broken? So typing say, http://slashdot.org/ doesn't work for you?
  No. DNS has a few security issues, but they're mostly minor. The fact that DNS works for millions of people every day without issue at least 99% of the time proves that DNS is a successful design, even if it could use some security updating.
  
  --
  My blog
2. Re:Ageism by kasperd · 2010-09-26 05:22 · Score: 5, Insightful
  
  Some people think technology should be replaced just because it is old. But really, it should be replaced if it doesn't suit our needs and there is a different technology that does suit it.
  
  It is better to replace a 1 year old technology that does not suit our needs than to replace a 50 year old one that does. Usually when replacing, you want to replace with something newer. But in some cases it may turn out to be better to replace a new and misdesigned technology with an older and proven one.
  
  That said, there are improvements to both IP and DNS which should be rolled out because they fix real problems. The rollouts are not happening as fast as they ought to, mainly because it is problematic to roll out a change to the entire Internet, especially when not everybody involved is cooperating.
  
  But I don't think that really has anything to do with this outage.
  
  --
  
  Do you care about the security of your wireless mouse?
Re:DNS? by Mitchell314 · 2010-09-26 04:19 · Score: 3, Funny

Then stop buying Dells. :P

--
I read TFA and all I got was this lousy cookie
There WAS some DNS issues too ! by ivan_w · 2010-09-26 05:06 · Score: 3, Informative

The confusion might have come from the fact that when I looked, there seemed to also be some DNS problem.
Basically, when asking directly, the servers that are authoritative for the zone were giving me a CNAME for the 'ANY' query, but not the associated A records, which it should, since the CNAME was pointing to a host name within the same authority. At this point, any sensible resolver stops asking !
This only lasted for a little while though - so it might have been a glitch or possibly a deliberate action related to how they were trying to fix the underlying issue itself - possibly averting traffic until they actually solved the actual problem.
--Ivan
Re:Did Facebook have an internal DNS failure? by rekoil · 2010-09-26 06:21 · Score: 3, Informative

It didn't fail, they turned it off. This was the easiest way to "shut off the entire site" as their post-mortem describes. The DNS errors users saw were being generated by the front-end HTTP proxies, not by client browsers, which caused most of this confusion. Once the database issue cleared, they reactivated the DNS entries for the back-end servers one cluster at a time and the site came back.