Why Browsers Blamed DNS For Facebook Outage
Julie188 writes "That was probably the only time 'DNS' will ever be a trending term on Twitter. The cause was Facebook's 2.5 hour outage on Thursday, which incorrectly told users trying to access the site that a DNS error was to blame. In truth, experts who've read Facebook's explanation say the site went down because Facebook gave itself a distributed denial-of-service attack when a system admin misconfigured a database. So why was DNS blamed? The 27-year-old communications protocol has been known to cause other, somewhat similar outages."
It wasn't your browser having a DNS error, it was the user facing servers at Facebook reporting DNS problems talking to whoever they talk to. Maybe when they decided the way to fix the problem was to take down the site, they just removed the back end server cluster from their internal DNS.
So why was DNS blamed?
From http://www.facebook.com/note.php?note_id=431441338919&id=9445547199&ref=mf&_fb_noscript=1
The way to stop the feedback cycle was quite painful - we had to stop all traffic to this database cluster, which meant turning off the site.
I'm, uh, taking a wild guess that simply shutting off port 80 is not going to allow for a controllable ramp up... they could redirect to another site, Orkut or myspace would have been mildly humorous. I am mildly surprised they don't have a simple emergency box with a simple static "undergoing repair" page, but, whatever ...
So, other than zapping the A records and waiting, what are they supposed to do? Bonus points if they were doing DNS based load balancing and simply unplugged their (dns based) load balancer.
I have no dog in the fight, having deleted my facebook account months ago. It is kind of funny that a page of technobabble is described as "technical details" as if folks like us/me would find it to be a complete description rather than pretty vague. Then again we're dealing with farmville addicts and you can't reason with addicts.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
The 27-year-old communications protocol
So? TCP/IP is 36 years old.
Then stop buying Dells. :P
I read TFA and all I got was this lousy cookie
I found the genuine panic from many Facebook users to this outage very amusing.
http://rs79.vrx.net/works/photoblog/2010/Sep/23/
Notice the page, being served from facebook.com, saying "bad DNS". Think about that
for a second.
Need Mercedes parts ?
The confusion might have come from the fact that when I looked, there seemed to also be some DNS problem.
Basically, when asking directly, the servers that are authoritative for the zone were giving me a CNAME for the 'ANY' query, but not the associated A records, which it should, since the CNAME was pointing to a host name within the same authority. At this point, any sensible resolver stops asking !
This only lasted for a little while though - so it might have been a glitch or possibly a deliberate action related to how they were trying to fix the underlying issue itself - possibly averting traffic until they actually solved the actual problem.
--Ivan
I can understand why that may cause people to think the problem is with DNS. The error message looks like it came from an http proxy. That would suggest that either the user had a proxy configured or facebook were using a reverse proxy. If it was the later, the DNS "problem" would be inside their network.
Do you care about the security of your wireless mouse?
No. Facebook doesn't do data-mining, and they don't serve ads. They simply pull money out of their ass.
"linux is just DOS with a UNIX like syntax" -- Galactic Dominator (944134)
It didn't fail, they turned it off. This was the easiest way to "shut off the entire site" as their post-mortem describes. The DNS errors users saw were being generated by the front-end HTTP proxies, not by client browsers, which caused most of this confusion. Once the database issue cleared, they reactivated the DNS entries for the back-end servers one cluster at a time and the site came back.
So is Slashdot.
I don't know that finger pointing is necessarily healthy - that tends to suggest CYA and childish blame games. But on a technical IT focused web site, one might suppose that a lessons learned exercise on the root cause of the failure of a massive website would be of interest and hopefully even an educational experience.
look at my own /etc/hosts file. From time to time I manage to bite myself on the ass with my block-list
#Below is my custom DNS blocklist
127.0.0.1 om.nom.nom.
user@localhost:~$ ping om.nom.nom.
Boot Windows, Linux, and ESX over the network for free.
Yet, you failed to notice that /. is a site for nerds.
Many nerds do not thrive to cultivate their social skills.
Checking their friends status on social network might not be on top of their agendas.
So: event was notable, but not very important to many slashdotters.