Why Browsers Blamed DNS For Facebook Outage
Julie188 writes "That was probably the only time 'DNS' will ever be a trending term on Twitter. The cause was Facebook's 2.5 hour outage on Thursday, which incorrectly told users trying to access the site that a DNS error was to blame. In truth, experts who've read Facebook's explanation say the site went down because Facebook gave itself a distributed denial-of-service attack when a system admin misconfigured a database. So why was DNS blamed? The 27-year-old communications protocol has been known to cause other, somewhat similar outages."
All I received was bad gateway. Just saying..
Is contagious, it seems.
The dangers of knowledge trigger emotional distress in human beings.
because chrome stopped at "resloving host"
It wasn't your browser having a DNS error, it was the user facing servers at Facebook reporting DNS problems talking to whoever they talk to. Maybe when they decided the way to fix the problem was to take down the site, they just removed the back end server cluster from their internal DNS.
So why was DNS blamed?
From http://www.facebook.com/note.php?note_id=431441338919&id=9445547199&ref=mf&_fb_noscript=1
The way to stop the feedback cycle was quite painful - we had to stop all traffic to this database cluster, which meant turning off the site.
I'm, uh, taking a wild guess that simply shutting off port 80 is not going to allow for a controllable ramp up... they could redirect to another site, Orkut or myspace would have been mildly humorous. I am mildly surprised they don't have a simple emergency box with a simple static "undergoing repair" page, but, whatever ...
So, other than zapping the A records and waiting, what are they supposed to do? Bonus points if they were doing DNS based load balancing and simply unplugged their (dns based) load balancer.
I have no dog in the fight, having deleted my facebook account months ago. It is kind of funny that a page of technobabble is described as "technical details" as if folks like us/me would find it to be a complete description rather than pretty vague. Then again we're dealing with farmville addicts and you can't reason with addicts.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
The 27-year-old communications protocol
So? TCP/IP is 36 years old.
Terrible article. What "DNS error"? Is Facebook running its own DNS servers that do something funny, or what?
As for DNS "moving to the cloud", DNS is already far more distributed than any of the "cloud" systems. Which is a good thing.
Yeah...from reading that Facebook note, it's pretty clear that DNS had nothing to do with the outage. Do you guys think the outage would've been better or worse had it been one?
I found the genuine panic from many Facebook users to this outage very amusing.
It's an advertisement platform that rides solely on the ignorance if its users. So people had two and a half hours to take a break from their narcissism... this is something worthy of finger pointing?
Sorry for posting Off-Topic, but is there a way to hide all stories tagged "facebook"?
Who cares? Enough about the facebook outage already.
With all the adblocking software, no scripting, and other misc Firefox plugins I have no ads from facebook, let alone any other page. Firefox is set to delete cookies and BS on exit and I keep my machine clean with bleachbit.
Facebook allows me to connect with lots of people I never would see, like buddies in the Army based in Japan for instance or friends in New York etc, etc.
I hide all the annoying spam adds for peoples stupid farms, I have convinced many of my Facebook friends and family to stop playing those games and to take similar precautions.
Every company is evil, hell even my Linux distro has political agendas, ( damn Mint Linux!!! ) but what does it really matter? It is the cost of using technology. Until we change our idea about advancing ourselves and not our pocket books nothing will change, so stop with the Q_Q and pew pew, and just deal with and be smart about what you do.
Visit my Forums?
No explanation for DNS is offered.
My guess is Facebook runs its own DNS servers and they were swamped by a DDoS of Facebook's own making.
The confusion might have come from the fact that when I looked, there seemed to also be some DNS problem.
Basically, when asking directly, the servers that are authoritative for the zone were giving me a CNAME for the 'ANY' query, but not the associated A records, which it should, since the CNAME was pointing to a host name within the same authority. At this point, any sensible resolver stops asking !
This only lasted for a little while though - so it might have been a glitch or possibly a deliberate action related to how they were trying to fix the underlying issue itself - possibly averting traffic until they actually solved the actual problem.
--Ivan
The summary's wrong. The problem was caused by one looping server, hence making it a DOS, not a DDOS.
Yet m.facebook.com worked the whole time this was going on.
Cause browsers is stoopid!
That was obvious, it showed symptoms of a DDoS attack, not a DNS problem. I find it funny it was caused by their own error.
To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn't allow the databases to recover.
Even when the database has a valid value, if failures to get a value from the database can creating a growing cascade of errors, then this design is still poised for a future failure for simple things like a partial outage of databases or network access to them. Ideally, once the data was valid, the number of clients not getting a valid value should gradually decrease as more and more get valid values and don't have to requery. But if the scale was such that none could get anything when all were trying (and hence require a shutdown and slow start), it can all happen again with many other classes of failure. It can even happen with transient error, if the transient is long enough to trip a certain threshhold of clients.
So leaving that configuration correction system off for now makes sense. I would suggest a combination of a push system (originate at the database and push new values out ... but watch for security) and a randomizing of delay times inserted for the data pulls.
now we need to go OSS in diesel cars
What percentage of slashdotters actually noticed the facebook outage when it happened? As opposed to merely participating in the post-hoc commentary after they read about it. It should have been posted to slashdot's idle category.
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
... and browsers will think their DNS is dead ... because, well, it is ... and is the first thing a browser needs to access.
now we need to go OSS in diesel cars
The "error page" is clearly a Facebook server reporting a DNS failure within Facebook's own network. Facebook requests are processed by user-facing servers which make RPC calls (not HTTP) into Facebook's internal network. Machines in multiple locations may be involved in generating a single Facebook page. If their in-house DNS system for organizing their internal network failed, they might produce messages like that.
Browsers are fucking software. They don't blame anything for anything.
Facebook was down. That's the only that matters to most people.
'Fake sidebar"
Lookup whoosh in the dictionary. That's not a mirror.
look at my own /etc/hosts file. From time to time I manage to bite myself on the ass with my block-list
#Below is my custom DNS blocklist
127.0.0.1 om.nom.nom.
user@localhost:~$ ping om.nom.nom.
Boot Windows, Linux, and ESX over the network for free.
Wouldn't this just be DOS or have the two separate terms become synonymous?
It is the most used website in the world (more userhours/month spent of Facebook than any other site), the fastest growing internet community (when measured in new users/month), etc... And as such it is an engineering masterpiece (in software engineering and probably in several other areas, too). When it goes down for several hours, it is a newsworthy event.
For us who work for advertising agencies, FB downtime is also a financially notable event.
it might be a buzz acronym on twitter anyway, cause it also means DNA in german
There is a whole market devoted to handling high delay TCP connections. It works. It's what I do. Well, part of it.
Replacing the protocol for this reason would be kind of lame.
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
Someone digging in Farmville didn't call before digging and cut an underground cable.
How much easier would fighting spam be if SMTP had a strong authentication system for sent messages?
There is one, called OpenPGP. There is another one, called S/MIME. Implementation of these in real-world MUAs awaits a decision on best practices for how strong the authentication needs to be. Stronger authentication has two downsides. First, the cost of obtaining a digital ID goes up with strength; even with the OpenPGP web of trust, travel to a key signing party hundreds of km away is not free. Second, requiring strong digital ID makes it difficult for someone living under a government that suppresses speech to express politically unpopular ideas.
If everyone had as a personal policy "only read OpenPGP-signed mail, and distrust mail signed with a key I haven't personally downloaded from a key server"
Then it would it would still fall under the "Requires immediate total cooperation from everybody at once" line of the well-known copypasta, and possibly "Mailing lists and other legitimate email uses would be affected" and "Many email users cannot afford to lose business or alienate potential employers" depending on how it is implemented.