Why Browsers Blamed DNS For Facebook Outage

DNS? by Anonymous Coward · 2010-09-26 03:46 · Score: 0

All I received was bad gateway. Just saying..

Re:DNS? by Mitchell314 · 2010-09-26 04:19 · Score: 3, Funny

Then stop buying Dells. :P

--
I read TFA and all I got was this lousy cookie
Re:DNS? by rs79 · 2010-09-26 04:44 · Score: 2, Informative

http://rs79.vrx.net/works/photoblog/2010/Sep/23/
Notice the page, being served from facebook.com, saying "bad DNS". Think about that
for a second.

--
Need Mercedes parts ?
Re:DNS? by kasperd · 2010-09-26 05:08 · Score: 2, Interesting

Notice the page, being served from facebook.com, saying "bad DNS".
I can understand why that may cause people to think the problem is with DNS. The error message looks like it came from an http proxy. That would suggest that either the user had a proxy configured or facebook were using a reverse proxy. If it was the later, the DNS "problem" would be inside their network.

--

Do you care about the security of your wireless mouse?
Re:DNS? by rekoil · 2010-09-26 06:12 · Score: 1

Easy. They absolutely do use reverse proxies - every large site does, because you just can't scale a web site to Facebook's size without them.
In the post-mortem, they mention the need to effectively "turn off" the entire site, and the easiest way to do that is to remove its DNS. In this case, however, it was most likely more effective to remove the DNS entries for the back-end hosts that the proxies forward queries to, rather than the entries for www.facebook.com. This is most likely what generated the DNS errors that users saw.
Re:DNS? by billcopc · 2010-09-26 15:47 · Score: 1

Bingo! The DNS issue was internal to Facebook's load-balancing cluster. Anyone who's hosted a busy web site should be familiar with this kind of setup. Internal DNS is often uses for such purposes, as it can transparently provide round-robin functionality. Every time you resolve the hostname, you get a different IP (caches notwithstanding), so while the back-end servers need to be conscious of the load-balancing in a generic fashion, the actual distribution of work is trivial, and adding more back-end nodes merely requires a painless addition to a DNS zone file.

--
-Billco, Fnarg.com
Re:DNS? by rs79 · 2010-09-26 20:19 · Score: 1

I think they changed their internal DNS config, screwed it up, and when their front facing webservers tried to lookup their database servers and failed, they tried the backup/rollover db servers, failed... these cascading errors caused their internal DNS servers to melt down.
After they'd been down for a while, because it spun down slowly over about half an hour, somebody in charge asked "WHY ARE WE DOWN" and was told "DNS error" and then changed the front facing webservers to spit up HTML that said "DNS ERROR", a simple web page communicating something is better than dead air.
Pedants will note that when http://facebook.com/ says "DNS error" clearly it isn't a DNS error - it was able to use the DNS to find facebook.com, no? Therefore it had to be an internal DNS error.
Facebook's own explanation of the fault speaks vaguely of cached and persistant data. Classic DNS screwup.

--
Need Mercedes parts ?
Re:DNS? by sortius_nod · 2010-09-26 20:54 · Score: 1

If it was doing a DDoS of their SQL servers, wouldn't taking down the DNS be useless? I've always been taught to use IP rather than hostname when building any database servers.
Re:DNS? by c6gunner · 2010-09-27 00:09 · Score: 1

Off topic here, but FYI your "orange plane" is probably a flare, especially if you spotted it over Mountain View.
Re:DNS? by InfiniteWisdom · 2010-09-27 08:17 · Score: 1

I've always been taught to use IP rather than hostname when building any database servers.
Definitely time for some reeducation then. Using IP addresses instead of DNS is just asking for trouble and headaches.
Re:DNS? by Anonymous Coward · 2010-09-27 20:49 · Score: 0

Agreed - the only time I can think of only using IP is on a completely isolated segment, where DNS is unavailable and undesirable, like a cluster interconnect.. even private backup segments we resolve names through the frontend, route the traffic through the backend.

Human Error by mfh · 2010-09-26 03:52 · Score: 1

Is contagious, it seems.

--
The dangers of knowledge trigger emotional distress in human beings.

Re:Human Error by BrokenHalo · 2010-09-26 05:06 · Score: 1

Indeed. I don't do Facebook, but if I had got such a message, my first response would be to look at my own /etc/hosts file. From time to time I manage to bite myself on the ass with my block-list, but I can live with that...
Re:Human Error by froggymana · 2010-09-26 05:47 · Score: 1

People are always looking to blame someone else for their problems or someone else's. Its just human nature.

--
"To prevent this day from getting any worse, I'll just read ERROR as GOOD THING" 1GJU8xLuDKDxEs4KLf8fAGyptoDsqvEsBT

I think it was DNS by mhh91 · 2010-09-26 03:52 · Score: 1

because chrome stopped at "resloving host"

Re:I think it was DNS by morgan_greywolf · 2010-09-26 04:56 · Score: 1

Except that 'hslookup facebook.com', et al.. worked with no issues. RTFS.

--
My blog

Message saying DNS error by Anonymous Coward · 2010-09-26 03:58 · Score: 2, Interesting

It wasn't your browser having a DNS error, it was the user facing servers at Facebook reporting DNS problems talking to whoever they talk to. Maybe when they decided the way to fix the problem was to take down the site, they just removed the back end server cluster from their internal DNS.

Duh by vlm · 2010-09-26 04:00 · Score: 5, Insightful

So why was DNS blamed?

From http://www.facebook.com/note.php?note_id=431441338919&id=9445547199&ref=mf&_fb_noscript=1

The way to stop the feedback cycle was quite painful - we had to stop all traffic to this database cluster, which meant turning off the site.

I'm, uh, taking a wild guess that simply shutting off port 80 is not going to allow for a controllable ramp up... they could redirect to another site, Orkut or myspace would have been mildly humorous. I am mildly surprised they don't have a simple emergency box with a simple static "undergoing repair" page, but, whatever ...

So, other than zapping the A records and waiting, what are they supposed to do? Bonus points if they were doing DNS based load balancing and simply unplugged their (dns based) load balancer.

I have no dog in the fight, having deleted my facebook account months ago. It is kind of funny that a page of technobabble is described as "technical details" as if folks like us/me would find it to be a complete description rather than pretty vague. Then again we're dealing with farmville addicts and you can't reason with addicts.

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger

Re:Duh by kasperd · 2010-09-26 05:36 · Score: 1

I'm, uh, taking a wild guess that simply shutting off port 80 is not going to allow for a controllable ramp up...
Both approaches allow for a controllable ramp up given the right software on their servers. And I think with the typical off the shelf software neither of them allow for a controllable ramp up.

But did they even need a controllable ramp up of user requests? It sounded like the overloaded system was overloaded by internal requests, that were unrelated to the number of requests they got from end users.

My guess is they simply did what they found the easiest way to get their internal systems working again and not worrying about what errors users would see in the meantime.

--

Do you care about the security of your wireless mouse?
Re:Duh by Anonymous Coward · 2010-09-26 07:27 · Score: 0

Without going into specifics, we use global load balancing infrastructure driven by dns to shift traffic between VIPs in realtime. Turn that to 0 with no fallback set and you get the described behavior. Painful but in this case, necessary.
Re:Duh by PaganRitual · 2010-09-26 14:34 · Score: 2, Funny

This whole situation does explain why my mother appeared to be sick on the couch at my parent's place on Thursday afternoon when I paid them a visit. With all the shaking and huddling under the covers and looking pale-faced I presumed she had come down with the flu or something.

Then again we're dealing with farmville addicts and you can't reason with addicts.
They aren't addicts, that's patently unfair. They can stop any time they want. What is most admirable about them is that they are simply so time-savvy that they coincide those times at which they wish to stop with the periods during which their crops have to be left to grow. Once the crops are ready for harvest, they desire to play again. It's really very simple and implies no addiction whatsoever.
Seriously though, 2.5 hours? The experience I have with Farmville gives me vague recollection that there are a fair few crops that have a growth period of a hour or less, and given that the crops wither and become unusable in the same time they take to complete their growth makes me wonder how many people petitioned Zynga for free ... well, the game is free so technically (and literally) nothing of value was lost, but still, I'm sure they were crying about something.
Now shut-up, it's nearly 4:01 server time and my rogue still needs the Brewfest boss' dagger to drop for it. 5 times and all I've seen is the mace which I can buy for fuck all anyway. My warlock has had two daggers already; maybe it's payback for the Midsummer event when my rogue got the staff twice and my warlock never saw it. THIS IS SUCH BULLSHIT.
Re:Duh by vlm · 2010-09-26 23:10 · Score: 1

But did they even need a controllable ramp up of user requests? It sounded like the overloaded system was overloaded by internal requests, that were unrelated to the number of requests they got from end users.
When you hear hoofs, think horses not zebras.
Seeing my servers spike to 100% CPU or 100% I/O and stay there, I'd look outside first before looking inside... So my first goal would be to act for a controllable ramp up of user requests. If the systems are so overloaded I can't troubleshoot at 100% of users, maybe I COULD log in and troubleshoot at 50 or 90 % load.
Also, I've worked at places that won't upgrade until outages due to high utilization are some large multiple of the cost of upgrading, this would be strong indication facebook works the same way.

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger

Ageism by Vahokif · 2010-09-26 04:05 · Score: 5, Informative

The 27-year-old communications protocol

So? TCP/IP is 36 years old.

Re:Ageism by Anonymous Coward · 2010-09-26 04:13 · Score: 0

Yeah, but TCP/IP at least shows up for work
DNS however...
Re:Ageism by Anonymous Coward · 2010-09-26 04:15 · Score: 1, Funny

PILFS!
Re:Ageism by Anonymous Coward · 2010-09-26 04:32 · Score: 0

And broken in practice. We've have that new version now for 12 years now, but not many have gotten around to implementing it yet.
Re:Ageism by morgan_greywolf · 2010-09-26 05:03 · Score: 4, Insightful

Really? DNS is broken? So typing say, http://slashdot.org/ doesn't work for you?
No. DNS has a few security issues, but they're mostly minor. The fact that DNS works for millions of people every day without issue at least 99% of the time proves that DNS is a successful design, even if it could use some security updating.

--
My blog
Re:Ageism by BrokenHalo · 2010-09-26 05:10 · Score: 1

Who are those? Programmers I'd like to fuck? Sorry, that doesn't compute.
Re:Ageism by kasperd · 2010-09-26 05:14 · Score: 2, Interesting

I think that comment was referring to the fact that some recent announcement said there are now 5 billion devices on the internet, and IPv4 supports only up to 3.7 billion devices.

--

Do you care about the security of your wireless mouse?
Re:Ageism by kasperd · 2010-09-26 05:22 · Score: 5, Insightful

Some people think technology should be replaced just because it is old. But really, it should be replaced if it doesn't suit our needs and there is a different technology that does suit it.

It is better to replace a 1 year old technology that does not suit our needs than to replace a 50 year old one that does. Usually when replacing, you want to replace with something newer. But in some cases it may turn out to be better to replace a new and misdesigned technology with an older and proven one.

That said, there are improvements to both IP and DNS which should be rolled out because they fix real problems. The rollouts are not happening as fast as they ought to, mainly because it is problematic to roll out a change to the entire Internet, especially when not everybody involved is cooperating.

But I don't think that really has anything to do with this outage.

--

Do you care about the security of your wireless mouse?
Re:Ageism by oldspewey · 2010-09-26 05:26 · Score: 2, Funny

So? TCP/IP is 36 years old.
Yeah, but it still lives in its parents' basement.

--
If libertarians are so opposed to effective government, why don't they all move to Somalia?
Re:Ageism by morgan_greywolf · 2010-09-26 05:31 · Score: 2, Informative

What does IPv4 have to do with DNS? (hint: nothing. Modern DNS servers support IPv6)

--
My blog
Re:Ageism by Anonymous Coward · 2010-09-26 05:43 · Score: 0

What does IPv4 have to do with DNS?
Maybe you should answer that question. After all you were the one writing a comment about DNS in reply to a comment saying that IPv4 is broken.
Re:Ageism by the_womble · 2010-09-26 06:49 · Score: 1

The people who think that are
1) The people with patents on the new technology, or who are planning to sell stuff for it.
2) The people who have been convinced by the marketing budgets made possible by 1)
Re:Ageism by Anonymous Coward · 2010-09-26 06:56 · Score: 0

Different AC here - I guess the "P" stands for "Protocols".
Re:Ageism by dlgeek · 2010-09-26 07:01 · Score: 2, Insightful

And is definitely showing it's age. There's been a big cry for years from those working at the really high end of networking that we need to replace (really just extend) TCP because it doesn't work well with high bandwidth-delay-product links. This is because the max window size and ramp-up algorithm (slow start) don't allow you to saturate the pipe quickly enough or even at all. There are several proposed extensions floating around to fix the problem but none of them have widespread adoption.

This actually is the case with a lot of our old networking protocols - yes, they were incredibly well designed at the time, but many are showing that they need to be upgraded to reflect modern technology. Back to our original case, the original DNS protocol does have a lot of problems that have surfaced lately (think about the sequence number prediction stuff from a couple years back) which inspired the roll-out of DNSSEC. IPv4 is hitting it's limits, but we're having trouble rolling out IPv6. How much easier would fighting spam be if SMTP had a strong authentication system for sent messages? Even HTTP, which has undergone several revisions, is again showing limitations, hence Google rolling out SPDY which allows predictive pushes, stream parallelism, etc.

I don't think anyone seeks to criticize the designers of these protocols, and the protocols have excelled and scaled far, far beyond anyone's wildest expectations. That being said, they have been showing cracks lately as technology has grown, and nothing looks like it did back when they were written. However, we have hit a point where the difficulty in upgrading or replacing them is actually starting to hold us back.
Re:Ageism by Anonymous Coward · 2010-09-26 07:14 · Score: 0

The comment was about TCP/IP, not IPv4/6. He's right, get over it. If he meant IP he should've said IP.
Re:Ageism by A+beautiful+mind · 2010-09-26 07:31 · Score: 1

I'll take the quality of design of IP or DNS over what passes on for "The Web" these days. The browser as a concept is bending towards it's breaking point as it tries to cope with the fact it's treated as a clown car.

I guess it's historical legacy that we started with HTML and crap like that for browser interaction and everything sort of grew from there, but we're doing the whole "web as an applications platform" wrong.

--
It takes a man to suffer ignorance and smile
Be yourself no matter what they say
Re:Ageism by Anonymous Coward · 2010-09-26 08:03 · Score: 0

Which still has nothing to do with DNS, never mind that what you said makes no sense since you can't pair TCPv6 with IPv4, you have TCP/IP(v4) and TCP/IPv6, only the first one of these is 36 years old.
Re:Ageism by morgan_greywolf · 2010-09-26 10:47 · Score: 1

1. TFA is about DNS.
2. There is no "TCPv6". There is TCP over IPV4 and TCP over IPv6. They are, however, the same TCP.
3. TCP/IP is also used as a broad term for for the entire network stack. For example, DNS is an application-level protocol implemented on top of TCP and UDP over IP. But the entire thing is, loosely speaking, TCP/IP technologies.

--
My blog
Re:Ageism by Anonymous Coward · 2010-09-26 11:20 · Score: 0

Thanks for proving my point. The OP was talking about TCP, and some idiot started mentioning IPv6/DNS.
Re:Ageism by definate · 2010-09-26 13:18 · Score: 1

AAAAhhh since when has DNS supported IPv6? I call shenaaaanigans!

--
This is my footer. There are many like it, but this one is mine.
Re:Ageism by bill_mcgonigle · 2010-09-26 17:28 · Score: 1

So? TCP/IP is 36 years old.
And can't even cope with lossy network connections (i.e. mobile).

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Re:Ageism by Anonymous Coward · 2010-09-26 23:45 · Score: 0

"but we're doing the whole "web as an applications platform" wrong."
I agree there but this is a symptom of one click install being an impossibility in 2010.
People don't use the web as an app platform because it's better, easier to develop for, or provides control over the system.... It's one of the worst places to do your coding but you get one beautiful thing for all the trouble.
JoeBlow typing yourdumpcompany.com and instantly running an app without libraries, install media, or waiting......
Re:Ageism by godefroi · 2010-09-27 02:18 · Score: 1

On the contrary; it copes very well with lossy network connections. The real problem is YOU and your insistence on receiving everything that was sent, and in the correct order even.
If you were willing to see half-pages and miss images, then UDP would be a splendid protocol for you, and you wouldn't have to wait for timeouts and retransmissions.

--
Karma: Poor (Mostly affected by lame karma-joke sigs)
Re:Ageism by bill_mcgonigle · 2010-09-27 03:07 · Score: 1

On the contrary; it copes very well with lossy network connections
I suspect this is going for Funny, but just in case: the basic problem is that TCP congestion control sees a lossy network as busy and backs off on transmission speed.
It's an open research topic, and currently handled in L2 on mobile networks since TCP can't cope.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)

DNS? Huh? by Animats · 2010-09-26 04:14 · Score: 1

Terrible article. What "DNS error"? Is Facebook running its own DNS servers that do something funny, or what?

As for DNS "moving to the cloud", DNS is already far more distributed than any of the "cloud" systems. Which is a good thing.

Couldn't be a DNS outage. by MrCrassic · 2010-09-26 04:21 · Score: 1

Yeah...from reading that Facebook note, it's pretty clear that DNS had nothing to do with the outage. Do you guys think the outage would've been better or worse had it been one?

Not mission critical! by j_col · 2010-09-26 04:30 · Score: 2, Insightful

I found the genuine panic from many Facebook users to this outage very amusing.

Re:Not mission critical! by Anonymous Coward · 2010-09-26 04:43 · Score: 1

I found the genuine panic from many Facebook users to this outage very amusing.
I, too, laugh at the misfortune of others.
Re:Not mission critical! by Anonymous Coward · 2010-09-26 05:15 · Score: 0

Misfortune?? This is a blessing!!
Re:Not mission critical! by kiwimate · 2010-09-26 07:21 · Score: 1, Flamebait

I suppose if I were an angst-ridden bitter friendless teenager I may have found it amusing too. Luckily, I'm an adult. (How sad that this comment is currently marked insightful.)
And - really? Genuine panic? I think that says more about the specific subset of Facebook users within your anecdote set than anything else. Or do you also extrapolate out from the frequent racist troll comments on Slashdot?
Re:Not mission critical! by definate · 2010-09-26 13:21 · Score: 1

LOL Yeah it was hilarious when people were complaining about being unable to get on Facebook. So funny that people need services to keep in contact with others, it's like why don't you just talk to them in person? I mean like, HELLO, am I the only one getting this? Geeze. If it's so important to you then you should be more redundant with your services, like, everyone knows that!

--
This is my footer. There are many like it, but this one is mine.
Re:Not mission critical! by Skylinux · 2010-09-26 23:24 · Score: 1

Des einen Leid ist des anderen Freud. -- Of one man's meat is another man's poison.

--
Everyone who buys Wild Hunt will receive 16 specially prepared DLCs absolutely for free, regardless of platform.

So what? Big Whoop! by WarpedCore · 2010-09-26 04:32 · Score: 1, Flamebait

It's an advertisement platform that rides solely on the ignorance if its users. So people had two and a half hours to take a break from their narcissism... this is something worthy of finger pointing?

Re:So what? Big Whoop! by Anonymous Coward · 2010-09-26 05:02 · Score: 1

high and mighty non-trend followers are pretty trendy, just saying...
Re:So what? Big Whoop! by _Shad0w_ · 2010-09-26 05:04 · Score: 1

There are adverts on there?

--
Yeah, I had a sig once; I got bored of it.
Re:So what? Big Whoop! by Anonymous Coward · 2010-09-26 05:09 · Score: 0

I'm way cooler because I participate in the trend, but only at a level that would barely be called surface. Once again my lack of dedication and inability to commit have paid off by inserting myself into an elitist class that hovers above all else. Unfortunately due to said attributes I am unwilling and unable to use this to any social advantage.
Re:So what? Big Whoop! by Sir_Lewk · 2010-09-26 06:09 · Score: 2, Funny

No. Facebook doesn't do data-mining, and they don't serve ads. They simply pull money out of their ass.

--
"linux is just DOS with a UNIX like syntax" -- Galactic Dominator (944134)
Re:So what? Big Whoop! by kiwimate · 2010-09-26 07:30 · Score: 2, Insightful

So is Slashdot.
I don't know that finger pointing is necessarily healthy - that tends to suggest CYA and childish blame games. But on a technical IT focused web site, one might suppose that a lessons learned exercise on the root cause of the failure of a massive website would be of interest and hopefully even an educational experience.
Re:So what? Big Whoop! by Skylinux · 2010-09-26 23:32 · Score: 1

But on a technical IT focused web site, one might suppose that a lessons learned exercise on the root cause of the failure of a massive website would be of interest and hopefully even an educational experience.
I can't remember ever seeing an article about a major outage from some big website where they delivered enough information that one could learn from it.
So what did you learn from this article? Don't fuck up when you are an admin or maybe to create a better error page.
Yes very informative and of interest to nerds, indeed.

--
Everyone who buys Wild Hunt will receive 16 specially prepared DLCs absolutely for free, regardless of platform.

OT: I don't care. by Anonymous Coward · 2010-09-26 04:40 · Score: 0

Sorry for posting Off-Topic, but is there a way to hide all stories tagged "facebook"?

FFS by Anonymous Coward · 2010-09-26 04:47 · Score: 0

Who cares? Enough about the facebook outage already.

They get no monies from me! by chucklebutte · 2010-09-26 04:57 · Score: 1, Interesting

With all the adblocking software, no scripting, and other misc Firefox plugins I have no ads from facebook, let alone any other page. Firefox is set to delete cookies and BS on exit and I keep my machine clean with bleachbit.

Facebook allows me to connect with lots of people I never would see, like buddies in the Army based in Japan for instance or friends in New York etc, etc.

I hide all the annoying spam adds for peoples stupid farms, I have convinced many of my Facebook friends and family to stop playing those games and to take similar precautions.

Every company is evil, hell even my Linux distro has political agendas, ( damn Mint Linux!!! ) but what does it really matter? It is the cost of using technology. Until we change our idea about advancing ourselves and not our pocket books nothing will change, so stop with the Q_Q and pew pew, and just deal with and be smart about what you do.

--
Visit my Forums?

Re:They get no monies from me! by BrokenHalo · 2010-09-26 05:18 · Score: 1

hell even my Linux distro has political agendas, ( damn Mint Linux!!! )

I had never heard of them (I'm an old Slackware hand, and more recently Arch), but Mint's webpage is so incredibly slow to load, it's impossible to see what that agenda is. It doesn't inspire much confidence in them. :-|
Re:They get no monies from me! by Anonymous Coward · 2010-09-26 07:08 · Score: 1, Informative

You won't find it on the home page. It was a post by a developer on the dev blog. He later removed it and apparently moved it to his personal blog.

Palestine Written by Clem on Sunday, May 3rd, 2009 @ 12:34 am | Main Topics
This is not the place to talk about this but I am deeply touched by what is happening over there. I feel disgust and guilt with us passively witnessing it and our money and weapons supporting it. I don't want to use my name or this project to push my own ideas about this but I spend a lot of time working and giving away, sharing and receiving to and from a lot of people.
I'm only going to ask for one thing here. If you do not agree I kindly ask you not to use Linux Mint and not to donate money to it.
I hope for these people to be able to live decently in the future and for me not to have anything to do with the misery they're in at the moment.
I promise not to talk about this anymore. I don't want any money or help coming from Israel or people who support the action of their current government.
Thank you for your understanding. This is very important to me.

Where's the beef? by Anonymous Coward · 2010-09-26 05:02 · Score: 0

No explanation for DNS is offered.

My guess is Facebook runs its own DNS servers and they were swamped by a DDoS of Facebook's own making.

There WAS some DNS issues too ! by ivan_w · 2010-09-26 05:06 · Score: 3, Informative

The confusion might have come from the fact that when I looked, there seemed to also be some DNS problem.

Basically, when asking directly, the servers that are authoritative for the zone were giving me a CNAME for the 'ANY' query, but not the associated A records, which it should, since the CNAME was pointing to a host name within the same authority. At this point, any sensible resolver stops asking !

This only lasted for a little while though - so it might have been a glitch or possibly a deliberate action related to how they were trying to fix the underlying issue itself - possibly averting traffic until they actually solved the actual problem.

--Ivan

Re:There WAS some DNS issues too ! by Skapare · 2010-09-27 23:03 · Score: 1

This kind of thing can happen when records are being changed (say from A to CNAME) and the A record has not expired from your cache, yet. Did you do a "dig trace" around the cache to verify?

--
now we need to go OSS in diesel cars

no by Anonymous Coward · 2010-09-26 05:20 · Score: 0

The summary's wrong. The problem was caused by one looping server, hence making it a DOS, not a DDOS.

Anonymous Coward. by Anonymous Coward · 2010-09-26 05:33 · Score: 0

Yet m.facebook.com worked the whole time this was going on.

Why DNS? by darth+dickinson · 2010-09-26 05:47 · Score: 1

Cause browsers is stoopid!

It wasn't DNS. by meerling · 2010-09-26 05:47 · Score: 0, Redundant

That was obvious, it showed symptoms of a DDoS attack, not a DNS problem. I find it funny it was caused by their own error.

Another failure waiting to happen by Skapare · 2010-09-26 05:59 · Score: 1

To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn't allow the databases to recover.

Even when the database has a valid value, if failures to get a value from the database can creating a growing cascade of errors, then this design is still poised for a future failure for simple things like a partial outage of databases or network access to them. Ideally, once the data was valid, the number of clients not getting a valid value should gradually decrease as more and more get valid values and don't have to requery. But if the scale was such that none could get anything when all were trying (and hence require a shutdown and slow start), it can all happen again with many other classes of failure. It can even happen with transient error, if the transient is long enough to trip a certain threshhold of clients.

So leaving that configuration correction system off for now makes sense. I would suggest a combination of a push system (originate at the database and push new values out ... but watch for security) and a randomizing of delay times inserted for the data pulls.

--
now we need to go OSS in diesel cars

And somebody here actually cares??? by AliasMarlowe · 2010-09-26 06:00 · Score: 1

What percentage of slashdotters actually noticed the facebook outage when it happened? As opposed to merely participating in the post-hoc commentary after they read about it. It should have been posted to slashdot's idle category.

--
Those who can make you believe absurdities can make you commit atrocities. - Voltaire

Re:And somebody here actually cares??? by ryanov · 2010-09-26 18:23 · Score: 1

I noticed, and saw the DNS message when it was there. When I read this, I said to myself "umm, why did people blame DNS? That's what the message said!"
Re:And somebody here actually cares??? by Anonymous Coward · 2010-09-27 02:47 · Score: 0

I noticed when half of my Friends suddenly appeared on AIM in the middle of the day.

Take down any company's entire network access ... by Skapare · 2010-09-26 06:05 · Score: 1

... and browsers will think their DNS is dead ... because, well, it is ... and is the first thing a browser needs to access.

--
now we need to go OSS in diesel cars

Did Facebook have an internal DNS failure? by Animats · 2010-09-26 06:07 · Score: 1

The "error page" is clearly a Facebook server reporting a DNS failure within Facebook's own network. Facebook requests are processed by user-facing servers which make RPC calls (not HTTP) into Facebook's internal network. Machines in multiple locations may be involved in generating a single Facebook page. If their in-house DNS system for organizing their internal network failed, they might produce messages like that.

Re:Did Facebook have an internal DNS failure? by rekoil · 2010-09-26 06:21 · Score: 3, Informative

It didn't fail, they turned it off. This was the easiest way to "shut off the entire site" as their post-mortem describes. The DNS errors users saw were being generated by the front-end HTTP proxies, not by client browsers, which caused most of this confusion. Once the database issue cleared, they reactivated the DNS entries for the back-end servers one cluster at a time and the site came back.
Re:Did Facebook have an internal DNS failure? by Skapare · 2010-09-26 11:02 · Score: 1

You seem informed. Maybe you can explain why it is that clients would not be picking up the corrected info and reducing their "attack" on the database servers (more so than everything being turned off and back on).

--
now we need to go OSS in diesel cars
Re:Did Facebook have an internal DNS failure? by rekoil · 2010-10-04 17:31 · Score: 1

This is explained in the post-mortem. Basically, the problem was that clients were reacting to corrupt data being served up by the origin DB cluster the same way that they reacted to bad data coming from the memcached cluster - by deleting the offending entry in memcached and re-sending the query to the origin DB. So a client queried the origin, got bad data, and then deleted the key from memcached - resulting in every other client (tens of thousands of them, most likely) then querying the cluster for the same key* at the same time. Instant meltage ensued.
Now think about what happens when you have tens of thousands of boxes all querying the same cluster for the same keys all at the same time. Some clients will get the answer, but others will get an invalid response back from a melting mysql box. And when that happens, what does the client do? Exactly what started the mess in the first place - it *deletes the key from memcached*. So if any other clients were happily using the cached copy of the key data, they aren't anymore...and back to the origin they go. Lather, rinse, repeat until someone hits the Big Red Button and restarts the whole shebang in a ordered fashion (i.e. only re-activating a few racks at a time).
* More likely, many keys were corrupted on the origin. A single key would only impact one memcached instance and most likely only one mysql server (read about consistent hashing for more detail) and not cause this level of chaos.

Clumsy title by Haiyadragon · 2010-09-26 07:02 · Score: 1

Browsers are fucking software. They don't blame anything for anything.

Facebook was down. That's the only that matters to most people.

Re:DNS? Huh? by Anonymous Coward · 2010-09-26 07:06 · Score: 0

'Fake sidebar"

Lookup whoosh in the dictionary. That's not a mirror.

You must have interesting firewall logs... by RulerOf · 2010-09-26 07:47 · Score: 2, Funny

look at my own /etc/hosts file. From time to time I manage to bite myself on the ass with my block-list

#Below is my custom DNS blocklist
127.0.0.1 om.nom.nom.

user@localhost:~$ ping om.nom.nom.

--
Boot Windows, Linux, and ESX over the network for free.

DDOS? by Anonymous Coward · 2010-09-26 07:56 · Score: 0

Wouldn't this just be DOS or have the two separate terms become synonymous?

I disagree by Anonymous Coward · 2010-09-26 08:09 · Score: 1, Informative

It is the most used website in the world (more userhours/month spent of Facebook than any other site), the fastest growing internet community (when measured in new users/month), etc... And as such it is an engineering masterpiece (in software engineering and probably in several other areas, too). When it goes down for several hours, it is a newsworthy event.

For us who work for advertising agencies, FB downtime is also a financially notable event.

Re:I disagree by Anonymous Coward · 2010-09-26 08:47 · Score: 0

So you don't know and instead went on some tangent about something else? I have no doubt you are an diehard facebook user.
Re:I disagree by Kvasio · 2010-09-26 08:54 · Score: 2, Interesting

Yet, you failed to notice that /. is a site for nerds.
Many nerds do not thrive to cultivate their social skills.
Checking their friends status on social network might not be on top of their agendas.
So: event was notable, but not very important to many slashdotters.

dns by xander19 · 2010-09-26 08:33 · Score: 1

it might be a buzz acronym on twitter anyway, cause it also means DNA in german

PEP by HBI · 2010-09-26 08:52 · Score: 1

There is a whole market devoted to handling high delay TCP connections. It works. It's what I do. Well, part of it.

Replacing the protocol for this reason would be kind of lame.

--
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.

The REAL Reason Facebook went down! by Anonymous Coward · 2010-09-26 12:07 · Score: 0

Someone digging in Farmville didn't call before digging and cut an underground cable.

Strong authentication of senders: 2 drawbacks by tepples · 2010-09-27 06:20 · Score: 1

How much easier would fighting spam be if SMTP had a strong authentication system for sent messages?

There is one, called OpenPGP. There is another one, called S/MIME. Implementation of these in real-world MUAs awaits a decision on best practices for how strong the authentication needs to be. Stronger authentication has two downsides. First, the cost of obtaining a digital ID goes up with strength; even with the OpenPGP web of trust, travel to a key signing party hundreds of km away is not free. Second, requiring strong digital ID makes it difficult for someone living under a government that suppresses speech to express politically unpopular ideas.

Re:Strong authentication of senders: 2 drawbacks by jgrahn · 2010-09-27 08:36 · Score: 1

How much easier would fighting spam be if SMTP had a strong authentication system for sent messages?
There is one, called OpenPGP. There is another one, called S/MIME. Implementation of these in real-world MUAs awaits a decision on best practices for how strong the authentication needs to be. Stronger authentication has two downsides. First, the cost of obtaining a digital ID goes up with strength; even with the OpenPGP web of trust, travel to a key signing party hundreds of km away is not free.
You wouldn't have to go to such extremes though. If everyone had as a personal policy "only read OpenPGP-signed mail, and distrust mail signed with a key I haven't personally downloaded from a key server", spam and mail worms would be less of a problem.
(Not that it will ever happen.)

Spam solutions copypasta by tepples · 2010-09-27 13:23 · Score: 1

If everyone had as a personal policy "only read OpenPGP-signed mail, and distrust mail signed with a key I haven't personally downloaded from a key server"

Then it would it would still fall under the "Requires immediate total cooperation from everybody at once" line of the well-known copypasta, and possibly "Mailing lists and other legitimate email uses would be affected" and "Many email users cannot afford to lose business or alienate potential employers" depending on how it is implemented.

Slashdot Mirror

Why Browsers Blamed DNS For Facebook Outage

96 comments