Confirmed Gmail / Google App Outage
mbone writes "Earlier today there was a confirmed Google outage which got a lot of attention from network operators. From a post to NANOG after everything calmed down: 'Google ack'd a maintenance on their core network did not go as planned-Forced traffic to one peer link that was unable to handle all the traffic. Maintenance has been rolled back. Issue has been restored.' This is exactly what makes me nervous about cloud computing and data storage. It's bad enough when I screw up a config and it takes down my mail, but what about when it happens to the entire globe at once?" Several readers also point to CNET's coverage of the outage.
Update: 05/14 19:25 GMT by T : CWmike adds this: "Steven J. Vaughan-Nichols writes that what may be happening is a massive DDoS attack. Based on the size of the attack that would be needed to interfere with Google, I believe that it's quite likely to be the result of an attack from the controllers of the Windows worm, Conficker. Another theory that has been put about — that the problem was due to AT&T NOC routing problems — does not appear to hold water, writes Steven."
Update: 05/14 21:01 GMT by T : Google's put up a low-detail explanation on their blog that says "An error in one of our systems caused us to direct some of our web traffic through Asia, which created a traffic jam. As a result, about 14% of our users experienced slow services or even interruptions."
In comments from Google Admins, they said "oops." :)
Serious? Seriousness is well above my pay grade.
My Google voice account went all sorts of haywire.
1) Text messages sent from the web got duplicated. One person got near 10 duplicates in quick succession. I also got duplicate messages back.
2) My number doesn't work. If you call it you get a "Currently unavailable"
3) A few calls that came in before the outage aren't showing up in the Received/Missed calling list.
...and take an stroll to the great big place known as "outside".
call me....
And yet somehow miraculously we are all still alive. The sky is not falling!
When it's just your mail server down, everyone else gets annoyed at you because you're not {gett,receiv}ing mail they're {sending, expecting from} you. When the cloud is down, everyone can just chill and be thankful that they're not going to log on to find a whole stream of new emails.
This sucks for docs though but using a completely cloud based doc solution is a bit mental. Even if you're mobile it's best to have a local copy to save on battery life.
Nick
If everybody goes down, nothing happens and you just go outside (beyond the doors, out into the bright white light) and enjoy your day until 'they' fix it.
What's not to like?
Faster! Faster! Faster would be better!
If it bothers you then use a mail client to download your mail from Google. As someone that has been using my gmail account all week I didn't even notice a problem, the whole thing seems overblown.
Considering the amount of usage google sees, a minor interruption like today's issue is nothing that worries me much at all.
It's not like oh, say, Comcast, who left me without an internet connection for a month because their technician was drunk and rammed his truck into the large metal junction box where my apartment's internet connection tied into everyone elses. It only took them a month to replace the box and re-wrire everything
Sig Follows: "Suppose you were an idiot. And suppose you were a member of Congress. But I repeat myself." -- Mark Twain
Having run my own mail server, and used mail servers run by companies I work for, I'll -gladly- take GMail's track record for reliability. Even with no 'guarantee', it's been a hell of a lot better than anything else I've experienced.
And what's -really- the difference between a server going down locally that affects you and a server going down globally that affects you? Nothing.
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
Take a good look kids. Google was down and Twitter was up. This only happens once in every 3,271 days. You probably won't see it again, at least in Twitters lifetime...
Anyone who has ever used or administered a mail server has experienced a mail server going down. This is not news.
What is news is that Google Mail has been up for so long until now. And current accounts seem to indicate the outage lasted about one hour.
One hour of down time after five years of steady service is good enough for me. It is better than any other mail server I have ever used.
If a life is not lost, there are no worries with cloud computing (hence, cloud computing should be used for non-life critical services, gmail is a perfect example).
Of course, VCs may have lost revenue, Capitalists may sweat from loss stock trades, teenagers may lose that one twitter about how cool Miley is to them, some adult may not get that date tonight from craigslist, you may miss that one Hulu commercial, some K-12 kid may not be able to send out his homework, some college kid can't access his pirate bay music lists, or the USPoTC may miss that extra minute to promote his stimulus bill.
In the end, I hope cloud services shows us that we are not slaves to time. The human race has advanced enough to know that already. And really, if "the cloud" is down for an hour, maybe you should go outside and enjoy the wonders of nature and peace for once, or talk to someone physically. It begs to ask the question: "can it wait?"
"It's bad enough when I screw up a config and it takes down my mail, but what about when it happens to the entire globe at once?"
That's much better for you. Instead of having to explain to everybody that the dog ate your homework or whatever, you can sit back and let them explain it to you...
It will suck when everything's on the Cloud because I won't be able to claim my server's been down all day while I'm out playing golf.
When things came back up this afternoon it was an old backup version and several of my settings had been rolled back.
I guess this is one instance where Google's perpetual beta status really applied - those using Voice for mission critical communications were up a creek.
Ah, we all get our power from the "electrical cloud". We all need private generators. Ah! Ah!
If we're talking about the same outage that caused google advertisements to hang forever this morning, it caused access to many unrelated websites to hang, including slashdot itself. This seems like a really bad single-point-of-failure issue. If a site can't display ads, shouldn't it come up anyway?
It's bad enough that I have to wait tens of seconds for Captcha content to pop up long after a login page has loaded.
This is starting to get annoying. If this is "cloud computing", I'd rather stay on earth.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
When done correctly, the "cloud" is the internet itself. Google has network design issues, some of their key services only have a couple of ingresses into Tier-1 providers:
http://en.wikipedia.org/wiki/Tier_1_carrier
I don't work for them, i don't hold their stock, and I am not (currently) a customer, so I have no skin in their game, but Internap as a BUSINESS MODEL, becomes more important.
If you are a major company that comes to rely HEAVILY on Cloud Services, you want to insure that you have on-ramps into several Tier-1 providers ALL AT ONCE, without having to contract individually with 4 or 5 of them yourself. I predict more companies will mimic this model of aggregation, essentially handling the business of BGP optimization for customers, and handing customers 2 redundant pipes and saying "hey, don't worry if San Fran has an earthquake and these peering points blow up, we'll get you out via this Tier-1 backbone over to your cloud computing provider's service via this backbone within seconds. Let us handle that."
Especially with ISPs that get into pissing matches, like when Cogent and Telia got into it, and cut each other off. If you had Cogent as your only ISP, you were screwed if you wanted to get to a bunch of Swedish sites, because Cogent's CEO was trying to play chicken over some tariff rates. The cloud computing model will no longer tolerate that, it's not just some website, it's a BUSINESS function.
that's my take at least.
while I was trying to get work done today. This was pretty scary. I mean, besides not being able to search google and check my email, there are other sites that wouldn't work. Some apache projects and also nabble use google analytics apparently, so I couldn't even load those pages. Also, I couldn't load slashdot's main page because it apparently uses googleads or something like that. What suggestions to people have for this? What other sites were not accessible during the outage?
Many sites rely on Google in ways that aren't immediately evident - for instance, during the outage, Google Analytics connections were lagged, which meant that all our our sites that incorporate Analytics were ALSO lagged.
What's amazing is the extent to which an outage on a single entity can bring down ALL of the other entities that surround it -- not just those who rely more visibly, e.g., Google Docs., on their services.
Yikes!
--Dave
Everything is fine. It was simply a glitch in the holo-matrix. The Doctor has been tinkering with his program again and caused a feedback loop between the holo-emitters and EPS conduits on deck five. Seven has corrected the malfunction.
Notes or contacts causing important meetings to be missed or leaving attendees un/less prepared. It's easy to say back everything up, but in the real world under stress (or laziness, or stupidity) you tend to stick with simpler work-flows. I like Saas for non-critical applications, maybe it's an age thing or maybe critical service/hosted solutions are simply still new enough that the kinks in reliability haven't been fully worked out.
Quack, quack.
It's bad enough when I screw up a config and it takes down my mail, but what about when it happens to the entire globe at once?
I was reading this comment and it occurred to me that the latter is actually preferred. With the first option, your systems are messed up, but everyone else wants you to continue to conduct business. With the latter situation, your systems are down and so are the people who would normally be trying to reach you.
The Cylons are coming, the Cylons are coming!!!
This speculation from the ComputerWorld blog doesn't belong in the post. Even the blog author says its conjecture. Especially ridiculous since the NANOG post in the second link already explained that the problem was a routing error at Google.
Hence the reason why we need a whole storm of clouds... and some APIs for submitting the same jobs to multiple clouds. If one goes down you start sending them off someplace else (maybe someplace slower or more expensive) for the duration of the outage.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
e-mail is supposed to be reliable because of its distributed nature. It is not supposed to be on single "cloud", distributed machines should be caring for it. It is just like XMPP vs. old fashion MSN/AIM etc. junk.
:)
Let me show what I see with the "cloud" (which is one of the worst abused terms) right now:
(wget)
s3.amazonaws.com[72.21.207.242]
Saving to: `423.dmg'
10% [===> ] 4109203 5,54K/s eta 79m 33s
So, highly successful mac shareware which I love couldn't deal with bandwidth issues and offloaded the downloads to Amazon S3. Amazon S3 on the other hand, showing it works perfectly (on status page) has 450 ms ping response and I am back to 56K speed on a 4 mbit ADSL line. It looks like something wrong with Level3 hops.
Cloud is not offloading all mail to one central server nor putting all files to Amazon S3, it doesn't even exist yet. When people do 10x realtime h264 encoding with their Xgrid enabled portable Macs running Snow Leopard and store the file anonymously to thousands of other machines, that would be some kind of "cloud". Right now, Cloud is just an icon for that overpriced me.com (dotmac) service
Will this have any impact on internet porn?!
strange. my Firefox 3.0.10 got somehow affected by this outage. it just refused to open! it loaded about 30Mb of data to RAM but went nowhere from there. the browser window never appeared. and i tried to re-launch it several times, but for no avail! very odd.... anyone else had problems with it? Opera -- although not able to open Google.com -- opened fine!
Is Firefox tied to Google like E.T. was tied to Elliot?
Right when it started I had trouble loading /. even. It kept stalling while loading Google ads. However that seemed to only last for 10 minutes while my Gmail, iGoogle, etc. was slow for 1/2 an hour or so. Maybe they fixed the ads quickly...
Looking at the Google status page at http://www.google.com/appsstatus# has some live info.
ZDNet is reporting that any traffic that is routed through AT&T was not able to get to Google
http://blogs.zdnet.com/BTL/?p=18064
Google says that a traffic overload in Asia was the problem:
http://googleblog.blogspot.com/2009/05/this-is-your-pilot-speaking-now-about.html
So it looks like a switch/router issue caused a long packet path with caused timeouts which caused unhappy users.
I blinked a few times over the extra update,
until I read that it wast just Steven J. Vaughan-Nichols' speculations. Apparantly, the claim
that Google got DDOS'ed is not confirmed.
This makes me wonder is Google too big to fail of the technology world - as Citibank/Bank of America/AIG are to the financial world. Would the US government have to prop up Google and its services some day with massive bailouts because the failure of google could be catastrophic for the general public. The cost of this failure to individual users may not be high (a few minutes of lost access to mail, a hit to efficiency because you cannot search etc etc) but the cumulative cost across the globe could be very high.
I post, therefore I am
Am I missing something? What is "ack'da maintenance?"
Sounds like someone's watched the new Star Trek a few times too many...
To reign is to serve.
MITM!!! Quick, change your passwords. The Asians sniffed them all! LOL
Companies expecting to do mission critical work over the Net need dedicated lines, dedicated machines, and somebody from THEIR company overseeing the system.
Relying on other people is a sure route to disaster. It's hard enough relying on your OWN people.
The Net is NOT fault-tolerant - unless YOU make it so.
Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
And what's -really- the difference between a server going down locally that affects you and a server going down globally that affects you? Nothing.
The difference is that you don't have a global melt-down of every web base service that is dependent on Google.
At work I experienced the issue, but I could remote in to my home computer and load gmail etc no problem...I live 3km from my office.
/. loaded slowly for me while the issue was occurring, and gmail was totally inaccessible.
Even
I was going to email you all about this earlier but I couldn't send email for some reason.
Cloud Cloud cloud cloud? Cloud cloud "cloud cloud cloud?"
Cloud cloud cloud's cloud cloud cloud Cloud Computing cloud cloud cloud cloud cloud...
For those with a significant investment in the cloud, how do you plan for disaster recovery?
Somebody must have typed "google" into Google. It's the only possible explanation.
Instead of 'all your data in the cloud ', how about all your data on a portable device that you plug into a rom type device that provide basic screen, mouse, keyboard and Internet functionality. The only thing out there 'in the cloud' would be a set of servers providing identity and virtual location information. As in Skype where the server keeps a telephone directory but the communication is end-to-end. That means if one service fails I can fall back to the others.
"We purchased a 25K euros firewall last month with which we had some issues"
..
What for, all you needed was a redundant PC and SmoothWall, not that a firewall is much good in this day-and-age of RPC over HTTP and various apps allowed to open most any high port. Firewall were only really useful when the original nix system only allowed 'root' to open low ports for sending, so any packets received (nix-to-nix) from one of these ports was deemed semi-validated. Whatever, read what an expert has to say on Firewalls and security.
"using firefox to type adresses in the search bar, nothing was responding"
Why not have a heartbeat applet running on the firewall that SMSed your phone in the event of an outage. That way you don't have to set up camp in the server room, clicking on things
There is a cool graph on the outage from Wired and Arbor Networks.