Slashdot Mirror


Keeping an Eye Out When Sites Go Down

miller60 writes "Are major web sites going down more often? Or are outages simply more noticeable? The New York Times looks at the recent focus on downtime at services like Twitter, and the services that have sprung up to monitor outages. When a site goes down, word spreads rapidly, fueled by blogs and forums. But there have also been a series of outages with real-world impact, affecting commodities exchanges, thousands of web sites and online stores."

15 of 77 comments (clear)

  1. Short version... by MRe_nl · · Score: 4, Insightful

    Is downtime really more frequent? Or is it just more visible?
    The answer is both.

    --
    "Kill 'em all and let Root sort 'em out"
    1. Re:Short version... by arth1 · · Score: 5, Insightful

      I think monopolization plays a role too.
      Back when people jumped between Altavista, Hotbot, Jeeves and other engines, one of them going down wasn't so bad -- you just used another, and a day later, you wouldn't even remember that one of them had been down. But these days, everyone and his dog uses Google, and if Google goes down, people won't know what to do. Similar for other sites and hubs -- they've become too big, and users have become too reliant on them.

      So even if uptime has increased, the impact of downtime has become larger, in part due to the larger reliance on single systems.

  2. New sites are more complicated... by Anonymous Coward · · Score: 4, Interesting

    So they're more likely to suffer downtime as any one of the many pieces can break, causing it to all go down. Look at a site like Drudge Report that gets massive traffic, but is really VERY simple to run. Then look at a site like Twitter or YouTube or something like that, which has many more services to operate and keep running together.

  3. The twitter factor by ximenes · · Score: 5, Insightful

    Twitter's infrastructure is notoriously poorly thought out, and I sort of doubt they employed any systems administrators (or service engineers, or operations engineers, or whatever) up until recently.

    I think the barrier to entry from an engineering standpoint has been lowered such that you can more easily make a site that appears to be pretty decent and attracts an audience. What is often missing is the behind-the-scenes work which ensures that the service is:

    - Deployed properly, with testing and staging environments that actually mirror production.
    - Fault-tolerant at every practical level. This gets expensive, so you see datacenter failures take down large swaths of sites who don't have multiple locations.
    - Constantly monitored, including performance metrics, to find issues quickly or ever before they happen.

    This is the kind of work that always seems to take a back seat to development due to resource constraints, but it really needs to occur in tandem with the development process.

    If you don't design a site from the ground up to be redundant and highly performing, its pretty difficult to flip a switch and make it that way later. Which is basically what Twitter has found out. Whether or not this mentality is taking over the Interworld is another story though.

    1. Re:The twitter factor by jnovek · · Score: 5, Insightful

      "If you don't design a site from the ground up to be redundant and highly performing, its pretty difficult to flip a switch and make it that way later. Which is basically what Twitter has found out."

      And really, that's OK.

      Sites like Twitter are popping up precisely because the bar is very low to get your idea out on the 'net and compete. Sure, the cost in dollars and person hours is much higher to refactor for stability later, but would Twitter have even come into existence if that was a requirement from the start? Would its founders have considered it a worthwhile risk?

      Jason

    2. Re:The twitter factor by dubl-u · · Score: 3, Insightful

      Sites like Twitter are popping up precisely because the bar is very low to get your idea out on the 'net and compete. Sure, the cost in dollars and person hours is much higher to refactor for stability later, but would Twitter have even come into existence if that was a requirement from the start? Would its founders have considered it a worthwhile risk?

      That's a common after-the-fact excuse for not thinking at all about performance, but I've concluded that it's mostly bullshit.

      Sure, if you consider these questions up front and know what you're doing, it's completely possible to defer most of the work until things start to pick up. That's a very legitimate business decision, and if you get a big surprise in your growth curve, it's possible to get crushed. But with a little load testing, responsible development practices, and a little forethought, you've got a very good chance of avoiding a disaster. And none of that needs to be a big barrier to just getting something out.

      On the other hand, if you just don't think about those questions at all, building things willy nilly with no preparation for refactoring and growth down the road, then that's just idiotic. You are in effect betting that you will fail, in that your site will work only if it doesn't get popular. And with something like Twitter, where the network effect is king and you could only make money with a shitload of traffic, massive growth is the only way to succeed.

      From what I can tell, Twitter is firmly in that second camp. They've been going for nearly two years, and they've been shaky for most of it. One black eye from a sudden surge is acceptable, and for some is even a badge of honor. But more than a year of load-based suckage, to the point where you are an international joke, is a sign of plain incompetence. Although it hasn't killed Twitter, it has killed other businesses, and Twitter is not out of the woods yet.

    3. Re:The twitter factor by dubl-u · · Score: 3, Insightful

      By the time you get big enough to really have to worry about scalability more than just turning on caching, you ought to be able to produce enough revenue to reimplement the site. If not, obviously you aren't relevant (or you aren't clever enough.) :)

      I've heard this theory a lot. With regrettable frequency, it's part of noob entrepreneur business plans. I see three big problems with it.

      1. If a sudden surge in popularity is forcing you to work on scalability, that's exactly the point that you don't want to work on scalability. Finally, people care about your site! So now you want to give them cool new features regularly, so they don't go away again. Plus, they discover (and create) problems that you need to solve with new code.
      2. Scaling is much harder to do when you're behind than when you're ahead. If you're already creaking under load, you run around doing a lot of quick fixes that do nothing for the long term. All of the budget you planned for that rebuild can quickly get eaten up just keeping things from catching on fire.
      3. Per-user margins have been steadily declining for pretty much the life of the web. Decreased hardware and bandwidth costs mask some of that. And the vast growth of the internet audience makes up for the rest. But over time you have needed larger and larger numbers of people to have a viable web business. So you need to serve a lot more people to support a staff than you did early on.

      Twitter is a good example of all of these problems. They surely started out saying they would worry about scaling later. Then later came, and they had other things to do: new features, dealing with abusers, setting up a customer support infrastructure. Their quick scaling fixes kept their heads barely above water, but they didn't do much for the long term. And they are still in the "grow big, grow fast" stage, so they don't have any revenue and would rather wait a while longer to deal with that.

  4. Re:no... by Nick+Fel · · Score: 5, Funny

    I've seen Google down. Not completely unreachable, but not working. It was terrifying.

  5. Re:no... by Koiu+Lpoi · · Score: 3, Interesting

    Agreed. Google and Slashdot are the two (depending on my mood) sites I test to see if I have an internet connection. If I can't reach one, I don't even bother testing the other - I assume it's on my end, and I've not yet been wrong.

  6. Blackstart capability by Animats · · Score: 4, Interesting

    What with the "software as a service" and "outsourcing system administration" fads, more sites are relying on other sites being up when they power up. This could become a problem in bringing a site back up after an outage. It's important to know which sites have "black start" capability; they can start up without any resources from the outside.

    You can save money by outsourcing Linux system administration to Tomsk, Russia, or Lotus system administration to India. "Remote System Administration for your Lotus Notes/Domino Servers, Infrastructure". But can you then restart your data center from a cold start, when the offshore admin people can't yet get in?

    1. Re:Blackstart capability by dubl-u · · Score: 3, Insightful

      An important, related issue is the loss of local knowledge.

      If you did a web startup ten years ago, you pretty much had to hire a sysadmin. If you had a good one, they would yell at your developers about their retarded, unscalable designs. Having a scary bearded man threaten you with defenestration has its downsides, but it does give you an incentive to consider the impact to operations.

      The ever-lower cost of hosting is also a problem. If you tried to just throw $250k of hardware at a scaling issue back then, hopefully some executive would come by and ask some WTF-ish questions. (Unless you were at Boo.com or Webvan, natch.) But now, monthly rental on equivalent computing power is circa $400. Who'd bitch about that? Which allows you to really settle in to a totally unscalable architecture.

  7. Thanks, Grisoft by FilterMapReduce · · Score: 4, Funny

    Are major web sites going down more often?

    A bit more often now thanks to AVG?

  8. Slashdot uncertainty principle by CrazyJim1 · · Score: 5, Funny

    We're not sure if the sites are already dead, or if the observers changed the outcome.

  9. Re:or... by Buran · · Score: 3, Insightful

    So don't go there, don't click on links to it, and stop bitching about it. It only annoys you if you let it.

    Or do you just like to whine?

    Yes, they got a mention, because they can't fucking make the damn thing stop dying. If you want to be that prominent you need to get your shit together, or take the flak.

  10. Re:But what happens by Geak · · Score: 3, Interesting

    I can't really trust those network monitoring sites. They aren't accurate. All they can tell is that the site is down "from their location". I work for a webhosting company, and I've run into numerous cases where a customer is screaming that his website is down because they network monitoring site sent him a report saying so. The truth of the matter was the site was up the entire time (even the customer could get to the site when I had them actually try). If a node goes down anywhere between the monitoring site and the user's website, they get a false positive. On top of that, you have to wonder if any of these monitoring sites are also deliberately sending false reports. Back when I was working for an ISP, I remember there was some kind of network monitoring software that came out, and a number of people were installing on their computers. It would start warning customers that their "network connection was saturated - blah blah blah" and customers would call in blaming us. Within a few days I started seeing reviews on the net about the product, and some research showed that it was deliberately generating false reports for anybody that wasn't with a certain large coaster shipping ISP. Apparently the software company was a shareholder. I can't remember what the name of the product was however, this was back in the old dialup days.