Slashdot Mirror


Keeping an Eye Out When Sites Go Down

miller60 writes "Are major web sites going down more often? Or are outages simply more noticeable? The New York Times looks at the recent focus on downtime at services like Twitter, and the services that have sprung up to monitor outages. When a site goes down, word spreads rapidly, fueled by blogs and forums. But there have also been a series of outages with real-world impact, affecting commodities exchanges, thousands of web sites and online stores."

9 of 77 comments (clear)

  1. Short version... by MRe_nl · · Score: 4, Insightful

    Is downtime really more frequent? Or is it just more visible?
    The answer is both.

    --
    "Kill 'em all and let Root sort 'em out"
    1. Re:Short version... by arth1 · · Score: 5, Insightful

      I think monopolization plays a role too.
      Back when people jumped between Altavista, Hotbot, Jeeves and other engines, one of them going down wasn't so bad -- you just used another, and a day later, you wouldn't even remember that one of them had been down. But these days, everyone and his dog uses Google, and if Google goes down, people won't know what to do. Similar for other sites and hubs -- they've become too big, and users have become too reliant on them.

      So even if uptime has increased, the impact of downtime has become larger, in part due to the larger reliance on single systems.

  2. New sites are more complicated... by Anonymous Coward · · Score: 4, Interesting

    So they're more likely to suffer downtime as any one of the many pieces can break, causing it to all go down. Look at a site like Drudge Report that gets massive traffic, but is really VERY simple to run. Then look at a site like Twitter or YouTube or something like that, which has many more services to operate and keep running together.

  3. The twitter factor by ximenes · · Score: 5, Insightful

    Twitter's infrastructure is notoriously poorly thought out, and I sort of doubt they employed any systems administrators (or service engineers, or operations engineers, or whatever) up until recently.

    I think the barrier to entry from an engineering standpoint has been lowered such that you can more easily make a site that appears to be pretty decent and attracts an audience. What is often missing is the behind-the-scenes work which ensures that the service is:

    - Deployed properly, with testing and staging environments that actually mirror production.
    - Fault-tolerant at every practical level. This gets expensive, so you see datacenter failures take down large swaths of sites who don't have multiple locations.
    - Constantly monitored, including performance metrics, to find issues quickly or ever before they happen.

    This is the kind of work that always seems to take a back seat to development due to resource constraints, but it really needs to occur in tandem with the development process.

    If you don't design a site from the ground up to be redundant and highly performing, its pretty difficult to flip a switch and make it that way later. Which is basically what Twitter has found out. Whether or not this mentality is taking over the Interworld is another story though.

    1. Re:The twitter factor by jnovek · · Score: 5, Insightful

      "If you don't design a site from the ground up to be redundant and highly performing, its pretty difficult to flip a switch and make it that way later. Which is basically what Twitter has found out."

      And really, that's OK.

      Sites like Twitter are popping up precisely because the bar is very low to get your idea out on the 'net and compete. Sure, the cost in dollars and person hours is much higher to refactor for stability later, but would Twitter have even come into existence if that was a requirement from the start? Would its founders have considered it a worthwhile risk?

      Jason

  4. Re:no... by Nick+Fel · · Score: 5, Funny

    I've seen Google down. Not completely unreachable, but not working. It was terrifying.

  5. Blackstart capability by Animats · · Score: 4, Interesting

    What with the "software as a service" and "outsourcing system administration" fads, more sites are relying on other sites being up when they power up. This could become a problem in bringing a site back up after an outage. It's important to know which sites have "black start" capability; they can start up without any resources from the outside.

    You can save money by outsourcing Linux system administration to Tomsk, Russia, or Lotus system administration to India. "Remote System Administration for your Lotus Notes/Domino Servers, Infrastructure". But can you then restart your data center from a cold start, when the offshore admin people can't yet get in?

  6. Thanks, Grisoft by FilterMapReduce · · Score: 4, Funny

    Are major web sites going down more often?

    A bit more often now thanks to AVG?

  7. Slashdot uncertainty principle by CrazyJim1 · · Score: 5, Funny

    We're not sure if the sites are already dead, or if the observers changed the outcome.