Slashdot Mirror


Handling the Loads

On Tuesday, something terrible happened. The effects rippled through the world. And Slashdot was hit with more traffic than ever before as people grabbed at any open line of communication. When many news sites collapsed under the load, we managed to keep stumbling along. Countless people have asked me questions about how Slashdot handled the gigantic load spike. I'm going to try to answer a few of these questions now. Keep reading if you're interested.

I woke up and it seemed like a normal day. Around 8:30 I got to the office and made a pot of coffee. I hopped on IRC, started rummaging through the submissions bin, and of course, began reading my mail. Within minutes someone told me on IRC what had happened just moments after the impact of the first plane. Just a minute or 2 later, submissions started streaming into the bin. And at 9:12 a.m. Eastern Time, I made the decision to cancel Slashdot's normal daily coverage of "News for Nerds, Stuff that Matters," and instead focus on something more important then anything we had ever covered.

I couldn't get to CNN, and MSBNC loaded only enough to show me my first picture of the tragedy. I posted whatever facts we had: these were coming from random links over the net, and from Howard Stern who syndicates live from NY, even to my town. Over the next hour I updated the story as events happened. I updated when the towers collapsed. And the number of comments exploded as readers expressed their outrage, sadness, and confusion following the tragedy.

Not surprisingly, the load on Slashdot began to swell dramatically. Normally at 9:30 a.m., Slashdot is serving 18-20 pages a second. By 10 we were up to 30 and spiking to 40. This is when we started having problems.

At this point Jamie and Pudge were online and we started trying to sort out what we could do. The database crashed and Jamie went into action bringing it back up. I called Krow: he's on Western time, but he knows the DB best, and I had to wake him up. But worst of all, I had to tell him what had happened in New York. It was one of the strangest things I've ever done: it still hadn't settled in. I had seen a few grainy photos but I don't have a TV in my office and hadn't yet seen any of the footage. After I hung up the phone I almost broke down. It was the first time, but not the last.

The DB problem was a known bug and the decision was made to switch to the backup box. This machine was a replicated mirror of Slashdot, but running a newer version of MySQL. We hadn't switched the live box simply because it meant taking the site down for a few minutes. Well we were down anyway, and the box was a complete replica of the live DB, so we quickly moved.

At this point the DB stopped being a bottleneck, and we started to notice new rate limits on the performance of the 6 web servers themselves. Recently we fixed a glitch with Apache::SizeLimit: Functionally, it kills httpd processes that use more then a certain amount of memory, but the size limit was to low and processes were dying after serving just a few requests. This was complicated by the fact that the first story quickly swelled to more than a thousand comments ... we've tuned our caching to Slashdot's normal traffic: 5000-6000 comments a day, with stories having 200-500 comments. And this was definitely not the normal story. Our cache simply wasn't ready to handle this.

Our httpd processes cache a lot of data: this reduces hits to the database and just generally makes everything better. We turned down the number of httpd processes (From 60 on each machine, to 40) and increased the RAM that each process could use up (From 30 to 40 and later 45 megs) We also turned off reverse hostname lookups which we use for geotargetting ads: The time required to do the rdns is fine under normal load, but under huge loads we need that extra second to keep up with the primary job: spitting out pages as fast as possible.

This was around noon or so. I was keeping a close eye on the DB and we noticed a few queries that were taking a little too long. Jamie went in and switched our search from our own internal search, to hitting Google: Search is a somewhat expensive call on our end right now, and this was necessary just to make sure that we could keep up. We were serving 40-50 pages/second ... twice our usual peak loads of around "Just" 25 pages a second. I drove the 10 minutes to get home so I could watch CNN and keep up better with what was happening.

We trimmed a few minor functions out temporarily just to reduce the number of updates going to frequently read tables. But it was just not enough: The database was now beginning to be overworked and page views were slowing down. The homepage was full of discussions that were 3-4x the average size. The solution was to drop a few boxes from generating dynamic pages to serving static ones.

Let me explain: most people (around 60-70%) view the same content. They read the homepage and the 15 or so stories on the homepage. And they never mess with thresholds and filters and logins. In fact, when we have technical problems, we serve static pages. They don't require any database load, and the apache processes use very little memory. So for the next few hours, we ran with 4 of our boxes serving dynamic pages, and 2 serving static. This meant that 60-70% of people would never notice, and the others would only be affected when they tried to save something ... and then they would only notice if they hit a static box, which would happen only one in 3 times. It's not the ideal solution, but at this point we were serving 60-70 pages a second: 3x our usual traffic, and twice what we designed the system for. We got a lot of good data and found a lot of bottlenecks, so next time something that causes our traffic to triple, we'll be much more prepared.

At the end of the day we had served nearly 3 million pages -- almost twice our previous record of 1.6M, and far more then our daily average of 1.4M. During the peak hours, average page serving time slowed by just 2 seconds per page ... and over 8000 comments were posted in about 12 hours, and 15,000 in 48 hours.

On Wed. we started to put additional web servers into the pool, but that ended up not being necessary. We stayed dynamic and had no real problems on all 6 boxes all day. We peaked at around 35-40 pages/second. We served about 2 million pages. Thursday traffic loads were high, but relatively normal.

Summary So here is what we learned from the experience.

  • We have great readers. I had only one single flame emailed to me in 24 hours, and countless notes of thanks and appreciation. We were all frazzled over here and your words of encouragement meant so much. You'll never know.
  • Slashteam kicks butt. Jamie, Pudge, Krow, Yazz, Cliff, Michael, Jamie, Timothy, CowboyNeal, you guys all rocked. From collecting links to monitoring servers, to fixing bits of code in real time. It was good seeing the team function together so well ... I can't begin to describe the strangess of seeing 2 seperate discussions in our channel: one about keeping servers working, and another about bombs, terrorists, and war. But through it all these guys each did their part.
  • Slash is getting really excellent. With tweaks that we learned from this, I think that our setup will soon be able to handle a quarter million pages an hour. In other words, it should handle 3x Slashdot's usual load, without any additional hardware. And with a more monstrous database, who knows how far it could scale.
  • Watch out for Apache::SizeLimit if you are doing Caching.
  • Writing and reading to the same innodb MySQL tables can be done since it does row-level locking. But as load increases, it can start being less then desirable.
  • A layer of proxy is desirable so we could send static requests to a box tuned for static pages. For a long time now we've known that this was important, but its a tricky task. But it is super necessary for us to increase the size of caches in order to ease DB load and speed up page generation time ... but along with that we need to make sure that pages that don't use those caches don't hog precious apache forks that have them. Currently only images are served seperately, but anonymous homepages, xml, rdf, and many other pages could easily be handled by a stripped down process.

What happened on Tuesday was a terrible tragedy. I'm not a very emotional person but I still keep getting choked up when I see some new heart breaking photo, or a new camera angle, learn some new bit of heart breaking information, or read about something wonderful that somebody has done. This whole thing has shook me like nothing I can remember. But I'm proud of everyone involved with Slashdot for working together to keep a line of communication open for a lot of people during a crisis. I'm not kidding myself by thinking that what we did is as important as participating in the rescue effort, but I think our contribution was still important. And thanks to the countless readers who have written me over the last few days to thank us for providing them with what, for many, was their only source of news during this whole thing. And thanks to the whole team who made it happen. I'm proud of all of you.

7 of 890 comments (clear)

  1. The Community Was Served. by pgrote · · Score: 5, Interesting

    Slashdot did provide a very valuable service the day of the attack.

    Take into consideration that during the day at some point all major media web sites died.

    Many people found Slashdot as their only source of updated information that was staying up.

    This sentiment was echoed in pieces by Salon and Wired writers that mentioned Slashdot specifically as a site that had what people were looking for.

    You should be proud and satisfied that what you have created did provide a needed service. Thanks, again.

  2. CNN's problems by crow · · Score: 5, Interesting

    CNN's main problem was that they had canceled their contract with Akamai a month or two ago to save money. Akamai works by having servers at or near most major ISPs so that the majority of traffic is served locally.

    While the load was heavy, it wasn't anything Akamai wasn't prepared to handle.

    Unfortunately, Akamai's co-founder was one of the passengers flying out of Boston on a hijacked flight Tuesday. I have friends who work at Akamai for whom he was not just a boss, but a friend.

  3. Re:A request by hansk · · Score: 5, Interesting

    Speaking of religious fanatics, we have our own here in the US:

    God Gave U.S. 'What We Deserve,' Falwell Says

    Jerry Falwell and Pat Robertson blaming the events on liberals, feminists, etc. etc. etc.

    Sick.
  4. Re:A request by Rimbo · · Score: 5, Interesting

    Any white Christian who starts seeing those of other ethnicities or religions as "Them" is not only a poor excuse for a Christian, but ought to be considered as bad as the terrorists themselves.

    Speaking as a white Christian...bingo. You just hit the nail on the head. In fact, it's that very attitude that allowed these terrorists to believe that what they were doing was somehow God's Will.

    Christianity, Judaism, and Islam are all filled with references to people who, though they weren't Official Churchgoing Believers, represented God's will better than the average Believer. And an ongoing theme in both the Talmud and the Bible (I can't speak for the Koran, although I've been told Mohammed's teachings are very tolerant of other religions) is the failure of church leaders.

    It's ironic. All of these religions which these misguided fundamentalist-whacko "leaders" (such as Osama Bin Laden and Jerry Falwell) supposedly follow condemn the most the Bin Ladens and Falwells of the world, who use God's Name to mislead people, or cause people to commit terrible atrocities.

  5. Re:/.ers: Don't get too cocky... by Kiaser+Zohsay · · Score: 5, Interesting

    Props to Taco and team, not only for their hard work keeping the site up, but for this behind-the-scenes look at what it took to do so.

    CNN was peaking at about an estimated 50,000 hits per second.

    I also noticed after CNN came back up that they seemed to be in a sort of stripped-down static-only "combat mode". A talk with the guys behind CNN's site during the height of Tuesdays events would make for a great slashdot interview.

    --
    I am not your blowing wind, I am the lightning.
  6. Has Slashdot's own search been removed for good? by kjj · · Score: 5, Interesting

    I was wondering if Slashdot was going to stick with "google search" indefinitely or is Slashdot going to bring back there own search engine. I really hope the regular Slashdot search comes back. Google just doesn't cut it when searching for something specific. I wanted to go back to a story about a benchmark and review of DDR motherboards used with Linux. So I tried the following search: linux ddr motherboards
    and this is what I got:

    Slashdot | Pentium 4 Under Linux
    ... Under Linux, I would not buy a P4 ... Re:Why didn't they use DDR RAM on the AMD? by Splork ... someone
    out there selling G4 motherboards with standard form factors and ...
    www.slashdot.org/articles/01/07/15/209215.shtml - 69k - Cached - Similar pages

    Slashdot | Linux Intel Chipset Comparison
    ... it in march, and i run linux on it, and it performs ... Re:Athlon Motherboards... (Score:1)
    by Diabolus (troy ... until we start seeing DDR mobos hit the shelves (any ...
    www.slashdot.org/articles/00/12/18/056248.shtml - 46k - Cached - Similar pages

    Slashdot | AMD Athlon Multi-Processor Under Linux
    ... on several single-CPU motherboards; check your favourite vendor's ... Quake3 demo benchmarks
    under linux on the following boards ... with 256 meg ddr sdram running at ...
    www.slashdot.org/articles/01/07/12/1838238.shtml - 101k - Cached - Similar pages

    Slashdot | Intel To Drop Rambus Exclusivity, Support SDRAM
    ... problems with the newest linux kernels - but widespread - well ... the cost of producing
    motherboards and chipsets, but ... need two seperate 400MHz DDR channels to get ...
    www.slashdot.org/articles/01/07/26/1153225.shtml - 89k - Cached - Similar pages

    That is just the first few but I looked through a number of them and I couldn't find the story I was looking for.

  7. Christianity is at its worst... by FreeUser · · Score: 5, Interesting

    ... with reactions such as you describe. Christianity is (even more than Islam) an evangelical religion, in which there are strong pressures built into the belief system to convert others to one's own way of thinking, generally under the guise of "saving their soul." I have personally experienced this sort of pyschological assault from Christian sects ranging from Catholic to Mormonism (yes, they do qualify as Christian in that they worship Christ, even if the other sects won't claim them).

    I won't go into a long diatribe at the offensiveness of this mindset or this behavior, but rather reference it in order to point out that, as a genre of religion which is bent on conversion, i.e. selling their viewpoint to others, Christian sects tend to be obsessed with appearance as much as substance. Whether it is cloaked as "setting a good example to others," "representing your faith/church to others," or "demonstrating through actions what it is to be a good Christian," none of which are as blatent as the Mormon adage of "avoid the appearance of evil," the underlying message is clear: appearances are at least as important as substance. With a mindset like that, reinforced every sunday from one's spiritual leaders, is it any suprise that people who look even a little non-mainstream garner the reactions like you describe?

    We should kick ass and eradicate our enemies. Not in the name of God, not in the name of some religion, but in the name our our country and our people, which have been attacked and shall be avenged. Keep church and state where they belong, separate, and obliterate the bastards who committed these atrocities last Tuesday in the name of our secular, democratic instutions, leaving each of us to pray, and to grieve, in our own fashion, according to our own beliefs. And never make the mistake that just because someone doesn't share your beliefs, ethnic background, or skin color that they are in any respect less capable of grieving than you.

    --
    The Future of Human Evolution: Autonomy