Slashdot Mirror


Huge Traffic On Wikipedia's Non-Profit Budget

miller60 writes "'As a non-profit running one of the world's busiest web destinations, Wikipedia provides an unusual case study of a high-performance site. In an era when Google and Microsoft can spend $500 million on one of their global data center projects, Wikipedia's infrastructure runs on fewer than 300 servers housed in a single data center in Tampa, Fla.' Domas Mituzas of MySQL/Sun gave a presentation Monday at the Velocity conference that provided an inside look at the technology behind Wikipedia, which he calls an 'operations underdog.'"

22 of 240 comments (clear)

  1. Impressive by locokamil · · Score: 4, Insightful

    Given that their topic sites are generally in the top three for any search engine query, the volume of traffic they're dealing with (and the budget that they have!) is very impressive. I always thought that they had much beefier infrastructure than the article says.

    1. Re:Impressive by mcrbids · · Score: 2, Insightful

      As somebody who has been serving the Internet for a good length of time, I remember when busy web servers serving a 10 Mb stream were "ultra-high capacity" with a Pentium II 350 Mhz chip and 256 MB of RAM.

      The reality is that today, if you pay any attention at all to performance and a reasonable architecture, modern commodity hardware has just utterly incredible delivery capacity. A cheap, 1U 4-core x86 with 8 GB of RAM and a couple of SCSI 10k drives can easily saturate a 1 Gb stream of static pages, or even dynamic pages if the core algo is reasonable. This server can cost about $2500 without too much trouble, and even with heavily database-driven applications, a couple of these can deliver an insane amount of traffic.

      As an example, I use LAMP stack software to serve school districts. I went into one larger school with our software, and they had a half-dozen higher-end systems to serve a Filemaker Pro based application to their several hundred staff. Delays of 5 minutes or more were commonplace. Our computing cluster, consisting of four, 4-core servers with SCSI drives satisfied all their needs much faster than their existing solution, while simultaneously serving almost 100 other schools and school districts. Our software was cleaner and more efficient, and got a much bigger job done with greatly reduced resources.

      LAPP (Linux/Apache/Postgres/PHP) can be damned efficient if you do it right.

      So it really doesn't take much, anymore to serve a huge audience if you pay attention to systemic efficiency. That Wikipedia can do so much with just 300 systems actually seems heavy to me - I'm surprised that they need that many! I'd personally guessed something like 20-50 servers total, with dynamic pages heavily cached with static files and some kind of expiration algorithm, along with some spendy communications hardware.

      --
      I have no problem with your religion until you decide it's reason to deprive others of the truth.
  2. I've always wondered... by mnslinky · · Score: 4, Insightful

    It would be neat to have a deeper look at their budget to see how I can save money and boost performance at work. It's always nice having the newest/fastest systems out there, but it's rarely the reality.

  3. The power of low standards by Itninja · · Score: 4, Insightful

    From TFA: "But losing a few seconds of changes doesn't destroy our business."

    Our organizations' databases (also a non-profit) get several thousand writes per second. Losing 'a few seconds' would mean potentially hundreds of users' record changes were lost. If that happened here, it would be a huge deal. If it happened regularly, it would destroy the business.

    --
    I judt got a nre Kinesis keybiartf so please excusr ant egregiou typos.
    1. Re:The power of low standards by robbkidd · · Score: 5, Insightful

      Okay. So pay attention to the sentence before the one you quoted which read, "I'm not suggesting you should follow how we do it."

    2. Re:The power of low standards by Anonymous Coward · · Score: 5, Insightful

      Don't be too harsh -- the standards are dependent on the application. Your application, by the nature of the information and its purposes, requires a different standard of reliability than Wikipedia does. You're certainly entitled to be proud of yourself for maintaining that standard.

      But don't let that turn into being derogatory about the Wikipedia operation. Wikipedia has identified the correct standard for their application, and by doing so they have successfully avoided the costs and hassle of over-engineering. To each his own...

    3. Re:The power of low standards by Waffle+Iron · · Score: 2, Insightful

      Indeed. Some of us are old enough to remember the days of "banker's hours" and before ATMs, when banks used to make their customers deal with less than "one two" (20%) availability.

    4. Re:The power of low standards by astrotek · · Score: 2, Insightful

      Thats amazing considering I get an error page on bank of america around 5% of the time if I move to quickly though the site.

    5. Re:The power of low standards by Anonymous Coward · · Score: 1, Insightful

      > A bank requires "six nines" of performance (i.e., right 99.9999% of the time)

      Wrong. When I worked for Wachovia, a successful week was only one hour of downtime. Management was very happy with a 99.4% uptime average. Our scheduled maintenance windows were one hour per day which is a 96% uptime. As long as no mistakes were made with data and the downtime didn't happen during the day, management really just didn't care.

      I now work for American Express, and our customer service system and customer web sites are commonly down for more than an hour per week. A few weeks ago, we had more forty hours of downtime on a customer-facing web site, and no one lost their job. No one even got yelled at for it.

      Slashdotters just have unrealistic expectations for uptime. In the real world, you have weekly maintenance schedules and a lot of downtime. Also, the cost of achieving five 9's is so great that it makes business sense to have reasonable requirements and expectations for availability.

    6. Re:The power of low standards by PMBjornerud · · Score: 3, Insightful

      If there's a 30-second period per year when data doesn't properly move, and that requires manual cleanup, that's acceptable. And if there is a 1-hours downtime, EVER, you just blew through the scheduled downtime for the next 120 years.

      "Six nines" is meaningless. Unrealistic.

      It is a promise that you cannot be hit by a single accident, fuckup, pissed-off-employee or act of god.

      --
      I lost my sig.
    7. Re:The power of low standards by Qzukk · · Score: 2, Insightful

      You can achieve 100% service availability by clustering

      Is that where when I run "DROP TABLE reallyimportanttable;" it drops it on all the servers at once?

      --
      If I have been able to see further than others, it is because I bought a pair of binoculars.
  4. Re:Some thoughts by TheLazySci-FiAuthor · · Score: 4, Insightful

    "... you need to focus on a handful highly-talented IT people rather than an army of droids."

    This is so true; I've always said, "you get what you pay for."

    Do you want to pay for software, or do you want to pay for people?

    Only one can create the other.

  5. Re:Some thoughts by bsDaemon · · Score: 2, Insightful

    Which is somehow different from any other open source project how?

  6. Works great because it's not "Web 2.0" by Animats · · Score: 5, Insightful

    Most of Wikipedia is a collection of static pages. Most users of Wikipedia are just reading the latest version of an article, to which they were taken by a non-Wikipedia search engine. So all Wikipedia has to do for them is serve a static page. No database work or page generation is required.

    Older revisions of pages come from the database, as do the versions one sees during editing and previewing, the history information, and such. Those operations involve the MySQL databases. There are only about 10-20 updates per second taking place in the editing end of the system. When a page is updated, static copies are propagated out to the static page servers after a few tens of seconds.

    Article editing is a check-out/check in system. When you start editing a page, you get a version token, and when you update the page, the token has to match the latest revision or you get an edit conflict. It's all standard form requests; there's no need for frantic XMLHttpRequest processing while you're working on a page.

    Because there are no ads, there's no overhead associated with inserting variable ad info into the pages. No need for ad rotators, ad trackers, "beacons" or similar overhead.

  7. Confused by the title by Just+Some+Guy · · Score: 5, Insightful

    What does "Non-Profit Budget" mean, anyway? There are non-profits bigger than the company I work for. Non-profit isn't the same as poorly financed.

    --
    Dewey, what part of this looks like authorities should be involved?
  8. Nonsense. Wikipedia is THE web 2.0 by Nicolas+MONNET · · Score: 4, Insightful

    Web 2.0 is not just about flashy Ajax or what not, it's about user generated dynamic content. WP's "everything is a wiki" architecture might /look/ a bit archaic compared to fancy schmancy dynamic rotating animated gradient-filled forums, but it's much more powerful.
    Moreover, WP is not a collection of static pages, if you're logged in at least, every pages is dynamically generated, and every page's history is updated within a few seconds.

  9. Re:What is the role of Open Source by guruevi · · Score: 2, Insightful

    I don't know what else but open source you could use especially on the database side. You have only a few choices:

    Microsoft ($$$) (approx. $50,000 per server per year in licensing costs since it's a public (unlimited CAL) enterprise-level site)
    IBM ($$) (approx. $500,000 per year for leasing the whole operation, another load for support)
    Oracle ($) (approx. $20,000 per backend and about 30 contractors for the next 5 years for the implementation)
    Linux, MySQL, PHP (Free)

    Not to mention, with Microsoft you'll need more servers to handle the same amount of load especially if you use Microsoft-based software package for the frontend as well (ASP.NET, MS CRM or SharePoint).

    For IBM you'll have special hardware that nobody can handle but IBM certified support personnel.

    For Oracle you're pretty much on your own anyway and you'll have to find a frontend.

    --
    Custom electronics and digital signage for your business: www.evcircuits.com
  10. Re:I was just thinking that by dubl-u · · Score: 4, Insightful

    But why would they think it was a bad thing to expose? The whole "Look what we can do with so little" angle seems appealing; efficiency is something to boast about nowadays. Turn it around. What does Google gain from exposing data about their internal performance?

    Maybe they do well because they are amazingly CPU-efficient on a per-query basis. Maybe it's the opposite; they may be masters at lavishing CPU on every query, but know how to do that very cheaply. Most likely, it's a clever mix of the two.

    Regardless, Google's engineering-fu and operations-fu are mighty, and a major competitive advantage. Releasing detailed data doesn't boost their reputation, as everybody already knows they are great. But it does give potential competitors an idea of what works well, making it easier for them to catch up with Google. As a rule, expect that any details you see from inside Google are old, boring, or vague. As Intel's Andy Grove said, "Only the paranoid survive."

  11. Re:What amazes me... by HarvardAce · · Score: 2, Insightful

    (link to Alexa graph) One problem about Alexa is that it only gathers statistics from those who install the Alexa toolbar...I would tend to think that the Slashdot crowd would be a group that predominantly avoids installing that sort of thing. I actually think there was a discussion on this on Slashdot many months ago.

    That said, I'm sure that the traffic to Wikipedia is probably several orders of magnitude higher than that of Slashdot.
    --
    Note to self: Stop putting jokes in my insightful comments so I can get something other than +1 Funny!
  12. Re:What amazes me... by dubl-u · · Score: 3, Insightful

    Slashdot is great at taking down sites on crappy shared hosting, but anything with a decently configured dedicated server will likely survive just fine. Sounds right to me. I don't have any terribly recent data on a slashdotting, but I think the Slashdot-as-server-killer meme is pretty stale.

    Looking at some old data and extrapolating, I'd guess a modern slashdotting would peak at 200 pageviews/min, or ~3 pv/sec. Get mentioned on Good Morning America or Oprah, on the other hand, and you're looking at 20-200 pageviews/sec. I'd guess that getting on Digg's front page is somewhere in the 20-40 pv/sec range.

    A slashdotting was a big deal back when every nerd used it and the Internet was mainly nerds. Neither is true anymore.

  13. Re:It's easy... by Hillgiant · · Score: 3, Insightful

    Why? If you want search, go to google. If you want an encyclopedia, go to wikipedia. Its pretty simple, really.

    --
    -
  14. Where are the PHP/MySQL doom criers? by trawg · · Score: 2, Insightful

    I notice they are conspicuously absent in the comments. They tend to jump up and down in any other post about PHP and MySQL. This is such a great example of the scalability and performance of it WHEN USED CORRECTLY.