Slashdot Mirror


Huge Traffic On Wikipedia's Non-Profit Budget

miller60 writes "'As a non-profit running one of the world's busiest web destinations, Wikipedia provides an unusual case study of a high-performance site. In an era when Google and Microsoft can spend $500 million on one of their global data center projects, Wikipedia's infrastructure runs on fewer than 300 servers housed in a single data center in Tampa, Fla.' Domas Mituzas of MySQL/Sun gave a presentation Monday at the Velocity conference that provided an inside look at the technology behind Wikipedia, which he calls an 'operations underdog.'"

50 of 240 comments (clear)

  1. Impressive by locokamil · · Score: 4, Insightful

    Given that their topic sites are generally in the top three for any search engine query, the volume of traffic they're dealing with (and the budget that they have!) is very impressive. I always thought that they had much beefier infrastructure than the article says.

    1. Re:Impressive by VeNoM0619 · · Score: 4, Funny

      Yes, and seeing how slashdot decided to try and slashdot them also helps...

      --
      Disclaimer: I am not god.
      We may not be created equal
      But we can be treated equal.
    2. Re:Impressive by sm62704 · · Score: 3, Interesting

      I was always impressed with how fast pages loaded, after seeing how small their operation is I'm even more impressed now!

      Go to any newspaper from the NYT to any one in a smaller city (say, Springfield's State Journal-Register) and the difference in load times is HUGE. Probably has to do with all the ads served from third party servers in the newspapers, what's the use of having a humungous server with giant pipes if your readers' pages have to wait for a flash ad served from a 486 powered by gerbils?

      If I link to the SJR form one of my journals it slows down! I mean, I can see if it's a front page slashdotting a little paper like that but come on, a user journal?

      And Wikipedia isn't all their servers serve; iinm the uncyclopedia shares servers. Impressive, indeed.

      --
      mcgrew's razor: Never attribute to stupidity that which can be explained by greedy self-interest
    3. Re:Impressive by Bandman · · Score: 3, Interesting

      Yea, a single datacenter seems really risky, especially considering some of the shenanigans that have been going on

    4. Re:Impressive by Achromatic1978 · · Score: 4, Informative

      Except there's not. There's data centers in Europe and Asia, too, including one at some Yahoo facilities - at least on this note, the article (or summary) is utterly wrong. Single datacenter? No.

    5. Re:Impressive by David+Gerard · · Score: 5, Informative

      No, actually - the Wikimedia servers serve all Wikimedia projects (all the Wikipedias, Wikimedia Commons, all the other projects), but Uncyclopedia is part of Wikia, which is a private company owned by Jimmy Wales to do wikis and isn't actually linked to the Wikimedia Foundation in any way.

      --
      http://rocknerd.co.uk
    6. Re:Impressive by David+Gerard · · Score: 4, Informative

      Single database, though. All the databases for all the projects are in Tampa - one master for English Wikipedia and two for all the other 700+ Wikimedia projects.

      (They tried running the databases for Asian languages from the Yahoo!-sponsored datacentre in Seoul for a while, but it didn't actually work much faster than it did with everything in Tampa.)

      --
      http://rocknerd.co.uk
    7. Re:Impressive by kv9 · · Score: 4, Informative

      I was always impressed with how fast pages loaded, after seeing how small their operation is I'm even more impressed now! you can skip TFA entirely and look here for detailed info on their servers, locations, pictures, software, pretty graphs and charts. and lots more, just keep clicking.
  2. I've always wondered... by mnslinky · · Score: 4, Insightful

    It would be neat to have a deeper look at their budget to see how I can save money and boost performance at work. It's always nice having the newest/fastest systems out there, but it's rarely the reality.

    1. Re:I've always wondered... by Anonymous Coward · · Score: 5, Funny

      "It would be neat to have a deeper look at their budget to see how I can save money and boost performance at work."

      Since they are using LAMP, obviously they could save money by following Microsoft's "Get The Facts" advice!

    2. Re:I've always wondered... by midom · · Score: 5, Informative

      I covered most of Wikipedia technology bits at my previous year MySQL Conference presentation: http://dammit.lt/uc/workbook2007.pdf (thats quite detailed report)

  3. The power of low standards by Itninja · · Score: 4, Insightful

    From TFA: "But losing a few seconds of changes doesn't destroy our business."

    Our organizations' databases (also a non-profit) get several thousand writes per second. Losing 'a few seconds' would mean potentially hundreds of users' record changes were lost. If that happened here, it would be a huge deal. If it happened regularly, it would destroy the business.

    --
    I judt got a nre Kinesis keybiartf so please excusr ant egregiou typos.
    1. Re:The power of low standards by robbkidd · · Score: 5, Insightful

      Okay. So pay attention to the sentence before the one you quoted which read, "I'm not suggesting you should follow how we do it."

    2. Re:The power of low standards by Anonymous Coward · · Score: 5, Insightful

      Don't be too harsh -- the standards are dependent on the application. Your application, by the nature of the information and its purposes, requires a different standard of reliability than Wikipedia does. You're certainly entitled to be proud of yourself for maintaining that standard.

      But don't let that turn into being derogatory about the Wikipedia operation. Wikipedia has identified the correct standard for their application, and by doing so they have successfully avoided the costs and hassle of over-engineering. To each his own...

    3. Re:The power of low standards by WaltBusterkeys · · Score: 4, Interesting

      Exactly. A bank requires "six nines" of performance (i.e., right 99.9999% of the time) and probably wants even better than that. Six nines works out to about 30 seconds of downtime per year.

      It seems like Wikipedia is getting things right 99% of the time, or maybe even 99.9% of the time ("three nines"). That's a pretty low standard relative to how most companies do business.

    4. Re:The power of low standards by Nkwe · · Score: 5, Informative

      A bank requires "six nines" of performance (i.e., right 99.9999% of the time) and probably wants even better than that.

      Banks don't require "six nines"; banks require that no data (data being money), once committed, get lost. The "nines" rating refers to the percentage of time a system is online, working, and available to its users. It does not refer to the percentage of acceptable data loss. It is acceptable for bank systems to have downtime, scheduled maintenance, or "closing periods" -- all of these eat into a "nines" rating, none of which lead to data loss.
    5. Re:The power of low standards by AK+Marc · · Score: 5, Funny

      Screw that, I want a bank with six twos of performance. 22.2222%. Of course, any number of nines is easy to achieve. Want six nines? 9.99999% is easy.

    6. Re:The power of low standards by PMBjornerud · · Score: 3, Insightful

      If there's a 30-second period per year when data doesn't properly move, and that requires manual cleanup, that's acceptable. And if there is a 1-hours downtime, EVER, you just blew through the scheduled downtime for the next 120 years.

      "Six nines" is meaningless. Unrealistic.

      It is a promise that you cannot be hit by a single accident, fuckup, pissed-off-employee or act of god.

      --
      I lost my sig.
    7. Re:The power of low standards by az-saguaro · · Score: 3, Interesting

      Your reasoning may be a bit specious. If your databases get "several thousand writes per second", it sounds like this may be massive underuse of your bandwidth - i.e. your servers or databases may be able to handle hundreds of thousands or millions of writes per second. If a few seconds were lost or went down, then the incoming traffic might get cached or queued, waiting for services to come back on line. Once the connection is re-established, the write backlog might take only a few seconds or a few fractions of a second to catch up and be back to real time. Users might be unaware of the whole thing, or they would re-log and try again, and there would be no perceptible throttle or bottleneck to data logging. Any system that presses its bandwidth limits, any system that walks dangerously close to its top capacity, with no capacitances or reserves, is likely to be down quite a bit. A system such as yours, which hardly taxes its bandwidth at all (I am guessing) could certainly tolerate lost seconds. Admittedly, your system may have had problems like this in the past, and the system was upgraded to handle higher capacity. . . . Which is why Wikipedia no longer runs on just one machine. It does sound as though Wikipedia may have found a sweet spot, balancing load against reserve capacity or bandwidth, for robust up-time versus economic efficiency. I am sure that this is a topic that computer and network engineers have studied exhaustively - perhaps someone else knows?

  4. Easy to Increase the budget or add servers by Subm · · Score: 5, Funny

    How hard can it be to increase the budget or add more servers?

    Just go to the Wikipedia page with those numbers and change them. You don't even need to have an account.

    1. Re:Easy to Increase the budget or add servers by elrous0 · · Score: 5, Funny

      In their defense, if you're going to run your entire site off a single server farm,a coastal city in Florida is the logical place to put it.

      --
      SJW: Someone who has run out of real oppression, and has to fake it.
  5. Maybe... by nakajoe · · Score: 3, Funny

    Datacenterknowledge.com might want to take lessons from Wikipedia as well. Slashdotted...

  6. Note to self by Anita+Coney · · Score: 5, Funny

    If you ever find yourself in a flamewar on Wikipedia you cannot win, bomb Tampa, Florida out of existence.

    --
    If someone says he and his monkey have nothing to hide, they almost certainly do.
    1. Re:Note to self by canajin56 · · Score: 5, Funny

      That's your solution to everything.

      --
      ASCII stupid question, get a stupid ANSI
    2. Re:Note to self by Ron+Bennett · · Score: 4, Interesting

      Or do a hurricane dance, and let nature do its thing...

      Having all their servers in Tampa, FL (of all places given hurricanes, frequent lightning, flooding, etc there) doesn't seem too smart - I would have thought, given Wikipedia's popularity, their servers would be geographically spread out in multiple locations.

      Though to do that adds a level of complexity and costs that even many for-profit ventures, such as Slashdot, likely can't afford / justify; Slashdot's servers are in one place - Chicago ... to digress a bit, I notice this site's accessibility (ie. more page not found / timeouts lately) has been spotty since the servers move.

      Ron

    3. Re:Note to self by OverlordQ · · Score: 4, Informative

      They're not all in Tampa, they have a bunch in Netherlands and a few more in South Korea.

      --
      Your hair look like poop, Bob! - Wanker.
  7. Re:Some thoughts by TheLazySci-FiAuthor · · Score: 4, Insightful

    "... you need to focus on a handful highly-talented IT people rather than an army of droids."

    This is so true; I've always said, "you get what you pay for."

    Do you want to pay for software, or do you want to pay for people?

    Only one can create the other.

  8. More importantly by wolf12886 · · Score: 5, Interesting

    I don't care how few servers they have, whats more interesting to me is that they run an ultra-high traffic site, which they aren't having trouble paying for, and do it without adds.

  9. Off-topic, I know, but...what about /.'s hardware? by kiwimate · · Score: 5, Interesting

    I.e. the promised follow-up to this story about moving to the new Chicago datacenter? You know, the one where Mr. Taco promised a follow-up story "in a few days" about the "ridiculously overpowered new hardware".

    I was quite looking forward to that, but it never eventuated, unless I missed it. It's certainly not filed under Topics->Slashdot.

  10. Re:Some thoughts by morgan_greywolf · · Score: 5, Funny

    Do you want to pay for software, or do you want to pay for people?

    Only one can create the other.

    Oh, gods, let's hope so!
  11. Works great because it's not "Web 2.0" by Animats · · Score: 5, Insightful

    Most of Wikipedia is a collection of static pages. Most users of Wikipedia are just reading the latest version of an article, to which they were taken by a non-Wikipedia search engine. So all Wikipedia has to do for them is serve a static page. No database work or page generation is required.

    Older revisions of pages come from the database, as do the versions one sees during editing and previewing, the history information, and such. Those operations involve the MySQL databases. There are only about 10-20 updates per second taking place in the editing end of the system. When a page is updated, static copies are propagated out to the static page servers after a few tens of seconds.

    Article editing is a check-out/check in system. When you start editing a page, you get a version token, and when you update the page, the token has to match the latest revision or you get an edit conflict. It's all standard form requests; there's no need for frantic XMLHttpRequest processing while you're working on a page.

    Because there are no ads, there's no overhead associated with inserting variable ad info into the pages. No need for ad rotators, ad trackers, "beacons" or similar overhead.

  12. Re:What is the role of Open Source by KokorHekkus · · Score: 4, Interesting

    The wiki software, MediaWiki, was written for Wikipedia and is licensed under the GPL ( http://www.mediawiki.org/wiki/How_does_MediaWiki_work%3F. According to Wikipedia they use MySQL as their database and run it all on Linux servers.

  13. Confused by the title by Just+Some+Guy · · Score: 5, Insightful

    What does "Non-Profit Budget" mean, anyway? There are non-profits bigger than the company I work for. Non-profit isn't the same as poorly financed.

    --
    Dewey, what part of this looks like authorities should be involved?
  14. Link to wikipedia? by Luyseyal · · Score: 4, Funny

    The summary was wrong to include a link to the Wikipedia homepage without a Wikipedia link about Wikipedia in case you don't know what Wikipedia is. I myself had to Google Wikipedia to find out what Wikipedia was so I am providing the Wikipedia link about Wikipedia in case others were likewise in the dark regarding Wikipedia.

    -l

    P.s., Wikipedia.

    --
    Help cure AIDS, cancer, and more. Donate your unused computer time to worldcommunitygrid.org. Join Team Slashdot!
    1. Re:Link to wikipedia? by hansamurai · · Score: 4, Funny

      Wait, what's this Google thing you're talking about?

  15. Simplicity by wsanders · · Score: 5, Interesting

    Although much of the Mediawiki software is a hideous twitching blob of PHP Hell, the base functionality is fairly simple and run perpetually and scale massively as long as you don't mess with it.

    What spoils a lot of projects like this is the constant need for customization. Wikimedia essentially can't be customized (except for plugins obviously, which you install at your own peril) and that is a big reason why it scales so massively.

    As for Wikipedia itself, I suspect it is massively weighted in favor of reads. That simplifies circumstances a lot.

    --
    Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
  16. Re:What amazes me... by ceejayoz · · Score: 4, Interesting

    Slashdot is great at taking down sites on crappy shared hosting, but anything with a decently configured dedicated server will likely survive just fine.

    Wikipedia's probably getting hit with hundreds of times the traffic Slashdot is at all times.

  17. Sure they do it without ads... by DerekLyons · · Score: 3, Informative

    Sure they do without ad income. But they also do it without having to pay salaries, or co location fees, or bandwidth costs... (I know they pay some of those, but they also get a metric buttload of contributions in kind.)

    When your costs are lower, and your standard of service (and content) malleable, it is easy to live on a smaller income.

  18. Nonsense. Wikipedia is THE web 2.0 by Nicolas+MONNET · · Score: 4, Insightful

    Web 2.0 is not just about flashy Ajax or what not, it's about user generated dynamic content. WP's "everything is a wiki" architecture might /look/ a bit archaic compared to fancy schmancy dynamic rotating animated gradient-filled forums, but it's much more powerful.
    Moreover, WP is not a collection of static pages, if you're logged in at least, every pages is dynamically generated, and every page's history is updated within a few seconds.

  19. Wikipedia = much more traffic than slashdot by Anonymous Coward · · Score: 5, Interesting

    Slashdot does .. what? 40 mbit of traffic at peak? Wikipedia
    is roughly 100 times larger. (And WP has three datacenters, not one)

    Slashdot traffic hasn't created noticeable blips on Wikipedia's radar for years.

    OTOH, if Wikipedia linked slashdot on every page slashdot would go down, if do to nothing else but bandwidth exhaustion.

    1. Re:Wikipedia = much more traffic than slashdot by hostyle · · Score: 5, Funny

      OTOH, if Wikipedia linked slashdot on every page slashdot would go down, if do to nothing else but bandwidth exhaustion.

      Sounds like a dare to me. Gentlemen, start your packets!
      --
      Caesar si viveret, ad remum dareris.
    2. Re:Wikipedia = much more traffic than slashdot by beav007 · · Score: 3, Funny

      bandwidth exhaustion
      Welcome to ************ broadband tech support. How can I help?

      "My internet is running very slowly tonight. Why is that?"

      Well sir, it looks like you've been downloading from the other side of the continent. I'd say that your packets are just very tired by the time they reach you...
    3. Re:Wikipedia = much more traffic than slashdot by BooRolla · · Score: 5, Funny

      If only there were some way to put links on to Wikipedia!

  20. Re:I was just thinking that by Chris+Burke · · Score: 4, Interesting

    I don't actually know anything about the total computing power Google employs, but I do know that they will purchase on the order of 1,000-10,000 processors merely to evaluate them prior to making a real purchase.

    --

    The enemies of Democracy are
  21. Re:I was just thinking that by dubl-u · · Score: 4, Insightful

    But why would they think it was a bad thing to expose? The whole "Look what we can do with so little" angle seems appealing; efficiency is something to boast about nowadays. Turn it around. What does Google gain from exposing data about their internal performance?

    Maybe they do well because they are amazingly CPU-efficient on a per-query basis. Maybe it's the opposite; they may be masters at lavishing CPU on every query, but know how to do that very cheaply. Most likely, it's a clever mix of the two.

    Regardless, Google's engineering-fu and operations-fu are mighty, and a major competitive advantage. Releasing detailed data doesn't boost their reputation, as everybody already knows they are great. But it does give potential competitors an idea of what works well, making it easier for them to catch up with Google. As a rule, expect that any details you see from inside Google are old, boring, or vague. As Intel's Andy Grove said, "Only the paranoid survive."

  22. Re:What amazes me... by dubl-u · · Score: 3, Insightful

    Slashdot is great at taking down sites on crappy shared hosting, but anything with a decently configured dedicated server will likely survive just fine. Sounds right to me. I don't have any terribly recent data on a slashdotting, but I think the Slashdot-as-server-killer meme is pretty stale.

    Looking at some old data and extrapolating, I'd guess a modern slashdotting would peak at 200 pageviews/min, or ~3 pv/sec. Get mentioned on Good Morning America or Oprah, on the other hand, and you're looking at 20-200 pageviews/sec. I'd guess that getting on Digg's front page is somewhere in the 20-40 pv/sec range.

    A slashdotting was a big deal back when every nerd used it and the Internet was mainly nerds. Neither is true anymore.

  23. What about the Internet Archive by Xtifr · · Score: 5, Informative

    Wikipedia's pretty impressive, but how about the Internet Archive? Also a non-profit that doesn't run ads, and not only do they, like Google and Yahoo, "download the Internet" on a regular basis, but the Archive makes backups! Plus, they have huge amounts of streaming audio and video (pd or creative-commons). The first time I ever heard the word "Petabyte" being discussed in practical, real world terms (as in, "we're taking delivery next month") was in connection with the Internet Archive. Several years ago. And it was being used in the plural! :)

    They may not have as much incoming traffic as Wikipedia, but the sheer volume of data they manage is truly staggering. (Heck, they have multiple copies of Wikipedia!) When I do download something from there, it's typically in the 80-150 MB range, and 1 or 2 GB in a pop isn't unusual, and I know I'm not the only one downloading, so their bandwidth bills must still be pretty impressive.

    The fact that these two sites manage to survive and thrive the way they do never ceases to amaze me.

  24. Re:It's easy... by Hillgiant · · Score: 3, Insightful

    Why? If you want search, go to google. If you want an encyclopedia, go to wikipedia. Its pretty simple, really.

    --
    -
  25. Re:I was just thinking that by kiwimate · · Score: 3, Interesting

    You know what I thought was interesting? This story (which was linked to from this /. story titled A Look At the Workings of Google's Data Centers contained the following snippets.

    On the one hand, Google uses more-or-less ordinary servers. Processors, hard drives, memory--you know the drill.

    and

    While Google uses ordinary hardware components for its servers...

    But this was immediately followed by:

    it doesn't use conventional packaging. Google required Intel to create custom circuit boards.

    For some reason I'd always believed they used pretty much standard components in everything.