Slashdot Mirror


Huge Traffic On Wikipedia's Non-Profit Budget

miller60 writes "'As a non-profit running one of the world's busiest web destinations, Wikipedia provides an unusual case study of a high-performance site. In an era when Google and Microsoft can spend $500 million on one of their global data center projects, Wikipedia's infrastructure runs on fewer than 300 servers housed in a single data center in Tampa, Fla.' Domas Mituzas of MySQL/Sun gave a presentation Monday at the Velocity conference that provided an inside look at the technology behind Wikipedia, which he calls an 'operations underdog.'"

17 of 240 comments (clear)

  1. Re:Impressive by Achromatic1978 · · Score: 4, Informative

    Except there's not. There's data centers in Europe and Asia, too, including one at some Yahoo facilities - at least on this note, the article (or summary) is utterly wrong. Single datacenter? No.

  2. Re:The power of low standards by MinuteElectron · · Score: 2, Informative

    Changes are never just lost, when an error does happen and the action cannot be completed then it is rejected and the user notified of this so they can try what they were doing again. You have vastly overstated the severity of such issues.

    --
    MinuteElectron
  3. Re:Works great because it's not "Web 2.0" by Anonymous Coward · · Score: 1, Informative

    There is practically no such thing as a static page in Wikipedia. We're running 2 small Wikipedia mirror clusters, and It's quite obvious that if you don't run a memcached along with the apache, that all pages are rendered from the Database on demand and for every single request. Large and complex pages (e.g. on Hydrogen or Gold) take more than 1 second to render even on the fastest CPUs available.
    You make things sound cheap and simple, but without the memcached and the squid clusters Wikipedia is using, the whole thing would require significantly more hardware than the foundation could afford.

  4. Sure they do it without ads... by DerekLyons · · Score: 3, Informative

    Sure they do without ad income. But they also do it without having to pay salaries, or co location fees, or bandwidth costs... (I know they pay some of those, but they also get a metric buttload of contributions in kind.)

    When your costs are lower, and your standard of service (and content) malleable, it is easy to live on a smaller income.

  5. Re:The power of low standards by Nkwe · · Score: 5, Informative

    A bank requires "six nines" of performance (i.e., right 99.9999% of the time) and probably wants even better than that.

    Banks don't require "six nines"; banks require that no data (data being money), once committed, get lost. The "nines" rating refers to the percentage of time a system is online, working, and available to its users. It does not refer to the percentage of acceptable data loss. It is acceptable for bank systems to have downtime, scheduled maintenance, or "closing periods" -- all of these eat into a "nines" rating, none of which lead to data loss.
  6. Re:Note to self by OverlordQ · · Score: 4, Informative

    They're not all in Tampa, they have a bunch in Netherlands and a few more in South Korea.

    --
    Your hair look like poop, Bob! - Wanker.
  7. Re:Works great because it's not "Web 2.0" by Tweenk · · Score: 2, Informative

    If you haven't noticed, "Web 2.0" is a long estabilished buzzword - which means it carries little meaning, but it looks good in advertising. Just like "information superhighway", "enterprise feature" or "user friendly".

    --
    Those who would give up liberty to obtain working drivers, deserve neither liberty nor working drivers.
  8. Re:Out like a light by timstarling · · Score: 2, Informative

    We've never lost external power while we've been at Tampa, but if we did, there are diesel generators. Not that it would be a big deal if we lost power for a day or two. There's no serious problem as long as there's no physical damage to the servers, which we're assured is essentially impossible even with a direct hurricane strike, since the building is well above sea-level and there are no external windows.

  9. Servers and locations by Anonymous Coward · · Score: 2, Informative

    According to http://meta.wikimedia.org/wiki/Wikimedia_servers Wikimedia (and by extension, Wikipedia):

    "About 300 machines in Florida, 26 in Amsterdam, 23 in Yahoo!'s Korean hosting facility."

    also: http://meta.wikimedia.org/wiki/Wikimedia_partners_and_hosts

  10. Re:I've always wondered... by midom · · Score: 5, Informative

    I covered most of Wikipedia technology bits at my previous year MySQL Conference presentation: http://dammit.lt/uc/workbook2007.pdf (thats quite detailed report)

  11. Re:Tampa? by midom · · Score: 2, Informative

    add power costs, difficulty to travel to, possible flooding, etc. it is all historic reasons, we can't just migrate datacenters at wish - that requires quite a high investment. and the datacenter choice was simply because the founder lived there in 2001, when all we needed was single server. --Domas

  12. Re:Impressive by David+Gerard · · Score: 5, Informative

    No, actually - the Wikimedia servers serve all Wikimedia projects (all the Wikipedias, Wikimedia Commons, all the other projects), but Uncyclopedia is part of Wikia, which is a private company owned by Jimmy Wales to do wikis and isn't actually linked to the Wikimedia Foundation in any way.

    --
    http://rocknerd.co.uk
  13. Re:Impressive by David+Gerard · · Score: 4, Informative

    Single database, though. All the databases for all the projects are in Tampa - one master for English Wikipedia and two for all the other 700+ Wikimedia projects.

    (They tried running the databases for Asian languages from the Yahoo!-sponsored datacentre in Seoul for a while, but it didn't actually work much faster than it did with everything in Tampa.)

    --
    http://rocknerd.co.uk
  14. What about the Internet Archive by Xtifr · · Score: 5, Informative

    Wikipedia's pretty impressive, but how about the Internet Archive? Also a non-profit that doesn't run ads, and not only do they, like Google and Yahoo, "download the Internet" on a regular basis, but the Archive makes backups! Plus, they have huge amounts of streaming audio and video (pd or creative-commons). The first time I ever heard the word "Petabyte" being discussed in practical, real world terms (as in, "we're taking delivery next month") was in connection with the Internet Archive. Several years ago. And it was being used in the plural! :)

    They may not have as much incoming traffic as Wikipedia, but the sheer volume of data they manage is truly staggering. (Heck, they have multiple copies of Wikipedia!) When I do download something from there, it's typically in the 80-150 MB range, and 1 or 2 GB in a pop isn't unusual, and I know I'm not the only one downloading, so their bandwidth bills must still be pretty impressive.

    The fact that these two sites manage to survive and thrive the way they do never ceases to amaze me.

  15. Re:Wikipedia = much more traffic than slashdot by Haoie · · Score: 2, Informative

    That's pretty obvious because Wiki has, literally, millions of topics covering every possible field. Whereas /. is very limited in scope.

    --
    If each mistake being made is a new one, then progress is being made.
  16. Re:Impressive by kv9 · · Score: 4, Informative

    I was always impressed with how fast pages loaded, after seeing how small their operation is I'm even more impressed now! you can skip TFA entirely and look here for detailed info on their servers, locations, pictures, software, pretty graphs and charts. and lots more, just keep clicking.
  17. Re:Cached on servers all over the interweb? by adri · · Score: 2, Informative

    It exists. Its called "validators". There are strong and weak validators. You can Vary on your validators, and thus have multiple copies of the same object but in different forms (so given a text document, you can have it in different languages, compressed/uncompressed, etc.)

    Your browser will then quite happily ask the origin server (which may not be the "origin" origin) for an object and provide validators. (Last-Modified -> If-Modified-Since; ETag->If-None-Match) which the origin (or the cache which is pretending to be the origin) can check against its local copy and then return a "yes, use your local copy" or "no, don't bother."

    Its all there, right now, in HTTP/1.1. I swear. People just don't have a clue how to use caching, they've been bitten by the difference between "expiry" and "revalidation", and they just turn off all hope of caching. Maybe they're scared; maybe their job is to sell bits; maybe they're just clueless about it and turning off caching fixed an obscure problem. In any case, its right there in HTTP/1.1 and you can use it any time you like.

    Adrian

    (I'm a Squid developer.)