Huge Traffic On Wikipedia's Non-Profit Budget
miller60 writes "'As a non-profit running one of the world's busiest web destinations, Wikipedia provides an unusual case study of a high-performance site. In an era when Google and Microsoft can spend $500 million on one of their global data center projects, Wikipedia's infrastructure runs on fewer than 300 servers housed in a single data center in Tampa, Fla.' Domas Mituzas of MySQL/Sun gave a presentation Monday at the Velocity conference that provided an inside look at the technology behind Wikipedia, which he calls an 'operations underdog.'"
How hard can it be to increase the budget or add more servers?
Just go to the Wikipedia page with those numbers and change them. You don't even need to have an account.
If you ever find yourself in a flamewar on Wikipedia you cannot win, bomb Tampa, Florida out of existence.
If someone says he and his monkey have nothing to hide, they almost certainly do.
I don't care how few servers they have, whats more interesting to me is that they run an ultra-high traffic site, which they aren't having trouble paying for, and do it without adds.
Okay. So pay attention to the sentence before the one you quoted which read, "I'm not suggesting you should follow how we do it."
I.e. the promised follow-up to this story about moving to the new Chicago datacenter? You know, the one where Mr. Taco promised a follow-up story "in a few days" about the "ridiculously overpowered new hardware".
I was quite looking forward to that, but it never eventuated, unless I missed it. It's certainly not filed under Topics->Slashdot.
My blog
Most of Wikipedia is a collection of static pages. Most users of Wikipedia are just reading the latest version of an article, to which they were taken by a non-Wikipedia search engine. So all Wikipedia has to do for them is serve a static page. No database work or page generation is required.
Older revisions of pages come from the database, as do the versions one sees during editing and previewing, the history information, and such. Those operations involve the MySQL databases. There are only about 10-20 updates per second taking place in the editing end of the system. When a page is updated, static copies are propagated out to the static page servers after a few tens of seconds.
Article editing is a check-out/check in system. When you start editing a page, you get a version token, and when you update the page, the token has to match the latest revision or you get an edit conflict. It's all standard form requests; there's no need for frantic XMLHttpRequest processing while you're working on a page.
Because there are no ads, there's no overhead associated with inserting variable ad info into the pages. No need for ad rotators, ad trackers, "beacons" or similar overhead.
Don't be too harsh -- the standards are dependent on the application. Your application, by the nature of the information and its purposes, requires a different standard of reliability than Wikipedia does. You're certainly entitled to be proud of yourself for maintaining that standard.
But don't let that turn into being derogatory about the Wikipedia operation. Wikipedia has identified the correct standard for their application, and by doing so they have successfully avoided the costs and hassle of over-engineering. To each his own...
What does "Non-Profit Budget" mean, anyway? There are non-profits bigger than the company I work for. Non-profit isn't the same as poorly financed.
Dewey, what part of this looks like authorities should be involved?
Although much of the Mediawiki software is a hideous twitching blob of PHP Hell, the base functionality is fairly simple and run perpetually and scale massively as long as you don't mess with it.
What spoils a lot of projects like this is the constant need for customization. Wikimedia essentially can't be customized (except for plugins obviously, which you install at your own peril) and that is a big reason why it scales so massively.
As for Wikipedia itself, I suspect it is massively weighted in favor of reads. That simplifies circumstances a lot.
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
"It would be neat to have a deeper look at their budget to see how I can save money and boost performance at work."
Since they are using LAMP, obviously they could save money by following Microsoft's "Get The Facts" advice!
A bank requires "six nines" of performance (i.e., right 99.9999% of the time) and probably wants even better than that.
Banks don't require "six nines"; banks require that no data (data being money), once committed, get lost. The "nines" rating refers to the percentage of time a system is online, working, and available to its users. It does not refer to the percentage of acceptable data loss. It is acceptable for bank systems to have downtime, scheduled maintenance, or "closing periods" -- all of these eat into a "nines" rating, none of which lead to data loss.Slashdot does .. what? 40 mbit of traffic at peak? Wikipedia
is roughly 100 times larger. (And WP has three datacenters, not one)
Slashdot traffic hasn't created noticeable blips on Wikipedia's radar for years.
OTOH, if Wikipedia linked slashdot on every page slashdot would go down, if do to nothing else but bandwidth exhaustion.
I covered most of Wikipedia technology bits at my previous year MySQL Conference presentation: http://dammit.lt/uc/workbook2007.pdf (thats quite detailed report)
Nevermind, found it:
http://www.google.com/search?q=google
Reviewing just the first hour of video games.
No, actually - the Wikimedia servers serve all Wikimedia projects (all the Wikipedias, Wikimedia Commons, all the other projects), but Uncyclopedia is part of Wikia, which is a private company owned by Jimmy Wales to do wikis and isn't actually linked to the Wikimedia Foundation in any way.
http://rocknerd.co.uk
Wikipedia's pretty impressive, but how about the Internet Archive? Also a non-profit that doesn't run ads, and not only do they, like Google and Yahoo, "download the Internet" on a regular basis, but the Archive makes backups! Plus, they have huge amounts of streaming audio and video (pd or creative-commons). The first time I ever heard the word "Petabyte" being discussed in practical, real world terms (as in, "we're taking delivery next month") was in connection with the Internet Archive. Several years ago. And it was being used in the plural! :)
They may not have as much incoming traffic as Wikipedia, but the sheer volume of data they manage is truly staggering. (Heck, they have multiple copies of Wikipedia!) When I do download something from there, it's typically in the 80-150 MB range, and 1 or 2 GB in a pop isn't unusual, and I know I'm not the only one downloading, so their bandwidth bills must still be pretty impressive.
The fact that these two sites manage to survive and thrive the way they do never ceases to amaze me.
Screw that, I want a bank with six twos of performance. 22.2222%. Of course, any number of nines is easy to achieve. Want six nines? 9.99999% is easy.
Learn to love Alaska