Huge Traffic On Wikipedia's Non-Profit Budget
miller60 writes "'As a non-profit running one of the world's busiest web destinations, Wikipedia provides an unusual case study of a high-performance site. In an era when Google and Microsoft can spend $500 million on one of their global data center projects, Wikipedia's infrastructure runs on fewer than 300 servers housed in a single data center in Tampa, Fla.' Domas Mituzas of MySQL/Sun gave a presentation Monday at the Velocity conference that provided an inside look at the technology behind Wikipedia, which he calls an 'operations underdog.'"
Given that their topic sites are generally in the top three for any search engine query, the volume of traffic they're dealing with (and the budget that they have!) is very impressive. I always thought that they had much beefier infrastructure than the article says.
It would be neat to have a deeper look at their budget to see how I can save money and boost performance at work. It's always nice having the newest/fastest systems out there, but it's rarely the reality.
From TFA: "But losing a few seconds of changes doesn't destroy our business."
Our organizations' databases (also a non-profit) get several thousand writes per second. Losing 'a few seconds' would mean potentially hundreds of users' record changes were lost. If that happened here, it would be a huge deal. If it happened regularly, it would destroy the business.
I judt got a nre Kinesis keybiartf so please excusr ant egregiou typos.
Every time I Google something, Wikipedia comes near the top most of the time. Maybe that's why Google doesn't want to disclose its processing power, it may very will be a lot smaller than people assume.
How hard can it be to increase the budget or add more servers?
Just go to the Wikipedia page with those numbers and change them. You don't even need to have an account.
Datacenterknowledge.com might want to take lessons from Wikipedia as well. Slashdotted...
If you ever find yourself in a flamewar on Wikipedia you cannot win, bomb Tampa, Florida out of existence.
If someone says he and his monkey have nothing to hide, they almost certainly do.
This is so true; I've always said, "you get what you pay for."
Do you want to pay for software, or do you want to pay for people?
Only one can create the other.
Read my Very Short "Stories"
Which is somehow different from any other open source project how?
I don't care how few servers they have, whats more interesting to me is that they run an ultra-high traffic site, which they aren't having trouble paying for, and do it without adds.
I.e. the promised follow-up to this story about moving to the new Chicago datacenter? You know, the one where Mr. Taco promised a follow-up story "in a few days" about the "ridiculously overpowered new hardware".
I was quite looking forward to that, but it never eventuated, unless I missed it. It's certainly not filed under Topics->Slashdot.
My blog
Most of Wikipedia is a collection of static pages. Most users of Wikipedia are just reading the latest version of an article, to which they were taken by a non-Wikipedia search engine. So all Wikipedia has to do for them is serve a static page. No database work or page generation is required.
Older revisions of pages come from the database, as do the versions one sees during editing and previewing, the history information, and such. Those operations involve the MySQL databases. There are only about 10-20 updates per second taking place in the editing end of the system. When a page is updated, static copies are propagated out to the static page servers after a few tens of seconds.
Article editing is a check-out/check in system. When you start editing a page, you get a version token, and when you update the page, the token has to match the latest revision or you get an edit conflict. It's all standard form requests; there's no need for frantic XMLHttpRequest processing while you're working on a page.
Because there are no ads, there's no overhead associated with inserting variable ad info into the pages. No need for ad rotators, ad trackers, "beacons" or similar overhead.
The wiki software, MediaWiki, was written for Wikipedia and is licensed under the GPL ( http://www.mediawiki.org/wiki/How_does_MediaWiki_work%3F. According to Wikipedia they use MySQL as their database and run it all on Linux servers.
Not to mention hurricanes and faulty electronic voting machines.... ;-)
"Question everything, including this!" - http://technoracle.blogspot.com/
What does "Non-Profit Budget" mean, anyway? There are non-profits bigger than the company I work for. Non-profit isn't the same as poorly financed.
Dewey, what part of this looks like authorities should be involved?
The summary was wrong to include a link to the Wikipedia homepage without a Wikipedia link about Wikipedia in case you don't know what Wikipedia is. I myself had to Google Wikipedia to find out what Wikipedia was so I am providing the Wikipedia link about Wikipedia in case others were likewise in the dark regarding Wikipedia.
-l
P.s., Wikipedia.
Help cure AIDS, cancer, and more. Donate your unused computer time to worldcommunitygrid.org. Join Team Slashdot!
Although much of the Mediawiki software is a hideous twitching blob of PHP Hell, the base functionality is fairly simple and run perpetually and scale massively as long as you don't mess with it.
What spoils a lot of projects like this is the constant need for customization. Wikimedia essentially can't be customized (except for plugins obviously, which you install at your own peril) and that is a big reason why it scales so massively.
As for Wikipedia itself, I suspect it is massively weighted in favor of reads. That simplifies circumstances a lot.
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
Slashdot is great at taking down sites on crappy shared hosting, but anything with a decently configured dedicated server will likely survive just fine.
Wikipedia's probably getting hit with hundreds of times the traffic Slashdot is at all times.
Sure they do without ad income. But they also do it without having to pay salaries, or co location fees, or bandwidth costs... (I know they pay some of those, but they also get a metric buttload of contributions in kind.)
When your costs are lower, and your standard of service (and content) malleable, it is easy to live on a smaller income.
Web 2.0 is not just about flashy Ajax or what not, it's about user generated dynamic content. WP's "everything is a wiki" architecture might /look/ a bit archaic compared to fancy schmancy dynamic rotating animated gradient-filled forums, but it's much more powerful.
Moreover, WP is not a collection of static pages, if you're logged in at least, every pages is dynamically generated, and every page's history is updated within a few seconds.
Slashdot does .. what? 40 mbit of traffic at peak? Wikipedia
is roughly 100 times larger. (And WP has three datacenters, not one)
Slashdot traffic hasn't created noticeable blips on Wikipedia's radar for years.
OTOH, if Wikipedia linked slashdot on every page slashdot would go down, if do to nothing else but bandwidth exhaustion.
We've never lost external power while we've been at Tampa, but if we did, there are diesel generators. Not that it would be a big deal if we lost power for a day or two. There's no serious problem as long as there's no physical damage to the servers, which we're assured is essentially impossible even with a direct hurricane strike, since the building is well above sea-level and there are no external windows.
According to http://meta.wikimedia.org/wiki/Wikimedia_servers Wikimedia (and by extension, Wikipedia):
"About 300 machines in Florida, 26 in Amsterdam, 23 in Yahoo!'s Korean hosting facility."
also: http://meta.wikimedia.org/wiki/Wikimedia_partners_and_hosts
add power costs, difficulty to travel to, possible flooding, etc. it is all historic reasons, we can't just migrate datacenters at wish - that requires quite a high investment. and the datacenter choice was simply because the founder lived there in 2001, when all we needed was single server. --Domas
I don't know what else but open source you could use especially on the database side. You have only a few choices:
Microsoft ($$$) (approx. $50,000 per server per year in licensing costs since it's a public (unlimited CAL) enterprise-level site)
IBM ($$) (approx. $500,000 per year for leasing the whole operation, another load for support)
Oracle ($) (approx. $20,000 per backend and about 30 contractors for the next 5 years for the implementation)
Linux, MySQL, PHP (Free)
Not to mention, with Microsoft you'll need more servers to handle the same amount of load especially if you use Microsoft-based software package for the frontend as well (ASP.NET, MS CRM or SharePoint).
For IBM you'll have special hardware that nobody can handle but IBM certified support personnel.
For Oracle you're pretty much on your own anyway and you'll have to find a frontend.
Custom electronics and digital signage for your business: www.evcircuits.com
That said, I'm sure that the traffic to Wikipedia is probably several orders of magnitude higher than that of Slashdot.
Note to self: Stop putting jokes in my insightful comments so I can get something other than +1 Funny!
Looking at some old data and extrapolating, I'd guess a modern slashdotting would peak at 200 pageviews/min, or ~3 pv/sec. Get mentioned on Good Morning America or Oprah, on the other hand, and you're looking at 20-200 pageviews/sec. I'd guess that getting on Digg's front page is somewhere in the 20-40 pv/sec range.
A slashdotting was a big deal back when every nerd used it and the Internet was mainly nerds. Neither is true anymore.
Wikipedia's pretty impressive, but how about the Internet Archive? Also a non-profit that doesn't run ads, and not only do they, like Google and Yahoo, "download the Internet" on a regular basis, but the Archive makes backups! Plus, they have huge amounts of streaming audio and video (pd or creative-commons). The first time I ever heard the word "Petabyte" being discussed in practical, real world terms (as in, "we're taking delivery next month") was in connection with the Internet Archive. Several years ago. And it was being used in the plural! :)
They may not have as much incoming traffic as Wikipedia, but the sheer volume of data they manage is truly staggering. (Heck, they have multiple copies of Wikipedia!) When I do download something from there, it's typically in the 80-150 MB range, and 1 or 2 GB in a pop isn't unusual, and I know I'm not the only one downloading, so their bandwidth bills must still be pretty impressive.
The fact that these two sites manage to survive and thrive the way they do never ceases to amaze me.
Why? If you want search, go to google. If you want an encyclopedia, go to wikipedia. Its pretty simple, really.
-
"Believe me!" -- Donald Trump
I notice they are conspicuously absent in the comments. They tend to jump up and down in any other post about PHP and MySQL. This is such a great example of the scalability and performance of it WHEN USED CORRECTLY.
It exists. Its called "validators". There are strong and weak validators. You can Vary on your validators, and thus have multiple copies of the same object but in different forms (so given a text document, you can have it in different languages, compressed/uncompressed, etc.)
Your browser will then quite happily ask the origin server (which may not be the "origin" origin) for an object and provide validators. (Last-Modified -> If-Modified-Since; ETag->If-None-Match) which the origin (or the cache which is pretending to be the origin) can check against its local copy and then return a "yes, use your local copy" or "no, don't bother."
Its all there, right now, in HTTP/1.1. I swear. People just don't have a clue how to use caching, they've been bitten by the difference between "expiry" and "revalidation", and they just turn off all hope of caching. Maybe they're scared; maybe their job is to sell bits; maybe they're just clueless about it and turning off caching fixed an obscure problem. In any case, its right there in HTTP/1.1 and you can use it any time you like.
Adrian
(I'm a Squid developer.)