Huge Traffic On Wikipedia's Non-Profit Budget
miller60 writes "'As a non-profit running one of the world's busiest web destinations, Wikipedia provides an unusual case study of a high-performance site. In an era when Google and Microsoft can spend $500 million on one of their global data center projects, Wikipedia's infrastructure runs on fewer than 300 servers housed in a single data center in Tampa, Fla.' Domas Mituzas of MySQL/Sun gave a presentation Monday at the Velocity conference that provided an inside look at the technology behind Wikipedia, which he calls an 'operations underdog.'"
Given that their topic sites are generally in the top three for any search engine query, the volume of traffic they're dealing with (and the budget that they have!) is very impressive. I always thought that they had much beefier infrastructure than the article says.
It would be neat to have a deeper look at their budget to see how I can save money and boost performance at work. It's always nice having the newest/fastest systems out there, but it's rarely the reality.
From TFA: "But losing a few seconds of changes doesn't destroy our business."
Our organizations' databases (also a non-profit) get several thousand writes per second. Losing 'a few seconds' would mean potentially hundreds of users' record changes were lost. If that happened here, it would be a huge deal. If it happened regularly, it would destroy the business.
I judt got a nre Kinesis keybiartf so please excusr ant egregiou typos.
Every time I Google something, Wikipedia comes near the top most of the time. Maybe that's why Google doesn't want to disclose its processing power, it may very will be a lot smaller than people assume.
read up on the Roman Republic now before Wikipedia gets Slashdotted
How hard can it be to increase the budget or add more servers?
Just go to the Wikipedia page with those numbers and change them. You don't even need to have an account.
More and more companies should look into approaches like this. Seriously. In tight economic times, a more ad-hoc approach saves money. People snubbed Google's approach to IT, and now it's becoming the standard in high availability for big dollar projects. But what about the small dollar approach? As economies slide into recession, you need to focus on a handful highly-talented IT people rather than an army of droids.
My blog
Interesting to know, but I wish the article was more substantial than a list of tangential statistics. Also, although Wikipedia receives a hell of alot of traffic, I bet its at least an order of magnitude smaller than googles.
If someone knows where we can find a good comparison between Wikipedia and others, as far as cost to traffic ratio, please speak up.
Datacenterknowledge.com might want to take lessons from Wikipedia as well. Slashdotted...
If you ever find yourself in a flamewar on Wikipedia you cannot win, bomb Tampa, Florida out of existence.
If someone says he and his monkey have nothing to hide, they almost certainly do.
I don't care how few servers they have, whats more interesting to me is that they run an ultra-high traffic site, which they aren't having trouble paying for, and do it without adds.
I wonder how much of a role open source software is playing in Wikipedia's operations. How much is it? Anyone in the know?
I lost power for about a week when that happened and I only live about 15 miles from Tampa, right over the Courtney Campbel Causeway actually.
Wanna fight ? Bend over, stick your head up your ass, and fight for air.
What amazes me is that not only they manage all this traffic on such a small infrastructure, but even with them being on the front page of /. the site is still up.
I.e. the promised follow-up to this story about moving to the new Chicago datacenter? You know, the one where Mr. Taco promised a follow-up story "in a few days" about the "ridiculously overpowered new hardware".
I was quite looking forward to that, but it never eventuated, unless I missed it. It's certainly not filed under Topics->Slashdot.
Does anyone see the lack of planning that resulted in the placement of a major data center in the thunderstorm and lightning-strike capitol of the world?
Most of Wikipedia is a collection of static pages. Most users of Wikipedia are just reading the latest version of an article, to which they were taken by a non-Wikipedia search engine. So all Wikipedia has to do for them is serve a static page. No database work or page generation is required.
Older revisions of pages come from the database, as do the versions one sees during editing and previewing, the history information, and such. Those operations involve the MySQL databases. There are only about 10-20 updates per second taking place in the editing end of the system. When a page is updated, static copies are propagated out to the static page servers after a few tens of seconds.
Article editing is a check-out/check in system. When you start editing a page, you get a version token, and when you update the page, the token has to match the latest revision or you get an edit conflict. It's all standard form requests; there's no need for frantic XMLHttpRequest processing while you're working on a page.
Because there are no ads, there's no overhead associated with inserting variable ad info into the pages. No need for ad rotators, ad trackers, "beacons" or similar overhead.
How the hell are we supposed to read the text with an ad hiding the text? What idiot decided that it was a good decision to go to the hard work to create content only to hide it?
What does "Non-Profit Budget" mean, anyway? There are non-profits bigger than the company I work for. Non-profit isn't the same as poorly financed.
Dewey, what part of this looks like authorities should be involved?
The summary was wrong to include a link to the Wikipedia homepage without a Wikipedia link about Wikipedia in case you don't know what Wikipedia is. I myself had to Google Wikipedia to find out what Wikipedia was so I am providing the Wikipedia link about Wikipedia in case others were likewise in the dark regarding Wikipedia.
-l
P.s., Wikipedia.
Help cure AIDS, cancer, and more. Donate your unused computer time to worldcommunitygrid.org. Join Team Slashdot!
Thoughts, everyone?
A-Bomb
Although much of the Mediawiki software is a hideous twitching blob of PHP Hell, the base functionality is fairly simple and run perpetually and scale massively as long as you don't mess with it.
What spoils a lot of projects like this is the constant need for customization. Wikimedia essentially can't be customized (except for plugins obviously, which you install at your own peril) and that is a big reason why it scales so massively.
As for Wikipedia itself, I suspect it is massively weighted in favor of reads. That simplifies circumstances a lot.
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
In the early days of the WWW the idea with popular pages was that they could be cached all over the internet. Your server checks with their server and if it has the page in cache already then that is what gets served up. What happen to that idea and why cannot Wikipedia work like that with only obscure and new pages getting served up from Florida?
Those 300 servers are one of the wonders of the world and if you have never made an edit then you should. There must be something you can add to the whole.
There has been much talk of other encyclopaedias but I am still waiting.
Sure they do without ad income. But they also do it without having to pay salaries, or co location fees, or bandwidth costs... (I know they pay some of those, but they also get a metric buttload of contributions in kind.)
When your costs are lower, and your standard of service (and content) malleable, it is easy to live on a smaller income.
I take it that "Works great because it's not "Web 2.0" " means its fast and dynamic, whereas Web 2.0 generally means slow and dynamic.
The technology behind it is irrelevant, if content is provided by users then its web 2.0 (as I understan the term), so Wikipedia definitely is web 2.0, its just that they have some fancy caching mechanism to get the best of both worlds. If only more systems were built in a pragmatic way instead of worrying about what its "supposed" to be.
Web 2.0 is not just about flashy Ajax or what not, it's about user generated dynamic content. WP's "everything is a wiki" architecture might /look/ a bit archaic compared to fancy schmancy dynamic rotating animated gradient-filled forums, but it's much more powerful.
Moreover, WP is not a collection of static pages, if you're logged in at least, every pages is dynamically generated, and every page's history is updated within a few seconds.
Slashdot does .. what? 40 mbit of traffic at peak? Wikipedia
is roughly 100 times larger. (And WP has three datacenters, not one)
Slashdot traffic hasn't created noticeable blips on Wikipedia's radar for years.
OTOH, if Wikipedia linked slashdot on every page slashdot would go down, if do to nothing else but bandwidth exhaustion.
Remember when CmdrTaco called wikipedia a fad and said they couldn't scale? It was during the last (only?) slashdot IRC "interview" a few years back. Just before wikipedia overtook /. in traffic.
Do you even lift?
These aren't the 'roids you're looking for.
According to http://meta.wikimedia.org/wiki/Wikimedia_servers Wikimedia (and by extension, Wikipedia):
"About 300 machines in Florida, 26 in Amsterdam, 23 in Yahoo!'s Korean hosting facility."
also: http://meta.wikimedia.org/wiki/Wikimedia_partners_and_hosts
If wikipedia is anything to go by, you just don't include a decent search engine.
1. Millions of static pages can be served at a very high rate from a single modern server.
2. Editing is basically (a) get token (b) edit page (c) submit revisions with token (d) hope you didn't conflict with someone else's edits, in which case you've got to manually fix things.
3. Lack of in-order human oversight. Wikipedia is powered by a gaggle of zealots, not organised humans, and the rule is "latest change produces current page". That's way more easy to implement than a system which involves some sort of review process.
4. Wikipedia operates like a religion with volunteer ministers and one charismatic leader. To paraphrase Bush, it's a whole lot easier to run a group when there's just one dictator and everyone's working toward his whims. "Lowest common denominator fits all" is very easy to engineer but rarely produces progress.
5. Because Wikipedia is operated as a religion rather than a business or charity, no-one gets hurt (except the charismatic leader) if there's data loss or failure, and volunteers are very tolerant of what they're given. It's unnecessary to implement the kind of safeguards to financial loss that any site of Wikipedia's site would normally have to implement.
In other news, a modern desktop can have n people logged in simultaneously typing `less ObjectivismIsAboutFreeWorkers.txt' while another n/100 are in the middle of `vi ObjectivismIsAboutFreeWorkers.txt'.
This is not the first article on Wikipedia's infrastructure to grace Slashdot.
I seem to remember some data distribution (DB replicants) in other parts of the world.
I could be wrong!
Those who would give up liberty to obtain working drivers, deserve neither liberty nor working drivers.
Obviously U can pay much less outside Silicon Valley. If you want investment capital & lots of customers you have to be physically in Silicon Valley and pay the millions of dollars. Even Kiwipedia had to move its office to San Francisco & the data center is going to follow if they can get enough donations.
http://www.goldmark.org/netrants/webstats/
Browser Cache
Local site cache
Local regional cache
Large regional cache
ummm.. by the way, you /could/ use mediawiki as a quick-and-dirty source code versioning system as long as there's only a few members in the team and/or the code is small - maybe a few ten thousand lines of code totally.
Wonderful history and diff built-in, web-access fit in documents wherever you want. Effective in certain situations.
Hackers have long memories. It works both ways.
Wikipedia was in theory impossible, and unproven in practice. Even now, the main difference is that most people just accept that it works without understanding how.
Wikipedia's pretty impressive, but how about the Internet Archive? Also a non-profit that doesn't run ads, and not only do they, like Google and Yahoo, "download the Internet" on a regular basis, but the Archive makes backups! Plus, they have huge amounts of streaming audio and video (pd or creative-commons). The first time I ever heard the word "Petabyte" being discussed in practical, real world terms (as in, "we're taking delivery next month") was in connection with the Internet Archive. Several years ago. And it was being used in the plural! :)
They may not have as much incoming traffic as Wikipedia, but the sheer volume of data they manage is truly staggering. (Heck, they have multiple copies of Wikipedia!) When I do download something from there, it's typically in the 80-150 MB range, and 1 or 2 GB in a pop isn't unusual, and I know I'm not the only one downloading, so their bandwidth bills must still be pretty impressive.
The fact that these two sites manage to survive and thrive the way they do never ceases to amaze me.
Wikipedia has a version in volapuk (a conlang with just 20 speakers), which has over 116,000 articles generated by a bot.
I wanted pictures :(
You posted to the wrong article. You meant to post to this one.
Dammit! Got my tabs confused.
Thanks!
-Russ
Me
"Believe me!" -- Donald Trump
Wikipedia has many sites besides FL: http://meta.wikimedia.org/wiki/Wikimedia_servers
MediaWiki doesn't literally generate static HTML pages because it doesn't need to, since it's designed to be used with the rest of the infrastructure. The "static pages" are the ones served by the squid clusters, which is simpler architecturally (and more distributed) than having the core software literally generate static HTML pages. And the vast majority of Wikipedia pageviews are these static pages served out of squids.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
"Almost all" Wikipedia pageviews are cached static HTML served up by a squid proxy, because there are orders of magnitude more non-logged-in readers than logged-in users, and many orders of magnitude more reads than edits. Only a small minority of traffic hits the database at all.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Wikipedia handles about 5x the number of mbit/sec of the Internet archive, and since Wikipedia's pages are tiny it takes Wikipedia a lot more work for every bit sent. Wikipedia also does it with something like 1/10th the budget.
Maps.Google.com now includes a Wikipedia article layer along with other layers like traffic, terrain, streets and satellites. The layer is referred to as "Wikipedia"; the articles are shown as Wikipedia's trademarked logo icon on the map. Click the an icon and its linked article content pops up right in the browser (or Google Earth, if that's the viewer you're using) window.
That's fair use of Wikipedia's open content, so Google isn't required to pay Wikipedia a license fee or anything. But Google is obviously getting a huge value out of including Wikipedia content in Google's app and UI, including the Wikipedia logo, for which Google is making $BILLIONS a year, and its place in the stock market protected by cobranding with Wikipedia. I see no sign that Google is paying Wikipedia for all that traffic Google gets paid for which Wikipedia must pay to support on Wikipedia's servers.
Google's Maps pages all say at their bottom "", but Wikipedia isn't even mentioned. The Where does Google Maps get its information? "Help" page credits NAVTEQ, TeleAtlas, DigitalGlobe and MDA Federal, but not Wikipedia. The detailed instructions on using the Wikipedia layer and others doesn't credit Wikipedia, just takes credit for exposing it.
That's an excellent feature of Google Maps, and probably completely blows away competitors like MapQuest and Yahoo Maps. Google should pay Wikipedia whatever it costs to operate the servers that are making Google so many $billions, and even more to keep Wikipedia the excellent resource that Google exploits so well. Probably at least $50 million a year would be good, and just another investment in Google's auxiliary infrastructure.
Or Google could just be evil and get it for free while millions of other people pay Google's tab.
--
make install -not war
But then, this "article" was really one of the most pointless things I've read in a long time, anyway - all it consisted of were some numbers (interesting, admittedly, but not for more than a few seconds), a description of Wikipedia that sounds like it was written by a third-grader ("This is Wikipedia. Wikipedia runs on MySQL. Run, Wikipedia, run!"), and some links to actual presentations.
Why not cut out this middleman and directly link to those? Oh, wait, they're from last year, so this isn't even news. My bad.
Internet archive low traffic compared to Wiki?
{{fact}}?
+1
Setting aside the arguments that maybe you can have that kind of uptime with certain setups and clauses, I did think that these requirements were also often so that it'd be clear to both sides when a sales company owed a user company some kind of compensation. I don't think either would expect it to be reliable, but it gives a measurement system for deciding just how much money as owed, as long as there's agreement on how to interpret it. ie. If it's down for longer than 30 seconds this year, B will pay A ((#seconds-30) * $some-amount). If it's down for longer than 5 minutes, perhaps they'll switch to a different scale.
I think you'd find that a lot of companies are prepared for the system not to live up to that particular requirement, no matter which side of the deal they're on.
I notice they are conspicuously absent in the comments. They tend to jump up and down in any other post about PHP and MySQL. This is such a great example of the scalability and performance of it WHEN USED CORRECTLY.
pretty ironic, when you think about it. A site as incredibly useful as Wikipedia scales nicely, Twitter not so much. I like that kind of irony.
Well.. maybe. Or Maybe not. But Definitely not sort of.
Citation needed here, I think. While I visit Wikipedia a lot more often than archive.org, I've downloaded a few 4GB films from archive.org, and so the total amount of traffic I've generated to them dwarfs the total wikipedia usage of most people I know (and I know a few other people who have downloaded public domain films from archive.org).
I am TheRaven on Soylent News
Wikipedia is all user-generated content.
Web 1.0 contains only marginal amounts of user-generated content.
If you take a $20 a month VPS (or for that matter a $5 a month GoDaddy shared hosting account) serving static CSS/HTML with a few images, on Apache, then you can take a Slashdotting straight to the head without any issue whatsoever. Apache will put your 200 kb of content into memory, it gets served out as fast as the connections come in, you win.
Then point the same Slashdotting at, e.g., a page which requires a minor .1 second hop to the database to render and BAM Slashdotting. Similarly, if you've got a heavy media object (e.g. videos hosted locally), you'll probably saturate your bandwidth.
Ironically Slashdot is probably more capable of taking out sites these days than it was previously not because servers are slower (they're much, much faster) or server code is worse (its much, much better) but bceause the average complexity of the typical website is growing.
Compare a default Wordpress install (which doesn't cache anything because, hey, who needs to cache operations that inexpensive? Its not like you were expecting to get popular...) to a static HTML page written in notepad, which was the standard I-can't-believe-its-not-blog format in 1996. If you fire a slashdotting at the Wordpress install, PHP will cause your RAM utilization to go to "lots" and you will likely either get killed by your host or see the majority of visitors get timed out. If you fire it at the static HTML page, no worries.
Help poke pirates in the eyepatch, arr.
hey, are you THE bloodninja? the famous bloodninja from the cybersex logs? if so, AHAH!