Wikipedia On the Brink? Or Crying Wolf?
netbuzz writes "Might Wikipedia 'disappear' three or four months from now absent a major infusion of cash donations? The suggestion has been made by Florence Devouard, chairwoman of the Wikimedia Foundation. And while her spokesperson has since backpedaled off that dire prediction, there can be little doubt that the encyclopedia anyone can edit could use a few more benefactors to go along with all those editors."
Downloads of all the Wikimedia Projects. You need to do a lot of DB work (XML -> SQL conversion, importing, rebuilding tables, etc.)
The issue is simply that massive servers are not cheap. Wikimedia is already at 100+ servers, and they are barely getting by. They could spend half a million on servers and still have a wish-list. And bandwidth isn't cheap. They get a charity discount, and a bulk discount, but it's still gigabytes and gigabytes a day.
http://download.wikipedia.org/ is what you are looking for; you can get monthly database dumps for all the wikis, containing XML files with the articles (or other meta-data, depending on what you are looking for).
Zorglub
Why don't you save yourself some time and just get a Wikipedia search bar for your browser? I used to do the same thing, but got tired of going through a Google search just to wind up clicking on the Wikipedia entry link anyway. Might as well spare yourself the extra steps and have a direct Wikipedia search in the corner of your browser window.
& x=0&y=0&scope=all
u sing+internet+explorer&btnG=Google+Search
For Firefox:
https://addons.mozilla.org/search-engines.php
For Opera:
http://widgets.opera.com/search/?search=wikipedia
For Internet Explorer:
http://www.google.com/search?q=help+me+i'm+still+
brandelf: invalid ELF type 'KEEBLER'
Whatever happened to that?
I'm sure he meant this page: http://en.wikipedia.org/wiki/The_Star_Wars_Holiday _Special
How is it that one careless match can start a forest fire, but it takes a whole box to start a campfire?
It looks like hardware is their single largest expense, at $190,000. Personnel takes a distant second place at $33,000. Bandwidth (well, hosting) takes third, at $24,000.
Also, a note at the bottom: So far this is little more than a minimal budget, meaning a budget designed to pretty much just keep the foundation going. What is not included are special projects (content and/or software). Please include ideas for that on the talk page. --Daniel Mayer 22:39, 18 September 2005 (UTC)
tasks(723) drafts(105) languages(484) examples(29106)
Old adage: you have to spend money in order to get people to give you the money that they made.
It's punchier in the original Klingon, I grant you.
If you were blocking sigs, you wouldn't have to read this.
Bandwidth is cheap as dirt.
So you have experience with very popular web sites, do you? When you need high performance consistent bandwidth it is not cheap. I worked on a popular site whose bill was in the tens of thousands of dollars a month. Wikipedia is extremely fast so you can bet they're paying top dollar.
Developers: We can use your help.
Alternativly for Opera you could go to Wikipedia, right click inside the search box then select "create search". Once you have done that if you want to search Wikipedia simply enter "w" then the search terms into the address bar.
I do already contribute *plenty* to citizendium, by contributing articles and edits and money to wikipedia to fund you guys mirroring their content.
You do not, because we do not mirror Wikipedia's content. We unforked weeks ago.
The data is available as XML, but to clone the site you need the MediaWiki app.
--
make install -not war
Citizendium unforked from Wikipedia some weeks ago. And no articles will pass the approval process on Citizendium unless they can stand up to the rigour and consistency that scholars are used to in their professional work, which means that most Approved articles will bear little resemblance to their Wikipedia counterparts, inconsistent with regards to tone, styling, and references even in the best of cases.
Re-read his comment: he never claimed that they don't spend tens of thousands of dollars on bandwidth. He said they're doing something wrong when they spend tens of thousands of dollars on bandwidth.
Where and how you procure bandwidth is a business decision, and business folks aren't exactly the brightest of folks when it comes to technology. Yes, I have worked for an internet company that went through insane amounts of data and yes, they paid dearly for bandwidth and yes, they could easily have gotten the same amount for 1/10th of the price. But the business manager knew someone who swore Rackspace was the bomb and thus Rackspace it was at Rackspace's prices. Never mind that there's a rash of very cheap data centers much closer to the backbone who'll give you unmetered BW for a factor five less than what we paid.
(Of course that "friend" ran a business that relied much more on HW uptime than data throughput and for him Rackspace might have been the right choice. For us, it wasn't.)
I agree with the statement that you're doing something wrong if you're paying tens of thousands of $$ for BW a month. That's just not what it costs.
We're all born with nothing.
If you die in debt, you're ahead.
That budget is a year and a half old; wikipedia's traffic has increased more than tenfold over that period.
Remember, there were no nuclear weapons before women were allowed to vote.
Whoa there -- either we're living in entirely different worlds or there's a real ambiguity in the term "bandwidth" here. Where I come from, BW was never measured in "per second" or any such thing. A number like "GB/s" would have been called "throughput". When we used the term "bandwidth" it meant something like the aggregate amount of data shipped in or out over the course of a month. In essence the integral over the number you're quoting.
I've never dealt with a company that put limits on the amount of data I can move around "per second" or some such -- that's home-broadband thinking. Or has the business changed so much that businesses are now running their own servers in their own buildings and are paying for the connection from there to the trunk? I don't know about Wikimedia, but most "popular websites" aren't in the business of running hardware - they're in the business of running a business.
Is this another of those fabled "paradigm shifts" of the last couple years?
We're all born with nothing.
If you die in debt, you're ahead.
They paid more than $100,000 a year in salaries to their growing staff? And built up a $500,000 cash reserve? http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_S ignpost/2006-12-11/Financial_audit
http://upload.wikimedia.org/wikipedia/foundation/2 /28/Wikimedia_2006_fs.pdf
What you're seeing is an empty references section followed by a navigation box. Yes, this is a defect in the article. I just added a tag to the page to bring this specific defect to other editors' attention.
A page on the Wikimedia foundations page indicates around 200 terabytes a month, but is marked as outdated - I have no idea how much it's grown since then. So yes, it's not cheap, but you have to wonder if they couldn't get a couple of hundred corporate sponsors to commit to $500 or so a month to pay for 1-3 servers each and get their logo as a sponsor on the pages served from that server. I know a lot of people don't want ads on Wikipedia, a single, discreet, static and easily blockable image wouldn't be very intrusive. 3TB/month of bandwidth + a server can be leased for a few hundred dollars.
Yo. Well, it's pretty much as you say. Wikipedia currently has about 200 servers, most of which are dedicated to a single task. There's a web cluster running Apache with PHP (with eAccelerator, I believe,) that runs the Mediawiki software and serves requests. (That is about 100 servers, if I remember right.) There's a database cluster which runs the Mysql databases; one cluster is English, a few other languages have dedicated boxes, (Chinese, Korean, Spanish, I think...), and another cluster for all other languages. There's also a Nagios box somewhere in there that monitors the whole shebang. Everything is situated behind a set of Squids, like you suggested. In fact, three of four vrequests to Wikipedia are hitting a Squid, not an Apache server. Also, some of the Apaches have memcached.
Wikipedia is indeed just text and images, but even with the cache, the entire thing has to run a disturbingly large number of edits through a database and then retrieve any one of over 1.6 million articles anytime it's requested. The scale on which the software runs hurts my head, and I would imagine the guys at Wikimedia's server place have similar headaches hourly.
~ C.
I got that you didn't like MediaWiki. Yeah, it is slow, but it is slow because of what it does.
As far as the actual studies involved, I would have to at the moment refer you to the Wikimedia development team directly, although I've seen some published values that do go into some details.
For general statistics of Wikimedia projects, I would have to refer you to http://stats.wikimedia.org/ that goes into some depth about individual projects and what the general demands on them are, including statistical summaries of leading contributors, growth of content, and edit counts that would certainly be of general interest in terms of trying to compare to other Wiki environments.
I also would like to mention that Erik Zachte, the person who has written this statistical summary mentioned above, has also gone into depth regarding general usage data where he has been given direct access to the Apache server logs and has noted areas that were critical for Wikimedia projects. Brion Vibber is also actively involved with these reviews, and several of these statistical summaries were noted among the internal developers lists, with hints of these studies being mentioned from time to time on other foundation mailing lists.
There have also been formal requests for performing this sort of statistical analysis by several university research teams that have been eager to get such a statistical set, which also prompted the WMF to establish specific guidelines for obtaining this sort of raw data.
Is this specific enough? I don't know right off hand besides these direct studies, but I do know there are others that do exist as well. Wikipedia is a heavily studied topic in part because much of the data is open and available, which gives some interesting sociological interpretations as well if studied through the lens of a statistical review. And there is enough raw data to come to conclusions that may not fit the traditional orthodoxy, so you can also tweak some noses at the same time.
The reason I mention MediaWiki's feature set is that you are (I'm presuming here) claiming that one of the reasons why the Wikimedia Foundation is running out of money is due in part because they are foolishly spending money on server resources that could be better run had they only selected the proper Wiki software. I am offering a rebuttal that this is hardly the case, and that almost (because I can't claim absolute knowledge here) any other Wiki editing software package would die a horrible and nearly instant death if they had to deal with the same feature set and bandwidth issues that currently confront Wikipedia. Or that the other software packages are so lacking in the essential requirements needed to run Wikipedia that there is hardly room to even justify a valid comparison based off of only one single comparison.... content distribution bandwidth on the CPU.