Researcher's Wikipedia Big Data Project Shows Globalization Rate
Nerval's Lobster writes "Wikipedia, which features nearly 4 million articles in English alone, is widely considered a godsend for high school students on a tight paper deadline. But for University of Illinois researcher Kalev Leetaru, Wikipedia's volumes of crowd-sourced articles are also an enormous dataset, one he mined for insights into the history of globalization. He made use of Wikipedia's 37GB of English-language data — in particular, the evolving connections between various locations across the globe over a period of years. 'I put every coordinate on a map with a date stamp,' Leetaru told The New York Times. 'It gave me a map of how the world is connected.' You can view the time lapse/data visualization on YouTube."
Come on, 37G isn't big data. You'd have a hard time arguing 37TB is big data.
Cool stuff though.
.
looks exponential :)
#
#\ @ ? Colonize Mars
#
If you're using Wikipedia as a metric to measure anything, you're insane.
From reading the NYT article, I understand this is a study of the English version of Wikipedia. That alone should raise a red flag about the significance of the study beyond being a survey of the interests or obsessions of Wikipedia editors.
It's useful only as a survey of a clearly unrepresentative sample of the world population. It's clearly biased against those that can't write English, itself a much smaller subset of those who can claim some fluency in English.
It tells us less about history and more about present attitudes twoard history. It's pretty much like compiling a list of the 100 greatest sci-fi movies by surveying Facebook users. Movies produced within the last decade or so will outrank the "classic" movies of the 70's and 80's. Avatar will likely be a more "significant" movie than Bladerunner or Aliens.
I'd be interested in how this guy parses "positive" statements vs "negative" statements. English nuance is a tricky wicket, and unlike trying to analyze text from Twitter or Facebook ("Eeewwww, the Civil War is teh Suxorrzz") Wikipedia articles tend to maintain a neutral tone.
After reading the article (yeah, I know) and viewing the video, it seems like "negative" entries appear most often around periods of time when there's a lot of war. Interesting and obvious... but I'd like to know if periods of religious persecution or large scale social upheaval/conflict are accounted for, or show up on whatever "sentiment meter" he's using.
As a side note, I've been thinking about rolling my own-- the "free online" sentiment tools like sentiment140.com tend to miss out on a lot of nuance. Anyone here have any recommendations?