How Journalists Data-Mined the Wikileaks Docs
meckdevil writes "Associated Press developer-journalist extraordinaire Jonathan Stray gives a brilliant explanation of the use of data-mining strategies to winnow and wring journalistic sense out of massive numbers of documents, using the Iraq and Afghanistan war logs released by Wikileaks as a case in point. The concepts for focusing on certain groups of documents and ignoring others are hardly new; they underlie the algorithms used by the major Web search engines. Their use in a journalistic context is on a cutting edge, though, and it raises a fascinating quandary: By choosing the parameters under which documents will be considered similar enough to pay attention to, journalist-programmers actually choose the frame in which a story will be told. This type of data mining holds great potential for investigative revelation — and great potential for journalistic abuse."
Worked miracles after I've gotten around the ugly HTML format they use to release all those INFORMATIONS. Still, there was very little new or worthwhile in the heap of those news clips and rumour aggregations. Frankly, the more I grep it, the less it looks like the "largest leak in history", and the more it seems like "the largest controlled release of information" in history.
/ takes off conspiracy theory hat // flame on
The fact that there's a media narrative is hardly news. The purpose is to provide ratings. Anything that will lead to scandal, corruption, or supporting national politics is the name of the game. Fox does this to support Republicans, all the others support the Democrats. I suppose this is news to those that don't already know this however. And this "taking sides" of the national media is nothing new at all. Very old hat in American history.
Ask any budding journalist as to why they want to be in this industry. Sometimes, you will hear a common theme of "To change the world for a better place". Generally that implies a motive with bias. No, their job to REPORT the news in its purest form. I'll tell ya, that can both end wars and create them. But oh no, we can't have that now can we? They should report the good, the bad, and the ugly with impartiality. BBC is the closest as it comes to doing that. Perhaps I'm giving them too much credit however.
Life is not for the lazy.
If memory serves, and I'm not missing something in my quick re-read of the Wikipedia page, the leaked cables were not all made available to everyone. They were distributed to five major news organizations so more than one editorial staff could reasonably decide which material was newsworthy and which was too sensitive to publish (sarcastic example: the GPS coordinates of Obama's real long-form birth certificate). This is a reasonably good idea, but it does mean that there are only a handful of people who have access to all the documents.
Have you ever heard that when you find something it's always in the last place you look? That's because you stop looking for it once you're satisfied. Similarly, an editor searching for terms that might confirm a previously-unsubstantiated rumor he's got tucked away in a story on the shelf may find what he's looking for, but he won't find the really juicy stuff he didn't know to look for.
In a perfect world, the system would correct for this because some enterprising young journalists who are willing to "pound the pavement" and read the whole thing would uncover the stuff they missed. But because of the limited set of people who have access, that won't happen for a decade or two at the earliest. It's a necessary evil to prevent information like the locations of and personnel at sensitive sites from falling into the wrong hands.
Actually, if you watch the video, that's not what Stray is talking about. Rather than doing targeted searches, he's talking about processing the whole dataset and using algorithms to establish connections. The narrative that makes sense of those clusters is what would (hopefully) be the reasoned analysis.