How Journalists Data-Mined the Wikileaks Docs

Posted by samzenpus on Sunday June 12, 2011 @03:15PM from the reading-between-a-million-lines dept.

meckdevil writes "Associated Press developer-journalist extraordinaire Jonathan Stray gives a brilliant explanation of the use of data-mining strategies to winnow and wring journalistic sense out of massive numbers of documents, using the Iraq and Afghanistan war logs released by Wikileaks as a case in point. The concepts for focusing on certain groups of documents and ignoring others are hardly new; they underlie the algorithms used by the major Web search engines. Their use in a journalistic context is on a cutting edge, though, and it raises a fascinating quandary: By choosing the parameters under which documents will be considered similar enough to pay attention to, journalist-programmers actually choose the frame in which a story will be told. This type of data mining holds great potential for investigative revelation — and great potential for journalistic abuse."

2 of 59 comments (clear)

Min score:

Reason:

Sort:

I just used grep -P by siddesu · 2011-06-12 15:23 · Score: 3, Interesting

Worked miracles after I've gotten around the ugly HTML format they use to release all those INFORMATIONS. Still, there was very little new or worthwhile in the heap of those news clips and rumour aggregations. Frankly, the more I grep it, the less it looks like the "largest leak in history", and the more it seems like "the largest controlled release of information" in history.
/ takes off conspiracy theory hat // flame on
Re:But this isn't reasoned analysis by MimeticLie · 2011-06-12 18:24 · Score: 4, Interesting

Actually, if you watch the video, that's not what Stray is talking about. Rather than doing targeted searches, he's talking about processing the whole dataset and using algorithms to establish connections. The narrative that makes sense of those clusters is what would (hopefully) be the reasoned analysis.