How Journalists Data-Mined the Wikileaks Docs
meckdevil writes "Associated Press developer-journalist extraordinaire Jonathan Stray gives a brilliant explanation of the use of data-mining strategies to winnow and wring journalistic sense out of massive numbers of documents, using the Iraq and Afghanistan war logs released by Wikileaks as a case in point. The concepts for focusing on certain groups of documents and ignoring others are hardly new; they underlie the algorithms used by the major Web search engines. Their use in a journalistic context is on a cutting edge, though, and it raises a fascinating quandary: By choosing the parameters under which documents will be considered similar enough to pay attention to, journalist-programmers actually choose the frame in which a story will be told. This type of data mining holds great potential for investigative revelation — and great potential for journalistic abuse."
Worked miracles after I've gotten around the ugly HTML format they use to release all those INFORMATIONS. Still, there was very little new or worthwhile in the heap of those news clips and rumour aggregations. Frankly, the more I grep it, the less it looks like the "largest leak in history", and the more it seems like "the largest controlled release of information" in history.
/ takes off conspiracy theory hat // flame on
Whilst - purely coincidentally - completely avoiding paying Wikileaks anything for any of these files, from Afghanistan to the US State Dept's opposition to the Haitian government daring to propose a raise in their minimum wage up to 59/hour. (Not by that much, but to that much)
Isn't that one of the major reasons we have journalism? To synthesize and contextualize information? If the contextualized (or perhaps editorialized, depending on your point of view) information was the only kind available, then yes that is an issue. But with Wikileaks, the data is there for anyone who wants to parse it.
This strikes me as being similar to when Anderson Cooper was criticized for calling Mubarak a liar. Or the behavior that Colbert mocked the White House press corps for at the correspondents' dinner. Pretending that journalists are free of bias doesn't make it so, and saying that they should just regurgitate facts and talking points verbatim is counter-productive. Reasoned analysis should be encouraged.
Great video, can we at least get a better FA.
sysadmins and parents of newborns get the same amount of sleep.
Worked for me... *shrug*
I wonder what this program would do to my extensive volume of email.
I've got thousands of emails going back over a decade.
Would love to see where the correlations are.
This signature has Super Cow Powers
How is this different than the current trend of deciding what facts to publish and what to ignore?
I would see this as a threat to journalistic integrity only if there was such a thing anymore.
This is not cutting edge in the slightest: machine learning researchers have been clustering documents, let alone other objects, designing similarity measures, and constructing visualization schemes for years. The fact that cluster tendency assessment was used in a journalistic context isn't newsworthy.
Terrorists and foreign intelligence services will also be doing this to use against the United States and its allies, not just journalists. Wikileaks has provided the raw material for data mining to find things the US doesn't even realize about itself, or its allies. There is no surprise that Bradley Manning has been charged with aiding the enemy.
The fallout continues, hopefully it won't be literally.
Al-Qaeda Already Using Wikileaks Material Against Us
Taliban Study WikiLeaks to Hunt Informants
Wikileaks: US will have to reshuffle diplomats following revelations
'They're informants... if they get killed, they deserve it': New book reveals shocking disregard of Julian Assange towards Afghans named in WikiLeaks cables
Since I can anticipate the follow ups:
No, Wikileaks didn't do an adequate job of scrubbing the documents of names at various points which is why they are useful to the Taliban and other groups building death lists.
Yes, I have seen reports of people being killed due to Wikileaks publishing their name, you just have to dig a lot to find them. For some reason it doesn't seem to be a popular news item. Go figure.
Oversight of US diplomacy, military, and intelligence activity is the role of the Congress elected by voters.
Even if nobody was killed, Wikileaks has resulted in a significant disruption to US diplomacy and antiterrorism efforts. (You pull out informants due to their cover being blown and you lose valuable intelligence.)
Poll finds that more Americans oppose WikiLeaks
much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
The fact that there's a media narrative is hardly news. The purpose is to provide ratings. Anything that will lead to scandal, corruption, or supporting national politics is the name of the game. Fox does this to support Republicans, all the others support the Democrats. I suppose this is news to those that don't already know this however. And this "taking sides" of the national media is nothing new at all. Very old hat in American history.
Ask any budding journalist as to why they want to be in this industry. Sometimes, you will hear a common theme of "To change the world for a better place". Generally that implies a motive with bias. No, their job to REPORT the news in its purest form. I'll tell ya, that can both end wars and create them. But oh no, we can't have that now can we? They should report the good, the bad, and the ugly with impartiality. BBC is the closest as it comes to doing that. Perhaps I'm giving them too much credit however.
Life is not for the lazy.
The visualisations look like they were generated using Gephi. Interesting use. I wonder if the search for "search terms" was initially refined by graphing the raw data and continuing from there.
But yer phone is up his ass, so how that work?
Reasoned analysis would mean taking the entire corpus of documents, and coming up with stories based on what's in them.
This process, on the other hand, is coming up with stories, then doing targeted searches of the documents to find material to back them up.
Mark Twain summed up the central problem of journalism with his epigram, "Get your facts first... then you can distort 'em as much as you please". But, amusing as it is, this completely misses the point! In the very process of "getting your facts" you have the opportunity - indeed, the obligation - of selecting them from among the infinite number of facts that you could choose. Having selected the facts that you think are most important, there is no longer the slightest need to distort them. The work is already done.
Suppose you are the New York Times, and you are reporting on events in Afghanistan. You have a certain amount of space, so do you write up the IED explosion which killed a couple of NATO soldiers and put a few more in hospital - or do you describe the NATO helicopter raid that killed a dozen villagers and wounded another few dozen? Well, your readers are far more interested in the fate of NATO people (especially if they are from the USA); moreover, they don't particularly want to read about how their glorious forces have accidentally (or otherwise) killed a lot of civilians. So it's a no-brainer - you write up the IED event. After a few years of such a policy, consistently followed, readers get the idea that all that happens in Afghanistan is that NATO soldiers occasionally get blown up. Yes the NYT has accurately reported the facts. It hasn't reported all of them, but its editors could argue that such an attempt would be physically impossible. The only practical way of giving a more balanced impression would be to read, as well as the NYT, a newspaper that takes an anti-NATO, pro-Afghan point of view. But no such newspaper can survive commercially in the US market, because it wouldn't sell enough copies (even if it were allowed to go on operating for long).
Indeed, the Wikileaks documents currently under discussion are subject to such a filtering effect too. Remember, all those documents were written by American officials, for US government consumption. You won't find many mentions in there of atrocities by our forces - even if the US authorities in Afghanistan or Washington were aware of such atrocities, they wouldn't put them into messages with such a low level of security. What you can expect to find is a fairly high level of unguarded opinions - either honest or carefully angled to make a particular desired impression.
I am sure that there are many other solipsists out there.
Arrest him.
If it uses US military terms, then there will be a significant bias as they declare that all dead Iraqis are insurgents.
"This type of data mining holds great potential for investigative revelation — and great potential for journalistic abuse"
I don't think so ...
Is this a link to (presumably) the submitter's blog, rather than the actual presentation available here: http://curiositycounts.com/post/6455747293/jonathan-stray-of-the-associated-press-on
"By choosing the parameters under which documents will be considered similar enough to pay attention to, journalist-programmers actually choose the frame in which a story will be told."
Journalists already choose the frame in which a story will be told. They always have. That's not new.
http://www.geoffreylandis.com
One thing we know, whatever the corps decide not to cover, if it's in the main body of documents, Jesse Ventura will find and make a book out of it. He was smart about that. And it probably got a bunch of people to read WikiLeaks info that otherwise would not have.