Data Mining Rescues Investigative Journalism
John Mecklin sends in word of initiatives through which the digital revolution that has been undermining in-depth reportage may be ready to give something back, through a new academic and professional discipline known as "computational journalism." "James Hamilton, director of the DeWitt Wallace Center for Media and Democracy at Duke University, is in the process of filling an endowed chair with a professor who will develop sophisticated computing tools that enhance the capabilities — and, perhaps more important in this economic climate, the efficiency — of journalists and other citizens who are trying to hold public officials and institutions accountable. The goal: Computer algorithms that can sort through the huge amounts of databased information available on the Internet, providing public-interest reporters with sets of potential story leads they otherwise might never have found. Or, in short, data mining in the public interest."
As someone who does investigative journalism for a living, data mining won't get you squat. Having done it for a living for 5+ years, and being very familiar with data mining, the two so rarely cross paths that it rounds to zero.
Why? Because if it is in minable form, it doesn't take any digging to find. If you can run a google search and get even a tidbit about what you need, you don't need investigative journalism.
Of the stories I have gotten, little ones like the P4 going 64 bits, it never reaching 4GHz, Dell exploding laptops (an assist on that one), and more recently the Nvidia bump cracking problem(s), none of that would have been possible through data mining.
If it is out there, it doesn't need an investigative journalist. If it isn't, than data mining won't help. The end.
-Charlie
The Cline Center for Democracy at UIUC has been running a data mining project, scanning archives and contents of newspapers around the world for reports of political disturbances such as riots &tc. The project, a collaboration between the center and the UIUC CS department, is meant to facilitate research on domestic stability and the like. Currently it's focused primarily on English papers, but efficiency and completeness will dictate searches in other languages sooner or later.
Information can be suppressed or 'spun', but at least this will ensure that the data's available for such evaluations instead of paying some graduate student peanuts for years and years to put it together.
Of course it does mean that I'm sort of out of a job...
If you're in the world of investigative journalism I'd encourage you to take a look at a new class of semantic data generation tools. New capabilities like Calais (www.opencalais.com) from Thomson Reuters allow you to ingest unstructured text (news articles, press releases, FOIA documents, whatever) and automatically extract semantic metadata like people, companies, management changes, natural disasters and hundreds of others. You can take the output of these tools and load them directly into databases to query. You could take news stories and build a social network of family relationships then play news events against the network. We're already seeing some initial uses in the area of investigative journalism and would love to see more. Jump in and give it a try.