Brown Dog: a Search Engine For the Other 99 Percent (of Data)

← Back to Stories (view on slashdot.org)

Brown Dog: a Search Engine For the Other 99 Percent (of Data)

Posted by Soulskill on Tuesday October 7, 2014 @02:18PM from the because-it-fetches-data dept.

aarondubrow writes: We've all experienced the frustration of trying to access information on websites, only to find that the data is trapped in outdated, difficult-to-read file formats and that metadata — the critical data about the data, such as when and how and by whom it was produced — is nonexistent. Led by Kenton McHenry, a team at the National Center for Supercomputing Applications is working to change that. Recipients in 2013 of a $10 million, five-year award from the National Science Foundation, the team is developing software that allows researchers to manage and make sense of vast amounts of digital scientific data that is currently trapped in outdated file formats. The NCSA team recently demonstrated two publicly-available services to make the contents of uncurated data collections accessible.

3 of 23 comments (clear)

Min score:

Reason:

Sort:

Hmm, no, can't say I have, no. by Anonymous Coward · 2014-10-07 14:21 · Score: 0, Insightful

Sorry, who is this addressed to?
oblig xkcd by irussel · 2014-10-07 14:26 · Score: 2, Insightful

http://xkcd.com/979/
The problem isn't the format of the data... by Vesvvi · 2014-10-07 15:48 · Score: 4, Insightful

Although you have a point, you don't understand the realities of science, data, and publishing.
Journal articles never contain sufficient information to replicate an experiment. That's been reported multiple times and also discussed here previously indirectly: in particular there was the study about how difficult/impossible it is to reproduce research. Many jumped into the fray with the fraud claims when that report hit, but the reality is that it's just not possible to lay out every little detail in a publication, and those details matter a LOT. As a consequence, it takes a highly trained individual to carefully interpret the methods described in a journal article, and even then their success rate in reproducing the protocols won't be terrific.
The data is not hidden behind paywalls: there is minimal useful data in the publications. Of course, the paywalls do hide the methods descriptions, which is pretty bad.
There are two major obstacles to dissemination of useful data. This first is that the metadata is nearly always absent or incomplete, and the format issue is a subset of this problem. The second is that data is still "expensive" enough that we can't trivially just have a copy of all of it. This means that it requires careful stewardship if it's going to be archived, and no one is paying for that.