Brown Dog: a Search Engine For the Other 99 Percent (of Data)
aarondubrow writes: We've all experienced the frustration of trying to access information on websites, only to find that the data is trapped in outdated, difficult-to-read file formats and that metadata — the critical data about the data, such as when and how and by whom it was produced — is nonexistent. Led by Kenton McHenry, a team at the National Center for Supercomputing Applications is working to change that. Recipients in 2013 of a $10 million, five-year award from the National Science Foundation, the team is developing software that allows researchers to manage and make sense of vast amounts of digital scientific data that is currently trapped in outdated file formats. The NCSA team recently demonstrated two publicly-available services to make the contents of uncurated data collections accessible.
Sorry, who is this addressed to?
http://xkcd.com/979/
The problem is that 99%* of data is actually trapped behind paywalls...
Which is more of a problem than the format. If the data was available without the paywall, then the format probably wouldn't matter as much.
GrpA
*99% is a made-up statistic - just like the original article. I assume it means "lots..."
Enjoy science fiction? "Turing Evolved" - AI, Mecha, Androids and rail-gun battles. What more could you want?
Isn't gathering, indexing, and trying to find heads/tails of data what Splunk is designed for? It is a commercial utility, and not cheap by any means... but at least this is one software package meant to sift through and generate reports/graphs/etc on stuff.
Disclaimer: Not associated with them, but have ended up using their products at multiple installations with very good results (mainly keeping customers happy with a morning PDF report that all is well, with the charts to prove it.)
What the fuck is happening to Slashdot? Has it switched to CloudFlare? It's all fucked up for me. I couldn't even post comments earlier.
wow
I'm still struggling with the first 99 percent, and now you tell me there's more?
This must be the Dark Matter that the rumors are about. Oh, how the elusive tendrils of reality converge on the delicate neurons of the deranged mind.
...omphaloskepsis often...
What will this project do for me? How do I get old, worn-out data files converted out of dead proprietary formats into something usable or useful? Or is this project only for certain types of researcher? (aka those with oodles of money)
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Although you have a point, you don't understand the realities of science, data, and publishing.
Journal articles never contain sufficient information to replicate an experiment. That's been reported multiple times and also discussed here previously indirectly: in particular there was the study about how difficult/impossible it is to reproduce research. Many jumped into the fray with the fraud claims when that report hit, but the reality is that it's just not possible to lay out every little detail in a publication, and those details matter a LOT. As a consequence, it takes a highly trained individual to carefully interpret the methods described in a journal article, and even then their success rate in reproducing the protocols won't be terrific.
The data is not hidden behind paywalls: there is minimal useful data in the publications. Of course, the paywalls do hide the methods descriptions, which is pretty bad.
There are two major obstacles to dissemination of useful data. This first is that the metadata is nearly always absent or incomplete, and the format issue is a subset of this problem. The second is that data is still "expensive" enough that we can't trivially just have a copy of all of it. This means that it requires careful stewardship if it's going to be archived, and no one is paying for that.
I honestly don't recall that ever being a problem. It may have happened to me, but it must have been so long ago and so infrequent I seriously can't recollect not finding something I was expecting to find.
i used to make my living doing just such a thing! and hope to again one day...
i was sort of an SPSS jockey and data interpreter for a geospatial HCI research project
the supervising professor set the parameters for the study, i came on b/c he had absolutely no idea what to do with all his data...it was 8 years of a study that changed formats 3 years through, and added/changed questions every year...that was on top of constant device usage monitoring data...as in we developed our own field monitoring/data recording device to have undergrad research assistants literally follow research subjects *all day* and record what they did on their phones...and computer...everything...not content of emails but they'd record the fact that they were emailing on a laptop...it was that in depth and undergrad students volunteered!!! it was kind of like being in a reality show for them...idk...but i had to collate/make sense of/explain to others all this data
it took me a long time...since i first started reading research journals as an undergrad in '97...but finally i have accepted that what you say that I've quoted above is true
i hate to admit it but you're right, and anyone in academia should take a long look in the mirror...it's been like this for awhile but now in these times of "pop science" where western culture has an infatuation with tech and TED Talk level scientific discourse the difference between good research and total shit is so abstract and up to the researcher's choice that they **must** make those decisions known and defend them in the literature as needed! it's really academic dishonesty!
Thank you Dave Raggett
I've been kicking around the internet since before the web, since one was delighted by the capabilities of gophur, etc.
And:
"We've all experienced the frustration of trying to access information on websites, only to find that the data is trapped in outdated, difficult-to-read file formats and that metadata..."
Nope, not even once.
-Styopa
Has this class of data been termed "Dark Data" yet?
http://alternatives.rzero.com/
So if I have a YAML or MP3 file, I can convert it...?
Well, hope governments will soon make odf as a default standart. .doc(x) and other MS proprietary formats are time bomb as well. With what will you open your documents after 50 years?