Brown Dog: a Search Engine For the Other 99 Percent (of Data)

← Back to Stories (view on slashdot.org)

Brown Dog: a Search Engine For the Other 99 Percent (of Data)

Posted by Soulskill on Tuesday October 7, 2014 @02:18PM from the because-it-fetches-data dept.

aarondubrow writes: We've all experienced the frustration of trying to access information on websites, only to find that the data is trapped in outdated, difficult-to-read file formats and that metadata — the critical data about the data, such as when and how and by whom it was produced — is nonexistent. Led by Kenton McHenry, a team at the National Center for Supercomputing Applications is working to change that. Recipients in 2013 of a $10 million, five-year award from the National Science Foundation, the team is developing software that allows researchers to manage and make sense of vast amounts of digital scientific data that is currently trapped in outdated file formats. The NCSA team recently demonstrated two publicly-available services to make the contents of uncurated data collections accessible.

23 comments

Min score:

Reason:

Sort:

Hmm, no, can't say I have, no. by Anonymous Coward · 2014-10-07 14:21 · Score: 0, Insightful

Sorry, who is this addressed to?
oblig xkcd by irussel · 2014-10-07 14:26 · Score: 2, Insightful

http://xkcd.com/979/
The problem isn't the format of the data... by GrpA · 2014-10-07 14:55 · Score: 4, Informative

The problem is that 99%* of data is actually trapped behind paywalls...
Which is more of a problem than the format. If the data was available without the paywall, then the format probably wouldn't matter as much.
GrpA
*99% is a made-up statistic - just like the original article. I assume it means "lots..."

--
Enjoy science fiction? "Turing Evolved" - AI, Mecha, Androids and rail-gun battles. What more could you want?
1. Re:The problem isn't the format of the data... by Anonymous Coward · 2014-10-07 15:38 · Score: 0
  
  That time was months ago. We've all been over at SoylentNews instead.
2. Re:The problem isn't the format of the data... by Anonymous Coward · 2014-10-08 21:53 · Score: 1
  
  And you keep coming back to spam here. Cant be a very succesfull site. But I guess you now earned the right to spam and troll slashdot now that you left. being a part of the problem is so cool eh?
Isn't this what Splunk is for? by mlts · 2014-10-07 14:59 · Score: 4, Informative

Isn't gathering, indexing, and trying to find heads/tails of data what Splunk is designed for? It is a commercial utility, and not cheap by any means... but at least this is one software package meant to sift through and generate reports/graphs/etc on stuff.
Disclaimer: Not associated with them, but have ended up using their products at multiple installations with very good results (mainly keeping customers happy with a morning PDF report that all is well, with the charts to prove it.)
What the fuck is happening to Slashdot? by Anonymous Coward · 2014-10-07 14:59 · Score: 0

What the fuck is happening to Slashdot? Has it switched to CloudFlare? It's all fucked up for me. I couldn't even post comments earlier.
"the Other 99 Percent" by swell · 2014-10-07 15:24 · Score: 2, Funny

wow
I'm still struggling with the first 99 percent, and now you tell me there's more?
This must be the Dark Matter that the rumors are about. Oh, how the elusive tendrils of reality converge on the delicate neurons of the deranged mind.

--
...omphaloskepsis often...
I have plenty of old scientific data files by jd · 2014-10-07 15:25 · Score: 1

What will this project do for me? How do I get old, worn-out data files converted out of dead proprietary formats into something usable or useful? Or is this project only for certain types of researcher? (aka those with oodles of money)

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
1. Re:I have plenty of old scientific data files by globaljustin · 2014-10-07 19:15 · Score: 1
  
  i can point you to Systems Science research projects that are based on testing high level data analysis algorithms and the data set they need is immaterial...it just has to fit certain parameters
  the data in TFA could be used for just such a task
  in one example, a PhD researcher was developing a speech recognition algorithm improvement and he actually used the entire digital catelogue of some musician as his data set...it produced interesting results that were not part of testing his hypothesis at all...hard to explain w/o digging very deep in my note pile but it was like a pandora for speech recognition and the musician's catalogue was perfect to optimize upon for some reason
  my point is, there is use for big random old data sets, potentially...
  
  --
  Thank you Dave Raggett
The problem isn't the format of the data... by Vesvvi · 2014-10-07 15:48 · Score: 4, Insightful

Although you have a point, you don't understand the realities of science, data, and publishing.
Journal articles never contain sufficient information to replicate an experiment. That's been reported multiple times and also discussed here previously indirectly: in particular there was the study about how difficult/impossible it is to reproduce research. Many jumped into the fray with the fraud claims when that report hit, but the reality is that it's just not possible to lay out every little detail in a publication, and those details matter a LOT. As a consequence, it takes a highly trained individual to carefully interpret the methods described in a journal article, and even then their success rate in reproducing the protocols won't be terrific.
The data is not hidden behind paywalls: there is minimal useful data in the publications. Of course, the paywalls do hide the methods descriptions, which is pretty bad.
There are two major obstacles to dissemination of useful data. This first is that the metadata is nearly always absent or incomplete, and the format issue is a subset of this problem. The second is that data is still "expensive" enough that we can't trivially just have a copy of all of it. This means that it requires careful stewardship if it's going to be archived, and no one is paying for that.
We all have? by jader3rd · 2014-10-07 16:53 · Score: 3, Interesting

I honestly don't recall that ever being a problem. It may have happened to me, but it must have been so long ago and so infrequent I seriously can't recollect not finding something I was expecting to find.
academic dishonesty by globaljustin · 2014-10-07 19:09 · Score: 2

it takes a highly trained individual to carefully interpret the methods described in a journal article, and even then their success rate in reproducing the protocols won't be terrific.
i used to make my living doing just such a thing! and hope to again one day...
i was sort of an SPSS jockey and data interpreter for a geospatial HCI research project
the supervising professor set the parameters for the study, i came on b/c he had absolutely no idea what to do with all his data...it was 8 years of a study that changed formats 3 years through, and added/changed questions every year...that was on top of constant device usage monitoring data...as in we developed our own field monitoring/data recording device to have undergrad research assistants literally follow research subjects *all day* and record what they did on their phones...and computer...everything...not content of emails but they'd record the fact that they were emailing on a laptop...it was that in depth and undergrad students volunteered!!! it was kind of like being in a reality show for them...idk...but i had to collate/make sense of/explain to others all this data
it took me a long time...since i first started reading research journals as an undergrad in '97...but finally i have accepted that what you say that I've quoted above is true
i hate to admit it but you're right, and anyone in academia should take a long look in the mirror...it's been like this for awhile but now in these times of "pop science" where western culture has an infatuation with tech and TED Talk level scientific discourse the difference between good research and total shit is so abstract and up to the researcher's choice that they **must** make those decisions known and defend them in the literature as needed! it's really academic dishonesty!

--
Thank you Dave Raggett
huh? by argStyopa · 2014-10-07 23:36 · Score: 1

I've been kicking around the internet since before the web, since one was delighted by the capabilities of gophur, etc.
And:
"We've all experienced the frustration of trying to access information on websites, only to find that the data is trapped in outdated, difficult-to-read file formats and that metadata..."
Nope, not even once.

--
-Styopa
1. Re:huh? by CastrTroy · 2014-10-08 00:40 · Score: 1
  
  I think the thing that bothers me most that might be closely related is finding information on the internet, and not having a date attached to it. You'll search for something, find something that looks like a news article, and there's no dates. No information about when it happened.
  
  --
  
  Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
2. Re:huh? by argStyopa · 2014-10-09 01:35 · Score: 1
  
  This times a BILLION.
  Just to have a crawlable time/date stamp on pages consistently....delightful.
  IIRC there's a google command-line tag like sort:date or something that will give you freshest pages first, but i've never found it useful - dunno, maybe changing ad-content 'refreshes' page ages to the point of meaninglessness?
  
  --
  -Styopa
Dark Data by tverbeek · 2014-10-08 01:11 · Score: 1

Has this class of data been termed "Dark Data" yet?

--
http://alternatives.rzero.com/
1. Re:Dark Data by neo-mkrey · 2014-10-08 01:55 · Score: 1
  
  It has now ;-)
How do you spell microfiche again? by mrego · 2014-10-08 03:18 · Score: 1

So if I have a YAML or MP3 file, I can convert it...?
open document format by Anonymous Coward · 2014-10-10 18:44 · Score: 0

Well, hope governments will soon make odf as a default standart. .doc(x) and other MS proprietary formats are time bomb as well. With what will you open your documents after 50 years?