Bringing Open Source To Biomedicine

So long as there is money to be made by KBentley57 · 2011-04-11 10:58 · Score: 1

there will no be any "openness" of any kind. There is just too much financial gain at stake (not that it is a good thing).

Re:So long as there is money to be made by TaoPhoenix · 2011-04-11 11:03 · Score: 1

We can fight to have pockets of open. Hopefully about 1000 academic articles and 1000 drugs and 1000 genomes to study.
(Starting small)

--
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
Re:So long as there is money to be made by arun84h · 2011-04-11 11:05 · Score: 1

You mean "no openness of any kind...besides the ones listed in TFA who are already sharing" right?
Re:So long as there is money to be made by sexconker · 2011-04-11 11:08 · Score: 2

I can tell you didn't RTFA, or even RTFS.
TFH is fucking misleading.
There is very little to do with open source, or openness in general.
Some guy is simply trying to get various players to buy into his system, with money and data, so he can then go back and run a few queries, maybe make a little graph, etc., and sell that data to others (for the price of money and more data).
It's basically stone soup, but he demands money as well as all the work. (And if he's not demanding money now, just wait until the date draws nearer.)
But this will never happen. The reason these companies are so tight lipped with their data is not because they don't see a benefit in sharing and accessing data, but because they don't dare let others see their dirty laundry, lest they expose themselves as liable for their fuckups.
Re:So long as there is money to be made by KBentley57 · 2011-04-11 11:09 · Score: 1

I think it was apparent what I mean. But, I'll spell it out. Big companies will not be willing to open source everything while they are raking in profits. The comment was meant in disregard to those mentioned
Re:So long as there is money to be made by tqk · 2011-04-11 15:49 · Score: 1

I fail to see why this is a problem for anyone. My idea? Every individual gets a unique alphanumeric ID, that matches a tuple in a nationally maintained database that contains that anonymous citizen's data. You don't need to know the individual's name. You just want his raw data to aggregate with the rest of the population.
What's wrong with this? Smiple? Make sure the link is secure and keep the lawyers out. :-|

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.
Re:So long as there is money to be made by ldobehardcore · 2011-04-11 18:21 · Score: 1

I really wish the world could work that way... I don't see how any company with a national database of "Anonymous" user IDs could resist doing a cross correlation and using that for direct medical advertizing.
The thing is, what you're suggesting is that rights to medical privacy will be revoked from the patients, and their information will be commercialized at the highest rate possible.
I understand the value of having a dataset like this, but it gives me chills to think of the consequences of it's implementation.

--
Hectice, baby, Mercator says hello to you
Re:So long as there is money to be made by tqk · 2011-04-11 18:45 · Score: 1

The thing is, what you're suggesting is that rights to medical privacy will be revoked from the patients
No. It's anonymized.
You're arguing against early warning systems.

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.
Re:So long as there is money to be made by rtb61 · 2011-04-11 21:22 · Score: 1

They will also not open up anything that remotely hints of greed driven culpability. In fact that will be the driving factor for not only not loading up data by the pharmaceuticals but to also attack Sage Commons with claims that any negative data is false and it will sue to cripple the data base.
The only counter will be foreign governments with universal health who are directly financially affected by poorly performing drugs and who will fight to protect the billions at stake.

--
Chaos - everything, everywhere, everywhen
Re:So long as there is money to be made by ldobehardcore · 2011-04-12 15:10 · Score: 1

There's really no such thing as anonymized data when it comes to large aggregate databases. For example, you're facebook page can be matched to your netflix account 9 out of 10 times because of the sheer volume of similarity in the data pertaining to your advertizing profile.
I'm all for early warning systems based on genetic markers, but I think it is a horrible idea to make a national system for it. I know a small company in the Seattle area where I live that does the testing on their own premises, and doesn't keep any data about the samples they process. There is no database that my information is stored in. If I want to knock $50 off the charge for my genetic processing, I can let them do a little experimentation on the dna, but they don't hold on to individual results.
I understand your point in that an anonymized national database would make a good standard. But the thing is any useful database for tracking a personal medical record can, and probably will be exploited.

--
Hectice, baby, Mercator says hello to you
Re:So long as there is money to be made by tqk · 2011-04-12 18:28 · Score: 1

There's really no such thing as anonymized data when it comes to large aggregate databases.
Well, I have to admit that's true. Once you start aggregating, ...
Still, anonymize, anonymize, anonymize, anonymize, anonymize, ... Give it to the Secret Service, or NCIS. "We just want your data, we don't care who you are. Honest, it's all just going into this big pot. We're shooting lawyers on sight."

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.
Re:So long as there is money to be made by ldobehardcore · 2011-04-13 13:09 · Score: 1

Heh Heh.
Shooting lawyers on sight.
Just appreciating the irony that of any group to shoot with wanton abandon, lawyers would probably be the best equipped to ruin you financially and have you put in jail if you do.

--
Hectice, baby, Mercator says hello to you

Re:Incorporating this "Standard" by blair1q · 2011-04-11 11:13 · Score: 1

What standard are you talking about? And what software?

TFA is about sharing data that companies are keeping secret or are too lazy to publish.

Re:Incorporating this "Standard" by TwistedPants · 2011-04-11 11:14 · Score: 1

I really wish this wasn't an article about "sage commons" but one for Life Sciences & the semantic web - http://www.w3.org/blog/hcls

Tons of money involved by kvvbassboy · 2011-04-11 11:20 · Score: 1

Thing is, there is a lot of money involved Biomedicine. Research Institutes can hope to gain a lot of funding by selling their results to pharmaceutical companies. It would be the equivalent of Microsoft open sourcing the datasets used for their multibillion dollar speech and language technologies.

I can see this happening at universities though with a "GPLv2 equivalent" license on the database.

Re:Tons of money involved by oldhack · 2011-04-11 12:19 · Score: 1

The mixture of medicine with so much moneyed interest produces noxious stench like no other. And no other fields are propped up with so much pubic fund - look up NIH funding vs. any other research funding.
Soon enough, it will also implode our finance, both public and private.

--
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.

Re:Incorporating this "Standard" by Samantha+Wright · 2011-04-11 11:21 · Score: 3, Informative

Well, although you're right, there is still something that I believe is usually called a "clusterfuck" when it comes to data transfer formats for biology and chemistry, and it's not helping the open-ification process any. (Note that this list seems to omit most of the proprietary formats, at least a dozen of which I can name off the top of my head.) It's symptomatic of the commercial land-grab that took place in biomedical computing (mostly) in the nineties.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!

Twitter and Facebook .. sharing? by ackthpt · 2011-04-11 11:32 · Score: 1

I'd say it's more than sharing - it's about exposing ones thoughts in the desire for acknowledgement or acceptance .. I am not alone.

Also, useful for slagging others to prop up own self esteem and plugging ones own site/service/content or film one has participated in.

As for medicine .. I think it would be great to get more people on-line with folk remedies, to see if any actually have merit, i.e. chewing willow bark helped relieve my headache (this is the origin of Aspirin as salicylic acid.) The only worries I have, aside from a flood of quackery would be some billion dollar pharma concern buying up the track of land where the useful plant or bug lives and tries to patent the heck out of DNA or synthntesis to corner the market.

--

A feeling of having made the same mistake before: Deja Foobar

Re:Incorporating this "Standard" by HiggsBison · 2011-04-11 11:33 · Score: 1

Contrary to popular belief scientists don't just sit down and code programs on their off time for fun...

I am a counter-example to your assertion.

--
My other car is a 1984 Nark Avenger.

Adverse drug events and duplicity by rune.w · 2011-04-11 11:48 · Score: 1

Drugmakers are already required to keep track of adverse drug events that arise during clinical testing. Much of this information is reported to regulatory agencies on almost a daily basis and there's a lot of work going behind the scenes to make sure the information is reliable, consistent and keeps patient privacy.

I can understand to some extent why drugmakers aren't too keen to jump into this. There is little use in adding yet another database into an already busy workflow. This new database is guaranteed to be different from many in-house solutions currently in use, so you will need to train people, get them used to the new process, etc. just to input the same data the regulator already receives. IMO this won't be worth the effort in the eyes of many drugmakers unless you get regulatory agencies involved.

I am not saying in general this is not a worthy cause. We currently have more data derived from genomics (and all the other -omics) than we can analyze. However to be successful this guys need to make sure they aren't duplicating the functionality of the myriad of public databases already out there.

Re:Adverse drug events and duplicity by Daniel+Dvorkin · 2011-04-11 13:35 · Score: 1

Adverse event reporting covers only a tiny fraction of the data gathered in any clinical trial. There's an enormous amount of information that would be useful for future research locked up in clinical trials databases, and as we move into the "genomic medicine" era, this will be ever more the case. Having gone back and forth between bioinformatics and clinical research, and being persistently annoyed at how difficult it is to access data in the latter field, I say that anything that can bring bioinformatics' generally more open approach to the clinical research world is a good thing. Obviously there are privacy concerns when working with human data that don't exist when working with model organisms, but most of what keeps clinical data out of public research databases is plain old inertia, and it sounds like Sage Commons is working to overcome that -- good for them.

--
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.

Two questions by wrencherd · 2011-04-11 12:14 · Score: 1

TFT&TFS are as misleading as others have noted; this is about "open-data", not really "open-source".

I am skeptical and have two questions:

(1) In terms of research, isn't this what peer review and publication are supposed to accomplish?

(2) How is "biomedicine" different from "medicine"?

Re:Two questions by Daniel+Dvorkin · 2011-04-11 14:45 · Score: 1

(1) Peer review is a lot more powerful when you can review the data itself, not just what the paper says about the data. In bioinformatics, we've known this for years, which is why you absolutely can't publish a paper concerning a microarray experiment without making the raw data available in GEO or a similar repository.
(1.5) Any high-throughput experiment generates enormous amounts of data (that's pretty much the definition of "high-throughput") and that data is very often useful for answering questions other than the specific one the experimenter was asking. Public availability of data has proven an enormous boon to basic biology, and an awful lot of people would like to see that carry over into medical research.
(2) Generally speaking, "medicine" refers to clinical practice, and "medical research" to research in that practice, while "biomedical research" refers to research in the biology underlying disease and the treatment of disease. "Biomedicine" is, more or less, best defined as "what biomedical researchers do." For example, if you have cancer, your oncologist may prescribe chemotherapy (medicine) but before that, a pharmacologist designed the drug you're now being administered (biomedicine) and a biostatistician analyzed the results of the clinical trials on the drug (medical research). Ideally, data sharing will help close the loop: more biostatisticians can analyze the results of your and other patients' treatments, and more pharmacologists can use the results of that analysis to design the next generation of treatments.

--
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
Re:Two questions by innerweb · 2011-04-11 15:03 · Score: 1

Marketing.

--
Freud might say that Intelligent Design is religion's ID.

Are you shitting me? by Anonymous Coward · 2011-04-11 14:51 · Score: 1

Sage is a spinoff of Merck as Rosetta Inpharmatics. Rosetta died and Sage emerged from it. The spin was that Merck has deposited thousands of clinical mouse strains that supposedly worth tens of millions USD. I don't buy it.

I know Stephen Friend has been "promoting" his idea of pooling genetic data. The pitch is that by pooling, his company can offer "better" analysis. However, by "pooling" means for his company's (Sage) use and NOT for public good. This article is absolutely misleading! I'm speaking as someone who has dealt with Sage. So, Sage has been acting like a piggy bank. It's easy to deposit data, but it's really hard to get ANYTHING out. The reason is simple: they're ALWAYS citing NDA and privacy concerns (HIPAA among others).

Their supposed algorithms are lousy and shrouded with mysticism. They ALWAYS cite patents and/or proprietary rights. No source codes have been released so far. Read their papers / publications. They are FILLED with too many buzzwords and little detail. The important part are handwaved really vigorously into the paper. It bothers me why people trust them so much. Yes, they can produce good results, no doubt. But are they for open science? Definitely NOT!

Cause and effect by Anubis+IV · 2011-04-11 15:01 · Score: 1

Facebook and Twitter may have proven that humans have a deep-seated desire for sharing

If anything, sharing is merely a byproduct of the actual desires that drive those sites. Desires and tendencies such as showing off, egoism, seeking acceptance, seeking affirmation, or gathering information. And more so in the case of Facebook than Twitter, given the studies that repeatedly indicate that Twitter's graph is structured as a news graph rather than as a social graph (what was the last statistic? That the top 1% of Twitter users produce 98% of the tweets that get retweeted?).

To suggest that sharing is the driving desire behind those sites is to give humanity far more credit than it is due.

Re:Incorporating this "Standard" by tqk · 2011-04-11 16:24 · Score: 1

... "clusterfuck" when it comes to data transfer formats for biology and chemistry

Oh, come on. That's what computers do if you've someone with smarts on the keyboard. Filters and data conversion's simple stuff. Know your input, know what you want out, figure out what you need to do that.

I know, bleeding edge versions of some software can't even read data from their previous versions. Well, build a box with the old version installed and ... Sometimes it's both a data and system problem. You need a better geek. :-)

I used to work for Atomic Energy Canada. You would not believe the bizarre, proprietary system they bought into for documenting the project we were in. Imagine Lotus Notes ca. '70.

I wrote *my* dox in LaTex (which was refused), and bailed soon afterwards (into contracting :-).

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.

Re:Incorporating this "Standard" by tqk · 2011-04-11 16:38 · Score: 1

Q: How did the regular expression cross the road?
A: ^.*$

Admitting my ignorance, would you please explain your .sig? Pretty please?

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.

Re:Incorporating this "Standard" by Samantha+Wright · 2011-04-11 17:40 · Score: 1

As programmers, I would like to think we are positioned to criticize those who don't respect applicable standards. Simply because a brain-dead decision can be accommodated doesn't mean it deserves to live!

And these are simple things, very often—dozens of different metadata and header formats for wrapping and annotating DNA, for example. Totally bogus.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!

Re:Incorporating this "Standard" by Samantha+Wright · 2011-04-11 17:49 · Score: 1

Regular expressions are a rudimentary programming language (not Turing-complete) most commonly used for matching strings based on patterns. This is similar to the * and ? wildcards used by Unix- and Windows-derived/inspired operating systems for filenames, but more powerful. The answer to the joke consists of a regular expression containing the following four symbols:

^ indicates the start of a line
. indicates any character other than a linebreak
* indicates "zero or more repetitions of the previous character
$ indicates the end of a line

Thus, the regular expression is starting at one "side", "crossing" any characters it passes over, and stopping at the other "side". This particular road-crossing joke is unique in that it completely describes the method by which the subject crosses the road, instead of just a brief summary of the goal or, as in the few other jokes that use "how" instead of "why", a vague descriptor of the manner in which the road was crossed.

Now hand in your geek card, forever.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!

Re:Incorporating this "Standard" by tqk · 2011-04-11 17:59 · Score: 1

... dozens of different metadata and header formats for wrapping and annotating DNA, for example

So, you need a DBA who understands data formats.

I've been building stuff like this ever since I got into computing. This is geek heaven. "file $Blah". "apropos $Blah". ...

I don't understand why this is so hard.

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.

Re:Incorporating this "Standard" by tqk · 2011-04-11 18:25 · Score: 1

... Thus, the regular expression is starting at one "side", "crossing" any characters it passes over, and stopping at the other "side".

That's when I whooped out laughing. Damn, this's well written and composed! Thanks.

I do know the values of all those special chars you mention above, but damn, you do put a brilliant spin on them.

No, you'll not get my geek card, except from my cold, dead hands ...! :-|

I'm still giggling. Fun meeting you. Carry on, thanks.

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.

Re:Incorporating this "Standard" by Samantha+Wright · 2011-04-11 19:12 · Score: 1

It isn't, but they're still jerks for doing it in the first place. Also, your assumptions about the organization sizes involved are a bit high—often we're talking about labs with two or three PhDs and a handful of masters students. Not a major resource of deep computer expertise, or large enough to have a DBA. That they have to export all of their old material and re-import it into a new format when they upgrade their software is an obstacle (albeit a defeatable one!) to getting things done, and before you know it, you're wasting time and company/grant money.

On top of that, you have the same format obsolescence problem we see with physical media: if DNAStar goes out of business, and everyone switches to MacVector, then Microsoft discontinues support for 32-bit executables in Windows 12, how do we interpret what header bytes 8-12 mean in their proprietary SBD format when we need to access Professor (Emeritus) Recently Deceased's early graduate work on a cancer cure, the programmers have been dead for fifty years, and no format documentation was released because people were expected to export to FASTA first? We may be able to recover the sequence from the file (it's stored in lower-case ASCII) but not the annotations. Laboratory work must be redone to confirm hypotheses about the precise format of the binary-encoded addresses, and this could cost months of work and tens of thousands of dollars (today) if Prof. Deceased was working in mammalian cells, which require very expensive techniques to transform with modified DNA.

In short, the hacker's approach fails here, and hard. Your technique is valid for sensible things like firewall scripts that are all well-commented, but the quantity of file formats in this world that are undocumented (and not self-explanatory) is far greater than that of those which are generally understood. This is the whole point of formats like FASTA and GenBank, and even the hacker's arch-nemesis XML, which are ASCII-encoded and easy to comprehend, but there are many programs that continue to store their material in obscure binary structures for convenience and legacy compatibility, and those companies have yet to cough up any scrap of documentation—in the aforementioned example, MacVector can't read DNAStar's native format, and the manufacturer recommends exporting from the LaserGene suite into a more common format first. Again, hours of headache for semi-computer-literate experimentalists, and potentially months of headache for people digging into historical archives.

Do you understand now?

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!

Why no sharing? by Syberz · 2011-04-11 23:58 · Score: 1

Easy, this is why: $

Except for the companies providing the publishing service, nobody makes money with your tweets. On the other hand, complex drug interaction and related biomedical info is a potential source of great return which is gathered at great risk for the company (they can spend millions on R&D and get nothing currently usable out of it). Biomed companies will not share info that a competitor could use.

--
~Syberz

Ahhh yes open source in Bio med... by Schmyz · 2011-04-12 03:53 · Score: 1

...soon to usher in the "lawsuit era for the bio-med hack" I can see it now...somewhere some how some yo-yo is going to state open source lead to a hack into some company's personal bottom line. (i.e. personel info, pay scale, profits info, secret formula for some new ED med...etc)

--
Joe Investor

Re:Incorporating this "Standard" by tqk · 2011-04-12 06:16 · Score: 1

I don't understand why this is so hard.
It isn't, but they're still jerks for doing it in the first place.

Which is why I wrote my dox in LaTeX. I knew they'd reject it. I didn't care.

... labs with two or three PhDs and a handful of masters students. Not a major resource of deep computer expertise, or large enough to have a DBA.

See, this's the part that pees me off. This is complicated !@#$ for average mortals, and you sciency types, hell everyone in and out of research, have to learn to budget for this specialized computing expertise. My last big client couldn't wait to ship my position off to three guys in Brazil. If that's how much local expertise is appreciated, who'd want to be in this business?

We can do this, but you've got to fund us as much as your sponsors are funding you.

If you've got exotic data to deal with, do you hand it to an intern, or to someone who already knows how to handle it? That's your choice. How soon do you want it?

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.

Re:Incorporating this "Standard" by tqk · 2011-04-12 06:35 · Score: 1

Contrary to popular belief scientists don't just sit down and code programs on their off time for fun...
I am a counter-example to your assertion.

Not to mention, there's a few geeks out here who love to dabble in sciency stuff. Show me an interesting problem, and you won't find it easy to get me off it. I live for the "three pipe problem."

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.

Re:Incorporating this "Standard" by Samantha+Wright · 2011-04-12 10:40 · Score: 1

You're ignoring my point completely in order to make a stand for your own job security, like an assembly line worker insisting that car-building robots can't cope with the unexpected, and thus car-building robots should be banned. The entire problem can be eliminated by making the data more consistent in the first place. Also, in this case, it's possible that in a few years the people who "already know how to handle it" are dead because the format specs were never released.

I am actually working as a DBA right now supporting a very fucked up genealogy database that uses numbers for table and column names for deliberate obfuscation reasons. This job sucks, because the vendor's shit isn't remotely fucking extensible, and it's a huge amount of work to find the data in its back-end (an old version of Sybase) and manipulate it externally. But at the same time, this database platform provides features that are patent-encumbered and can't be reimplemented, even if we had the money for hiring the developers required. So we have to cope with it. My predecessor left me a book correlating column numbers, table numbers, and data, that had to be reverse-engineered by probing over an SQL connection. We still don't know where significant portions of the input from the UI is stored. All of this was created to prevent customers from migrating away, but that doesn't matter because there's nothing to migrate to.

Your fantasy world that geeks + money = results ignores the amount of pain and suffering that these bad designs are creating in the first place. The whole point of computer technology is to simplify people's lives and work, and data standardization is critical to that, just like quality control of parts was critical to automating assembly lines. Yes, there's still a place for experts, and someone always needs to know how to keep the machines running, but would you personally rather be doing that, or programming the next generation of better automation tools?

Your argument is essentially that of the Luddite. I remind you that there are still artisan textile workers in the world, and suggest that you start your own business pursuing your dreams.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!

Re:Incorporating this "Standard" by tqk · 2011-04-12 12:04 · Score: 1

I'm sorry you've come to the conclusion that we're in hopeless disagreement with each other. I assure you, you're jumping to conclusions. I've always railed against the proliferation of proprietary, opaque file formats & etc. (remember, I used LaTeX against orders, ffs).

You're ignoring my point completely in order to make a stand for your own job security

I'll cop to the job security charge, but in my defense, you're the one with the vast, complex problem to solve. I'm someone who (theoretically :-) can solve it.

The entire problem can be eliminated by making the data more consistent in the first place.

I'm one of the loudest advocates for this.

What you fail to understand is this is as it is! This is IT. *We* didn't create this clusterfuck, but this is our reality! It is among the youngest sciences out there. We're going to have to go through a lot of !@#$ before it's as solid as other professions. COBOL programmers are still valuable, ffs. There's going to be a lot of deadwood to wade through on the way, and more's created as we speak. I've been fighting this crap since '75 or so. With respect, suck it up. This is IT. Idiots out there created a mess. For our own reasons, we choose to work within that mess. These are our dragons.

dbi_list_schema.pl

Cry me a river.

Your fantasy world that geeks + money = results ignores the amount of pain and suffering that these bad designs are creating in the first place.

You appear to be blaming this on me. Why? I'm well aware there's a vast amount of dumbth in IT. Not every geek is worth the air they breath. I know of Sun Certified engineers who can't use ls to list a directory. Such is life.

Your argument is essentially that of the Luddite.

If that's really the impression you got, then I've obviously failed to express myself cogently. My apology.

All I'm asking is, is it really worth six months of a postdoc's time to bang their head on this to figure it out, or might a competent specialist manage to get that data for you in a week? When do you want it? How cheap are you? What's the postdoc really want to do? Bang their head on data conversion for six months, or do something with the data?

Wouldn't it be smarter to budget for data conversion specialists in the first place?

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.

Re:Incorporating this "Standard" by Samantha+Wright · 2011-04-12 12:31 · Score: 1

It may in theory be smarter to budget for data conversion specialists, yes, given their flexibility (we'll assume for simplicity that they're all worth their paychecks), but it's just not practical to do on the scale we're looking at: The average small university would need one per biology/biochemistry/life sciences department, and institutions of that kind of bureaucratic gravitas are hard to move. Rather than pushing for embedding my classmates in every biomedical sciences department in the world, (even though I personally feel that postdocs are staggeringly computer-illiterate sometimes and really need some technically-minded adults to supervise them) it's much more practical to lobby vendors to open their stuff up, so that we can move everything into future-proof formats once, and never have to deal with it ever again.

A lot of biotechnology companies are already appreciating the OSS movement and go so far as to document their file formats in the user's manual for their hardware (e.g. Applied Biosystems's DNA sequencing hardware) but in general, biomedical software companies, such as MacVector and DNAStar from my first example, are still in the "Let's emulate Microsoft!" mindset when it comes to data storage.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!

Re:Incorporating this "Standard" by tqk · 2011-04-12 13:46 · Score: 1

The average small university would need one per biology/biochemistry/life sciences department ...

No. It would need a process implemented by a specialist. I'd do an inventory of all the data, all the file formats that need to be dealt with, then I'd start building tools/filters that handle those types of data. Once built, those tools can be used institution-wide. Soon, you would be batch processing data in the background automatically. Your postdocs would see the output in their email every morning.

... it's much more practical to lobby vendors to open their stuff up ...

I very much doubt that! Since when has that sort of thing been in their interest?

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.

Re:Incorporating this "Standard" by tqk · 2011-04-12 14:17 · Score: 1

The average small university would need one per biology/biochemistry/life sciences department ...
No. It would need a process implemented by a specialist.

That person ought to be employed by your computing centre. You shouldn't even need to budget for them. This's like having someone on hand to do backups, or configure the firewall. You need data conversion on a regular basis. It's an essential service your entire institution needs, institution-wide. What's wrong with your IT dept?

Out-sourced to Brazil?

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.

Re:Incorporating this "Standard" by Samantha+Wright · 2011-04-13 00:13 · Score: 1

Deciphering some of these formats is, as I've said, non-trivial. Your "start building tools/filters" step is where I take fault, especially when some combinations of closed tools can produce files that aren't lossless, e.g. a Windows metafile of a graph embedded in a FileMaker Pro database. How do you get the data points back out of the graph?

It also doesn't stop the world from continuing to produce files in formats with non-open specifications, even if you've fixed the institutions that have hired you, because you're only treating the symptoms, not the root problem. It ultimately is in the best interest of vendors to be compatible and open, because it's far more convenient for users, and what they want. (And, many organizations and companies are already moving this way, so it's not like no one's ever thought about it.) Consider that this same situation has happened in a number of IT arenas: video encoding being a recent prominent example. When there are open alternatives close enough in quality to closed software—which use widely-supported formats—people tend to prefer them by default. There's no de facto closed standard here, unlike Microsoft Office documents, which is why we often shuffle DNA around in the very simple FASTA format (one line starting with a > for the title of the sequence, and then another line containing the nucleotides, which lacks many useful features.)

As to your other comment: most university IT departments aren't well-prepared for application-specific material. They do things like make sure everyone has a network connection, that the computer labs all work, that every department has its own web-accessible site (which most departments write and maintain themselves), that course scheduling proceeds as normal, etc. Professors are too self-important—and university IT staff are too content to focus on their own material—for their paths to ever cross. The computer situation in most labs I've been to resembles a home LAN, and is generally completely under the control of the lab staff. They wouldn't generally tolerate externally-managed machines, as the time to resolve complications would mean a significant hit to productivity.

To make your batch idea work, you'd have to do the conversions as part of a nightly backup process, requiring no intervention on the part of the user to produce the record. You then have to hope to the gods that you get informed whenever a professor adds a new obscure format to his or her roster, and then personally know enough field-specific information to interpret the format involved. This is a great way to ensure you remain employed forever, but it's not a solution to the problem. And you can bet that running overnight wouldn't be good enough for their every-day conversion needs—many labs are open 24/7 so that staff can get exclusive access to equipment, just like the hackers of the seventies staying up to wait for mainframe access. We need to have a file format flag day, but there's too much mass to do so efficiently.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!

Re:Incorporating this "Standard" by tqk · 2011-04-13 06:14 · Score: 1

To make your batch idea work, you'd have to do the conversions as part of a nightly backup process, requiring no intervention on the part of the user to produce the record. You then have to hope to the gods that you get informed whenever a professor adds a new obscure format to his or her roster, and then personally know enough field-specific information to interpret the format involved.

This is an old problem, one that we've been dealing with forever! At a shell prompt on any *nix box, type "apropos 2". On my Linux box, that spits out stuff like:

po2debconf (
pod2html
pod2latex
pod2man
pod2text
pod2usage
ps2ascii
ps2epsi
ps2pdf

We've been building and using specialised data conversion tools since forever! Anyone with any shel/perl/python/... scripting foo can build a tool that'll loop over the contents of $INCOMING, detect what sort of file it is, pass it through the correct filter, or bail and scream "Exception!", and go on to the next.

As for your "new obscure format", shouldn't you have policies in place to handle this? If $NEWFILEFORMAT is non-portable, submission refused, rework and resubmit, damnit!

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.

Re:Incorporating this "Standard" by Samantha+Wright · 2011-04-13 07:23 · Score: 1

1. There are lots of cases where the only currently-existing way to get data produced is in a patent-encumbered or indecipherably complex format. There's no reworking or resubmitting; there's just one vendor-specific program that does the magic, its storage format, and an export feature that only captures one viewpoint of the data. The only solution in such a case is getting this original storage format changed.

2. In general, I think you're out of touch with the culture at universities. In general, labs are self-managed and do not have anything remotely resembling a general IT department that they run software purchasing decisions by. There's a very good reason for this: maximum independence and self direction enables maximum efficiency in producing worthwhile results. That's partially to blame for the problems that TFA is about—people not sharing research with each other—but what you're proposing is impossible on many levels.

We're talking about people running experiments, here. They may be inventing new kinds of data (and needing to bring in new software tools) on a very regular basis. A university that encumbers this process with red tape about tools of choice is harming its ability to compete as a research institution, which is its ultimate goal. In practice, no university has policies regarding what labs can run on their computers (which, further, are almost always self-purchased by the labs) much less any restrictions on file formatting. The amount of work that would be required to track all of the tools and utilities used by hundreds of graduate students, postdocs and professors is far greater than you seem to assume, (particularly since it would require understanding of the research, in many cases) and would be extremely invasive and disruptive to productivity.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!

Re:Incorporating this "Standard" by tqk · 2011-04-13 08:53 · Score: 1

There are lots of cases where the only currently existing way to get data produced is in a patent encumbered or indecipherably complex format. There's no reworking or resubmitting; there's just one vendor-specific program that does the magic, its storage format, and an export feature that only captures one viewpoint of the data.

I don't know what sort of IT people you have to put up with, but from my perspective this is not the rocket science you appear to believe it is. Any of my friends would be able to sit down and analyze those bizarre file formats and come up with *some* sort of process to handle them. Some will be finicky and demand hands-on treatment. However, you ought to be able to automate the majority of them.

At the very least, insist this cryptic data is submitted in as many forms as possible; raw binary, export, backup data files, screenshots, email attachments, fax, ... All I'm doing here is advocating for doing it smarter. IT is a young field, but we have learned few basic laws. Redundancy's an important one.

I'm also suggesting that you're missing the value of divide and conquer. Leave the IT to the geeks. Leave the researchers to research. Don't try to teach a pig to sing; they're not good at it, and it annoys the pig.

In general, I think you're out of touch with the culture at universities.

Absolutely. I've never been to one (I'm primarily a self-taught hacker:-).

Computers can make life easier for everyone involved, but only if the sharp end's focussed upon. I enjoy implementing solutions that make problems disappear forever. Not all of your labs or projects should have to fight with every problem, ffs! Automate what you can institution-wide, and deal with the rest when you run into them. Iterate.

Not to belittle your burden, but this's what I've been doing for two decades. Some problems are intractable exceptions, however most *can* be handled if you know how. I still say you need a better geek. :-)

--
"Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.

Slashdot Mirror

Bringing Open Source To Biomedicine

46 of 60 comments (clear)