Bringing Open Source To Biomedicine
waderoush writes "'Facebook and Twitter may have proven that humans have a deep-seated desire for sharing, [but] this impulse is still widely suppressed in biomedicine,' biotech reporter Luke Timmerman observes in this column on Sage Bionetworks founder Stephen Friend. Friend is working to convince drugmakers and academic researchers to pool their experimental genomic data in a shared database called the Sage Commons. The database could be used to track adverse drug events, or to 'visually display network models of disease that connect the dots between genes, proteins, and clinical manifestations of disease in ways that [scientific] journals are not equipped to handle,' Timmerman says. Researchers from Stanford, Columbia, UCSF, and UCSD are already contributing to the Sage Commons, and Friend is now calling for a community effort by drugmakers, academic scientists, doctors, regulators, insurers, and patients to 'grab this platform and run with it on their own."
there will no be any "openness" of any kind. There is just too much financial gain at stake (not that it is a good thing).
What standard are you talking about? And what software?
TFA is about sharing data that companies are keeping secret or are too lazy to publish.
I really wish this wasn't an article about "sage commons" but one for Life Sciences & the semantic web - http://www.w3.org/blog/hcls
Thing is, there is a lot of money involved Biomedicine. Research Institutes can hope to gain a lot of funding by selling their results to pharmaceutical companies. It would be the equivalent of Microsoft open sourcing the datasets used for their multibillion dollar speech and language technologies.
I can see this happening at universities though with a "GPLv2 equivalent" license on the database.
Well, although you're right, there is still something that I believe is usually called a "clusterfuck" when it comes to data transfer formats for biology and chemistry, and it's not helping the open-ification process any. (Note that this list seems to omit most of the proprietary formats, at least a dozen of which I can name off the top of my head.) It's symptomatic of the commercial land-grab that took place in biomedical computing (mostly) in the nineties.
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
I'd say it's more than sharing - it's about exposing ones thoughts in the desire for acknowledgement or acceptance .. I am not alone.
Also, useful for slagging others to prop up own self esteem and plugging ones own site/service/content or film one has participated in.
As for medicine .. I think it would be great to get more people on-line with folk remedies, to see if any actually have merit, i.e. chewing willow bark helped relieve my headache (this is the origin of Aspirin as salicylic acid.) The only worries I have, aside from a flood of quackery would be some billion dollar pharma concern buying up the track of land where the useful plant or bug lives and tries to patent the heck out of DNA or synthntesis to corner the market.
A feeling of having made the same mistake before: Deja Foobar
Contrary to popular belief scientists don't just sit down and code programs on their off time for fun...
I am a counter-example to your assertion.
My other car is a 1984 Nark Avenger.
Drugmakers are already required to keep track of adverse drug events that arise during clinical testing. Much of this information is reported to regulatory agencies on almost a daily basis and there's a lot of work going behind the scenes to make sure the information is reliable, consistent and keeps patient privacy.
I can understand to some extent why drugmakers aren't too keen to jump into this. There is little use in adding yet another database into an already busy workflow. This new database is guaranteed to be different from many in-house solutions currently in use, so you will need to train people, get them used to the new process, etc. just to input the same data the regulator already receives. IMO this won't be worth the effort in the eyes of many drugmakers unless you get regulatory agencies involved.
I am not saying in general this is not a worthy cause. We currently have more data derived from genomics (and all the other -omics) than we can analyze. However to be successful this guys need to make sure they aren't duplicating the functionality of the myriad of public databases already out there.
TFT&TFS are as misleading as others have noted; this is about "open-data", not really "open-source".
I am skeptical and have two questions:
(1) In terms of research, isn't this what peer review and publication are supposed to accomplish?
(2) How is "biomedicine" different from "medicine"?
Sage is a spinoff of Merck as Rosetta Inpharmatics. Rosetta died and Sage emerged from it. The spin was that Merck has deposited thousands of clinical mouse strains that supposedly worth tens of millions USD. I don't buy it.
I know Stephen Friend has been "promoting" his idea of pooling genetic data. The pitch is that by pooling, his company can offer "better" analysis. However, by "pooling" means for his company's (Sage) use and NOT for public good. This article is absolutely misleading! I'm speaking as someone who has dealt with Sage. So, Sage has been acting like a piggy bank. It's easy to deposit data, but it's really hard to get ANYTHING out. The reason is simple: they're ALWAYS citing NDA and privacy concerns (HIPAA among others).
Their supposed algorithms are lousy and shrouded with mysticism. They ALWAYS cite patents and/or proprietary rights. No source codes have been released so far. Read their papers / publications. They are FILLED with too many buzzwords and little detail. The important part are handwaved really vigorously into the paper. It bothers me why people trust them so much. Yes, they can produce good results, no doubt. But are they for open science? Definitely NOT!
Facebook and Twitter may have proven that humans have a deep-seated desire for sharing
If anything, sharing is merely a byproduct of the actual desires that drive those sites. Desires and tendencies such as showing off, egoism, seeking acceptance, seeking affirmation, or gathering information. And more so in the case of Facebook than Twitter, given the studies that repeatedly indicate that Twitter's graph is structured as a news graph rather than as a social graph (what was the last statistic? That the top 1% of Twitter users produce 98% of the tweets that get retweeted?).
To suggest that sharing is the driving desire behind those sites is to give humanity far more credit than it is due.
Oh, come on. That's what computers do if you've someone with smarts on the keyboard. Filters and data conversion's simple stuff. Know your input, know what you want out, figure out what you need to do that.
I know, bleeding edge versions of some software can't even read data from their previous versions. Well, build a box with the old version installed and ... Sometimes it's both a data and system problem. You need a better geek. :-)
I used to work for Atomic Energy Canada. You would not believe the bizarre, proprietary system they bought into for documenting the project we were in. Imagine Lotus Notes ca. '70.
I wrote *my* dox in LaTex (which was refused), and bailed soon afterwards (into contracting :-).
"Tongue tied and twisted, just an Earth bound misfit
Admitting my ignorance, would you please explain your .sig? Pretty please?
"Tongue tied and twisted, just an Earth bound misfit
As programmers, I would like to think we are positioned to criticize those who don't respect applicable standards. Simply because a brain-dead decision can be accommodated doesn't mean it deserves to live!
And these are simple things, very often—dozens of different metadata and header formats for wrapping and annotating DNA, for example. Totally bogus.
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Regular expressions are a rudimentary programming language (not Turing-complete) most commonly used for matching strings based on patterns. This is similar to the * and ? wildcards used by Unix- and Windows-derived/inspired operating systems for filenames, but more powerful. The answer to the joke consists of a regular expression containing the following four symbols:
. indicates any character other than a linebreak
^ indicates the start of a line
* indicates "zero or more repetitions of the previous character
$ indicates the end of a line
Thus, the regular expression is starting at one "side", "crossing" any characters it passes over, and stopping at the other "side". This particular road-crossing joke is unique in that it completely describes the method by which the subject crosses the road, instead of just a brief summary of the goal or, as in the few other jokes that use "how" instead of "why", a vague descriptor of the manner in which the road was crossed.
Now hand in your geek card, forever.
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
So, you need a DBA who understands data formats.
I've been building stuff like this ever since I got into computing. This is geek heaven. "file $Blah". "apropos $Blah". ...
I don't understand why this is so hard.
"Tongue tied and twisted, just an Earth bound misfit
That's when I whooped out laughing. Damn, this's well written and composed! Thanks.
I do know the values of all those special chars you mention above, but damn, you do put a brilliant spin on them.
No, you'll not get my geek card, except from my cold, dead hands ...! :-|
I'm still giggling. Fun meeting you. Carry on, thanks.
"Tongue tied and twisted, just an Earth bound misfit
It isn't, but they're still jerks for doing it in the first place. Also, your assumptions about the organization sizes involved are a bit high—often we're talking about labs with two or three PhDs and a handful of masters students. Not a major resource of deep computer expertise, or large enough to have a DBA. That they have to export all of their old material and re-import it into a new format when they upgrade their software is an obstacle (albeit a defeatable one!) to getting things done, and before you know it, you're wasting time and company/grant money.
On top of that, you have the same format obsolescence problem we see with physical media: if DNAStar goes out of business, and everyone switches to MacVector, then Microsoft discontinues support for 32-bit executables in Windows 12, how do we interpret what header bytes 8-12 mean in their proprietary SBD format when we need to access Professor (Emeritus) Recently Deceased's early graduate work on a cancer cure, the programmers have been dead for fifty years, and no format documentation was released because people were expected to export to FASTA first? We may be able to recover the sequence from the file (it's stored in lower-case ASCII) but not the annotations. Laboratory work must be redone to confirm hypotheses about the precise format of the binary-encoded addresses, and this could cost months of work and tens of thousands of dollars (today) if Prof. Deceased was working in mammalian cells, which require very expensive techniques to transform with modified DNA.
In short, the hacker's approach fails here, and hard. Your technique is valid for sensible things like firewall scripts that are all well-commented, but the quantity of file formats in this world that are undocumented (and not self-explanatory) is far greater than that of those which are generally understood. This is the whole point of formats like FASTA and GenBank, and even the hacker's arch-nemesis XML, which are ASCII-encoded and easy to comprehend, but there are many programs that continue to store their material in obscure binary structures for convenience and legacy compatibility, and those companies have yet to cough up any scrap of documentation—in the aforementioned example, MacVector can't read DNAStar's native format, and the manufacturer recommends exporting from the LaserGene suite into a more common format first. Again, hours of headache for semi-computer-literate experimentalists, and potentially months of headache for people digging into historical archives.
Do you understand now?
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Easy, this is why: $
Except for the companies providing the publishing service, nobody makes money with your tweets. On the other hand, complex drug interaction and related biomedical info is a potential source of great return which is gathered at great risk for the company (they can spend millions on R&D and get nothing currently usable out of it). Biomed companies will not share info that a competitor could use.
~Syberz
...soon to usher in the "lawsuit era for the bio-med hack" I can see it now...somewhere some how some yo-yo is going to state open source lead to a hack into some company's personal bottom line. (i.e. personel info, pay scale, profits info, secret formula for some new ED med...etc)
Joe Investor
Which is why I wrote my dox in LaTeX. I knew they'd reject it. I didn't care.
See, this's the part that pees me off. This is complicated !@#$ for average mortals, and you sciency types, hell everyone in and out of research, have to learn to budget for this specialized computing expertise. My last big client couldn't wait to ship my position off to three guys in Brazil. If that's how much local expertise is appreciated, who'd want to be in this business?
We can do this, but you've got to fund us as much as your sponsors are funding you.
If you've got exotic data to deal with, do you hand it to an intern, or to someone who already knows how to handle it? That's your choice. How soon do you want it?
"Tongue tied and twisted, just an Earth bound misfit
Not to mention, there's a few geeks out here who love to dabble in sciency stuff. Show me an interesting problem, and you won't find it easy to get me off it. I live for the "three pipe problem."
"Tongue tied and twisted, just an Earth bound misfit
You're ignoring my point completely in order to make a stand for your own job security, like an assembly line worker insisting that car-building robots can't cope with the unexpected, and thus car-building robots should be banned. The entire problem can be eliminated by making the data more consistent in the first place. Also, in this case, it's possible that in a few years the people who "already know how to handle it" are dead because the format specs were never released.
I am actually working as a DBA right now supporting a very fucked up genealogy database that uses numbers for table and column names for deliberate obfuscation reasons. This job sucks, because the vendor's shit isn't remotely fucking extensible, and it's a huge amount of work to find the data in its back-end (an old version of Sybase) and manipulate it externally. But at the same time, this database platform provides features that are patent-encumbered and can't be reimplemented, even if we had the money for hiring the developers required. So we have to cope with it. My predecessor left me a book correlating column numbers, table numbers, and data, that had to be reverse-engineered by probing over an SQL connection. We still don't know where significant portions of the input from the UI is stored. All of this was created to prevent customers from migrating away, but that doesn't matter because there's nothing to migrate to.
Your fantasy world that geeks + money = results ignores the amount of pain and suffering that these bad designs are creating in the first place. The whole point of computer technology is to simplify people's lives and work, and data standardization is critical to that, just like quality control of parts was critical to automating assembly lines. Yes, there's still a place for experts, and someone always needs to know how to keep the machines running, but would you personally rather be doing that, or programming the next generation of better automation tools?
Your argument is essentially that of the Luddite. I remind you that there are still artisan textile workers in the world, and suggest that you start your own business pursuing your dreams.
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
I'm sorry you've come to the conclusion that we're in hopeless disagreement with each other. I assure you, you're jumping to conclusions. I've always railed against the proliferation of proprietary, opaque file formats & etc. (remember, I used LaTeX against orders, ffs).
I'll cop to the job security charge, but in my defense, you're the one with the vast, complex problem to solve. I'm someone who (theoretically :-) can solve it.
I'm one of the loudest advocates for this.
What you fail to understand is this is as it is! This is IT. *We* didn't create this clusterfuck, but this is our reality! It is among the youngest sciences out there. We're going to have to go through a lot of !@#$ before it's as solid as other professions. COBOL programmers are still valuable, ffs. There's going to be a lot of deadwood to wade through on the way, and more's created as we speak. I've been fighting this crap since '75 or so. With respect, suck it up. This is IT. Idiots out there created a mess. For our own reasons, we choose to work within that mess. These are our dragons.
dbi_list_schema.pl
Cry me a river.
You appear to be blaming this on me. Why? I'm well aware there's a vast amount of dumbth in IT. Not every geek is worth the air they breath. I know of Sun Certified engineers who can't use ls to list a directory. Such is life.
If that's really the impression you got, then I've obviously failed to express myself cogently. My apology.
All I'm asking is, is it really worth six months of a postdoc's time to bang their head on this to figure it out, or might a competent specialist manage to get that data for you in a week? When do you want it? How cheap are you? What's the postdoc really want to do? Bang their head on data conversion for six months, or do something with the data?
Wouldn't it be smarter to budget for data conversion specialists in the first place?
"Tongue tied and twisted, just an Earth bound misfit
It may in theory be smarter to budget for data conversion specialists, yes, given their flexibility (we'll assume for simplicity that they're all worth their paychecks), but it's just not practical to do on the scale we're looking at: The average small university would need one per biology/biochemistry/life sciences department, and institutions of that kind of bureaucratic gravitas are hard to move. Rather than pushing for embedding my classmates in every biomedical sciences department in the world, (even though I personally feel that postdocs are staggeringly computer-illiterate sometimes and really need some technically-minded adults to supervise them) it's much more practical to lobby vendors to open their stuff up, so that we can move everything into future-proof formats once, and never have to deal with it ever again.
A lot of biotechnology companies are already appreciating the OSS movement and go so far as to document their file formats in the user's manual for their hardware (e.g. Applied Biosystems's DNA sequencing hardware) but in general, biomedical software companies, such as MacVector and DNAStar from my first example, are still in the "Let's emulate Microsoft!" mindset when it comes to data storage.
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
No. It would need a process implemented by a specialist. I'd do an inventory of all the data, all the file formats that need to be dealt with, then I'd start building tools/filters that handle those types of data. Once built, those tools can be used institution-wide. Soon, you would be batch processing data in the background automatically. Your postdocs would see the output in their email every morning.
I very much doubt that! Since when has that sort of thing been in their interest?
"Tongue tied and twisted, just an Earth bound misfit
That person ought to be employed by your computing centre. You shouldn't even need to budget for them. This's like having someone on hand to do backups, or configure the firewall. You need data conversion on a regular basis. It's an essential service your entire institution needs, institution-wide. What's wrong with your IT dept?
Out-sourced to Brazil?
"Tongue tied and twisted, just an Earth bound misfit
Deciphering some of these formats is, as I've said, non-trivial. Your "start building tools/filters" step is where I take fault, especially when some combinations of closed tools can produce files that aren't lossless, e.g. a Windows metafile of a graph embedded in a FileMaker Pro database. How do you get the data points back out of the graph?
It also doesn't stop the world from continuing to produce files in formats with non-open specifications, even if you've fixed the institutions that have hired you, because you're only treating the symptoms, not the root problem. It ultimately is in the best interest of vendors to be compatible and open, because it's far more convenient for users, and what they want. (And, many organizations and companies are already moving this way, so it's not like no one's ever thought about it.) Consider that this same situation has happened in a number of IT arenas: video encoding being a recent prominent example. When there are open alternatives close enough in quality to closed software—which use widely-supported formats—people tend to prefer them by default. There's no de facto closed standard here, unlike Microsoft Office documents, which is why we often shuffle DNA around in the very simple FASTA format (one line starting with a > for the title of the sequence, and then another line containing the nucleotides, which lacks many useful features.)
As to your other comment: most university IT departments aren't well-prepared for application-specific material. They do things like make sure everyone has a network connection, that the computer labs all work, that every department has its own web-accessible site (which most departments write and maintain themselves), that course scheduling proceeds as normal, etc. Professors are too self-important—and university IT staff are too content to focus on their own material—for their paths to ever cross. The computer situation in most labs I've been to resembles a home LAN, and is generally completely under the control of the lab staff. They wouldn't generally tolerate externally-managed machines, as the time to resolve complications would mean a significant hit to productivity.
To make your batch idea work, you'd have to do the conversions as part of a nightly backup process, requiring no intervention on the part of the user to produce the record. You then have to hope to the gods that you get informed whenever a professor adds a new obscure format to his or her roster, and then personally know enough field-specific information to interpret the format involved. This is a great way to ensure you remain employed forever, but it's not a solution to the problem. And you can bet that running overnight wouldn't be good enough for their every-day conversion needs—many labs are open 24/7 so that staff can get exclusive access to equipment, just like the hackers of the seventies staying up to wait for mainframe access. We need to have a file format flag day, but there's too much mass to do so efficiently.
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
This is an old problem, one that we've been dealing with forever! At a shell prompt on any *nix box, type "apropos 2". On my Linux box, that spits out stuff like:
We've been building and using specialised data conversion tools since forever! Anyone with any shel/perl/python/... scripting foo can build a tool that'll loop over the contents of $INCOMING, detect what sort of file it is, pass it through the correct filter, or bail and scream "Exception!", and go on to the next.
As for your "new obscure format", shouldn't you have policies in place to handle this? If $NEWFILEFORMAT is non-portable, submission refused, rework and resubmit, damnit!
"Tongue tied and twisted, just an Earth bound misfit
1. There are lots of cases where the only currently-existing way to get data produced is in a patent-encumbered or indecipherably complex format. There's no reworking or resubmitting; there's just one vendor-specific program that does the magic, its storage format, and an export feature that only captures one viewpoint of the data. The only solution in such a case is getting this original storage format changed.
2. In general, I think you're out of touch with the culture at universities. In general, labs are self-managed and do not have anything remotely resembling a general IT department that they run software purchasing decisions by. There's a very good reason for this: maximum independence and self direction enables maximum efficiency in producing worthwhile results. That's partially to blame for the problems that TFA is about—people not sharing research with each other—but what you're proposing is impossible on many levels.
We're talking about people running experiments, here. They may be inventing new kinds of data (and needing to bring in new software tools) on a very regular basis. A university that encumbers this process with red tape about tools of choice is harming its ability to compete as a research institution, which is its ultimate goal. In practice, no university has policies regarding what labs can run on their computers (which, further, are almost always self-purchased by the labs) much less any restrictions on file formatting. The amount of work that would be required to track all of the tools and utilities used by hundreds of graduate students, postdocs and professors is far greater than you seem to assume, (particularly since it would require understanding of the research, in many cases) and would be extremely invasive and disruptive to productivity.
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
I don't know what sort of IT people you have to put up with, but from my perspective this is not the rocket science you appear to believe it is. Any of my friends would be able to sit down and analyze those bizarre file formats and come up with *some* sort of process to handle them. Some will be finicky and demand hands-on treatment. However, you ought to be able to automate the majority of them.
At the very least, insist this cryptic data is submitted in as many forms as possible; raw binary, export, backup data files, screenshots, email attachments, fax, ... All I'm doing here is advocating for doing it smarter. IT is a young field, but we have learned few basic laws. Redundancy's an important one.
I'm also suggesting that you're missing the value of divide and conquer. Leave the IT to the geeks. Leave the researchers to research. Don't try to teach a pig to sing; they're not good at it, and it annoys the pig.
Absolutely. I've never been to one (I'm primarily a self-taught hacker:-).
Computers can make life easier for everyone involved, but only if the sharp end's focussed upon. I enjoy implementing solutions that make problems disappear forever. Not all of your labs or projects should have to fight with every problem, ffs! Automate what you can institution-wide, and deal with the rest when you run into them. Iterate.
Not to belittle your burden, but this's what I've been doing for two decades. Some problems are intractable exceptions, however most *can* be handled if you know how. I still say you need a better geek. :-)
"Tongue tied and twisted, just an Earth bound misfit