Bioinformatics in the Post-Genomic Era
Bioinformatics is the science of biological information, namely sequences and metadata about organisms and sequences. What's interesting about this field to many people, both in the sciences and outside of it, is the large volume of data that gets analyzed and the results that emerge on a daily basis. Obviously interesting for the medical advances and the rapidly growing business in the life sciences, there's a complex field that has developed in the past ten years or so. And following the sequencing of the human genome, new challenges have arisen for everyone involved. Augen's Bioinformatics provides a good introduction to this new field of research for students in the sciences, and anyone with a decent undergraduate education in modern biology. I think that this accessibility of the material is one of the book's biggest winning points.
After an introduction to the book and the subject area of bioinformatics (chapters 1 and 2), Augen begins at the level of the structure of a gene (chapter 3). Here, anyone with an undergraduate level understanding of genetics or molecular biology can begin using the book and bridging the gap to the new areas of modern bioinformatics. Augen then describes how basic sequence analysis is performed at the DNA sequence level (in chapter 4). The material in Bioinformatics covers some of the higher-level methods for sequence analysis, including hidden Markov models, neural networks, and pattern discovery, and introduces some of the common algorithms found to do this analysis.
Chapter 5 then covers transcription, the process of going from DNA to mRNA. Beginning with the biology behind this activity (the ribosome and the larger "transcriptome"), Bioinformatics then describes how you would perform transcriptional analysis. Here, Augen shows how you go from a wet lab to a computational lab and describes what classes of experiments you perform to gather data and then what kinds of analysis you perform on it. This chapter introduces some of the more common clustering techniques for data aggregation and understanding.
The next step in the DNA -> RNA -> protein chain is found in chapter 6, which covers the translation process. Coupled to chapter 7, which describes protein structure prediction and searching, these two chapters bridge the next gap between laboratory data and computational analysis. Protein folding and structure analysis was one of my pet areas of study as a graduate student, and Augen's text does a decent summarization of the field to date. The resources listed and techniques described are definitely on par with the common practices in the field.
Finally, Bioinformatics gets into the next major area of bioinformatics, medical databases. Augen's bridge from genetics to medical science is complete, and he discusses how medical professionals utilize databases and can begin to predict disease, for example, based on data mining. The final chapter, "New Themes in Bioinformatics," covers exactly that, but also what Augen refers to as "workflow computing," or basically going about being a bioinformatics scientist. One of my favorite emerging areas in bioinformatics, metabolic pathway elucidation, is also covered briefly.
I've shared this book with a few friends who are all studying computer science or practicing computer scientists. I did so because Augen's material does a good job of explaining my background and introducing them to some of the analysis forms I introduce into my own work. It does a good job of that, and gets them quite excited. Bioinformatics really bridges a number of fascinating areas of computer sciences, including data mining and high performance algorithms. Augen's Bioinformatics is a good introduction to the field for them, and really anyone who has studied a couple of biology courses in college.
Where the book falls short, however, can be grouped into two main areas. The first is the failure of Augen's presentation of the algorithms. While the methods used to describe computational algorithms in Bioinformatics is common for non-computer scientists, it's completely unusable for computer scientists who are used to a specific algorithm presentation style that looks more like pseudocode than rambling text. The ambiguities this presents for a technical reader are unfortunate, especially if anyone studying bioinformatics is supposed to be computer science literate. The book itself assumes a life science literacy, so this isn't an unreasonable expectation of the reader.
The second area that consistently falls short in the book is in the utility of the information given. While I am significantly happier with the quality and depth of material presented in Augen's book than in the O'Reilly bioinformatics series, where the book fails to deliver is in showing the reader how to actually use the data they gather. After all, the book shows various sequence analysis algorithms and discusses tools available to do this work, but it only devotes a few pages (out of over 370 in total) to a workflow that can be used. Also, the book fails to point the reader at very worthwhile web resources sometimes, including meta sites like the SDSC Biology Workbench site, and just says "some Perl scripts" for local data analysis. As such, you'll have to go a few extra miles on your own to make use of the data sources.
I guess a third complaint of the book for me is that Augen has ignored or omitted significant bodies of research that fit squarely into the scope of the book. For example, Ken Dill's research into protein folding models, as well as Martin Karplus' work on the subject, receives no mention, nor does the topic of Bayesian network analysis when Augen discusses time series data analysis. These aren't new, they've been around for many years and influenced most of the field, and their absence is noted. The book's spotty coverage in some places, like these, is noticeable.
Bioinformatics does a few things well, but overall reads too much like a biology textbook to be useful to the average computer scientist. More emphasis on the practice of bioinformatics and data analysis would have made this book stronger and complemented the substantive background material well. Finally, using an approach more similar to the computer science approach would have been a tremendous benefit, since the material really is computer science in part. That said, I think this is probably the best introduction to this exciting area of science that I have yet seen.
You can purchase Bioinformatics in the Post-Genomic Era from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
Uh, genomics isn't going anywhere.
The ambiguities this presents for a technical reader are unfortunate, especially if anyone studying bioinformatics is supposed to be computer science literate. The book itself assumes a life science literacy, so this isn't an unreasonable expectation of the reader.
In bioinformatics, science literacy is so much more important than computer literacy. Computer scientists rarely become good bioinfromaticians. This is the primary reason almost every single peice of commercial bioinformatics software is a complete peice of shit. And why the free stuff is hacky but gets the job done. The free stuff was written by life scientists, the commercial stuff was written by computer scientists with no domain knowledge of the question they were trying to answer.
Bioinformatics is not something you 'just get into.' And it is not a natural path to go from CS to bioinformatics.
bioinformatics is more bio than informatics...
---- Where is my mind?
I'm a medical informaticist, and I don't completely agree with part of the above. I, too, read the wikipedia entry on bioinformatics and saw that my field is lumped in w/ bioinformatics, which is something I don't agree with. Perhaps to a layperson, the difference between "bio" and "medical" is not a big one, but practically speaking it is quite big. (The parent of this didn't lump, just mentioned it in passing, but I wanted to comment on it.)
Basically, someone like myself might not be too knowledgeable about what I refer to as bioinformatics, which I consider to be everything from DNA->proteins->cell cycles. Bioinformatics focuses on solving problems of data management of the vast amount of information in the above fields, which is a huge undertaking. It also happens to deal more with vast databases and data mining.
A bioinformaticist, on the other hand, is probably not very aware of what I do in my job, which is quite different. I deal on a daily basis with how we manage large datasets of patient-specific data, with the goal of improving medical care. We also deal w/ data mining and database design and all that, but it has a very different focus and uses very different tools. Our solutions are largely focused around caring for patients.
That being said, there is a middle ground of largely research projects that attempt to span that gap. There are some groups working on making data available from the bioinformatics world merge w/ the medical informatics world to be useful in some way to clinicians, but I would estimate the fruit of that labor to be 10-15 years out. These projects are mostly driven by bioinformaticists, since they're more experienced with dealing with their huge datasets and doing the data mining that they do, while the medical informaticists would be more interested in how they feed data into something like that and get something useful out.
At any rate, someone who labels themselves as a "bioinformaticist" and is doing EMR research is clearly just gunning for research money using the bio label (there's more money there).
All the abstruse stuff that an OS, DBMS, or compiler writer should know about, but that an application programmer does not need.
Well, there are at least two answers to that. The first is general: the idea that "programmers don't need to know all that theory" is, IMNSDGHO, largely responsible for all the crappy bloatware that the computing world has to deal with; if programmers spent more time learning real CS than the latest buzzwords, software would generally be much better than it is.
The second is specific to the topic of discussion: scientific programming, including bioinformatics, is much closer to the theoretical level than is most application programming. Pretty widgets don't matter nearly as much as the fact that you're dealing with complex operations on huge data sets, and if you write your program without any awareness of What's Really Going On, then your program will run like shit.
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
The reason software that exists is of poor quality is a function of both those who work on it _and_ the amount of features required for the product. I would argue that MS Office is a great application if you require the list of features the various applications need to provide. The problem with office is that 99% of the list is extraneous for 99% of it's users. I have seen people try to use excel as a database. Others use it as a viewer for slices of a database. Excel is ok at doing this but making it ok at this has detracted from it's actual problem domain, namely analyzing rather small numeric datasets.
Now the other %99 percent of software (domain specific spec software) developed in house for a company will fall into the other category. Most companies do want to hire a scientist to develop for them, they really would rather hire a spreadsheet jocky who can understand what if/then/else does and pay them accordingly. The real problem is the proprietary domain specific applications that are developed by those same spreadsheet jockeys (you know who you are wintam developer). You get neither the skill of a well trained scientist nor the internal expertise of the application.
Such a nicely written point deserves an answer, so I hope this helps.
My experience is that formal training in biology and chemistry cannot hurt, but they're not mandatory.
I have degrees in Comp Sci & Math (like a double major in US), but nothing beyond an introduction to biology and chemistry. I have a good understanding of what I know in biology and chemistry, but I'm just a novice in these areas.
I hold a PhD in CS, with a thesis on bioinformatics. I am fairly active in the area, so my experience might be relevant.
Over the years I found that the only necessary skills are good communication and some mathematical intuition. Programming skills are useful, but marginally so. One good idea easily compensates for ten top programmers. I am a good programmer, with years of practice and a few projects of at least 50,000 lines (some published under GPL). So don't think I'm bashing coders because I'm not good at it myself.
However, I always found that the most successful projects followed from good communication between the modellers and the biologists. As long as they were able to tell each other what they wanted and where things weren't going well, all went beautifully.
The quality of the code was a side issue, discussed only when we didn't have anything else to say.
There were some pitfalls I encountered over time, too.
Modellers thinking they understood everything, and that they could do everything on their own. Usually they produced beautiful theories, without much practical application or success.
Biologists thinking the modellers were trying to devise programmes that would replace them. They generally sneered upon our projects and they went back to staring at some experimental results hoping they could sift through thousands of rows in Excel. It rarely worked.
Overly complex programme design because some programmer decided it was useful to use the latest buzzword technology. Usually this failed because it actually wasn't necessary to make the project so complex.
In what concerns the available literature, there are some books that deal with the problems and solutions in the field. One such example would be "Bioinformatics" written by Baldi & Brunak. Another would be "Molecular modelling" by Alan Hinchliffe.
I found these geared more towards presenting the problems at hand, and some of the existing algorithms.
So, all in all: one can work in bioinformatics without much training on life sciences. Some general knowledge is necessary, although mostly for allowing the communication with the experts in biology or chemistry.
From a social perspective, a somewhat modest attitude (not humble, just know your limitations!) is also important, because it facilitates communication. A positive attitude towards group work is also necessary, since I really cannot see anyone being able to do such research alone.