Slashdot Mirror


Beginning Perl for Bioinformatics

babbage writes:"As the banner above the title of James Tisdall's Beginning Perl for Bioinformatics indicates, this book is 'an introduction to Perl for biologists.' What the banner doesn't mention is that it's also an introduction to biology and bioinformatics for Perl programmers, and it's also an introduction to both Perl *and* biology for people that have never really been exposed to either field. The author has clearly thought a lot about making one book to please these different audiences, and he has pulled it off nicely, in a way that manages to explain basic topics to people learning about each field for the first time while not coming off as condescending or slow-paced to those that might already have some exposure to it." Read on for the rest of his review. Beginning Perl for Bioinformatics author James Tisdall pages 400 publisher O'Reilly & Associates rating 8 reviewer babbage ISBN 0-596-00080-4 summary Well-balanced approach to applying Perl's sorting and analytical abilities to the field of bioinformatics.

Superficially, this book isn't all that different from a lot of introductory Perl books: the Perl material starts out with an overview of the language, followed by a crash course on installing Perl, writing programs, and running them. From there, it goes on to introduce all the various language constructs, from variables to statements to subroutines, that any programmer is going to have to get comfortable with. Pretty run of the mill so far. Tisdall starts with two interesting assumptions, though: [1] that the reader may have never written a computer program before, and so needs to learn how to engineer a robust application that will do its job efficiently and well, and [2] that the reader wants to know how to write programs that can solve a series of biological problems, specifically in genetics and proteomics.

As such, there is at least as much material about the problems that a biologist faces and the places she can go to get the data she needs as there is about the issues that a Perl programmer needs to be aware of. The author introduces the reader to the basics of DNA chemistry, the cellular processes that convert DNA to RNA and then proteins, and a little bit about how and why this is important to the biologist and what sorts of information would help a biologist's research. The main sources of public genetic data are noted, and the often confusing -- and huge -- datafiles that can be obtained from these sources are examined in detail.

With the code he presents for solving these problems, Tisdall makes a point of not falling into the indecipherable-Perl trap: this is a useful language, well-suited to the essentially text-analysis problems that bioinformatics means, and he doesn't want to encourage the kind of dense, obscure, idiomatic coding style that has given Perl an undeservedly bad reputation. Some of Perl's more esoteric constructs are useful, and they show up when they're needed, but they're left out when they would only serve to confuse the reader. This is a good decision.

Rather, the focus is on teaching readers how to solve biological problems with a carefully developed library of code that happens to leverage some of Perl's most useful properties. The result is pretty much a biologist's edition of Christiansen & Torkington's Perl Cookbook or Dave Cross' Data Munging With Perl. The author presents a series of issues that a working bioinformaticist might have to deal with daily -- parsing over BLAST, GenBank, and PDB files, finding relevant motifs in that parsed data, and preparing reports about all of it. If a bioinformaticist's job is to be able to report on interesting patterns from these various sources, then following the programming techniques that Tisdall explains in clear, easy-to-follow prose would be an excellent way to go about doing it.

And when I say "programming techniques," note that I'm not specifically mentioning Perl. The code in this book is clear and organized, and all programs are carefully decomposed into logical subroutines that are then packaged up into a library file that each later sample program gets to draw from. Each new program typically contains a main section of a dozen lines of code or less, followed by no more than two or three new subroutines, along with calls to routines written earlier and called from the BeginPerlBioinfo.pm that is built up as the book progresses. Each sample is typically preceded by a description of what it's trying to accomplish and followed by a detaild description of how it was done, as well as suggestions of other ways that might have worked or not worked.

This modular approach is fantastic -- too many Perl books seem to focus so heavily on the mechanics of getting short scripts to work that they lose sight of how to build up a suite of useful methods and, from those methods, to develop ever-more-sophisticated applications. It isn't quite object-oriented programming, but that's clearly where Tisdall is headed with these samples, and given a few more chapters he probably would have started formally wrapping some of this code into OO packages.

If I have a complaint with the book, in fact, it's that Tisdall doesn't go any further: everything is good, but it ends too soon. Seemingly important topics such as OO programming, XML, graphics (charts & GUIs), CGI, and DBI are mentioned only in passing, under "further topics" in the last chapter. I also have a feeling that some of the biology was shorted, and the book barely touches upon the statistical analysis that probably is a critical aspect of the advanced bioinformaticist's toolbox. I can understand wanting to keep the length of a beginner's book relatively short, and this was probably the right decision, but it would have been nice to see some of the earlier sample problems revisited in these new contexts by, for example, formally making an OO library, showing a sample program that provided a web interface to some of the methods already written, or presenting code that presented results as XML or exchanged them with a database.

But these are minor quibbles, and if the reader is comfortable with the material up to this point, she shouldn't have a hard time figuring out how to go a step further and do these things alone. It's a solid book, and one that should be able to get people learning Perl, genetics, or both up to speed and working on real world problems quickly.

You can purchase Beginning Perl for Bioinformatics at Fatbrain. Want to see your own review here? Read the review guidelines first, then use Slashdot's webform.

14 of 127 comments (clear)

  1. The challenge of Bioinformatics by nesneros · · Score: 5, Informative

    Bioinformatics is probably the biggest challenge facing the biological sciences in the next few years. Its becomming more and more apparent that even slight changes in very small elements of a system (i.e., a small sequence of a protein, the behavior of a single neuron within a group of 10,000) can have a drastic effect on the behavior of the entire system. As a result, to really study the problem, you have to aquire massive amounts of data. For example, in our lab we routinely collect data from 64 channels of 16-bit data (monitoring neuron firing in culture) at 1KHz, in addition, we're simultaneously taking calcium imaging video at 100fps at 256x256 (at 256 colors). This results in about 200 MB of data gathered every second. Considering we run tests for over 10 minutes, just aquiring and storing this data is a challenge, but finding useful methods to analyze it is even more difficult. Its refreshing to see texts being written on how to bridge the gap between comp. sci. and biology. I've been working in the area for about 4 years now, and its really great to see the field growing and getting more mainstream attention.

    --
    Some men spend their entire lives trying to kill themselves for having been born. --Ross MacDonald
    1. Re:The challenge of Bioinformatics by babbage · · Score: 3, Informative
      Ok, so you can work out about how much data is coming out of that machine. Now assume that the lab in question has several such machines, and that labs all over the world are churning out this degree of output, and maybe your lab needs to keep a local copy of all the relevant data. Go ahead & make up your own numbers if you'd liike, but keep in mind that this is a huge field these days, so there are probably hundreds or thousands of such groups working on it all, and they're churning out, by your math, 7mb per second per machine per lab.

      Now take a step in a different direction, and realize that we don't know what *any* of this stuff means (much less than 1% of it, at a rough estimate). We've got a completed genome project that has produced another mountain of mostly undecoded data.

      Or to go to the central issue, we understand that DNA translates trivially to RNA, then to chains of amino acids that fold up into balls of protein, with secondary, tertiary, (etc) levels of structure. Largely this is determined by how the chemical bonds between each amino acid twist together, and how disparate segments of the chain come close together or far apart. And the effect of this protein chain biologically is determined by which segments of the chain end up at which parts of the knot: the same sequence of amino acids can be neutral or active depending on whether it's near the surface, for example. Finally, go a step further and realize that all these proteins in the body are in a constant state of flux, constantly changing each other, catalyzing each other, restricting each other, and so on. The number of active variables very quickly hits a point that becomes incalculable, and we're down to a new version of the travelling salesman problem, which no contemporary computer system can even dent, nevermind solve.

      Laugh at the stream of data if you want to, but keep in mind that it's not like we're just talking about a piece of network hardware that needs to be able to shuffle this much data around more or less blindly. Rather, any & all of it could be biologically relevant in any given context, and so each bit of that data stream has to be scrutinized, often more than once in different contexts. It is, simply, a *huge* amount of computational work.

  2. More for your library by chundercanada · · Score: 5, Informative
    I just spend a couple of days trying to choose a few books in this area. My interest was as a computer guy needing to get filled in on the bio side of things. Here are the books I ended up ordering:

    Human Molecular Genetics 2: Looks to be a great primer on all the biology background.

    Bioinformatics: A Practical Guide...: This book is a detailed tour of the online databases and existing tools for analysis of genes and proteins.

    Algorithms on Strings, Trees and Sequences: This is a book for real computer science types who want to do high-performance implementations of new tools.

  3. Re:statistical approaches by babbage · · Score: 3, Informative
    Well of course loading data into an DBMS is the ideal here, it's the loading of the data into one that's the tricky part :)

    Generally, a lot of biological data is publically available from sources such as NCBI (US national computational biology lab) and EMBL (European molecular biology lab), but it could be coming in as SQL statements ready for loading into your database, CSV or TSV files, any of several annoyingly flexible standard biological data exchange formats, or worst of all something like an Excel spreadsheet or just scraped from a web page somewhere. There is way too much of this stuff to pump it all into your local storage system by hand, so you need something like Perl that can munge it into an intermediate format that can be loaded properly. Once it's actually in there then yeah, you only revert to some sort of flat file system if you want to redistribute data.

    A related but more central problem is in looking for interesting patterns in these huge datasets once you have them locally, whether in flat files or a database or what have you. This is a huge area of research right now, because modern bioloogical lab technques can slurp up data extremely fast, we have the whole genome decoded but uninterpreted, etc, and now we need computational techniques that can chew through this fire hose of information efficiently.

    A lot of this seems to be unsolveable at the moment, because the algorithmic complexity is up there with the Travelling Salesman problem (e.g. protein folding), so every little bit that can chip away at the difficulty of it helps. Perl is good at this, and a lot of places are using it heavily right now. Being able to work with flat files is only one aspect of it; it just happens to be a useful one to teach with, which is why it was used so heavily in the book, but in actual use the applications of Perl go way beyond simple file maniipulation.

  4. Re:Not all biologists are doing genomics! by SloppyElvis · · Score: 2, Informative

    From Gray's Lab Dictionary on medical sciences:

    Bioinformatics: The use of computers in solving information problems in the life sciences.

    This says nothing about bioinformatics being used solely for genomics, though I hear your gripe, as many think of the two as the same. No doubt, this author has made the same assumption. I speculate it has something to do with money, since genomics are a "hot topic". The point is, you may be a bioinformatician and not even know it.

  5. Re:statistical approaches by Marcus+Brody · · Score: 4, Informative

    why would you want to use Perl over a flat file data set

    Good Question. Answer is yes and no.
    Flat Files are really quite useful in biology (btw, when a biologist mentions a "database", he almost certainly mean a "flatfile"). DNA/RNA/Proteins are just a long sequence of letters, and therefore these are perfectly represented by good 'ol ASCII. This is particularly useful for means of distribution etc. When annotations are added to the data, they are traditionally added to the flatfile by way of an "annotation table", to keep the simple ease of ASCII.

    However, more advanced ways are used to store annotations of biological data, although traditional databases arent allways that good at expressing the rather messy, randomness of biology ;-) Therefore, specialised databases such as acedb are quite useful and intuitive to the biological mind. Furthermore, projects such as ensembl (which ambitiously attempts annotations on the whole genome) store their data in an SQL database. However, they still make extensive use of perl to interact wiht the database.

  6. Re:Flashbacks by glwtta · · Score: 4, Informative

    I've worked in bioinformatics for the last few years, and I can say that there's a bit of a difference between bioinf and perl, and engeneering and fortran - perl is suited for bioinformatics far, FAR better than any other language. And so far the benefits of modern languages just can't seem to outweigh this innate suitability.

    Traditionally almost all bioinformatics tools have been done in perl, and they continue to be so, for one very simple reason - bioinformatics, when it comes down to it, is just plain text processing.

    Anyway, about the book itself - it's nice for biologists who want to learn something about programming, but I neither learned much about biology from it, nor am I afraid I will lose my job because all the bio people are gonna start doing their own programming :)

    --
    sic transit gloria mundi
  7. Re:Why a scripting language? by babbage · · Score: 3, Informative
    Why do scientists gravitate to these scripting languages?

    For the same reasons that people gravitated to them for internet programming: there is so much ad hoc work do be done that it isn't worth the effort to work "that close to the metal". Perl's text analysis capabilities are so sophistocated that it would be hard to match them with custom written C code -- and if you did manage to pull it off without getting ensnared in infuriating memory leaks and so on, a well designed system will end up approaching Perl anyway. Yeah, Python is well suited towards modularizing systems and reworking bottleneck components in something like C, but Python just isn't as slick at text analysis as Perl is, and this kind of genetic/proteomic work is essentially a text analysis problem.

    I mean, look at it the other way around -- Perl isn't actually that hiideous if you avoid all the stupid features, and you can do the development 50 times faster. If it really runs that slowly -- and usually the execution time won't be a problem -- then sure, redo parts in C (or XS), but 99% of the time that really doesn't help very much.

  8. Re:Why a scripting language? by scottcain · · Score: 2, Informative

    It's easy to explain really: text manipulation. Bioinformatics is really about moving text around. What are DNA and protein sequences? Text. What are the reports generated by the plethora of analysis programs? Text. And Perl has outstanding and easy to use text manipulation tools. Add to that CPAN and BioPerl, and you have the makings of excellent Bioinformatics tools.

  9. Perl and Bioinformatics by fasta · · Score: 5, Informative

    I would like to answer several questions that were raised in this discussion.

    (1) How does a CS person learn biology? I recommend "Recombinant DNA, A short Course", as an accessible (Scientific American style) introduction to the cloning breakthroughs and discoveries that lead to genome science.

    (2) How does a CS person learn "Bioinformatcs"? I strongly recommend "Bioinformatics - Sequence and Genome Analysis" by David Mount as an accessible and extremely comprehensive survey of current approaches in Biological Sequence Analysis.

    (3) Why do Biologists use Perl? Much of the information Biologists want is on the WWW, and Perl's LWP makes it extremely easy to get it. We don't use Perl for sophisticated text analysis (similarity searching, motif searching, etc) because the algorithms that are appropriate are typically not exact (or even regular expression) matches. But it's difficult to beat Perl for getting stuff off the WWW.

    (4) Why do Biologists use Flat files? Several reasons - (a) the most useful information is sequence information, and it can be read much more quickly out of a flatfile (esp. one that is memory mapped) than a DB; (b) flat files solve some versioning problems that DB's make very complex and slow. (c) Most data providers only provide flatfiles. This will change, however, over the next 2 - 3 years, mySQL and postgresQL are moving into biology labs.

    It is very exciting that Bioinformatics has high visibility now, and many people with CS background are considering bioinformatics problems. Unfortunately, many of the introductory books on bioinformatics (particularly the O'Reilly books) do not adequately present the substantial foundations of bioinformatics that have been build over the past 15 - 20 years, and some newcomers are mislead into believing there are simple problems looking for a few good programmers. Most of the simple problems have been solved; many of the complicated problems are challenging not because we do not know enough CS, but because we do not know enough biology.

  10. Another language for bioinformatics by Jon+Howard · · Score: 2, Informative

    Since I'm a Lisp fiend: while we're on the subject of programming for bioinformatics, I'd like to point out that Allegro Common Lisp has been used by a few folks in the field. Here are two links:

    Pangea Systems Inc. (now DoubleTwist) for EcoCyc.

    MDL Information Systems to design new drugs.

  11. PubMed Books online by NullSpaceKid · · Score: 2, Informative

    A selection of possibly relevant books (_Introduction to Genetic Analysis_, Molecular Cell Biology_, etc) can be found at: www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books NSK

  12. bioinformatics does not equal string manipulation by Anonymous Coward · · Score: 1, Informative

    It seems that perl is still being used purely because many bioinformatics departments are full of people who know how to program in perl. And this is because bioinformatics *used* to be pretty much only about string manipulation.
    This is just not true any more - proteomics require in silico trypsin digest and algorithms for protein identification for MALDI mass spec (prediction of protein sequence via analysis of digested protein fragments); microarray experiments require cluster analysis of expression data in order to identify functinoal relationships. Added to this there are lots of issues relating to integrating the many many databases there are out there.
    The systems are becoming bigger and have to deal with lots of other systems around the world. Is Perl the best language for all this? I don't know but languages shouldn't be pushed into unsuitable roles purely for historical reasons and lots of bioinformaticians are trying to do this by trying to cling onto perl.

    martin

  13. Re:Why a scripting language? by jslag · · Score: 2, Informative

    scripting languages avoid several common things that non-programmers usually have a hard time with:

    * Variable declarations


    Actually, most perl programs more than a few lines long (hopefully) use strict; thus requiring variable declarations.

    * Memory allocation

    Seems like plenty of programmers have trouble with this as well, based on the number of memory leaks out in the wild.

    Really, why scripting languages?
    Why not? Hardware is fast and cheap compared to programmer time, so slightly slower (but written!) programs are often better than super-optimized programs that are only half done.

    Scripting languages aren't necessarily slower, anyhow. Perl programs, for example, tend to do all their heavy lifting in libraries, with performance-critical parts coded in C. If you're into benchmarks, you can dig some up showing perl outpacing java and c++ at various text-processing tasks.