Beginning Perl for Bioinformatics

Posted by timothy on Tuesday January 29, 2002 @03:00AM from the listen-up-class dept.

babbage writes:"As the banner above the title of James Tisdall's Beginning Perl for Bioinformatics indicates, this book is 'an introduction to Perl for biologists.' What the banner doesn't mention is that it's also an introduction to biology and bioinformatics for Perl programmers, and it's also an introduction to both Perl *and* biology for people that have never really been exposed to either field. The author has clearly thought a lot about making one book to please these different audiences, and he has pulled it off nicely, in a way that manages to explain basic topics to people learning about each field for the first time while not coming off as condescending or slow-paced to those that might already have some exposure to it." Read on for the rest of his review. Beginning Perl for Bioinformatics author James Tisdall pages 400 publisher O'Reilly & Associates rating 8 reviewer babbage ISBN 0-596-00080-4 summary Well-balanced approach to applying Perl's sorting and analytical abilities to the field of bioinformatics.

Superficially, this book isn't all that different from a lot of introductory Perl books: the Perl material starts out with an overview of the language, followed by a crash course on installing Perl, writing programs, and running them. From there, it goes on to introduce all the various language constructs, from variables to statements to subroutines, that any programmer is going to have to get comfortable with. Pretty run of the mill so far. Tisdall starts with two interesting assumptions, though: [1] that the reader may have never written a computer program before, and so needs to learn how to engineer a robust application that will do its job efficiently and well, and [2] that the reader wants to know how to write programs that can solve a series of biological problems, specifically in genetics and proteomics.

As such, there is at least as much material about the problems that a biologist faces and the places she can go to get the data she needs as there is about the issues that a Perl programmer needs to be aware of. The author introduces the reader to the basics of DNA chemistry, the cellular processes that convert DNA to RNA and then proteins, and a little bit about how and why this is important to the biologist and what sorts of information would help a biologist's research. The main sources of public genetic data are noted, and the often confusing -- and huge -- datafiles that can be obtained from these sources are examined in detail.

With the code he presents for solving these problems, Tisdall makes a point of not falling into the indecipherable-Perl trap: this is a useful language, well-suited to the essentially text-analysis problems that bioinformatics means, and he doesn't want to encourage the kind of dense, obscure, idiomatic coding style that has given Perl an undeservedly bad reputation. Some of Perl's more esoteric constructs are useful, and they show up when they're needed, but they're left out when they would only serve to confuse the reader. This is a good decision.

Rather, the focus is on teaching readers how to solve biological problems with a carefully developed library of code that happens to leverage some of Perl's most useful properties. The result is pretty much a biologist's edition of Christiansen & Torkington's Perl Cookbook or Dave Cross' Data Munging With Perl. The author presents a series of issues that a working bioinformaticist might have to deal with daily -- parsing over BLAST, GenBank, and PDB files, finding relevant motifs in that parsed data, and preparing reports about all of it. If a bioinformaticist's job is to be able to report on interesting patterns from these various sources, then following the programming techniques that Tisdall explains in clear, easy-to-follow prose would be an excellent way to go about doing it.

And when I say "programming techniques," note that I'm not specifically mentioning Perl. The code in this book is clear and organized, and all programs are carefully decomposed into logical subroutines that are then packaged up into a library file that each later sample program gets to draw from. Each new program typically contains a main section of a dozen lines of code or less, followed by no more than two or three new subroutines, along with calls to routines written earlier and called from the BeginPerlBioinfo.pm that is built up as the book progresses. Each sample is typically preceded by a description of what it's trying to accomplish and followed by a detaild description of how it was done, as well as suggestions of other ways that might have worked or not worked.

This modular approach is fantastic -- too many Perl books seem to focus so heavily on the mechanics of getting short scripts to work that they lose sight of how to build up a suite of useful methods and, from those methods, to develop ever-more-sophisticated applications. It isn't quite object-oriented programming, but that's clearly where Tisdall is headed with these samples, and given a few more chapters he probably would have started formally wrapping some of this code into OO packages.

If I have a complaint with the book, in fact, it's that Tisdall doesn't go any further: everything is good, but it ends too soon. Seemingly important topics such as OO programming, XML, graphics (charts & GUIs), CGI, and DBI are mentioned only in passing, under "further topics" in the last chapter. I also have a feeling that some of the biology was shorted, and the book barely touches upon the statistical analysis that probably is a critical aspect of the advanced bioinformaticist's toolbox. I can understand wanting to keep the length of a beginner's book relatively short, and this was probably the right decision, but it would have been nice to see some of the earlier sample problems revisited in these new contexts by, for example, formally making an OO library, showing a sample program that provided a web interface to some of the methods already written, or presenting code that presented results as XML or exchanged them with a database.

But these are minor quibbles, and if the reader is comfortable with the material up to this point, she shouldn't have a hard time figuring out how to go a step further and do these things alone. It's a solid book, and one that should be able to get people learning Perl, genetics, or both up to speed and working on real world problems quickly.

You can purchase Beginning Perl for Bioinformatics at Fatbrain. Want to see your own review here? Read the review guidelines first, then use Slashdot's webform.

6 of 127 comments (clear)

Min score:

Reason:

Sort:

statistical approaches by ciole · 2002-01-29 03:12 · Score: 5, Insightful

I felt the same about the lack of statistical approaches. While this book is probably great for biologists just learning to write code, for coders entering the field (bioinformatics) it contains too little biology or math to be really educational. My opinion.

What I'd love would be a dissection of the construction of various motif analysis tools, critiquing various impl's of HMMs, really going into detail. This seems like a perfect complementary work to OSS, so I might even find one, someday...
I haven't read it myself but by Theodore+Logan · 2002-01-29 03:15 · Score: 5, Insightful

I have a number of friends in the business who have read that book. In summary:
1) It is good for biologists who wants to learn how to program
2) It is not good for programmers who want to learn biology
Obviously, my friends disagree with reviewer Babbage on this point. However, a quick look on Amazon reveals that most reviewers who found the book interesting are biologists with no programming experience instead of the other way round.

--
"If you think education is expensive, try ignorance" - Derek Bok
Not all biologists are doing genomics! by RevAaron · 2002-01-29 03:45 · Score: 4, Insightful

This book seems to equate biology with genomics/bioinformatics, when that is simply not the case. There are a fair amount of scientists in the general school of biology who *are not* bioinformaticians. As a person who does computational ecology, this book really wouldn't help me- and I am a biologist. Sure, DNA is swell, but it won't tell us about the complex interactions between a number of populations of organisms and the environment in which they live; it doesn't provide strategies and formulas (or references to perl modules?) that *other* kinds of biologists use. ...sigh.

--

Working toward a usable PDA environment in the spirit of Newton OS: Dynapad
1. Re:Not all biologists are doing genomics! by jfrumkin · 2002-01-29 04:58 · Score: 2, Insightful
  
  Agreed - I happen to work on a phylogenetic project, which heavily uses PERL and other Open Source technologies. I believe O'Reilly's other book, "Developing Bioinformatics Skills" makes some mention of phylogeny, but it is rather limited, to be sure.
  
  On the other hand, my guess is most of the big money is in genomics at this point, so I can understand the heavy emphasis in that area at this time. Perhaps the increased attention given to this area will allow for increased interest in other biology-related arenas....
  
  --
  
  "What we have here, is a failure to communicate." - Cool Hand Luke
Re:As a biologist... by mfarah · 2002-01-29 04:02 · Score: 3, Insightful

Still I doubt whether Perl should be the language of choice due to it tending to be "write-only code". Maybe this book will change my mind though.

FWIW, in my personal experience, I find Perl to lend itself to some very obscure code, worthy of the IOCCC [*] just as easily to extremely clear code - the latter, though, requires a disciplined programmer and some effort (not much, though) directed to that goal.

[*]: so, when will the first International Obfuscated Perl Code Contest will come? Perl poetry is getting kinda old.

--
"Trust me - I know what I'm doing."
- Sledge Hammer
Re:For fun or for work? by babbage · 2002-01-29 04:12 · Score: 3, Insightful

I would say that it's a crash course in two linked fields, targeted at an audience of people lookiing for bioinformatics work who might be familiar with one or the other of these fields, but need to get up to speed on the other one quicky.
And I *do* think it does a good job at this -- I'm a Perl hacker that hasn't taken a biology class since my freshman year of high school (ten years ago, oy vey), but the genomics & proteomics covered in this book did bring me up to speed to the point where I understand the terminology and have a decent grasp of the computational issues involved in doing work in this field, as well as some techniques that can be appled to these issues. After reading this book, I read The Cartoon Guide to Genetics by Larry Gonick -- it's a better introduction to the field than you might expect from a title like that -- and felt satisfied that I had already been exposed to 95% of the material in there, with a significant portion of that coming from this book (and O'Reilly's other bioinformatics book, and skimming over web sites).
No, it isn't a masters degree by a long shot, but it's a solid start at learning the field, and if I choose to follow it that far. And it is enough of a crash course to land you a job, if you feel comfortable with the Perl stuff. You might not be expected to understand all the subtleties of DNA and proteins on your first day on the job, but you will at least come in knowing what your colleagues are talking about, and you'll be able to begin workiing with it immediately.
Give it a chance, it's a good book for starting out with. Yes, there's more to learn -- I understand that James Tisdall is doing a followup that'll be more like a "Perl-Bioinformatics Cookbook" for more advanced users, and there are of course other books out there besides the O'Reilly stuff -- but it's a worthwhile & solid start.

--
DO NOT LEAVE IT IS NOT REAL