As someone who has both written and read _alot_ of perl, in particular in Bioperl and Ensembl, in bioinformatics I have a rather love/hate relationship with Perl.
I love: the low learning curve for people coming from biology, with alot of forgiving behaviour (in particular I think the auto-creation of datastructures as you use notation to fill in complex anonymous - think pointer based - structures). This is probably the critical one which means we can hire a much broader group of people with a much better understanding of biology and for them to be productive far earlier
I love: the large and robust libraries accessing nearly every sort of database, web-app and other things you need
I love: the consistency of behaviour between systems (don't get me started on Java or porting C++ code between compilers/library systems. Ugh! unbelievable pain as one starts using those languages and move between high end systems. Its C for the fast stuff and Perl for anything else for portability in my book).
I love/hate: The (huge) amount of robust existing Perl code that we have in Ensembl and that works day in, day out on multiple outings
I hate: The lack of clean objects. Why, oh why, oh why?
I hate: The inability to switch on strong typing and bigger checking optionally in libraries - I know you can do more these days, but it is still clunky.
I hate: switching the word "continue" (in C) to "next" (it gets me every time)
I hate: having to always brace if statements
I hate: operators designed for one-liners that gets in the way of good readable code - grep and map in complex lines are pet hate of mine.
I hate: the tortorous cross-language capabilities - compare python's jython and other C-level compilers. Soooo much better.
Interestingly I coded in python for about 6 months in the late 90s - very early on python - and lots python appeals to me. But then Perl came along, and lots of bioinformaticians were using it, and systems people were installing it by default on systems...
Roll on Parrot. I want Parrot to be able to run Perl5 syntax code, Perl6 and Python/Java syntax all together, with easy ways to load in C level or compiled down libraries. That's what Perl needs to save it.
Also, don't forget that each person has two haplotypes, one from each parent, so when one sequences a person, one captures the variation on two human genomes at once.
Of course, this all relies on the coverage you sequence at, and one option for the 1,000 genomes project is doing this at low (2x?) coverage, using pretty sophisticated methods to combine statistical power between sample datasets.
The "1,000" though is more a round number that is in the right range. it might well be 1346 people or something like that (often some multiple of 96, as 96, or 4*96, 384 is the standard size of a molecular biology "tray" put into a robotic system).
This is probably why you need a consultancy firm (dare I say it... IBM or someone) to show you what is going on. If I had time... I'd be happy to show you what is going on.
Raw OpenSource generally only appeals to people who are confident about what they want and understand the IT problem correctly. Then you can get this stuff for free, off the net and set up things for just the cost of the time of the guys who installs it. And generally it is far stabler than any "commercial" solutions.
But, in the absence of someone like that in your department, ring up IBM or RedHat (or hopefully they will see your post here, and some salesman will give you a call). You'll have to spend money at some point, but your total cost will be waaaay lower than a heavily marketted, (presumably M$oft) "solution"
Don't dismiss open source straight out because the raw software doesn't come with a fancy brochure.... that's a sign of strength...
(if you would like some more pointers, I can help you out. But... looking at your web page, you seem to have a high comfort level with MS stuff, so I think it would be slightly pointless unless you really want to learn stuff.
At some point you will be using open source directly - you already do indirectly via web sites and email - so, you might as well build you skill set up sooner rather than later)
The idea that people would actively get into a swimming pool and put on a helmet to answer a work phone call. The mental image... is quite worrying in some cases.
Though I find the best thing about working from home is that people dont have my phone number here, so... noone calls me. And I go to no meetings. Magical.
There is a comment somewhere down here which is
really that noone knows how to convert a whole bunch of ESTs hitting the genome into genes. The EST data is *very* messy. We've looked at this recently inside Ensembl and don't see a big win from confidently placed ESTs. Our opinion is that the Ohio State thang is just somewhat enthusiastic
researchers getting good PR for their work.
Check out http://www.ensembl.org/ for the more sober-headed view of this.
This annoys me. Slashdot are really happy to pander to the PR that these sorts of companies have but consistently turn down interesting stories about how we are trying make the human genome open and accessible for all, in projects like
Ensembl. What are these guys really going do with this? Probably nothing. They don't look like they know what they are doing. And yet they get posted to slashdot.
I wish Slashdot was more interested in the real
science of the genome and less PR orientated. Slashdot aint what it used to be...
Yo Chris - thanks for the tag. I always feel that the signal to noise discussions on slashdot are pretty skewed. Who knows how this all going to pan out.
I have to admit I think we have done pretty well with the latest bioperl. Kudos for you as well chris...
It always amuses me how clueless slashdot generally as group is about these things.... Despite best efforts otherwise. It comes up as an "Ask Slashdot" related question regularly; slashdot posts pseudo-science stories or op-ed about cloning etc, and yet... slashdot hasn't attempted to *contact the actual scientists* involved to get their opinion.
Yes - I have suggested this as an interview topic a number of times. Slashdot editorials are more interested in "wow-science" stories than real science. It annoys me. (but I still read slashdot).
I have been a long UK -> US road traveller, and bizarrely the best thing to do is track down a kinko's - kinko's offer reasonable (still pretty steep) cybercafe type access but they are everywhere (even in knoxville tenesse for example)
I never tried to get dhcp into an ethernet port. I don't think they offered it then (this summer). But you never know - if enough of us ask;)
I have worked (closely) with two female programmers. One was an ex-physicist with strong java/perl/c skills and the other was a c programmer who used to code asynchronous signalling stuff.
Both were/are great. And we get on well. And the work is good.
It confuses me why there are not so many girls in the industry but I guess the best way to solve it (like most things) is just to live to your ideals. So - I don't worry about the sex/age/culture/race of the people I work with and that seems good enough for me...
I think talking about it helps air some issues but doesn't really change much.
Thanks troc - just got around to reading this commnet.
I have sort have appealed at the top to people to come along. People seem more interested in writing about patents than getting down to nuts and bolts of course....;)
If there is anyone out there who would like to do this coding, as sure as hell I don't know how to it;). But I know what to run...
It is clear from these postings that people would like the client to run. If there are people with experience in writing these sorts of d.net systems then please drop me a note. We have the problem for you to work on - it is just a question of figuring out how to do it.
Hardware at the moment generally are clusters of alpha boxes or intel boxes (running tru64 or linux respectively).
The two big drainers on CPU for analysis are gene prediction (genscan) and database searching (blast). database searching can't be distributed easily as you have to worry about the database;)
However, there are programs like sim4, genewise and est2genome that could greatly help us and could be distributed.
Genewise - you can download (I wrote it) at Wise2 est2genome is somewhere around as well.
For the more general overview of the problem - check out ensembl for an idea of the project.
I assumme that the original poster did not understand what was going on;). Like alot of slashdot in this case - concerned but not knowledgeable.
Celera always talk about the assembly problem as they have gene myers solving it (he has) and think it is pretty cool. It is not trivial, but from my view (an annotation centric view) not the most important thing.
This is only for the assembly and not for the analysis. With analysis you have a better data/cycles ratio. Assembly is done at the genome centres anyway...
Great that you were following the talk. I thought I put everyone to sleep
The rate limiting step at the moment is effectively the mapping in fact, then sequencing. The interesting thing about the analysis is that the amount of CPU is unbounded. If we have more CPU we just use more accurate algorithms. We can do something within the CPU bounds on the hinxton campus, but if anyone wants to give me a super computer, then we could get more accurate analysis.
Bioinformatics generally has a very good cycles to data ratio - ie - we have algorithms that take alot of cycles for very little data. So it is feasible...
Does anyone want to write it? If so - I have alot of CPU hungry algorithms to run.
There are some good open source genome projects for doing this efficiently - and we do welcome help of any kind. Here are some open source projects which I know about/work on/
ensembl is an open source genome project designed to get as much data and software into the public domain as possible
All these are well backed, strong open source projects with different strengths. Everytime genome stuff comes up on slashdot I try to point these things out to people, but everything gets lost in the noise about people $%!"'ing on about patents (generally without alot of knowledge!).
Anyway - check out these projects for more information about real open source efforts in biology.
I could not submit a bug in the source forge bug report area (doh. Can't even submit a bug that the bug submission does not work!)
Had a dodgy certificate that explorer didn't like...
And the projects that are there seem to be focused mirroring other projects
Finally - could you/would you trust someone else to keep a server up 24-7 for your source code? My experience of projects is that they need more than cvs/mailinglist. They need coordinated web site and people close by to make it all work
So. I am not moving from my work machine yet. But I guess this is the way things are going to go
Check out bioperl. In particular the new 0.6 series (just available via anonymous cvs). Bioperl is more up to date than readseq, and it is in your favourite language.
As someone who has both written and read _alot_ of perl, in particular in Bioperl and Ensembl, in bioinformatics I have a rather love/hate relationship with Perl.
I love: the low learning curve for people coming from biology, with alot of forgiving behaviour (in particular I think the auto-creation of datastructures as you use notation to fill in complex anonymous - think pointer based - structures). This is probably the critical one which means we can hire a much broader group of people with a much better understanding of biology and for them to be productive far earlier
I love: the large and robust libraries accessing nearly every sort of database, web-app and other things you need
I love: the consistency of behaviour between systems (don't get me started on Java or porting C++ code between compilers/library systems. Ugh! unbelievable pain as one starts using those languages and move between high end systems. Its C for the fast stuff and Perl for anything else for portability in my book).
I love/hate: The (huge) amount of robust existing Perl code that we have in Ensembl and that works day in, day out on multiple outings
I hate: The lack of clean objects. Why, oh why, oh why?
I hate: The inability to switch on strong typing and bigger checking optionally in libraries - I know you can do more these days, but it is still clunky.
I hate: switching the word "continue" (in C) to "next" (it gets me every time)
I hate: having to always brace if statements
I hate: operators designed for one-liners that gets in the way of good readable code - grep and map in complex lines are pet hate of mine.
I hate: the tortorous cross-language capabilities - compare python's jython and other C-level compilers. Soooo much better.
Interestingly I coded in python for about 6 months in the late 90s - very early on python - and lots python appeals to me. But then Perl came along, and lots of bioinformaticians were using it, and systems people were installing it by default on systems...
Roll on Parrot. I want Parrot to be able to run
Perl5 syntax code, Perl6 and Python/Java syntax
all together, with easy ways to load in C level or compiled down libraries. That's what Perl needs to save it.
Also, don't forget that each person has two haplotypes, one from each parent, so
when one sequences a person, one captures the variation on two human genomes at once.
Of course, this all relies on the coverage you sequence at, and one option for
the 1,000 genomes project is doing this at low (2x?) coverage, using pretty sophisticated
methods to combine statistical power between sample datasets.
The "1,000" though is more a round number that is in the right range. it might well be
1346 people or something like that (often some multiple of 96, as 96, or 4*96, 384
is the standard size of a molecular biology "tray" put into a robotic system).
We're going to have alot of fun at http://www.ensembl.org/ with this...
This is probably why you need a consultancy
firm (dare I say it... IBM or someone) to show
you what is going on. If I had time... I'd be
happy to show you what is going on.
Raw OpenSource generally only appeals to people
who are confident about what they want and understand the IT problem correctly. Then you can
get this stuff for free, off the net and set up
things for just the cost of the time of the guys
who installs it. And generally it is far stabler
than any "commercial" solutions.
But, in the absence of someone like that in your
department, ring up IBM or RedHat (or hopefully
they will see your post here, and some salesman
will give you a call). You'll have to spend money
at some point, but your total cost will
be waaaay lower than a heavily marketted, (presumably M$oft) "solution"
Don't dismiss open source straight out because
the raw software doesn't come with a fancy brochure.... that's a sign of strength...
(if you would like some more pointers, I can
help you out. But... looking at your web page,
you seem to have a high comfort level with MS
stuff, so I think it would be slightly pointless
unless you really want to learn stuff.
At some point you will be using open source
directly - you already do indirectly via web
sites and email - so, you might as well build
you skill set up sooner rather than later)
The idea that people would actively get into
... noone calls me. And I go to no
a swimming pool and put on a helmet to answer
a work phone call. The mental image... is
quite worrying in some cases.
Though I find the best thing about working from
home is that people dont have my phone number
here, so
meetings. Magical.
There is a comment somewhere down here which is
really that noone knows how to convert a whole bunch of ESTs hitting the genome into genes. The EST data is *very* messy. We've looked at this recently inside Ensembl and don't see a big win from confidently placed ESTs. Our opinion is that the Ohio State thang is just somewhat enthusiastic
researchers getting good PR for their work.
Check out http://www.ensembl.org/ for the more sober-headed view of this.
I wish Slashdot was more interested in the real science of the genome and less PR orientated. Slashdot aint what it used to be...
I like the article. The more of these sorts of
...
articles that are around the easier it is for
people like me to make an impact.
BTW - on topic here somewhat - if you want to see
an open source genome management system, take
a trip over to
http://www.ensembl.org/
for your open source project
That is an impressive show of being pissed off
by CmdrTaco. I guess it got to him.
Stock Market is a non-story to me as well.
I don't think it should be commented on by slashdot either!
Yo Chris - thanks for the tag. I always feel that
the signal to noise discussions on slashdot
are pretty skewed. Who knows how this all going to
pan out.
I have to admit I think we have done pretty
well with the latest bioperl. Kudos for you
as well chris...
It always amuses me how clueless slashdot generally as group is about these things....
Despite best efforts otherwise. It comes up as
an "Ask Slashdot" related question regularly;
slashdot posts pseudo-science stories or op-ed
about cloning etc, and yet... slashdot hasn't
attempted to *contact the actual scientists*
involved to get their opinion.
Yes - I have suggested this as an interview topic
a number of times. Slashdot editorials are more
interested in "wow-science" stories than real
science. It annoys me. (but I still read slashdot).
Here are some pointers:
The largest public sequencing center in the world
http://www.sanger.ac.uk/
The US biological information portal
http://www.ncbi.nln.nih.gov/
The European biological information portal
http://www.ebi.ac.uk/
Some open source projects in this area:
(The bio* group.)
http://bio.perl.org/
http://www.biojava.org/
http://www.biopython.org/
http://www.bioxml.org/
Open source genome annotation project
http://www.ensembl.org/
I have been a long UK -> US road traveller,
and bizarrely the best thing to do is track
down a kinko's - kinko's offer reasonable
(still pretty steep) cybercafe type access
but they are everywhere (even in knoxville
tenesse for example)
I never tried to get dhcp into an ethernet
port. I don't think they offered it then (this
summer). But you never know - if enough of us
ask
ewanb
I have worked (closely) with two female
programmers. One was an ex-physicist with strong
java/perl/c skills and the other was a c programmer who used to code asynchronous signalling stuff.
Both were/are great. And we get on well. And the
work is good.
It confuses me why there are not so many girls
in the industry but I guess the best way to
solve it (like most things) is just to live to
your ideals. So - I don't worry about the
sex/age/culture/race of the people I work with
and that seems good enough for me...
I think talking about it helps air some issues
but doesn't really change much.
join in with ensembl and help us out. You ;)
would learn *alot* of biology v.quickly
Thanks troc - just got around to reading this
;). But I know what to run...
commnet.
I have sort have appealed at the top to people
to come along. People seem more interested
in writing about patents than getting down to
nuts and bolts of course....;)
If there is anyone out there who would like to
do this coding, as sure as hell I don't know how
to it
It is clear from these postings that people would
like the client to run. If there are people with
experience in writing these sorts of d.net systems
then please drop me a note. We have the problem
for you to work on - it is just a question of
figuring out how to do it.
Drop me a mail (birney@sanger.ac.uk).
There are aspects of the work which have
;)
a good data/cycles ratio. (surprisingly).
I would read about the subject before you pronounce...
Absolutely - see my reply to the post above yours.
The two big drainers on CPU for analysis are gene prediction (genscan) and database searching (blast). database searching can't be distributed easily as you have to worry about the database ;)
However, there are programs like sim4, genewise and est2genome that could greatly help us and could be distributed.
Genewise - you can download (I wrote it) at Wise2 est2genome is somewhere around as well.
For the more general overview of the problem - check out ensembl for an idea of the project.
Celera always talk about the assembly problem as they have gene myers solving it (he has) and think it is pretty cool. It is not trivial, but from my view (an annotation centric view) not the most important thing.
This is only for the assembly and not for the analysis. With analysis you have a better data/cycles ratio. Assembly is done at the genome centres anyway...
Great that you were following the talk. I thought I put everyone to sleep
The rate limiting step at the moment is effectively the mapping in fact, then sequencing. The interesting thing about the analysis is that the amount of CPU is unbounded. If we have more CPU we just use more accurate algorithms. We can do something within the CPU bounds on the hinxton campus, but if anyone wants to give me a super computer, then we could get more accurate analysis.
I can always use more juice!
Does anyone want to write it? If so - I have alot of CPU hungry algorithms to run.
- ensembl is an open source genome project designed to get as much data and software into the public domain as possible
- EMBOSS
- bioperl
All these are well backed, strong open source projects with different strengths. Everytime genome stuff comes up on slashdot I try to point these things out to people, but everything gets lost in the noise about people $%!"'ing on about patents (generally without alot of knowledge!).Anyway - check out these projects for more information about real open source efforts in biology.
Had a dodgy certificate that explorer didn't like...
And the projects that are there seem to be focused mirroring other projects
Finally - could you/would you trust someone else to keep a server up 24-7 for your source code? My experience of projects is that they need more than cvs/mailinglist. They need coordinated web site and people close by to make it all work
So. I am not moving from my work machine yet. But I guess this is the way things are going to go
ewan
Bioperl at bio.perl.org