Bioinformatics
tadghin pointed out this Newsweek article on bioinformatics, and also notes: "At O'Reilly, we just published our first bioinformatics book last week, Learning Bioinformatics Computer Skills, by Cynthia Gibas and Per Jambeck, and it immediately rocketed to the top of the Amazon Computer bestseller list. This definitely appears to be a new area for the computer industry that's just starting to hit people's radar big time. I've also made the point to VCs looking at distributed computation startups that what I see on sites like slashdot is a lot of movement by hackers towards new and interesting problems. And science looks a lot more interesting than some of the business computing that's been front and center the past couple of years. And the Biological Open Source Computing Conference I spoke at last year was definitely popping with ideas and excitement. Unfortunately, this year's conference is in Copenhagen, right before the O'Reilly open source convention, but I definitely urge slashdotters to check out this area. Demand for perl expertise is especially high."
There are two factors that I think are driving the emergence of bioinformatics: culture and data explosion.
.com's use: big iron data warehouses running heavy duty RDBMS's like oracle, DB2. Nope. I have yet to come across a single bioinformatics project that has a clue about data modelling. It's actually much above average to use a database at all, let alone well. If I was head of the NIH, you can bet that Freshmen biologists would take a class in SQL starting immediately.
When I was in college, the computer science majors "hung out" with the math majors, the physics majors, and the electrical engineering majors. Biologists hung out with the less analytical crowd. Obviously these are generalizations, but I believe a lot of "the problem" is that culturally biologists just don't have very good computer skills. Suddenly it is the case that biology as a science absolutely requires these skills. If you were one of the few (and some do exist) that broke the stereotype, you need to be starting a company about now. Otherwise the race is on for the biologists to learn programming and the CS-math-physics types to learn biology.
Second is the fact that biologists are drowning in data. Projects like the human genome project are producing lots of data, but thats just the tip of the iceberg. There is already an exploding market in high throughput assays and measurement computation. The result is that the field as a whole simply isn't managing it's data well. Often groups store there data in extremely crappy formats. Custom text formats, asn.1, etc... I'm an Oracle programmer, so I expect the kind of solutions that Banks and
When you combine the two factors: culture and data innundation, very strange things start to happen. The data infrastructure just isn't there and worse a lot of people just don't realize it. Biology is presenting problems that require massive data warehousing solutions to a field whose main data background is calculating p-values to show the effect of a drug is significant.
Stripped of header and gzipped, I get 366 bytes, X4 is 1464 nucleotides.
The probability of any two random sequences of the same length being equal is the inverse of the number of expressible sequences of that length. In this case, it is 4^length.
When you are looking for a random sequence (of length N) within a longer sequence (length M), the probability of finding it is the above probability multiplied by M-N (the same chance over and over again for every sub-sequence of length N, assuming you don't count wrapping substrings).
So N=1464, and M equals roughly 3 billion. So the probability is:
(3*10^9-1464)/4^1464
Which is in the neighborhood of one to a squared googol odds.
Of course, that assumes random data, but I figure it's a good enough approximation.
Don't knock yourself out looking for it. It's not there.
--