Genetic Database Hits One Billion Entries

← Back to Stories (view on slashdot.org)

Genetic Database Hits One Billion Entries

Posted by ScuttleMonkey on Tuesday January 17, 2006 @02:39PM from the start-the-recount dept.

ChocSnorfler writes to tell us that the Sanger Institute is reporting that their Genetic Record Database has hit one billion entries, making it the world's largest. From the announcement: "The Trace Archive is a store of all the sequence data produced and published by the world scientific community, including the Sanger Institute's own prodigious output as a world-leading genomics institution. To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest. The Archive is 22 Terabytes in size and doubling every ten months."

7 of 189 comments (clear)

Min score:

Reason:

Sort:

I will be more impressed... by Stachybotris · 2006-01-17 15:00 · Score: 5, Informative

When we figure out what all of that does. For every organism as or more complex than your average bacterium, there's a large amount of what amounts to filler DNA. Viruses don't have this problem, as few of them are large enough to even get by without overlapping reading frames. If you shrink this dataset down to only sequences that encode functional proteins (read: genes), there's still an insane amount of information. If you then remove the introns, the dataset gets even smaller. But of course, we don't really know if the introns and intra-genic regions of DNA (the so-called 'junk DNA') have functions (or how many they have), although some do act as regulators of transcription.

Given that a change of just 1 base in 500 of the 16S rRNA gene is sufficient to differentiate between two different species of bacteria, I have to wonder how many of these entries are quasi-redundant. When you consider how many species of bacteria are known to man, that means that there are literally thousands of potential entries for each gene. Unless, of course, they're storing only consensus sequences, which still vary widely between genera.

Sadly, the trend here seems to be more of 'sequence it, upload it, and patent it' instead of 'sequence it, upload it, figure out what it does/makes, do something useful with it'. Knowing the sequence for the Ubiquitin gene is all well and good, but it's of little practical importance. Being able to construct designer proteins to treat illnesses based on that information, however, is a truly worthy goal. Unfortunately, that's also where the 'patent it' part comes into play...
1. Re:I will be more impressed... by floWing · 2006-01-17 21:49 · Score: 3, Informative
  
  First of all I want to point out, so-called "junk DNA" has proven to be a very bad idea for thinking of introns and other untranslated regions (like UTR's [untranslated regions around protein-coding regions], regions of DNA which are not used to create proteins [in the regular way] via mRNA (messenger RNA), then translated to protein). Most scientists will agree nowadays there is _alot_ of information in these non-exonic regions, the most prominent example up to date being microRNA - small RNA pieces from intronic and UT regions - affecting the cell machinery, like silencing protein translation from existing mRNAs.
  Given the figures of 1 billion sequence records, it is by far not as impressive once you start removing redundant entries, and as more than half of these entries originate from so-called EST's (expressed sequence tags) - meaning DNA regions [exonic regions] which do translate to mRNA: Knowing exons only constitute a minoirty of the genomes of higher organisms, thse entries constitute less than 5 % of the complete genome. Also redundancies might not even be discernable because of the high fault-tolerance most "quick-and-dirty" sequencing-methods have - ranging up to several precent of erroneous bases. Also a _big_ problem is sequencing of highly repetitive regions of the genome, as current sequencing proceedures allow to sequence strands up to a length of approx. 1 KB (1000 bases), not much more [this relates to the error-rate growing untolerably high if sequencing anything significantly longer than this]. But repetitive DNA regions can often keep on going for more than this length: so we are still not able to "close the gaps" and can not say where these pieces belong to (although excellent scientists are working exactly on this tough problem using so-called "whole genome assemblers").
  Concluding this, I would not be astonished to see that less than 10 % (and even far less) of these billion records do actually contain original information. So, if you want to stick to the hype, you are free to do so, but: it's about hype, not facts.
Re:How do they map their function? by AlanKilian · 2006-01-17 15:04 · Score: 5, Informative

From: http://www.learner.org/channel/courses/biology/tex tbook/genom/genom_7.html

A biological approach to determining the function of a gene is to create a mutation and then observe the effect of the mutation on the organism. This is called a knockout study. While it is not ethical to create knockout mutants in humans, many such mutants are already known, especially those that cause disease. One advantage of having a genome sequence is that it greatly facilitates the identification of genes in which mutations lead to a particular disease.

The mouse, where one can make and characterize knockout mutants, is an excellent model system for studying genetic diseases of humans; its genome is remarkably similar to a human's. Nearly all human genes have homologs in mice, and large regions of the chromosomes are very well conserved between the two species. In fact, human chromosomes can be (figuratively) cut into about 150 pieces, mixed and matched, and then reassembled into the 21 chromosomes of a mouse. Thus, it is possible to create mutants in mice to determine the probable function of the same genes in humans. Genetic stocks of mutant mice have been developed and maintained since the 1940s.

One goal of the mouse genome project is to make and characterize mutations in order to determine the function of every mouse gene. After a particular gene mutation has been linked to a particular disorder, the normal function of the gene may be determined. An example of this approach is the mutated gene that resulted in cleft palates in mice. The researchers found that the gene's normal function is to close the embryo's palate. An understanding of the genetics behind cleft palate in mice may one day be used to help prevent this common birth defect in humans.
Re:How do they map their function? by Stachybotris · 2006-01-17 15:11 · Score: 5, Informative

In most cases they work backwards. You start with a known protein, determine its amino acid sequence, and then convert that into the most likely DNA sequence (accounting for codon bias). Primers/probes are then generated for the 3' and 5' ends of the probable DNA sequence. If you're working with a small genome like that of a bacterium, you can perform a restriction digest to get random hunks of chromosome. These are then amplified via PCR using your designer primers. The final product is then sequenced.

In other cases you can create a gene knockout by splicing a random gene into your gene of interest. This causes your target gene to encode a non-functional protein. Then you watch and see what happens to the test subject. In some cases the creature dies because the gene turned out to be extremely important. In others it results in minor to significant impairment. But because of the complexity of most organisms, single-gene knockouts usually don't have too much effect - the creature has multiple pathways that can accomplish the same goal. This is especially true for critical functions like those in the immune system.
Here's your standard by TubeSteak · 2006-01-17 16:06 · Score: 2, Informative

I'm gonna assume 12 point, single spaced with inch (or inch and a half) margins is pretty standard fare.

And by standard, I mean: whatever MS Office defaults to

Diana Hacker's "A Writer's Reference" says the same thing.
/I'm not a grammar Nazi, I was forced to purchase it many years ago and have kept it handy ever since.

--
[Fuck Beta]
o0t!
Re:Dubious claims by RedWizzard · 2006-01-17 16:13 · Score: 2, Informative

Such claims should be taken with a grain of salt until they reveal what fonts and point sizes they use.
It's just meaningless reporter-speak. A stupid attempt to provide context for readers who can't visualise that much data. Of course, I doubt many such readers have a good concept of the circumference of the world or the height of Mt Everest either.
I actually have my masters thesis on a single sheet of A4. I had to use a 1.5 point font to make it fit. You could still read it though.
DNA sequence makes up only a small proportion. by cerebis · 2006-01-17 22:42 · Score: 3, Informative

As this is a trace archive, it stores not just the DNA sequence (ACGT) but also the signal data produced by the machines used in these experiments, which is used to determine the DNA sequence (or basecall).
The signal data is composed of peaks and troughs across 4 channels, corresponding to the 4 base types. A peak in a channel corresponds to a base of that type passing in front of the detector. In your typical sampling configuration, a peak is made up of about 12 data pts.
Now, since each sampled point in the signal is stored as a 4 byte int and the base for that peak is stored as a 1 byte char, then you've got basically a 192:1 ratio of techincally superfluous signal data to actual DNA sequence.
Since there are yet other peices of information in the file, this ratio is actually larger.
Of course, there is a good reason for keeping trace data rather than just the DNA sequences, the notion being that you have more information with which to validate the integrity of what you've done. There have been cases where scientific databases have had their data integrity damaged over time by low quality (ie. mistakes) submissions.
In this case, they're retain the wrong file type, as it doesn't store the original unfiltered data signal, only a heavily filtered and manipulated one. Most modern basecallers start from the original unfiltered data to gain more advantage through better processing, you cannot do this with the file type they are retaining.