Slashdot Mirror


Genetic Database Hits One Billion Entries

ChocSnorfler writes to tell us that the Sanger Institute is reporting that their Genetic Record Database has hit one billion entries, making it the world's largest. From the announcement: "The Trace Archive is a store of all the sequence data produced and published by the world scientific community, including the Sanger Institute's own prodigious output as a world-leading genomics institution. To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest. The Archive is 22 Terabytes in size and doubling every ten months."

189 comments

  1. w00t! Opensource genetics! by themysteryman73 · · Score: 2, Funny

    genetic information of organisms - mice, fish, flies, bacteria and, of course, humans... All the data are freely available to the world scientific community (http://trace.ensembl.org/) Sweet, now I can finally build myself that fleet of flying super monkeys I've always wanted!

  2. For God's sake, don't print it! by BadAnalogyGuy · · Score: 5, Funny

    Some dumbass is always printing 300 pages of documents and hogging the printer. Forchrissakes, just figure out what pages you need and print those! Asshole.

    The amount of data here is really enormous. To put it in perspective, if you lined up 7143 blondes, the number of strands of hair present would approximately equal the number of entries in this database.

    1. Re:For God's sake, don't print it! by davidsyes · · Score: 1

      Well, that depends on how bald those Barbies are. Some blonds might have more genes than they have hair. Some certainly have hair longer than 7 or 9 inches. But, add up all the surface hair... That could be one helluva "helix".

      Now, I wonder when gene blending will happen outside of the morphing software and bedroom. Maybe we can bring tails back...

      --
      Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"
    2. Re:For God's sake, don't print it! by Anonymous Coward · · Score: 4, Funny

      I love those things: "To put this in perspective, here's another image or figure that won't fit in the human mind either." They always clear those huge numbers right up for me.

      At least your name is "BadAnalogyGuy", which gives you a better excuse than the story submitter.

    3. Re:For God's sake, don't print it! by kahanamoku · · Score: 3, Interesting

      Printing would be an issue in itself,

      By the time you successfully print the 22TB of data, you would no doubt pass the 10 month threshold for the double sized growth. Once you start printing, you'd never stop!

      then again, a new challenge for Epson/HP etc... develop a printer that is robust enough to print a paper mount everest!

      --
      ----- Concentrate on promoting more than demoting.
    4. Re:For God's sake, don't print it! by margaret · · Score: 5, Funny

      Some dumbass is always printing 300 pages of documents and hogging the printer. Forchrissakes, just figure out what pages you need and print those! Asshole.

      Like when I was in grad school, I remember our IT guy was hopping mad because he had to come in on a sunday to reboot the server because some dumbass decided to print the entire mouse chromomome 22 sequence. Something about a spool file and crashing his server...

    5. Re:For God's sake, don't print it! by tomhudson · · Score: 0, Troll

      By the time you successfully print the 22TB of data, you would no doubt pass the 10 month threshold for the double sized growth. Once you start printing, you'd never stop!

      I have a simpler soluton - just study Bush supporters - they come from the shallow end of the gene pool, so your flood of data would also slow to a ttrickle.

      .

      [tt]

    6. Re:For God's sake, don't print it! by queenb**ch · · Score: 3, Interesting

      If it doubles every 10 months, in about 8 years we should no longer have enough hard drive space to store it.

      2 cents,

      Queen B

      --
      HDGary secures my bank :/
    7. Re:For God's sake, don't print it! by Anonymous Coward · · Score: 0

      no, there will always be enough hard drive space for it, it'll just keep costing more & more, until they buy up all the world's hard drives & are using more space than can be produced, but extra manufacturors would start up to fill the demand, etc.

      as long as their money supply is infinate, they wont run out of hard drive space until we run out of natural resources in the near universe to make hard drives from, or until almost every human spends all their time making hard drives & no more could possibly be produced.

      i suspect this will take longer than 8 years.

      but yes, their data does seem to be growing faster than the storage space of average hard drives, so their storage requirements will keep getting more expensive.

    8. Re:For God's sake, don't print it! by Firehed · · Score: 3, Insightful
      Right. Because hard drives aren't ever going to be made from today forward, and certainly won't get bigger in capacity.

      If I'm doing the math right, that would put the storage needed at about 25EB eight years out from now (about ten doublings is 1024 times the current needs). Which is only 50,000 500GB drives. While certainly quite a lot, if the average hard drive space is even 10GB, times millions of computers just in the US, I think we're set. Seagate probably sells that much storage every week.

      I'm pretty sure we'll run out of species to map the genetic info of before we run out of space to store that info.

      Still, quite the accomplishment.

      --
      How are sites slashdotted when nobody reads TFAs?
    9. Re:For God's sake, don't print it! by Oxen · · Score: 1

      some dumbass decided to print the entire mouse chromomome 22 sequence

      Nice try. Mice only have 20 different chromosomes. Whose the dumbass now!

      I kid. I kid.

      --
      First you animate. Then you SUSPEND!!!
    10. Re:For God's sake, don't print it! by davidsyes · · Score: 1

      Yeh, I am sure Chaka and the Sleetaks out in the Oval Orifice would love you counting their pubic, ummm, PUBLIC hairs...

      Hey, that could be an information disinformation campaign band... A modern spin on "Pontius Pilate and the Nail-Drivin' Five" from the 70's.

      With enough gene therapy, though, Chaka and the Sleetaks can ALL be transformed into uubersekshoowalls (read: uubersexuals).

      --
      Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"
    11. Re:For God's sake, don't print it! by davidsyes · · Score: 1
      --
      Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"
    12. Re:For God's sake, don't print it! by Phiu-x · · Score: 1

      Oh come on, this is Slashdot! We'd make a beowulf cluster of .... Its possible. Really ...

      --
      This is a stolen sig.
    13. Re:For God's sake, don't print it! by krunk4ever · · Score: 1

      I remember one of my classmates was too cheap to buy the reader and the reader was available in pdf format. I believe 600-800 pages and he printed the entire thing one weekend. Our professor found out and said to the entire class, "The book's only $20. It's worth the paper's price."

    14. Re:For God's sake, don't print it! by dartarrow · · Score: 1

      The amount of data here is really enormous. To put it in perspective, if you lined up 7143 blondes, the number of strands of hair present would approximately equal the number of entries in this database.

      Are we also counting the hair on their head?

      --
      I love humanity, it is people I hate
    15. Re:For God's sake, don't print it! by Stan+Vassilev · · Score: 1

      "Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest."

      Holly shit batman! Just imagine if instead we chisel it on stone plates! It might go to the moon and back.

      "The Archive is 22 Terabytes in size and doubling every ten months."

      22 Terabytes, i.e. if written on a holodisk (coming: 2007 - 2008) it'll be about 22 if em. That would produce a stack about the height of my home scanner.

    16. Re:For God's sake, don't print it! by Anonymous Coward · · Score: 1, Funny

      I think you mean 'who is' not 'whose'. So yes, you are!

    17. Re:For God's sake, don't print it! by Archades54 · · Score: 0

      and to represent the amount of data in brain cells, the amount of blonde's you'd need..... wait a few billion years for the universe to grow enough to fit them in

      --
      If your neighbours roof is flying past your window, you know it's cyclone season.
    18. Re:For God's sake, don't print it! by CFTM · · Score: 1

      Yeah I'm really going to have to agree with Firehed on this one, not to mention the fact that in the next ten years we'll probably have drastically different methods of storing dense pieces of information. I expect that within ten years terabyte drives will be common in personal computers and it may be even five years away. Space isn't a big deal ;)

    19. Re:For God's sake, don't print it! by margaret · · Score: 1

      Ah, the typo. The bane of my existence. It was chromosome 12. And I also misstyped "chromomome." But alas, I was too quick on the submit button.

      But I still got a +5 funny :-)

    20. Re:For God's sake, don't print it! by MarkCollette · · Score: 1

      Presumably that's bounded by earth's population...

    21. Re:For God's sake, don't print it! by gstoddart · · Score: 1
      our IT guy was hopping mad because he had to come in on a sunday to reboot the server because some dumbass decided to print the entire mouse chromomome 22 sequence.

      Mmmmmm ... Mouse Chromomomes.

      =)
      --
      Lost at C:>. Found at C.
  3. How many LOCs is that? by Anonymous Coward · · Score: 2, Funny

    I could make this sentence wrap around the world a zillion times if I used 10^100 point text.

  4. i love meaningless data by JeanBaptiste · · Score: 5, Funny

    "To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest. "

    I have twice that much data on my 128k thumbdrive, if printed out in 72 point font size.

    Anyone care to translate this into volkswagens, or libraries of congress?

    1. Re:i love meaningless data by Anonymous Coward · · Score: 2, Funny

      you have a 128k thumbdrive? Does it use the serial port?

    2. Re:i love meaningless data by Snarfangel · · Score: 3, Funny

      Anyone care to translate this into volkswagens, or libraries of congress?

      I keep forgetting, how many Volkswagens to the Ferrari?

      --
      This tagline is copyrighted material. Please send $10 for an affordable replacement.
    3. Re:i love meaningless data by davidsyes · · Score: 2, Funny

      Can anyone translate that into strands of 1/4 inch-long pubic hair, or SPI (density in strands per cubic inch)? Maybe we can turn humans into mink or felt. Imagine the hygiene business stock if you could put this hair-densification stuff into the food and water supply. I hear Pantene helps women's hair grow FAST. I've been using it to see what would happen, and my own hair seems to be growing faster than normal. But, it could be coincidental. (I wonder if Pantene has bull sperm/semen in it like the hair clubs for bald men purportedly does. I'd read in GQ or one of those uuber-sexual mags that it did. So, I dog-eared one of my friend's magazines so his guests would probably notice it and read the article.

      But, speaking of thumbnails, can anyone translate that information density into THUMBNAILS or corneas or eyeball spheres worth of information?

      Next: Subspace Neural Network...

      --
      Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"
    4. Re:i love meaningless data by Frogbert · · Score: 5, Funny

      No, but to put it in some perspective. It would take over 6 minutes for a japanese school girl to type it all out on her phone.

    5. Re:i love meaningless data by borisborf · · Score: 3, Interesting

      Well, according to Wikipedia, It is estimated that the print holdings of the Library of Congress would, if digitized and stored as plain text, constitute 17 to 20 terabytes of information. Remember, this is without images or diagrams. Just plain text.
      So this is roughly the size of the TEXT in the library of congress.

    6. Re:i love meaningless data by Brent+Spiner · · Score: 5, Funny

      If you choose a fixed-width font such as 12 point Courier about 75 letters fit on a single line with half inch margins. This means that each letter is about 2.54 millimeters in length. The earth is 24900 miles in circumference that means that it would take 15776640000 letters to stretch around the earth.

      If we take a 1967 Volkswagen to be a measuremeant of length then it is 1606.01 times larger than a single letter so it would take 9823500.48 Volkswagi to tailgate around the earth. Multiply that by 250 and you get ~ 2.455875x10^9 Volkswagens.

      Since it is quite easy to convert Volkswagens to Library of Congresses I won't go into further detail.

      --
      Reality test... am I dreaming?
    7. Re:i love meaningless data by mrscorpio · · Score: 2, Funny

      "Volkswagi"

      I just threw up in my mouth.

    8. Re:i love meaningless data by c0dedude · · Score: 1

      no that would be a generic not genetic database this data has meaning

      --
      Since when has this country used intellectual elite as a pejorative term?
    9. Re:i love meaningless data by RedWizzard · · Score: 1
      If we take a 1967 Volkswagen to be a measuremeant of length then it is 1606.01 times larger than a single letter so it would take 9823500.48 Volkswagi to tailgate around the earth. Multiply that by 250 and you get ~ 2.455875x10^9 Volkswagens.
      No, no! The Volkswagen Beetle can only be used as a unit of mass or volume, never has a unit of length. Length should be measured in "football fields" (which can also be used for area)!
    10. Re:i love meaningless data by mattjb0010 · · Score: 1

      (800 billion bases) / (3 billion bases * 0.5 per sperm) = about 53 sperm or rougly 0.00013% of one normal ejaculate (40 million sperm).

    11. Re:i love meaningless data by Bishop_Of_Battle · · Score: 1

      "... doubling time is ten months." heheh so ahh... in ten months they will have already entered an additional (whatever they have now) and then 20 months from now they'll have 4(whatever they have now). I double dare them. They better be typing like mad. It seems to me that statement could only be possible if every ten months they figure out enter data twice as quickly. nope. not gonna do it.

    12. Re:i love meaningless data by grimJester · · Score: 1

      It's around seven million mp3z, giving a rough estimate of $10^12 annual loss for the music industry.

    13. Re:i love meaningless data by ls+-la · · Score: 1

      No, but to put it in some perspective. It would take over 6 minutes for a japanese school girl to type it all out on her phone.

      But less than 5 minutes for a 93-year-old using morse code

    14. Re:i love meaningless data by Hynee · · Score: 1

      According to the post it's 1 billion database entries for a total of 22TB, so it's about 22kB per entry. Quite what's in each entry I'm not sure, probably whole genes (I think they're about 10000 base pairs, and a base pair could be encoded as two bits (there are 4 different ones), but they're probably encoded as bytes (G, C, T or A).

      --
      Damn, I already moderated this topic. Now I'll have to log in with my sock puppet to comment.
    15. Re:i love meaningless data by djdavetrouble · · Score: 1

      "Volkswagi"

      I just threw up in my mouth.


      No stupids its already plural, like Vaxen.

      Get it right !

      --
      music lover since 1969
    16. Re:i love meaningless data by sploxx · · Score: 1

      This means that each letter is about 2.54 millimeters in length.
      Hey, you have admitted the true power of the Imperial System here, it's 0.1 inches!

      Disclaimer: I'm a EU citizen :)

    17. Re:i love meaningless data by smoker2 · · Score: 1
      If we take a 1967 Volkswagen to be a measuremeant of length then it is 1606.01 times larger than a single letter so it would take 9823500.48 Volkswagi to tailgate around the earth.
      That's one hell of a Burning Man fest !
  5. 22TB is nothing. by Duncan3 · · Score: 3, Insightful

    Wow, that's almost 12U of rack space. Oh my *yawn*

    Now the fact that that's all genetic data, that's amazing considering a human is only ~1GB so 22,000 humans worth.

    --
    - Adam L. Beberg - The Cosm Project - http://www.mithral.com/
    1. Re:22TB is nothing. by Endymion · · Score: 2, Interesting

      seriously... I've personally added at least that much to NCBI's archive...

      I guess it depends on what they mean by "genetic data", exactly. if they are including the traces, that's not much.

      --
      Ce n'est pas une signature automatique.
    2. Re:22TB is nothing. by TheSpoom · · Score: 4, Funny

      I'm pretty sure storing humans on your hard drive is illegal.

      --
      It's better to vote for what you want and not get it than to vote for what you don't want and get it.
      - E. Debs
    3. Re:22TB is nothing. by RootsLINUX · · Score: 1

      Yeah, it's really nothing to be impressed about. I have well over 22TB of porn sitting on my computer.

      --
      Hero of Allacrost, a FOSS RPG for *NIX/*BSD/OS X/Win
    4. Re:22TB is nothing. by damneinstien · · Score: 0

      Now the fact that that's all genetic data, that's amazing considering a human is only ~1GB so 22,000 humans worth.

      What's more amazing is that a human has 1 GB on all of his/her non-sex cells. Considering the amount of cells in a human body, I doubt that all of that would fit on 12U worth of rack space.

    5. Re:22TB is nothing. by roesti · · Score: 3, Funny
      I'm pretty sure storing humans on your hard drive is illegal.
      Well, the HIAA keeps saying that, but the Digital Human Copyright Act (DHCA) is pretty vague.

      In the meantime, you can still get the genetic layouts of other animals on eDonkey. (groan)

    6. Re:22TB is nothing. by Anonymous Coward · · Score: 0

      5U, on a good day, if you use one of those 48-drive pile cases and 500GB drives.

    7. Re:22TB is nothing. by Anonymous Coward · · Score: 0

      According to the NCBI trace site, their archive currently contains 986143227 traces, about 18 million fewer than in the Sanger Archive. And yet you claim that you have personally contributed more than 22TB of data to them? Maybe the tapes got lost in the mail...?

    8. Re:22TB is nothing. by Anonymous Coward · · Score: 0

      Note the name of the database: The trace archive. This is not simply an archive of the ACGT sequence, which as others have said is a relatively compact 3GB of data for an entire human genome. This is the raw trace data that came off the sequencing machines, and essentially consists of graphs of light intensity as recorded by the machine. To get the ACGT out, you need to run some signal analysis software on it. When genomes are sequenced, short sequences of a few hundred, or less, base pairs are sequenced. They are done multiple times, and to get the final answer, software looks for overlaps in these fragments to build up a picture of the whole thing. The raw data remains useful because if contains other data, which the reduction to plain sequence throws away. The sequences may have come from different individuals, so obtaining the raw data for a given region of DNA will help identify variations in that data. By applying a slightly different signal analysis method, you can look for heterozygosity as well (i.e. places where you get two different peaks in the same place, which the normal base-caller software would just choose one and treat it as a low quality prediction). Secondly, someone may come up with a better base-calling algorithm, and therefore want to re-analyse the existing data.

      But that's why there's so much data. To make the genome assembly accurate, everything has been sequenced many times, from multiple individuals, and this is still ongoing. And the image data is approximately 20 times as large as the ACGT sequence you get from it, I'd guess, possibly even more than that.

    9. Re:22TB is nothing. by Phillip2 · · Score: 1

      It's probably a little bit more than that, as it's managed. Also, it's doubling in
      size every 10 months; a problem as the rate of increase of hard drive size is
      something like once every 18 months. This means the cost of providing this storage
      will increase exponentially.

      Incidentally, it's not "genetic data"--that is sequence data. It's trace data
      which is then interpreted to produce sequence data. So actually, the data storage
      requirements for each base takes more than 2 bits. Moreover it's redundant (DNA is
      sequencing at least 10 times). So you're probably several orders of magnitude out
      with your calculations.

      Phil

    10. Re:22TB is nothing. by Harodotus · · Score: 1

      It's not illegal if it's voluntary.

      What two consenting adults do on hard disks is none of my concern...

      --
      Its not users who are broken, it's systems not taking account their likely behaviour and fixing it technically.
    11. Re:22TB is nothing. by Anonymous Coward · · Score: 0

      I think the Human Genome is over 3 billion base pairs- so it's more like 3GB (assuming 1 letter = 1 byte). Furthermore, when they sequence genomes, they try and sequence the same area multiple times to insure accuracy- this is refered to as "depth of coverage". In the human genome's case, they aimed for 8x to 9x coverage, thus increasing total raw sequencing data to at least 24GB.

      More information at ornl.org

    12. Re:22TB is nothing. by Amouth · · Score: 1

      i had the same reaction.. the rack behind me right now has around 6TB on it.. and we are a small company of ~15 people

      22TB just doesn't seem impressive..

      --
      '...if only "Jumping to a Conclusion" was an event in the Olympics.'
    13. Re:22TB is nothing. by TheSpoom · · Score: 1

      Strange, that wasn't what I got when I searched for "animal"...

      --
      It's better to vote for what you want and not get it than to vote for what you don't want and get it.
      - E. Debs
    14. Re:22TB is nothing. by Endymion · · Score: 1

      don't believe the web site numbers...

      --
      Ce n'est pas une signature automatique.
    15. Re:22TB is nothing. by Anonymous Coward · · Score: 0

      This gives a whole new meaning to "Web 2.0 is made of people!"

  6. Dubious claims by Dr.+Photo · · Score: 3, Interesting

    if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest.

    Such claims should be taken with a grain of salt until they reveal what fonts and point sizes they use.

    1. Re:Dubious claims by kahanamoku · · Score: 1

      or, print it Duplex and you're only 1.25 the size of everest!

      I wanna see someone print it in booklet format, then try and fold the edge!

      --
      ----- Concentrate on promoting more than demoting.
    2. Re:Dubious claims by RedWizzard · · Score: 2, Informative
      Such claims should be taken with a grain of salt until they reveal what fonts and point sizes they use.
      It's just meaningless reporter-speak. A stupid attempt to provide context for readers who can't visualise that much data. Of course, I doubt many such readers have a good concept of the circumference of the world or the height of Mt Everest either.

      I actually have my masters thesis on a single sheet of A4. I had to use a 1.5 point font to make it fit. You could still read it though.

    3. Re:Dubious claims by timeOday · · Score: 4, Insightful
      Such claims should be taken with a grain of salt until they reveal what fonts and point sizes they use.
      Let me interpret for you: it's a lot.

      What's incredibly more lame is that 99% of the slashdot comments on this article so far are stuck on units of measure. Clearly it's a lot. Instead of debating the length of a piece of string, how about some discussion on how to distribute and analyze so much data. At this point I'd almost welcome some grousing about patents or dumb google DNA-related theories. We're barely scratching the surface on understanding genetic data. Even finding approximate substring matches within samples is fairly difficult. Here we have the world's biggest crossword puzzle which encodes the secrets of life itself and most of you guys are stuck on the point size of the font.

    4. Re:Dubious claims by ObsessiveMathsFreak · · Score: 1

      Such claims should be taken with a grain of salt until they reveal what fonts and point sizes they use.

      It's a moot point anyway. You're never going to be able to open the whole file in Word to begin with.

      --
      May the Maths Be with you!
  7. Whoa!! I thought it said: by davidsyes · · Score: 0

    "...every ten MINUTES." Imagine we'd look like the Ferengi with loads of teeth and slick heads.

    --
    Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"
    1. Re:Whoa!! I thought it said: by Anonymous Coward · · Score: 0
      "...every ten MINUTES." Imagine we'd look like the Ferengi with loads of teeth and slick heads.


      okay, seriously... what the hell?

  8. If printed out... by MarkusQ · · Score: 5, Funny

    if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest

    Did anybody else think "Wow, I've got a great idea for a mural for the space elevator!"

    Anybody?

    Uh, well, it's late...

    --MarkusQ

    1. Re:If printed out... by Geoffreyerffoeg · · Score: 1

      Oh wow. I just realized...you could do a lot of cool things with that. How about a reproduction of all the major written works in civilization, e.g., from Epic of Gilgamesh and the Vedas through the Bible through the Divine Comedy through Paradise Lost to the Lord of the Rings and other modern texts? Print them in their entirety, in standard font, in order from the oldest at the surface to the most recent at the top. It would be impressively symbolic.

  9. Torrent? by mendaliv · · Score: 5, Funny

    Would somebody please torrent it?

    1. Re:Torrent? by ShaneThePain · · Score: 1

      do you really have a hard drive that can store 22TB? I consider my 120 GB drive pretty big actually. You would need a whole set of 500GB drives.

      --
      Fascism is the greatest political ideology ever conceived. Sorry.
    2. Re:Torrent? by jftitan · · Score: 1

      just wait and see, Some music artist is going to give his genetic code to this database. Once torrenting becomes the method of open code surfing, the MPAA/RIAA will be knocking at ours doors.

      --
      "Don't Forget to Salt the Fries"
  10. If we're not careful.. by AkA+lexC · · Score: 2, Funny
    The Archive is 22 Terabytes in size and doubling every ten months

    This enormous archive will devour us all.. ARGHH!
    --
    -AlexC
  11. How do they map their function? by bubulubugoth · · Score: 1, Interesting

    This is a real question...

    How the scientist do that?

    They wiggle this gen, and see what happens?
    How do they go for the "scientific method" of experimentation?

    --
    Â_Â
    1. Re:How do they map their function? by AlanKilian · · Score: 5, Informative

      From: http://www.learner.org/channel/courses/biology/tex tbook/genom/genom_7.html

      A biological approach to determining the function of a gene is to create a mutation and then observe the effect of the mutation on the organism. This is called a knockout study. While it is not ethical to create knockout mutants in humans, many such mutants are already known, especially those that cause disease. One advantage of having a genome sequence is that it greatly facilitates the identification of genes in which mutations lead to a particular disease.

      The mouse, where one can make and characterize knockout mutants, is an excellent model system for studying genetic diseases of humans; its genome is remarkably similar to a human's. Nearly all human genes have homologs in mice, and large regions of the chromosomes are very well conserved between the two species. In fact, human chromosomes can be (figuratively) cut into about 150 pieces, mixed and matched, and then reassembled into the 21 chromosomes of a mouse. Thus, it is possible to create mutants in mice to determine the probable function of the same genes in humans. Genetic stocks of mutant mice have been developed and maintained since the 1940s.

      One goal of the mouse genome project is to make and characterize mutations in order to determine the function of every mouse gene. After a particular gene mutation has been linked to a particular disorder, the normal function of the gene may be determined. An example of this approach is the mutated gene that resulted in cleft palates in mice. The researchers found that the gene's normal function is to close the embryo's palate. An understanding of the genetics behind cleft palate in mice may one day be used to help prevent this common birth defect in humans.

    2. Re:How do they map their function? by Stachybotris · · Score: 5, Informative

      In most cases they work backwards. You start with a known protein, determine its amino acid sequence, and then convert that into the most likely DNA sequence (accounting for codon bias). Primers/probes are then generated for the 3' and 5' ends of the probable DNA sequence. If you're working with a small genome like that of a bacterium, you can perform a restriction digest to get random hunks of chromosome. These are then amplified via PCR using your designer primers. The final product is then sequenced.

      In other cases you can create a gene knockout by splicing a random gene into your gene of interest. This causes your target gene to encode a non-functional protein. Then you watch and see what happens to the test subject. In some cases the creature dies because the gene turned out to be extremely important. In others it results in minor to significant impairment. But because of the complexity of most organisms, single-gene knockouts usually don't have too much effect - the creature has multiple pathways that can accomplish the same goal. This is especially true for critical functions like those in the immune system.

    3. Re:How do they map their function? by Anonymous Coward · · Score: 0

      They lock it in a room with Jack Bauer and it gives up its secrets in about 30 seconds. The protein is frequently damaged in the process.

    4. Re:How do they map their function? by zen-theorist · · Score: 1
      An understanding of the genetics behind cleft palate in mice may one day be used to help prevent this common birth defect in humans.

      how do scientists expect to modify the genetic material of every living human being so as to prevent this defect? is there some parallel technology that promises a mass-producable mutating vaccine or something equal in function?

    5. Re:How do they map their function? by Anonymous Coward · · Score: 0

      So what you're saying is that they inject stuff into animals and plants and see if they turn funky colours?

    6. Re:How do they map their function? by AlanKilian · · Score: 1

      "how do scientists expect to modify the genetic material of every living human being so as to prevent this defect? is there some parallel technology that promises a mass-producable mutating vaccine or something equal in function?"

      It's not often an easy fix, and it's not often even fixable right now.

      Once the defective gene is identified, and it's CHEMICAL function is understood, scintists can attempt to make a pill that provides a similar chemical so that the metabolic pathway regains its function.

      Cleft palate is a developmental problem, so its treatment would be the mother taking some kind of pill while pregnant. This is unlikly to be an easy thing to fix for a large number of reasons.

  12. The proper unit of measure by Anonymous Coward · · Score: 0

    ...is in LoC's (Libraries of Congress).

  13. Metric, not Imperial by BadAnalogyGuy · · Score: 1

    He referenced A4 paper, so he's obviously not in the U.S. They use the metric system overseas.

    1. Re:Metric, not Imperial by Anonymous Coward · · Score: 0
      He referenced A4 paper, so he's obviously not in the U.S. They use the metric system overseas.
      Especially, since the earth is much bigger in the metric system.
    2. Re:Metric, not Imperial by dotgain · · Score: 1

      What do they call a whopper?

    3. Re:Metric, not Imperial by pompomtom · · Score: 1

      No, the fonts are much smaller.

      --

      Buckets,

      pompomtom

      "There's an exception to every rule. Except for some rules"
  14. So tired. So very, very tired. Of that. by ScentCone · · Score: 5, Insightful

    If we stacked up all of the useless length metaphors/comparisons from end to end, they'd still add up to a non-useful mental image of a billion genetic records.

    I mean, "printed out as a single line of text, it would stretch around the world more than 250 times" means what, in terms of helping us picture this? I take it that we're not supposed to be able to imagine a billion records, but we can all clearly picture some text wrapped around the planet 250 times? Ah, that's much more helpful!

    Now, I just got done re-indexing 10 million records in a database, and I can sort of picture 100 times that much work. This is slashdot! More nerdly examples, please.

    --
    Don't disappoint your bird dog. Go to the range.
    1. Re:So tired. So very, very tired. Of that. by borisborf · · Score: 1

      You have to wonder what means of database management they use to keep all this data. We're not talking about a small company access database here. It would be interesting to konw how their server setups are to hold this data and how they process all of it should it ever need "re-indexing". Just imagine that processor load!

    2. Re:So tired. So very, very tired. Of that. by Otter · · Score: 1
      If we stacked up all of the useless length metaphors/comparisons from end to end, they'd still add up to a non-useful mental image of a billion genetic records.

      Helpfully, that's precisely how meaningless a milestone one billion sequencer traces is.

    3. Re:So tired. So very, very tired. Of that. by jmv · · Score: 4, Funny

      More nerdly examples, please.

      - It would require 100,000 liters of ink to write down all the 1's and 0's
      - It would take 400 years to transmit it over a 14.4 kbps modem
          * Requiring about 10 Giga Joules
      - If each bit was encoded on a single hydrogen atom, the whold db would weight about 0.1 mg
      - If ones are transmitted as a single (infrared) photon, it would take 0.01 Joules to transmit the whole db
          * You could transmit it 100 times with the energy of a mouse trap
      - It would require about one year for a million monkeys to type it in (without having to guess)

    4. Re:So tired. So very, very tired. Of that. by ScentCone · · Score: 1

      It would take 400 years to transmit it over a 14.4 kbps modem

      See, now that's what I'm talking about. A proper, well-scaled, nerdly example. Except at least half the readers here will say 14.4-whatis-that-now?

      --
      Don't disappoint your bird dog. Go to the range.
    5. Re:So tired. So very, very tired. Of that. by Dhalka226 · · Score: 1
      This is slashdot! More nerdly examples, please.

      You do realize that announcements written by the Sanger Institute are not written for Slashdot readers, right?

      It's a quote. Deal with it.

    6. Re:So tired. So very, very tired. Of that. by TedTschopp · · Score: 1

      Is that a bit slower than 150baud? I think the decimal place is in the wrong location.

      --
      Fantasy remains a human right; we make in our measure and in our derivative mode... -- JRR Tolkien
    7. Re:So tired. So very, very tired. Of that. by ScentCone · · Score: 1

      You do realize that announcements written by the Sanger Institute are not written for Slashdot readers, right?

      I do. But in some ways, I think my point is even more appropriate for the lay audience. Meaning, again, how is someone supposed to picture text wrapping around the planet 250 times? Isn't that just another way of saying "more than you can really get your head around" anyway? Most analogies like that aren't really helpful to anyone. Is text going around the planet 100 times really a lot less in your mind that that going around 250 times? Sure, it's 100/250.. but it's not like it makes the number of genetic records in the discussed database more comprehensible to anyone, regardless. Obviously I'm talking about how this type of presentation is used througout semi-scientific news coverage of any topic, not just in this press release.

      --
      Don't disappoint your bird dog. Go to the range.
    8. Re:So tired. So very, very tired. Of that. by Amouth · · Score: 1

      you my friend have way too much time on your hands..

      but thanks for the refrences :)

      --
      '...if only "Jumping to a Conclusion" was an event in the Olympics.'
    9. Re:So tired. So very, very tired. Of that. by Anonymous Coward · · Score: 0

      More nerdy explamples, please.

      It would only take 4x of it to go the length of my peinis.

  15. I will be more impressed... by Stachybotris · · Score: 5, Informative

    When we figure out what all of that does. For every organism as or more complex than your average bacterium, there's a large amount of what amounts to filler DNA. Viruses don't have this problem, as few of them are large enough to even get by without overlapping reading frames. If you shrink this dataset down to only sequences that encode functional proteins (read: genes), there's still an insane amount of information. If you then remove the introns, the dataset gets even smaller. But of course, we don't really know if the introns and intra-genic regions of DNA (the so-called 'junk DNA') have functions (or how many they have), although some do act as regulators of transcription.

    Given that a change of just 1 base in 500 of the 16S rRNA gene is sufficient to differentiate between two different species of bacteria, I have to wonder how many of these entries are quasi-redundant. When you consider how many species of bacteria are known to man, that means that there are literally thousands of potential entries for each gene. Unless, of course, they're storing only consensus sequences, which still vary widely between genera.

    Sadly, the trend here seems to be more of 'sequence it, upload it, and patent it' instead of 'sequence it, upload it, figure out what it does/makes, do something useful with it'. Knowing the sequence for the Ubiquitin gene is all well and good, but it's of little practical importance. Being able to construct designer proteins to treat illnesses based on that information, however, is a truly worthy goal. Unfortunately, that's also where the 'patent it' part comes into play...

    1. Re:I will be more impressed... by QuantumG · · Score: 1

      When we can do full quantum electromagnetic simulation of even a square micron of space in at least 1/100th of real time then we'll have no trouble figuring out this stuff works. Either that or three dimensional microscoping scanning technology (or a combination of both).

      --
      How we know is more important than what we know.
    2. Re:I will be more impressed... by alicenextdoor · · Score: 1

      They store everything, every little variant of every gene that they can find in any species. And it's not just stamp collecting; modern genetics is becoming more and more reverse genetics. This means that instead of using the traditional "knock out a gene and see what happens to the mouse" approach, scientists studying a particular condition identify regions of the genome that appear to be different in patients with the condition (using markers like single nucleotide polymorphisms (SNPs)), find a gene in that vicinity, sequence it and then compare that sequence with all of those in the databases to try and figure out what it codes for. It's a hugely powerful technique, and totally dependant upon having a comprehensive database.

      --
      of course, biting monkeys is not to everyone's taste - Konrad Lorenz
    3. Re:I will be more impressed... by floWing · · Score: 3, Informative
      First of all I want to point out, so-called "junk DNA" has proven to be a very bad idea for thinking of introns and other untranslated regions (like UTR's [untranslated regions around protein-coding regions], regions of DNA which are not used to create proteins [in the regular way] via mRNA (messenger RNA), then translated to protein). Most scientists will agree nowadays there is _alot_ of information in these non-exonic regions, the most prominent example up to date being microRNA - small RNA pieces from intronic and UT regions - affecting the cell machinery, like silencing protein translation from existing mRNAs.

      Given the figures of 1 billion sequence records, it is by far not as impressive once you start removing redundant entries, and as more than half of these entries originate from so-called EST's (expressed sequence tags) - meaning DNA regions [exonic regions] which do translate to mRNA: Knowing exons only constitute a minoirty of the genomes of higher organisms, thse entries constitute less than 5 % of the complete genome. Also redundancies might not even be discernable because of the high fault-tolerance most "quick-and-dirty" sequencing-methods have - ranging up to several precent of erroneous bases. Also a _big_ problem is sequencing of highly repetitive regions of the genome, as current sequencing proceedures allow to sequence strands up to a length of approx. 1 KB (1000 bases), not much more [this relates to the error-rate growing untolerably high if sequencing anything significantly longer than this]. But repetitive DNA regions can often keep on going for more than this length: so we are still not able to "close the gaps" and can not say where these pieces belong to (although excellent scientists are working exactly on this tough problem using so-called "whole genome assemblers").

      Concluding this, I would not be astonished to see that less than 10 % (and even far less) of these billion records do actually contain original information. So, if you want to stick to the hype, you are free to do so, but: it's about hype, not facts.

    4. Re:I will be more impressed... by Phillip2 · · Score: 1

      "I have to wonder how many of these entries are quasi-redundant."

      All of them, pretty much. This is a trace archive. It stores the traces as they
      come of the sequencing machine. Given that DNA is normally sequenced to 10x
      (five times in both directions), most of the data in this database will be
      replicates.

      "Sadly, the trend here seems to be more of 'sequence it, upload it, and patent it' instead of 'sequence it, upload it, figure out what it does/makes, do something useful with it'."

      Data collection is the bed rock of all good science. You can sit and think
      of clever hypotheses all that you like, but it's all junk unless you actually
      have the data to test it against.

      People are already doing interesting things with this sort of trace data that
      were not thought of when it first came out (predication of polymorphisms
      is possible for instance). More will come over time.

      Gather, storing and archiving data is vital. The trace archive is a modern day
      Library of Alexandria.

      Phil

  16. a metric we can use, please? by indole · · Score: 1

    All well and good, but how many Libraries of Congress does 2.5 Mt Everest / A4 pages equal?

    My calculator has no Mt Everest button.

    --
    (2,3-Benzopyrrole)
    1. Re:a metric we can use, please? by Anonymous Coward · · Score: 0

      but it does have a LoC button?

    2. Re:a metric we can use, please? by kadathseeker · · Score: 1

      My calculator has no Mt Everest button. Turn in your geek badge. Now.

      --
      The 'Net is a waste of time, and that's exactly what's right about it. - William Gibson
  17. don't use their database by rubycodez · · Score: 1

    use my sequence generator:

    ruby -e 'while 1; print "c a t g".split[(rand 4)]; end'

    Just hit control-c when the sequence is long enough to suit you

  18. A4 paper wouldn't work. by suso · · Score: 3, Funny

    Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest.

    You can't do that with ordinary A4 paper. You need to reinforce it on the sides at least so it won't tumble over. Plus, I doubt the paper would sit still with the high winds once it gets above a few thousand feet. Sheesh.

    1. Re:A4 paper wouldn't work. by keraneuology · · Score: 1

      Based on these comments I'd say the papers would be fairly well cemented together.

      --
      If the g'vt kept the data on you that google does you'd better believe you'd be calling it "doing evil"
    2. Re:A4 paper wouldn't work. by Anonymous Coward · · Score: 0

      To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times.

      Yeah, and around the world - well the paper would get wet and not make it. Seagulls, whales, african-eurasian-tiger-pussy cats and other maner of beasts would most certainly tamper with it also.

    3. Re:A4 paper wouldn't work. by iwein · · Score: 1

      duh...

      How else do you think they make diamonds. Wait for them to grow by accident from trees?

      --
      Show a man some news, distract him for an hour. Show a man some mod points, distract him for the rest of his life.
  19. More Impressive.. by The-Perl-CD-Bookshel · · Score: 1

    Pfft. I would be more impressed if it was all running on MSDE.

    --
    I don't keep a lid on my coffee so when I walk around I look busy -me
  20. Anybody know what DB Software they're using? by Vorondil28 · · Score: 2, Funny

    Something tells me a 22TB MS Access table just wouldn't cut it. :-P

    --
    This sig rocks the casbah.
    1. Re:Anybody know what DB Software they're using? by wllf · · Score: 1

      It's in the article:

      "The Database is hosted on a single HP ES45 (a 4-CPU server with 16GB of memory) with the storage consisting of HSV EVA5000s and EVA8000s on a SAN. The data are processed into the database using a cluster of 4 ES45s. The database is an Oracle Database 10g Enterprise Edition."

    2. Re:Anybody know what DB Software they're using? by jonoton · · Score: 1

      It's running on Oracle 9, currently running on Tru64 Unix.

      There's a project underway to migrate this to SuSE Linux & Oracle 10, this will be running on HP DL585 4 way Opteron boxes.

  21. Which Database? by Anonymous Coward · · Score: 1, Funny

    Are they using the latest MSSQL 2005 beta 3?

    1. Re:Which Database? by Anonymous Coward · · Score: 0

      www.ensembl.org/info/software/index.html

      "(...) Ensembl uses MySQL relational databases to store its information. (...)"

  22. 22TB = a lot of rebates by skiingyac · · Score: 1

    Now I know who waits in line at 5am at CircuitCity to get the $40 after rebate HDDs! You should be ashamed CmdrTaco, come back when your measly 1/4 to 1/2 TB doubles every ten months.

  23. Do the math by Kickboy12 · · Score: 2, Interesting

    1 billion entries = ~22 Terabytes
    1 billion x 1,000 Bytes = ~0.9 Terabytes

    Which means, on average, your genetic code can be stored in 22KB.

    Just an interesting thought.

    1. Re:Do the math by Anonymous Coward · · Score: 0

      Umm doesn't it mean the average size of the genetic code in their DB is 22kb? They're not all human ones...

    2. Re:Do the math by Wabin · · Score: 3, Interesting
      except that each entry is not an individual. It is a trace from a sequencing rig, usually. Which means that it is usually 500-1000 bases of sequence (with a bunch of other info there as well... it is not just the As Ts Gs and Cs, but also sequence quality and such). The human genome is roughly 3 billion bases. So they have the equivalent of say 200x the genome of an individual. Of course, the data they have is probably much more concentrated on some areas, where they have thousands of traces, and other areas where they have very few.

      Anyway, the point is you are not about to be able to fit a genome on a floppy disk. Not even close.

      --
      Most exciting phrase in science: not "Eureka!" but "Hmm... That's funny..." -Asimov (abridged for \. limits)
  24. How big compressed? by Mr_Tulip · · Score: 3, Insightful

    I mean, most of that data is just redundant pairs of A-G C-T T-G etc...

    I reckon you could zip it up and it'll fit on a couple of floppy disks.

    1. Re:How big compressed? by quokkapox · · Score: 1
      Algorithms for this exist; they can also be used for determining how closely related (in some sense) two different sequences are.

      Google cache of PDF A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison .

      --
      it's a blue bright blue Saturday hey hey
    2. Re:How big compressed? by Cygnus78 · · Score: 1

      ...zip it up...

      zip it down ?

  25. Re:first comment by Anonymous Coward · · Score: 0

    dude, imagine how many chicks you are going to get.

  26. Prostitute Schedule for Jan. 17 at the MBOT in SF by Anonymous Coward · · Score: 0
    Folks, check out the updated prostitute schedule for January 17 at the Mitchell Brother's O'Farrell Theater (MBOT) in San Francisco. The MBOT is the most convenient way for you to buy a blow job, a hand job, and full service (i.e. vaginal sexual intercourse).

    I kid you not.

    Please establish a hypertext link to this message. Spread the word!

  27. Standard Units of Measurement... by HockeyPuck · · Score: 1

    As reported on /. the standard units of measurement are:

    Football Fields in Length
    Mt Everest in Height (even tho the avg person has no idea how tall it really is).
    Olympic Sized Swimming Pools in Volume (which again the avg person has no idea)
    Number of Chins in a Chinese phonebook (when talking about someone's momma).

  28. 733t speak! by Orion+Blastar · · Score: 0

    \/\/3 pwn j00! \/\/3 g07 y0ur DN@ 0n 0ur d@7@b@$3 b17ch!

    Anyone thought of the privacy issues of storing human DNA in a public database?

    I am not a number, I refuse to be processed and let some strangers catalog my DNA into a public database.

    --
    Remember, Slashdot does not have a -1 disagree moderation, and no, troll, flamebait, and overrated are not substitutes.
    1. Re:733t speak! by TubeSteak · · Score: 1

      If it doesn't have any identifying information attached to it, why would it matter?

      And by identifying, I mean name, SSN, Age, race, etc

      --
      [Fuck Beta]
      o0t!
    2. Re:733t speak! by Ravatar · · Score: 1

      You are a number. Now sit down, 457579.

    3. Re:733t speak! by Smuffe · · Score: 1

      I am not a number, I refuse to be ...

      Really? I would have figured you for 1337...

    4. Re:733t speak! by jthayden · · Score: 1
      If it doesn't have any identifying information attached to it, why would it matter?

      And by identifying, I mean name, SSN, Age, race, etc



      I suspect it does have species, race, gender and such. Granted this can be derived from the data itself, but I suspect they have a lot of information about the donor in order to categorize the sample. I could see researchers wanting to be able to query up all human mongolian females in order to study some genetic disease/trait.

  29. Compare by Anonymous Coward · · Score: 0

    How much data is this in comparison to the amount that google stores? Seems like google would be storing a lot more.

  30. i love meaningless duping. by Anonymous Coward · · Score: 0

    "Anyone care to translate this into volkswagens, or libraries of congress?"

    How about "number of slashdot dupes"?

    1. Re:i love meaningless duping. by Anonymous Coward · · Score: 0

      0.5 dupes.

  31. Wrong standards by sehryan · · Score: 2, Funny

    These people are obviously not aware that the standard unit of measurements for the press is Rhode Island and Texas. Without phrasing it in these units, I have no idea how much data that really is.

    --
    The world moves for love. It kneels before it in awe.
  32. Re:first comment by clragon · · Score: 0
    dude, imagine how many chicks you are going to get.
    one domestic chicken "only" has 20,000 to 23,000 gene, so, in 22 Terabytes, a lot!
  33. Bigger Than Jesus by DogDude · · Score: 1

    I had my gf read the summary of this article, and she promptly said, "Now that's bigger than Jesus!" :)

    --
    I don't respond to AC's.
    1. Re:Bigger Than Jesus by Anonymous Coward · · Score: 0

      You've got a talking Saint Bernard? That's pretty amazing!

      You should put your fat hairy bitch in the circus, DogFucker.

  34. Use a smaller font by Anonymous Coward · · Score: 0

    idiot...

    What a schtoopidttt analogy.

  35. So what? by Anon.Pedant · · Score: 4, Funny

    I'm not impressed. I already have genetic material all over my computer.

    (Oops, did I just admit something bad?)

    1. Re:So what? by Anonymous Coward · · Score: 0

      did I just admit something bad?

      Only if you think that everyone doesn't already know that you jack off to pictures of cartoon squirrel-women.

    2. Re:So what? by Anonymous Coward · · Score: 0

      oh please. When I was your age I had about a thousand peoples genetic material all over my trunk.

  36. A lot of data by Anonymous Coward · · Score: 2, Funny

    Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest.

    Tapping it out on morse code would take 10000 drummers 5 years!

    Expressing it in smoke signals would burn 100 amazon rain forests!

    Putting it in fortune cookies would require flour and sugar with the same approximate mass as the moon!

    And sending it in semaphore would require every man, woman and child on the planet to signal nonstop with every flag ever made until the year 2010!

    That's a lot of data.

  37. Here's your standard by TubeSteak · · Score: 2, Informative
    I'm gonna assume 12 point, single spaced with inch (or inch and a half) margins is pretty standard fare.

    And by standard, I mean: whatever MS Office defaults to

    Diana Hacker's "A Writer's Reference" says the same thing.

    /I'm not a grammar Nazi, I was forced to purchase it many years ago and have kept it handy ever since.

    --
    [Fuck Beta]
    o0t!
  38. Just like in that song... by Errandboy+of+Doom · · Score: 0, Offtopic

    "I made this half-pony half-monkey monster to please you,
    But I get the feeling that you don't like it.
    What's with all the screaming?
    You like monkeys, you like ponies,
    Maybe you don't like monsters so much?
    Maybe I used too many monkeys;
    Isn't it enough to know that I ruined a pony making a gift for you?"
    -Skullcrusher Mountain by Jonathan Coulton

  39. The amazing thing is how SMALL it is. by sbaker · · Score: 5, Insightful

    All this hype about how vastly much paper you get if you print it all out misses the wonder of the thing.

    The wonder isn't how BIG the human genome is - the amazing thing is how *TINY* it is.

    The human genome is 3 billion base pairs...each base pair is one of only four possibilities - so two bits each. 750 Megabytes...that's one CD-ROM. There is a lot of redundancy in it too - many of those base pairs are never 'expressed' as proteins, many are replicated redundantly dozens of times. So with compression, or even just deleting the junk - you'd get it down to maybe 100 to 200 megs - tops.

    I find it utterly amazing that all that complexity is so amazingly compactly encoded.

    Yeah - that's a lot of bits of paper - or 600 floppy disks or some other bullshit - but by the standards of modern media, it's MICROSCOPIC.

    Announcements like this would do better to explain how LITTLE data this really is - that's the wonder of the thing.

    --
    www.sjbaker.org
    1. Re:The amazing thing is how SMALL it is. by The+Step+Child · · Score: 4, Interesting

      Just as amazing is that there are only about 25,000 protein coding genes in the entire human genome (though obviously there are more proteins possible through splicing and post-translational modification, but I digress). Also amazing is the precision in which the chromosomes wind up all that DNA. Imagine taking a piece of yarn miles and miles long and compacting it into something that could fit into a paper bag - now imagine someone asking you to take out a VERY specific piece of that yarn and exposing it from your roll, disturbing the rest of the yarn as little as possible, then putting it back exactly as it was before when they're finished with it...that's basically what each chromosome has to do when genes are expressed. And it's all mediated by proteins coded in that very DNA.

  40. On the other hand... by Chris+Snook · · Score: 3, Interesting

    ...the entire database would fit on just one sheet of A(-24) paper. (Yes, I actually did the math.)

    --
    There's no failure quite as dissatisfying as a complete and total solution to the wrong problem.
  41. Re:w00t! Opensource genetics! by Anonymous Coward · · Score: 0

    Why isn't this thread submitted by Beatles, is he slacking off?

  42. I've read the whole thing.. by tinrobot · · Score: 4, Funny

    I won't give away the ending, but my favorite part is:

    ctattggacttggaatcggatattggacacttggaatcggata

    1. Re:I've read the whole thing.. by caluml · · Score: 1

      Look a little further on, and you'll see a twist: actgacgccggctatataSCOtgctagtagcgtatgctagctagtag. I don't know how the author thinks of things like that... :)

    2. Re:I've read the whole thing.. by Redwin · · Score: 1

      I won't give away the ending, but my favorite part is:

      ctattggacttggaatcggatattggacacttggaatcggata


      Great, thats another twist in the story ruined!

      --
      Warning, comments may not have been passed by the sanity department of my brain.
    3. Re:I've read the whole thing.. by Cygnus78 · · Score: 2, Funny

      I won't give away the ending, but my favorite part is: ctattggacttggaatcggatattggacacttggaatcggata

      Man that's disgusting. Please keep your fantasies to yourself.

  43. This could only be.. by musakko · · Score: 3, Funny
    The Archive is 22 Terabytes in size and doubling every ten months.

    Go FoxPro!

  44. 22 Terabytes! Wow! by Pedrito · · Score: 1

    The Archive is 22 Terabytes in size and doubling every ten months.

    Doubling every 10 months? I think hard drives are doing that as well, or damn close to it. A few years ago, 22 terabytes sounded like a lot, but these days, not so much. I've got half a terabyte in my server and another half in the other two computers in my home and if I didn't regularly burn stuff to DVD, I would have run out of space a long time ago. Terabytes just aren't what they used to be. Well, they are and they aren't.

    1. Re:22 Terabytes! Wow! by jonoton · · Score: 1

      Sure, 22Tb isn't an enormous amount of disk space these days, it represents about 5% of the total storage at the institute.

      What you find however is that when you get above a threshold on disk space the cost of the actual disks becomes less relevant than the cost of the infrastructure to support & manage them. There are many 'cheap' raid arrays out there that will allow you to install large numbers of Tb very cheaply, and they are cheap & work - right up until they stop.

      22Tb may be a 'small' amount of disk space, but it still takes one hell of a long time to recover that from tape!

      'Been there - done that' not doing it again :)

  45. storage by dartarrow · · Score: 1

    The Archive is 22 Terabytes in size and doubling every ten months.

    Wow, in a coupla years they'll need Google to help them store data

    --
    I love humanity, it is people I hate
    1. Re:storage by metalcup · · Score: 1

      actually, if I remember correctly, Craig Venter has teamed up with google folks to develop tools for analysing genomes - (Source: The Google Story, by David Vise)

      --
      "Laziness is an optimisation protocol"
  46. in other words... by avi33 · · Score: 3, Funny

    All your base (pairs) belong to us.

  47. Re:w00t! Opensource genetics! by Anonymous Coward · · Score: 1, Funny

    Got a torrent?
    I want to print it out to read off screen...

  48. oops by Anonymous Coward · · Score: 0

    ha, i read the title as "Generic database hits 1 billion entries" and i was wondering for awhile what all this talk about genetics was... oops! lol

  49. Print resolution by clambake · · Score: 1

    To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times.

    Not at a 100 million DPI it won't.

  50. Doubling? by nick255 · · Score: 1

    The Archive is 22 Terabytes in size and doubling every ten months.

    I doubt that. Surely that means by the end of the day it will be:

    22 * 2^144 Terabytes = 5*10^44 Terabytes

    in size.....I don't even know what you call that!

    1. Re:Doubling? by nick255 · · Score: 1

      D'oh! Just realised I read 10 minutes rather than 10 months!

  51. For bonus points by Walkiry · · Score: 1

    Now, if you want to do something really cool with that database, you'd blast it against itself using no repeat masking. Or just blast it against the repeats database :-)

    --
    ---- Take the Space Quiz!
  52. To answer all the 'what font' questions.... by jonoton · · Score: 1

    Here are the 'official' calculations....

    The 1 billion traces equates to 800 billion letters of genetic information.

    70*50 is a solid page at times new roman 12 point font == 3,500 characters

    100 sheets is 1cm high. = 350,000 letters

    800,000,000,000/350,000 = 2,285,714.29

    So the stack of paper would be 22,857M high

    22.8 kilometers.

    Mount Everest is 8.848 KM high.

    So the stack of paper would be 2 1/2 times the height of Everest.

  53. Animals on a flash stick by Stan+Vassilev · · Score: 1

    If those are the full sequences, and the bio technology evolves enough so that build the full sequence out of digital data...

    Woa.. Just imagine the possibilities.

    We won't have to feel guilty for extinct species anymore!

    PS.: Anyone wanna join my safari party next weekend?

  54. The word "The" 251 times around the world by tod_miller · · Score: 1

    The word "The" printed out as a single line could strretch around the world two hundreds and fifty one times, given a sufficiently large font.

    While that is crazy, it begs the question, are they thinking in points? 10? 11? 12? 72? Why didn't that say 500 times? 1000 times? a million times?

    Is there an rfc for this specification of measurement? Can I order things in 'printed word lengths around the world'?

    Can I measure my penis with this?

    Does google calculator support this?

    I shot the sheriff but I sold the deputy some SCO licenses.

    please type the word in this image: sheriff
    random letters - if you are visually impaired, please email us at pater@slashdot.org

    Hello, visually impaired. I hope you are reasing this, either in a large font or some braille device.

    Did you email fatboy slim, I mean cowboy neal about this CAPTCHA? What did the tellytubby do about it?

    --
    #hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
    1. Re:The word "The" 251 times around the world by smoker2 · · Score: 1
      The word "The" printed out .... Can I measure my penis with this?
      Maybe once ...
  55. DMCA Violation by Anonymous Coward · · Score: 0

    I'm pretty sure that gene information is copyrighted, and the whole project should be canned before some association takes up suing kids for looking up information for science class.

  56. Less work, faster results with the database by cerebis · · Score: 1
    Hang on hang on, you're detailing how to find a gene in a genome by direct experiment (something you do when that's all that you've got to work with), when this article is talking about genomic databases and consequently bioinformatics should be used to greater extent.

    Rather than go through the entire process you outline, one could avoid a great deal of the wet work but sequencing the protein and then jumping into computer space; searching the genome database for hits.

    This assumes you're organism of study has been sequenced, but that isn't uncommon for a number of reasons.

    1. Re:Less work, faster results with the database by Stachybotris · · Score: 1

      Very true. But they had to start somewhere, right?

      I actually love the databases. At my current job I have almost no use for them, but I still like to cruise around and see what all has been sequenced.

  57. AAaaargh! *SPOILER!* Mod parent down!!! by Thorsten+Timberlake · · Score: 1

    Damn you.

  58. DNA sequence makes up only a small proportion. by cerebis · · Score: 3, Informative
    As this is a trace archive, it stores not just the DNA sequence (ACGT) but also the signal data produced by the machines used in these experiments, which is used to determine the DNA sequence (or basecall).

    The signal data is composed of peaks and troughs across 4 channels, corresponding to the 4 base types. A peak in a channel corresponds to a base of that type passing in front of the detector. In your typical sampling configuration, a peak is made up of about 12 data pts.

    Now, since each sampled point in the signal is stored as a 4 byte int and the base for that peak is stored as a 1 byte char, then you've got basically a 192:1 ratio of techincally superfluous signal data to actual DNA sequence.

    Since there are yet other peices of information in the file, this ratio is actually larger.

    Of course, there is a good reason for keeping trace data rather than just the DNA sequences, the notion being that you have more information with which to validate the integrity of what you've done. There have been cases where scientific databases have had their data integrity damaged over time by low quality (ie. mistakes) submissions.

    In this case, they're retain the wrong file type, as it doesn't store the original unfiltered data signal, only a heavily filtered and manipulated one. Most modern basecallers start from the original unfiltered data to gain more advantage through better processing, you cannot do this with the file type they are retaining.

  59. If you look for yourself in Google... by neveragain4181 · · Score: 1

    you now get...

    TCGGAGACCAAGGCAAGGAAGCA...
    Mostly human, better watch this one, he might do something soon...
    www.sanger.ac.uk/ - 1.2Tb - Feb 17, 1965 - Cached - Similar pages - Remove result

    AGGCATCGATCAGTCAAGTCAACA...
    Bad speller. Looks kind of odd in daylight...
    www.sanger.ac.uk/ - 1.2Tb - Jan 10, 1970 - Cached - Similar pages - Remove result

    CCGGTGACCAAGGTAAGGATGCA...
    Beneath the Digg comments IQ threshold, mostly harmless
    www.sanger.ac.uk/ - 14k - Sept 11, 1982 - Cached - Similar pages - Remove result

    Try your search again on Google Book Search

                                      Gooooooooooooooogle >

                         Result Page:  1 2 3 4 5 6 7 8 9 10   Next

    (you are meant to smile btw)

  60. 2 columns by Narc · · Score: 2, Interesting

    I can't confirm this, maybe someone can tho. I had an oracle training course last year and the instructor told us she had someone from sanger working on the human genome stuff, and their database was something daft like 2 columns wide. It was used in an example to explain the intricacies of hot backups and such..

    Interesting if its true!

    1. Re:2 columns by dbwidders · · Score: 1

      Not sure what you mean by the database being two columns wide. The database in question consists of four major tables, each one with a record per 1 billion traces. One of the tables has about 60 columns in it, the others 4-8.

  61. 22TB really ain't that much friend by zeronitro · · Score: 1

    it kinda makes me sad that this is considered a lot (with something useful in science) when AOL has petabytes of AIM logs sitting at their server farms. sad indeed.

  62. Moore trouble ahead by mlush · · Score: 1

    The data doubles every 10 months computing power doubles every 18 months were going to hit a problem sooner or later...

  63. Forget out the printing... by Call+Me+Black+Cloud · · Score: 1


    ...can you imagine how much it would cost to have it bound?

    Really, though, they should come up with a better comparison. "If burned to CD, it would take half as many CDs as AOL sends out in a year".

  64. Well, the issue seems quite serious to me by He_Is_Me · · Score: 1

    I'm pretty undecided as to what I should think of this project.
    A sort of "Opensource genetics" organisation seems like a good idea at first. The fact that information likely to help researchers is made public is quite a good thing in my view, be it data about genes, the 1958 census of the Uzbek population, or about how many people in Uzbekistan wear jeans.
    At least, this is far less freaky than a biotech company getting an "exclusive contract" from the Icelandic parliament to get access to the centralized database of all the Icelandic peoples' genealogical, genetic, and personal medical information. (See details here: http://www.actionbioscience.org/genomic/hlodan.htm l)
    Yet, the information published by the Sanger Institute seems to be used mainly by private firms (Quote: "Dotcoms are responsible for about 80% of download each week"). I just wonder whether the Institute assesses these firms' goals before letting them download the data. I wouldn't be too glad to learn that they gave it to companies using genetic engineering for purposes other than medical.

    1. Re:Well, the issue seems quite serious to me by mortuusangelus · · Score: 0

      And without purposes other than medical you wouldn't see half the tech that you do now. Hell /. wouldn't exist. :D

      --
      Oh god... not again.
  65. For further reading on sequencing trace files... by Coco+Lopez · · Score: 1
  66. At last! by infochuck · · Score: 1

    Finally, my search for the chosen one might yield something tangible! The Rambaldi device will be MINE! MU-HAHAHAHA!

  67. Yes by Anonymous Coward · · Score: 0

    how many Texases could one wallpaper with the sheets from this hypothetical printout? Anyone? It's a challenge.

  68. Re:w00t! Opensource genetics! by landrol · · Score: 1

    ouch... that hurt... that was the funniest thing I've read in a long time... a fleet of flying super monkeys... rotflmao!!!!

  69. Attn Sanger Employees, Do Not: by recurve7 · · Score: 1

    SELECT * FROM GENOME;

  70. 1 billion traces of non-unique data by theflattman · · Score: 1

    Most of the sequence in the trace archive is from large genome sequencing projects where we intentionally oversample genomes to 6-10x or more. Also each trace while averaging 864 characters only contains about 600 bp of real data. What this means is that the trace archive currently represents about 60 billion basepairs of unique sequence. In human genome numbers thats 20 genomes worth of data. The approximate output in traces right now is max about 30 million per month so the 1 billion traces represents 33 months at the current output of the world's sequencing centers. As the trace archive was started after the human genome project, most of the traces related to the human genome aren't in the repository. It is not difficult to produce massive amounts of sequencing data- the trick is in turning it into something that one can use to answer scientific questions.