Genome Researchers Have Too Much Data

Last post by Anonymous Coward · 2011-12-02 07:29 · Score: 2, Funny

All previous posts have been purged due to too much data.

Re:Last post by NFN_NLN · 2011-12-02 08:18 · Score: 5, Funny

There is now so much data, researchers cannot keep it all.' One researcher says, 'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'"
Perhaps they can come up with a new type of storage mechanism modeled after nature. They could store this data in tight helical structures and instead of base 2 use base 4.
Re:Last post by edremy · 2011-12-02 09:32 · Score: 4, Informative

The error rate is too high- data copying using that medium and the best available (naturally derived) technology makes an error roughly every 100,000 bases. There are existing correction routines, but far too much data is damaged on copy, even given the highly redundant coding tables.
Then again, it could be worse: you could use the single strand formulation. Error rates are far higher. This turns out to be a surprisingly effective strategy for organisms using it, although less so for the rest of us.

--
"Seven Deadly Sins? I thought it was a to-do list!"
Re:Last post by gorzek · 2011-12-02 09:41 · Score: 1

"RNA does what you want, unless what you want is consistency." -- Larry Wall (sort of)

--
Check out my world simulator thingy.
Re:Last post by cyachallenge · 2011-12-02 13:30 · Score: 1

Maybe there is a use for this... http://en.wikipedia.org/wiki/Write_only_memory

Wrong problem by sunderland56 · 2011-12-02 07:30 · Score: 4, Interesting

They don't have too much data, they have insufficient affordable storage.

Re:Wrong problem by Anonymous Coward · 2011-12-02 07:31 · Score: 0, Informative

Only kind of correct - they also don't really have a clue what it means. It is kind of like reading a binary program and trying to say saying what the program does.
Re:Wrong problem by TheRealMindChild · 2011-12-02 07:40 · Score: 2

"to the cloud!"

--

"When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
Re:Wrong problem by bugs2squash · 2011-12-02 07:40 · Score: 5, Funny

If only they had some kind of small living cell it could be stored in...

--
Nullius in verba
Re:Wrong problem by jacoby · 2011-12-02 07:43 · Score: 4, Insightful

Yes and no. It isn't just storage. What we have comes off the the sequencers as TIFFs first, and after the first analysis we toss the TIFFs to free up some big space. But that's just the first analysis, and we go to machines with kilo-cores and TBs of memory in multiple modes, and many of our tools are not yet written to be threaded.
Re:Wrong problem by TooMuchToDo · 2011-12-02 07:51 · Score: 5, Informative

Genomes have *a lot* of redundant data across multiple genomes. It's not hard to do de-duplication and compression when you're storing multiple genomes in the same storage system.
Wikipedia seems to agree with me:
http://en.wikipedia.org/wiki/Human_genome#Information_content

The 2.9 billion base pairs of the haploid human genome correspond to a maximum of about 725 megabytes of data, since every base pair can be coded by 2 bits. Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.
Disclaimer: I have worked on genome data storage and analysis projects.
Re:Wrong problem by Anonymous Coward · 2011-12-02 07:52 · Score: 2, Informative

Only kind of kind of correct - they also don't really have a clue as to the accuracy, with the short read illuminas that dominate, they have problems with repeats and inversions and deltions, the basepairs with hydroxy methyl C or thiophosphate, the sequence of the centromeres and telomeres, and the ability to contigs into phase with parental genomes....aside from that, it's all peachy
oh yeah, I bet the contamination rates are not real good either (there was a paper a few months ago on this, looking at public data bases, kinda scary)
Re:Wrong problem by rubycodez · 2011-12-02 07:54 · Score: 1

surely storage and transmission can't be an issue, the capacity and bandwidth of a mini-van full of 2TB disks from RAID sets should be sufficient
Re:Wrong problem by tgd · 2011-12-02 07:58 · Score: 1

Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.
640K will always be enough!
Re:Wrong problem by GAATTC · 2011-12-02 08:00 · Score: 5, Informative

Nope - the bottleneck is largely analysis. While the volume of the data is sometimes annoying in terms of not being able to attach whole data files to emails (19GB for a single 100bp flow cell lane from a HiSeq2000) it is not an intellectually hard problem to solve and it really doesn't contribute significantly to the cost of doing these experiments (compared to people's salaries). The intellectually hard problem has nothing to do with data storage. As the article states "The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data.". We just finished up generating and annotating a de novo transcriptome (sequences of all of the expressed genes in an organism without a reference genome). Sequencing took 5 days and cost ~$1600. Analysis is going on 4 months and has taken at least one man year at this point and there is still plenty of analysis to go.
Re:Wrong problem by Samantha+Wright · 2011-12-02 08:00 · Score: 1

Actually, short reads aren't that bad as they may seem from a distance—the lab for which I consult has spent about a year surveying second-gen sequencing platforms, and it turns out the the 5th-generation ABI SOLiD platform finally lives up to its name, even though it uses only ~20 nt reads instead of the Illumina's 100. The chemistry has improved to a point where read quality isn't the biggest issue any more.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Wrong problem by StikyPad · 2011-12-02 08:05 · Score: 5, Funny

Warning: Monkeying with lossy compression for human genomic data may lead to monkeys.

--
https://www.eff.org/https-everywhere
Re:Wrong problem by Anonymous Coward · 2011-12-02 08:11 · Score: 3, Informative

It's not lossy compression.
You store the first human's genome exactly. Then you store the second as a bitmask of the first -- 1 if it matches, 0 if it doesn't. You'll have 99% 1's and 1% 0's. You then compress this.
Of course it's more complicated than this due to alignment issues, etc, but this need not be lossy compression
Re:Wrong problem by Remus+Shepherd · 2011-12-02 08:12 · Score: 1

So compressed, you have 4 megabytes of data...per individual. 7 billion individual human beings means you potentially need 28 petabytes of storage...
That's just for human beings. If we look at the sequences of non-human species the storage needed expands exponentially. Why, even if we used efficient DNA storage to keep all this data, we'd need a whole planet just to house it.

--
Genocide Man -- Life is funny. Death is funnier. Mass murder can be hilarious.
Re:Wrong problem by msauve · 2011-12-02 08:14 · Score: 1

"they have insufficient affordable storage."

I've got an idea to solve that, which I'm going to patent.

You store the sequence as a chain of different types of molecules (I'll call them "base pairs") which can link together, that way the storage will take up really minimal space. You could even have a chemical process which replicated the original, to produce more of the original.

--
"National Security is the chief cause of national insecurity." - Celine's First Law
Re:Wrong problem by Anonymous Coward · 2011-12-02 08:14 · Score: 1

"The amount stored has more than tripled...taking up nearly 700 trillion bytes of computer memory"
Let's see, 1 trillion bytes is 1TB. Have they considered Newegg? Even with the Thailand flooding disaster, Western Digital Caviar Green WD30EZRX 3TB hard drives are going for $299.99, plus $7.86 shipping. They'd need (Dr. Evil pinkie to corner of the mouth maneuver) 234 of these hard drives. Without quantity or government discount this comes out to be a little over $72K. Somehow I can't see a federal program set up to act as a centralized database straining under this load. What did they think they were budgeting for? An accumulation of Slashdot comments?
P.S. C. Titus Brown, quoted in the article, used to be at CalTech and was part of the local Python group.
Re:Wrong problem by Anonymous Coward · 2011-12-02 08:19 · Score: 0

Whoosh! me away but i thought the actual data capacity of dna was rather limited...
but then again you dont have to place everything in 1 place
class ameba
string DNA { DNA = "evolushiuooon"
class lawyer public : ameba
string DNA { DNA ="WRITE ERROR CONTACT TECH SUPPORTwrWRUQHRIU#QWHR#r3r13"
Re:Wrong problem by bunratty · 2011-12-02 08:36 · Score: 0

They have plenty of storage. It's just Darwinists covering up evidence of ID by throwing away the evidence that points to that conclusion. That way the researchers get to keep their jobs that allow them to bilk millions form the government. Oh, wait, this doesn't involve climate change, so I'll be modded down instead of up.

--
What a fool believes, he sees, no wise man has the power to reason away.
Re:Wrong problem by Anonymous Coward · 2011-12-02 08:37 · Score: 0

So compressed, you have 4 megabytes of data...per individual. 7 billion individual human beings means you potentially need 28 petabytes of storage...
I'm not sure what you think you're computing, but just to let you know somatic DNA mutates fairly easily. I've got more than just one DNA sequence in my body. I've probably got millions of unique sequences.
Re:Wrong problem by YaHooL · 2011-12-02 08:37 · Score: 1

They don't have too much data, they have insufficient affordable storage.
or insufficient affordable "cpu time". Performing tasks like "Multiple sequence alignment", Finding "Motifs in sequences" and even running a "Basic Local Alignment Search Tool" means using probabilistic models which requires a lot of computation.
Re:Wrong problem by Anonymous Coward · 2011-12-02 08:57 · Score: 1

Just call the Magratheans, they'll sort it out.
Re:Wrong problem by Anonymous Coward · 2011-12-02 08:57 · Score: 0

Transcriptome datasets like the one you're describing are on a lower end of the dataset spectrum - high-coverage cancer genome or meta-genome samples can easily exceed a terabyte, and when you have hundreds of these coming in, storage does become an issue.
Re:Wrong problem by next_ghost · 2011-12-02 08:59 · Score: 1

28 petabytes only if you save each file separately. If you store multiple files in one archive, it'll be much smaller.
Re:Wrong problem by hesaigo999ca · 2011-12-02 09:25 · Score: 1

I suggested way back to google about storage capacity for gmail, to implement a pointer system that would mark identical emails across multiple accounts and use one version of the same files but with multiple pointers to that file, which is what many people do with compression algorithms. I know in their system for GF and GFS2 they also do many threaded calls that also cross check for duplication.
Maybe someone aught to set up these tools to recognize similar patterns and set up compression.
Re:Wrong problem by yodleboy · 2011-12-02 09:28 · Score: 1

out of curiosity, what's the analysis? are you looking for something specific? comparing to something else? poking around to see what looks interesting? all of the above? thx
Re:Wrong problem by Hognoxious · 2011-12-02 09:29 · Score: 2

Given recent events in Thailand, it might be wise to replace the mini-van with something that floats.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Re:Wrong problem by Hognoxious · 2011-12-02 09:40 · Score: 1

You store the sequence as a chain of different types of molecules (I'll call them "base pairs") which can link together, that way the storage will take up really minimal space. You could even have a chemical process which replicated the original, to produce more of the original.
But the vessels that are used to perform those chemical processes take up a hell of a lot of space.
Bizarrely, if the size of the vessels is measured in non-metric they're considerably bigger.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Re:Wrong problem by Anonymous Coward · 2011-12-02 09:41 · Score: 0

Whoosh! me away but i thought the actual data capacity of dna was rather limited...
but then again you dont have to place everything in 1 place
class ameba string DNA { DNA = "evolushiuooon"
class lawyer public : ameba string DNA { DNA ="WRITE ERROR CONTACT TECH SUPPORTwrWRUQHRIU#QWHR#r3r13"
I'll woosh you. You're aware that storing the information about the DNA of a living cell... would normally be done by the DNA of that living cell... and thus all you (theoretically) need is one cell of the original subject... right? It wasn't a DNA computer joke...
Re:Wrong problem by buchner.johannes · 2011-12-02 10:01 · Score: 2

To be clear, the problem is this. The sequencing (cheap now) produces a lot of strips of a few DNA elements. They are overlapping, and its unknown from which position they are from.
So the difficulty is to arrange those strips to reproduce the original DNA sequence. It is a NP-hard problem, no wonder Moore's law doesn't outrun that!

--
NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
Re:Wrong problem by c6gunner · 2011-12-02 10:25 · Score: 2

So compressed, you have 4 megabytes of data...per individual. 7 billion individual human beings means you potentially need 28 petabytes of storage...
I'm not sure why you'd want to store the genome of every human on the planet, but for that kind of project 28 petabytes is peanuts. The newest IBM storage array is 120-ish petabytes. We're talking about storing 4 megabytes per person. In the modern world, most people have at least a 4 gigabyte flash drives. I could store the genomic information of myself, all my relatives, and all my friends, and still have space left over.
Re:Wrong problem by gr8_phk · 2011-12-02 10:26 · Score: 1

To be clear, the problem is this. The sequencing (cheap now) produces a lot of strips of a few DNA elements. They are overlapping, and its unknown from which position they are from. So the difficulty is to arrange those strips to reproduce the original DNA sequence. It is a NP-hard problem, no wonder Moore's law doesn't outrun that!
I thought that was a solved problem. I don't see it as being too difficult. Certainly not NP-hard. If I can solve this problem, how do I cash in on it?
Re:Wrong problem by complete+loony · 2011-12-02 10:28 · Score: 1

I don't know if this is the same kind of DNA related problem, but a friend of mine did his PHD on searching DNA. I don't think anyone is using the results of his work yet.

--
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
Re:Wrong problem by Daniel+Dvorkin · 2011-12-02 10:36 · Score: 2

I thought that was a solved problem.
No, sequence assembly is still an area of active research; here is a sampling of papers published on the problem this year alone. Part of the problem is that "next gen" sequencing produces reads which are less reliable the farther down the fragment you go -- and the fragments are short, so there a hell of a lot of them to reassemble. The overall volume of sequencing is getting bigger and cheaper all the time, but there are some really serious reliability problems that need to be ironed out.

--
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
Re:Wrong problem by TooMuchToDo · 2011-12-02 10:39 · Score: 1

I had more storage than that across both spinning disk and tape archives at Fermilab working on the USCMS Tier-1 team. 28PB? Not that big of a deal.
Re:Wrong problem by TooMuchToDo · 2011-12-02 10:40 · Score: 1

This is how Dropbox works storing objects in S3, but they do it with 2MB chunks of data.
Re:Wrong problem by rubycodez · 2011-12-02 10:51 · Score: 1

A PT boat full of disk has even more bandwidth and capacity! Hell, that sounds like a good Black Lagoon story arc
Re:Wrong problem by Anonymous Coward · 2011-12-02 11:29 · Score: 0

28 PB is certainly doable. Maybe $3 million or so.
http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/
Re:Wrong problem by GAATTC · 2011-12-02 11:55 · Score: 2

We're trying to do a good job with the annotation which includes manually curating the gene families we are interested in, characterize splicing isoforms, and we're looking for genes/gene families that may be expanded or unique and provide us with insights into the evolution of the unique morphological structures we study in our critter.
Re:Wrong problem by GAATTC · 2011-12-02 11:59 · Score: 1

Completely true - I did not mean to make light of the storage issues that come along with big genomic data sets. The point was more that storage issues are easier to address (you can for the most part throw $$ at these issues until you get to really big data sets) than the challenges of analyzing the data which cannot necessarily be solved with brute force approaches.
Re:Wrong problem by WillAffleckUW · 2011-12-02 12:22 · Score: 1

A lot of what you think of as noise - inserts, deletes, misfolds - is in fact, either viral rewrites of DNA segments or is actually genetic programming that allows DNA to adapt and express differently in different biochemical and environmental conditions - kind of like how we used to code chip storage to store more data than we could use in the old days before RAM or storage was cheap.
The problem is that we lack the open-source tools and the ability to recognize what we need.

--
-- Tigger warning: This post may contain tiggers! --
Re:Wrong problem by WillAffleckUW · 2011-12-02 12:31 · Score: 1

Mod parent up.
Sequencing is pretty cheap now. Analysis and processing is expensive. Storage isn't that big of a deal, really, if you do it right, although SNPs do churn up TB of data storage.

--
-- Tigger warning: This post may contain tiggers! --
Re:Wrong problem by WillAffleckUW · 2011-12-02 12:32 · Score: 1

You store the sequence as a chain of different types of molecules (I'll call them "base pairs") which can link together, that way the storage will take up really minimal space. You could even have a chemical process which replicated the original, to produce more of the original.
But the vessels that are used to perform those chemical processes take up a hell of a lot of space.
Bizarrely, if the size of the vessels is measured in non-metric they're considerably bigger.
Not really, you can store DNA in paraffin or in freezers. You're thinking functional biochemical DNA in a working unit, not DNA used for transcription.

--
-- Tigger warning: This post may contain tiggers! --
Re:Wrong problem by Kjella · 2011-12-02 12:49 · Score: 1

So the difficulty is to arrange those strips to reproduce the original DNA sequence. It is a NP-hard problem, no wonder Moore's law doesn't outrun that!
What does that even mean? The length of a human genome is for all practical purposes fixed, so scaling is utterly irrelevant. And even so, whether a problem is NP-hard has little relevance for whether we have a practical solution for any given n. n^1000000 is polynominal, 2^0.00000001n is not. If you're trying to analyze n genomes then that should be a simple O(n) scaling. Now it's possibly that sequence technology means n is increasing really, really fast but that doesn't mean the complexity of sequencing a genome changes at all.

--
Live today, because you never know what tomorrow brings
Re:Wrong problem by Anonymous Coward · 2011-12-02 15:23 · Score: 0

They don't have too much data, they have insufficient affordable storage.
For real smart guys apparently they're too stupid to make compressed data formats and 3TB drives that are about $100 or less.
Assuming a genome is like 4GB, gzipping would probably make it a quarter that size if not less because of the massive amounts of repitition.
Re:Wrong problem by elyons · 2011-12-02 16:59 · Score: 2

Well, as others have said, this is kind of correct. After sequencing, the raw reads (short sequences of DNA) are assembled into either transcripts of genome fragments (usually called contigs). This leads to a great reduction in the amount of data, but there is a lot of concern by scientists over whether or not to save all the raw data for future work. My take is that unless the sample is impossible to collect DNA/RNA from again, then toss it and assume that the sequencing technology will be better/faster/cheaper/longer in the future.

I'm actually involved with a large US National Science Foundation project to help build the cyberinfrastructure to help handle these data and analyses: the iPlant Collaborative: http://iplantcollaborative.org./ In addition, I maintain a set of web-based software for comparative genomics: CoGe, http://genomevolution.org./ From the standpoint of genomes, I adopted the philosophy of building a system that can easily accommodate new versions of existing genomes and new genomes. Thus, as new data becomes available, they get quickly loaded into the system and made available for analysis by any of the existing tools or compared to any of the already loaded genomes. So far, the system has scaled quite well and it is storing over 16,000 genomes from over 12,500 organisms. While the science is a lot of fun (sort of like the ultimate video game except no one knows the rules and there are no pre-built user interfaces), it is awesome to see how quickly the number of sequenced genomes has grown over such a short period of time. This is driven by how cheap the technology has become to use and the quantity of data that can be produced. For those interested, the National Human Genome Research Institute keeps track of this and has some very informative graphs: http://www.genome.gov/SequencingCosts/.

While it has also been said, the analyses and interpretation of these data is extremely rate limiting. Lots of opportunity for folks with programming, algorithm, data visualization, web, and user interface experience.
Re:Wrong problem by StikyPad · 2011-12-02 17:39 · Score: 2

I didn't say it was lossy compression, I was just warning against it... though judging by your response, it may already be too late!

--
https://www.eff.org/https-everywhere
Re:Wrong problem by Anonymous Coward · 2011-12-02 18:31 · Score: 0

While deduplication helps a lot for the actual reads (sequences) the problem lies more in the storage of the data before: the "raw data".
Let me tell you: scientists want to keep that data.
That could be a few million TIFF pictures (Illumina GA) or Intensity files (HiSeq) These are much less redundant.
But with something as ZFS online compression you could get compression factors of about 2.
Disclaimer: I'm working on genome data storage and administration of a small analysis cluster.
Every working day I'm busy with archiving sequencing runs to tape or moving data around. And moving a few TB over the network is no fun.
Currently we have about 500 TB live data and it's growing.
Fortunately the newest machines don't let so much "raw data" of of the instrument and do the first stage analysis on-line.
Re:Wrong problem by mlush · 2011-12-02 20:37 · Score: 2

The internet needs to catch up first.
At my Uni I can get ~80Mbps download 40Mbps upload speed. One high throughput sequencer can generate ~700GB/day (1) so it would take 1.6 days to upload 1 days worth of data. For a small lab it may just be possible in improve the upload speed enough to get by on. But with little to no margin of downtime.
(1) this data can be discarded after analysis but needs to be retained for at least 2-3 months in case a reanalysis is needed
Re:Wrong problem by mlush · 2011-12-02 20:57 · Score: 1

So the difficulty is to arrange those strips to reproduce the original DNA sequence. It is a NP-hard problem, no wonder Moore's law doesn't outrun that!
What does that even mean? The length of a human genome is for all practical purposes fixed, so scaling is utterly irrelevant. .
Who says were just (re)sequencing the Human genome? There are plenty of Model Organisms in the pipeline. Then there are the things like the Human microbiome project
Re:Wrong problem by Thiez · 2011-12-03 01:00 · Score: 1

> If we look at the sequences of non-human species the storage needed expands exponentially.
Actually that would merely be a linear expansion.
Re:Wrong problem by Anonymous Coward · 2011-12-03 03:16 · Score: 0

I'm not sure why you'd want to store the genome of every human on the planet,
You need to ask this question to the head of DHS or DoJ (in the USA, for all the foreigners who hate assumed American centricity). I think they would very much like to store the full genome of every human on the planet. Think GATTACA, but more sinister. Think about every crime that could be solved. Think about being able to truly determine who someone really is. Think about the children...
Re:Wrong problem by ultranova · 2011-12-03 07:04 · Score: 1

So compressed, you have 4 megabytes of data...per individual. 7 billion individual human beings means you potentially need 28 petabytes of storage...

28 petabytes is 28 000 terabytes. Currently, a terabyte of hard disk space seems to cost about 100 euros. That makes a total cost of about 600 000 euros, or about 800 000 US dollars, assuming 100% redundancy.
In other words, it costs peanuts, even with no compression.

That's just for human beings. If we look at the sequences of non-human species the storage needed expands exponentially.

No, it doesn't. Most of the genes between various species are the same, since they code the same proteins. Consequently, a useful - normalized - database of world's genomes grows fast for a while, but then the growth slows since all the proteins are already there; the same is true of an archive that simply lists all the genes of all species and is compressed with any decent compressor.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:Wrong problem by Pseudonym · 2011-12-03 21:13 · Score: 1

The length of a genome is fixed, but the amount of data coming off a sequencer isn't. In the 80s, Sanger sequencing technology would produce 1000 bases or so slowly with essentially no errors very slowly and expensively. Modern machines give you 100 bases with some errors extremely quickly and cheaply. We've traded accuracy for volume and price. We now sample genomes with hundreds of times coverage knowing that there will be a lot of errors, and volume and read length is improving faster than accuracy.
Having said that, while the length of a human genome is fixed, researchers would also like to assemble larger ones with less favourable repeat structures, such as plants. And then there's RNA transcripts, which add yet another layer of complexity (e.g. multiple isoforms, which are sometimes indistinguishable from read errors).

--
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});

Nope by masternerdguy · 2011-12-02 07:32 · Score: 3, Insightful

No such thing as too much data on a scientific topic.

--
To offset political mods, replace Flamebait with Insightful.

Re:Nope by Anonymous Coward · 2011-12-02 07:40 · Score: 0

Um, yes there can if the data kept is less valuable than the data not able to be stored.
Re:Nope by blair1q · 2011-12-02 07:42 · Score: 2, Insightful

Sure there is.
They're collecting data they can't analyze yet.
But they don't have to collect it if they can't analyze it, because DNA isn't going away any time soon.
It's like trying to fill your swimming pool before you've dug it. I hope you have a sealed foundation, because you've got too much water. You might as well wait, because it's stupid to think you'll lose your water connection before the pool is done.
Same way they've got too much data. No reason for them to be filling up disk space now if they can just get the data again when they know what to do with it.
Re:Nope by Anonymous Coward · 2011-12-02 08:12 · Score: 0

Trying to fill up your pool before you dig really gets in the way, it makes it more difficult to actually do the digging. Having too much research data, shouldn't have any effect on the analysts. They don't need to start looking at each new genome that comes in. It can just sit on the hard drives of the people doing the sequencing.
Re:Nope by Anonymous Coward · 2011-12-02 08:33 · Score: 1

I disagree. This is similar to the problems that astronomers/astrophysicists, and high energy physicists have. Bigger telescopes and bigger, higher resolution detectors but way too much data to analyze in a lifetime. Same with experiments going on at CERN where there is a huge amount of data coming out and a lot of effort is devoted into simply filtering out information that isn't relevant to the particular experiment.
Re:Nope by blair1q · 2011-12-02 09:00 · Score: 1

If they're discussing it with us, it's in their way.
Take as much data as you need to test your analysis methods, then ramp up data collection when your analysis works.
Re:Nope by ace37 · 2011-12-02 09:19 · Score: 1

Chuck Testa.
Re:Nope by Anonymous Coward · 2011-12-02 09:44 · Score: 0

Bollocks. That's not too much data, it's too little storage.
Re:Nope by Hognoxious · 2011-12-02 09:57 · Score: 1

That's like saying don't buy any more books until you've read the ones you've already got. Or don't download any more pr0n until ... well, you get the drift.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Re:Nope by blair1q · 2011-12-02 11:48 · Score: 1

>That's like saying don't buy any more books until you've read the ones you've already got.
Yes, it is. And? If you have too many books to fit into your house, you're probably not going to be able to read them all anyway. When someone develops the Kindle, get that.
>Or don't download any more pr0n until
Interestingly, I stopped downloading that a while ago. There's no need. I know there will be plenty more out on the web.

Bad... by Ixne · 2011-12-02 07:33 · Score: 3, Insightful

Throwing out data in order to be able to analyze other data, especially when it comes to genes and how they interact, sounds like one of the worst ideas I've heard.

Re:Bad... by Samantha+Wright · 2011-12-02 08:06 · Score: 3, Informative

Although that isn't quite what we're talking about here, reductionism in biology has been an ongoing problem for decades. Traditional biochemists often reduce the system they're examining to simple gene-pair interactions, or perhaps a few components at once, and focus only on the disorders that can be succinctly described by them. That's why very small-scale issues like haemophilia and sickle-cell anaemia were sorted out so early on. As diseases with larger and more complex origins become more important, research and money is being directed toward them. Cancer has been by far the most powerful driving force in the quest to understand biology from a broader viewpoint, primarily because it's integrally linked to a very important, complicated process (cell replication) that involves hundreds if not thousands of genes, miRNAs, and proteins.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Bad... by Anonymous Coward · 2011-12-02 08:11 · Score: 0

Forgetting is essential.
What did you have for lunch on May 23, 1986?
High energy physics experiments call selectively looking at interesting things "triggering". The rate of raw data pouring out of a high energy experiment is nigh unstorable by mankind.
Re:Bad... by JEBowers · 2011-12-02 08:22 · Score: 1

The problem with the raw data (and dna sequence) is a lot of it is wrong (errors). When confronted with a large data set with errors it is often best to reduce it to the portion that is more correct, than to treat all data as correct for later analysis. For some sorts of analysis such as genome assemblies this may be the only realistic way to proceed.
Re:Bad... by maiki · 2011-12-02 09:17 · Score: 1

It's tremendously useful to through out data sometimes. It's called feature pruning. Get rid of the noise and the patterns become more lucid.
The problem here, it appears, is that sequencing is becoming cheaper and faster than processing the data, so unless the storage/transfer/processing methods and resources improve at a similar rate, they're guaranteed to be overwhelmed with data.
In other words, "Sure I can process this data in an hour. Put in the queue and I'll get to it in about a year.". Processing the data then takes a year, and this wait time will only increase.
Re:Bad... by Hognoxious · 2011-12-02 10:00 · Score: 1

Re: your sig.
It would appear by your name that you are in fact a biologistess.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Re:Bad... by Hognoxious · 2011-12-02 10:10 · Score: 1

In other words, "Sure I can process this data in an hour. Put in the queue and I'll get to it in about a year.". Processing the data then takes a year, and this wait time will only increase.
That's like saying in 1811 that it'll take a century to get somewhere, because at the time the fastest thing is a horse or a sailboat. But by 1911 there's trains and steamships.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Re:Bad... by Anonymous Coward · 2011-12-02 10:19 · Score: 0

Throwing out the raw data is common in many branches of science, including my own (radio astronomy). Let me try an analogy:
Say that you're doing a study on the number of geese in a particular lake. You set up a camera so that it takes a photo of the lake every few minutes. Every month, you collect the data from the camera, and run an image-recognition program that looks at the pictures, picks out the geese, and tells you how many of them there are. You now have the data you wanted, which is the number of geese at each moment in time - so you can delete the original pictures if you like, to save space.
In this case, the "pictures" are TB of raw data from the gene sequencing machines, the "numbers" are the GB of actual gene sequences, which losslessly compress down to MB of diffs.
Re:Bad... by Samantha+Wright · 2011-12-02 10:50 · Score: 1

I don't think too many Latinists would support that construction, but it's certainly semantically correct.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Bad... by EricScott · 2011-12-02 12:50 · Score: 2

I specialize in compressing (20:1), in real-time, stock quotes (and anything else that trades). On the order of a few million records/sec, a few billion records a day. I'd like to take a look at a sample file that needs to be compressed -- based on what I've read so far, I'm thinking my algorithm classes might just work right out of the box. How difficult would it be to obtain this information to test ?
Re:Bad... by Anonymous Coward · 2011-12-02 14:31 · Score: 0

As others have mentioned, it's not the compression, per se, that's a problem. Up until fairly recently, any time a protein was sequenced it went into a central government database (or two). Now that high throughput sequencing is on the verge of being used for things like routine medical diagnosis, there's a need to decide what goes into the databases and what doesn't - or perhaps for some new specialized databases containing different subsets of the data.
But if you want to see what the data looks like you could try the NCBI Blast databases.
Re:Bad... by maiki · 2011-12-02 16:41 · Score: 1

That's like saying in 1811 that it'll take a century to get somewhere, because at the time the fastest thing is a horse or a sailboat. But by 1911 there's trains and steamships.
I feel like you may have missed the point. Adapting your analogy, consider that the sequencing step speeds up too, and is now an airplane. How will the trains and steamships keep up?
Re:Bad... by Anonymous Coward · 2011-12-02 21:24 · Score: 0

Check out the Sequence Squeeze competition; there are example datasets available. if you compression algo is up to scratch, you can win yourself US$15,000 http://www.sequencesqueeze.org/

ASCII storage? by Anonymous Coward · 2011-12-02 07:34 · Score: 0

ACGT... 4 symbols only in this alphabet. I hope they're not storing it in ASCII form ;)
If so, better get this bzip2 or lzma compressor going.

Re:ASCII storage? by Samantha+Wright · 2011-12-02 08:11 · Score: 3, Informative

ASCII storage of nucleotide and protein information is actually very standard. The most widespread format is called FASTA, named after the fast alignment program that introduced it. When you sequence a whole genome on a second-generation sequencing platform (like Illumina or SOLiD), there's a step in the process where you end up with a huge (10-100 GB) text file containing little puzzle pieces of DNA that must then be assembled by a specialized program. These files usually don't hang around very long, but the point of keeping them in this inefficient storage format is, simply, performance: CPUs are oriented toward byte-based computing at a minimum, and so frequent compression/decompression becomes prohibitively inefficient.

Big biotechnology purchases are typically hundreds of thousands of dollars though, so most labs are used to shelling out for this kind of price bracket.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:ASCII storage? by wezelboy · 2011-12-02 10:31 · Score: 1

Actually, performance would be enhanced by ASCII to 2-bit compression. The performance cost to compress is much less than the I/O throughput gained.
Re:ASCII storage? by Samantha+Wright · 2011-12-02 10:53 · Score: 1

And what about the performance penalty incurred by needing to bitshift a nucleotide quartet around within a single byte register every single time you wanted to actually work with the data? Busses be damned; at some point you have to do individual base pair comparison, and that's by far the most common operation in sequence analysis.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:ASCII storage? by wezelboy · 2011-12-02 14:48 · Score: 1

In some cases you can use masks instead of shifting. Even if you do shift bits, processors are so fast compared to I/O that it is still faster.
Re:ASCII storage? by Samantha+Wright · 2011-12-02 16:36 · Score: 1

I'll remember that! You might win me a paper some day. :)

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:ASCII storage? by wezelboy · 2011-12-02 17:36 · Score: 1

I think the UCSC Genome browser has been doing this for a while on their backend, so it might not be new.
Re:ASCII storage? by Samantha+Wright · 2011-12-02 17:59 · Score: 1

Doubtful, but introducing an optimization into a novel domain can always be valuable incremental progress.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!

Work! by Anonymous Coward · 2011-12-02 07:35 · Score: 0

I see an opportunity for work, and jobs.

Re:Work! by Anonymous Coward · 2011-12-02 07:58 · Score: 3, Funny

I see an opportunity for work, and jobs.
Wozniak. He is called Wozniak. But opportunity will have to wait, because Jobs is dead. Sorry to break it to you like this.
Come on, every story has an Apple angle, if you look at it the right way.. in fact, I bet those researchers could store all that data on an iPod if they wanted! You can plug it right in and sync with iTunes!
Re:Work! by Samantha+Wright · 2011-12-02 08:14 · Score: 2

Bioinformatics is indeed a very lucrative profession, but few programmers have the willingness to memorize the huge canon of data while they're in college that is required to be proficient in it. The curriculum is about 70% computer science and 30% life sciences, including organic chemistry at some universities.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Work! by WillAffleckUW · 2011-12-02 12:38 · Score: 1

Bioinformaticians also have to wrap their heads around the fact that biology changes while you measure it, and that not everything that can be measured is binary on/off or straight digital, but the very methodology of measurement needs to match what you're measuring.
A lot of good computer programmers will never be good bioinformaticians, just as a lot of good biologists or biochemists will never be good bioinformaticians. It is possible training can help some of them, but unlikely it could help all of them.

--
-- Tigger warning: This post may contain tiggers! --

Time for the scientists to ge to work by Hentes · 2011-12-02 07:35 · Score: 4, Insightful

Most scientific topics are like this, there is too much raw data to analize it all. But a good scientist can spot the patterns and can distinguish between important stuff and noise.

Re:Time for the scientists to ge to work by BagOBones · 2011-12-02 07:38 · Score: 5, Insightful

Research team finds important role for junk DNA
http://www.princeton.edu/main/news/archive/S24/28/32C04/
Accept in the field of DNA they still don't know what is and is not important.

--
EA David Gardner -"... but the consumers have proven that actually what they want is fun."
Re:Time for the scientists to ge to work by blair1q · 2011-12-02 07:45 · Score: 1

A good scientist will design the experiment before collecting the data. If he spots patterns, it's because something interesting happened to another experiment. Then he'll design a new experiment to collect data on the interesting thing.
Seriously, this is a non-problem. Don't waste resources keeping and managing the data if you can make more. And I can't imagine how you can't make more data from DNA. The stuff is everywhere.
Re:Time for the scientists to ge to work by Nyall · 2011-12-02 07:48 · Score: 1

Time for scientists to get to work? What an elegantly simple solution.
The next time I have to debug something, maybe my fist step should be identifying the problem [taken from dilbert..]

--
http://en.wikipedia.org/wiki/Jury_nullification
Re:Time for the scientists to ge to work by Hentes · 2011-12-02 07:59 · Score: 1

That's exactly what makes science interesting, when new better models show that some of the data previously disposed as junk can also be predicted. But making a perfect model would require infinite resources, so sometimes tradeoffs has to be made.
Re:Time for the scientists to ge to work by bberens · 2011-12-02 08:01 · Score: 1

And I can't imagine how you can't make more data from DNA. The stuff is everywhere.
I work in a cheap motel you insensitive clod!

--
Check out my lame java blog at www.javachopshop.com
Re:Time for the scientists to ge to work by sirlark · 2011-12-02 08:18 · Score: 4, Insightful

A good scientist will design the experiment before collecting the data. If he spots patterns, it's because something interesting happened to another experiment. Then he'll design a new experiment to collect data on the interesting thing.
Flippant response: A good scientist doesn't delete his raw data...
More sober response: Except to do an experiment said scientist might need a sequence. And that sequence needs to be stored somewhere, often in a publicly accessible database as per funding stipulations. And that sequence has literally gigabytes more information than he needs for his experiment, because he's only looking at part of the sequence. Consider also that sequencing a small genome may take a few days in the lab, but annotating can take weeks or even months of human time. And the sequence is just the tip of the iceberg, it doesn't tell us anything because we need to know how the genome is expressed, and how the expressed genes are regulated, and how they are modified after transcription, and how they are modified after translation, and how the proteins that translation forms interact with other proteins and sometimes with the DNA itself. Life is messy, and singling out stuff for targeted experimentation in the biosciences is a lot more difficult than in physics, and even chemistry.

Seriously, this is a non-problem. Don't waste resources keeping and managing the data if you can make more. And I can't imagine how you can't make more data from DNA. The stuff is everywhere.
Sequencing may be getting cheaper, but it's not so cheap that scientists facing funding cuts can afford to throw away data simply to recreate it. Also, DNA isn't the only thing that's sequenced or used. Protein's are notoriously hard to purify and sequence, RNA can also be difficult to get in sufficient quantities. The only reason DNA is plentiful is because it's so easy to copy using PCR, but those copies are not necessarily perfect.
Re:Time for the scientists to ge to work by Samantha+Wright · 2011-12-02 08:18 · Score: 1

Transposons are interesting and complex, but they don't play much of a role in mammals. Intergenic DNA is still important in that it provides scaffolding (an active chromosome resembles a puff-ball with all of the important genes at the outside edges, where they're most accessible to incoming proteins) and flex room (sometime proteins will actually bend DNA and pinch it to make sure the important genes stick out) but so far we believe that the actual sequence of most of the human genome isn't very important. 95% of it appears to be under no evolutionary pressure (that is, even if it mutates, the organism is fine.)

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Time for the scientists to ge to work by Samantha+Wright · 2011-12-02 08:21 · Score: 1

Fortunately, most biologists are unimaginative, and the medical establishment's coffers are bottomless, so really only four genomes ever actually get much mileage: human, rat, mouse, and chimpanzee. Perhaps a parasite or virus here and there. I weep for plant biologists.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Time for the scientists to ge to work by Anonymous Coward · 2011-12-02 11:34 · Score: 0

S. cerevisiae and A. thaliana both have been sequenced and have been the subject of numerous ChIP-Seq and RNA-Seq studies. Additionally, considerable effort is being put into sequencing Arabadopsis accessions. Also there is an impressive 2008 bisulfate analysis of the A. thaliana genome.
Re:Time for the scientists to ge to work by blair1q · 2011-12-02 11:51 · Score: 1

>Sequencing may be getting cheaper, but it's not so cheap that scientists facing funding cuts can afford to throw away data simply to recreate it.
They should, in their original budget, have determined that they were able to do something with it before they budgeted money to create it.
If they didn't, then they failed in their original budgeting, and the problem isn't so much that we have too much data and not enough brainpower, but that we simply aren't applying any brainpower to the part of the lifecycle of the scientific process.
They wasted their (probably my) money, and now they're asking for more? Nuh-uh. Someone else's turn.
Re:Time for the scientists to ge to work by Anonymous Coward · 2011-12-02 12:55 · Score: 0

...the medical establishment's coffers are bottomless...
Which is great if you're a medical doctor but most bioinformatics research is funded by governments - and a lot of governments these days are focused on "austerity". For example, I've got a friend in bioinformatics with a PhD in biochemistry and solid programming skills who makes about $60K per year - which is nothing to complain about - but he did have to move to Asia for his job - and when his 3 year contract runs out next year he (and his family) will be back out on the street looking for work again.

...so really only four genomes ever actually get much mileage: human, rat, mouse, and chimpanzee. Perhaps a parasite or virus here and there...
For basic science the other genomes do get quite a bit of mileage (i.e. when you build up a multiple sequence alignment you might easily include homologs from 20 different organisms). But in certain sense I agree, the immediate problem facing bioinformatics is huge numbers of slightly different version of the same human of viral sequence.
For example, there's a database of "all" unique (non-redundant) protein sequences (NCBI nr). The database comes in various formats but a common format if "Fasta" which is a text file with each sequence having a header line followed by some lines of sequence data. Well, there was this one bioinformatics program that had hard-coded a maximum line length 32K characters. The thing is, when this non-redundant database is constructed all the headers for identical sequences are concatenated into a single header line and someone had gone and sequence a bunch of hepatitis B viruses from the population and there were so many version of the same sequence that the header line exceeded 32K and the program crashed. Now I know this isn't a "hard" problem to solve - but the point is that the paradigm in sequence data is rapidly changing: the sequence database are being flooded with multiple version of the same sequence.
And this is really the future, a couple weeks ago I had to take my daughter to the doctor for a bad cough and very high fever. Of course, the key question is whether it's a bacterial infection and, truth is, the doctor just didn't know. But imagine that you run some phlegm through a high throughput sequence - of course you're going to get all kinds of human sequences from the immune system cells but you'll probably also get some viral or bacterial sequence. In another year or two the sequencing itself will only cost $100 or so - and with the right bioinformatics analysis of the sequence data you could get a definitive answer on whether the infection is bacterial or viral.
It used to be that any and every sequence was deposited in a central government database (or two) but when, in a few years, high throughput sequencing is used for routine diagnostics then that just won't work anymore - at least not in the current form.
Re:Time for the scientists to ge to work by smi.james.th · 2011-12-02 19:04 · Score: 1

That's easier said than done though. Especially depending on what format the data is in. One needs to be able to look at it to spot patterns, and if it's too much to even look at...
I am not a geneticist, I'm an electronic engineer, and as such I know a bit about information theory. I understand that the human genome contains quite a lot of information...

--
One thing I know, and that is that I am ignorant...
Re:Time for the scientists to ge to work by Anonymous Coward · 2011-12-02 20:55 · Score: 0

Not to mention that the data might not be easily reproducible. I work in cancer research, and we tend to do large-scale genotyping and cgh (and methylation, and phorphorylation, and mRNA/miRNA/protein - expression) instead of sequencing, but the same ideas apply. If the patient in question is cured - or dead - there's no more tumor tissue to be had, so we have to be really careful with what we have. (We have run into having too little material left to do interesting studies on old cohorts.)
Re:Time for the scientists to ge to work by sirlark · 2011-12-02 21:41 · Score: 1
What you're suggesting is that scientists be actively prohibited from using previously gathered data to perform secondary or even novel experiments. It's not a failure in budgeting. You can't budget for the experimental outcome, because YOU DON'T KNOW WHAT IT IS YET, that's why you're performing the fucking experiment in the first place. And because people like you tend to end up in decision making positions regarding scientific funding, budget items like #3 below just never get funding, and often the only way to get funding for #2 is hide money away elsewhere in the budget. Otherwise, put that research on hold until the next grant cycle rolls around
1. Expereriment to determine if X is true
2. If X is true then Y is likely, and that costs a bunch less than Testing for Z
3. If X is not true, then Z is likely and this is really interesting but costs a whole fuck load to test for
They should, in their original budget, have determined that they were able to do something with it before they budgeted money to create it.
They do. They perform their experiment, but why should they then have to toss the data away. Car analogy: You want to drive middleofnowheresville but no road exists yet. For the sake of argument off-roading is not an option, so you need to build a road. It's not ridiculously expensive to do so, but not so cheap you can afford to do every time you need to get there. You do (this is data generation). You drive there (this is data analysis/further experimentation). And then you send some dudes backs along the road to tear it up (this is what you advocate), OR you could leave the road there, because other people can use it too (do different experiments with the data)
Re:Time for the scientists to ge to work by sunzoomspark · 2011-12-03 01:14 · Score: 1

They are starting to use whole genome analysis clinically. http://www.completegenomics.com/
That data might be from a patient with cancer, not an experiment in a lab. Comparing samples over time might provide valuable information, like what changed after the last round of treatment.
Re:Time for the scientists to ge to work by ultranova · 2011-12-03 10:01 · Score: 1

We are not talking about experiments. We are talking about data acquisition. Acquiring data faster than you can record is a failure of logistics. That's all there is to it - and, frankly, it's a rathe pathetic problem, seeing how cheap hard drive space is, even during this time of a spike (100 euros/terabyte).

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

Seems by Anonymous Coward · 2011-12-02 07:37 · Score: 0

Seems like we need to stop sequencing genomes until we've figured out if there's anything useful we can do with all that data.

I don't see how "too much data" can be a problem. Just stop taking in the new data, concentrate on the data sets you already have and only get more when you find a gap in what you need.

Have they tried compression? by Anonymous Coward · 2011-12-02 07:40 · Score: 0

I mean, it's only the same four letters after all.

Re:Have they tried compression? by Samantha+Wright · 2011-12-02 08:24 · Score: 1

Can't analyse a compressed sequence. Gotta decompress it first. Disks are cheaper than time.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Have they tried compression? by whereiswaldo · 2011-12-06 07:46 · Score: 1

Imagine a 1GB file compressed to 25% of the original size (so 250MB).
Decompress the file to memory (not disk!) and process it. My bet is that decompression time will be the least of your worries: the time saved by reading only 250MB of data from disk instead of 1GB will be the more dominant factor in the total time taken.
Re:Have they tried compression? by Samantha+Wright · 2011-12-06 07:58 · Score: 1

That seems to be the consensus among Python programmers! (There's another comment on this story from a bioinformatician who basically agrees with you, but out of experience and not expectation.) Someone's probably done a study somewhere.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!

So, create a public DNA museum of sequences by Anonymous Coward · 2011-12-02 07:42 · Score: 0

So, create a public DNA museum of sequences.

I assume that some of this data will be useful, one day.

Re:So, create a public DNA museum of sequences by bberens · 2011-12-02 08:02 · Score: 1

So, create a public DNA museum of sequences.
They have those, they call it the "public school system."

--
Check out my lame java blog at www.javachopshop.com
Re:So, create a public DNA museum of sequences by Samantha+Wright · 2011-12-02 08:27 · Score: 2

Done: NCBI, DDBJ, and Ensembl all perform that role. The problem is what to do with all of it.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!

They should learn by hbar+squared · 2011-12-02 07:43 · Score: 4, Insightful

...from CERN. Sure, the Grid was massively expensive, but I doubt genome researchers are generating 27 TB of data per day.

Re:They should learn by Anonymous Coward · 2011-12-02 08:06 · Score: 0

Pretty close... the article claims a single center (BGI in China) outputs the equivalent of 2,000 human genomes per day. That's over 6 trillion A,C,G,T's per day, which is the processed result. If you include the signal generated by the sequencing instruments before converting to nucleotides it's much, much more data. Now, multiply that by the 5 or so centers like BGI around the world, and add all the smaller independent labs, and I bet it's way more data than CERN produces per day. However, biology has the advantage that it's decentralized and they all are not working on the same problem and data at the same time.
Re:They should learn by AtomicDevice · 2011-12-02 08:28 · Score: 1

Plus, they could make a digital frontier to reshape the human condition

--
Ze Atomic Device! It iz Ztolen!
Re:They should learn by Samantha+Wright · 2011-12-02 08:33 · Score: 2

As an aside, BGI is not just any centre, it is the centre. Much like biochemists send their crystals to a synchrotron for X-ray crystallography, biologists send their sequences to BGI to get them sequenced. They own something like 180 high-throughput sequencing instruments, which is about 5-10% of the installed base, give or take.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:They should learn by Anonymous Coward · 2011-12-02 09:38 · Score: 0

At BGI they have 180 machines... each run from a machine is approximately 3 TB of raw data. A single run takes one week. That is 77 TB per day of data being produced. And that is only BGI.. there are at least 180 machines outside of BGI scattered across the world. So imagine 140TB per day.
CERN is nothing compared to Genomic data.
Re:They should learn by Anonymous Coward · 2011-12-02 10:15 · Score: 0

a single center (BGI in China) outputs the equivalent of 2,000 human genomes per day.

Isn't it the same one repeated 2000 times? They all look the same to me.
Re:They should learn by Anonymous Coward · 2011-12-02 11:22 · Score: 0

One full hiseq run basically generates 4.5TB per run. You can have multiple runs per day if you have more than one machine, so yes it's entirely possible. Once the bases are called though this drops to 200-300GB/run.
Re:They should learn by Anonymous Coward · 2011-12-02 13:55 · Score: 1

As an aside, BGI is not just any centre, it is the centre.
Hmm, when I'm putting together a mutliple sequence alignment and I see that the sequence came from BGI I cringe a bit. A lot of the BGI sequences that I've seen have been a mess - lots of XXXs and problems with correctly determining the beginning and end of the sequence. But I hear they're cheap and easy - essentially the Walmart of high throughput sequencing.
Re:They should learn by Arabani · 2011-12-02 16:08 · Score: 2

At BGI they have 180 machines... each run from a machine is approximately 3 TB of raw data. A single run takes one week. That is 77 TB per day of data being produced. And that is only BGI.. there are at least 180 machines outside of BGI scattered across the world. So imagine 140TB per day.
CERN is nothing compared to Genomic data.
When LHC is running at full luminosity, it produces roughly a megabyte per event per detector (for CMS and ATLAS at least). Of course, the events are happening at ~40MHz, so 288 TB of raw data per hour. That's why they have to trigger, and hence throw out 99% of the data.

Genomic data is nothing compared to elementary particles.
Re:They should learn by Samantha+Wright · 2011-12-03 12:40 · Score: 1

Pretty much. I was asked to flip through some Illumina sequence data from BGI in August—more than half of the reads were garbage.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!

Is it .. by ackthpt · 2011-12-02 07:43 · Score: 3, Interesting

Is it outpacing their ability to file patents on genome sequences?

--

A feeling of having made the same mistake before: Deja Foobar

as a genome researcher by ecorona · 2011-12-02 07:44 · Score: 5, Informative

As a genome researcher, I'd like to point out that I, for one, do not have nearly enough genome data. I simply need about 512GB of RAM on a computer with a hard drive that is about 100x faster than my current SSD, and processing power about 1000x cheaper. Right now, I bite the bullet and carefully construct data structures and implement all sorts of tricks make the most out of the RAM I do have, minimize how much I have to use a hard drive, and extract every bit of performance available out of my 8 core machine. I wait around and eventually get things done, but my research would go way faster and be more sophisticated if I didn't have these hardware limitations.

Re:as a genome researcher by Overzeetop · 2011-12-02 08:01 · Score: 4, Insightful

It will come, but it doesn't make the wait less frustrating. I'm an aerospace engineer, and I remember building and preparing structural finite element models by hand on virtual "cards" (I'm not old enough to have used actual cards), and trying to plan my day around getting 2-3 alternate models complete so that I could run the simulations overnight. In the span of 5 years, I was building the models graphically on a PC, and runs were taking less than 30 minutes. Now, I can do models of foolish complexity and I fret when a run takes more than a minute, wondering if the computer has hung on a matrix inversion that isn't converging.
You should, in some ways, feel lucky you weren't trying to do this twenty years ago. I understand your frustration, though.
Just think - in twenty years, you'll be able to tell stories about hand coding optimizations and efficiencies to accommodate the computing power, as you describe to your intern why she's getting absolute garbage results from what looks like a very complete model of her project.

--
Is it just my observation, or are there way too many stupid people in the world?
Re:as a genome researcher by martas · 2011-12-02 08:03 · Score: 1, Offtopic

Goatse alert.

--
weinersmith
Re:as a genome researcher by GameboyRMH · 2011-12-02 08:07 · Score: 1

Oh god someone modded this informative. IT'S GOATSE, GENIUSES.

--
"When information is power, privacy is freedom" - Jah-Wren Ryel
Re:as a genome researcher by Anonymous Coward · 2011-12-02 08:15 · Score: 0

I find it hilarious that you use 'she' instead of 'he' in reference to your intern.
Funny how even in a field of aerospace engineering, interns are assumed female. :D
Re:as a genome researcher by Samantha+Wright · 2011-12-02 08:35 · Score: 1

I think the point is more like "in 20 years, there won't be any men left in the STEM fields."

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:as a genome researcher by Hognoxious · 2011-12-02 10:21 · Score: 1

I simply need about 512GB of RAM
What? Why can't you make do with 640k?

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Re:as a genome researcher by Anonymous Coward · 2011-12-02 10:23 · Score: 0

I'm surprised you're not using one of the many massively parallel super-computers out there. Many of them have idle time. You'd probably need to re-write your code to be more parallel but it seems like a win.
Re:as a genome researcher by sunzoomspark · 2011-12-03 01:36 · Score: 1

piss off
Re:as a genome researcher by lars_stefan_axelsson · 2011-12-03 23:54 · Score: 1

I think the point is more like "in 20 years, there won't be any men left in the STEM fields."
Joke or not it's an interesting question. When it comes to fields like chemistry, biology etc. females have been the clear majority for some time (although not at the highest levels), medicin they are now the majority (something like 60% here in Sweden); some fields are holdouts, surgery being one, but I think that's going. The vetrinarians have been slapped on the wrist for trying to accept male students on a quota (otherwise there would be none) so that trend is pretty clear as well.
When it comes to the "TEM" though the picture is bleaker. Even though I had a record 20% females in my (comp. sci./eng.) masters programme this year, the levels have been discouraging for the past 20 years. We might be seeing a small encouraging trend in maths, but it's not much to write home about.
So, that's the billion dollar question. What could/should we do to even out the gender inequality in these fields? In Sweden we've had the debate of not enough women at the highest rungs of corporate leadership for quite some time, but my experience from places like Ericsson, with maybe 20%-30% women, many of them (as much as 80%) in management is that the "problem" isn't necessarily that women don't advance up the corporate leadership ladder (yes there's a "glass ceiling" effect, I'm not denying that), but rather that they don't become technical specialists. The lack of female CEOs isn't nearly as striking as the lack of female nerds when you think about it. I've had ten or so female bosses/project managers/whatnot, but only one socially awkward technically adept female colleague...
I can't help but think that in the "beaker intensive" fields, the picture must be different. There the specialists can't be an all male club, in fact I'd expect the opposite, but don't really have any experiences or facts to base that on.

--
Stefan Axelsson
Re:as a genome researcher by Samantha+Wright · 2011-12-04 05:29 · Score: 1

There are two big factors that drive gender participation in the sciences: cultural attitudes and, alas, raw biology. While women and men have the same mean for many traits such as intelligence, the standard deviation for men is wider; i.e. there are more smart men and more stupid men than smart women and stupid women. As a result, when you take a cross-section of the population at one far end or the other, you get more men in low-income situations and more men suitable for certain positions in academia. We're pretty sure this is because the two copies of the X chromosome balance each other out. A number of other traits behave similarly.

The cultural issues are much more complex. One (the reality), the academic culture is not very well-suited to the traditional socialized setting of women (who prefer to collaborate more rather than maintaining complete independence) and indeed the competitive nature of publishing is responsible for women not generally remaining in science long enough to get doctorates or professorships. Two (the expectation), girls are still not regularly raised with the same focus on their development as intellectuals, and so by the time college is available to them they only see it as an instrument for supporting their careers, and not integral to the direction of their lives (a consistent statistical aberration arises when computer scientists and mathematicians reproduce). Three (the reflection), culture still doesn't have a lot of strong images of women in these fields in it (movies, books, shows), since culture's primary function is to reflect the current state of things, and it is notoriously bad at being inspirational unless it's regurgitating a biography—also there's not much culture about computer sciencists, engineers, and mathematicians in the first place.

At the same time, this last cultural reason is driving male participation in academia down. With (toxic) exceptions like Gregory House, there aren't many male heroes left in culture that are clearly defined as having an intelligent upbringing; there are no astronauts in the sky and no Shakespeareans captaining the Enterprise. What smart protagonists do remain are morally ambiguous characters, like House, that are more interesting from a literary perspective but have some pretty bad effects on society when viewed in reflection. Instead, boys in the United States are being raised solely by action heroes, and disappearing straight into the military or business; the popularity of hip-hop culture caused a similar (but much more rapid) disaster for middle-class blacks a few years ago.

The outcome of all this is that my university's first-year computing courses have more than 50% female attendance in class sizes of over a hundred students, and our second-year computing courses have about 20% female attendance in class sizes of no more than thirty.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!

Isn't it compressable? by BlueCoder · 2011-12-02 07:44 · Score: 2

I would figure most genomes are highly compressible. Especially if compressed against thousands of samples of a species and even across different species.

I have half my mothers genome and half my fathers. I couldn't have that many mutations. To store all three genomes couldn't take more than 2.0001 times the size of a human genome.

Re:Isn't it compressable? by Derekloffin · 2011-12-02 07:59 · Score: 2

That is what I was thinking. Maybe they just need a more customized compression algorithm. The problem there, I suppose, is figure out matches can be an expensive operation in itself.
Re:Isn't it compressable? by Anonymous Coward · 2011-12-02 08:01 · Score: 0

Exactly. But watch out for the alien species called "the butterfly" -- 380 chromosomes for that super species (compared to 46 for us humans). "the butterfly" probably flew to earth with their "fern" plants, which have upwards of 1,260 chromosomes).
Re:Isn't it compressable? by oodaloop · 2011-12-02 08:01 · Score: 1

I would figure most genomes are highly compressible
I know right? I can fit all of my DNA inside of a single cell! When will these people learn?

--
Tic-Tac-Toe, Global Thermonuclear War, and relationships all have the same winning move.
Re:Isn't it compressable? by Anonymous Coward · 2011-12-02 08:14 · Score: 0

Multiple human genomes can be stored very efficiently, but there is a huge production of data that comes from sources that don't have a well studied reference. Take the field of "metagenomics" for example where they sequence the DNA from a patch of dirt or a sample of the ocean. The vast majority of organisms in these samples have never been seen or characterized before so you can't do a reference based compression. Compression of these types of sequences using something like bzip only gives a 3-5x reduction.
Re:Isn't it compressable? by complete+loony · 2011-12-02 10:12 · Score: 1

The finished product of the analysis can be stored reasonably efficiently. But I don't think that's what they are talking about. I believe this has more to do with the memory / disk / cpu load of putting all the pieces of the jigsaw together.

--
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.

Time to outsource these efforts by bogaboga · 2011-12-02 07:45 · Score: 1

I think these researchers should look at outsourcing these efforts, and China now has bragging rights to the fastest computer.

After all most of our electronics are all imported. It's sad, but what do you do when "...the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data..." as the intro to this submission says?

Re:Time to outsource these efforts by Punchcardz · 2011-12-02 08:13 · Score: 1

FTFA: "BGI, based in China, is the world’s largest genomics research institute, with 167 DNA sequencers producing the equivalent of 2,000 human genomes a day."
Re:Time to outsource these efforts by the+gnat · 2011-12-02 08:27 · Score: 1

I think these researchers should look at outsourcing these efforts, and China now has bragging rights to the fastest computer.
Except they don't - the Japanese just brought a system online that is around 3x more powerful.
But a more general issue is that you don't need a conventional supercomputer to analyze genomic data - you just need a lot of aggregate processing power. Supercomputers are good for serial numerical methods like molecular dynamics, climate modeling, or simulating nuclear explosions (the only reason anyone cares that China is "beating" the US). Genome analysis can easily be done with distributed power, and there is plenty of that in the US, although probably not enough to be dedicated 24/7 to genome analysis. Besides, the problem of "too much" data is as much an issue of researcher time as it is of technology. Computers can take care of the brute force stuff like sequence assembly, but genomics isn't simply about generating files full of ACGT - ultimately the goal is to come up with new hypotheses about disease, evolution, etc., and that takes actual brains.

Re:Well... by Baloroth · 2011-12-02 07:46 · Score: 2

Oh hey look you made another account to goatse /. with. Good job.

--
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton

Re:Well... by tom17 · 2011-12-02 07:46 · Score: 0

Goatse!

Damn, I have not been goatsed in YEARS :(

Steve Yegge is on the way. by doom · 2011-12-02 07:46 · Score: 1

But stand back! Steve Yegge is on the way to show them how to get things to scale:

https://www.youtube.com/watch?v=vKmQW_Nkfk8

Where does it all come from? by WaffleMonster · 2011-12-02 07:51 · Score: 3, Funny

I was under the impression the complete DNA sequence for a human can be stored on an ordinary CD.

Given the amount of data mentioned in TFA it it begs the question what the hell are they sequencing? The genome of everyone on the planet?

Re:Where does it all come from? by godrik · 2011-12-02 08:07 · Score: 1

no, the genome of every bacteria in a soil sample. We (I work as a computer scientist in a genome related research lab) do not work only on human genomes.
Re:Where does it all come from? by Daniel_Staal · 2011-12-02 08:11 · Score: 1

Nope.
Every living being on the planet. And as many of the dead ones as they can get their hands on.

--
'Sensible' is a curse word.
Re:Where does it all come from? by Punchcardz · 2011-12-02 08:12 · Score: 2

This is true, but doesn't really capture the types of experiments that are being done in many cases. Yes, your genome can be stored on a CD. However, next gen sequencing is usually done with a high degree of overlapping coverage, to catch any mistakes in the sequencing, which is still basically a biochemical process despite geting large text files as the end result. So any genome is sequenced multiple times: say 8x coverage as fairly standard. That is if you are interested in sequencing a single genome. If you are interested in sequencing all the mRNAs that tell you which genes are active in which tissue and cell type, expect that you need to do a similar amount of sequencing for each tissue and cell type in the human body. Now imagine doing that with different experimental conditions: disease states, environmental factors etc. Of course, on top of that, you will need replicates of each experimental condition in order to have statistical power to say anything meaningful. On top of that there is the sequencing that you can do to identify differences in the epigenome: how the DNA is marked with things like methyl-groups, how it is wrapped around histones, all of which we are finding has a huge functional difference. Having the a genome sequence is a lot like having the total word list of the english language. It is huge and powerful, but there is a lot more information you need before you can write Shakespeare.
Re:Where does it all come from? by Anonymous Coward · 2011-12-02 08:53 · Score: 1

To remove possibility for errors, sequencing of the same genome needs over 50 times the coverage (the same genome will in fact be sequenced roughly 50 times in the same run, so 1 genome is closer to 50 CDs after the experiement
Re:Where does it all come from? by Anonymous Coward · 2011-12-02 13:25 · Score: 0

A sequence is 3 Giga bases. If you use 2 bits to represent each base (actg), then you can fit this into 750 MB's. Clever compression of repeat sequences can reduce size further. However, the starting data for each sequence is based on image data. The machines produce a signal graph of peaks and valleys for each base in the sequence. Other software analyzes these peaks to determine with a base is a,c,t, or g. That's a big chunk of the data. Then you have multiple snippets of sequence (called reads, which are roughly 500 base pairs long). Usually to form a consensus, anywhere from 8 to 30+ reads are required underneath (depending on how the long the sequences are that you are assembling). So, basically, multiply your finished product by 8x, and then keep in mind you have a ton of analysis to do. You can easily chew through a terabyte per genome. Now, that's not the only issue. Since we can now sequence genomes for individuals with relative ease, studies that sequence the entire genome of multiple individuals are occuring, which is multiplying those many terabytes of data for each individual.
However, the above said, to really understand the problems these idiots are facing, you should do a google search on an article by Lincoln Stein about how perl saved the human genome project. Yep, that's right, fucking perl. That's what a lot of them are using and they are wondering why they are having problems analyzing all that data. Having gotten out of the field I can't help but be happy at watching these morons, who many times described any attempt at high performance design as "premature optimization", drown in their data. My previous employer had a grant for half a billion, and a massive data center that required it's own power substation and thousands of CPUs, and they still couldn't get it done. A lot of the issues facing this field are the result of incompetence, political infighting, and corruption. As much as I hate to see it happen, the public and academic sector are going to get their collective asses handed to them by private industry over the next few years, as you couldn't possibly screw the software up any worse than it is. It's too bad that a lot of the same doucebags that are in academics will find a way to weasel to the head of these companies (in fact, many of them already are getting out while working their public sectors jobs, can you say conflict of interest?).

I have an idea by Anonymous Coward · 2011-12-02 07:52 · Score: 0

Let's store all that data in tightly coiled long-chain molecules made of Carbon, Nitrogen, Oxygen and Phosphorous (with a dash of Hydrogen.) It'll be really compact and much cheaper than hard disks.

Re:Well... by Anonymous Coward · 2011-12-02 07:52 · Score: 0

goatse alert!

Cheap Storage by Anonymous Coward · 2011-12-02 07:53 · Score: 0

You can build a cheap array of 3TB Sata Drives for about .10$/GB. (a couple months back). How much data do they have?

I have too much books to read! by Anonymous Coward · 2011-12-02 07:54 · Score: 0

Having lots of data is a good thing! It eliminates the time spent of researchers waiting for more data. Limiting your view to smaller scopes of data is alot more powerful/flexible then simple relying on what little data you get as it comes in. This simply means we need to develop new research methods to deal with such large amounts of data or simply that we need more researchers. Other fields of science also encounters this issue and found ways to deal with it. NASA for example is also in this position somewhat thanks to all the space telescopes. Yet they don't complain and instead look for ways to increase the amount of data as well as new methods to deal with the data.

Dealing with storage of large sets of data, while a large task, isn't impossible. It's just a matter of money. It may mean that the data sets may need to be centralized with resources pooled so that costs are kept are a minimal. Well, it's hard to say anything about this aspect with so little info.

Transmission of data is one issue they can't deal with beyond a certain point unless they pay to put down more fiber directly between them and they places they want it to go (generally impractical due to extreme costs). Technically, since latency doesn't seem to be an issue, they can always just mail hd to the places they want the data. More work and add in shipping cost, it pretty small price to allow large amounts of data to be sent quickly.

In the end, it really all just comes down to money and that's pretty much normal and a good thing. It mean that things are going as fast as possible barring money rather then technical issues that causes wait time.

Your genome will fit conveniently on a CD. by bhspencer · 2011-12-02 07:59 · Score: 2

From the article "three billion bases of DNA in a set of human chromosomes". A base may hold 1 of 4 values A, C, G and T. So each base can be represented with 2 bits. 2 bits * 3 billion = 750MB.

Re:Your genome will fit conveniently on a CD. by bhspencer · 2011-12-02 08:04 · Score: 1

From the article "There will probably be 30,000 human genomes sequenced by the end of this year". 750MB*30,000 = 2.813 TB (terabytes). So we can store all sequenced genomes so far on a single 3TB HDD. You can pick one up from newegg for less than 250 USD.
Re:Your genome will fit conveniently on a CD. by bhspencer · 2011-12-02 08:09 · Score: 1

Correction that total is 22.5 TB.
Re:Your genome will fit conveniently on a CD. by Daniel_Staal · 2011-12-02 08:13 · Score: 2

That's human genomes.
They are also sequencing plants, and (other) animals, and fungus, and bacteria, and viruses, and...

--
'Sensible' is a curse word.
Re:Your genome will fit conveniently on a CD. by Anonymous Coward · 2011-12-02 08:16 · Score: 0

Still not really that much data...
Re:Your genome will fit conveniently on a CD. by bhspencer · 2011-12-02 08:27 · Score: 1

That is the point I was trying to make.

diff by dabridgham · 2011-12-02 08:00 · Score: 2

Someone needs to introduce these researchers to the 'diff' program.

Welcome to the club by pingbat · 2011-12-02 08:02 · Score: 1

Well, hard times for these guys. They have tons of data with next to no noise, errors or uncertainty. I can name 20 people i know personally that would love datasets like that for their research. Am I the only one seeing it this way? Shame you didn't buy the hard drives 4 months ago though. Tough break.

Don't we all? by angiasaa · 2011-12-02 08:02 · Score: 1

I thought we (humans) all had roughly (if not exactly) the same amount of data.. This title reeks of intent to mislead! :)

--
Geekism is your _only_ God!

Re:Don't we all? by Daniel+Dvorkin · 2011-12-02 10:53 · Score: 1

No, bioinformaticists have a bunch of extra DNA. It codes for redundancy in the brain structures that we burn out thinking about this stuff.

--
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.

Drops in NGS Costs Outpacing Storage Costs by Anonymous Coward · 2011-12-02 08:02 · Score: 4, Informative

The big problem is that the dramatic decreases in sequencing costs driven by next-gen sequencing (in particular the Illumina HiSeq 2000, which produces in excess of 2TB of raw data per run) have outpaced the decreases in storage costs. We're getting to the point where storing the data is going to be more expensive than sequencing it. I'm a grad student working in a lab with 2 of the HiSeqs (thank you HHMI!) and our 300TB HP Extreme Storage array (not exactly "extreme" in our eyes) is barely keeping up (on top of the problems were having with datacenter space, power, and cooling).

I'll reference an earlier /. post about this:
http://science.slashdot.org/story/11/03/06/1533249/graphs-show-costs-of-dna-sequencing-falling-fast

There are some solutions to the storage problems such as Goby (http://campagnelab.org/software/goby/) but those require additional compute time, and we're already stressing our compute cluster as is. Solutions like "the cloud(!)" don't help much when you 10TB of data to transfer just to start the analysis - the connectivity just isn't there.

Re:Drops in NGS Costs Outpacing Storage Costs by bhspencer · 2011-12-02 08:40 · Score: 1

What is in that 2TB of data? A human genome only takes up 750MB. (A base may hold 1 of 4 values A, C, G and T. So each base can be represented with 2 bits. 2 bits * 3 billion = 750MB)
Re:Drops in NGS Costs Outpacing Storage Costs by Daniel+Dvorkin · 2011-12-02 10:44 · Score: 2

What is in that 2TB of data? A human genome only takes up 750MB. (A base may hold 1 of 4 values A, C, G and T. So each base can be represented with 2 bits. 2 bits * 3 billion = 750MB)
What you get out of next-gen sequencing isn't actually the sequence of a genome; it's the sequences of a bunch of fragments, each of which has to be resequenced several times (8 or 16 is the current standard, so you'll hear about "8x sequencing" for example; anything less than 4x sequencing is considered so unreliable as to be worthless, and even 16x may not really be enough) to reduce the number of read and assembly errors to an acceptable level. And although the final "consensus sequence" which is the outcome of this process can indeed be stored in 750MB, or considerably less with good compression, the original data still has to be kept around somewhere in order to reproduce the work.

--
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
Re:Drops in NGS Costs Outpacing Storage Costs by Anonymous Coward · 2011-12-02 10:46 · Score: 1

Sequence reads for one. The human genome you are talking about is (part) of the final product, the intermediate files (FASTQ, BAM, etc...) take up a lot of space and so does all the meta-data and temporary data you generated.
Re:Drops in NGS Costs Outpacing Storage Costs by Anonymous Coward · 2011-12-02 14:27 · Score: 0

> Solutions like "the cloud(!)" don't help much when you 10TB of data to transfer just to start the analysis - the connectivity just isn't there.
How about literally placing your compute center at a colo or right next to a large server farm like Google, Apple, Echilon (okay it exists), or Fusion (same answer, the addresses are on websites now).
Some problems require physical compromises.
Also the process of analyzing the data can be semi-automated with AI techniques and simple rules based procedures. It may not catch everything on the first pass, but as time passes it will get smarter and by then compute technology will have changed enough and your skills will have improved for a meaningful second pass on all past data sets.
Sometimes you literally have to "ready, shoot, aim".
JJ
Re:Drops in NGS Costs Outpacing Storage Costs by eli+pabst · 2011-12-02 17:31 · Score: 1

Yeah, this problem really sank in with us when we realize it was faster to download the data onto 2TB external drives and ship it to collaborators rather than transmit it over the internet (even with Aspera). Seemed so bizarre to be surrounded by all this high tech equipment and yet we're putting stamps on our data so we can give it to the mailman.
Re:Drops in NGS Costs Outpacing Storage Costs by v.vamsi · 2011-12-07 18:34 · Score: 1

We ran into the same problem when downloading the 1000 genome project BAM files from Short Read Archive . Now, we know that the BAM files are indexed so you can easily retrieve all reads overlapping some portion of chromosome 10 etc. But, do the files really need to be that big? Turns out that with simple run-length encoding and other measures we can cut BAM file sizes in half and we can probably use the same indexing scheme. A writeup on that is on our blog.

Re:Well... by GameboyRMH · 2011-12-02 08:03 · Score: 1

Still doesn't elude my filter. See, I was thinking ahead when I designed that. He'll either have to quit trolling or get off his lazy ass and put some effort into trolling us.

--
"When information is power, privacy is freedom" - Jah-Wren Ryel

"GNome" researchers have too much data by MikeTheGreat · 2011-12-02 08:05 · Score: 1

Hehe - I mis-read this as "GNome researchers" have too much data.

Probably along the lines of several thousand comments to the effect of "I can't stand GNOME 3", "I liked GNOME 2 better", etc, etc :)

Re:"GNome" researchers have too much data by sunzoomspark · 2011-12-03 02:30 · Score: 1

+1 funny There are sure a lot of comments along those lines. The 'gnome researchers' unlike the biology people have the option of ignoring any data that does not agree with their hypotheses.
--
Westheimer's Discovery: A couple of months in the laboratory can frequently save a couple of hours in the library.

Here's an idea.... by Nidi62 · 2011-12-02 08:05 · Score: 1

I'm sure all the insurance companies would love to buy up all that data...

--
The only thing necessary for evil to triumph is for it to be pitted against a slightly greater evil

Not True by Anonymous Coward · 2011-12-02 08:07 · Score: 0

No, we really have too much data by comparison to the number of people who can analyze that data. That's the problem, and yes, I do this for a living and at the University I am at we can't analyze it fast enough.

Now back to analyzing genomes...

Go to the cloud by sowalsky · 2011-12-02 08:09 · Score: 1

For individual research units, the cost of maintaining the processing power and storage space for these types of projects can be cost-prohibitive. Cloud-based options offer distributed computing power and low-cost storage that is often a more economical solution that paying for the equipment in house, especially when genomic projects can come in spurts rather than a continuous stream.

Disclaimer: I work with large amounts of genomic data and use both in-house and cloud-based analysis tools.

Easy... by jellomizer · 2011-12-02 08:09 · Score: 1

'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'

Have all you data open on a Windows share, and a FTP. Have them available on the full internet. Make some honest mistakes in setting up permissions. Copy and Past the "wrong link" into a hackers/gaming website. Wait a while.... All your data has been replaced with illegal information. which makes it easy to clean out. Problem solved.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.

AI by slyrat · 2011-12-02 08:15 · Score: 2

This seems like just the kind of problem that AI will help with narrowing the field of 'interesting' things to look at. Either that or better ways to search through the data that is available along with better ways to store said data will probably work.

Re:AI by laejoh · 2011-12-02 08:34 · Score: 1

That's what you think, but honestly, we just hacked most of it together with perl.

Reminded of a Parallel Computing Problem by wbtittle · 2011-12-02 08:17 · Score: 2

Way back in 1993, I visited an atomic laboratory in Pennsylvania. On the tour, they showed us the 30,000 core computing machine they had purchased several years before. "We still can't program it".

30 seconds later he pointed to the next piece of metal.

This is our 120,000 core computer.

I raised my hand "Why did you buy a 120,000 core machine when you can't even program the 30,000 core machine!"

"Well it's faster."

one of my early lessons in big companies attacking the wrong problem.

--
God: "I don't leave footprints!"

Is it a searching problem? by camh · 2011-12-02 08:18 · Score: 2

A couple of researchers in Sydney think they've got a model for searching the genoma much more efficiently. They're trying to fund their research and development with crowdsourcing: http://rockethub.com/projects/4065-a-gps-for-the-genome : "The PASTE project [is] based on a new number system we call Permutahedral Indexing - P.I. for short, an N-dimensional map that efficiently locates and interrelates complex datasets in the space of all possible data. P.I. does this efficiently even when the data has hundreds of independent dimensions and comes in petabytes and exabytes."
They don't seem to need much money in the scheme of things - I might just throw in $25.

Still waiting. by Anonymous Coward · 2011-12-02 08:20 · Score: 0

Got the cure for cancer yet?

RAW Data... by Anonymous Coward · 2011-12-02 08:23 · Score: 0

I think there is general misconception here that the amount of data is too much to store. This is not the case, genomes are not that big. The real problem is that sequences can be obtained at much much higher rate and extreme ease (guaranteed good results) these days than a biochemist/molecular biologist/etc. can characterize said genes on a molecular level and assign proper function.

Just having the sequence for a new species is not enough. There is a certain amount of knowledge that can be transferred to the data of this new organism, since a lot of proteins/genes have already been characterized in other organisms. But there is also plenty of new or different stuff and doing real research on those is the hard part.

Speaking from personal experience I've entered a field where people have been working with a certain class of genes/proteins for *ages*, yet the way they work and do their job is still a complete mystery. The classic approach of just knocking out a gene, calling the resulting mutant some silly name (developmental biologist anyone?) simply is not sufficient anymore.

No &@$^% by Anonymous Coward · 2011-12-02 08:25 · Score: 0

I just had to reprimand a genome research student who was using up multiple TB of space (I'm not allowed to put them under quota) with stupid uncompressed xml files.

Re:Well... by GameboyRMH · 2011-12-02 08:27 · Score: 1

3140? Haha that's pretty lame. I got more than that on a pastebin URL I had in my sig for a couple of weeks. Still I can see why you keep going, considering some of the lulzy responses, although there's a good amount of pity mixed in there these days.

Seriously though, you're a one-trick pony, the Microsoft of trolling. Try to mix it up. You're boring the grizzled veterans on here.

--
"When information is power, privacy is freedom" - Jah-Wren Ryel

LHC has similar problem by peter303 · 2011-12-02 08:33 · Score: 1

They only save a tiny fraction of collision events, those deemed "interesting". Even so thats petabytes a year. This keeps the researchers busy during shutdowns such as now analyzing these data for for new particles or anomalies.

Compress at the level of PROTEINS by wisebabo · 2011-12-02 08:36 · Score: 2

So, why can't they compress the data at the level of proteins? I mean it takes thousands of DNA base pairs to code for 1 protein, like hemoglobin, so instead of storing all that just say "here is the DNA sequence for protein X". Any exceptions, like mutations could then be indicated as "at position 758, the A is replaced by a G".

Of course if there is something REALLY novel, like a bioengineered virus that used different (non-standard) 3 base pair codons to encode the same amino acid, this kind of data compression wouldn't work but for 99.9999% of "natural" cases it would. (I saw this idea in the tv series "regenesis"). So for these (hopefully rare, it was for a bio-weapon!) cases a different type of compression would be used. "My" compression algorithm would, of course, break which would be a good indication this wasn't a natural DNA sequence.

I am neither a bio-expert nor a compression expert but this seems to me to be similar to the problem of compressing a vast library of books. Is it best to compress at the level of letters, words or even sentences? I'm only guessing what this entails because I'm not a linguist either! :(

(Then there's the whole business of introns or exons which "seem" to be content/protein free but I understand contain lots of regulatory information despite their repetitive nature. I would imagine these could be handled by some sort of pattern RLE.)

Re:Compress at the level of PROTEINS by Anonymous Coward · 2011-12-02 08:54 · Score: 0

Human DNA is badly fragmented. Many proteins are coded for by chunks on separate chromosomes. Likewise, we have no idea what everything codes for, and new regulatory mechanisms (like micro RNA) built into DNA are being discovered frequently
Re:Compress at the level of PROTEINS by wisebabo · 2011-12-02 08:56 · Score: 1

If I may wax philosophical about my own posting, the advantage of using this "level" of encoding is that nature has, through ruthlessly efficient evolution, pruned out the almost-infinite number of non-useful proteins. Almost every DNA sequence that encodes a protein that is deleterious to the survival to the organism has been eliminated by the grim reaper. The few "bad" but non-lethal proteins that are still around in a living organism (like mis-folded hemoglobin to fight sickle cell disease) will stick out like a sore thumb with this sort of compression algorithm due to the exceptions it will throw.
But maybe I'm missing something completely obvious! (I believe I covered the intron, exon and by extension other "regulatory" aspects of DNA, they would possibly have a different library to be compared to or use a lower level compression scheme).

Compression / dedupe? by dave562 · 2011-12-02 08:41 · Score: 1

I am not a geneticist, so I might be way off base here. But isn't DNA data a grouping of ATCG bonds in various arrangements? It seems like the nature of the data itself would lend to effective compression and/or de-duplication.

Re:Compression / dedupe? by certron · 2011-12-02 10:06 · Score: 1

...way off base...
Well played.

--

fair.org counterpunch.com truthout.com indymedia.org salon.com
eff.org guerrilla.net debian.org gentoo.org
Re:Compression / dedupe? by dave562 · 2011-12-02 10:14 · Score: 1

It just goes to show, being good at one thing does not necessarily mean you are qualified to comment on something else. ;)

Funny... by Kamiza+Ikioi · 2011-12-02 08:46 · Score: 1

In a previous post, people were saying that mixing biology and computer science was a stupid idea, here. However, this clearly shows that is much needed, except that in this case, the computer geeks can help out the biology nerds.

--
I8-D

The amount of data is so ridiculously small! by Anonymous Coward · 2011-12-02 08:47 · Score: 0

Come on, are they really talking about "700 trillion bytes of computer memory"? What's the problem or did they drop a factor of 1000?

The Dawn of "Big Data" by Anonymous Coward · 2011-12-02 08:51 · Score: 0

Genomics, among other data intensive industries (geophysics, pharmaceuticals, GIS, etc.) are indeed beginning to experience new challenges in an area of computing that has traditionally been remedied with a rather simplistic solution--simply adding more storage. At a certain point, the management of large (and constantly growing) datastores becomes cumbersome, expensive and quite difficult to manage. The confounding aspect of many of these instances is that, while this data needs to be stored and accessible (usually by many users and for extended periods of time, if not infinitely), oftentimes only a very small segment of this data is actually accessed or utilized, but it is all important to have on hand for future use.

This presents a number of challenges:

First and foremost, multiple users accessing the same dataset usually requires high-performance storage.
High performance disk storage is not cheap to operate; in fact, there are a number of capital and operational costs involved, including, but not limited to:
Acquisition, operation (power/cooling), expansion (we're talking massive RAID arrays, so the costs increase exponentially), backup (which is in itself another challenge and expense), replication, support and management. This can be a daunting task (and budget item) in small to medium businesses managing Terabytes of data--now translate that to Petabytes, Exabytes, etc.

As mentioned, the cost of creating the data has declined significantly as competition in the computer commodity marketplace has not only led to the development of faster processing, but simultaneously driven the cost down. What else has driven the cost down? New technologies such as virtualization that allow users to more efficiently and effectively utilize these resources. Instead of buying a separate server for each application, it became possible to run multiple "virtual" machines on a single physical server.

Consider the notion of virtualization ten years ago. A salesman walks into your data-center, points at your racks of servers and tells you that with this software, he can help you run all of your servers on 10% of the hardware you currently use. Save on hardware/support, upgrades, administration, licensing and software. Magic!
It took some time, but virtualization is now the norm--it makes sense, it works and it allows people to solve problems without simply adding more hardware.

Similarly, two technologies in the storage industry are changing the way we address data growth.

In a traditional backup scenario, a copy of your data is written to a media--usually disk or tape, night after night, or more/less frequently as your requirements/backup window/budget allows. How many times do you think the same file is present within the data set that is being backed up? Why back it up two, three or five hundred times? Why do that every night? Talk about a waste of space and money.

De-duplication technology is offered by a number of vendors in a number of flavors, the most robust of which involves a block-level analysis of data being backed up. As the data goes across the wire (usually en-route to an underlying appliance comprised of high-performance disk with robust processing and cache), it is analyzed for redundancy. If a pattern is unique, it is written to disk. If it is redundant, a pointer or reference is made to the location of the duplicate instance, instead of writing the data again. What happens on the first backup is an impressive reduction in the amount of data that is written to the device (most manufacturers cite up to 20:1 reduction ratios). On successive backups, only the changed data is written. Over time, the amount of data actually being written to the appliance is drastically reduced. As an added bonus, replication to a second device appreciates the same increase in efficiency--only unique instances of data need to be sent across the wire.

Need to restore a file? No problem. Instead of pulling your last full and subsequent incremental backup takes and

@Home? by ad1217 · 2011-12-02 09:03 · Score: 1

maybe an @home-type project would work?

It's not the data by thisisauniqueid · 2011-12-02 09:04 · Score: 3, Insightful

It's not that there's too much data to store. There's too much to analyze. Storing 1M genomes is tractable today. Doing a pairwise comparison of 1M genomes requires half a trillion whole-genome comparisons. Even Google doesn't compute on that scale yet. (Disclaimer: I'm a postdoc in computational biology.)

Simple. by Anonymous Coward · 2011-12-02 09:14 · Score: 0

Disk is cheap. Buy More Disk.

The problem starts at the instrument by Anonymous Coward · 2011-12-02 09:18 · Score: 0

The 'problem' starts at the sequencing level - a typical trace file (the output of the sequencing instrument) tends to be about 250K. The trace file contains luminescence data on 4 or 5 channels taken at intervals while the marked molecules are drawn electrically through a matric filled channel. This is analyzed and results in something like 500 basecalls from the electropherogram. With 2-3 of those (redundancy and overlap) you can get about the same number in bases. So a small bacterial genome of 32K bases will need about 48 Mbytes in trace data.

All well and good - but it turns out that there's a lot more data in those trace files than once imagined. mtDNA traces may be re-examined for polymorphisms including heteroplasmic insertions and deletion - this is where not all of the mtDNA is the same. Analysis of these traces (and older traces for that quality) can reveal a lot more information than we got out of the simple sequence data when we first looked at it.

So the problem isn't at the ACTG level but below that where each basecall took about 500 bytes of instrument data and 1ms of processing power to generate. If you were to recall all the traces done to date you're looking at some pretty big storage and processing requirements.

They're just not asking the right people for help by Anonymous Coward · 2011-12-02 09:19 · Score: 0

Over in Eastern Europe there are people who deal with more data than that every day -- the bastards who end up with the results of all the malware that's capturing data from everyone's parents, grandparents, and children on their PC's every day. Those guys must REALLY know how to handle big data.

Gnome Researchers by Fjandr · 2011-12-02 09:40 · Score: 1

I'm not awake enough yet. I read the title as "Gnome Researchers Have Too Much Data."

TEDx Talk on the Subject by rockmuelle · 2011-12-02 10:00 · Score: 3, Informative

I did a talk on this a few years back at TEDx Austin (shameless self promotion): http://www.youtube.com/watch?v=8C-8j4Zhxlc

I still deal with this on a daily basis and it's a real challenge. Next-generation sequencing instruments are amazing tools and are truly transforming biology. However, the basic science of genomics will always be data intensive. Sequencing depth (the amount of data that needs to be collected) is driven primarily by the fact that genomes are large (e. coli has around 5 M bases in it's genome, humans have around 3 billion) and biology is noisy. Genomes must be over-sampled to produce useful results. For example, detecting variants in a genome requires 15-30x coverage. For a human, this equates to 45-90 Gbases or raw sequence data, which is roughly 45-90 GB of stored data for a single experiment.

The two common solutions I've noticed mentioned often in this thread, compression and clouds, are promising, but not yet practical in all situations. Compression helps save on storage, but almost every tool works on ASCII data, so there's always a time penalty when accessing the data. The formats of record for genomic sequences are also all ASCII (fasta, and more recently fastq), so it will be a while, if ever, before binary formats become standard.

The grid/cloud is a promising future solution, but there are still some barriers. Moving a few hundred gigs of data to the cloud is non-trivial over most networks (yes, those lucky enough to have Internet2 connections can do it better, assuming the bio building has a line running to it) and, despite the marketing hype, Amazon does not like it when you send disks. It's also cheaper to host your own hardware if you're generating tens or hundreds of terabytes. 40 TB on Amazon costs roughly $80k a year whereas 40 TB on an HPC storage system is roughly $60k total (assuming you're buying 200+ TB, which is not uncommon). Even adding an admin and using 3 years' depreciation, it's cheaper to have your own storage. The compute needs are rather modest as most sequencing applications are I/O bound - a few high memory (64 GB) nodes are all that's usually needed.

Keep in mind, too, that we're asking biologists to do this. Many biologists got into biology because they didn't like math and computers. Prior to next-generation sequencing, most biological computation happened in calculators and lab notebooks.

Needless to say, this is a very fun time to be a computer scientist working in the field.

-Chris

Moore's Law != Performance by ender06 · 2011-12-02 10:11 · Score: 1

DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore's law.

Moore's law, or rather Moore's observation, has absolutely nothing to do with performance and everything to do with the number of transistors. For the love of deity of your choice, will they stop using it regarding performance? Simply mentioning something computer related doesn't make the writer look smarter. Yes, an increase in the number of transistors can see an increase in performance but it isn't guranteed. Eg. Bulldozer

creative ways to get rid of data? by Anonymous Coward · 2011-12-02 10:29 · Score: 0

mv genome > /dev/null &2>

??

Compression by gr8_phk · 2011-12-02 10:31 · Score: 1

You know just about any compression algorithm will squash the hell out of a file containing only 4 letters. Zip them for archiving. And those 100GB files will fit on a small hard drive and can be thrown out once the pieces are all put together.

Re:Compression by Samantha+Wright · 2011-12-02 10:48 · Score: 1

Of course it willbut that's pretty inefficient when you need to decompress it regularly to work with it. At any rate, when you're working with data sets this big it really is more efficient to just buy more hard drives for most people, and the article is actually about not having the CPU and human power to do the analysis, anyway.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Compression by rnaiguy · 2011-12-02 16:58 · Score: 1

I've actually found it FASTER to keep the data compressed. Hard drive read speed has become a real bottleneck with big datasets. Python has built-in gzip support, so working with it has become trivial for me :)
Re:Compression by Anonymous Coward · 2011-12-02 17:24 · Score: 0

As does php, which at work, I'd compress log files when they are done being written to and show them to the user.
Added benefit is I can allow append to it (in case the program isn't done logging) and continuing reading the file.
Re:Compression by Samantha+Wright · 2011-12-02 17:37 · Score: 1

Right, but then you're faced with the 'blistering' speed of Python. I was thinking of C—still, good to think about for the future.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Compression by rnaiguy · 2011-12-03 02:40 · Score: 1

when faced with the 'blistering' speed of developing in C, I'm willing to accept slower runtimes.
I imagine that it would be a different story for someone making production code for distribution, but most of my stuff gets used a few times, and only within the lab.

The problem isn't completed genomes... by Vornzog · 2011-12-02 10:36 · Score: 2

Though, there is quite a lot of that being generated these days.

The problem is the *raw* data - the files that come directly off of the sequencing instruments.

When we sequenced the human genome, everything came off the instrument as a 'trace file' - 4 different color traces, one representing a fluorescent dye for each base. These files are larger than text, but you store the data on your local hard drive and do the base calling and assembly on a desktop or beefy laptop by today's standards.

2nd gen sequencers (Illumina, 454, etc) take images, and a lot of them, generating many GB of data for even small runs. The information is lower quality, but there is a lot more of it. You need a nice storage solution and a workstation grade computer to realistically analyze this data.

3rd gen sequencers are just coming out, and they don't take pictures - they take movies with very high frame rates. Single molecule residence time frame rates. Typically, you don't store the rawest data - the instrument interprets it before the data gets saved out for analysis. You need high end network attached storage solutions to store even the 'interpreted' raw data, and you'd better start thinking about a cluster as an analysis platform.

This is what the article is really about - do you keep your raw 2nd and 3rd gen data? If you are doing one genome, sure! why not? If you are a genome center running these machines all the time, you just can't afford to do that, though. No one can really - the monetary value of the raw data is pretty low, you aren't going to get much new out of it once you've analyzed it, and your lab techs are gearing up to run the instrument again overnight...

The trick is that this puts you at odds with data retention policies that were written at a time when you could afford to retain all of your data...

--

-V-

Who can decide a priori? Nobody.
-Sartre

Re:The problem isn't completed genomes... by Anonymous Coward · 2011-12-02 19:01 · Score: 0

3rd gen sequencers are just coming out, and they don't take pictures - they take movies with very high frame rates. Single molecule residence time frame rates. Typically, you don't store the rawest data - the instrument interprets it before the data gets saved out for analysis. You need high end network attached storage solutions to store even the 'interpreted' raw data, and you'd better start thinking about a cluster as an analysis platform.
Actually our 3rd gen machine (PacBio) produces much less data then our HiSeqs.
I don't have hard data yet since it only ran a few times but it's like a few tens of GB vs. 1-4 TB

the pell mell fullness of time by epine · 2011-12-02 11:03 · Score: 2

640K will always be enough!

Yeah, back when Slashdot ran at 2400 bps, the comment limit was shorter than Twitter. But not to worry, like the Witnesses, the "great crowd" with seven-digit UIDs are relegated to a paradise on earth.

I have to say in 1981 making those decisions I felt like I was providing enough freedom for ten years, that is the move from 64K to 640K felt like something that would last a great deal of time.

The complaints as Gates recalls began in five years. He was off by a factor of two. I remember 1981 clear as day. There was hardly a baseline by which to judge the trajectory of the home computer. A monochrome 80 column display with mixed case was state of the art. By the end of 1982, the PC was selling a decimal order of magnitude faster than IBM projected, which put a whole different spin on enough. Volume drove down cost, and lower cost made eyes bigger sooner than almost anyone guessed.

I've read a lot from Gates over the years. Arrogant in most regards, but rarely stupid. Gates might have had the sentiment that a 0.33 MIPS processor didn't need 16MB of system memory, and figured that the memory limit would be addressed in a less anemic platform in the fullness of time. No-one in 1981 thought that 8088 byte code would still reign supreme thirty years later, any more than COBOL programmers in the 1960s worried about Y2K.

There's Plenty of Room at the Bottom as capiced already in 1959.

I don't really see a problem here. We have more than enough storage for the amount of analysis we're able to do. It's a short term nuisance that we have to invest some resources in being a little more selective in what we save, until storage or analysis catches up again.

There are some applications of genetics where the error component is the signal you're looking for. These methods are less forgiving of lossy synopsis. There might be room for some improvements to storage and compression algorithms in this space.

Re:the pell mell fullness of time by WillAffleckUW · 2011-12-02 12:29 · Score: 0

My old ID was 5 digits, but then I used to remember when 300 baud was fast, and we used bubble memory and LED ribbons for screens.

--
-- Tigger warning: This post may contain tiggers! --

Git biorepo? by dak664 · 2011-12-02 11:36 · Score: 1

Put each sequence into a git repo, and periodically rebase ...

Not sure I see the problem by WillAffleckUW · 2011-12-02 12:18 · Score: 1

Look, I just bought a 6 core 16GB 2TB machine from newegg for $822 this week.

And I work with genetic data. We use 2TB external HDD to back up laptops, and have rack mounts of 8 core 8 blade machines in the basement.

The problem mostly is with deciding what data is relevant and categorizing it, and then processing it.

A lot of what some people think is "useless data" in the genome is precisely NOT useless. Inserts, deletes, misfolds - all are ways for DNA to store adaptive biochemical patterns in a finite space and let the same strand of RNA express differently in different environmental and biological conditions.

Half the time I've worked on a biochem or genetics project, it turns out what we initially thought was useless data in fact turns out to be important data.

But, that's just my personal opinion.

Do we need better open-source tools? Yes.

--
-- Tigger warning: This post may contain tiggers! --

Re:Wrong problem or why clouds not ok by WillAffleckUW · 2011-12-02 12:24 · Score: 1

We don't use clouds. Especially for human subjects. We might use a lab-only distributed storage paradigm, but it's not a "cloud", since we tightly control access and location.

--
-- Tigger warning: This post may contain tiggers! --

Re:Wrong problem or redundant genome data by WillAffleckUW · 2011-12-02 12:28 · Score: 1

It is not redundant data.

The actual shifts, misfolds, deletes, and insertions around hotspots in the DNA sequence itself stores adaptive information that allows DNA to replicate differently in different biochemical and environmental conditions, so that you can survive famine AND feast, or cold AND heat.

Just because you think it's useless and repetitive doesn't mean it IS useless OR repetitive.

Some of it is, of course, like the viral rewrites from infections. Some is backup adaptations that you may not think useful today, but allow you to have gills - which you will need when global warming results in Bangladesh being underwater.

--
-- Tigger warning: This post may contain tiggers! --

Re:Wrong problem or Darwins Law by WillAffleckUW · 2011-12-02 12:34 · Score: 1

They have plenty of storage. It's just Darwinists covering up evidence of ID by throwing away the evidence that points to that conclusion. That way the researchers get to keep their jobs that allow them to bilk millions form the government. Oh, wait, this doesn't involve climate change, so I'll be modded down instead of up.

Actually, we're covering up the fact that your DNA not only encodes for you to have lungs, it also codes for you to have gills and turn into Zombie Lizards, if we just expose it to the correct biochemistry and environmental conditions.

You should see what embryos look like in the early stages.

Fish People from Outer Space!

--
-- Tigger warning: This post may contain tiggers! --

FPGA's and ASIC's by Anonymous Coward · 2011-12-02 17:36 · Score: 0

I guess the next step will be to code custom chips to deal with this data efficiently. Imagine all the cool pieces of hardware you could build, like a chip that (de)compresses new genome data based on diffs from standard sequences. Some good custom chips could move you years ahead of where you are now. The high frequency traders have learned this lesson. Does anyone know if this is being attempted?

Re:Wrong problem or redundant genome data by PaladinAlpha · 2011-12-02 19:32 · Score: 1

Hi! We were having a conversation about compression. Data that is losslessly compressible can be compressed and then expanded to the exact same thing. Which means that your prized gill sequences will survive unmolested.

A duplicated pattern of bits is exactly redundant data. That's how compression works.

Wrong assumptions by Anonymous Coward · 2011-12-02 23:34 · Score: 0

What about insertions, deletions? What about no data? What about 4 different nucleotides?

Moreover, the article is about something completely different than storing genomes. The article is about analysis and storing raw sequencing data.
Since, we don't know how to assembly well these reads to genome we want to store these raw reads until we will know.

*sigh* by Anonymous Coward · 2011-12-03 02:57 · Score: 0

Yes they are.
I'm looking for a petabyte hard disk right now.
WD doesn't have one, yet.

It's the Economics by Anonymous Coward · 2011-12-03 07:02 · Score: 0

Wall Street pays more to do the same math. The reason the "data deluge" is such a problem in biology is because there is no money in it. A wall street quant has a starting salary of $80,000 to $120,000. A bioinformatics scientist is paid $60,000 to $80,000. As a senior bioinformatics scientist for the NIH, I was making $70,000 with no hope of breaking six figures in my lifetime. Do you have any idea what college tuition is projected to be in 2020?

I submit that any intelligent analyst capable of routinely reducing biological data analysis to tractable and publishable information will not sacrifice legacy for prestige. The financial calculation is imminently present to the numerical mind required for crunching these numbers.

Bottom line: the reason "this is so hard" is because the right people aren't stupid and don't do bioinformatics. Boost the pay table and then we can talk.

BGI pays only 10% for supplies! by PythonM · 2011-12-03 16:07 · Score: 1

The Chinese government negotiated good price for supplies too. BGI gets the reagents really cheap - paying only 10% of the official price! Welcome to the 21 century when American corporations help China to kill competition: American science and Biotech!

I'm willing to give up 10 gig by obscuro · 2011-12-06 15:05 · Score: 1

Is there a distributed computing platform that does for storage what SETI et al do for processing? Anyone want to build one?

--
Every rule has more than one consequence.

Slashdot Mirror

Genome Researchers Have Too Much Data

239 comments