Genome Researchers Have Too Much Data
An anonymous reader writes "The NY Times reports, 'The field of genomics is caught in a data deluge. DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore's law. The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data. Now, it costs more to analyze a genome than to sequence a genome. There is now so much data, researchers cannot keep it all.' One researcher says, 'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'"
All previous posts have been purged due to too much data.
They don't have too much data, they have insufficient affordable storage.
No such thing as too much data on a scientific topic.
To offset political mods, replace Flamebait with Insightful.
Throwing out data in order to be able to analyze other data, especially when it comes to genes and how they interact, sounds like one of the worst ideas I've heard.
ACGT... 4 symbols only in this alphabet. I hope they're not storing it in ASCII form ;)
If so, better get this bzip2 or lzma compressor going.
I see an opportunity for work, and jobs.
Most scientific topics are like this, there is too much raw data to analize it all. But a good scientist can spot the patterns and can distinguish between important stuff and noise.
Seems like we need to stop sequencing genomes until we've figured out if there's anything useful we can do with all that data.
I don't see how "too much data" can be a problem. Just stop taking in the new data, concentrate on the data sets you already have and only get more when you find a gap in what you need.
I mean, it's only the same four letters after all.
So, create a public DNA museum of sequences.
I assume that some of this data will be useful, one day.
...from CERN. Sure, the Grid was massively expensive, but I doubt genome researchers are generating 27 TB of data per day.
Is it outpacing their ability to file patents on genome sequences?
A feeling of having made the same mistake before: Deja Foobar
As a genome researcher, I'd like to point out that I, for one, do not have nearly enough genome data. I simply need about 512GB of RAM on a computer with a hard drive that is about 100x faster than my current SSD, and processing power about 1000x cheaper. Right now, I bite the bullet and carefully construct data structures and implement all sorts of tricks make the most out of the RAM I do have, minimize how much I have to use a hard drive, and extract every bit of performance available out of my 8 core machine. I wait around and eventually get things done, but my research would go way faster and be more sophisticated if I didn't have these hardware limitations.
I would figure most genomes are highly compressible. Especially if compressed against thousands of samples of a species and even across different species.
I have half my mothers genome and half my fathers. I couldn't have that many mutations. To store all three genomes couldn't take more than 2.0001 times the size of a human genome.
I think these researchers should look at outsourcing these efforts, and China now has bragging rights to the fastest computer.
After all most of our electronics are all imported. It's sad, but what do you do when "...the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data..." as the intro to this submission says?
Oh hey look you made another account to goatse /. with. Good job.
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
Goatse!
Damn, I have not been goatsed in YEARS :(
But stand back! Steve Yegge is on the way to show them how to get things to scale:
https://www.youtube.com/watch?v=vKmQW_Nkfk8
I was under the impression the complete DNA sequence for a human can be stored on an ordinary CD.
Given the amount of data mentioned in TFA it it begs the question what the hell are they sequencing? The genome of everyone on the planet?
Let's store all that data in tightly coiled long-chain molecules made of Carbon, Nitrogen, Oxygen and Phosphorous (with a dash of Hydrogen.) It'll be really compact and much cheaper than hard disks.
goatse alert!
You can build a cheap array of 3TB Sata Drives for about .10$/GB. (a couple months back). How much data do they have?
Having lots of data is a good thing! It eliminates the time spent of researchers waiting for more data. Limiting your view to smaller scopes of data is alot more powerful/flexible then simple relying on what little data you get as it comes in. This simply means we need to develop new research methods to deal with such large amounts of data or simply that we need more researchers. Other fields of science also encounters this issue and found ways to deal with it. NASA for example is also in this position somewhat thanks to all the space telescopes. Yet they don't complain and instead look for ways to increase the amount of data as well as new methods to deal with the data.
Dealing with storage of large sets of data, while a large task, isn't impossible. It's just a matter of money. It may mean that the data sets may need to be centralized with resources pooled so that costs are kept are a minimal. Well, it's hard to say anything about this aspect with so little info.
Transmission of data is one issue they can't deal with beyond a certain point unless they pay to put down more fiber directly between them and they places they want it to go (generally impractical due to extreme costs). Technically, since latency doesn't seem to be an issue, they can always just mail hd to the places they want the data. More work and add in shipping cost, it pretty small price to allow large amounts of data to be sent quickly.
In the end, it really all just comes down to money and that's pretty much normal and a good thing. It mean that things are going as fast as possible barring money rather then technical issues that causes wait time.
From the article "three billion bases of DNA in a set of human chromosomes". A base may hold 1 of 4 values A, C, G and T. So each base can be represented with 2 bits. 2 bits * 3 billion = 750MB.
Someone needs to introduce these researchers to the 'diff' program.
Well, hard times for these guys. They have tons of data with next to no noise, errors or uncertainty. I can name 20 people i know personally that would love datasets like that for their research. Am I the only one seeing it this way? Shame you didn't buy the hard drives 4 months ago though. Tough break.
I thought we (humans) all had roughly (if not exactly) the same amount of data.. This title reeks of intent to mislead! :)
Geekism is your _only_ God!
The big problem is that the dramatic decreases in sequencing costs driven by next-gen sequencing (in particular the Illumina HiSeq 2000, which produces in excess of 2TB of raw data per run) have outpaced the decreases in storage costs. We're getting to the point where storing the data is going to be more expensive than sequencing it. I'm a grad student working in a lab with 2 of the HiSeqs (thank you HHMI!) and our 300TB HP Extreme Storage array (not exactly "extreme" in our eyes) is barely keeping up (on top of the problems were having with datacenter space, power, and cooling).
I'll reference an earlier /. post about this:
http://science.slashdot.org/story/11/03/06/1533249/graphs-show-costs-of-dna-sequencing-falling-fast
There are some solutions to the storage problems such as Goby (http://campagnelab.org/software/goby/) but those require additional compute time, and we're already stressing our compute cluster as is. Solutions like "the cloud(!)" don't help much when you 10TB of data to transfer just to start the analysis - the connectivity just isn't there.
Still doesn't elude my filter. See, I was thinking ahead when I designed that. He'll either have to quit trolling or get off his lazy ass and put some effort into trolling us.
"When information is power, privacy is freedom" - Jah-Wren Ryel
Hehe - I mis-read this as "GNome researchers" have too much data.
Probably along the lines of several thousand comments to the effect of "I can't stand GNOME 3", "I liked GNOME 2 better", etc, etc :)
I'm sure all the insurance companies would love to buy up all that data...
The only thing necessary for evil to triumph is for it to be pitted against a slightly greater evil
No, we really have too much data by comparison to the number of people who can analyze that data. That's the problem, and yes, I do this for a living and at the University I am at we can't analyze it fast enough.
Now back to analyzing genomes...
For individual research units, the cost of maintaining the processing power and storage space for these types of projects can be cost-prohibitive. Cloud-based options offer distributed computing power and low-cost storage that is often a more economical solution that paying for the equipment in house, especially when genomic projects can come in spurts rather than a continuous stream.
Disclaimer: I work with large amounts of genomic data and use both in-house and cloud-based analysis tools.
'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'
Have all you data open on a Windows share, and a FTP. Have them available on the full internet. Make some honest mistakes in setting up permissions. Copy and Past the "wrong link" into a hackers/gaming website. Wait a while.... All your data has been replaced with illegal information. which makes it easy to clean out. Problem solved.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
This seems like just the kind of problem that AI will help with narrowing the field of 'interesting' things to look at. Either that or better ways to search through the data that is available along with better ways to store said data will probably work.
Way back in 1993, I visited an atomic laboratory in Pennsylvania. On the tour, they showed us the 30,000 core computing machine they had purchased several years before. "We still can't program it".
30 seconds later he pointed to the next piece of metal.
This is our 120,000 core computer.
I raised my hand "Why did you buy a 120,000 core machine when you can't even program the 30,000 core machine!"
"Well it's faster."
one of my early lessons in big companies attacking the wrong problem.
God: "I don't leave footprints!"
A couple of researchers in Sydney think they've got a model for searching the genoma much more efficiently. They're trying to fund their research and development with crowdsourcing: http://rockethub.com/projects/4065-a-gps-for-the-genome : "The PASTE project [is] based on a new number system we call Permutahedral Indexing - P.I. for short, an N-dimensional map that efficiently locates and interrelates complex datasets in the space of all possible data. P.I. does this efficiently even when the data has hundreds of independent dimensions and comes in petabytes and exabytes."
They don't seem to need much money in the scheme of things - I might just throw in $25.
Got the cure for cancer yet?
I think there is general misconception here that the amount of data is too much to store. This is not the case, genomes are not that big. The real problem is that sequences can be obtained at much much higher rate and extreme ease (guaranteed good results) these days than a biochemist/molecular biologist/etc. can characterize said genes on a molecular level and assign proper function.
Just having the sequence for a new species is not enough. There is a certain amount of knowledge that can be transferred to the data of this new organism, since a lot of proteins/genes have already been characterized in other organisms. But there is also plenty of new or different stuff and doing real research on those is the hard part.
Speaking from personal experience I've entered a field where people have been working with a certain class of genes/proteins for *ages*, yet the way they work and do their job is still a complete mystery. The classic approach of just knocking out a gene, calling the resulting mutant some silly name (developmental biologist anyone?) simply is not sufficient anymore.
I just had to reprimand a genome research student who was using up multiple TB of space (I'm not allowed to put them under quota) with stupid uncompressed xml files.
3140? Haha that's pretty lame. I got more than that on a pastebin URL I had in my sig for a couple of weeks. Still I can see why you keep going, considering some of the lulzy responses, although there's a good amount of pity mixed in there these days.
Seriously though, you're a one-trick pony, the Microsoft of trolling. Try to mix it up. You're boring the grizzled veterans on here.
"When information is power, privacy is freedom" - Jah-Wren Ryel
They only save a tiny fraction of collision events, those deemed "interesting". Even so thats petabytes a year. This keeps the researchers busy during shutdowns such as now analyzing these data for for new particles or anomalies.
So, why can't they compress the data at the level of proteins? I mean it takes thousands of DNA base pairs to code for 1 protein, like hemoglobin, so instead of storing all that just say "here is the DNA sequence for protein X". Any exceptions, like mutations could then be indicated as "at position 758, the A is replaced by a G".
Of course if there is something REALLY novel, like a bioengineered virus that used different (non-standard) 3 base pair codons to encode the same amino acid, this kind of data compression wouldn't work but for 99.9999% of "natural" cases it would. (I saw this idea in the tv series "regenesis"). So for these (hopefully rare, it was for a bio-weapon!) cases a different type of compression would be used. "My" compression algorithm would, of course, break which would be a good indication this wasn't a natural DNA sequence.
I am neither a bio-expert nor a compression expert but this seems to me to be similar to the problem of compressing a vast library of books. Is it best to compress at the level of letters, words or even sentences? I'm only guessing what this entails because I'm not a linguist either! :(
(Then there's the whole business of introns or exons which "seem" to be content/protein free but I understand contain lots of regulatory information despite their repetitive nature. I would imagine these could be handled by some sort of pattern RLE.)
I am not a geneticist, so I might be way off base here. But isn't DNA data a grouping of ATCG bonds in various arrangements? It seems like the nature of the data itself would lend to effective compression and/or de-duplication.
In a previous post, people were saying that mixing biology and computer science was a stupid idea, here. However, this clearly shows that is much needed, except that in this case, the computer geeks can help out the biology nerds.
I8-D
Come on, are they really talking about "700 trillion bytes of computer memory"? What's the problem or did they drop a factor of 1000?
Genomics, among other data intensive industries (geophysics, pharmaceuticals, GIS, etc.) are indeed beginning to experience new challenges in an area of computing that has traditionally been remedied with a rather simplistic solution--simply adding more storage. At a certain point, the management of large (and constantly growing) datastores becomes cumbersome, expensive and quite difficult to manage. The confounding aspect of many of these instances is that, while this data needs to be stored and accessible (usually by many users and for extended periods of time, if not infinitely), oftentimes only a very small segment of this data is actually accessed or utilized, but it is all important to have on hand for future use.
This presents a number of challenges:
First and foremost, multiple users accessing the same dataset usually requires high-performance storage.
High performance disk storage is not cheap to operate; in fact, there are a number of capital and operational costs involved, including, but not limited to:
Acquisition, operation (power/cooling), expansion (we're talking massive RAID arrays, so the costs increase exponentially), backup (which is in itself another challenge and expense), replication, support and management. This can be a daunting task (and budget item) in small to medium businesses managing Terabytes of data--now translate that to Petabytes, Exabytes, etc.
As mentioned, the cost of creating the data has declined significantly as competition in the computer commodity marketplace has not only led to the development of faster processing, but simultaneously driven the cost down. What else has driven the cost down? New technologies such as virtualization that allow users to more efficiently and effectively utilize these resources. Instead of buying a separate server for each application, it became possible to run multiple "virtual" machines on a single physical server.
Consider the notion of virtualization ten years ago. A salesman walks into your data-center, points at your racks of servers and tells you that with this software, he can help you run all of your servers on 10% of the hardware you currently use. Save on hardware/support, upgrades, administration, licensing and software. Magic!
It took some time, but virtualization is now the norm--it makes sense, it works and it allows people to solve problems without simply adding more hardware.
Similarly, two technologies in the storage industry are changing the way we address data growth.
In a traditional backup scenario, a copy of your data is written to a media--usually disk or tape, night after night, or more/less frequently as your requirements/backup window/budget allows. How many times do you think the same file is present within the data set that is being backed up? Why back it up two, three or five hundred times? Why do that every night? Talk about a waste of space and money.
De-duplication technology is offered by a number of vendors in a number of flavors, the most robust of which involves a block-level analysis of data being backed up. As the data goes across the wire (usually en-route to an underlying appliance comprised of high-performance disk with robust processing and cache), it is analyzed for redundancy. If a pattern is unique, it is written to disk. If it is redundant, a pointer or reference is made to the location of the duplicate instance, instead of writing the data again. What happens on the first backup is an impressive reduction in the amount of data that is written to the device (most manufacturers cite up to 20:1 reduction ratios). On successive backups, only the changed data is written. Over time, the amount of data actually being written to the appliance is drastically reduced. As an added bonus, replication to a second device appreciates the same increase in efficiency--only unique instances of data need to be sent across the wire.
Need to restore a file? No problem. Instead of pulling your last full and subsequent incremental backup takes and
maybe an @home-type project would work?
It's not that there's too much data to store. There's too much to analyze. Storing 1M genomes is tractable today. Doing a pairwise comparison of 1M genomes requires half a trillion whole-genome comparisons. Even Google doesn't compute on that scale yet. (Disclaimer: I'm a postdoc in computational biology.)
Disk is cheap. Buy More Disk.
The 'problem' starts at the sequencing level - a typical trace file (the output of the sequencing instrument) tends to be about 250K. The trace file contains luminescence data on 4 or 5 channels taken at intervals while the marked molecules are drawn electrically through a matric filled channel. This is analyzed and results in something like 500 basecalls from the electropherogram. With 2-3 of those (redundancy and overlap) you can get about the same number in bases. So a small bacterial genome of 32K bases will need about 48 Mbytes in trace data.
All well and good - but it turns out that there's a lot more data in those trace files than once imagined. mtDNA traces may be re-examined for polymorphisms including heteroplasmic insertions and deletion - this is where not all of the mtDNA is the same. Analysis of these traces (and older traces for that quality) can reveal a lot more information than we got out of the simple sequence data when we first looked at it.
So the problem isn't at the ACTG level but below that where each basecall took about 500 bytes of instrument data and 1ms of processing power to generate. If you were to recall all the traces done to date you're looking at some pretty big storage and processing requirements.
Over in Eastern Europe there are people who deal with more data than that every day -- the bastards who end up with the results of all the malware that's capturing data from everyone's parents, grandparents, and children on their PC's every day. Those guys must REALLY know how to handle big data.
I'm not awake enough yet. I read the title as "Gnome Researchers Have Too Much Data."
I did a talk on this a few years back at TEDx Austin (shameless self promotion): http://www.youtube.com/watch?v=8C-8j4Zhxlc
I still deal with this on a daily basis and it's a real challenge. Next-generation sequencing instruments are amazing tools and are truly transforming biology. However, the basic science of genomics will always be data intensive. Sequencing depth (the amount of data that needs to be collected) is driven primarily by the fact that genomes are large (e. coli has around 5 M bases in it's genome, humans have around 3 billion) and biology is noisy. Genomes must be over-sampled to produce useful results. For example, detecting variants in a genome requires 15-30x coverage. For a human, this equates to 45-90 Gbases or raw sequence data, which is roughly 45-90 GB of stored data for a single experiment.
The two common solutions I've noticed mentioned often in this thread, compression and clouds, are promising, but not yet practical in all situations. Compression helps save on storage, but almost every tool works on ASCII data, so there's always a time penalty when accessing the data. The formats of record for genomic sequences are also all ASCII (fasta, and more recently fastq), so it will be a while, if ever, before binary formats become standard.
The grid/cloud is a promising future solution, but there are still some barriers. Moving a few hundred gigs of data to the cloud is non-trivial over most networks (yes, those lucky enough to have Internet2 connections can do it better, assuming the bio building has a line running to it) and, despite the marketing hype, Amazon does not like it when you send disks. It's also cheaper to host your own hardware if you're generating tens or hundreds of terabytes. 40 TB on Amazon costs roughly $80k a year whereas 40 TB on an HPC storage system is roughly $60k total (assuming you're buying 200+ TB, which is not uncommon). Even adding an admin and using 3 years' depreciation, it's cheaper to have your own storage. The compute needs are rather modest as most sequencing applications are I/O bound - a few high memory (64 GB) nodes are all that's usually needed.
Keep in mind, too, that we're asking biologists to do this. Many biologists got into biology because they didn't like math and computers. Prior to next-generation sequencing, most biological computation happened in calculators and lab notebooks.
Needless to say, this is a very fun time to be a computer scientist working in the field.
-Chris
DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore's law.
Moore's law, or rather Moore's observation, has absolutely nothing to do with performance and everything to do with the number of transistors. For the love of deity of your choice, will they stop using it regarding performance? Simply mentioning something computer related doesn't make the writer look smarter. Yes, an increase in the number of transistors can see an increase in performance but it isn't guranteed. Eg. Bulldozer
mv genome > /dev/null &2>
??
You know just about any compression algorithm will squash the hell out of a file containing only 4 letters. Zip them for archiving. And those 100GB files will fit on a small hard drive and can be thrown out once the pieces are all put together.
Though, there is quite a lot of that being generated these days.
The problem is the *raw* data - the files that come directly off of the sequencing instruments.
When we sequenced the human genome, everything came off the instrument as a 'trace file' - 4 different color traces, one representing a fluorescent dye for each base. These files are larger than text, but you store the data on your local hard drive and do the base calling and assembly on a desktop or beefy laptop by today's standards.
2nd gen sequencers (Illumina, 454, etc) take images, and a lot of them, generating many GB of data for even small runs. The information is lower quality, but there is a lot more of it. You need a nice storage solution and a workstation grade computer to realistically analyze this data.
3rd gen sequencers are just coming out, and they don't take pictures - they take movies with very high frame rates. Single molecule residence time frame rates. Typically, you don't store the rawest data - the instrument interprets it before the data gets saved out for analysis. You need high end network attached storage solutions to store even the 'interpreted' raw data, and you'd better start thinking about a cluster as an analysis platform.
This is what the article is really about - do you keep your raw 2nd and 3rd gen data? If you are doing one genome, sure! why not? If you are a genome center running these machines all the time, you just can't afford to do that, though. No one can really - the monetary value of the raw data is pretty low, you aren't going to get much new out of it once you've analyzed it, and your lab techs are gearing up to run the instrument again overnight...
The trick is that this puts you at odds with data retention policies that were written at a time when you could afford to retain all of your data...
-V-
Who can decide a priori? Nobody.
-Sartre
Yeah, back when Slashdot ran at 2400 bps, the comment limit was shorter than Twitter. But not to worry, like the Witnesses, the "great crowd" with seven-digit UIDs are relegated to a paradise on earth.
I have to say in 1981 making those decisions I felt like I was providing enough freedom for ten years, that is the move from 64K to 640K felt like something that would last a great deal of time.
The complaints as Gates recalls began in five years. He was off by a factor of two. I remember 1981 clear as day. There was hardly a baseline by which to judge the trajectory of the home computer. A monochrome 80 column display with mixed case was state of the art. By the end of 1982, the PC was selling a decimal order of magnitude faster than IBM projected, which put a whole different spin on enough. Volume drove down cost, and lower cost made eyes bigger sooner than almost anyone guessed.
I've read a lot from Gates over the years. Arrogant in most regards, but rarely stupid. Gates might have had the sentiment that a 0.33 MIPS processor didn't need 16MB of system memory, and figured that the memory limit would be addressed in a less anemic platform in the fullness of time. No-one in 1981 thought that 8088 byte code would still reign supreme thirty years later, any more than COBOL programmers in the 1960s worried about Y2K.
There's Plenty of Room at the Bottom as capiced already in 1959.
I don't really see a problem here. We have more than enough storage for the amount of analysis we're able to do. It's a short term nuisance that we have to invest some resources in being a little more selective in what we save, until storage or analysis catches up again.
There are some applications of genetics where the error component is the signal you're looking for. These methods are less forgiving of lossy synopsis. There might be room for some improvements to storage and compression algorithms in this space.
Put each sequence into a git repo, and periodically rebase ...
Look, I just bought a 6 core 16GB 2TB machine from newegg for $822 this week.
And I work with genetic data. We use 2TB external HDD to back up laptops, and have rack mounts of 8 core 8 blade machines in the basement.
The problem mostly is with deciding what data is relevant and categorizing it, and then processing it.
A lot of what some people think is "useless data" in the genome is precisely NOT useless. Inserts, deletes, misfolds - all are ways for DNA to store adaptive biochemical patterns in a finite space and let the same strand of RNA express differently in different environmental and biological conditions.
Half the time I've worked on a biochem or genetics project, it turns out what we initially thought was useless data in fact turns out to be important data.
But, that's just my personal opinion.
Do we need better open-source tools? Yes.
-- Tigger warning: This post may contain tiggers! --
We don't use clouds. Especially for human subjects. We might use a lab-only distributed storage paradigm, but it's not a "cloud", since we tightly control access and location.
-- Tigger warning: This post may contain tiggers! --
It is not redundant data.
The actual shifts, misfolds, deletes, and insertions around hotspots in the DNA sequence itself stores adaptive information that allows DNA to replicate differently in different biochemical and environmental conditions, so that you can survive famine AND feast, or cold AND heat.
Just because you think it's useless and repetitive doesn't mean it IS useless OR repetitive.
Some of it is, of course, like the viral rewrites from infections. Some is backup adaptations that you may not think useful today, but allow you to have gills - which you will need when global warming results in Bangladesh being underwater.
-- Tigger warning: This post may contain tiggers! --
They have plenty of storage. It's just Darwinists covering up evidence of ID by throwing away the evidence that points to that conclusion. That way the researchers get to keep their jobs that allow them to bilk millions form the government. Oh, wait, this doesn't involve climate change, so I'll be modded down instead of up.
Actually, we're covering up the fact that your DNA not only encodes for you to have lungs, it also codes for you to have gills and turn into Zombie Lizards, if we just expose it to the correct biochemistry and environmental conditions.
You should see what embryos look like in the early stages.
Fish People from Outer Space!
-- Tigger warning: This post may contain tiggers! --
I guess the next step will be to code custom chips to deal with this data efficiently. Imagine all the cool pieces of hardware you could build, like a chip that (de)compresses new genome data based on diffs from standard sequences. Some good custom chips could move you years ahead of where you are now. The high frequency traders have learned this lesson. Does anyone know if this is being attempted?
Hi! We were having a conversation about compression. Data that is losslessly compressible can be compressed and then expanded to the exact same thing. Which means that your prized gill sequences will survive unmolested.
A duplicated pattern of bits is exactly redundant data. That's how compression works.
What about insertions, deletions? What about no data? What about 4 different nucleotides?
Moreover, the article is about something completely different than storing genomes. The article is about analysis and storing raw sequencing data.
Since, we don't know how to assembly well these reads to genome we want to store these raw reads until we will know.
Yes they are.
I'm looking for a petabyte hard disk right now.
WD doesn't have one, yet.
Wall Street pays more to do the same math. The reason the "data deluge" is such a problem in biology is because there is no money in it. A wall street quant has a starting salary of $80,000 to $120,000. A bioinformatics scientist is paid $60,000 to $80,000. As a senior bioinformatics scientist for the NIH, I was making $70,000 with no hope of breaking six figures in my lifetime. Do you have any idea what college tuition is projected to be in 2020?
I submit that any intelligent analyst capable of routinely reducing biological data analysis to tractable and publishable information will not sacrifice legacy for prestige. The financial calculation is imminently present to the numerical mind required for crunching these numbers.
Bottom line: the reason "this is so hard" is because the right people aren't stupid and don't do bioinformatics. Boost the pay table and then we can talk.
The Chinese government negotiated good price for supplies too. BGI gets the reagents really cheap - paying only 10% of the official price! Welcome to the 21 century when American corporations help China to kill competition: American science and Biotech!
Is there a distributed computing platform that does for storage what SETI et al do for processing? Anyone want to build one?
Every rule has more than one consequence.