Data Archiving Standards Need To Be Future-Proofed
storagedude writes Imagine in the not-too-distant future, your entire genome is on archival storage and accessed by your doctors for critical medical decisions. You'd want that data to be safe from hackers and data corruption, wouldn't you? Oh, and it would need to be error-free and accessible for about a hundred years too. The problem is, we currently don't have the data integrity, security and format migration standards to ensure that, according to Henry Newman at Enterprise Storage Forum. Newman calls for standards groups to add new features like collision-proof hash to archive interfaces and software.
'It will not be long until your genome is tracked from birth to death. I am sure we do not want to have genome objects hacked or changed via silent corruption, yet this data will need to be kept maybe a hundred or more years through a huge number of technology changes. The big problem with archiving data today is not really the media, though that too is a problem. The big problem is the software that is needed and the standards that do not yet exist to manage and control long-term data,' writes Newman.
'It will not be long until your genome is tracked from birth to death. I am sure we do not want to have genome objects hacked or changed via silent corruption, yet this data will need to be kept maybe a hundred or more years through a huge number of technology changes. The big problem with archiving data today is not really the media, though that too is a problem. The big problem is the software that is needed and the standards that do not yet exist to manage and control long-term data,' writes Newman.
While there certainly is an issue with data integrity and retention, it is unlikely that anyone will need their entire DNA sequence "stored" for future use. It's becoming clear that the DNA you're born with isn't the same as the DNA you have when they recycle you. Further, medicine doesn't need your entire genome. Just the part that the doctor (or whatever they're called at that point in time) is interested in.
It is far more likely that you will be resequenced as needed.
Besides, you won't be able to afford it anyway.
Faster! Faster! Faster would be better!
Keep your important data on current mainstream storage. This is the only way to preserve it - copy data from old disks to new disks whenever you upgrade.
Of course at each upgrade you can also discard a lot of data that isn't necessary, but pictures and similar stuff shall be preserved. Data formats for images have been stable for the last decades. Even though some improvements have occurred a 25 year old jpg is still viewable.
However some document formats have to be upgraded to latest version since especially Microsoft have a tendency to "forget" their old versions. You may still lose some formatting, but the content of the documents is the important.
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
Seriously, what's wrong with the MS Word .doc format? Feature complete, stable, lots of free implementations. I don't think for a second that I will be able to open any standardized "future-proofed" data archiving format in 500 years; but wouldn't be surprised if a good old-fashioned .doc works just fine.
What other storage medium, besides rock carving, can survive an EMP blast?
Preserving the bits accurately is only a small part of the problem. Knowing what the bits mean is critical. Having a bunch of .xlsx spreadsheet files in the year 2050 will be useless unless you also have Excel 2050, and it knows how to read them.
Unless you want to basically just 'print' all your data to a format like .pdf (or just plain old .txt) programs to access data are as critical as the data.
The problem is that we do have formats that do work for long term archiving, but are limited to a platform and are not open, so decoding them in the future may be problematic.
WinRAR is one example. It has the ability to do error detection and correction with recovery records. However, it is a commercial product.
PAR records are another way, but it is a relatively clunky mechanism for long term storage.
Even medium term storage on disk/tape can be problematic:
There is one standard for backup programs for tape, and that is tar. Very useful format, but zero error correction or detection, other than reading and looking for hard errors. There are tons of backup programs that work with tapes. Networker, TSM, NetBackup, and many others come to mind, all using a different format. Of course, once you get the program, there is still finding the registration key, and some programs require online activation (which means when the activation servers get shut off, you can never do a restore from scratch again.) We need one archive grade standard for tape, perhaps with a standard facility for encryption as well.
Same with disks. It wasn't until recently that there was any bit rot detection in filesystems at all. Now with ReFS, Storage Spaces, ZFS, and btrfs, we now can tell if a file is damaged... but none of the filesystems have the ability to store ECC on an entire (other than ZFS and ditto blocks.) It would be nice to have part of a filesystem be a large area for ECC on a block basis. It would take some optimization for performance, but adding ECC in the filesystem is more geared for long term storage than day to day file I/O.
Finally there is paper. Other than limited stuff on QR codes, there isn't any real way to print a document onto paper, then scan it to get it back. There was a utility called Paperbak that purported to do this, offering encryption, error correction, various DPI codes, and so on. It printed well, but could never scan and read any of the documents printed, so it is worthless. What is needed is something like the Paperbak utility, but with a lot more robust error detection (like checking of blocks are at an angle similar to how QR codes can be scanned from any direction.) This utility would have to be completely open for it to have any use at all. However, if it could be done to print small documents to paper, it would help greatly in some situations, such as recovering encryption keys, archived tax documents, and so on.
Ironically, in general, we have the formats for long term storage. We just don't have any that are open.
Hardware is an issue too. Hard drives are not archival media. Tapes are, but one with a reasonable capacity is expensive, well out of reach for all but the enterprise customers. It would be a viable niche for a company to make a relatively low cost tape drive that could work on USB 3, has a large buffer (combined with variable tape speeds to prevent shoe-shining), and has backup software with it that is usable and open, where the formats can be re-engineered years down the road for decoding.
The problem is, we currently don't have the data integrity, security and format migration standards to ensure that, according to Henry Newman at Enterprise Storage Forum
.
He is wrong, of course. We have all of that right now.
You won't need to archive your genome. It will be re-sequenced in 5 seconds each time you go to the doctor. Because it will be cheap, and because it may evolve over time. The same way blood samples are not archived for life, or teeth X-rays are taken periodically, they're just taken when needed.
Wakes up, "WTF? I have a....Vagina!? Hoooneeeyyy!"
Table-ized A.I.
Any optical media, actually... Like a CD, remember?
Fifteen years ago I long-term stored some important files.
Rather than keeping the fanvy formatting I saved them in plain ascii text and saved as *.txt.
I burned it all on gold-plattered CDs which then were considered archive proof
These CDs are stored at three different locations
Im sure there are better ways today
I propose storing it in a new medium. A "molecular chain", which should withstand the effects of EMP, right?
A name for it. Hmmm. How about the Destroy-Not Archive, or D.N.A. for short.
How is it possible to have a collision-proof hash?
Lol, there is no such thing. It's a hash function, you get 2^n width, some reasonable cryptographic assuredness against collision, and that's it. Collision proof is not a hash function, it's a data compressor. Remind me not to put this guy in control of my storage.
By the way, ZFS works great with raidzN, internal sha256, and lz4. Get it on FreeBSD.
Great, so you keep the compressed data and the original data, that way you can check if the data is still valid and as a bonus if either the compressed data or original data breaks you can restore it.
While you may be right about the current use we have for DNA, it's very likely that medicine will have many more uses for it in the future. Prices on genome sampling are going down rapidly too, so it's reasonable to use this as an example why we might want to store data error free for at least a century.
There will be many more things we want to store. Remember all those old city records and paper books? The news paper archives? early 20th century cellulose film? All those data sources have their problems and we have already lost a lot of information that is valuable to us now. Your parents and grand parents color photographs have lost a lot of the color in them already. Not just the prints, but also the negatives. Those VHS video tapes of your dad growing up? They're turning into noisy images right now.
People have plenty of reasons to come up with a proper way to store data in such a way that it's still accessible for future generations, or themselves later in life.
I was promised a flying car. Where is my flying car?
Technology is always changing. Whatever is today's commodity storage device will be tomorrow's rare anachronism.
So we put haxx0z in ur data so u can get haxx0rzd while u get haxx0rzd.
You!
We already have the technology to preserve the data: http://www.pcworld.com/article...
Just scrape off the rust and your good to go. Now, where did I put my M14G and FR3010.
My karma is not a Chameleon.
Get the acid-free paper. Will last forever
nuff sed
Your bank records exist despite changing hardware and software because the data is kept in use. Its kept alive. It is added to, modified... active. Your genetic records could be kept active. Keep them part of a patient record and they'll be copied, migrated, translated, from one system to the next to the next to the next for as long as you live.
Only when the data goes dormant can it rot. By all means... have long term storage media for long term data archiving. But the best means of keeping data current is to keep it moving.
All that said... the data we're talking about can't be that much data. A few terabytes should be more then what you need to store that kind of stuff for one person. And that kind of storage is already cheap.
I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
... or for that matter any of your medical history. MDs do spot-diagnosis in 5 minutes or less based exclusively on what they've memorized or else they do no diagnosis at all.
My wife has a major genetic defect (MTHFR C677T), which causes severe nutritional problems. We haven't yet met an MD who has a clue about nutrition. Moreover, we had to diagnose this problem ourselves through genetic testing, with no doctors involved. We've shown the results to doctors, and they don't entirely disbelieve us, but they also have no clue what to do about it and still are dubious of the symptoms. (Who has symptoms of Beriberi these days? Someone whose general ability to absorb nutrients is severely compromised.)
What makes anyone think that this will change if your doctor has access to your DNA, even with detailed analysis? They won't take the time to actually read any of it. In fact a lot of what we know about genetic defects pertains to problems in generating certain kinds of enzymes, a lot of which participate in nutrient absorption. (So obviously RESEARCHERS know something about nutrition.) These nutritional problems require supplementation that MDs don't know about. Do you think the typical MD knows that Folic Acid is poison to those with C677T? Nope. They don't know the differences between folic acid, folinic acid, and methylfolate and still push folic acid on all pregnant women (they should be pushing methylfolate). They also don't know the differences between the various forms of B12 and always prescribe cyanocobalamin even for people who need the methyl and hydroxy forms.
Another way in which MDs are useless is caused by their training. Bascally, they're trained to be skeptical and dismissive. Many nutritional and autoimmune disorders manifest with a constellation of symptoms, along with severe brainfog. Someone with one of these problems will generally want to write down the symptoms when talking to a doctor, because they can't think clearly. The thing is, in med school, doctors are specifically trained to look out for patients with constellations of symptoms and written lists, and they are told to recognize this as a condition that is entirely within the mind of the patient. Of course, a lot of doctors, even if not trained to dsmiss things as "all in their head" are terrible at diagnosis anyway. They'll have no clue where to start and won't have the patience to do extensive testing. It's too INCONVENIENT and time-consuming. They won't make enough money off patients like this, so they get patients like this out the door as fast as possible.
I've had some good experiences with surgeons. But for any other kind of medical treatment, MDs have been mostly useless to me and my family. In general, if we go to one NOW, we've already disgnosed the problem (correctly) and possibly need advice on exactly which medicine is required, although when it comes to antibiotics, it's easy enough to find out which ones to use. (Medical diagnosis based on stuff you look up on the internet is really hard and requires a very well-trained bullshit filter, and you also have to know how to use the more authoritative sources properly. However, it's not impossible for people with training in things like law, information science, and biology. It just requires really good critical thinking skills. BTW, most MDs don't have that.)
MDs are technicians. Most of them are like those B-average CS grads from low-ranked schools who can barely manage to write Java applications. If you know how to deal with a low-level technician, guide them properly, and stroke their ego in the right way, you can deal with an MD.
One of the big differences between archiving and backup is that in archiving I want to keep this exact version intact, if it changes on me it's an error while a backup takes a copy of whatever is now - maybe I wanted to edit that file. Unlike backups I think it's not about versioning, it's about maintaining one logical instance of the archive across different physical copies. Here's what I'm thinking, you create a system with three folders:
archived
to_archive
to_trash
The archive acts like a CD/DVD/BluRay and is read-only. So far, nothing but a really awkward way to create a WORM(-ish) drive, but the real point comes next in distribution and synchronization.
When you put a file in "to_archive" a job will pick it up and wrap it in AES (with AES-NI the cost of on-the-fly encryption/decryption is very slim) and create a torrent-like file for it and move it to archived. If you want to delete it from the archive, you drag the file to the "to_trash" folder or maybe you put some kind of lock/freeze/undo timer on that function. Files that are in "archived" are sync'ed to other computers - still encrypted - which means you can shop around for storage/bandwidth, maybe you got multiple locations yourself (home/cabin), maybe swap backup with friends or family or you can buy it on the open market and they'll all mingle and share data because it's based on basic torrents.
They can all do basic limits on size/bandwidth so you can have pricing plans and caps, you can have one-way "leeches" that download and archive it on tape that can physically deliver it to you. If you build it fairly smart you can also have local, offline backups and if you restore them it'll pick up that 95% is the same as last week and sync up the rest. Basically a "Redundant Array of Inexpensive Archive Locations." It will leak a little bit of metadata as to size and number of files, but not file or directory names and you can probably muddle that metadata up with padding and dummy files if you want.
Of course you can choose to have the AES key on several computers so you can access your media from any of them. And as a free bonus a device that has the AES key like say your cell phone can use this as an online library, it doesn't have to auto-sync everything. With many locations = many peers it won't matter if one is down and you aggregate up the bandwidth, just like in any other torrent swarm. Through the seed/peer numbers you can at any time watch the state of your backup in progress as you add files. If your computer goes to shit, tell it the archive key and it'll hook up and start syncing. Just like a torrent client you can set priorities on what to download first.
It's not for all your data, but I think a lot of common user data is that way. Those RAW photos or video or audio you took? Archive them, "single" everlasting master copy. It doesn't replace backup of say documents you're working on or source code you're developing but it complements it.
Live today, because you never know what tomorrow brings
for the huge and growing number of people on this planet. I get how wonderful it is that genetic medicine might allow us all to live to the age of 150, eliminate birth defects, and cure Aunt Millie's cancer. But really, just where are we going to put all the people whose lives we save and extend while at the same time the birth rate keeps climbing? How will we feed them? How will we maintain a viable biosphere in an era of rapidly accelerating extinctions?
All that long term data will be meaningless if human society collapses under its own weight. If we're going to invest in keeping data viable so we can maintain and extend our scientific and technological reach, perhaps we should use it to help solve more important problems than our navel-gazing, narcissistic fixation on immortality and eternal youth.
'The Economy' is a giant Ponzi scheme whose most pitiable suckers are the youngest among us and the yet-unborn.
Something about securing genomes, coming from a guy called Newman? And not a single Jurassic Park joke after 79 posts?
What a shame.
Get free satoshi (Bitcoin) and Dogecoins
Glad you mentioned LTFS, but you should also look at CDMI and SIRF. The standards are there, but need to be put into an Offering that implements the policies and proceedures you mentioned. The only thing holding this back is the business case and awareness of the issues which your article does nicely.
Good luck trying to archieve this one. We are having serious issues reading medias that were in use just 1980's... Some media formats are so obsolete theres just no way to read them.
What we really need is new type of media format thats newer access devices are kept backwards compatible... Maybe something like Start Treks data crystals? Heck their readers were so good they were able to adapt them to read different races data crystals in some episodes...
One can whine and wax poetic all one wants, but since we don't have a good archival format, the practical solution today is continual refresh of data: periodically copying data to fresh, and technologically up-to-date media. It's not sexy, but it does address three of the four points at the end of the linked piece (end-to-end data integrity, format migration and secondary media formats). The unaddressed point, access audit trails, makes no sense given the premise stated at the beginning of the piece that "No matter what anyone tells you, there is data that does not need to be on primary storage".
Yes, this is expensive. Yes, it would be nicer (cheaper) if a one-time single format could address the archive problem.
P.S. There is also this gem from the piece:
Of course the whole point of a hash is a mapping from a high-cardinality space to a low-cardinality space, and thus collisions are always a possibility. Collisions are minimized when a good hashing function uniformly distributes the resulting hashes, but given a large enough collection of source documents (no more are needed than the cardinality of the hash space), collisions will occur.
Your body produces tons of it, and it can be stored and sequenced considerably longer than human lifespans, especially if care is taken to preserve it.
This is my signature. There are many like it, but this one is mine.
I wrote an article about long-term storage *hardware* in CACM -- "The Forever Disc". My favorite musing had to do with writing the data into a population's genetics, and letting redundancy correct errors/mutations..
Tom Geller
There's a lot of work in this space from digital libraries for preservation of cultural heritage, state/official archives etc. Start with Open Archival Information Systems Reference Model (ISO-OAIS, an international standard originally from space agencies). PREMIS. Preservation metadata standard by US Library of Congress, but used around the world for digital assets. It works well with METS encoding standard and MIX technical metadata standard. PRONOM and DROID for format policy registries, monitoring and migration planning. Digital asset repositories like Fedora Commons Repository 4 (being worked on by DuraSpace community), that have built in fixity checking for bit rot and store to a wide range or underlying technologies (including file systems like ZFS, tar), etc. LOKSS for distributed archive relocation and exchange programs. Or TAHOE-LAFS. (There's also things like CryptoSphere coming). There's tools like ArchiveMatica too for ingestion workflows, characterization, etc etc too. See also the recent partnership with DuraSpace too. http://www.duraspace.org/artic....
Check out guidelines for data archival at http://datasealofapproval.org/en/. Also http://www.duraspace.org/, they support Dspace and Fedora repository work.
Is it too much too ask that every patient is looked at as a unique case? Should every patient put into a known category as soon as possible?
I know, if it quacks like a duck, etcetera. But it happens quite often that it isn't a duck. Why not examine every patient as a genuine case, instead of lumping it in to one of the few general cases? Is it money, greed, what?
Fix what ails people, that is what doctors are supposed to do, right? So why do so many not do that?
When the Vatican decided to digitize their archives, they chose to store the mages in FITS format for pretty much the same reasons. One thing FITS doesn't address is preventing unauthorized access to the data.
A Shadeless room is a brighter room.
(Disclaimer: I am an Arvados developer)
The Arvados project is a free and open source (AGPLv3 and Apache v2) bioinformatics platform for genomic and biomedical data, designed to address precisely the issues raised in this article. Arvados features a 1) content addressed filesystem (blocks are addressed by a hash of their actual content rather some arbitrarily assigned identifier) which performs end-to-end data integrity checks , 2) fine-grained access controls, 3) a cluster scheduling system that tracks the input and output results of every job (enabling you to track processing pipelines and establish data provenance), and 4) data replication by default. Arvados is developed and commercially supported by Curoverse which is 100% committed to free software (in fact, one of the founders is a former employee of the Free Software Foundation.) I encourage slashdotters in the bioinformatics, big data, or data archiving space to come check it out and join the community.
http://gattaca.wikia.com/wiki/...