Slashdot Mirror


Data Archiving Standards Need To Be Future-Proofed

storagedude writes Imagine in the not-too-distant future, your entire genome is on archival storage and accessed by your doctors for critical medical decisions. You'd want that data to be safe from hackers and data corruption, wouldn't you? Oh, and it would need to be error-free and accessible for about a hundred years too. The problem is, we currently don't have the data integrity, security and format migration standards to ensure that, according to Henry Newman at Enterprise Storage Forum. Newman calls for standards groups to add new features like collision-proof hash to archive interfaces and software.

'It will not be long until your genome is tracked from birth to death. I am sure we do not want to have genome objects hacked or changed via silent corruption, yet this data will need to be kept maybe a hundred or more years through a huge number of technology changes. The big problem with archiving data today is not really the media, though that too is a problem. The big problem is the software that is needed and the standards that do not yet exist to manage and control long-term data,' writes Newman.

113 comments

  1. Nope by ColdWetDog · · Score: 1, Offtopic

    While there certainly is an issue with data integrity and retention, it is unlikely that anyone will need their entire DNA sequence "stored" for future use. It's becoming clear that the DNA you're born with isn't the same as the DNA you have when they recycle you. Further, medicine doesn't need your entire genome. Just the part that the doctor (or whatever they're called at that point in time) is interested in.

    It is far more likely that you will be resequenced as needed.

    Besides, you won't be able to afford it anyway.

    --
    Faster! Faster! Faster would be better!
    1. Re:Nope by Anonymous Coward · · Score: 1

      Besides, you won't be able to afford it anyway.

      Why not? Whole genome sequencing is already down to a few thousand dollars. Within the next decade it will almost certainly have dropped below a thousand. And there will be standard analysis pipelines (hopefully some of which are freely available and open source) to check for the most common pathogenic mutations. Now, paying an expert to do a custom analysis could easily reach into the hundreds of thousands of dollars. But just I'm not seeing why the sequencing itself would be unaffordable.

    2. Re: Nope by Anonymous Coward · · Score: 0

      Its important to study how DNA changes, so storing it fully and accurately is important.

    3. Re:Nope by Anonymous Coward · · Score: 0

      Astronomers are able to read very old files today with no issues. They use the FITS format that was created with punch cards in mind and keep using the format as a standard because astronomical data really doesn't get old and you never know when it will be collected again. It has checksums for both meta info and binary data so data integrity can be checked.

      The author assumes that this hasn't been though of before.

  2. Keep your important data on current storage. by Z00L00K · · Score: 4, Insightful

    Keep your important data on current mainstream storage. This is the only way to preserve it - copy data from old disks to new disks whenever you upgrade.

    Of course at each upgrade you can also discard a lot of data that isn't necessary, but pictures and similar stuff shall be preserved. Data formats for images have been stable for the last decades. Even though some improvements have occurred a 25 year old jpg is still viewable.

    However some document formats have to be upgraded to latest version since especially Microsoft have a tendency to "forget" their old versions. You may still lose some formatting, but the content of the documents is the important.

    --
    If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    1. Re:Keep your important data on current storage. by _merlin · · Score: 1, Informative

      JPEG wasn't standardised until 1992. THere are no 25-year-old JPEG files. Things have changed a lot since 1989.

    2. Re:Keep your important data on current storage. by Anonymous Coward · · Score: 0

      I, for one, welcome our pedant overlord !

    3. Re:Keep your important data on current storage. by _merlin · · Score: 1

      But seriously, JPG is everywhere now but how long will it last? Could you read pre-JPEG image formats? Do you have software that will open PhotoCD, or PBM, or XBM or IFF in all their variants? I expect some formats like DNG will be around for a while, but the XMP processing instructions contain a "process version", and how long will software continue to support the process versions we use today? Data security really isn't straightforward when you don't know what the future holds.

    4. Re:Keep your important data on current storage. by Dadoo · · Score: 1

      What I want to know is, what ever happened to fuse-based proms, and why we can't use similar technology to store important data? I have to believe that, with current technology, we could create proms with a density at least as high as current usb keys, and since they're just microscopic wires in a hermetically sealed package, they'd last basically forever.

      --
      Sit, Ubuntu, sit. Good dog.
    5. Re:Keep your important data on current storage. by ihtoit · · Score: 1

      yep, out of the box my windows 7 laptop could read GIF89a and Targa formats. Pretty sure yours could too.

      --
      Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
    6. Re:Keep your important data on current storage. by _merlin · · Score: 0

      You're picking easy formats. What about Macintosh PICT, including vector information (not just a bitmap PICT)? What about WMF? I could see both those formats becoming effectively unusable within a decade, as they effectively depend on the drawing API/environment of ancient operating systems (classic MacOS and Windows 3).

    7. Re:Keep your important data on current storage. by Anonymous Coward · · Score: 0

      It seems like you are arguing all the open source which at the time used to implement readers in C somehow has disappeared in a cloud or either is impossible to use. Seriously, are you trolling or don't know about conversion tools like http://www.imagemagick.org which support the formats you mentioned? (see http://www.imagemagick.org/script/formats.php).

    8. Re:Keep your important data on current storage. by ihtoit · · Score: 1

      PICT is a legacy Mac format, precursor to PDF. WMF is tangentially similar in that it uses function calls (PICT uses opcodes) to "draw" a scalable image, however the WMF specification continues to be updated to this day (last update was in February?).

      You couldn't use either on a RISC box running on RISC OS 3.1 (without plugins and/or serious hacking), so for me they're both useless for archiving right out the gate.

      You want an open vector standard such as SVG (for the simple reason that future systems will be more likely able to read these than .wmf or .pict since the specification is publicly available with no restrictive licensing or patent issues). Come back when you're completed the conversion. :)

      --
      Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
    9. Re:Keep your important data on current storage. by _merlin · · Score: 1

      ImageMagick doesn't support PICT with vector information. I'm just trying to make a point that even if a format seems to be widespread now, it may become effectively useless in the future. Believe me PICT files were everywhere in the classic Mac days.

    10. Re:Keep your important data on current storage. by _merlin · · Score: 2

      Yeah, SVG renderers have more chance of being around in 50 years than WMF or PICT. But you still need to actively go through your data archives, find things in "endangered" formats, and migrate them to more future-proof formats. This requires substantial effort that increases as the collection grows. Then there's verifying that nothing was lost in the conversion to consider.

    11. Re:Keep your important data on current storage. by ihtoit · · Score: 1

      archiving serious amounts of data does require careful forethought. Actually, I would say that archiving your photo collection requires as much forethought. Futureproofing is but one facet of the problem, you've also got disaster preparedness among a great many other things to consider. Storage media, not just the file format, is another. Will a floppy disk drive be available in fifty years? How about a five inch optical disc reader? Quarter inch tape? DAT? Vinyl? Etched steel plate? How resilient is your storage? Will it withstand an EMP, fire, freezing, earthquake? Is one copy in one format in one location enough? Future proof your media as well as your formats.

      --
      Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
    12. Re:Keep your important data on current storage. by aix+tom · · Score: 1

      Which goes to show that you better don't use proprietary formats that are used only by one software vendor for archiving purposes, not matter how "everywhere" they are at a specific point in time.

    13. Re:Keep your important data on current storage. by Anonymous Coward · · Score: 1

      And LBM files were everywhere in the DOS/AMIGA days due to artists using Deluxe Paint for graphics. Still, your point is about implementation details and barking at the wrong tree. At that point in time, where (I believe you) PICT files were everywhere in the classic Mac days, were there not ANY open source libraries for reading them? If there were none, you are screwed anyway due to storing your documents in a closed format, nothing can save your soul (maybe a VM with that crap closed source software). You took the wrong decision and now have to live with it.

      But if there WERE open source libraries, why should they stop existing? Or why would you be unable to rewrite the format's reader in a newer language? Interested gamers continuously reverse engineer closed source game file formats to mod them without any documentation. Are you really telling me PICT vector information is somehow impossiburu to read back? Haven't you heard of scummvm? If there is no scummvm for PICT vector information maybe its because nobody cares.

      Man, maybe you should tell U2 and Apple to ship their new DRM crap in PICT vector files...

    14. Re:Keep your important data on current storage. by RabidReindeer · · Score: 1

      Not WMF. I had to write a WMF module to generate graphics commands for a laser printer back ages ago. It wasn't that hard.

      Actually, I'm reasonably sure that WMF or a descendent of it is used for the device-independent spool format on modern versions of Windows. Since it's basically a recording of the GDI commands.

      Still, a better bet would be to convert those WMFs to Postscript format if you want real longevity.

      My votes for things most likely to still be decodable 1000 years from now are PDF/Postscript, JPEG, GIF, and ZIP with LZW. Assuming that the media can remain uncorrupted and readable and that civilization hasn't crashed between now and then.

    15. Re:Keep your important data on current storage. by Jawnn · · Score: 2

      JPEG wasn't standardised until 1992. THere are no 25-year-old JPEG files. Things have changed a lot since 1989.

      So what's your point? I have GIF images that predate 1989. The still render just fine. I could convert them if I felt the need. I don't, because the format's are indeed "stable".

    16. Re:Keep your important data on current storage. by Bing+Tsher+E · · Score: 1

      Well, maybe fuse based PROMs with dies the size of a 12" record album jacket.

    17. Re:Keep your important data on current storage. by Anonymous Coward · · Score: 0

      JPEG wasn't standardised until 1992. THere are no 25-year-old JPEG files. Things have changed a lot since 1989.

      TIFF then.

      "Revision 3.0" was published in 1986: https://en.wikipedia.org/wiki/Tagged_Image_File_Format

    18. Re:Keep your important data on current storage. by Anonymous Coward · · Score: 0

      Who has anything really important stored as PICT or WMF?

      But look at ancient word processor files. Could easily be something important - although such stuff is often printed out as well.

    19. Re:Keep your important data on current storage. by ChumpusRex2003 · · Score: 1

      And only one variant of one JPEG protocol ever found widespread use. JPEG actually published both a lossless and a lossy compression algorithm and accompanying file format. The lossless format faded into near total obscurity, apart from some medical software, where the lossless JPEG data would be encapsulated in a medical (DICOM) container. Technically, lossless JPEG is a mandatory part of the DICOM specification, but not every product (free or commercial) supports it, and it's virtually impossible to find an opensource implementation of lossless JPEG outside of limited implementations as part of medical imaging tools. There have also been a variety of extensions published to the JPEG lossy algorithm - notably extension to 12 or 16 bit depths. Good luck finding any support for these, at all. Again, these formats were nominally supported in the DICOM standard for medical imaging, but were very poorly supported. A flurry of naive new-entrant machine vendors, ended up embracing these "novel" formats, only to cause total chaos for their customers, as they found that the files were unviewable on incumbent viewing software or untransmittable to other systems.

    20. Re:Keep your important data on current storage. by Dr_Barnowl · · Score: 1

      it's basically a recording of the GDI commands.

      There were a number of WMF exploits just because of this - because the WMF parser had insufficient bounds checking and you could pass malformed input directly to the Win32 API just by sending someone a picture.

      This is also part of the reason that Microsoft Office Open XML isn't an implementable standard - because it contains a bunch of stuff that boils down to "call the Windows API".

  3. Word! by Anonymous Coward · · Score: 0

    Seriously, what's wrong with the MS Word .doc format? Feature complete, stable, lots of free implementations. I don't think for a second that I will be able to open any standardized "future-proofed" data archiving format in 500 years; but wouldn't be surprised if a good old-fashioned .doc works just fine.

    1. Re:Word! by Tablizer · · Score: 1

      BSOD DNA=TMNJ

    2. Re:Word! by phantomfive · · Score: 2

      Seriously, what's wrong with the MS Word .doc format? Feature complete, stable, lots of free implementations.

      Because it's not feature complete (otherwise Microsoft wouldn't keep adding features), it's not stable, and the free implementations aren't completely compatible.

      data archiving format in 500 years; but wouldn't be surprised if a good old-fashioned .doc works just fine.

      You can have trouble opening a .doc from a few years ago......

      --
      "First they came for the slanderers and i said nothing."
    3. Re:Word! by Anonymous Coward · · Score: 0

      Because it's not feature complete (otherwise Microsoft wouldn't keep adding features), it's not stable, and the free implementations aren't completely compatible.

      You're confusing .doc with .docx. The .doc format is stable since 2003, .docx is where all the new things are happening.

    4. Re:Word! by Drinking+Bleach · · Score: 0

      I really really really hope you're just trying to troll.

    5. Re:Word! by sjames · · Score: 1

      Even MS can't say exactly what that spec is. Sure, there's an alleged standard but Word never actually followed it and in spite of over 1000 pages of documentation, it's incomplete.

    6. Re:Word! by ihtoit · · Score: 0

      uh... because the MS Word .doc format is a proprietary binary format that's closed up tighter than a spinsters snizz? MS Word is not, never has been and never will be a legitimate document exchange format, and so far away from an archival format it's not funny.

      Future proofing a document in my experience has involved the following:

      removing unnecessary formatting;
      removing unnecessary whitespace;
      if images are absolutely essential, supply them in uncompressed and/or lossless format (ie TIFF, GIF89a (although the compression algorithm for this is patented so might pose a problem later)) as separate files;
      ensuring that as many contemporary readers as possible are able to parse and display the data in human readable format. This makes it more likely that a future reader might be able to open the document (correctly, every time) than one that's saved in a format with a secret specification.

      --
      Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
    7. Re: Word! by Anonymous Coward · · Score: 0

      Please stop suggesting GIF as an archival format. It has a 8-bit palette, with 1 color reserved for transparency. PNG at least has a 32-bit palette with an 8-bit alpha channel, and it is 'P'ortable.

      Going forward, we'll need a format with 10-bits-per-channel artwork.

      The only reason to use a GIF today is for a cheap cheesy animation.

  4. Punch cards by QuietLagoon · · Score: 1

    What other storage medium, besides rock carving, can survive an EMP blast?

    1. Re:Punch cards by Firethorn · · Score: 1

      Glass master CDs? Anything that's sufficiently shielded, and the shielding isn't actually all that hard to make?

      --
      I don't read AC A human right
    2. Re:Punch cards by Number42 · · Score: 1

      But those don't hold up well against time.

    3. Re:Punch cards by QuietLagoon · · Score: 0
      What is the high temperature limit for optical media?

      .
      Will a CD-ROM survive at 400 degrees Fahrenheit? Punch cards and rocks will.

    4. Re:Punch cards by QuietLagoon · · Score: 1
      Don't forget temperature survival. Yeah, I mentioned EMP, but there are also other environmental attacks that must be diverted, such as temperature, and water. Shielding won't prevent something from melting.

      .
      It's the end of the world, how will you save your data?

    5. Re:Punch cards by mirix · · Score: 2

      stamped / punched stainless steel sheets would probably be about the best option, if you wan't something to really stick around. Less brittle than rock carvings too.

      --
      Sent from my PDP-11
    6. Re:Punch cards by Firethorn · · Score: 4, Insightful

      The ultimate strategy is to duplicate it in so many different areas that at least one of them survives. Preferably multiple ones.

      The more critical the data, the more spots you duplicate it in.

      Though you have to realize that eventually everything will be lost.

      --
      I don't read AC A human right
    7. Re:Punch cards by ShanghaiBill · · Score: 2

      What other storage medium, besides rock carving, can survive an EMP blast?

      Nearly all of them. Flash media, including SD-cards, SSD, etc. should survive. A HDD that is powered off, should survive. The biggest threat is to anything that is connected to mains power. The power supply in your desktop computer may die, but a powered off laptop should be fine.

    8. Re:Punch cards by Anonymous Coward · · Score: 0

      Any medium with the right coding. Do your homework ! Read Noisy-channel_coding_theorem and A Mathematical Theory of Communication -- C. E. SHANNON.

    9. Re:Punch cards by jones_supa · · Score: 1

      There are fireproof containers into which the heat won't get so easily.

    10. Re:Punch cards by RabidReindeer · · Score: 2

      What is the high temperature limit for optical media?

      .

      Will a CD-ROM survive at 400 degrees Fahrenheit? Punch cards and rocks will.

      But what about 451 degrees Fahrenheit? You're down to rocks at that point.

  5. More than just data by slowdeath · · Score: 2

    Preserving the bits accurately is only a small part of the problem. Knowing what the bits mean is critical. Having a bunch of .xlsx spreadsheet files in the year 2050 will be useless unless you also have Excel 2050, and it knows how to read them. Unless you want to basically just 'print' all your data to a format like .pdf (or just plain old .txt) programs to access data are as critical as the data.

    1. Re: More than just data by Fwipp · · Score: 2

      We store genomic variation data in VCF files - it's just tab-delimited-text.

    2. Re: More than just data by Tablizer · · Score: 1

      2075: "A tab? How quaint"

    3. Re: More than just data by Anonymous Coward · · Score: 0

      Ideally, each time a new reference genome is released you'd rerun the mapping/alignment of your raw reads to the new reference genome - generating new VCF files. But the raw reads themselves are also typically in very simple text-based formats like FASTQ so your underlying point stands. And people will most likely get better and better raw sequence data over the course of their lives anyway- much like upgrading a computer.

    4. Re: More than just data by sound+vision · · Score: 1

      Tab delimitation has been in use for 50 years, I can see it lasting another 60.

    5. Re: More than just data by Electricity+Likes+Me · · Score: 2

      More importantly: it's a regular, repeating sequence that would visible separate variable data.

      Even with no knowledge of what a tab is, it would be obvious in analysing the data that it was doing something special. Anyone with some knowledge of DNA's structure would be able to infer the rest.

    6. Re: More than just data by Fwipp · · Score: 1

      Yep - it'll be way easier to view genomic data in the future than an excel document. Bioinformaticists are lazy, so we store everything as text :)

  6. There is a lot we need for long term archiving by mlts · · Score: 4, Informative

    The problem is that we do have formats that do work for long term archiving, but are limited to a platform and are not open, so decoding them in the future may be problematic.

    WinRAR is one example. It has the ability to do error detection and correction with recovery records. However, it is a commercial product.

    PAR records are another way, but it is a relatively clunky mechanism for long term storage.

    Even medium term storage on disk/tape can be problematic:

    There is one standard for backup programs for tape, and that is tar. Very useful format, but zero error correction or detection, other than reading and looking for hard errors. There are tons of backup programs that work with tapes. Networker, TSM, NetBackup, and many others come to mind, all using a different format. Of course, once you get the program, there is still finding the registration key, and some programs require online activation (which means when the activation servers get shut off, you can never do a restore from scratch again.) We need one archive grade standard for tape, perhaps with a standard facility for encryption as well.

    Same with disks. It wasn't until recently that there was any bit rot detection in filesystems at all. Now with ReFS, Storage Spaces, ZFS, and btrfs, we now can tell if a file is damaged... but none of the filesystems have the ability to store ECC on an entire (other than ZFS and ditto blocks.) It would be nice to have part of a filesystem be a large area for ECC on a block basis. It would take some optimization for performance, but adding ECC in the filesystem is more geared for long term storage than day to day file I/O.

    Finally there is paper. Other than limited stuff on QR codes, there isn't any real way to print a document onto paper, then scan it to get it back. There was a utility called Paperbak that purported to do this, offering encryption, error correction, various DPI codes, and so on. It printed well, but could never scan and read any of the documents printed, so it is worthless. What is needed is something like the Paperbak utility, but with a lot more robust error detection (like checking of blocks are at an angle similar to how QR codes can be scanned from any direction.) This utility would have to be completely open for it to have any use at all. However, if it could be done to print small documents to paper, it would help greatly in some situations, such as recovering encryption keys, archived tax documents, and so on.

    Ironically, in general, we have the formats for long term storage. We just don't have any that are open.

    Hardware is an issue too. Hard drives are not archival media. Tapes are, but one with a reasonable capacity is expensive, well out of reach for all but the enterprise customers. It would be a viable niche for a company to make a relatively low cost tape drive that could work on USB 3, has a large buffer (combined with variable tape speeds to prevent shoe-shining), and has backup software with it that is usable and open, where the formats can be re-engineered years down the road for decoding.

    1. Re:There is a lot we need for long term archiving by Anonymous Coward · · Score: 0

      Uhh, ECC on disk media is called RAID. Aka: ZFS raidzN + sha256 chained all the way back to the uberblock.

    2. Re:There is a lot we need for long term archiving by Anonymous Coward · · Score: 0

      License keys?

      Just use amanda and sqlite in shell/python.

    3. Re:There is a lot we need for long term archiving by bugnuts · · Score: 1

      As far as long term media, we have mdisc. Whether or not we'll have anything that can read the intact medium is another issue.

      It's sad how we're still able to print from photographic plates shot a century ago, but I'm worrying about bit rot on my digital pics stored for 5 years.

    4. Re:There is a lot we need for long term archiving by mlts · · Score: 1

      There was an IBM computer made in the 1970s which stored data on black and white negatives. It would "write" to them via exposing light, then pass the negatives through the usual developer, stop, and fixer baths, finally into a storage area. Reading was done by having them scanned in, similar to punchcards.

      It definitely is a nonstandard way of doing things, but I'm sure film chemistry has advanced quite well since then, so storing information as colored dots might be a long term archiving solution, provided there is an easy way to handle the negatives without them tearing. The grain of the film, ISO, amount of ECC per negative and other processes can be tuned as well.

      There is an irony that the negatives I have from my 35mm camera will be printable long after I'm gone (assuming no mishandling), while on a SD card, once the electrons bail from the gates, the data is gone, no way to recover it, whatsoever. It would be nice to have some form of long term archiving format so bit rot doesn't claim picture collections.

      I'd probably guess the only real way is to create some type of CAS that periodically copies data and checks/rebuilds ECC info to new media every so often, with multiple layers of bit rot detection in place, as well as a cryptographic signing layer to ensure that data dropped there hasn't been altered even though it has been ECC-ed and de-ECC-ed many times.

    5. Re:There is a lot we need for long term archiving by Electricity+Likes+Me · · Score: 1

      Images are a sparse data set though. See the preponderance of techniques which rebuild a nearly complete image from 1% of the pixels.

      If you took those negatives and tried to write densely packed information to them, how recoverable would it be then?

    6. Re:There is a lot we need for long term archiving by Anonymous Coward · · Score: 0

      There is one standard for backup programs for tape, and that is tar. Very useful format, but zero error correction or detection, other than reading and looking for hard errors. There are tons of backup programs that work with tapes. Networker, TSM, NetBackup, and many others come to mind, all using a different format.

      Tar, specifically using the UStar (IEEE P1003.1) and POSIX.1-2001 header formats, is probably the best archiving format, with the ZIP coming in a close second.

      You mention Networker, TSM, NetBackup: these are backup programs, not archiving programs. There's a difference. Even the storage vendors will tell you this (and try to sell you a another product):

      http://eval.symantec.com/mktginfo/enterprise/white_papers/b-your_backup%20_is_not_an_archive_WP_21075780-1.en-us.pdf

      Also, while tar is the best, vendor ones may not be all that bad if they're properly documented:
      * http://udel.edu/~grim/networker/pdf/nsr_data.pdf
      * http://udel.edu/~grim/networker/pdf/mm_data.pdf

      Of course, once you get the program, there is still finding the registration key, and some programs require online activation (which means when the activation servers get shut off, you can never do a restore from scratch again.)

      All decent backup products that I know of will allow you to do restores without a license key (or only an eval one). It's is the backing up part that is limited without a key.

    7. Re:There is a lot we need for long term archiving by Anonymous Coward · · Score: 0

      Finally there is paper. Other than limited stuff on QR codes, there isn't any real way to print a document onto paper, then scan it to get it back.

      I've worked for a large law firm where no digital documents left the building without first being printed and then OCRed back. We could print and scan a half-million pages per week. This was the only way to ensure that no metadata ever leaked out.

      I don't know where you got your "there isn't any real way to print a document onto paper, then scan it to get it back" claim, but it's wrong.

    8. Re:There is a lot we need for long term archiving by mlts · · Score: 1

      I should have been clearer -- Paperbak is a way to not just print a document, but encode one onto paper, so a 100 page Word document fits on a single page (in theory), rather than needing 100 pages.

  7. wrong by Anonymous Coward · · Score: 0

    The problem is, we currently don't have the data integrity, security and format migration standards to ensure that, according to Henry Newman at Enterprise Storage Forum

    .

    He is wrong, of course. We have all of that right now.

  8. Absolutely not by Anonymous Coward · · Score: 1

    You won't need to archive your genome. It will be re-sequenced in 5 seconds each time you go to the doctor. Because it will be cheap, and because it may evolve over time. The same way blood samples are not archived for life, or teeth X-rays are taken periodically, they're just taken when needed.

    1. Re:Absolutely not by Tablizer · · Score: 1

      <spooky music> That's what the NSA wants you to think </spooky music>

    2. Re:Absolutely not by Anonymous Coward · · Score: 0

      You won't need to archive your genome. It will be re-sequenced in 5 seconds each time you go to the doctor.

      It's going to be at least a few decades before genome sequencing is that cheap. The next decade will most likely see whole genome sequencing become widely available for somewhere under a thousand dollars. But I suspect it will be a bit like computers. As the price per gigabase continues to drop, people will choose better quality sequencing (i.e. higher read depth) rather than lower price so the price will most likely stay around a thousand or so.

  9. Hacked by Hell by Tablizer · · Score: 1

    not be long until your genome is tracked from birth to death. I am sure we do not want to have genome objects hacked or changed via silent corruption

    Wakes up, "WTF? I have a....Vagina!? Hoooneeeyyy!"

  10. Punch cards by Anonymous Coward · · Score: 0

    Any optical media, actually... Like a CD, remember?

  11. I used ascii and saved as *txt by Anonymous Coward · · Score: 0

    Fifteen years ago I long-term stored some important files.

    Rather than keeping the fanvy formatting I saved them in plain ascii text and saved as *.txt.

    I burned it all on gold-plattered CDs which then were considered archive proof

    These CDs are stored at three different locations

    Im sure there are better ways today

  12. My proposal by Anonymous Coward · · Score: 1

    I propose storing it in a new medium. A "molecular chain", which should withstand the effects of EMP, right?

    A name for it. Hmmm. How about the Destroy-Not Archive, or D.N.A. for short.

    1. Re:My proposal by Anonymous Coward · · Score: 0

      Or how about 4 different atoms (say Fe, Co, Ni and Cu) on a substrate to represent the 4 different amino acids? The tech certainly is there - http://www.research.ibm.com/articles/madewithatoms.shtml#fbid=RiKFxiFL-GI

    2. Re:My proposal by KitFox · · Score: 1

      I propose storing it in a new medium. A "molecular chain", which should withstand the effects of EMP, right?
      A name for it. Hmmm. How about the Destroy-Not Archive, or D.N.A. for short.

      But then cosmic rays and ionizing radiation and other things will still introduce errors.

      So we would further need a method to reliably store the chains themselves and that could replicate the data to ensure there was a high chance of accurate data surviving. Little cartridges with all of the necessary environment and materials to power the reading system and maintain the chain and that could, as needed, replicate the data into new cartridges. The second versions, Contained Environment II (CE-II) work decently.*

      (*Hey, I couldn't think of anything with L. I guarantee I gave it at least four seconds of thought too!)

      --

      @Whee

    3. Re:My proposal by __aaltlg1547 · · Score: 1

      Who cares? I will still have about 30 trillion intact copies.

  13. What is a collision-proof hash? by Anonymous Coward · · Score: 0

    How is it possible to have a collision-proof hash?

    1. Re:What is a collision-proof hash? by cmarkn · · Score: 1

      Sure. All you have to do is make sure your keys are at least as long as any possible file you want to preserve, using at least all the possible characters in the original.

      --
      People should not fear their government. Governments should fear their people.
  14. Collision Proof Hash by Anonymous Coward · · Score: 0

    Lol, there is no such thing. It's a hash function, you get 2^n width, some reasonable cryptographic assuredness against collision, and that's it. Collision proof is not a hash function, it's a data compressor. Remind me not to put this guy in control of my storage.
    By the way, ZFS works great with raidzN, internal sha256, and lz4. Get it on FreeBSD.

    1. Re:Collision Proof Hash by phantomfive · · Score: 1

      Yeah, this is exactly what I came here to wonder about. A collision proof hash = 100% duplication

      --
      "First they came for the slanderers and i said nothing."
  15. Collision Proof Hash by Anonymous Coward · · Score: 0

    Great, so you keep the compressed data and the original data, that way you can check if the data is still valid and as a bonus if either the compressed data or original data breaks you can restore it.

  16. Many other reasons to store data by dutchwhizzman · · Score: 1

    While you may be right about the current use we have for DNA, it's very likely that medicine will have many more uses for it in the future. Prices on genome sampling are going down rapidly too, so it's reasonable to use this as an example why we might want to store data error free for at least a century.

    There will be many more things we want to store. Remember all those old city records and paper books? The news paper archives? early 20th century cellulose film? All those data sources have their problems and we have already lost a lot of information that is valuable to us now. Your parents and grand parents color photographs have lost a lot of the color in them already. Not just the prints, but also the negatives. Those VHS video tapes of your dad growing up? They're turning into noisy images right now.

    People have plenty of reasons to come up with a proper way to store data in such a way that it's still accessible for future generations, or themselves later in life.

    --
    I was promised a flying car. Where is my flying car?
  17. You can't by epyT-R · · Score: 1

    Technology is always changing. Whatever is today's commodity storage device will be tomorrow's rare anachronism.

  18. We herd u liek haxx0rz by Anonymous Coward · · Score: 0

    So we put haxx0z in ur data so u can get haxx0rzd while u get haxx0rzd.

    1. Re:We herd u liek haxx0rz by Anonymous Coward · · Score: 0

      0nlÿ \\'h£Ñ +h£ÿ hÅר® j0 \\'ïÑñÙ]£

  19. There's always the main backup. by nospam007 · · Score: 1

    You!

  20. just a reminder by xuchilpaba · · Score: 1

    We already have the technology to preserve the data: http://www.pcworld.com/article...

    1. Re: just a reminder by xuchilpaba · · Score: 1

      There is also this : http://www.extremetech.com/ext...

  21. Gimmie tape. by Stumbles · · Score: 1

    Just scrape off the rust and your good to go. Now, where did I put my M14G and FR3010.

    --
    My karma is not a Chameleon.
    1. Re:Gimmie tape. by Anonymous Coward · · Score: 0

      Umm, once you scrape off the rust, your tape is about as useful as adhesive tape with too much dust on it.

  22. Paper tape by Squidlips · · Score: 1

    Get the acid-free paper. Will last forever

    1. Re:Paper tape by dkf · · Score: 1

      Get the acid-free paper. Will last forever

      Or until it gets wet.

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
  23. ZFS by ChadMilios · · Score: 1

    nuff sed

  24. Live data lives by Karmashock · · Score: 1

    Your bank records exist despite changing hardware and software because the data is kept in use. Its kept alive. It is added to, modified... active. Your genetic records could be kept active. Keep them part of a patient record and they'll be copied, migrated, translated, from one system to the next to the next to the next for as long as you live.

    Only when the data goes dormant can it rot. By all means... have long term storage media for long term data archiving. But the best means of keeping data current is to keep it moving.

    All that said... the data we're talking about can't be that much data. A few terabytes should be more then what you need to store that kind of stuff for one person. And that kind of storage is already cheap.

    --
    I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
  25. Too bad your DNA is useless to most MDs by Theovon · · Score: 2

    ... or for that matter any of your medical history. MDs do spot-diagnosis in 5 minutes or less based exclusively on what they've memorized or else they do no diagnosis at all.

    My wife has a major genetic defect (MTHFR C677T), which causes severe nutritional problems. We haven't yet met an MD who has a clue about nutrition. Moreover, we had to diagnose this problem ourselves through genetic testing, with no doctors involved. We've shown the results to doctors, and they don't entirely disbelieve us, but they also have no clue what to do about it and still are dubious of the symptoms. (Who has symptoms of Beriberi these days? Someone whose general ability to absorb nutrients is severely compromised.)

    What makes anyone think that this will change if your doctor has access to your DNA, even with detailed analysis? They won't take the time to actually read any of it. In fact a lot of what we know about genetic defects pertains to problems in generating certain kinds of enzymes, a lot of which participate in nutrient absorption. (So obviously RESEARCHERS know something about nutrition.) These nutritional problems require supplementation that MDs don't know about. Do you think the typical MD knows that Folic Acid is poison to those with C677T? Nope. They don't know the differences between folic acid, folinic acid, and methylfolate and still push folic acid on all pregnant women (they should be pushing methylfolate). They also don't know the differences between the various forms of B12 and always prescribe cyanocobalamin even for people who need the methyl and hydroxy forms.

    Another way in which MDs are useless is caused by their training. Bascally, they're trained to be skeptical and dismissive. Many nutritional and autoimmune disorders manifest with a constellation of symptoms, along with severe brainfog. Someone with one of these problems will generally want to write down the symptoms when talking to a doctor, because they can't think clearly. The thing is, in med school, doctors are specifically trained to look out for patients with constellations of symptoms and written lists, and they are told to recognize this as a condition that is entirely within the mind of the patient. Of course, a lot of doctors, even if not trained to dsmiss things as "all in their head" are terrible at diagnosis anyway. They'll have no clue where to start and won't have the patience to do extensive testing. It's too INCONVENIENT and time-consuming. They won't make enough money off patients like this, so they get patients like this out the door as fast as possible.

    I've had some good experiences with surgeons. But for any other kind of medical treatment, MDs have been mostly useless to me and my family. In general, if we go to one NOW, we've already disgnosed the problem (correctly) and possibly need advice on exactly which medicine is required, although when it comes to antibiotics, it's easy enough to find out which ones to use. (Medical diagnosis based on stuff you look up on the internet is really hard and requires a very well-trained bullshit filter, and you also have to know how to use the more authoritative sources properly. However, it's not impossible for people with training in things like law, information science, and biology. It just requires really good critical thinking skills. BTW, most MDs don't have that.)

    MDs are technicians. Most of them are like those B-average CS grads from low-ranked schools who can barely manage to write Java applications. If you know how to deal with a low-level technician, guide them properly, and stroke their ego in the right way, you can deal with an MD.

    1. Re:Too bad your DNA is useless to most MDs by Bite+The+Pillow · · Score: 2

      Paraphrased:

      I forgot that doctors are people, and that the bottom half are generally worthless, and the average ones are average. Also, diagnosing a rare problem is hard because it is unlikely to be a rare problem.

      I also forgot that doctors are the people who didn't tire of medical school shenanigans and change studies.

      And I bear a grudge because I didn't find that top notch House like genius who, despite being wrong every show, succeeds in the end.

      Finally, I have no idea why and how insurance, both medical and liability, affects what care is given.

      Seriously, it is a hard position to be in, but you are angry at the wrong things.

    2. Re:Too bad your DNA is useless to most MDs by Anonymous Coward · · Score: 0

      Re-paraphrased. The guy sounds like he has above average inteligence, a deep commitment to his family, and a willingness to expend great effort in a attempt to guide licensed drug pushing, insurance fraud morons in the right direction. True, it's meaningless to blame the doctors for getting sucked into the vortex. Sorry, I don't have an answer, except to say we are all in the same boat and none of us gets out of this game alive. In the mean time, If you come across a p3rson or government or a business or a philosophy that profits from other peoples pain and suffering, you should shun them. Obamacare is just the latest example of cows being led to the slaughter. Hmmm, that really didn't paraphrase anything.

    3. Re:Too bad your DNA is useless to most MDs by Anonymous Coward · · Score: 0

      let me guess, they missed your chronic lyme as well, right?

      Aneuploidies and unbalanced translocations are major genetic defects. Things like common MTHFR variants with marginal clinical significance are exactly the reasons many primary care docs are dreading the advent of clinical genomics. Every smartypants person with "training in things like law, information science and biology" with "really good critical thinking skills" is going to show up bitching about how their variant of dubious clinical significance is the cause of all their ills and if the stupid doctor who is really just a technician would just order the right test or prescribe the right drug as evidenced by some random wet lab or computational study of dubious quality then assuredly everyone would be shitting rainbows.

      Pushy patients with long lists of complaints and demands get pushed out the door because they're goddam unpleasant to deal with. They yell at staff, they refuse to follow clinic protocol and speak to the triage nurse and they insist that everything is an emergency but refuse to go to the ED. If you want to monopolize a doctor's time, go pay a concierge MD to listen to your bullshit and then pay out of pocket for whatever the fark you want.

    4. Re:Too bad your DNA is useless to most MDs by Theovon · · Score: 1

      We seriously considered chronic lyme as a possibility and even got testing. The test came back negative, although there can be false negatives. We ultimately ruled it out on the basis of certain key symptoms being absent. Basically, we considered a LOT of things and did our best to rank the changes of each illness that might explain the symptoms. We were open to the idea of more than one cause but considered it a remote possibility; fortunately we were right.

      Anyhow, homozygous MTHFR C677T can be serious, especially if there are other complicating mutations. Compared to some people my wife has a moderate problem. She had chronic fatigue (not to be necessarily confused with CFIDS), brain fog, autoimmune disease, gluten intolerance, weight gain, pale skin, hairloss, and many more symptoms. But she never lost feeling in her limbs; some people do. When you mess up the methylation cycle, all sorts of things can go wrong.

      I'm not sure why you (an anonymous coward, so why am I feeding the trolls?) think that this mutation is of "dubious clinical significance." It's one of the more serious mutations, and the appropriate treatments have worked. Taking methylfolate, a few different forms of B12, and several other supplements has caused massive improvement in energy, return of proper skin tone, hair regrowth, appropriate weight loss, and so on. In other words THE TREATMENT WORKED.

      This is one of those fortunate cases where a hard-to-find single cause has been identified. It explains ALL of the symptoms (many of which are secondary, caused by a deficiency caused by the underlying problem), and the treatment has worked very well. It's a little hard to get the exact dosages of vitamins right, because as soon as you get enough of one thing, the body will start repairing things, which requires other chemicals, and cause a deficiency in another thing, etc. So the fix isn't an over-night sort of thing but the progress is rapid.

      And my biggest complaint is not that the MDs didn't know how to diagnose this. My complaint is that they EXPLICITLY REFUSED to help us when we were trying to track down the cause. Seriously. Most doctors just didn't have a clue and were unwilling to "do a lot of speculative testing," while some out-right said they refused to help us. Even if we came in with a list of tests to do to try to narrow down a range of possibilities (like a decision tree), they wouldn't do it. We had to figure this out completely on our own.

      I don't expect MDs to know everything or be super-human. But I do expect them to listen and take patients seriously.

    5. Re:Too bad your DNA is useless to most MDs by badkarmadayaccount · · Score: 1

      Dissolve the vitamin in question in DiMethylSulfOxide, apply topically. Or in water, and inject, snort or inhale. HTH

      --
      I know tobacco is bad for you, so I smoke weed with crack.
  26. Archiving vs backups by Kjella · · Score: 1

    One of the big differences between archiving and backup is that in archiving I want to keep this exact version intact, if it changes on me it's an error while a backup takes a copy of whatever is now - maybe I wanted to edit that file. Unlike backups I think it's not about versioning, it's about maintaining one logical instance of the archive across different physical copies. Here's what I'm thinking, you create a system with three folders:

    archived
    to_archive
    to_trash

    The archive acts like a CD/DVD/BluRay and is read-only. So far, nothing but a really awkward way to create a WORM(-ish) drive, but the real point comes next in distribution and synchronization.

    When you put a file in "to_archive" a job will pick it up and wrap it in AES (with AES-NI the cost of on-the-fly encryption/decryption is very slim) and create a torrent-like file for it and move it to archived. If you want to delete it from the archive, you drag the file to the "to_trash" folder or maybe you put some kind of lock/freeze/undo timer on that function. Files that are in "archived" are sync'ed to other computers - still encrypted - which means you can shop around for storage/bandwidth, maybe you got multiple locations yourself (home/cabin), maybe swap backup with friends or family or you can buy it on the open market and they'll all mingle and share data because it's based on basic torrents.

    They can all do basic limits on size/bandwidth so you can have pricing plans and caps, you can have one-way "leeches" that download and archive it on tape that can physically deliver it to you. If you build it fairly smart you can also have local, offline backups and if you restore them it'll pick up that 95% is the same as last week and sync up the rest. Basically a "Redundant Array of Inexpensive Archive Locations." It will leak a little bit of metadata as to size and number of files, but not file or directory names and you can probably muddle that metadata up with padding and dummy files if you want.

    Of course you can choose to have the AES key on several computers so you can access your media from any of them. And as a free bonus a device that has the AES key like say your cell phone can use this as an online library, it doesn't have to auto-sync everything. With many locations = many peers it won't matter if one is down and you aggregate up the bandwidth, just like in any other torrent swarm. Through the seed/peer numbers you can at any time watch the state of your backup in progress as you add files. If your computer goes to shit, tell it the archive key and it'll hook up and start syncing. Just like a torrent client you can set priorities on what to download first.

    It's not for all your data, but I think a lot of common user data is that way. Those RAW photos or video or audio you took? Archive them, "single" everlasting master copy. It doesn't replace backup of say documents you're working on or source code you're developing but it complements it.

    --
    Live today, because you never know what tomorrow brings
  27. What we need is viable storage and maintenance, by jenningsthecat · · Score: 1

    for the huge and growing number of people on this planet. I get how wonderful it is that genetic medicine might allow us all to live to the age of 150, eliminate birth defects, and cure Aunt Millie's cancer. But really, just where are we going to put all the people whose lives we save and extend while at the same time the birth rate keeps climbing? How will we feed them? How will we maintain a viable biosphere in an era of rapidly accelerating extinctions?

    All that long term data will be meaningless if human society collapses under its own weight. If we're going to invest in keeping data viable so we can maintain and extend our scientific and technological reach, perhaps we should use it to help solve more important problems than our navel-gazing, narcissistic fixation on immortality and eternal youth.

    --
    'The Economy' is a giant Ponzi scheme whose most pitiable suckers are the youngest among us and the yet-unborn.
    1. Re:What we need is viable storage and maintenance, by Anonymous Coward · · Score: 0

      > while at the same time the birth rate keeps climbing

      Fortunately with the Republicans denying more and more people healthcare, there is ironically a positive result to that. It does something about our massive population. It helps to keep it lower because of the number of children that they kill. Of course in their minds they are not killing people since they do not consider us people. They want everyone that isn't a white man to die.

  28. You didn't say the magic word by ArcadeMan · · Score: 1

    Something about securing genomes, coming from a guy called Newman? And not a single Jurassic Park joke after 79 posts?

    What a shame.

    1. Re:You didn't say the magic word by Anonymous Coward · · Score: 0

      Don't make jokes about Newman. He might go postal on you.

  29. Standards by Anonymous Coward · · Score: 0

    Glad you mentioned LTFS, but you should also look at CDMI and SIRF. The standards are there, but need to be put into an Offering that implements the policies and proceedures you mentioned. The only thing holding this back is the business case and awareness of the issues which your article does nicely.

  30. Good luck by Anonymous Coward · · Score: 0

    Good luck trying to archieve this one. We are having serious issues reading medias that were in use just 1980's... Some media formats are so obsolete theres just no way to read them.

    What we really need is new type of media format thats newer access devices are kept backwards compatible... Maybe something like Start Treks data crystals? Heck their readers were so good they were able to adapt them to read different races data crystals in some episodes...

  31. Pragmatic: continual, active refresh by michaelmalak · · Score: 1

    One can whine and wax poetic all one wants, but since we don't have a good archival format, the practical solution today is continual refresh of data: periodically copying data to fresh, and technologically up-to-date media. It's not sexy, but it does address three of the four points at the end of the linked piece (end-to-end data integrity, format migration and secondary media formats). The unaddressed point, access audit trails, makes no sense given the premise stated at the beginning of the piece that "No matter what anyone tells you, there is data that does not need to be on primary storage".

    Yes, this is expensive. Yes, it would be nicer (cheaper) if a one-time single format could address the archive problem.

    P.S. There is also this gem from the piece:

    creation of a collision-proof hash

    Of course the whole point of a hash is a mapping from a high-cardinality space to a low-cardinality space, and thus collisions are always a possibility. Collisions are minimized when a good hashing function uniformly distributes the resulting hashes, but given a large enough collection of source documents (no more are needed than the cardinality of the hash space), collisions will occur.

  32. Why not store the DNA itself? by king+neckbeard · · Score: 1

    Your body produces tons of it, and it can be stored and sequenced considerably longer than human lifespans, especially if care is taken to preserve it.

    --
    This is my signature. There are many like it, but this one is mine.
    1. Re:Why not store the DNA itself? by __aaltlg1547 · · Score: 1

      Nobody ever needs to know their complete genome and nobody ever will need to. Instead, you'll go to a doctor with a complaint and if they suspect a genetic component, they'll do a cheek swab and a quick test tuned to look for the particular genetic condition you might have. Or if something really exciting and common is discovered, you'll be offered an opportunity to get a new test to see if you're at risk for living to be 200. (You need to be warned because you probably won't have saved enough for near-permanent retirement.)

  33. My article about it in Communications of the ACM by tgeller · · Score: 1

    I wrote an article about long-term storage *hardware* in CACM -- "The Forever Disc". My favorite musing had to do with writing the data into a population's genetics, and letting redundancy correct errors/mutations..

    --
    Tom Geller
  34. Digital libraries have been doing this for decades by dlmetcalf · · Score: 1

    There's a lot of work in this space from digital libraries for preservation of cultural heritage, state/official archives etc. Start with Open Archival Information Systems Reference Model (ISO-OAIS, an international standard originally from space agencies). PREMIS. Preservation metadata standard by US Library of Congress, but used around the world for digital assets. It works well with METS encoding standard and MIX technical metadata standard. PRONOM and DROID for format policy registries, monitoring and migration planning. Digital asset repositories like Fedora Commons Repository 4 (being worked on by DuraSpace community), that have built in fixity checking for bit rot and store to a wide range or underlying technologies (including file systems like ZFS, tar), etc. LOKSS for distributed archive relocation and exchange programs. Or TAHOE-LAFS. (There's also things like CryptoSphere coming). There's tools like ArchiveMatica too for ingestion workflows, characterization, etc etc too. See also the recent partnership with DuraSpace too. http://www.duraspace.org/artic....

  35. Librarians are already at work by Anonymous Coward · · Score: 0

    Check out guidelines for data archival at http://datasealofapproval.org/en/. Also http://www.duraspace.org/, they support Dspace and Fedora repository work.

  36. What? You're ridiculous! by Anonymous Coward · · Score: 0

    Is it too much too ask that every patient is looked at as a unique case? Should every patient put into a known category as soon as possible?

    I know, if it quacks like a duck, etcetera. But it happens quite often that it isn't a duck. Why not examine every patient as a genuine case, instead of lumping it in to one of the few general cases? Is it money, greed, what?

    Fix what ails people, that is what doctors are supposed to do, right? So why do so many not do that?

  37. FITS standard by calidoscope · · Score: 1

    When the Vatican decided to digitize their archives, they chose to store the mages in FITS format for pretty much the same reasons. One thing FITS doesn't address is preventing unauthorized access to the data.

    --
    A Shadeless room is a brighter room.
  38. Arvados: the open source solution by Peter+Amstutz · · Score: 1

    (Disclaimer: I am an Arvados developer)

    The Arvados project is a free and open source (AGPLv3 and Apache v2) bioinformatics platform for genomic and biomedical data, designed to address precisely the issues raised in this article. Arvados features a 1) content addressed filesystem (blocks are addressed by a hash of their actual content rather some arbitrarily assigned identifier) which performs end-to-end data integrity checks , 2) fine-grained access controls, 3) a cluster scheduling system that tracks the input and output results of every job (enabling you to track processing pipelines and establish data provenance), and 4) data replication by default. Arvados is developed and commercially supported by Curoverse which is 100% committed to free software (in fact, one of the founders is a former employee of the Free Software Foundation.) I encourage slashdotters in the bioinformatics, big data, or data archiving space to come check it out and join the community.

  39. Need to borrow a ladder by mcswell · · Score: 1