Slashdot Mirror


Data Archiving Standards Need To Be Future-Proofed

storagedude writes Imagine in the not-too-distant future, your entire genome is on archival storage and accessed by your doctors for critical medical decisions. You'd want that data to be safe from hackers and data corruption, wouldn't you? Oh, and it would need to be error-free and accessible for about a hundred years too. The problem is, we currently don't have the data integrity, security and format migration standards to ensure that, according to Henry Newman at Enterprise Storage Forum. Newman calls for standards groups to add new features like collision-proof hash to archive interfaces and software.

'It will not be long until your genome is tracked from birth to death. I am sure we do not want to have genome objects hacked or changed via silent corruption, yet this data will need to be kept maybe a hundred or more years through a huge number of technology changes. The big problem with archiving data today is not really the media, though that too is a problem. The big problem is the software that is needed and the standards that do not yet exist to manage and control long-term data,' writes Newman.

14 of 113 comments (clear)

  1. Keep your important data on current storage. by Z00L00K · · Score: 4, Insightful

    Keep your important data on current mainstream storage. This is the only way to preserve it - copy data from old disks to new disks whenever you upgrade.

    Of course at each upgrade you can also discard a lot of data that isn't necessary, but pictures and similar stuff shall be preserved. Data formats for images have been stable for the last decades. Even though some improvements have occurred a 25 year old jpg is still viewable.

    However some document formats have to be upgraded to latest version since especially Microsoft have a tendency to "forget" their old versions. You may still lose some formatting, but the content of the documents is the important.

    --
    If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    1. Re:Keep your important data on current storage. by _merlin · · Score: 2

      Yeah, SVG renderers have more chance of being around in 50 years than WMF or PICT. But you still need to actively go through your data archives, find things in "endangered" formats, and migrate them to more future-proof formats. This requires substantial effort that increases as the collection grows. Then there's verifying that nothing was lost in the conversion to consider.

    2. Re:Keep your important data on current storage. by Jawnn · · Score: 2

      JPEG wasn't standardised until 1992. THere are no 25-year-old JPEG files. Things have changed a lot since 1989.

      So what's your point? I have GIF images that predate 1989. The still render just fine. I could convert them if I felt the need. I don't, because the format's are indeed "stable".

  2. More than just data by slowdeath · · Score: 2

    Preserving the bits accurately is only a small part of the problem. Knowing what the bits mean is critical. Having a bunch of .xlsx spreadsheet files in the year 2050 will be useless unless you also have Excel 2050, and it knows how to read them. Unless you want to basically just 'print' all your data to a format like .pdf (or just plain old .txt) programs to access data are as critical as the data.

    1. Re: More than just data by Fwipp · · Score: 2

      We store genomic variation data in VCF files - it's just tab-delimited-text.

    2. Re: More than just data by Electricity+Likes+Me · · Score: 2

      More importantly: it's a regular, repeating sequence that would visible separate variable data.

      Even with no knowledge of what a tab is, it would be obvious in analysing the data that it was doing something special. Anyone with some knowledge of DNA's structure would be able to infer the rest.

  3. There is a lot we need for long term archiving by mlts · · Score: 4, Informative

    The problem is that we do have formats that do work for long term archiving, but are limited to a platform and are not open, so decoding them in the future may be problematic.

    WinRAR is one example. It has the ability to do error detection and correction with recovery records. However, it is a commercial product.

    PAR records are another way, but it is a relatively clunky mechanism for long term storage.

    Even medium term storage on disk/tape can be problematic:

    There is one standard for backup programs for tape, and that is tar. Very useful format, but zero error correction or detection, other than reading and looking for hard errors. There are tons of backup programs that work with tapes. Networker, TSM, NetBackup, and many others come to mind, all using a different format. Of course, once you get the program, there is still finding the registration key, and some programs require online activation (which means when the activation servers get shut off, you can never do a restore from scratch again.) We need one archive grade standard for tape, perhaps with a standard facility for encryption as well.

    Same with disks. It wasn't until recently that there was any bit rot detection in filesystems at all. Now with ReFS, Storage Spaces, ZFS, and btrfs, we now can tell if a file is damaged... but none of the filesystems have the ability to store ECC on an entire (other than ZFS and ditto blocks.) It would be nice to have part of a filesystem be a large area for ECC on a block basis. It would take some optimization for performance, but adding ECC in the filesystem is more geared for long term storage than day to day file I/O.

    Finally there is paper. Other than limited stuff on QR codes, there isn't any real way to print a document onto paper, then scan it to get it back. There was a utility called Paperbak that purported to do this, offering encryption, error correction, various DPI codes, and so on. It printed well, but could never scan and read any of the documents printed, so it is worthless. What is needed is something like the Paperbak utility, but with a lot more robust error detection (like checking of blocks are at an angle similar to how QR codes can be scanned from any direction.) This utility would have to be completely open for it to have any use at all. However, if it could be done to print small documents to paper, it would help greatly in some situations, such as recovering encryption keys, archived tax documents, and so on.

    Ironically, in general, we have the formats for long term storage. We just don't have any that are open.

    Hardware is an issue too. Hard drives are not archival media. Tapes are, but one with a reasonable capacity is expensive, well out of reach for all but the enterprise customers. It would be a viable niche for a company to make a relatively low cost tape drive that could work on USB 3, has a large buffer (combined with variable tape speeds to prevent shoe-shining), and has backup software with it that is usable and open, where the formats can be re-engineered years down the road for decoding.

  4. Re:Punch cards by mirix · · Score: 2

    stamped / punched stainless steel sheets would probably be about the best option, if you wan't something to really stick around. Less brittle than rock carvings too.

    --
    Sent from my PDP-11
  5. Re:Word! by phantomfive · · Score: 2

    Seriously, what's wrong with the MS Word .doc format? Feature complete, stable, lots of free implementations.

    Because it's not feature complete (otherwise Microsoft wouldn't keep adding features), it's not stable, and the free implementations aren't completely compatible.

    data archiving format in 500 years; but wouldn't be surprised if a good old-fashioned .doc works just fine.

    You can have trouble opening a .doc from a few years ago......

    --
    "First they came for the slanderers and i said nothing."
  6. Re:Punch cards by Firethorn · · Score: 4, Insightful

    The ultimate strategy is to duplicate it in so many different areas that at least one of them survives. Preferably multiple ones.

    The more critical the data, the more spots you duplicate it in.

    Though you have to realize that eventually everything will be lost.

    --
    I don't read AC A human right
  7. Re:Punch cards by ShanghaiBill · · Score: 2

    What other storage medium, besides rock carving, can survive an EMP blast?

    Nearly all of them. Flash media, including SD-cards, SSD, etc. should survive. A HDD that is powered off, should survive. The biggest threat is to anything that is connected to mains power. The power supply in your desktop computer may die, but a powered off laptop should be fine.

  8. Too bad your DNA is useless to most MDs by Theovon · · Score: 2

    ... or for that matter any of your medical history. MDs do spot-diagnosis in 5 minutes or less based exclusively on what they've memorized or else they do no diagnosis at all.

    My wife has a major genetic defect (MTHFR C677T), which causes severe nutritional problems. We haven't yet met an MD who has a clue about nutrition. Moreover, we had to diagnose this problem ourselves through genetic testing, with no doctors involved. We've shown the results to doctors, and they don't entirely disbelieve us, but they also have no clue what to do about it and still are dubious of the symptoms. (Who has symptoms of Beriberi these days? Someone whose general ability to absorb nutrients is severely compromised.)

    What makes anyone think that this will change if your doctor has access to your DNA, even with detailed analysis? They won't take the time to actually read any of it. In fact a lot of what we know about genetic defects pertains to problems in generating certain kinds of enzymes, a lot of which participate in nutrient absorption. (So obviously RESEARCHERS know something about nutrition.) These nutritional problems require supplementation that MDs don't know about. Do you think the typical MD knows that Folic Acid is poison to those with C677T? Nope. They don't know the differences between folic acid, folinic acid, and methylfolate and still push folic acid on all pregnant women (they should be pushing methylfolate). They also don't know the differences between the various forms of B12 and always prescribe cyanocobalamin even for people who need the methyl and hydroxy forms.

    Another way in which MDs are useless is caused by their training. Bascally, they're trained to be skeptical and dismissive. Many nutritional and autoimmune disorders manifest with a constellation of symptoms, along with severe brainfog. Someone with one of these problems will generally want to write down the symptoms when talking to a doctor, because they can't think clearly. The thing is, in med school, doctors are specifically trained to look out for patients with constellations of symptoms and written lists, and they are told to recognize this as a condition that is entirely within the mind of the patient. Of course, a lot of doctors, even if not trained to dsmiss things as "all in their head" are terrible at diagnosis anyway. They'll have no clue where to start and won't have the patience to do extensive testing. It's too INCONVENIENT and time-consuming. They won't make enough money off patients like this, so they get patients like this out the door as fast as possible.

    I've had some good experiences with surgeons. But for any other kind of medical treatment, MDs have been mostly useless to me and my family. In general, if we go to one NOW, we've already disgnosed the problem (correctly) and possibly need advice on exactly which medicine is required, although when it comes to antibiotics, it's easy enough to find out which ones to use. (Medical diagnosis based on stuff you look up on the internet is really hard and requires a very well-trained bullshit filter, and you also have to know how to use the more authoritative sources properly. However, it's not impossible for people with training in things like law, information science, and biology. It just requires really good critical thinking skills. BTW, most MDs don't have that.)

    MDs are technicians. Most of them are like those B-average CS grads from low-ranked schools who can barely manage to write Java applications. If you know how to deal with a low-level technician, guide them properly, and stroke their ego in the right way, you can deal with an MD.

    1. Re:Too bad your DNA is useless to most MDs by Bite+The+Pillow · · Score: 2

      Paraphrased:

      I forgot that doctors are people, and that the bottom half are generally worthless, and the average ones are average. Also, diagnosing a rare problem is hard because it is unlikely to be a rare problem.

      I also forgot that doctors are the people who didn't tire of medical school shenanigans and change studies.

      And I bear a grudge because I didn't find that top notch House like genius who, despite being wrong every show, succeeds in the end.

      Finally, I have no idea why and how insurance, both medical and liability, affects what care is given.

      Seriously, it is a hard position to be in, but you are angry at the wrong things.

  9. Re:Punch cards by RabidReindeer · · Score: 2

    What is the high temperature limit for optical media?

    .

    Will a CD-ROM survive at 400 degrees Fahrenheit? Punch cards and rocks will.

    But what about 451 degrees Fahrenheit? You're down to rocks at that point.