Slashdot Mirror


Long-Term Storage of Moderately Large Datasets?

hawkeyeMI writes "I have a small scientific services company, and we end up generating fairly large datasets (2-3 TB) for each customer. We don't have to ship all of that, but we do need to keep some compressed archives. The best I can come up with right now is to buy some large hard drives, use software RAID in linux to make a RAID5 set out of them, and store them in a safe deposit box. I feel like there must be a better way for a small business, but despite some research into Blu-ray, I've not been able to find a good, cost-effective alternative. A tape library would be impractical at the present time. What do you recommend?"

19 of 411 comments (clear)

  1. Exactly what you're doing by rwa2 · · Score: 4, Informative

    I don't think you can beat a bunch of conventional hard disks in a RAID5 for both cost-per-TB and backup/restore performance, not to mention medium-term data integrity. Might be able to make hooking up the drives more convenient with an eSATA mult-bay enclosure, but those are kinda expensive. But I bet your backup box already has some sort of hot-swap on it already, like: http://www.amazon.com/Thermaltake-BlacX-eSATA-Docking-Station/dp/B001A4HAFS

    I assume you already compress your data, since scientific datasets tend to compress well. You might consider compressing to squashfs, since it will let you do transparent decompression later on so you can skip the restore step if you just need a handful of files.

    1. Re:Exactly what you're doing by forgottenusername · · Score: 5, Interesting

      I don't think it's a great solution. You're storing relatively fragile hard drives in a raid5 configuration in a lock box? It's not like you can tell if one of the drives goes bad and needs to be replaced when it's sitting in a box. You'd have to regularly pull the data sets out, fire them up and make sure everything is still functional.

      I'd at least want to do 2 complete sets of mirrored drives.

      Tape storage does store better.

      Depending on how important the data is, I might do something like a local mirrored drive set in storage and an online copy at something like rsync.net - stay away from s3, it's not designed to protect data, despite what AWS fans may say.

    2. Re:Exactly what you're doing by hardburn · · Score: 4, Insightful

      That's why you hot-swap them. You treat them just like tapes. In fact, once you start doing that, you realize that RAID mirroring isn't helping you any (striping is another matter).

      The best way to backup a big hard drive these days is with another big hard drive.

      --
      Not a typewriter
    3. Re:Exactly what you're doing by rwa2 · · Score: 4, Informative

      Yeah, keeping those drives in a huge online storage array is probably better. Then they can mirror them across multiple sites.

      Here's a compelling petabyte online RAID system for cheap:

      http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/

    4. Re:Exactly what you're doing by TrippTDF · · Score: 4, Funny

      your sig is incredibly apt for your post...

    5. Re:Exactly what you're doing by lgw · · Score: 4, Informative

      Tape is really best for archiving, to this day. A single LTO drive won't break the bank for a small business, and it will be reliable.

      3 Things to remember about tape backup:

      Encrypt your backups. This is becoming available in the tape drive itself, but many backup applications will also do it for you in software. Limits embarassment if a tape goes missing.

      Occasionally test restores. This is incredibly important - almost every unreadable tape in existance was unreadable when created. Any reasonable backup software will give you the ability to do this automatically (as part of the backup job). If practical, create a job that does a backup of everything, but verifies only some small volume. If you can read anything, chances are high that the whole tape is fine.

      Get those tapes offsite. A safe deposit box works for a tiny company, but someone like Iron Mountain works better and is less hassle. Store a copy of your encryption key in the same facility (but don't transport the tape and key together).

      --
      Socialism: a lie told by totalitarians and believed by fools.
    6. Re:Exactly what you're doing by Again · · Score: 4, Insightful

      (Or btrfs on a Linux distro)

      Are you honestly suggesting using an in-development filesystem for backup purposes?

  2. bzip2 by Colin+Smith · · Score: 5, Funny

    And optar:

    http://ronja.twibright.com/optar/

    You know it makes sense.

    --
    Deleted
  3. Amazon AWS? by TSHTF · · Score: 4, Interesting

    It might not be the cheapest option, but with Amazon's AWS, you can snail mail them a copy of the drive with the data and they're store it in S3 storage buckets.

  4. Different manufacturers by idiot900 · · Score: 4, Insightful

    Hard drives are ridiculously cheap these days, especially for how much data you are storing. You may wish to consider buying drives from different manufacturers but of the same size to put in a single mirrored set. This way if there is a problem with a particular batch of drives it won't ruin everything.

  5. Tape is your friend by chill · · Score: 5, Informative

    LTO tape, properly stored, will outlast burned optical media and hard drives. Great stuff and designed specifically for what you're talking about.

    http://en.wikipedia.org/wiki/Linear_Tape-Open

    --
    Learning HOW to think is more important than learning WHAT to think.
    1. Re:Tape is your friend by Saint+Aardvark · · Score: 5, Informative

      Couldn't agree more. A tape library (as in autochanger) might be out of your budget, but a simple tape drive wouldn't be too much -- say $5000 for an LTO4. Media is $50-$100 or so depending on where you shop. Seriously, you're not going to find a reasonable way of storing that much data anywhere else.

      BTW, if you're not a member of LOPSA, you may want to seriously consider it. Even if you're not a sysadmin, this is definitely a sysadmin-type question, and their mailing lists are second to none. It's an excellent resource.

  6. Re:Exactly. by TooMuchToDo · · Score: 5, Informative

    Because Amazon can be *expensive* compared to doing it yourself ($$$ for data in, $$$ for data out, $$$ for monthly storage). But heh, what do I know. I just manage the storage for one of the LHC detectors (5PB spinning disk, 17PB tape). Amazon is good when you've got VC money or have no IT folks.

  7. I'd encrypt the data and... by Rivalz · · Score: 5, Funny

    Label it something like complete american idol blueray collection and upload it on p2p to piratebay. every couple years rename it to some other horrible popular tv series. It will be self sustaining form of storage with infinite number of redundant hosts.

  8. Re:Exactly. by Anonymous Coward · · Score: 5, Insightful

    Ok, yes, we see you know a lot about this.

    So what's your recommendation?

  9. Never us DVDs as long term storage. by strangeattraction · · Score: 4, Interesting

    Repeat never use DVDs as long term storage. I have seen them go unreadable anywhere from 2-5 years. I have fired up disk drives 10 years later with no problems. They are cheap reliable and fast. Don't try and get fancy just compress and store data sets over multiple volumes. Don't use RAID.

  10. Re:GMail Drive by 0100010001010011 · · Score: 4, Interesting

    That's what ZFS is for.

    mount -t gmailfs /disk1 -o username=gmailuser,password=gmailpass
    mount -t gmailfs /disk2 -o username=gmailuser,password=gmailpass
    mount -t gmailfs /disk3 -o username=gmailuser,password=gmailpass
    mount -t gmailfs /disk4 -o username=gmailuser,password=gmailpass
    mount -t gmailfs /disk5 -o username=gmailuser,password=gmailpass

    zpool create gzfs raidz1 disk1 disk2 disk3 disk4 disk5

    Actually.... I think I just found my project for the evening. I mean it's already been done with 12 USB drives

  11. Depends on frequency of access by adosch · · Score: 4, Informative

    I work as a contractor for the USGS and the projects I've been involved with host, archive and provide means for customers to access all our different satellite data products. We've got a Long-term archive method for tons of data products (digitally and tangible) and I can honestly tell you the first thing that always comes up is: how often will the data need to be accessed?

    For the longest time (almost a decade) we used 3 big, STK tape silos for data archive and retrieval for custom orders. The problem behind that type of design is we used a archive in a completely wrong manner in the fact that we tried to use it as a archive and a quasi-online retrieval system into a caching filesystem. We had tape mount counts in the hundreds and thousands, constant mechanical tape issues because of the excessive use, ect. We actually decided to move it all to online storage using enterprise RAID (EMC Clarion) and moved to a small LTO-4 tape unit for almost permanent, maybe-once-in-a-great-while storage and the rest we leave completely on spinning disk and control the access to it via application layer network protocols as needed.

    IMHO, I really think it's going to depend on the access frequency of your data. If that custom needs their data once, and maybe never again in case they lose it, put it on tape. If it's a requirement they can get the data from you any time they want and you've got the hardware and administrative resources, power and bandwidth, put it some RAID.

  12. Re:Exactly. by TooMuchToDo · · Score: 4, Insightful
    Either MogileFS, Lustre, or possible Hadoop (depending on the type and size of the data). Any sort of distributed file system where multiple chunks, replicas, etc (3 is a good number, more is better if you have cheap disk and deduping at the filesystem level) are constantly available.

    Feel free to ask more questions.