Ask Slashdot: Simple Way To Backup 24TB of Data Onto USB HDDs ?

← Back to Stories (view on slashdot.org)

Ask Slashdot: Simple Way To Backup 24TB of Data Onto USB HDDs ?

Posted by samzenpus on Thursday August 9, 2012 @09:24PM from the save-often dept.

An anonymous reader writes "Hi there ! I'm looking for a simple solution to backup a big data set consisting of files between 3MB and 20GB, for a total of 24TB, onto multiple hard drives (usb, firewire, whatever) I am aware of many backup tools which split the backup onto multiple DVDs with the infamous 'insert disc N and press continue', but I haven't come across one that can do it with external hard drives (insert next USB device...). OS not relevant, but Linux (console) or MacOS (GUI) preferred... Did I miss something or is there no such thing already done, and am I doomed to code it myself ?"

25 of 405 comments (clear)

Min score:

Reason:

Sort:

Re:USB and disk Speed by gagol · 2012-08-09 21:30 · Score: 4, Informative

If you can achieve a sustained write speed of 50 megabytes per second, you are in for 140 hours of data transfer. I hope it is not a daily backup!

--
Tomorrow is another day...
Bacula is your friend by bernywork · 2012-08-09 21:32 · Score: 4, Informative

http://www.bacula.org/en/
There's even a howto here:
http://wiki.bacula.org/doku.php?id=removable_disk

--
Curiosity was framed; ignorance killed the cat. -- Author unknown
1. Re:Bacula is your friend by Anonymous Coward · 2012-08-09 21:50 · Score: 3, Informative
  
  Yes, Bacula is the only real solution out there that isn't going to cost you an arm and a leg, and that allows you to switch easily between any backup medium. As long as your mySQL catalog is intact restoration is a synch...
  Did I mention it supports backup archiving as well if you want duplicate copies for Tapes being shipped off site...
2. Re:Bacula is your friend by arth1 · 2012-08-09 23:17 · Score: 5, Informative
  
  Yes, Bacula is the only real solution out there that isn't going to cost you an arm and a leg, and that allows you to switch easily between any backup medium.
  Except for good old tar, which is present on all systems.
  Most people are probably not aware that tar has the ability to create split tar archives. Add the following options to tar:
  -L <max-size-in-k-per-tarfile> -M myscript.sh ... where myscript.sh echoes out the name to use for the next tar file in the series. It can be as easy as a for loop checking where the tar file already exists and returning the next hooked up volume where it doesn't.
  Or it could even unmount the current volume and automount the next volume for you. Or display a dialogue telling you to replace the drive.
  One advantage is that you can easily extract from just one of the tar files; you don't need all of them or the first-and-last like with most backup systems. Each tar file is a valid one, and at most you need two tar files to extract any file, and most of them just one.
  Tar multivolume can, of course, be combined with tar's built in compression.
Split into multiple tar files? by Anonymous Coward · 2012-08-09 21:34 · Score: 5, Informative

I'm guessing you don't have enough space to split a backup on the original storage medium and then mirror the splits onto each drive?
Given the size requirements, it seems that might be prohibitive, but it would make things easier for you:
How to Create a Multi Part Tar File with Linux
RAID by Anonymous Coward · 2012-08-09 21:34 · Score: 5, Informative

For that much data you want a RAID since drives tend to fail if left sitting on the shelf, and they also tend (for different reasons) if they are spinning.
Basically: buy a RAID enclosure, insert drives so it looks like one giant drive, then copy files.
For 24TB you can use eight 4TB drives for a 6+2 RAID-6 setup. Then if any two of the drives fail you can still recover the data.
1. Re:RAID by Sarten-X · 2012-08-10 00:58 · Score: 3, Informative
  
  As mentioned already, RAID is not a backup solution. While it will likely work fine for a while, the risk of a catastrophic failure rises as drive capacity increases. From the linked article:
  
  With a twelve -terabyte array the chances of complete data loss during a resilver operation begin to approach one hundred percent - meaning that RAID 5 has no functionality whatsoever in that case. There is always a chance of survival, but it is very low.
  Granted, this is talking about RAID 5, so let's naively assume that doubling the parity disks for RAID 6 will halve the risk... but then since we're trying to duplicate 24 terabytes instead of twelve, we can also assume the risk doubles again, and we're back to being practically guaranteed a failure.
  Bottom line is that 24 terabytes is still a huge amount of data. There is no reliable solution I can think of for backing it all up that will be cheap. At that point, you're looking at file-level redundancy managed by a backup manager like Backup Exec (or whatever you prefer) with the data split across a dozen drives. As also mentioned already, the problem becomes much easier if you're able to reduce that volume of data somewhat.
  
  --
  You do not have a moral or legal right to do absolutely anything you want.
2. Re:RAID by louic · 2012-08-10 02:54 · Score: 3, Informative
  
  As mentioned already, RAID is not a backup solution.
  Nevertheless, there is nothing wrong with using disks that happen to be in a RAID configuration as backup disks. In fact, it is probably a pretty good idea for large files and large amounts of data.
git-annex by Anonymous Coward · 2012-08-09 21:40 · Score: 4, Informative

You might want to look into git-annex:
http://git-annex.branchable.com/
I've not tried it, but it sounds like an ideal solution for your request, especially if your data is already compressed.
Tar already does this by cyocum · 2012-08-09 21:56 · Score: 3, Informative

Have a look at tar and it's "multi-volume" option.
1. Re:Tar already does this by leuk_he · 2012-08-09 22:09 · Score: 5, Informative
  
  multi volume tarJust mount a new usb disk whenever it is full.
  However to have reasonable retrieve rate (going through 24 TB of data will rake some days over USB2), You better split the dataset in multiple smaller sets. That also has the advantage that if one disk chrashes (AND Consumer grade USB disk will chrash!) not your entire dataset is lost.
  For that reason (diskfailure), do not use some linux spanning disk feature. File systems are lost when one of the disks they write on are lost. Unless you use a feature that can handle lost disks (Raid/ Zraid)
  And last but not least: Test your backup. I have seen myself cheap USB interfaces failing to write the data to disk without a good error messages. All looks ok until you retreive the data and some files are corrupted.
Linuxquestions thread on multi-disk backups by Anonymous Coward · 2012-08-09 21:56 · Score: 2, Informative

Here's a Linuxquestions thread outlining multi-disk backup strategies.
The gist of the discussion is to use DAR.
Bash.... by djsmiley · 2012-08-09 22:08 · Score: 4, Informative

First bash script to grab the size of the "current" storage;
compress the files up until that size;
Move compressed file onto storage;
request new storage, start again.
----------
Or, if you've got all the storage already connected; bash for 0..x; do { cp $archive$x /mount/$x/ }; done :D

--
- http://www.milkme.co.uk
Use DAR or KDAR by pegasustonans · 2012-08-09 22:18 · Score: 2, Informative

If you don't want to invest in new hardware, you could use DAR or KDAR (KDE front-end for DAR).
With KDAR, what you want is the slicing settings.
There's an option to pause between slices, which gives you time to mount a new disk.

--
And all our yesterdays have lighted fools The way to dusty death. --Will
Re:solution by aglider · 2012-08-09 22:23 · Score: 4, Informative

3.samba
Uh? Why?
cp -a is all you need once you put the HDD inside the target machine.
And if you put it into another machine on the same network, then rsync is the answer.
Forget about the buggy and slow SAMBA.

--
Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.
Re:No. by ledow · 2012-08-09 22:34 · Score: 1, Informative

USB 2.0 provides 480Mbps of (theoretical) bandwidth. So unless you go Gigabit all over your network (not unreasonable), you won't beat it with a NAS. Even then, it's only 1-and-a-bit times as fast as USB working flat-out (and the difference being if you have multiple USB busses, you can get multiple drives working at once). And USB 3.0 would beat it again. And 10Gb between the client and a server is an expensive network to deploy still.
Granted, eSATA would probably be faster but there's nothing wrong with USB for such tasks if you *don't* want to provide Gigabit connections everywhere and (presumably) greater-than-gigabit backbones.
Re:Tape? by Anonymous Coward · 2012-08-09 22:57 · Score: 5, Informative

No kidding. For $2400, you get 24x TB HDs and a bookkeeping nightmare if you ever actually resort to the "backup." For $3k, you get a network-ready tape autoloader with 50-100TB capacity and easy access through any number of highly refined backup and recovery systems.
Now, if the USB requirement is because that's the only way to access the files you want to steal from an employer or government agency, then the time required to transfer across the USB will almost guarantee you get caught. Even over the weekend. You should come up with a different method for extracting the data.
PAR by fa2k · 2012-08-09 22:59 · Score: 3, Informative

I have just seen "PAR" a couple of times here on slashdot, haven't used it, but it seems great for this: http://en.wikipedia.org/wiki/Parchive . You need enough redundancy to allow one USB drive to fail. And I would rather get a SATA bay and use "internal" drives than having to deal with external USB drives. Get "green" drives, they are slow but cheap.
Re:solution by myowntrueself · 2012-08-09 23:35 · Score: 1, Informative

3.samba
Uh? Why?
cp -a is all you need once you put the HDD inside the target machine.
And if you put it into another machine on the same network, then rsync is the answer.
Forget about the buggy and slow SAMBA.
cp copies file by file.
A more efficient way is something like
tar -cf - .|(cd /somewhere/ ; tar xf -)
tar treats the directory contents as a data stream. Its much faster for large amounts of files and data.

--
In the free world the media isn't government run; the government is media run.
Re:USB and disk Speed by Anonymous Coward · 2012-08-10 00:09 · Score: 2, Informative

It's "nudge-nudge", not "notch-notch".
Also, you left out "wink-wink".
Yes, I know, I should get a life..
Re:USB and disk Speed by v1 · 2012-08-10 00:31 · Score: 4, Informative

I have a setup here where the server's video media is about 8tb in size. That backs up via rsync to the backup server which is in another room over rsync. It contains a large number of internal and external drives. None of them are over 2tb in capacity. The main drive has data separated into subfolders and the rsync jobs back up specific folders to specific drives.
A few times I've had to do some rearranging of data on the main and backup drives when a volume filled up. So it helps to plan ahead to save time down the road. But it works well for me here.
The only thing with rsync you need to worry about is users moving large trees or renaming root folders in large trees. This tends to cause rsync to want to delete a few TB of data and then turn around and copy it all over again on the backup drive. It doesn't follow files and folders by inode, it just goes by exact location and name.
I help mitigate this by hiding the root folders from the users. The share points are a couple levels deeper so they can't cause TOO big of a problem if someone decides to "tidy up". If they REALLY need something at a lower level moved or renamed, I do it myself, on both the source and the backup drives at the same time.
Another alternative is to get something like a Drobo where you can have a fairly inexpensive large pool of backup storage space that can match your primary storage. This prevents the problem of smaller backup volumes filling up and requiring data shuffling, but does nothing for the issue of users mucking with the lower levels of the tree.

--
I work for the Department of Redundancy Department.
Re:solution by fnj · 2012-08-10 00:58 · Score: 5, Informative

No. It's slower. Informative, my ass.
Re:USB and disk Speed by milgr · 2012-08-10 01:17 · Score: 2, Informative

The LHC generates a petabyte per second.

--
Where law ends, tyranny begins -- William Pitt
Re:DaisyChain by Painted · 2012-08-10 02:35 · Score: 4, Informative

DON'T DO THIS.

We did this exact thing using WD Green drives for our 18Tb backup problem. Got two of 'em, planning on using their built-in rsync for onsite/off siting the data. Unfortunately, the units never broke 1MB/s transfer, and no amount of work with Drobo yielded faster performance reliably. Both of our units are now sitting unused, ($2500 each!), and we put the drives into a RAID-50 8 bay USB3 enclosure. The new unit runs about 150x faster, and ended up costing $400 (prices are for enclosures only, drives were additional).

Most disappointing was Drobo's support- they just seemed to shrug a lot, and were hyper-agressive about closing trouble tickets.

--
http://marsandmore.com - Posters of space, spacecraft, and astronomy.
Re:USB and disk Speed by voltorb · 2012-08-10 02:36 · Score: 3, Informative

But only 1GB/s is recorded: http://www.itnews.com.au/News/310769,computing-for-the-large-hadron-collider.aspx