Ask Slashdot: Simple Way To Backup 24TB of Data Onto USB HDDs ?
An anonymous reader writes "Hi there ! I'm looking for a simple solution to backup a big data set consisting of files between 3MB and 20GB, for a total of 24TB, onto multiple hard drives (usb, firewire, whatever) I am aware of many backup tools which split the backup onto multiple DVDs with the infamous 'insert disc N and press continue', but I haven't come across one that can do it with external hard drives (insert next USB device...). OS not relevant, but Linux (console) or MacOS (GUI) preferred... Did I miss something or is there no such thing already done, and am I doomed to code it myself ?"
If you can achieve a sustained write speed of 50 megabytes per second, you are in for 140 hours of data transfer. I hope it is not a daily backup!
Tomorrow is another day...
http://www.bacula.org/en/
There's even a howto here:
http://wiki.bacula.org/doku.php?id=removable_disk
Curiosity was framed; ignorance killed the cat. -- Author unknown
I'm guessing you don't have enough space to split a backup on the original storage medium and then mirror the splits onto each drive?
Given the size requirements, it seems that might be prohibitive, but it would make things easier for you:
How to Create a Multi Part Tar File with Linux
For that much data you want a RAID since drives tend to fail if left sitting on the shelf, and they also tend (for different reasons) if they are spinning.
Basically: buy a RAID enclosure, insert drives so it looks like one giant drive, then copy files.
For 24TB you can use eight 4TB drives for a 6+2 RAID-6 setup. Then if any two of the drives fail you can still recover the data.
You might want to look into git-annex:
http://git-annex.branchable.com/
I've not tried it, but it sounds like an ideal solution for your request, especially if your data is already compressed.
Have a look at tar and it's "multi-volume" option.
Here's a Linuxquestions thread outlining multi-disk backup strategies.
The gist of the discussion is to use DAR.
First bash script to grab the size of the "current" storage;
compress the files up until that size;
Move compressed file onto storage;
request new storage, start again.
----------
Or, if you've got all the storage already connected; bash for 0..x; do { cp $archive$x /mount/$x/ }; done :D
- http://www.milkme.co.uk
If you don't want to invest in new hardware, you could use DAR or KDAR (KDE front-end for DAR).
With KDAR, what you want is the slicing settings.
There's an option to pause between slices, which gives you time to mount a new disk.
And all our yesterdays have lighted fools The way to dusty death. --Will
3.samba
Uh? Why?
cp -a is all you need once you put the HDD inside the target machine.
And if you put it into another machine on the same network, then rsync is the answer.
Forget about the buggy and slow SAMBA.
Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.
USB 2.0 provides 480Mbps of (theoretical) bandwidth. So unless you go Gigabit all over your network (not unreasonable), you won't beat it with a NAS. Even then, it's only 1-and-a-bit times as fast as USB working flat-out (and the difference being if you have multiple USB busses, you can get multiple drives working at once). And USB 3.0 would beat it again. And 10Gb between the client and a server is an expensive network to deploy still.
Granted, eSATA would probably be faster but there's nothing wrong with USB for such tasks if you *don't* want to provide Gigabit connections everywhere and (presumably) greater-than-gigabit backbones.
No kidding. For $2400, you get 24x TB HDs and a bookkeeping nightmare if you ever actually resort to the "backup." For $3k, you get a network-ready tape autoloader with 50-100TB capacity and easy access through any number of highly refined backup and recovery systems.
Now, if the USB requirement is because that's the only way to access the files you want to steal from an employer or government agency, then the time required to transfer across the USB will almost guarantee you get caught. Even over the weekend. You should come up with a different method for extracting the data.
I have just seen "PAR" a couple of times here on slashdot, haven't used it, but it seems great for this: http://en.wikipedia.org/wiki/Parchive . You need enough redundancy to allow one USB drive to fail. And I would rather get a SATA bay and use "internal" drives than having to deal with external USB drives. Get "green" drives, they are slow but cheap.
3.samba
Uh? Why?
cp -a is all you need once you put the HDD inside the target machine.
And if you put it into another machine on the same network, then rsync is the answer.
Forget about the buggy and slow SAMBA.
cp copies file by file.
A more efficient way is something like
tar -cf - .|(cd /somewhere/ ; tar xf -)
tar treats the directory contents as a data stream. Its much faster for large amounts of files and data.
In the free world the media isn't government run; the government is media run.
It's "nudge-nudge", not "notch-notch".
Also, you left out "wink-wink".
Yes, I know, I should get a life..
I have a setup here where the server's video media is about 8tb in size. That backs up via rsync to the backup server which is in another room over rsync. It contains a large number of internal and external drives. None of them are over 2tb in capacity. The main drive has data separated into subfolders and the rsync jobs back up specific folders to specific drives.
A few times I've had to do some rearranging of data on the main and backup drives when a volume filled up. So it helps to plan ahead to save time down the road. But it works well for me here.
The only thing with rsync you need to worry about is users moving large trees or renaming root folders in large trees. This tends to cause rsync to want to delete a few TB of data and then turn around and copy it all over again on the backup drive. It doesn't follow files and folders by inode, it just goes by exact location and name.
I help mitigate this by hiding the root folders from the users. The share points are a couple levels deeper so they can't cause TOO big of a problem if someone decides to "tidy up". If they REALLY need something at a lower level moved or renamed, I do it myself, on both the source and the backup drives at the same time.
Another alternative is to get something like a Drobo where you can have a fairly inexpensive large pool of backup storage space that can match your primary storage. This prevents the problem of smaller backup volumes filling up and requiring data shuffling, but does nothing for the issue of users mucking with the lower levels of the tree.
I work for the Department of Redundancy Department.
No. It's slower. Informative, my ass.
The LHC generates a petabyte per second.
Where law ends, tyranny begins -- William Pitt
DON'T DO THIS.
We did this exact thing using WD Green drives for our 18Tb backup problem. Got two of 'em, planning on using their built-in rsync for onsite/off siting the data. Unfortunately, the units never broke 1MB/s transfer, and no amount of work with Drobo yielded faster performance reliably. Both of our units are now sitting unused, ($2500 each!), and we put the drives into a RAID-50 8 bay USB3 enclosure. The new unit runs about 150x faster, and ended up costing $400 (prices are for enclosures only, drives were additional).
Most disappointing was Drobo's support- they just seemed to shrug a lot, and were hyper-agressive about closing trouble tickets.
http://marsandmore.com - Posters of space, spacecraft, and astronomy.
But only 1GB/s is recorded: http://www.itnews.com.au/News/310769,computing-for-the-large-hadron-collider.aspx