Ask Slashdot: Simple Way To Backup 24TB of Data Onto USB HDDs ?
An anonymous reader writes "Hi there ! I'm looking for a simple solution to backup a big data set consisting of files between 3MB and 20GB, for a total of 24TB, onto multiple hard drives (usb, firewire, whatever) I am aware of many backup tools which split the backup onto multiple DVDs with the infamous 'insert disc N and press continue', but I haven't come across one that can do it with external hard drives (insert next USB device...). OS not relevant, but Linux (console) or MacOS (GUI) preferred... Did I miss something or is there no such thing already done, and am I doomed to code it myself ?"
May be your limiting factor here.
Tomorrow is another day...
http://www.bacula.org/en/
There's even a howto here:
http://wiki.bacula.org/doku.php?id=removable_disk
Curiosity was framed; ignorance killed the cat. -- Author unknown
I'm guessing you don't have enough space to split a backup on the original storage medium and then mirror the splits onto each drive?
Given the size requirements, it seems that might be prohibitive, but it would make things easier for you:
How to Create a Multi Part Tar File with Linux
Assuming you're not worried about backup speed, you could use a four-bay external hard-drive enclosure in combination with RSYNC and LVM on any linux variety. I don't know if they all do, but the MediaSonic HF2-SU3S2 supports 3TB hard drives per bay, which means that two of them could be used in conjunction to provide 24TB of backup storage. Since you can make a large volume out of the full 24TB using LVM, you could even use something like dd to write to the disk (RSYNC with the archive option would be a better choice though, imho).
For that much data you want a RAID since drives tend to fail if left sitting on the shelf, and they also tend (for different reasons) if they are spinning.
Basically: buy a RAID enclosure, insert drives so it looks like one giant drive, then copy files.
For 24TB you can use eight 4TB drives for a 6+2 RAID-6 setup. Then if any two of the drives fail you can still recover the data.
Out on bail mate?
It's not mentioned by the Author, so I might be assuming too much but if he's trying to write to USB Drives as opposed to a RAID of some sort I figured he wanted to be able to read the drives individually, prehaps on a different machine without a network connection between them.
The drobo won't allow that, the file system is spread across all the drives.
I guess it kind of depends on what the author needs to do with the drives when he's finished writing to them.
These comments are my personal opinions and do not necessarily reflect the opinions of the other voices in my head.
You might want to look into git-annex:
http://git-annex.branchable.com/
I've not tried it, but it sounds like an ideal solution for your request, especially if your data is already compressed.
Why not tape, backup RAID, SAN or some other dedicated backup hardware solution?
24TB is well within the range that a professional solution would be required.
Given a harddisk size of ~1TB, making a single backup to 24 disk isn't a backup; it's throwing data in a garbage can.
More than likely atleast one of those disks will die before it's time.
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
Evidently, our UNIX founding fathers had similar challenges...
Have a look at tar and it's "multi-volume" option.
Here's a Linuxquestions thread outlining multi-disk backup strategies.
The gist of the discussion is to use DAR.
Porn is a renewable resource, there's no need to store so much of it.
What your attemting isn't easy, it's actually difficult.
Buy a cheap and big refurbished workstation or rackmount server, install a few extra SATA controllers and maybe a new power supply, hook up 12 2TB drives, install Debian, check out LVM and your all set.
Messing around with 12 - 24 external HDDs and their power supplys is a big hassle and asking for trouble. Don't do it. Do seriously go through the possibilty of building your own NAS. You'll be thankfull in the end and it won't take much longer, it might even go faster and be cheaper if you can get the parts fast.
My 2 cents.
We suffer more in our imagination than in reality. - Seneca
First bash script to grab the size of the "current" storage;
compress the files up until that size;
Move compressed file onto storage;
request new storage, start again.
----------
Or, if you've got all the storage already connected; bash for 0..x; do { cp $archive$x /mount/$x/ }; done :D
- http://www.milkme.co.uk
If you don't want to invest in new hardware, you could use DAR or KDAR (KDE front-end for DAR).
With KDAR, what you want is the slicing settings.
There's an option to pause between slices, which gives you time to mount a new disk.
And all our yesterdays have lighted fools The way to dusty death. --Will
3.samba
Uh? Why?
cp -a is all you need once you put the HDD inside the target machine.
And if you put it into another machine on the same network, then rsync is the answer.
Forget about the buggy and slow SAMBA.
Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.
how transportable is that though?
I mean, if i copied 200 gig across 3 drives in a jbod raid, could i plug just one drive in to access the information on another machine? Suppose my laptop only has 2 usb ports and i do not have a hub plus i'm running a different OS, does this mean i can't look for information on the set?
I have never used JBOD for raid, I have however used regular mirrored and stripped raids with and without fault tolerance (raid 5 and 10 or a mirrored stripe for instance) and know this can be a problem. In fact, I've even seen issues reading a complete raid set across systems when you aren't using a true hardware raid controller.
Actually 8x4 TB disks will do it, with the overhead etc, giving you 24.96 TB usable space.
I have just seen "PAR" a couple of times here on slashdot, haven't used it, but it seems great for this: http://en.wikipedia.org/wiki/Parchive . You need enough redundancy to allow one USB drive to fail. And I would rather get a SATA bay and use "internal" drives than having to deal with external USB drives. Get "green" drives, they are slow but cheap.
A 24TB NAS is not very hard to assemble. Relatively cheap, and basically transfers data at Gb speed - assuming that you populate it with fast disks. Set one up with RAID and you're away. Personally, I would do it with a low end server and a big-ass RAID array. That way, you can really control its behaviour via the OS. Linux is ferpect for this kind of thing.
Seems like a very bad idea to me. You'll have trouble creating a JBOD device without connecting all the drives simultaneously. Also, you're basically increasing the chance that the entire JBOD volume will be broken as the number of drives goes up. If you've got one drive failing, you'll be lucky to get any data back at all.
To my mind, Bacula would be a good choice as you can set up virtual tapes that will correspond to the drives and you can set the backup to wait for the operator to swap over the drive and then continue the backup. Also, once you've got Bacula installed and working, it's easy to do incremental backups and thus not need to write out the whole dataset again.
You're a temporary arrangement of matter sliding towards oblivion in a cold, uncaring universe
The iCloud! ;-)
"Only wimps use tape[*] backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)"
Linus Torvalds (1996) http://en.wikiquote.org/wiki/Linus_Torvalds
(Isn't that prescience of "The Cloud"?)
–––––––––– ;-)
* replace this with your favorite backup media of today
I like my spaghetti with source.
Count Bacula as your friend ;) -> http://www.bacula.org/
--- I am known for the ones who want to find me on the net. Is that a privacy risk or a privilege? One might wonder..
I do things like this all the time with a data set about half of that, ~ 12TB. You didnt say anything about what the data is but from the request and the fact you mentioned USB I would gather this is your typical warez hording mp3/flac, mkv, apps and also a personal picture and video collection of fam.
Here is a checklist i would execute similiar to mine. I find the most reliable way to keep your data over the years is by following a checklist or procedure and choosing when to move to the next storage platform.
Step 0: Get USB out of your head. Pop upon the drive and attach it to the native bus, PATA, SATA. if SATA may want to invest in ESATA cases. Its not solely the speed. I have done stupid things like this, in which the data backup takes over 2 days, and on the 2nd day some unrelated event affecting my USB bus causes all kinds of problems with the transfer. Over time doing cheesy things like this affects other things, like doing stupid shit in real life, usually with duct tape or guerrilla glue, then you have your wife on you. Right now your wife may not catch on to this, but it will escalate. Just do shit the right way.
Step 1: Organize. Actually understand what you are backing up. I never got into these tools like google desktop that allow a user to accept the fact that he/she has no idea where their files are. Understand and make an effort to organize your files before you back them up and know the capacity of each 'genre' of crap you are backing up. Run a tool like 'jdiskreport' to find this information out after you organize. Create a mapping on paper of where shit is going, zork style. If you have really important shit like family pictures, taking up say 200GB, and your mkv collection is 12TB, you may want to make 2x copies of your family shit. Anything you download off the internet is easily replaceble despite how obscure your tastes may be and will turn up again. I would question even backing it up but that is another conversation.
Step 2. Label your drives accordingly to your documentation.
Step 3. Format the drives in the most likely native format you will use and are familiar with. If you are a noob linux guy who runs Windows 7 all the time, dont be an idiot and experiment with your backup on ext3. It is not that ext3 is a bad filesystem, but you may not be the most skilled in restoring your data in various scenarios. For example im a linux and solaris geek but am just getting into macs --- im not comfortable enough with mac failures enough to store my crap on a mac fs. Whatever your skillset is, dont use the most optimal file system on paper, use what you know, even if it is NTFS (which imo is very reliable).
Step 4. Copy your shit over using your knowledge of your data organization and native OS commands or tools.
Step 5. Run a checksum on your important stuff and store the hashes to verify everything is fine over time. Odd situations occur when backing up data. I have run into cases where i didnt realize the files i was about to backup were bad/corrupt until i saw the good copy on a backup drive i was about to incrementally overwrite.
Step 6. Store the shit somewhere else if you can reasonably do this and feel confident in the security of your data. If you have to start encrypting your crap, you add some more complexity that can effect the reliability of your restoration, but again if you proceduralize and keep up on it you will be fine.
Backup design and integrity is hard work and serious business when dealing with large volumes. It reminds me of the Seinfeld episode where he goes to the car rental place and they dont have his car and he goes into his "Anyone can take the ticket" diatribe. Anyone can back up their data. But can you get it back? I am not an expert in this area and dont pretend to be, i am just a seasoned IT administrator who has performed alot of backups in my day and have managed to keep most of my data safe over the years.
# rsync -avz /this /that. Split your directories corresponding to the sizes of your drives. If on Linux, run smartctl -H /dev/sdX to check your disk health and if possible, take the HDD's our of their usb enclosures and connect them directly to SATA for faster xfer speeds. These drives will 9/10 mount just like a normal drive since usually they are just a normal drive housed in an enclosure.
:)
Good luck
Damn, that's a lotta pr0n!
USB 2.0 provides 480Mbps of (theoretical) bandwidth. So unless you go Gigabit all over your network (not unreasonable), you won't beat it with a NAS. Even then, it's only 1-and-a-bit times as fast as USB working flat-out (and the difference being if you have multiple USB busses, you can get multiple drives working at once).
The 480Mbps is nowhere near what you will see in practise, unlike network speeds which are far closer to the rated maximum. Most USB drives I've seen top out at somewhere between 25 and 30MByte/sec, and if there are no other bottlenecks it isn't unusual to see 100Mbyte/sec from a gbit switched network. My main desktop pulls things from the fileserver at around 80Mbyte/sec, which is as fast as local reads tend to be on that array. So you are right about 100mbit networks: that'll be the bottleneck not USB, but gbit networking should outdo USB2 by at least a factor of 2, possibly 3, maybe even more if you have better drives in you main storage array than I do.
Before trying to run several USB drives to max out your network bandwidth, consider that you will taking the source disks too. Unless they are SSDs having 2, 3, or more concurrent bulk reads going on may not be any faster than one concurrent read as all the extra head movements will wipe out the bulk speed potential. If the OP's 24Tb is spread over numerous physical drives this need not ban an issue though (with planning careful enough to ensure there aren't two bulk processes reading from the same physical devices.
And USB 3.0 would beat it again.
That it would. I have an SSD in a USB3 enclosure, and it can happily consume 80Mbyte/sec read over my little network. It might even be able to do better than that: I've not measured a bulk write read from the internal SSD yet.
And 10Gb between the client and a server is an expensive network to deploy still.
Granted, eSATA would probably be faster but there's nothing wrong with USB for such tasks if you *don't* want to provide Gigabit connections everywhere and (presumably) greater-than-gigabit backbones.
If I wanted more speed than USB3+gbit can provide (due to the size of data being backed up on each run) I'd be plugging the backup device(s) in locally to the source (vie eSATA, USB3, or such) rather than using the network (though again taking note to be careful how things are done if trying to use more than one backup device at once).
For the size of data being described, I'd not want a set of USB drives to be my primary backup solution though.
Is it that much faster for 3mb to 20 gb files?
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
No. It's slower. Informative, my ass.
Whether tape or disk is appropriate really depends what you are intending to use the backup for and how important your data is. You might even choose to use a mixture of the two.
If it's your only backup, I would suggest that it's not wise to leave it permanently online in the way you suggest; that leaves you open to any number of potential issues which your backup is supposed to protect you from (OS bug, misconfiguration, lightning strike, power failure, overheating, ...). Tape libraries have the same issue although at least there you are exposed to a different set of software bugs and the other tapes in the library might be OK if they are not physically in use when the worst happens.
For the inadvertent file deletion, you can cover this with better tools using true online storage - effectively some form of regular snapshotting (ZFS snapshots, rdiff-backup, Windows VSS, etc) to keep a (shortish) recent history. This should cover a good proportion of restore requests depending on how much history you can keep. For the rest, you're right that if you need to restore files very regularly then you might need a second drive and/or robot. Whether you need to do that or not will just depend on your use case.
Even if you do go with disk, make sure you use something which can properly keep multiple versions of files - just rsync'ing a big directory onto another disk is a recipe for disaster. My personal favourites are rdiff-backup and DAR (which can handle multiple volumes as others have pointed out) but there are others out there too, eg bacula.
cp doesn't preserver exact timestamp. If you want to do rsync later, it will copy all files all over. Jusd do
rsync --dry-run --archive --stats --progress --whole-file --exclude "/lost+found" --delete-after /source/ /destination
which is reproducible and later on will copy only the newer files.
A cloud backup service released information on how they build their own disk based backup servers. Maybe something that would help with your endeavor? http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/
I work for a data backup company as a dev monkey/admin/jack-of-all-trades.
Do you ever want to restore these backups? If the answer is "yes" (and it should be, otherwise why are you backing up in the first place...?), then you need to be guarded against failure of an individual disk. That means you need some sort of RAID solution.
For reference, Datto's 3U nodes store 20TB across 14 2TB drives, and the next larger size of node we have is somewhere around 55TB in 4U. No, I'm not trying to sell you our hardware (we only sell to resellers anyway) but hear me out. You really are going to save yourself some headache if you build a NAS device.
USB 2.0 is SLOW AS BALLS. I see our USB seed drives (HDDs we mail out to customers to get their initial datasets up into the ether) max out at 20-30MB/sec on a good day. By comparison, Gigabit Ethernet will give you 112MB/sec after NFS/TCP/Ethernet overhead -- much better. For this reason, and because it's just so impractical to handle large collections of failure-prone USB drives, our largest round trip drive that is shipped as USB is 4TB. After that, we actually ship our customers NAS devices (usually a returned/development box with a different OS image on it).
Go with NAS. You need the resilience against disk failure, you need the additional speed, and while yes, it's a greater investment, the alternative is utter agony when one of your 12 2TB disks takes a dump.
DON'T DO THIS.
We did this exact thing using WD Green drives for our 18Tb backup problem. Got two of 'em, planning on using their built-in rsync for onsite/off siting the data. Unfortunately, the units never broke 1MB/s transfer, and no amount of work with Drobo yielded faster performance reliably. Both of our units are now sitting unused, ($2500 each!), and we put the drives into a RAID-50 8 bay USB3 enclosure. The new unit runs about 150x faster, and ended up costing $400 (prices are for enclosures only, drives were additional).
Most disappointing was Drobo's support- they just seemed to shrug a lot, and were hyper-agressive about closing trouble tickets.
http://marsandmore.com - Posters of space, spacecraft, and astronomy.
You buy one of these:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816322007
populate it with 4GB drives and create two RAID5 (or one RAID6) array, then you've got 24 or 28 TB of backup space, without having to change drives or break up your backup into smaller chunks.
But really, your backup methodology is broken; you need to organize the data into manageable chunks because aside from a large dedicated backup server/SAN, there is no reliable (don't tell me tape is reliable) backup solution for a such a large quantity of data in a single chunk.
What I do for backups: in my 24-bay server I have eight large drives in a (HARDWARE) RAID5 array (were 4TB drives available at the time I'd have gone RAID6) and rsync the virtualized server contents to that, then archive them into tarballs, and send copies of them across the LAN to another server that is running (HARDWARE) RAID5 as well. Every once in a while I back up the critical data (source, scripts, financial data, production web sites, /etc, and so forth but not the program binaries nor system binaries which are easily recreated or reinstalled, respectively) to optical media and external hard drives.
So what I have in summary is:
* Massive server with a backup array separate from the production array
* Separate backup server running another array (again, using a quality HARDWARE RAID controller. Safeguard your data and don't bother with Intel, Adaptec, Promise, or Highpoint "hybrid" RAID)
* Periodic backups of non-recreatable data to USB drives and optical media that are moved off site.
The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
Yes. The above tar command is really from a time when cp did not have r and p options (and still likely doesn't on some systems so it's worth knowing). OTOH, you can add in the z option (compress) if you're doing something networky (though you'll probably want to throw in netcat or ssh too in that case). Of course, if you're doing that, rsync is probably the better option if available and leads to some interesting backup options going forward.
The 200GB range drives in my main server have been trundling along for many years while I have a pile of 0.5-2TB hard drives I need to go through and get warrantied (three of them Caviar blacks). Not impressed with the big drives.