Ask Slashdot: Simple Way To Backup 24TB of Data Onto USB HDDs ?
An anonymous reader writes "Hi there ! I'm looking for a simple solution to backup a big data set consisting of files between 3MB and 20GB, for a total of 24TB, onto multiple hard drives (usb, firewire, whatever) I am aware of many backup tools which split the backup onto multiple DVDs with the infamous 'insert disc N and press continue', but I haven't come across one that can do it with external hard drives (insert next USB device...). OS not relevant, but Linux (console) or MacOS (GUI) preferred... Did I miss something or is there no such thing already done, and am I doomed to code it myself ?"
May be your limiting factor here.
Tomorrow is another day...
1.take all hard drives out of USB enclosures
2.install in PC with multiple SATA cards
3.samba
I believe you can daisy chain external drives together if you have the right cases.
For ease though, I'd consider a DroBo http://www.drobo.com/products/professionals/drobo-5d/index.php
http://en.wikipedia.org/wiki/Spanned_volume
http://macs.about.com/od/usingyourmac/ss/raidjbod.htm
JBOD allows you to create a large virtual disk drive by concatenating two or more smaller drives together. The individual hard drives that make up a JBOD RAID can be of different sizes and manufacturers. The total size of the JBOD RAID is the combined total of all the individual drives in the set.
http://www.bacula.org/en/
There's even a howto here:
http://wiki.bacula.org/doku.php?id=removable_disk
Curiosity was framed; ignorance killed the cat. -- Author unknown
Use 'dd' in linux
Are you REALLY sure that you want to use USB HDDs? The cost savings of using a box of HDDs may well be offset by the hassle in finding the backup software, the manual labor of swapping them, finding the correct drive to retrieve a certain file, etc.
How about a pair of Synology DS1512+ NASes? In addition to getting all of the storage online at all times, you get RAID support, etc.
I'm guessing you don't have enough space to split a backup on the original storage medium and then mirror the splits onto each drive?
Given the size requirements, it seems that might be prohibitive, but it would make things easier for you:
How to Create a Multi Part Tar File with Linux
Assuming you're not worried about backup speed, you could use a four-bay external hard-drive enclosure in combination with RSYNC and LVM on any linux variety. I don't know if they all do, but the MediaSonic HF2-SU3S2 supports 3TB hard drives per bay, which means that two of them could be used in conjunction to provide 24TB of backup storage. Since you can make a large volume out of the full 24TB using LVM, you could even use something like dd to write to the disk (RSYNC with the archive option would be a better choice though, imho).
For that much data you want a RAID since drives tend to fail if left sitting on the shelf, and they also tend (for different reasons) if they are spinning.
Basically: buy a RAID enclosure, insert drives so it looks like one giant drive, then copy files.
For 24TB you can use eight 4TB drives for a 6+2 RAID-6 setup. Then if any two of the drives fail you can still recover the data.
Out on bail mate?
You might want to look into git-annex:
http://git-annex.branchable.com/
I've not tried it, but it sounds like an ideal solution for your request, especially if your data is already compressed.
http://www.synology.com/products/product.php?product_name=DS2411%2B&lang=uk Still portable enough to do your backup then take offsite.
If you have 24TB of data to backup, it would be easier to just build another 24TB storage array. The amount of time you would spend swapping disks and then validating that disks don't go bad would sap any "savings" of not building a big array to begin with.
So, I would buy up some cheap dual-core dual processor xeon systems that ebay is flooded with currently, buy as much raid 5 and sata disks as it takes to get to 24tb with raid 5, and then you can actually do a meaningful backup that doesn't have a labor cost factored to each iteration.
I'm assuming the original 24tb exists in RAID 5 already, so if you have access to the existing hardware infrastructure, just buid a RAID 5 mirror. If you're doing a web mirror, RAID5 should be good enough and if you loose more than one disk then worry about restoring from the other mirror members.
Why not tape, backup RAID, SAN or some other dedicated backup hardware solution?
24TB is well within the range that a professional solution would be required.
Given a harddisk size of ~1TB, making a single backup to 24 disk isn't a backup; it's throwing data in a garbage can.
More than likely atleast one of those disks will die before it's time.
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
The Btrfs filesystem allows you to merge multiple physical disks to a single filesystem.
(AFAIK it's not stable yet, but it just had to be mentioned!)
Evidently, our UNIX founding fathers had similar challenges...
Have a look at tar and it's "multi-volume" option.
Here's a Linuxquestions thread outlining multi-disk backup strategies.
The gist of the discussion is to use DAR.
I'm not sure if you posed the question out of being nieve, or if its just being daft. You don't want to be moving 24TB over the USB bus. End of discussion really - at least in terms of USB.
Whoever or however you ended up looking at USB for this was wrong/wrong way.
You have lots of choice in terms of boxes, servers, NAS boxes, locally attached storage. 24TB is in the range of midrange NAS boxes.
Once you have this, you can start to make choices on the many backup, replication, and duplication bits of software that already exist, both free and proprietary.
We`re all equal
Porn is a renewable resource, there's no need to store so much of it.
Script your own solution for your specific problems.
That’s kinda the whole point of having a computer... as opposed to a set of appliances that happen to run on a computer you never use directly.
What your attemting isn't easy, it's actually difficult.
Buy a cheap and big refurbished workstation or rackmount server, install a few extra SATA controllers and maybe a new power supply, hook up 12 2TB drives, install Debian, check out LVM and your all set.
Messing around with 12 - 24 external HDDs and their power supplys is a big hassle and asking for trouble. Don't do it. Do seriously go through the possibilty of building your own NAS. You'll be thankfull in the end and it won't take much longer, it might even go faster and be cheaper if you can get the parts fast.
My 2 cents.
We suffer more in our imagination than in reality. - Seneca
First bash script to grab the size of the "current" storage;
compress the files up until that size;
Move compressed file onto storage;
request new storage, start again.
----------
Or, if you've got all the storage already connected; bash for 0..x; do { cp $archive$x /mount/$x/ }; done :D
- http://www.milkme.co.uk
... by employing a detector with a size of 2463 x 2527 pixels (6M) at 12 Hz (12 times / sec). When run continuously for a set of data (roughly 900 degrees) ...
we collect 900 frames in roughly 2 minutes including hardware limitations for starting/stopping.
In proper format for processing, this works out to about 6MB/image and roughly 3GB/min for 2 minutes.
With an experienced crew of 3-4 people ... one handling the samples, one handling the liquid nitrogen, one running the software and one taking notes (overall monitoring also) ... we can run through 600 samples in a 24 shift ...
Which roughly works out to about 600 x 6GB = 3.6 TB on a "working" day.
To answer your question ... we never make physical copies of stuff ... the data stays online in multiple places on multiple continents ... and when something is published the data becomes publicly available in a central database
Why do you need a physical copy anyway?
USB is for a second working copy.
Backups should also ensure durability of the copy, while USB HDD have a shorter lifespan than a normal HDD which in turn has shorter lifespan than tapes, the usual medium for durable backups.
Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.
If you don't want to invest in new hardware, you could use DAR or KDAR (KDE front-end for DAR).
With KDAR, what you want is the slicing settings.
There's an option to pause between slices, which gives you time to mount a new disk.
And all our yesterdays have lighted fools The way to dusty death. --Will
My experience is that eSATA II (3G) is about 4X faster than USB2. The benchmarks I have seen show that it is still faster than USB3. Today you can probably get eSATA III (6G)
Backup tapes were designed precisely for the problem you have. LTO-5 tapes are about 1.5TB, if I remember right. Stored correctly they shouldn't give any problems when you come to retrieve whatever is backed up. Most archiving efforts use backup tape, and they can't all be wrong :)
" I am aware of many backup tools which split the backup onto multiple DVDs with the infamous 'insert disc N and press continue', but I haven't come across one that can do it with external hard drives (insert next USB device...)" - the split archive functions in the Linux zip program might be able to do this. But, I've never used this feature in Linux but remember using it on good old pkzip on dos when trying to span files across multiple floppy disks.
I have just seen "PAR" a couple of times here on slashdot, haven't used it, but it seems great for this: http://en.wikipedia.org/wiki/Parchive . You need enough redundancy to allow one USB drive to fail. And I would rather get a SATA bay and use "internal" drives than having to deal with external USB drives. Get "green" drives, they are slow but cheap.
Some sort of NAS or tape would be your best option without knowing more. How often do you need to do the "backup"? Is it really a "backup" or data replication eg. are you needing to restore the data after a serious failure. Have a look at this seems to have some good advise and i think could be a solution to your issue, as i see the big problem is the amount of time and the restorability of the data after a failure. http://www.smallnetbuilder.com/nas/nas-howto/31485-build-your-own-fibre-channel-san-for-less-than-1000-part-1
If this is work-related, and the 24 TB of data is critical to your company, DON'T FUCK AROUND WITH TOYS.
Get a real backup solution - before they get a real sysadmin.
A 24TB NAS is not very hard to assemble. Relatively cheap, and basically transfers data at Gb speed - assuming that you populate it with fast disks. Set one up with RAID and you're away. Personally, I would do it with a low end server and a big-ass RAID array. That way, you can really control its behaviour via the OS. Linux is ferpect for this kind of thing.
What I want to know is this:
Who would have managed to get 24TB of data, without already having a backup solution in place?
24TB is a lot of data. It isn't something you get overnight. It should have been apparent a *long* time ago that some kind of backup was going to be needed.
If this is business data, then someone has been neglegent.
Your best bet for speed is likely to be eSATA.
Have you looked into something like this:
http://eshop.macsales.com/shop/NewerTech/Voyager/Hard_Drive_Dock
The cost becomes noise when you consider how many drives you will end up needing, and per TB, will be cheaper than USB solutions.
I don't know how your data is organized, but if possible, you may want to back it up by project/directory/etc.
There are also online backup systems that can do what you want, but it'll take an extremely long time...
The iCloud! ;-)
Get an old computer... anything will work really. You have to know someone that has one laying in their basement. Plug your drives into that. share the drives on your network. Use any general backup software and sequentially backup what you need to backup over the network. Now it will do it overnight and you really don't care how long it takes. It can even do it every night. If you want it safe from fire and such.... build a box out of 2x4s and Drywall scraps form homedepot. Make it 5 sheets thick and it'll withstand any housefire you could possibly have. If you really want to go hardcore you can pour a box out of concrete, but that'll be hard to move.
"Only wimps use tape[*] backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)"
Linus Torvalds (1996) http://en.wikiquote.org/wiki/Linus_Torvalds
(Isn't that prescience of "The Cloud"?)
–––––––––– ;-)
* replace this with your favorite backup media of today
I like my spaghetti with source.
Connect 8 x 3 TB USB drives (more if you want RAIDZ), add all the disks to a pool and copy your data. If you need more space later, just add more disks to the pool. This will obviously be slow, but if what you need is a navigable copy of a lot of data, once you've made the copy, it won't matter.
This is what ZFS was designed for. I use Solaris & OpenIndiana, but there's a MacOS port, MacZFS.
1 You need to maximise your computer's ram
2 Firewire is definitely preferable to USB, its much faster, you can get discs which offer both.
If you are actually going to be copying from another USB device and not an internal hard disc, thats even more necessary
3 Use high capacity, high speed discs, minimum a terabyte
4 I use the OS facilities from my internal drives to the BU drives and just copy the drive. If there is space left for another drive I'll put more on. I never split a drive across one or more BU drives. Thus its easy to keep track of what's where. Wastes space, but Gb are cheap.
5 If you are going to overwrite data on a BU drive then format it first, it will then work faster and there is less likelyhood of errors.
6 If the data is really important to you, then you have an additional bind. For that sort of data, you need to also have a copy in another location in case where the computer is gets wrecked for some reason – that takes out both the original and the backups. If its at work then your home should suffice to store the BU.
JS
It's a little late to be asking that now.
Online RaidZ ZFS with dual parity + ongoing offsite tape backups is the only way I would conduct this backup.
Let's say both the primary file and a 1TB backup disk fails. Is the damage felt by OP equal to 1/24 of his happiness or less? Then multiple drives is very justified. Is there a chance this 1TB drive contains a database that makes half the data completely useless when it crashes? Then multiple backups/redundancy is required. This is a vital piece of information to make a recommendation.
Count Bacula as your friend ;) -> http://www.bacula.org/
--- I am known for the ones who want to find me on the net. Is that a privacy risk or a privilege? One might wonder..
Sometimes the easiest way to duplicate (back up) data is to simply duplicate the hardware it's already on. If it's on a 16-disk (x 2TB) NAS system, build another one. If it's on tape, buy more tapes, if it's on random HDD's scattered all over the place, then you have bigger problems to deal with first (like building a NAS box)!
I do things like this all the time with a data set about half of that, ~ 12TB. You didnt say anything about what the data is but from the request and the fact you mentioned USB I would gather this is your typical warez hording mp3/flac, mkv, apps and also a personal picture and video collection of fam.
Here is a checklist i would execute similiar to mine. I find the most reliable way to keep your data over the years is by following a checklist or procedure and choosing when to move to the next storage platform.
Step 0: Get USB out of your head. Pop upon the drive and attach it to the native bus, PATA, SATA. if SATA may want to invest in ESATA cases. Its not solely the speed. I have done stupid things like this, in which the data backup takes over 2 days, and on the 2nd day some unrelated event affecting my USB bus causes all kinds of problems with the transfer. Over time doing cheesy things like this affects other things, like doing stupid shit in real life, usually with duct tape or guerrilla glue, then you have your wife on you. Right now your wife may not catch on to this, but it will escalate. Just do shit the right way.
Step 1: Organize. Actually understand what you are backing up. I never got into these tools like google desktop that allow a user to accept the fact that he/she has no idea where their files are. Understand and make an effort to organize your files before you back them up and know the capacity of each 'genre' of crap you are backing up. Run a tool like 'jdiskreport' to find this information out after you organize. Create a mapping on paper of where shit is going, zork style. If you have really important shit like family pictures, taking up say 200GB, and your mkv collection is 12TB, you may want to make 2x copies of your family shit. Anything you download off the internet is easily replaceble despite how obscure your tastes may be and will turn up again. I would question even backing it up but that is another conversation.
Step 2. Label your drives accordingly to your documentation.
Step 3. Format the drives in the most likely native format you will use and are familiar with. If you are a noob linux guy who runs Windows 7 all the time, dont be an idiot and experiment with your backup on ext3. It is not that ext3 is a bad filesystem, but you may not be the most skilled in restoring your data in various scenarios. For example im a linux and solaris geek but am just getting into macs --- im not comfortable enough with mac failures enough to store my crap on a mac fs. Whatever your skillset is, dont use the most optimal file system on paper, use what you know, even if it is NTFS (which imo is very reliable).
Step 4. Copy your shit over using your knowledge of your data organization and native OS commands or tools.
Step 5. Run a checksum on your important stuff and store the hashes to verify everything is fine over time. Odd situations occur when backing up data. I have run into cases where i didnt realize the files i was about to backup were bad/corrupt until i saw the good copy on a backup drive i was about to incrementally overwrite.
Step 6. Store the shit somewhere else if you can reasonably do this and feel confident in the security of your data. If you have to start encrypting your crap, you add some more complexity that can effect the reliability of your restoration, but again if you proceduralize and keep up on it you will be fine.
Backup design and integrity is hard work and serious business when dealing with large volumes. It reminds me of the Seinfeld episode where he goes to the car rental place and they dont have his car and he goes into his "Anyone can take the ticket" diatribe. Anyone can back up their data. But can you get it back? I am not an expert in this area and dont pretend to be, i am just a seasoned IT administrator who has performed alot of backups in my day and have managed to keep most of my data safe over the years.
When moving really large amounts of data it is not unlikely to see an incidental bit error, especially when new hardware is involved. Data on disk is generally safe because of ECC. But pumping that much data through RAM, associated controllers and all the non-ECC protected buses on a mainboard will increase the chance of experiencing bitrot because of tolerance or thermal issues. At some point it is just a matter of statistics.
I really have to ask why USB? Your looking at a top speed of 40MB/s on USB 2.0, more commonly you get 20 to 30MB/s.
Either a cable designed to hot swap drives, or a drive bay would work a lot better if a NAS is out of the question. The cable solution involves a SATA cable which has both the data and power lines bundled together on the HDD side. Reduces the risk of damaging the HDD when pulling the plug or plugging another in. A drive bay would cost a bit more, but is significantly less risky and much easier to use.
Even putting the destination drives in another machine and connecting the machines with a crossover cable will be much faster than USB 2.0 speeds. I really wouldn't suggest going through a router unless you take care to keep the router cool (a desk fan should be sufficient).
That leads me to my last point. When transferring that much data, the machine(s) are going to get much hotter than they do even in intense computational work. Your going to want to pop the side of the case off and set up a good fan to pull heat away from it. Easiest way to kill a HDD is to let it run hot for long periods of time.
# rsync -avz /this /that. Split your directories corresponding to the sizes of your drives. If on Linux, run smartctl -H /dev/sdX to check your disk health and if possible, take the HDD's our of their usb enclosures and connect them directly to SATA for faster xfer speeds. These drives will 9/10 mount just like a normal drive since usually they are just a normal drive housed in an enclosure.
:)
Good luck
Why not use Crashplan http://www.crashplan.com/consumer/compare.html
50$/Year for one computer and unlimited data or 120$ for 2-10 computers and unlimited data.
Cloud is the way to go!
Someone don't know how to count... The answer to this question will be easier if we know that we are helping you with a solution to backup ~20GigaBytes of stuff or 24TeraBytes....
Plug all the disks into a USB hub. Ensure that each one has a unique volume name eg bak1, bak2... The old skool way is to make a little tar script and use volume spanning. Otherwise, configure all the disks as a single JBOD and run DejaDup.
Excuse me, but please get off my Pennisetum Clandestinum, eh!
It depends on whether you are looking for reliability or a separate copy.
For reliability, you should be using a large raid array (Raid-6)?
For a separate backup, cobbling together a collection of external hard drives sounds painful. Recovery from them would be even more painful. You want a tape drive.
Or you may consider creating archive copies of each of your files, possibly writing them to a blu-ray disk. Kinda slow, depends on how often you need to backup the files.
24 TB of backups is not done with hdd. They invented tapes for that kind of work.
24TB = 25165824MB. Even with a 20MB sec write speed (I assume 24TB SSD will be too expensive) it will take 14 days.
A new software coming out in alpha this month will solve your problem. Check out Infinit (infinit.io). It's a distributed network that integrates into your file management system like DropBox. It encrypts al of your data and stores it in chunks on a p2p network that you can create with other users. If you're going to buy hardware to connect to your own network, then it makes everything your storing accessible on demand from any device. It's secure and safe and it also will give you instant access to all of your files via streaming. There's a wikipedia article here: http://en.wikipedia.org/wiki/Infinit
a rich man by NOT patenting stuff (i.e. using the GPL2 for Linux). So why shouldn't he do the same with other stuff? Also, I guess there is loads of "prior art" regarding this cloudy PR talk of today.
I like my spaghetti with source.
Private Manning, is that you?
You are welcome on my lawn.
Damn, that's a lotta pr0n!
When faced with similar situations, cheaper can be better. Put multiple DVD burners in the computer (all separate buses, or else burn speed suffers), run QuickPar on the source directories and burn those directories and par files endlessly until the job is done. The discs already have error correcting codes, plus with the recovery of QuickPar it should last about a decade - even with cheap media if you avoid a variable or harsh storage environment.
Discs are one step up from etched in stone, IMHO. They just don't make stones these days (last) like they used to. I look forward to the day where archivists use a plasma cutter to burn barcodes into granite (or something else as effective). That'd be a permanent optical medium.
was it that?
and to extract arj -va.
there, problem solved.
world was created 5 seconds before this post as it is.
shhhh blizzard might see
Why go through all that? Set up a ZFS volume, and snapshot it to another ZFS volume, and then offline that. Put it on a sata cage, and you can just take it with you when it is done.
I'm assuming you have a bunch of disks that are old and of different sizes. I'd recommend assuming no disks will crash in your backup set (you really wont be using it for extended periods) and creating a large JBOD partition using mdadm in Linux. I'd also recommend using ext4 for a Linux filesystem which probably means getting the latest code from GIT because probably some features are missing from your version of ext4 to create a large enough partition. Assuming you don't need anything fancy for file permissions, rysnc will probably be your best copy tool. As other have pointed out, the physical transfer will take days and will bottleneck at the USB interface. A "problem" with Linux us that the hard disk portion of the transfer is fast in comparision so as much RAM as possible will be internally used by the Linux kernel for buffers to support the transfer (this is not the disk cache incidentally). This means when you want to run a program while the large transfer is in progress, you will have to wait a long time for sufficent memory to free up. You can work around this by inserting a large number in /proc/sys/vm/min_free_kbytes
Takes the drives out of the enclosures, put them in a system and copy the data over network.
Concatenates filesystems via FUSE.
http://romanrm.ru/en/mhddfs
http://packages.debian.org/search?keywords=mhddfs
Tape!
Tape it out. You don't store giant data sets on hard drives as a back-up. Store them on LTO-5 tapes, which are 1.5 TB each. LTO-4 tapes are 800 GB, and I don't know how pricing works out for the tape writers and tapes to decide which is better for you (I'd suggest LTO-5 if you're making a long term investment and likely to have many data sets to deal with in the future).
LTO tapes are meant to be archival quality. They're meant to store your data for 15+ years. Hard drives, by contrast, fail easily and aren't meant for archiving anything. Naturally, you should still make 2 copies if this is very important to you.
If you subsequently go somewhere that doesn't have an LTO drive, then any local data recovery service will be able to get your data off tape and back onto hard drives easily if and when you need it.
This is how we make backups of large data sets in nuclear physics. My career depends on these data sets, so I have a lot invested in backing them up correctly.
Check out cpio under Linux or many Unix flavor OS, cpio can span backups over multiple target media. Make sure to test backup AND just as important: restore.
TOP DSLR Cameras Reviews of the top DSLRs
All the answers are what I expected: there are lots of professional or high cost solutions to your problem like raids, based, tapes but if you are after a on-the-cheap solution this is what I did: I bought a 10-way USB board on ebay and I have a stack of external drives plugged into it. On my Linux server I have all the drives combined into one with mhddfs. From linux side the drives look like a single directory. Obviously this solution lacks the error-checking or redundancy features of pro solutions but it is CHEAP - you don't need anything but an extra USB hub and a power board to plug in all the external drive plug packs into...
If this is a one or two time occurrence, then just bite the awful time bullet and do the transfers. It will get done eventually. If this is an ongoing thing then Panasas in Pittsburgh specializes in terabyte and petabyte parallel backup systems. When I dealt with them maybe 5 years ago, their prices were not that bad. There should be similar companies by now, too.
Since MacOS is an option, doesn't the latest Time Machine support multiple drives?
LVM is another possibility. If you can get SATA drives and plug them all in, you can then create LVM volume spanning all the drives and just simply copy the data over to one large volume. LVM will take care how to span it across the drives.
I have actually had to do this in an OS X environment before. We have an xserve hosting up about 30TB of data in small files, and we are scheduled to move away from the system, but we need backups in the meantime. My solution for the short-term was to create a concatenated "RAID" of 35TB worth of external hard drives connected via firewire AND usb, (the external drives range from 6TB to 12TB), and use retrospect to back up to the resulting volume. There is no room for anything but an up-to-date backup, but it's getting the job done until we move to a large RAID with offsite backup.
Apple's software RAID as configured through Disk Utility is surprisingly versatile, and though my transfer speed is slow when the data hits a USB drive, it is entirely transparent to the software when switching between FW and USB. It is also fairly robust, because if there is a hardware failure on our server, we can take the disks, plug them into another mac, and the RAID configuration is maintained without any futzing around (as the config is listed on the beginning of each volume).
Now, before everyone goes apeshit on me for using a concatenated set instead of a RAID solution, there were a couple limiting factors in my decision to concatenate rather than RAID 5/0, the major one being the range of sizes of external drives that we have, and a lack of funds available to purchase more. OS X's software RAID goes by the lowest common denominator (6TB in my case), so I would lose ~1/2 to 1/3 of my space if I used ANY of the RAID options, and I didn't have any space to spare.
I feel your pain, and good luck.
Has anyone posted ^A Shift-Del yet?
I would advise against using anything like that. If one disc fails...you're entire backup is gone. Unless you can logically separate the data as the guy says regarding brunette/blonde etc.. lol.. then you can just backup each portion onto a different disc
I guess there could be issues with space while making the rar files, but they can break the archive up into chunks of any size you desire. You will need all of them accessible to unpack them again though. Perhaps it isn't the greatest solution, but it may do what the poster wants.
-- ssoorrrryy,, dduupplleexx sswwiittcchh oonn.. -Quote found on actual fortune cookie.
Just put it in the cloud... *rimshot*
A cloud backup service released information on how they build their own disk based backup servers. Maybe something that would help with your endeavor? http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/
Get a PC with 6 USB 3 ports, connect a powered, 4-port USB 3 hub to each PC USB port. Then connect 24 1TB external USB HDDs (or SSDs) to the hubs, format as necessary and run your backup software.
Thunderbolt may be higher performance and have daisy chaining capabilities. But the USB solution should work just fine.
I work for a data backup company as a dev monkey/admin/jack-of-all-trades.
Do you ever want to restore these backups? If the answer is "yes" (and it should be, otherwise why are you backing up in the first place...?), then you need to be guarded against failure of an individual disk. That means you need some sort of RAID solution.
For reference, Datto's 3U nodes store 20TB across 14 2TB drives, and the next larger size of node we have is somewhere around 55TB in 4U. No, I'm not trying to sell you our hardware (we only sell to resellers anyway) but hear me out. You really are going to save yourself some headache if you build a NAS device.
USB 2.0 is SLOW AS BALLS. I see our USB seed drives (HDDs we mail out to customers to get their initial datasets up into the ether) max out at 20-30MB/sec on a good day. By comparison, Gigabit Ethernet will give you 112MB/sec after NFS/TCP/Ethernet overhead -- much better. For this reason, and because it's just so impractical to handle large collections of failure-prone USB drives, our largest round trip drive that is shipped as USB is 4TB. After that, we actually ship our customers NAS devices (usually a returned/development box with a different OS image on it).
Go with NAS. You need the resilience against disk failure, you need the additional speed, and while yes, it's a greater investment, the alternative is utter agony when one of your 12 2TB disks takes a dump.
I know you are likely trying to do this for a cheap alternative, but just don't. It is really an unworkable solution for that amount of data.
Some have mentioned Tape, which I know very little of. However I would simply build another RAID machine to copy to, or use a NAS if you can find one big enough, as it amounts to pretty much the same thing, but more specialized.
If this isn't sensitive data, another option might be to cloud it. Amazon and a few others have some competitive prices. The advantage here is you additionally get off site backup.
I guess one of the key factors in your decision will be how refreshed this 24TB of data is. WIll it only get occasional updates, or will a big chunk need to be backed up regularly. That is the other question, how often will you need to back up? Lastly, how quickly do you need recovery?
You're doing it wrong.
Single drives sitting on a shelf is not a "backup." You need to invest in renting or purchasing a tape drive and some tapes. THAT is long-term reliable backup. Those USB hard drives are a disaster waiting to happen.
Sans Digital Makes an 8 slot drive enclosure with either a PCI-E or USB 3.0 interface for about 350 bucks. Put 8 3tb drives in it, run it JBOD. You can buy the cheap 3tb drives because you're going to run them JBOD. At 150 bucks a drive, Your total cost is about $1600.
You might be able to get Windows to do Incrementals to those drives, although I haven't tried it myself. And remember to run the enclosure sparingly, because non-enterprise drives aren't rated for the same number of spin-up hours.
Of course, it's not as safe as putting everything on a billion optical disks. But even using a BD-rom (at 46gig a pop), you're talking about 534 Blu-rays, and that's pretty much ridiculous, unless you have an intern you really dislike or something.
USB seems inane and insane for that level of data. How redundant is this 24 tb of data as well? Running it through a data de duplicator (possibly to reduce storage requirements depending on the type of data) and then a tape drive or raid array may be a cheaper and more time effective option.
I backup a 10tb array to multiple usb HD's using aufs. I have the aufs mount configured to drop new files onto the drive with the most amount of free space ( I simply add drives as the data in the array gets larger ) but aufs supports other modes like round robin. I then rsync the data from the array to the aufs mount as a nightly cron job.
How about Linux system performance while doing even a single 2.0 USB copy? It seems to gank things up.
My system gets really sluggish, in odd ways. Mouse focus updates are very slow, and sometimes the mouse pointer gets left in odd states because of oddness that that my window mgr (Enlightenment) doesn't seem to anticipate.
I find I need to slow down my mousing and browsing to avoid issues.
Debian Squeeze user on a Thinkpad T61, 2.6.32.
"I'm looking for a simple solution to backup"
And USB drives are your idea of simple? Seriously? Please hand the lady your Admin card at the door when you leave.
For 24TB if you wan't to have a job after someone asks you to restore a chunk of that you'll want to insist on tape. Or perhaps a equally sized NAS or SAN array. USB? Hope your resume is up to date.
You buy one of these:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816322007
populate it with 4GB drives and create two RAID5 (or one RAID6) array, then you've got 24 or 28 TB of backup space, without having to change drives or break up your backup into smaller chunks.
But really, your backup methodology is broken; you need to organize the data into manageable chunks because aside from a large dedicated backup server/SAN, there is no reliable (don't tell me tape is reliable) backup solution for a such a large quantity of data in a single chunk.
What I do for backups: in my 24-bay server I have eight large drives in a (HARDWARE) RAID5 array (were 4TB drives available at the time I'd have gone RAID6) and rsync the virtualized server contents to that, then archive them into tarballs, and send copies of them across the LAN to another server that is running (HARDWARE) RAID5 as well. Every once in a while I back up the critical data (source, scripts, financial data, production web sites, /etc, and so forth but not the program binaries nor system binaries which are easily recreated or reinstalled, respectively) to optical media and external hard drives.
So what I have in summary is:
* Massive server with a backup array separate from the production array
* Separate backup server running another array (again, using a quality HARDWARE RAID controller. Safeguard your data and don't bother with Intel, Adaptec, Promise, or Highpoint "hybrid" RAID)
* Periodic backups of non-recreatable data to USB drives and optical media that are moved off site.
The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
Plug the drives in. Tell Win8 to treat them as one large drive. Good to go.
Make a NAS server with enough space in a RAID5 array and do the transfer via the network.
If you're willing to deal with the time it will take to write it all out, then its doable. You need a backup software that supports VTL (virtual tape library). With this, the physical drives are seen as tape devices. So it will start writing to drive #1 and when its full it will say "out of media" and it *should* pause for new media. You "eject" the drive, attach a fresh one, and hit continue. Then wash, rinse, repeat til complete. As others pointed out, it will take some time. You can speed it up with eSATA or USB 3. If you're on a Mac, you can speed it up using t-bolt. I believe Arkeia still offers a free version and they did/do support VTL. Haven't been current on free backup wares for a while. One thing to bear in mind as well once you write this 24Tb to a collection of media any single media failure will result in all data being unrecoverable. So you might opt for doubling your backup window and making a duplicate copy. Otherwise your best bet is to put all the drives in a NAS configuration (think FreeNAS) with a RAID6 structure, then have the backup s/w use this as its destination. You could do this with an 8 drive chassis of 8x4Tb SATA disks (2 lost for RAID6, leaves 6x4TB=24Tb raw). A similar idea could be accomplished with ZFS, but its future is somewhat unknown with Oracle these days. If you need longevity, I'd stick with a more open/compatibly filesystem. If you manage to setup it correctly and use exFAT, you could mount the backup volume to any current Linux, Windows, or Mac system and if the backup s/w runs on all platforms you'd have a lot more compatibility and recovery options.
Make a script that creates symbolic links to all your files. Not much extra space required.
Script figures out which files will fit based on output disk size (X GB), and puts links in created \Disk1 \Disk2 subdirs.
Then copy the DiskX subdir (follow symlinks) to HDX. Something like this surely exists by now.
we do much the same thing. we have a backup nas that we then rsync to a set of "offsite" drives.
My recommendation would be to investigate ZFS. (picture software raid and LVM rolled into one with filesystem encryption and compression built in.) Easy to compile and install on linux.
We created a pool for the offsite drives, then rsync to that file system. "Export" the file system and take the drives out. (Hot swap in trays, buy extra trays for rotation drives.) When you need to put in the next set just put them in and import it. Order and placement does not matter as long as enough drives are in. You could even have one or two parity drives in case a drive fails.
We have a cron job that rsyncs to the offsite drives, then exports them and emails the admins that it is ready for rotation. We keep 2 sets, one is in all the time and the other rotates offsite. You can swap on whatever schedule you are comfortable with. With compression, depending on data you could easily cut your drive requirements in half. Turn on encryption to keep your porn safe while in transit. All you need is a hot swap JBOD chassis. you could backup directly to the removable filespace, or do what we do, backup to a set local (local to datacenter, not to machine) filespace and rsync it over regularly.
It is something else to learn, etc. But it is a system that works well.
Prepare a week long vacation, before you leave, copy & then paste.
The mhddfs FUSE (filesystem in user space) for Linux is good at this sort of thing.
It combines a bunch of "real" filesystems into a large single-filesystem storage pool.
So take eight drives, each 3TB in size. Partition/format each of them with a single large ext4/xfs/whatever filesystem.
Mount all eight of them. Then issue a mhddfs command to create a new mount that pools the storage from all eight drives
into a single 24TB filesystem. Copy your data there. mhddfs will allocate individual files to individual drives, and the underlying
filesystems can be accessed without mhddfs involved if you like.
A very powerful tool -- should really be in the kernel, but isn't.
This problem is NP-complete. It is a bin sorting problem regardless.
Use an archival tool that allows you to specify chunk size and set it to be 1/drivesize or something. Use 10% parity files.
Or back up onto s3 with something like s3ql.
That would probably be cheaper and less of a headache, since space isn't a concern.
Stop doing it wrong. There is no reason to do this over USB.
Buy an md3000, pack it out with 1tb disks. If that is not enough, get another md1000, and daisy chain it to the md3000.
In another job, I had put 2 md1000's on a single box, and use it as a backup server. The total was 24TB.
Partimage can do this.
Comment removed based on user account deletion
If you've got that much data, with a setup like that, you can afford to buy something better than USB. Consider eSATA, though I, personally, would push for a simple, fast backup server.
However you go, and I admit to not having read all the comments, you didn't mention how often the backups need to occur. Here, were we've got terabytes of data on many systems, we do a nightly rsync, and use hard links, which speeds it up and decreases space usage.
mark
42
Buy 8 x 4TB drives + this enclosure
http://www.amazon.com/Mediasonic-H8R2-SU3S2-ProRaid-External-Enclosure/dp/B005GYDMYQ
Armaments, 2-9-21 And Saint Attila raised the hand grenade up on high, saying, 'O Lord, bless this Thy hand grenade' N
If you want to minimize the number of volumes required to pack your files I recommend GAFFitter (http://gaffitter.sf.net/), which reorders a set of files/directories to best fit the volumes and so to avoid waste of space.
If it turns out that the source data is not porn (unlikely) and is highly compressible
Would dirty photos of his blow-up doll count as "compressible?"
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
quoting Linus again: "First off, I'm actually perfectly well off. I live in a good-sized house, with a nice yard, with deer occasionally showing up and eating the roses (my wife likes the roses more, I like the deer more, so we don't really mind). I've got three kids, and I know I can pay for their education. What more do I need? The thing is, being a good programmer actually pays pretty well; being acknowledged as being world-class pays even better. I simply didn't need to start a commercial company. And it's just about the least interesting thing I can even imagine. I absolutely hate paperwork. I couldn't take care of employees if I tried. A company that I started would never have succeeded – it's simply not what I'm interested in! So instead, I have a very good life, doing something that I think is really interesting, and something that I think actually matters for people, not just me. And that makes me feel good." http://en.wikiquote.org/wiki/Linus_Torvalds
I like my spaghetti with source.
no text
I like my spaghetti with source.
I've done this (tar backups) for ages, partly from back in the day when I *had* to backup to cdrs/dvdrs, and partly from a desire to be able to more easily restore a partially corrupted backup.
Watch out for ACLs and Sparse files. They can cause grief. (Test before you rely on tar's -S flag.)
Watch out for using bzip2 (-j) and gzip (-z) compression. Aside from greatly slowing things down, the output stream is compressed rather than the individual files and thus a single bitflip can render the remaining tarfile (tarfiles?) unreadable.
Watch out for system pseudo-directories (/sys, /proc, /dev, etc). Letting tar backup /dev/hda can be a mistake.
You can mount all your backup drives and specify --file= multiple times without a tape-changing (disk-mounting) script.
You can use -F, --info-script=NAME, or --new-volume-script=NAME to run a script at the end of each file (tape), umounting and mounting new disks.
Older versions of tar used to have problems with long file/path names. Shouldn't be a problem these days, but it gave me headaches half a decade ago.
One Grand Final Rule: Don't backup the backup tarfiles. It just doesn't end well.
Oh, and consider eSATA or firewire or at least USB3. Disk throughput will (obviously) be a huge issue on a backup of this magnitude.
--Anon. (Don't have time to find my old Slashdot password right now. Machine it was stored on died and was replaced, about 6 times over now. Someday I have to dig through my old backups and find that thing. Really glad I used tar and not something obscure/closed-source/obsolete.)
Use FreeNAS to manage RAID on the array. And rsync. Yes, you may have to do some handywork yourself. GTG!
I come to Slashdot only to read sigs. One you are reading is mine.
I know this has nothing to do with USB and maybe the OP has very good reasons for wanting it on USB. In any case...
Amazon S3 pricing:
First 1 TB / month: $0.125 per GB
Next 49 TB / month: $0.110 per GB
(1 x 0.125 + 23 x 0.11) * 12 = about $32 per year for 24 TB. That's a lot less than buying a bunch of hard drives.
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
But only 1GB/s is recorded
Only?
You can do all of this with rsync and some text editing/a script.
rsync -av /mnt/srctree /mnt/destdrive > backup-list
When that copy runs out of drive space, insert next disk,
rsync -av /mnt/srctree /mnt/destdrive --exclude-from=backup-list >> backup-list
You do have to do some maintenance on "backup-list" in two ways. First, rsync will list the parent directory before individual files as output when copying. Unfortunately if it doesn't FINISH that directory, and you then provide it in the --exclude-from, it will skip the entire directory. A simple script to run through the backup-list file and remove any entry with a trailing forward slash, e.g. "parent/child/child/ instead of "parent/child/child/porn.jpg" will cause rsync to inspect each directory next time and catch anything it missed.
Secondly, the very last line of the "backup-list" file will most likely be in error. This will be the file that rsync was copying when disk space ran out on the destination. Delete that line and it will be caught on the next disk.
Writing the script takes about a minute, rsync is in every distro, et voila, your complete backup solution for large volumes to a series of different sized small volumes.
Most external HDD enclosures are limited in capacity around 2 Tb. Using larger drives is quite possible, but very unsafe for your data since a single connection to a BIOS unable to handle it and your could loose your file system, effectively erasing your drives. Be advised that NTFS is very vulnerable to this.
Large drives have a hard time being on the same controller, because you are exceeding hardware limitations, and that goes for both the enclosure and the computer side controllers.
SATA is GOOD, since it's 1 controller per drive.
Firewire is BAD, limit yourself to less than 2^32 (3.7 Tb) total or you loose everything.
USB is BAD, since they can to handle multiple drives per controller and you would need to look carefully at the hardware of any computer touching those drives.
I propose to build yourself a file server with multiple drives, like a smaller NAS enclosure. The objective is to keep the drives and the hardware that operate them together.
If you want to move all this data in a decent amount of time. You'll need to look into optical network cards. This might require computers on both ends.
Buy some USB hubs and maybe some SansDigital multi-drive enclosures. Hook them all up at once, build RAIDs out of each SansDigital chassis, and use LVM to aggregate the chassis. lvcreate, mkfs, and start copying the data.
"I'm just a simple caveman, ..." with a mainframe background, so I have a question of curiosity here
At what point does the bandwidth/throughput of the DMA start limiting the performance of your backup?
In my world, DMA for I/O is called a "channel". We have many, and while there are a lot of nuances we could discuss, basically we try to segregate the I/O for the input to backup (disk) and the output of backup (usually tape) , and have the backup task process in parallel as much as possible - my nightly backup, for example, runs 9 parallel tasks, 9 being the limit that this particular backup program has. I could run multiple instances of the program, but then I have to have mechanisms to make sure I don't back up the same disk twice between two concurrent executions; with one instance and 9 tasks I can just say 'back up everything that's online at the moment'. So, the throughput is limited by the performance of the slowest devices, multiplied by the parallelism we are able to achieve. In the PC / server environment, does the DMA limit the I/O capability?
Backup Exec does exactly what you are asking for. Free 30 day trial.
I ran across a FUSE module (mhddfs) that seemed relevant when I wanted to combine several USB drives into a single file system. My main goal was to make each drive usable independently for file recovery if I had to move it to another system.
The module appears to be a fairly thin wrapper over an existing file system. It only appears to choose which of the sub-file systems to write new data to, automatically writing files to whichever drive has the most space. This provides nothing in the way of redundancy, however.
What is nice is that you can easily access the files on a drive without needing the other drives. May be helpful for someone.
http://romanrm.ru/en/mhddfs
One byte at a time
Each has at least 6 external USB 2.0 ports, an eSATA port &
for use as a possible back-up, up to 4x internal SATA HHD's,
not to mention a Gigabit wired-network port.
Using 3 TB external USB HDD's, of the same brand & model
running, eg, FreeNAS, or your fav x86 (32- or 64-bit) op sys;
boot from internal USB stick frees an internal SATA drive for
use as back-up.
It may not be the fastest, but it's a SIMPLE solution, that fits
in a small space.
Honestly, I know it isn't your question, but skip USB. Too slow. WAY too expensive. Get yourself a rocket raid card or similar, a sas expander, and an 8+ trayless disk enclosure. I use a 12 disk enclosure (8 for regular backups, 4 for all the one off stuff I do) with 2TB drives. I wrote a program in Java using NIO that stripes the backups across the disks so that it can saturate the bus. A solution like this will ultimately be faster and cheaper. One day I will port the code to native as the Java program was just a proof of concept that has worked so well I haven't gotten around to it. This setup works exponentially better than the VXA-3 tape backup we were using before, and couldn't imagine having to do it with usb drives, either from a cost or a logistics perspective.
USB will copy files, but not identical copies. Firewire is better.
But the best/cheapest solution, is a Dell MD-1000. It will take 2tb generic drives.
if you build a small system, cheapish, an itx with 6sata, each connected to a port multiplier, os on flash... you could have 30tb from 1tb drives. set it all up as an lvm2 volume, then you can slap the drives back in a new system any way you want, and they'll come back up in the right order. rsync(backupmypc) will keep the backup in good condition, you'd of course need spares, in case the verify shows a drive fail, a duplicate system. using linux raid to turn the two nfs mounts into a raid 1 array would be nice, but parity would be better, yet it'd take even more drives. yah, bigger drives would be a good first start.
this sounds exactly what the guy is looking for..
world was created 5 seconds before this post as it is.
Step 1: Buy yourself something like this: http://www.aberdeeninc.com/abcatg/Stirling-X339.htm
Step 2: Install it
Step 3: rsync
Step 4: Go do something else -- this is going to take a while
In Reason We Trust
At $35 each you get a dozen Raspberry Pi's.
While not fast you have a USB port and can connect them
via ethernet and ssh and start tinkering.
A good USB hub can turn one USB to four
The local Costco has 3TB USB disks. Yes
you have to organize your data into 2.8TB chunks
or so with some script foo but rsync can help
verify the bits.
N.B. this is 10/100 ethernet not GigE and USB2 (at best)
and they share a single USB link to an onboard USB hub.
But you could automate the thing and not have to
swap out USB cables for a week.
MD5 checksums and an index...
Let us know how it goes. ;-)
No matter what you do you will have to do some
scripting. Do label each of the USB disks
(physical and logical names that match).
Did someone way that this was a marginal
idea? Backing up to USB has some value but does not
sound magical and error free.
Since 24TB is a lot of junk -- good luck
but with the crazy big USB disks -- what the hey.
Truth is stranger than fiction, but it is because Fiction is obliged to stick to possibilities; Truth isn't. Mark Twain.
Shouldn't you be looking at DLT devices for this kind of dat set size?
http://www.high-rely.com/
We ran some of these for off siting data in rotation... Way faster than tape and designed for swapping... Might not be the best for long term storage.
EA David Gardner -"... but the consumers have proven that actually what they want is fun."
to see someone here knows this.
Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.