The Amazing $5k Terabyte Array
An anonymous reader writes: "Running out of space on your local disk? How about a Terabyte array for only a few thousand dollars. This article at KCGeek.com shows how to put together 1000 Gigs of hard drive space for the cost of a few desktop computers."
I could rip my entire anime collection for instant access! Rip all my
CDs and still have .9 Terabytes left! Maybe Mirror Usenet! I guess
the simple truth is that now that 100 gig drives are a couple hundred
bucks, we now have the ability to store anything we reasonably could
need (unless you define "Reasonable" as "I need to store DNA Sequences").
Its only a matter of time 'til video becomes as commonplace as MP3's on our drives. 100 Gigs is what...20 movies??? I don't see my appetite for disk space slowing down any time soon.
Hmmm...video; logfiles that don't roll over - ever; online network backup... I'm sure to figure out a way to fill that terabyte. :)
BRENT ROCKWOOD, EST'd 1975
Yes, this is a groovy/geeky/cool solution for under your desk, but at least spend the extra dollars for a SCSI card and tape backup unit. You could fit the whole thing on a few DLT's. You can also keep incremental backups to keep the tape swapping to a minimum.
Yup, except that to a hard drive manufacturer:
1 Terabyte = 1000GB = 1000000 MB
Their marketdroids have a bad habit of rounding the values down and evening them off. This allows them to post bigger numbers on the actual size of the drive since dividing by 1000000 instead of 1048576 yeilds a larger end result.
But I wont have to worry about that. I can't even affoard a 9gb scsi drive at this point.
And at this point, why even bother anymore? Performance isn't really that different anymore. A SCSI 73GB drive is almost $800 whereas you can pickup an 80GB UltraDMA 133 drive for $160. SCSI is dead. People would be crazy to pay those kinds of prices for SCSI over IDE anymore just because they think it is cool to have SCSI. Believe me it's not. It doesn't make a damned bit of difference. The IDE drives are just as fast as the SCSI drives, if not faster for average everyday use and cost 25% of the SCSI drive. Unless you're building a huge server array and have shitloads of money to throw away, just use IDE in your systems.
OK, I'm not an expert in this area, but I think when people do research into DNA sequences they get DNA sequences from a large sample of people so they can look for statistical links between certain gene sequences and various properties. Therefore they will need reasonablesamplesize*sizeofsequence, so if you have a sample of just 1000 people you could easily be getting into terrabyte land. (Then they need a highly optimised version of diff to spot the differences in the sequences!)
That reminds me, I don't know where the hell the tape manufacturers think they're marketing to, but with 80 GB hard drives common now, it's rare to find a tape backup solution that is affordable for a consumer that can handle that much. By affordable I mean drives around $250 and tapes under $10/piece for at least 50GB of storage. I've seen some of the proprietary drives but the tapes cost almost as much as the drive! 5 or 6 years ago the backup drives available to consumers could handle backing up the entire average hard drive of the time onto a $15 tape (Travan), but now people are probably just doing without backups which is a disaster waiting to happen.
You are correct that the human genome is "only about" 3 giga basepairs of sequence, but to only store that would be rather egocentric. There are as of Dec 3 2001 some 14396883064 bp in the GenBank, and the amount of sequence information still grows roughly in a exponential manner.
Now, this will not hit the TB line anytime soon. The trouble starts if you are involved in genome sequencing. Then you need to store the raw data for all that sequence. Each some 450 bp of sequence is reconstructed from about 5 - 10 different fairly high reslution gel images (in the ballpark of 150 kBi per image). Also, recall that even short stretches of the sequence can be accompanied with a lot of annotating information, such as names and functions of genes, regualtory elements or pointers to articles explaining the experimental evidence for such. This mutiplies the storage requirement with quite a factor - nothing a neat little linux box with a huge RAID-array cannot handle though. Thats how we handle the sequencing data from Trypanosoma cruzi, by the way.
In case you didn't notice, it's RAID5. One hard disk could go bad with no issues other than slowdown.
They could also do what we did with our IDE TB. We used three RAID5s in hardware, each with hot swap. In theory, if they failed just right, we could lose up to 6 drives without losing any data.
The three RAID5s are hardware RAID0ed together. The worst case scenerio is a simultaneous failure of two drives on the same array. But we saved so much money using IDE that we just built two complete systems for less price than SCSI. So really, we would have to hit the worst case scenerio twice at nearly the same time to have a total loss.. It gets less and less likely.
I've had enough abrasive sigs. Kittens are cute and fuzzy.
I figure this is the easiest way to add as you grow without having to break open the case and try to figure out how to add another damn drive in there. For backup, just have two systems with identical capacities and rsync between the two nightly.
RAID is nice, but for home use, it's not as nice as a nightly mirror. Why? I've seen RAID controllers fail and take out an entire RAID set. RAID also doesn't deal with the "Holy shit, I just accidently type `rm * ~` instead of `rm *~` problem."
FIRE!
Any serious data store needs to include a backup system which allows for copies off-site. Fire is the obvious risk of course, but floods, vandalism and lightning strikes are all possibilities.
AFAIK the only generally available tape backup for something this big is DLT, which IIRC can now do around 40GB per tape before compression. With the 2:1 compression usually quoted thats 80GB per tape, or around 13-14 tapes for a full backup. So you really need about 30 tapes for a double cycle, and maybe more if lots of the data is non-compressible (like movies). But this stuff ain't cheap. DLT drives start at around £1000 and the tapes cost £55 each. So thats around £2500 = $4200 to back this beastie up.
Having said that, the possibility of using hot-swappable IDE drives as backup devices is intriguing. Just point your backup program at /dev/hdx3 or whatever. One big advantage is that if your tape drive gets cooked in the server-room fire you don't have the risk of tapes that can only be read on the drive that wrote them. A Seagate 5400RPM 60GB drive costs £110, which is only a third more per megabyte than a bare DLT tape. Two cycles-worth of backup (34 drives) would be £3,700. And you can probably do better by shopping around. For servers with only a few hundred GB on line this might well be more cost-effective than buying a DLT drive.
We use Amanda to do backups here. Its a useful program, but it can't back up a partition bigger than a tape. So you need to think carefully about your partition strategy. (Side note: you can use tar rather than dump to break up over-large partitions, but its still a pain).
Suddenly that terabyte starts looking a bit more expensive.
Paul.
You are lost in a twisty maze of little standards, all different.
Ironically, I just built something very similar to this a few weeks ago (it runs great BTW), but I spent <$1500US on all the components. The biggest thing you have to watch out for is the Hard Drives. I went for the ones with the best bang/buck ratio at the time (Maxtor 80GB 5400RPM drives). This let me build a system with well over 1/2 a Terabyte of usable space at a fraction of the cost. Additionally, the slower drives require less power and less cooling, making them easier to fit in a standard full tower case with a merely beefy (as opposed to server-class) power supply. I think the processor requirements he stated were a little overboard as well. I've found that disk access tends to be limited by the PCI bus (it doesn't help that I used an older motherboard with 33 Mhz 32bit PCI), especially on writes where you can spread data across the write cache on the drives. Be careful when you build an array like this, ATA *hates* having access to both a master and a slave drive at the same time. Be sure to avoid having two disks on the same plex on the same controller. This was natural for me fortunatly, since I was building two plexes, a "backup" and a "media" plex.
A final word of warning: Promise ATA100 TX2 controllers may look like a natural choice for a server like this, but they only support UDMA on up to 8 drives at once, and Promise's tech support only supports a maximum of 1 (one!) of their cards in any system.
I read the internet for the articles.
So, then, I'm confused... He's trying to use software raid, but he has 4 Promise FastTrak 100TX2 raid controllers. WTF? First off, each of those cards supports 4 drives on 2 channels... Why does he need 4 cards when he only has 8 drives? He only needs 2 cards. Second, why is he using expensive raid hardware (that doesn't even support RAID 5) when he's using software raid!?
All he needs are two of maxtor's cards, which you can buy packaged with the drive for an extra $13. Not only that, but his prices on hard drives are way too high. 8 drives (2 with maxtor's ide cards) are $2122, per pricescan.com. Since he lists $500 for the ide cards and $3000 for the HDD's, that's a savings of $1378.
Then, he quotes $500 for 2 GB of ram. At $70/.5GB sticks that's $280. $500 for a case??? Try $365.
That said, the $5720 price he quotes is high by $1733. You could build one of these for just under $4000.
Ok, I admit, I didn't include shipping.
> He's trying to use software raid, but he has 4
> Promise FastTrak 100TX2 raid controllers. WTF?
> First off, each of those cards supports 4
> drives on 2 channels... Why does he need 4
> cards when he only has 8 drives? He only needs
> 2 cards.
I'm a firmware engineer for Maxtor... if you're going for performance, you want 1 drive on each bus, and you don't want to use the motherboard connectors. With 2 drives on each bus, you are limiting the average transfer rate out of cache to 50% of the max transfer rate. On a modern drive with their 60-65MB/sec channel rates, you cannot stream sequentially off of 2 drives without saturating an ATA-100 cable. Even running ATA-133 won't help starting a year from now.
Additionally, every bios I have looked at sucks in terms of performance. In most cases they have small DMA FIFOs which stutter the pipe during high speed transfers -- they literally hang the DMA lines while they empty their fifo into memory, then come back and grab another 8 words or something sad. They also tend to be very poor managers of the IRQ line. This causes delays at times when your hard drive could be giving you more data, but the host hasn't gotten around to asking for it yet.
All the 3rd party cards have like 2Kbyte FIFOs which prevents any overrun from occurring, which alone is quite helpful in high bandwidth applications.
The cards we include with our drives are in the lower end of Promise's spectrum... you can spend more and get more performance if you want to, which is what I suspect the author of the original article did.
--eric
More data, damnit!
I think Usenet is underestimated here. I remember reading on the site of one of the larger ISPs, specialized in good usenet access (ie. 30000+ groups & week+ retention even on binaries groups) that they have significantly more than 1 TB of storage space (don't remember how much, but several TB). So mirroring Usenet might be a tight fit.
beauty is only a light switch away
i believe that there is more problem in performance rather than capacity.
a typical configuration that cheap will use an ide hdd (and to make it cheaper software raid).
the main problem (for us in this case) is the performance. how do you increase the data transfer? for the past few years, the storage space has increased tremendously but the transfer rate of the drives are out of proportion with the space.
ide is usually placed in a 33mhz/32bit bus which will give a burst transfer of 133mbyte/sec. that is the max whatever you do. but if you will place a nic card, they will share the bandwidth unless it is placed in a different bus.
for the interface itself, scsi can handle more i/o operations/sec and fc even more. technologies today can implement raid5 at almost no performance hit.
so given 1tb of data, definite many people will be accessing it (unless you really plan to use it for your insane storage space). so if people will be able to store much, they can access it at a much slower rate.
so you won't see the scsi and fc being obsolute even though the serial ata gets through. it will remain in the low end segment of the storage market.
and besides, if you want to backup your data, the best way is to store it to tape and that will cost big (since mirroring the info in another server will not give you the reliability compared to tape)
Live your life each day as if it was your last.
This guy totally went the wrong way for expandability and speed. You can get the Promise SuperTrakSX 6000 for $480 and that has hardware raid 5 and supports 6 drives. I'd throw one of those in with 6 drives to start and take my 800Gig and be happy. That would save me at least a $1000 up front. I wouldn't need 2 of the harddrives, the second processor or so much ram. Plus it would be faster and much more reliable. Then later on I could add another one for about $2500 and have 1.6 TB of space to store my huge collection of pornography... err rather mp3's, software and G-rated dvd movies.
If your not cheating your not trying. If your not trying your not winning and if your not winning why play?