Does ZFS Obsolete Expensive NAS/SANs?
hoggoth writes "As a common everyman who needs big, fast, reliable storage without a big budget, I have been following a number of emerging technologies and I think they have finally become usable in combination. Specifically, it appears to me that I can put together the little brother of a $50,000 NAS/SAN solution for under $3,000. Storage experts: please tell me why this is or isn't feasible." Read on for the details of this cheap storage solution.
Get a CoolerMaster Stacker enclosure like this one (just the hardware not the software) that can hold up to 12 SATA drives. Install OpenSolaris and create ZFS pools with RAID-Z for redundancy. Export some pools with Samba for use as a NAS. Export some pools with iSCSI for use as a SAN. Run it over Gigabit Ethernet. Fast, secure, reliable, easy to administer, and cheap. Usable from Windows, Mac, and Linux. As a bonus ZFS let's me create daily or hourly snapshots at almost no cost in disk space or time.
Total cost: 1.4 Terabytes: $2,000. 7.7 Terabytes: $4,200 (Just the cost of the enclosure and the drives). That's an order of magnitude less expensive than other solutions.
Add redundant power supplies, NIC cards, SATA cards, etc as your needs require.
Get a CoolerMaster Stacker enclosure like this one (just the hardware not the software) that can hold up to 12 SATA drives. Install OpenSolaris and create ZFS pools with RAID-Z for redundancy. Export some pools with Samba for use as a NAS. Export some pools with iSCSI for use as a SAN. Run it over Gigabit Ethernet. Fast, secure, reliable, easy to administer, and cheap. Usable from Windows, Mac, and Linux. As a bonus ZFS let's me create daily or hourly snapshots at almost no cost in disk space or time.
Total cost: 1.4 Terabytes: $2,000. 7.7 Terabytes: $4,200 (Just the cost of the enclosure and the drives). That's an order of magnitude less expensive than other solutions.
Add redundant power supplies, NIC cards, SATA cards, etc as your needs require.
Also should be noted that FreeBSD has added ZFS support to Current (v7). It's built on top of GEOM too so if you know what that is you can leverage that underneath zfs.
Not enough specifics here. I am going to say do your thing. If it works, you're a hero and saved 47k. If it doesn't obfuscate and negotiate the 50k of storage down to 47k. Win for all.
Unless you would like to give more specifics. Cause I am going to say in 99% of cases where you want fast, reliable, and cheap storage you only get to pick two.
For quite a while now, it has been less expensive to build a DIY file server then to purchase NAS equipment. I personally build gateway/NAS products using Via c7/8 boards as they are low power, have hardware encryption, and are easy to work with under linux. Accessory companies even make back plane drive cages for this purpose that fit nicely into commodity PCs.
BBH
place where i work looked at one of these things from another company. did the math and it's too slow even over gigabit for database and exchange server. OK for regular file storage, but not for heavy I/O needs
Does the overuse of TLAs obfuscate the meaning of SDS?
Infiltrated dot Net
For starters, our SAN uses extremely fast connectivity. It sounds like you're moving your disk I/O over the network, which is a fairly significant bottleneck (even Gb). We also have the flexibility of multiple tiers - 1st tier being expensive, fast disks, and 2nd tier being cheaper IDE drives. I imagine you can fake that a variety of ways, but it's built in. Finally, there's the enclosure itself, with redundant power and such.
Still, I bet you could do what you want on the cheap. Being in health care, response time and availability really are life-and-death, but many other industries don't need to spend the extra. Best of luck.
A good 20k$ RAID array does much more. First, it doesn't use cheap SATA drives, but Fiberchannel Drivers or even SAS drives which are tested to a higher level of quality (each disk costs like 500$ or more..). And those cheap SATA drives also react much more poorly to non-sequential access (like when you have multiple users). They are unusable for serious file serving. You can never compare RAID arrays that use SATA/IDE to ones that use enterprise drives like FC/SCSI/etc, because the drives are quite different.
Then you have the other features like dual redundant everything: controllers, power supplies, etc. Then you have thermal capabilities of rack-mount solutions that often are different from SATA, etc, etc.
http://milek.blogspot.com/2007/04/hw-raid-vs-zfs-s oftware-raid-part-iii.html
Infiltrated dot Net
Porn jokes indeed aside. I may not be an "everyman", but I think I'm close enough. My desire for storage (though not yet in the terrabyte range) comes from my photography (no not porn...). I take a bunch of pictures, and well, because storage is cheap I leave them all at the original file size (which in this case is about 2-5 MB depending).
I don't have a proper video camera, but I'm sure that people who do, have even bigger storage requirements.
Not only that, what with all the music you can copy of a friends HD now, your storage just jumps a bit more! (I've got literally more then 10 gigabytes of music on my desktop HD. And I know people who have hundreds of CDs, so if they ripped all those, they would have much more...)
Added to all those movies you can either rip or download...
Chuck in a decent network, family and/or friends, and you can now stream all this stuff around to wherever you want it.
I'd say then, that the most common use of all this space, multimedia. Not sure who has terrabytes of multimedia though.
I wank in the shower.
Speaking from personal experience - This file system is far from ready. It can kernel panic and reboot after minor IO errors, we were hosed by it, and probably won't ever revisit it. This phenomenon can be repeated with a usb device, you might want to try it before you hype it. Try a google search on it and see what you think...there is no fsck or repair, once it's hosed, it's hosed, the recovery is to go to tape. http://www.google.com/search?hl=en&q=zfs+io+error+ kernel+panic&btnG=Google+Search
Businesses buy SANs to consolidate storage, placing all their eggs in one basket. They need redundant everything, which this doesn't have. Additionally, SATA drives are not as reliable long term as SCSI. Compare the data sheets for Seagate drives, they don't even mention MTBF on the SATA sheet.
Businesses also want service and support. They want the system to phone home when a drive starts getting errors, so a tech shows up at their door with a new drive before they even notice there are problems. They want to have highly trained tech support available 24/7 and parts available within 4 hours for as long as they own the SAN.
Finally, the performance of this solution almost certainly pales as compared to a real SAN. These are all things that a home grown solution doesn't offer. Saving 47K on a SAN is great, unless it breaks 3 years from now and your company is out of business 3 days waiting for a replacement motherboard off Ebay.
That being said, everything has a cost associated with it. If management is ok with saving actual money in the short term by giving up long term reliability and performance, then go for it. But by all means, get a rep from EMC or HP in so the decision makers completely understand what they're buying.
ZFS does not obsolete NAS/SAN. However, for many many many instances, DIY fileservers have been more appropriate than SAN or NAS situations for many concepts long before ZFS came along, and ZFS has done little to change that situation (though adminning ZFS is more straightfoward and in some ways more efficient than the traditional, disparate strategies to achieve the same thing).
I haven't gotten the point of standalone NAS boxes. They always were not fundamentally different from a traditional server, but with a premium price attached. I may not have seen the high-end stuff, howerver.
SAN is an entirely different situation all together. You could have ZFS implemented on top of a SAN-backed block device (though I don't know if ZFS has any provisions to make this desirable). SAN is about solid performance to a number of nodes with extreme availability in mind. Most of the time in a SAN, every hard drive would be a member of a RAID, with each drive having two paths to power and to two RAID controllers in the chassis, each RAID controller having two uplinks to either two hosts or two FC switches, and each host either having two uplinks to the two different controllers or to two FC switches. Obviously, this gets pricey for good reason which may or may not be applicable to your purposes (frequently not), but the point of most SAN situations is no single-point of failure. For simple operation of multiple nodes on a common block device, HA is used to decide which single node owns/mounts any FS at a given time. Other times, a SAN filesystem like GPFS is used to mount the block device concurrenlty among many nodes, for active-active behavior.
For the common case of 'decently' availble storage, a robust server with RAID arrays has for a long time been more appropriate for the majority of uses.
XML is like violence. If it doesn't solve the problem, use more.
150GB mp3s
80GB DVDs
120GB games
14GB/hr for DV editing
1 whole drive for OSes
RAID-5ed (1 parity drive)
So I'm up to 4 200gb drives right now, without even trying hard.
Soon I'm going to jump to 500GB drives, and I expect to be hitting their limits in a year or so.
Also, how the hell am I supposed to back up all this?! Incrementals would be 10gb+ / week
You're not trying hard enough ;)
I've got just over a terabyte of live storage around the house and I probably use about half of it - I have a couple of hundred gigs of video and about 60 gigs of music. I know of someone who is currently buying seven of Hitachi's new terabyte HDs for an in-home video streaming system, There's always someone who has a use for it.
I'm a photographer and my RAW image files are 15MB each. At every shooting, I come back with 1 to 8GB worth of data to be processed. My workflow involves working on 16-bit TIFFs that weigh in excess of 40MB/file and I'm not even counting the photoshop work files. 40GB would last less than a week here.
:P
Not being rich, I have a couple of external HDs totalling a little less than 1TB, and it's nearly full. The rest is archived on DVD or transfered to HD for storage (cheaper, faster and more reliable than DVD).
So yeah, I can easily imagine why any organisation dealing with huge media files would be interested. Heck, I'd be a client for a safe, multi-TB storage system if I could afford it... Not everybody only deals with text files for a living
-- It's always darker before it goes pitch black.
It's no NetApp - yet. One thing to realize is that iSCSI target isn't even in Solaris proper yet - you have to run Solaris Express or OpenSolaris for the functionality. That may be fine for some people, but it's a deal-breaker for most companies - you're really going to place all those TB of data on a system that's basically unsupported? I'm sure Sun would lend you a hand for enough money, but running essentially a pre-release version of Solaris is a non-starter where real business is concerned. Even when iSCSI target makes it into Solaris 10 - which should be in the next release - are you really comfortable running critical services off of essentially the first release of the technology? Furthermore, while ZFS is amazingly simple to manage in comparison to any other UNIX filesystem/volume manager, it still requires you to know how to properly administer a Solaris box in order to use it. Even GUI-centric sysadmins are generally able to muddle through the interface on a Filer, but ZFS comes with a full-fledged OS that requires proper maintenance. Your Windows admins may be fine with a NetApp - especially with all that marvelous support you get from them - but ask them to maintain a Solaris box and you're asking for trouble. Not to mention, since it's a real, general purpose server OS, you'll have to maintain patches just like you do on the rest of your servers - and the supported method for patching Solaris is *still* to drop to single user mode and reboot afterwards (yes, I know that's not necessarily *required*). Also, "zfs send" is no real replacement for snapmirrors. And while ZFS snapshots are functionally equivalent to NetApp snapshots, there is no method for automatic creation and management of them - it's up to the admin to create any snapshotting scheme you want to implement. Don't get me wrong - I love ZFS and I use it wherever it makes sense to do so. It may even be acceptable as a "poor man's Filer" right now, assuming you don't need iSCSI or any of the more advanced features of a NetApp. In fact, it's a really great solution for home or small office fileservers, where you just need a bunch of network storage on the cheap - assuming, of course, that you already have a Solaris sysadmin at your home or small office. Just don't fool yourself, Filer it ain't - at least not yet.
This doesn't strike me as having much to do with ZFS at all. You've been able to do a home grown NAS / SAN box for years on the cheap using commodity equipment. Take ZFS out of the picture and you just need to use a hardware raid controller or a block level RAID (like dmraid on Linux or geom on FreeBSD). There are even canned solutions for this, like OpenFiler.
That being said, this sort of solution may or may not be appropriate, depending on site needs. Sometimes support is worth it.
You're also grossly overestimating the cost of an entry-level iSCSI SAN solution. Even going with EMC, hardly the cheapest of vendors, you can pick up a 6TB solution for about $15k, not $50k. Go with a second tier vendor and you can cut that number in half.
Some of these issues looked familiar, so I thought I'd do a basic comparison:
Reiser4 had the same problems with fsync -- basically, fsync called sync. This was because their sync is actually a damned good idea -- wait till you have to (memory pressure, sync call, whatever), then shove the entire tree that you're about to write as far left as it can go before writing. This meant awesome small-file performance -- as long as you have enough RAM, it's like working off a ramdisk, and when you flush, it packs them just about as tightly as you can with a filesystem. It also meant far less fragmentation -- allocate-on-flush, like XFS, but given a gig or two of RAM, a flush wasn't often.
The downside: Packing files that tightly is going to fragment more in the long run. This is why it's common practice for defragmenters to insert "air holes". Also, the complexity of the sync process is probably why fsync sucked so much. (I wouldn't mind so much if it was smarter -- maybe sync a single file, but add any small files to make sure you fill up a block -- but syncing EVERYTHING was a mistake, or just plain lazy.) Worse, it causes reliability problems -- unless you sync (or fsync), you have no idea if your data will be written now, or two hours from now, or never (given enough RAM).
(ZFS probably isn't as bad, given it's probably much easier to slice your storage up into smaller filesystems, one per task. But it's a valid gotcha -- without knowing that, I'd have just thrown most things into the same huge filesystem.)
There's another problem with reliability: Basically, every fast journalling filesystem nowadays is going to do out-of-order write operations. Entirely too many hacks depend on ordered writes (ext3 default, I think) for reliability, because they use a simple scheme for file updating: Write to a new temporary file, then rename it on top of the old file. The problem is, with out-of-order writes, it could do the rename before writing the data, giving you a corrupt temporary file in place of the "real" one, and no way to go back, even if the rename is atomic. The only way to get around this with traditional UNIX semantics is to stick to ordered writes, or do an fsync before each rename, killing performance.
I think the POSIX filesystem API is too simplistic and low-level to do this properly. On ordered filesystems, tempfile-then-rename does the Right Thing -- either everything gets written to disk properly, or not enough to hurt anything. Renames are generally atomic on journalled filesystems, so either you have the new file there after a crash, or you simply delete the tempfile. And there's no need to sync, especially if you're doing hundreds or thousands of these at once, as part of some larger operation. Often, it's not like this is crucial data that you need to be flushed out to disk RIGHT NOW, you just need to make sure that when it does get flushed, it's in the right order. You can do a sync call after the last of them is done.
Problem is, there are tons of other write operations for which it makes a lot of sense to reorder things. In fact, some disks do that on a hardware level, intentionally -- nvidia calls it "native command queuing". Using "ordered mode" is just another hack, and its drawback is slowing down absolutely every operation just so the critical ones will work. But so many are critical, when you think about it -- doesn't vim use the same trick?
What's needed is a transaction API -- yet another good idea that was planned for someday, maybe, in Reiser4. After all, single filesystem-metadata-level operations are generally guaranteed atomic, so I would guess most filesystems are able to handle complex transactions -- we just need a way for the program to specify it.
The fragmentation issue I see as a simple tradeoff: Packing stuff tightly saves you space and gives you performance, but increases fragmentation. Running a defragger (or "repacker") every once in awhile would have been nice. Problem is, they never got one written. Common UNIX (and Mac) philosoph
Don't thank God, thank a doctor!
I guess this setup could replace some people's need for a turnkey NAS solution. But your thinking it could replace SAN solutions shows you haven't looked into SAN too much. To start, there's a reason Fibre Channel is way more popular than iSCSI. The financial services company I work for has about 3 petabytes of SAN storage, and not a drop of it is iSCSI. Storage Area Networks are special built for a purpose. They typically have multiple fabrics for redundancy, special purpose hardware (we use Cisco Andiamo, i.e., the 9500 series), and a special purpose layer 2 protocol (Fibre Channel). iSCSI adds the overhead of TCP/IP. TCP does a really nice job of making sure you don't drop packets, i.e. layer 3 chunks of data, but at the expense of possibly dropping frames, i.e. layer 2 data. The nature of TCP just does this, as it basically ramps up data sending until it breaks, then slows down, rinse and repeat. This also has the effect of increasing latency. Sometimes this is okay, people use FCIP (Fibre Channel over IP), for example. But, sometimes it's not. Fibre Channel does not drop frames. In addition, Fibre Channel supports cool things like SRDF which can provide atomic writes in two physically separate arrays. (We have arrays 100 km away from each other that get written basically simultaneously and the host doesn't think its write is good until both arrays have written it.) So, like I said, this might be good for some uses, but not for any sort of significant SAN deployment.
"Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman
Google have a great solution that focuses on the “cheap” part without compromising much the latter two. If you have not read up on the Google Filesystem, definitely take the time to. At the very least, it seems to call into question the need to shell out tens of thousands for high-end storage solutions that promise reliability in proportion to the dollar.
Why bother.
Two words: High Bitrate.
If you like your music to actually SOUND good, 128kbps sucks. I personally rip my music using a Variable bitrate between 224 and 320 kbps. Unfortunately, this makes for VERY large files. But my music sounds FANTASTIC!
Official Heretic from the "Church of Global Warming". Proven right thanks to whistle blowers. AGW = Flat Earth Theory
Potentially it will obselete low-end NAS/SAN hardware (eg: Dell/EMC AX150i, StoreVault S500) in the next couple of years, for companies who are prepared to expend the additional people time in rolling their own and managing it (a not insignificant cost - easily making up $thousands or more a year). There's a lot of value in being able to take an array out of a box, plug it in, go to a web interface and click a few buttons, then forget it exists.
However, your DIY project isn't going to come close to the performance, reliability and scalability of even an off the shelf mid-range SAN/NAS device using FC drives, multiple redundant controllers and power supplies - even if the front end is iSCSI.
Not to mention the manageability and support aspects. When you're in a position to drop $50k on a storage solution, you're in a position to be losing major money when something breaks, which is where that 24x7x2hr support contract comes into play, and hunting around on forums or running down to the corner store for some hardware components just isn't an option.
ZFS also still has some reliability aspects to work out - eg: hot spares. Plus there isn't a non-OpenSolaris release that offers iSCSI target support yet AFAIK.
I've looked into this sort of thing myself, for both home and work - and while it's quite sufficient for my needs at home, IMHO it needs 1 - 2 years to mature before it's going to be a serious alternative in the low-end NAS space.
Linux has more perfomance testing on x86 than OpenSolaris (so you are not as likely to run into a bad bottleneck). On Linux you can create a RAID-1,-4,-5 and -6 under Multiple Device Driver Support in the kernel. You can then use mkraid to include all the drives you want. This code in not new at all. It was stable in 2.4, maybe even in 2.2
After that you just create a filesystem on top of the raid. If you don't like ext3 or don't trust it, there is always xfs. I had some rough times with reiserfs, xfs, and ext3 and for all the experience I had I would go xfs for long running server environments (and now get flamed for this little bit, use ext3 all you want).
The advantage is that you use very well tested code.
The problem comes with hotswapping. I don't know if the drivers are up to that yet. But I also highly doubt that OpenSolaris SATA drivers for some low price chip in a low price storage box can deal with hotswapping. So Linux might be faster on that one.
That is a setup I would compare to a plug'n play SAN solution. And it totally depends on the environment. If the Linux box goes down for some reason for a couple hours/days, how much will that cost you? If it is more than twice the SAN-solution, you might just buy the SAN and if it fails just pull the disks and put them in the new one. I dunno if that would work on Linux.
Ever heard of Starfish? It's a new distributed clustered file system:
Starfish Distributed Filesystem
From the website:
And you can build clusters at relatively low cost:
(warning: I work for the company that created Starfish)
-- manuManu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
Founder/CEO - Digital Bazaar, Inc.
Consolidate your multimedia and run MythTV for a while. Once you rip and encode several TV series, all your DVD films, and have the Myth recording your favourite shows, a terabyte doesnt seem that much. If you want an idea for future examples of massive storage consumptions, imagine having MythTV recording all channels all the time, so you'd basically be able to decide post-transmission what you want to view and save...
Of course, while I agree most NAS and SAN solutions are grotesquely overpriced and mainly useful for separating fools from their money, I cant really see why one would bring up ZFS and OpenSolaris for this purpose. Something like Openfiler would be vastly more appropriate, proven and easy to manage.
Comment removed based on user account deletion
Well, I'll have to buy a digital camera that shoots in ASCII. Oh wait.......
I prefer Flambe as apposed flamebait.
I actually have two 48GB databases full of minimal instruction sequences for generating boolean functions. Do I win the obscure use of disk space prize?
Program Intellivision!
If you need more than 1-3 TB, you can't use generic components
;-)
Why not?
Sure, a 16-channel SATA controller with RAID 0/1/5 will cost you $400. But that will handle, using 750GB drives that have recently entered the "affordable" range, a total of 12TB (or more practically, a 10.5TB RAID5 with one hot spare). Find an OEM that can set you up with that for under $5000 total.
Now, that uses a PC chassis and wouldn't look "nice" in a rack. So what? If you need 10TB and don't want to blow $50k on it, you don't have a lot of choices... So if you insist on all racked equipment, buy a rack shelf kit and lay it on its side (and hide it with blanks if you care that much)
I've been using ZFS for quite some time, and have yet to have any form of failure or data corruption. I've used it with simple JBOD, SAN attached, even USB attached drives - it just plain works. With Solaris 10 U3, and the latest revs of OpenSolaris, you have access to RAIDZ2 - which gives you double parity, even more protection. Snapshotting can be scripted to run as often as you like. I keep 2 months worth of snapshots every 5 minutes day in and day out. You can disassemble the system, scramble the drives, reattach and bring the system back up and all will be well. ZFS will just re-assemble the pool and continue. You can replace drives on the fly. Let's say you make your 12 disk encloser with 750GB drives. Now, 2 years later you want to replace them with 2TB drives. Simply use the zfs replace command to replace each drive, wait for it to re-silver and rebuild the data on the replaced drive, then move on to the next drive. As you replace them, the pool will grow automatically. This would grow your (assuming 12x750GB in a RAIDZ2) 7.5TB pool to a 20TB pool, without downtime. With OpenSolaris on x86, you can even boot off of ZFS now, so use ZFS to mirror your boot disk with a like drive and you should be good for quite some time... Using something like Sun's Thumper system, you can get a 12TB system for less than 30k (for those who have something akin to a budget) ZFS, it's fast, safe, secure - and I enjoy working with it (as I don't have to do much with it).
Who is general failure, and why is he reading my hard drive?
They made the Buffalo Terastation for guys like you. Google it, it's not too expensive and is pretty much hands-off.
Why? Because it's 128 bits! One hundred and twenty eight fucking bits! That's 64 bits more than any other FS, so any fool can see that it's twice as good as the alternatives.
It may be new and untested, but that's hardly important in the face of 128 fucking bits is it? Besides it's designed by Sun engineers and nobody has more experience in FS design and implementation. That's why all the previous Solaris filesystems rocked so hard. Nothing can beat UFS in terms of stability and performance after all.
Oh, and I nearly forgot - because it's made by Sun it's going to be 10 times as Enterprisey as any half-baked, so-called "tried and tested" RAID/SAN solution that those other suppliers are going to come up with.
Quite frankly, the fact that you're even asking this question suggests you are guilty of criminal hype evasion.
What amazes me is all the talk of iSCSI, but almost no mention of AoE (ATA over Ethernet).
/dev/office. Then I create several lv's (logical volumes) of arbitrary size beneath *that*. So I have /dev/office/home, /dev/office/mp3, /dev/office/blah, etc.
What you have is a box that exports block devices out over layer 2. Another devices loads it as a block device, and can now treat it in whatever fashion it could deal with any other block device, so for example I have 2 "shelves" of Serial ATA drives going. I have a third box that I could either load linux on, using md to create raid sets, or what I've actually done is used the hardware on each of the two shelves, created a raid5 set on each, then used md to create a raid1 set out of the two raid5's. I then take my spankin' new md0 device which is huge for my needs (7.5TB), use LVM to create a volume group (called 'office' for me) and that creates
Now you can format those lv's like any other partition/slice. I've used xfs on all of mine, but you could use ext2/3 if you really wanted.
Karma: Chameleon (mostly due to the fact that you come and go).
I'd like the following scenarios explained.
RAID0 = bunch of hard drives strung together, look like one big drive. in the implementation I'm refering two the data is striped to a block size and written across each disk simeultaneously (or nearly). This is the fastest disk subsystem available but the most susceptible to failure. If one disk fails, you're toast.
Does ZFS do anything in this situation? I have 'one big drive' presented to the operating system, the striping is abstracted at the hardware layer, and I have a semi-expensive ($300 - 800) RAID card running this. In a random I/O workload I get about 150 iops per drive, streaming is another ball of wax typically more interface limited/block size limited than head movement.
To save money, I'd drop the RAID card and put ZFS down. I now have 12 drives (SAS attached), can I get better performance with ZFS like I could with the RAID Card? Think Log Drives for a big DB, or scratch storage space while manipulating a metric assload of video files. Gbit / sec transfer rates for real-time storage of HD Video. In a random I/O workload I get about 150 iops per drive, streaming is another ball of wax.
RAID 0+1. All the perf benefits of RAID0, but 2x the drives. Typically two cabinets RAID 0'd then RAID 1 the two RAID 0s. I get redundancy at a slight penalty of performance due to 2x the writes happening, but no degredation the read.
What can ZFS do for me here? Again, performance improvements/changes?
RAID5
50% penalty in performance even with the best card because in a high drive count you have to read in data, calc, then do the write. However one drive gives me full redundancy, I loose a second though and I'm toast. RAID6 sometimes is used to describe distributing the hot spare into the array, so no more disk space but can take two simultaneous drive failures and keep running.
What does ZFS do for us here?
This isn't a troll, I really know nothing about ZFS and I'm really curious how I could not have to do the above to protect my photography / video data for my photography business. Would be cool if I could do it on my Mac Pro
As a rock-in-roll Physicist once said, No matter where you go, there you are.
It seems to me that even if the entire setup is prone to failure, all you really need is a gigabit crossover or two running to an identical setup. I don't know if ZFS does anything like this, but I can think of at least one way to make it work on Linux: DRBD + OCFS2 + heartbeat. If you're smart, you can even do some load balancing, at least until one of them fails -- and when that happens, the other should be able to take over very quickly, if not instantly -- Linux heartbeat means it would simply takeover the other machine's IP and start its services.
So, that's $6k total instead of $3k.
The one problem I have with OCFS2 is that when it fences a system, it tends to either bring the whole thing down (kernel panic), or in newer versions, give you the option of forceably rebooting instead. This killed it for a project I was working on, where one of the machines had other mission-critical systems running that were not on the OCFS2, and thus, it seemed retarded to panic and bring down everything else too.
So if that's your problem, you can always build a third, identical system to run the other stuff on. $9k.
Even if you figure another $1k for random stuff, like maybe a LOT of gigabit crossovers, or 10gig fiber, or something, that's still a fifth of the cost of the "business-grade" or whatever else he was considering. Even assuming the worst-case scenario, where the homebrew system costs a lot more to maintain (even electricity and cooling, maybe), how long will it take for it to cost another $40k? And this way, you have an ENTIRELY redundant system -- the only way you lose it is if, say, the whole building blows up.
I mean, I sort of agree that you get what you pay for. But when the difference in price is that much, the only way it's ever worth it is if there's really great support with the high-end package. And is it $40k worth of support? If not, I imagine this guy could put together a company selling little $3k, $6k, and $10k systems for $20k each (including support), shaving off $30k even for the most paranoid.
And all of that is pretending you're right about the cheap consumer-grade hardware actually being less reliable.
Don't thank God, thank a doctor!
I am the original poster, and I am not actually a typical user.
I routinely work with files that are 100 GB - 300 GB each.
Just copying one file from drive to drive takes hours.
I have about 4 Terabytes in use, with another 4 Terabytes for backup.
My usage is the exact opposite of database usage (which most storage is optimized for).
I need to copy huge sequential files. I rarely need many small reads or writes.
Because of the long times it takes to move these files around, I think NFS or CIFS would be too slow. That's why I am interested in the ability of ZFS to easily export iSCSI targets. Some tests I read showed that ZFS exporting iSCSI is about 4 times faster than ZFS exporting NFS or CIFS.
I am comparing to drives directly attached via eSATA, so it's got to be fast to come anywhere close to what I get with eSATA.
- For the complete works of Shakespeare: cat
From the ZFS FAQ:
Q: Suppose I have three E450s. Would ZFS allow me to integrate storage across all three boxes into one big "poor man's SAN"?
A: No, ZFS is a local filesystem (for the time being). To access storage attached to a different host, use NFS.
That's why I am doing an upgrade to larger disk soon. Hopefully I can get a deal on 1TB drives. The data has all been offloaded to a myriad of machines 3 times so I could upgrade the arrays and stay consistent with disk sizes. Latency isn't as noticeable as one would think. The array is mounted as read only except for scheduled uploads of new content(usually only 2-3 times a month). Once the reads start, they play without any problem (never played more than 2 HD and 4 SD streams at once), and writes are slow(read that as EXTREMELY slow) but not a problem as I only sync like I said 2-3 times a month. I'm looking to do a single RAID6 using 2 PCI-E 16x cards and upto 12 drives. My initial storage requirements of storage have been met and I only add a few movies and EP's a month now so a gain of 4TB over my current would keep me going for a couple years considering I am only right now using a little over 6TB. The reason I have the first raid5 is that the SATAII port multipliers I am using support JBOD, RAID0,1 and 5. Backup is not CRITICAL as I have legit copies of all the movies, and most of the tv shows have been bought as season bundles. Redundancy of data is important though so I can can suffer a pretty massive crash of a number of drives in this setup. It seems like a reasonable trade off to use RAID as I did to ensure I could recover without have to rip everything all over again. At the time I built this (200GB drives were new at the inception), RAID6 was not an option so please don't call me stupid for building a setup in a much smaller scale and growing with it until a large enough disk capacity became available at a price point to make it worth building a system from scratch. Its server my needs and now with the 750's at a good price point, and 1TB's coming out, the next few months will see my capacity grow and complexity diminish.
> I cant really see why one would bring up ZFS and OpenSolaris for this purpose
Here's why:
1) Snapshots. ZFS lets me make lots of snapshots to protect myself from user error, viruses, etc destroying my data. ZFS snapshots are so lightweight that I can make them hourly at nearly no cost in time and disk space.
2) Data integrity. Even RAID-5 can allow some errors to creep into my data (google: bit rot). ZFS has a much higher level of data integrity protection.
3) Cost/Performance. ZFS RAID-Z appears to be much faster than software RAID-5. it appears to be even or faster than hardware RAID-5. Hardware RAID-5 is much more expensive than software.
- For the complete works of Shakespeare: cat
And what happens when the RAID controller fails and corrupts all of your drives?
Because I've seen that happen more than once.
I'm not saying the more expensive solution is better. I'm just saying that in my personal experience I've seen *more* data destroyed from RAID controller failure than from hard drive failure. I would love to find out the solution to that one.
I do not claim to be a hardware expert or system administrator, so there may be a well known solution (don't buy 'brand X' RAID controllers). I just don't happen to know it.
I worked at ATMEL many years ago in their EPROM division. I had an up close and personal view of the screening flows, both Military and otherwise. Let's put aside the issue of Military screening, which is extensive and costly. You can't make very much out of Military grade ICs, because there are not very many available.
The difference between commercial and industrial parts is one of operating temperature, not quality. (In point of fact, there was no actual difference in the screening or handling.) The quality standards for both parts were the same - the goal was always zero defects. I spent weeks weeding out a problem with a 50 ppm failure rate that was slipping through our screening, and everyone was damned happy when I fixed it.
There's no reason to expect a correlation between maximum operating temperature and quality. A part might run too slow at elevated temperature to pass, but this will usually happen for process variation reasons that do not affect the expect lifetime of the part.
Any part coming from a reputable IC manufacturer should have the same level of quality, regardless of the rating.
Now, that being said, there is a very serious quality issue that an OEM does need to address, and that's counterfeit parts. If an OEM is not careful about where their parts come from, or buys them cheap and looks the other way, then there quality will obviously suffer. But this isn't so much a commercial versus industrial quality; it's about honest versus dishonest business practices.
It's not wasting time, I'm educating myself.
You can pick up those 750GB Seagate SATA drives for about $200 each now...
I'd never claim to be an everyman, but I broke 2TB on my desktop three years ago with a huge pile of SATA drives and a couple extra controller cards. Besides, chicks dig the little side-cart full of hard drives :) I just took a couple of four-slot drive cages from cheap PC cases and built them into an enclosure, complete with its own ATX power supply.
:)
Of course that was before I jumped onto the NAS storage cash cow, doing pretty much exactly what the article poster wrote, only I turned around and sold my PC-based NAS boxes for about half the price of "enterprise" solutions, which still represents a 400% markup for me
You have to realize, the companies and people building these overpriced RAID arrays are just your average greedy bastard, usually no smarter or more skilled than any other geek. Most of the computer-attached devices today are little more than an XScale processor, a tiny bit of RAM and Linux. Broadband routers, NAS boxes, KVM/IP switches, "smart" network adapters, heck I wouldn't be surprised to find home entertainment devices running Linux. We're in the age of mashups, where any idiot with a marketing budget can slap various I/O ports on a board and "invent" appliances.
-Billco, Fnarg.com
Old guys like me may remember when National Semiconductor was, if I recall rightly, fined for faking test records in the same year they won an award for the reliability in the field of their military products. Or the discovery in the early 90s that volume produced Japanese semiconductors were far more reliable than many JAN devices. There is just no substitute for having to manufacture in volume in a competitive world.
In semiconductors, the downside is that things that produce higher reliability like thicker oxide and bigger anti-static diodes also slow down clocks. You would think that, for a really reliable disk array, you would a less than state of the art system running conservatively. I guess that this is a case where having a great deal of practical and experimental experience is the best recipe for success, and perhaps this is where SAN manufacturers shine.
Pining for the fjords
I'm familiar with SATA and IDE...but, the FC ones are new to me..
;-)
Just a brief summary:
-- SATA refers to the new Serial ATA.
-- ATA or PATA refers to the older "Parallel" ATA. (ATA dates back to IBM PC AT and refers to that machine's AT Attachment interface.)
-- IDE refers to any drive (ATA, SCSI...) with integrated drive electronics, that is, everything that has come after the ancient dumb drives that required a model-specific controller on the motherboard. In other words, not a very useful term anymore.
-- SCSI refers to Small Computer System Interface; funny how it's the one used in the bigger iron. Beats the pants out of ATA when handling multiple daisy-chained drives; SATA is catching up in handling multiple drives. SCSI also has parallel interface and cabling.
-- SAS refers to Serially Attached SCSI (some inspiration from SATA perhaps?).
-- FC refers to Fibre Channel, a SCSI-like very fast interconnect type and interface protocol; often (but not always) uses optical cabling.
-- iSCSI refers to SCSI over Ethernet (thus it could be "SCSIoIP"...).
But I never understood the difference between a SAN and a NAS when the configuration gains any complexity beyond a textbook example. You can have a SAN with many NAS boxes, or you can have NAS with multiple SANs, sooo...
A SAN is not a host. It presents itself to a host machine as native storage in the form of raid groups/Luns, and/or raw storage. Access controls related to end users are done by the host OS, not the SAN, the SAN has no concept of file locking either, this is accomplished at the OS level on the host as well. Although the SAN does provide access controls for which host OS can connect to it. A NAS is the storage and some type of OS supplying network shares to a host. There are many tools that can make a NAS appear to a host as a native file system as well which kind of blurs the lines.
In really really simple terms, a SAN provides configurable disk space to a host, a NAS supplies file space and file serving to a host(s). Many storage solutions offer various functionality and can provide both NAS and SAN functionality at some level.
True story:
Two years ago next month, a clumsy plumber got a propane torch too close to a sprinkler head with the expected consequences: LOTS of water took the path of least resistance, where it finally filtered it's way into the basement data center, coming out right on top of our SAN.
Obviously, it didn't survive.
"I don't know, therefore Aliens" Wafflebox1
Anonymous Coward writes IDE refers to any drive (ATA, SCSI...) with integrated drive electronics, that is, everything that has come after the ancient dumb drives that required a model-specific controller on the motherboard. In other words, not a very useful term anymore.
Well, actually, IDE's history is a bit different than that. IDE requires a host buss interface, but, yes, they do have their disk controllers built into the PCA attached to the disk mechanism.
Before Compaq and others developed the first IDE systems, hard drives usually had external controller boards that used low level commands. IDE standardized the host interface to disk storage at the driver level, and standardized the host buss/drive command set at that buss level.
And, it's not just disk drives that use the IDE stack. Other devices can be attached to the IDE buss, too.
SCSI drives require a SCSI host buss adapter with a dedicated processor and that adapter does the heavy lifting for disk access. IDE requires the host CPU to do a lot of processing, where SCSI does the majority of the work. This model was used for the FC technology. It, too, unloads the processing from the CPU.
SCSI/FC are preferred in the 'big iron' type of installations. IDE/ATA/SATA are fine on a dedicated NAS system. In effect, the CPU of the NAS motherboard is doing the work that is done on the host buss adapters in SCSI/FC.
At the drive mech level, FC is a copper interface. The design of the connector on the disk mech allows it to be plugged. This provides the ability to quickly replace failed drives. The drive mechs are aggregated into some type of array to provide protection from data loss. This array of drives is then attached to systems via fiber optic cabling.
You can simulate some, but not all, of the benefits of a FC/SCSI array using SATA technology. I don't know if the IDE drivers are being rewritten to use the multi-core processors yet, but that would help reduce some of the latency.
Short answer, if what the OP was aiming for is to get into a large disk array for cheap, trading some reliability and performance for low cost, the idea is a good one. I would be looking for a multi-core cpu in the motherboard and an OS that has parallel processing drivers for the IDE channel. Be sure that all the drives have plenty of cooling. Have a backup solution. Some day, this lash-up will give you heartache, but till then, you've saved money.
Can you tell I used to work in the disk storage business?
Good luck.
Best regards.
I would avoid that card. It's limited to striping or mirroring, for starters. It's also not true hardware raid and depends on the drivers to do all the raid work. You really do get what you pay for here. You also get very little notice when one drive starts going bad. You just start getting random system hangs.
The first problem with your gnuplot script is that you're assuming a Poisson distribution for HDD failures (which is incorrect). Statistical failure distribution follows a Weibull distribution with k roughly equivalent to 7.5. Unfortunately, because you build your argument off of a Poisson distribution approximation, the rest of the analysis doesn't make much sense.
If you are interested in HDD failure rates and failure prediction, there is a fantastic paper done by Bianca Schroeder and Garth Gibson of CMU. I think this is the link to their main research website.
I think you miss the point of systems such as Starfish and other distributed clustered file systems. You have many other points of failure in a system: memory, CPU, power supply, power outage, motherboard, network switch, OS kernel, router, network cable, and the all important "oops, I tripped over the power cord". There are also times that you want to take down nodes in a highly-available cluster for maintenance without affecting your applications - to do this, you need a file system that assumes and can work around node-level failure.
There is much more to highly-available clustering than just making sure your disk sub-systems are bulletproof.
-- manuManu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
Founder/CEO - Digital Bazaar, Inc.
Where's the uncertainty? Sun fears Linux, and their programmers have already admitted this is why they deliberately made a GPL-incompatible license. Using their patent minefield to prohibit GPL implementations would be incredibly foolish if widespread use of ZFS were actually their goal.
That's nice except Jonathan Schwartz has indicated that OpenSolaris will go GPL3, assuming the final version of the license is OK.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Hello, I work at a 3 letter company whose name starts with "E", a friend of mine works at another 3 letter company starting with "I" as field service supervisor. Not a month goes by without seeing a double disk failure in a RAID-5 system from a customer site for either of these companies and that's on SCSI, SAS and Fibre Channel drives. ATA & SATA drive MTBF values that are given by drive manufactures can be off by as much as a factor of 1000 depending on lot. Most consumer level drives are listed as "Nonrecoverable Read Errors per Bits Read 1 per 10^14" a 1TB drive contains 8*10^12 bits. 1/8*10^14/10^12 = 1/8*10^2 = 100/8 = 12.5 So the manufacture says if you read the entire drive 12.5 times you will get 1 Nonrecoverable Read error. So if you think the manufacture is off by a factor of 100 then every .125 times you read the entire drive you get a non-recoverable read error.
Do you still feel you data is safe?
We have a large (geographically replicated) Hitachi disk array (as well as many NetApp boxes), mostly it works very well indeed.
However 2-3 years ago we stumbled (very painfully!) across a firmware bug which took the primary Hitachi array down:
As we (i.e. the Hitachi service reps) were upgrading the mirrored cache, an error hit the active half, and it turned out that the firmware would always check the mirror (a very good idea, right?) before falling back on re-reading the disk(s). However, the firmware error handler which could have handled an error on the mirror copy as well (as long as the data wasn't dirty, of course), did not know how to handle a _missing_ copy, instead it blew away the entire array while crashing.
It took us three days to get everything back up, even though most of the critical systems were running off of the WAN backup copy after 2-3 hours.
Terje
PS. That particular firmware bug has of course been extinguished, but there's bound to be some more lurking around. Getting totally non-stop operation is a _hard_ problem!
"almost all programming can be viewed as an exercise in caching"