Build Your Own $2.8M Petabyte Disk Array For $117k
Chris Pirazzi writes "Online backup startup BackBlaze, disgusted with the outrageously overpriced offerings from EMC, NetApp and the like, has released an open-source hardware design showing you how to build a 4U, RAID-capable, rack-mounted, Linux-based server using commodity parts that contains 67 terabytes of storage at a material cost of $7,867. This works out to roughly $117,000 per petabyte, which would cost you around $2.8 million from Amazon or EMC. They have a full parts list and diagrams showing how they put everything together. Their blog states: 'Our hope is that by sharing, others can benefit and, ultimately, refine this concept and send improvements back to us.'"
Good luck with all the silent data corruption. Shoulda used ZFS.
"Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman
Support.
Hail Eris, full of mischief...
E pluribus sanguinem
Before realizing that we had to solve this storage problem ourselves, we considered Amazon S3, Dell or Sun Servers, NetApp Filers, EMC SAN, etc. As we investigated these traditional off-the-shelf solutions, we became increasingly disillusioned by the expense. When you strip away the marketing terms and fancy logos from any storage solution, data ends up on a hard drive.
That's odd, where I work we pay a premium for what happens when the power goes out, what happens with a drive goes bad, what happens when maintenance needs to be performed, what happens when the infrastructure needs upgrades, etc. This article left out a lot of buzzwords but they also left out the people who manage these massive beasts. I mean, how many hundreds (or thousands) of drives are we talking here?
You might as well add a few hundred thousand a year for the people who need to maintain this hardware and also someone to get up in the middle of the night when their pager goes off because something just went wrong and you want 24/7 storage time.
We don't pay premiums because we're stupid. We pay premiums so we can relax and concentrate on what we need to concentrate on.
My work here is dung.
Looks like a cheap downscale undersized version of a Sun X4500/X4540.
And as others have pointed out, you pay a vender because in 4 years they will still be stocking the drives you bought today, where as for this setup you will be praying they are still on ebay
"If everybody is thinking alike, somebody isn't thinking" - Gen. George S. Patton
That's all fine and dandy but where is my support going to come from when this server has issues? Are they throwing in for free maintenance and upgrades to this server when it no longer meets requirements? If not, this figure is highly disingenuous.
Nominally a Slashvertisement, but the detailed specs for their "pods" (watch out guys, Apples gonna SUE YOU) are pretty damn cool. 45 drives on two consumer grade power supplies gives me the heebie jeebies though (powering up in stages sounds like it would take a lot of manual cycling, if you were rebooting a whole rack, for instance), and I'd be interested to know why they chose JFS (perfectly valid choice) over some other alternative...There are plenty of petabyte capable filesystems out there.
Very interesting though. I tried to push a much less ambitious version of this for work, and got slapped down because it wasn't made by (insert proprietary vendor here). Of course, we're still having storage issues because we can't afford the proprietary solution, but at least there is no non-branded hardware in our server room.
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
AHhh, this is why the EMC guy committed suicide. It wasn't because he was dying of cancer.
Trolling is a art,
...but that doesn't add up. $7,867 / 67 petabytes = $117.42/petabyte, not $117,000/petabyte.
Perhaps they were using the 'new' math.
Soon I shall have a single media server with every episode of "General Hospital" ever made stored at a high bitrate. WHO'S LAUGHING NOW, ALL YOU WHO DOUBTED ME!!!!
And how big is a petabyte you ask? There have been about 12,000 episodes of General Hospital aired since 1963. If you encoded 45 minute episodes at DVD quality mpeg2 bitrate, you could fit over 550,000 episodes of America's finest television show on a 1 petabyte server, enough to archive every episode of this remarkable show from its auspicious debut in 1963 until the year 4078.
SJW: Someone who has run out of real oppression, and has to fake it.
How do you replace disks in the chassis? We've got 1,000 spinning disks and we've got a few failures a month. With 45 disks in each unit you are going to have to replace a few consumer grade drives.
But when we priced various off-the-shelf solutions, the cost was 10 times as much (or more) than the raw hard drives.
Um..and what do you plan on running these disks with? HD's don't magically store and retreive data on their own. The HD's are cheap compared to the other parts that create a storage system. That's like saying a Ferrari is a ripoff because you can buy an engine for $3,000.
If you check out what the company does, they are an online backup company. They don't host servers on this array, just backup data from your desktop. They just need massive amounts of space which they make redundant.
I love free shipping, even if it costs me more !! I like FREE STUFF !!
They designed and built it so they should know how to support it. If someone else builds one, just learning how to get that beast up and running is excellent hands on training.
Yeah, this only works if your the geeks building the hardware to begin with. The real cost is in setup and maintenance. Plus, if the shit hits the fan, the CxO is going to want to find some big butts to kick. 67TB of data is a lot to lose (though it's only about 35 disks at max cap these days).
These guys, however, happen to be both the geeks, the maintainers, and the people-whos-butts-get-kicked-anyway. This is not a project for a one or two man IT group that has to build a storage array for their 100-200 person firm. These guys are storage professionals with the hardware and software know how to pull it off. Kudos to them for making it and sharing their project. It's a nice, compact system. It's a little bit of a shame that there isn't OTS software, but at this level you're going to be doing grunt work on it with experts anyway.
FWIW, Lime Technology (lime-technology.com) will sell you a case, drive trays, and software for a quasi-RAID system that will hold 28TB for under $1500 (not including the 15 2TB drives - another $3k on the open market). This is only one fault tolerant, though failure is more graceful than a traditional RAID). I don't know if they've implemented hot spares or automatic failover yet (which would put them up to 2 fault tolerant on the drives, like RAID6).
Is it just my observation, or are there way too many stupid people in the world?
where's the extensive stuff that sun (I work at sun, btw; related to storage) and others have for management? voltages, fan-flow, temperature points at various places inside the chassis, an 'ok to remove' led and button for the drives, redundant power supplies that hot-swap and drives that truly hot-swap (including presence sensors in drive bays). none of that is here. and these days, sas is the preferred drive tech for mission critical apps. very few customers use sata for anything 'real' (it seems, even though I personally like sata).
this is not enterprise quality no matter what this guy says.
there's a reason you pay a lot more for enterprise vendor solutions.
personally, I have a linux box at home running jfs and raid5 with hotswap drive trays. but I don't fool myself into thinking its BETTER than sun, hp, ibm and so on.
--
"It is now safe to switch off your computer."
Since you can now get 2TB drives you should be able to fit 90TB in one of these boxes :)
And I thought I was doing well with a few terabytes in my home server (but hey, ZFS should save me from silent data corruption when the drives inevitably start to fail).
It's not exactly rocket surgery.
Reliant Technology sells you NetApp FAS 6040 for $78,500 with a maximum capacity of 840 drives, without the hard drive (source: Google Shopping). If you buy FAS 6040 with the drives, most vendors will use more expensive and less capacity 15k rpm drives instead of the 7200rpm drives the BlackBlaze Pod uses, and this makes up a lot of the price difference. The point is, you could buy NetApp and install it yourself with cheap off-the-shelf consumer drives and end up spending about the same magnitude amount of money. I estimate that NetApp would cost just 1.5x the amount.
NetApp FAS 6040 at $78,500 + 840 x 1.5TB drives at $120 each = $179,300 which gives you 1.26PB. Cost per petabyte is $142,500, only slightly more expensive than BlackBlaze $117,000 from the article. The real story is that BlackBlaze is able to show a competitive edge of $30,000, or being 20% cheaper.
I once had a signature.
and save $2,799,720.
If you build a petabyte stack using 1.5TB disks you need about 800 drives including RAID overhead. With an MTBF for consumer drives of 500,000 hours, a drive will fail roughly every 10-15 days, if your design is good and you create no hotspots/vibration issues.
Rebuild times on large RAID sets are such that it is only a matter of time before they run a double drive failure and lose their customers data. The money they saved by going cheap will be spent on lawyers when they get the liability claims in.
To Terminate, or not to Terminate, that's the question - SCSIROB
These cost a bit and have drives which fail at a fairly infrequent rate. It doesnt' hurt that the data center is kept at 64 degrees by two (redundant) chillers and has 450 KVa redundant power conditioners keeping the electricity on at all times. (We do shut off the power to the building once a month to check these and the diesel generator housed on the premises as well.)
Now - paying $x,xxx per year for maintenance on these units is cheap insurance in my mind. If something goes wrong, HP is available 24/7 to be onsite with replacement parts. This has - in fact happened - during the past few years. A controller on the array went bad, causing disk read failures. We instantly called HP, had a tech onsite, and had the controller replaced within a few hours of the problem being detected.
OTOH - for someone's 4 petabyte home pr0n collection, this might be a good idea! :P
The Kai's Semi-Updated Website Thingy
If you need the support, go pay the premium. Those of us with the appropriate technical background welcome the cheaper implementations.
If an article went up describing how a major vendor released a petabyte array for $2M the comments would full of people saying "I could make an array with that much storage far cheaper!"
Now someone has gone and done exactly that (they even used linuxto do it) and suddenly everyone complains that it lacks support from a major vendor.
This may not be perfect for everyones needs, but it's nice to see this sort of innovation taking place instead of blindy following the same path everyone else takes for storage.
These guys build their own hardware, think it might be able to be improved on or help the community, and they release the specs, for free, on the Internet. They then get jumped on by people saying "bbbb-but support!". They're not pretending to offer support, if you want support, pay the 2MM for EMC, if you can handle your own support in-house, maybe you can get away with building these out.
It's like looking at KDE and saying "But we pay Apple and Microsoft so we get support" (even though, no you don't). The company is just releasing specs, if it fits in your environment, great, if not, bummer. If you can make improvements and send them back up-stream, everyone wins. Just like software.
I seem to recall similar threads whenever anyone mentions open routers from the Cisco folks.
I like music
Not too shabby.
I had recently built a "storage pod" for my media @ home (6T using 4 1.5T drives), and had a hell of a time finding "good" components. So, I looked this over, and while it's made up of "consumer components" a couple of the components seem impossible to find for this as well.
Case: Custom Built
HD Backplane: Custom made by chinese manufacturer.
So good luck building a "one off" for your small business/home, as I'll also bet these prices are for "quantity" (quality not withstanding)
hell, *I* would like to buy one, for my own personal use! $8000 seems very cheap for 67 terabytes of storage in a neat little package. My 4TB raid was quite expensive compared to this (on a $ per TB basis) and it's almost full now. I can definitely see something like this in my future. running ZFS for error detection, of course. And probably 2 redundant PSUs instead of standard consumer-grade ones. Wouldn't want one of those to go out and take half of my drives with it!
Online storage is way too expensive and internet connection speeds here in the USA will suck too badly for too long to even consider it..
These guys have a little more to worry about than redundancy... The two cheap ATX supplies in each box are split between the drives. So if one of the two supplies dies, the whole thing goes down. How's that for MTBF?
I think people are missing the point of this whole thing... instead of trashing and tearing the idea down. think what would make it better and improve the design... Ive been researching for a while now for something to store a life's worth of data, and this looks like something that will meet my needs. scalable, and enough space for a lifetime (I hope)
you know you can fry stuff putting things into things that dont like the things you put into it...
Weâ(TM)re a backup service, so our datacenter contains a complete copy of all of our customersâ(TM) data, plus multiple versions of files that change. In rough terms, every time one of our customers buys a hard drive, Backblaze needs another hard drive.
Data deduplication (see http://en.wikipedia.org/wiki/Data_deduplication) drastically reduce the storage requirements for backups. While email attachments are the classic example, it's doubtful that every one of their customer's is using a unique build of their OS. Ditto for third-party software. A lot of media also gets duplicated between people: vendor's whitepapers, video, even porn gets downloaded by lost of people. Rsync uses de-dup techniques to reduce bandwidth requirements; there's no reason why a clever storage node couldn't use that de-dup meta data to keep its own storage costs down.
Nothing for 6-digit uids?
For your average datacenter, primary storage needs to be on a major vendor's hardware, because you need the extras that the major vendor's supply. However, Backblaze is in the business of providing off-site storage for their customers. Their data is the secondary copy, so it can be as cheap as they can make it. No one is going to be running their data center off of this copy, so it can be low performance. And while I'm not saying that they should, they could probably get away with running non-protected storage for everything. Even if they lose a drive every day, it's unlikely to hold the data needed for that day's requested restores. That means they can almost always rebuild a failed drive's contents the next time the affected customers sync up.
Nothing for 6-digit uids?
they used incredibly cheep-ass HBA's for no good reason.
In their defence:
A note about SATA chipsets: Each of the port multiplier backplanes has a Silicon Image SiI3726 chip so that five drives can be attached to one SATA port. Each of the SYBA two-port PCIe SATA cards has a Silicon Image SiI3132, and the four-port PCI Addonics card has a Silicon Image SiI3124 chip. We use only three of the four available ports on the Addonics card because we have only nine backplanes. We don't use the SATA ports on the motherboard because, despite Intel's claims of port multiplier support in their ICH10 south bridge, we noticed strange results in our performance tests. Silicon Image pioneered port multiplier technology, and their chips work best together.
where I work we pay a premium for what happens when the power goes out, what happens with a drive goes bad,
Whomever spec'd your systems should have accommodated obvious failures like this. As in, paying for colo, using servers with dual power supplies that fail over, sensible RAID strategy. Giving money to EMC in this situation is not sensible.
but they also left out the people who manage these massive beasts. I mean, how many hundreds (or thousands) of drives are we talking here?
I have a couple of hundred drives going at any one time and I get an SNMP alert when a drive goes bad. I take one out of the closet and destroy the broken one. The RAID does the rest.
someone to get up in the middle of the night when their pager goes off because something just went wrong and you want 24/7 storage time.
Our storage strategy is N+1 all the way and required to be online 24/7 so failures are part of the plan. They are probably part of the plan at this startup.
We pay premiums so we can relax and concentrate on what we need to concentrate on.
I don't understand this. If your job is 89% software dev, then EMC may be the way to go. Expensive! But, it makes a little business sense. If you aren't spending most of your time writing software that adds value to your service/product, then EMC is doing your job and you are some kind of TPS generator. Do you pay a premium to blame someone else? I've had the opportunity to work in places like this and I've always passed because of the veiled contempt for IT.
Please, explain this to me.
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html
I like how you dismiss a detailed real world design example based simply on a claimed feature without any further substantiation. Very classy. I'm not saying you are wrong, but would it kill you to go into a little more detail about why these folks need "luck" when they are clearly very successful with their existing design?
STFU about slashdot bias.
That's amusing, since EMC was born out of the outrageously overpriced offerings from IBM and other mainframe companies of the day.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
I think this solution is quite interesting and probably fits their needs but comparing it to the storage solutions of the vendors listed is quite ridiculous. Another thing to note, there are vendors, NEXSAN, that sell cheaper storage systems, that while still more expensive then this, would have probably meet their needs. The first issue is high availability. There are many single points of failure on this box. There is only a single controller. The power supplies are not redundant. With the number of drives a single fan failure might lead to and high enough heat to damage components. Single port back plane. No NVRAM. The only thing that isn't a single point of failure are the drives themselves because they are in a RAID6 config but I still see a problem with that, their configuration uses no hot spares. A high end storage system is going to have multiple controllers, redundant power supplies, be able to sustain multiple fan failures, multiple back planes with interposer cards. It's also going to have NVRAM that should a power failure occur acknowledge cached data would not be lost. The second issue is maintenance. A high end storage system systems parts are high accessible and often hot swappable. A controller goes out, it's like changing a Nintendo cartridge. With this box if anything goes except a drive, the box is coming down. If you are a replacing a drive you'll have to slide the box out, hopefully you left enough clearance for the power cords when you slide it out, then you have to pop in a new drive, and hopefully not break the SATA connector on the back plane. Oh man, I forgot to put on a new rubber band, I mean vibration dampner. What's the performance of this box like? With software RAID and only a single processor with no ASIC acceleration for anything I would have to imagine the processor is going to get pretty bogged down. With a high end box everything is pretty much designed, within reason, to make the drives the ultimate performance bottleneck. Can this systems fully utilize all the drives or can the drives deliver more IOPS and throughput then the controller can handle? Extra features. What does this box offer in terms of volume copying, flash copying, and remote mirroring? The value of an enterprise solution is that it provides the features that keep it working 99.999% of the time, not just 99%. I see so many possible areas where data could possibly be lost or corrupted. A couple of comments have suggested this just being a block in a bigger solution, treating it just like a drive. In that case you are going to have to a additional layer of redundancy, probably a mirror. With a straight mirror you are going to see a doubling in cost of hardware, infrastructure, power and cooling, which is going to start disrupting cost/benefit of this solution. If you just want a bunch of file space accessible through HTTP with the ability to tolerate the occasional loss of data and downtime, this solution will work fine. If data loss or downtime means the loss of data or jobs, you'll go with one of the major storage vendors.
A plain-vanilla SAN is worth every penny.
Especially now that you can get them from distressed companies who paid too much for them a couple of years ago, $15,000 will get you a refrigerator-sized solution. Straight retail on a 2U san is still getting cheaper every year. http://h71016.www7.hp.com/dstore/ctoBases.asp?oi=E9CED&BEID=19701&SBLID=&ProductLineId=450&FamilyId=2569&LowBaseId=15222&LowPrice=$1,899.00
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html
I remember when i first got my hands on a sun thumper... impressive piece of kit but with a sun price tag... this one is way kewler on the price.
A project i worked on tried to deploy quite a number of thumper's and we ran into "issues"... First the racks - the thumper weighs in the vicinity of 150kgs (330lbs i think?), try put 10 of them in a rack and your in for a shock, assuming the rack can handle it, most data center floors have "issues" supporting the weight.
The second problem we had was cooling, the temp coming out the back of the rack was quite astronomical, and lastly power. In AU, this can often be a pain in the rear, specially with each thumper taking in about 2kw - 20kw per rack = PAIN.
Still, its kewl to see them do it all open.
I'd love to see someone do something like that though with computing power. Take the proliferation of mini-itx boards with "real" cpu's on them (ok, desktop cpus, but still, not shabby really) you could do a similar setup with a custom case supporting quite a number of those little buggers quite easily. Theres a beautiful little zotac AMD board that is almost ideal - supports the quad core, has a gig interface (sadly it has been aimed at htpc's cause you could replace the wireless, video, 6usb ports, etc with server-useful componentry - i.e. 2 or more gig ports and ipmi). But mostly it would be cheap and could run something like ovirt or abicloud quite happily. Shame that. There are other options in the space, like intel have a half-width xeon board (and a 1ru case that can support 2 of them side-by-side), but they're hard to get a hold of and quite long. get rid of local storage on the servers, use serial instead of video and add gpxe for remote boot - brilliant and dont exist!.
On a completely side note, one thing that i'd love to see in linux that has yet to exist in a useful format is replication (async) - the only real option is drbd, but its such a pain to setup and very inflexible. I was always so disappointed that neigther zfs, lvm or btrfs include it (even basic local replication would have surficed given the existence of so many network level block storage transports (iscsi, etc).
They went RAID 6, even though it is slow as shit, for the added failsafe mechanisms.
How long does it take to rebuild the array if a disk has to be replaced? Each RAID 6 volume is 15 disks of 1.5TB each (19.5TB data + 3TB parity). So either they'd have to take a real performance hit during a relatively short rebuild period, or a smaller hit over a longer rebuild period. Longer rebuild periods increase the odds of further failures before the rebuild is complete.
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
These are pretty impressive and a good start but there's a few of things I would do differently. I use to build NAS boxes for a living and there are some issues with this design. I think they have cut too many corners
1) Their choice of RAID cards is somewhat questionable. What 4 cards can you get for $175 total which will support proper hotswap? Even running software raid, I would still want cards that provide proper monitoring and drive management like 3ware. Yeah maybe it would have cost you a fair bit more $175 per box but it would be worth the difference. You would still be saving a ton. Also I am not sure I would put more than one drive on a cable with a multiplexer. You can get 16 port 3ware cards that use multiport cables that break out at the back plane. Now you would also have to upgrade to a server class motherboard with at least 3 PCIe slots.
2) I haven't checked recently but is software raid 6 even recommended yet. I know the 2.6 kernel has been supporting for a while but it was still listed as experimental last I checked. I might stick with raid 5 here.
3) While using Zippy power supplies is an excellent choice, I would definitely want redundant power in these boxes.
You realize they are USING this NOT SELLING it, right? They tell YOU how YOU can build one, nowhere are they offering to sell some schmuck a storage array.
If you don't know how to maintain it, do not try to do it yourself! however if you do, and you can save the kind of money they are saving, then go for it.
Umm, he was talking about the power redundancy. Also those are not 2 cheap ATX supplies, they are top quality server grade power supplies just no redundant. Though if you provide redundant power, you really shouldn't need battery backup on the sata cards as the datacenter would certainly have a UPS. I guess the motherboard could blow and battery backup could protect against that.
Umm, it's kind of obvious but whatever.
This is a NAS box. They aren't adding 10Gbe nics so the network will GigE and would be the bottle neck if they weren't using PCI sata cards.
You would have support for typical NAS stuff like NFS, Samba, distributed filesystems like AFS. YOu could also setup iSCSI nodes as well. But definitely no FiberChannel support. I don't know maybe you could add a card to the box to add this but I am pretty sure for the same money you could setup a 10Gbe storage network. Of course the 10Gbe storage network would be faster.
The article talks about how it is not intended as a complete solution. They do not go into, or intend to, describe their redundancy features, their performance issues, or anything else.
From the Article:
A Backblaze Storage Pod is a Building Block
We have been extremely happy with the reliability and excellent performance of the pods, and a Backblaze Storage Pod is a fully contained storage server. But the intelligence of where to store data and how to encrypt it, deduplicate it, and index it is all at a higher level (outside the scope of this blog post). When you run a datacenter with thousands of hard drives, CPUs, motherboards, and power supplies, you are going to have hardware failuresâ"itâ(TM)s irrefutable. Backblaze Storage Pods are building blocks upon which a larger system can be organized that doesnâ(TM)t allow for a single point of failure. Each pod in itself is just a big chunk of raw storage for an inexpensive price; it is not a âoesolutionâ in itself.
If you did want to attack this concept, it would be based on the fact that I cannot think of a good general storage use for this besides serving static webpages.
The only access method is through https.
There is only 1gigabyte bandwidth per 67 terabytes. 67 Terabytes is duh, 67000Gigabytes... Thats 536000 gigabits. a 1gigabit/s interface needs 6 days to move all that data. Oh and it can only be accessed through https. So its somewhat questionable that you can actually move nearly that much data. I don't really know what the limitations of the harddrives or SATA are, but no matter how much speed any of that has, the network link and latency are going to be significant if you are really moving large scale data. I can only assume their applications don't require speed, or that by duplicating it over a large number of systems they are going to get some load balancing. So then one asks... HOw many of these pods equal a redundant system with reasonable performance? And what is the power usage involved?
There is Raid6 based on 15 drive sets with 2 parity drives spread across between 1 and 3 controllers but there is no hot swappable drive, fan, or controller.
Essentially a single drive failure requires you to take down the entire system. Now I assume there is a replicated system, so you can just take down any of these boxes with no planning.
--------------------
Honestly I am sure this suits their purpose. I can't imagine what purpose it would suit for me.
I give it a 50/50 chance of actually breaking even vs. buying the cheaper Dell solution in a 5 year time frame.
I give it a 10% chance of causing an EPIC FAIL that causes the company to go out of business from a massive loss of customer data.
I have mod points and I am not afraid to use them
After adjustment, storage capacity has increased about 100,000x per dollar in the last 25 years. To get to a petabyte in the desktop price range requires just another 10 years or so.
(-1: Post disagrees with my already-settled worldview) is not a valid mod option.
The $117K is just the computer hardware. You still need UPS, A/C, Power, and floor space. Add up those, and a reasonable profit, and I'll bet Amazon and EMC don't look so bad. But if you already have the infrastructure, and the marginal cost of adding the storage arrays is low, then the design could save money.
and not something you'd want to store valuable data on. First off, it does not have redundant power. You could probably add redundant power for another $1,000 or so.
Second of all, if you did set up something like RAID 5 or RAID 6 (or RAIDZ/RAIDZ2), the rebuild time on a drive would probably be well over 12 hours with 1.5TB SATA drives.
I'm sure many people would be tempted to put all 45 drives in a large RAID 5 volume, which would be even scarier.
A more practical version would be to go with 41x 500GB SATA, 3x 60GB SSD, dual redundant power supplies, 32GB RAM, and Solaris or OpenSolaris.
You would probably break it down something like this: 2 disks - RAID 1 mirror for the system 2 30GB SSD drives for the slog (definitely helps improve performance) 3 hot spare and then 6 sets of 6 drives in RAIDZ-2 in a single pool This leaves out a couple of drives. You could put in a couple of 1.5TB (or even 2TB) in a Raid 1 mirror for some supplementary storage or just leave them out. You're not going to have as much storage, but, your data will be safer. Plus, dropping down to 500GB from 1.5TB drives is a large difference in price (as much as $50-$60 per drive,) and the price differentials mean that the added expenses (such as power and the SSD drvies.)
The real value in a data storage system isn't in the hardware, it's in the data. And the real cost incurred in a data storage system is measured in the inability of the customer to access that data quickly, efficiently and (in the case of a disaster) at all.
If you need to crunch the data quickly, a higher-performing system is going to save you money in the end. Look at all the benchmarks: no home-grown systems are anywhere on the lists. If you want to stream through your data at several gigabytes per second, you need to pay for a fast interconnect. Putting 45 drives behind a single 1GbE just doesn't cut it.
Similarly, if you want to ensure that the data is protected (integrity, immutable storage for folks who need to preserve data and be certain it hasn't been tampered with, etc) and stored efficiently (single instance store, or dedupe, so you don't fill your petabytes of disks with a bajillion copies of the same photos of Anna Kournakova) then you need to pay for the extra goodness in that software and hardware as well.
Finally, if you want extremely high availability, then the cost of the hardware is miniscule compared to the cost of downtime. We had customers that would lose millions of dollars per service interruption. They're willing to pay a million dollars to eliminate or even reduce downtime.
These folks are essentially just building a box that makes a bunch of disks behave like a honking big tape drive. It's a viable business--that's all some folks need. But EMC et al are not going to lose any sleep over this.
Am I part of the core demographic for Swedish Fish?
:D
How about reading the section "A Backblaze Storage Pod is a Building Block".
<snip> the intelligence of where to store data and how to encrypt it, deduplicate it, and index it is all at a higher level (outside the scope of this blog post). When you run a datacenter with thousands of hard drives, CPUs, motherboards, and power supplies, you are going to have hardware failures — it's irrefutable. Backblaze Storage Pods are building blocks upon which a larger system can be organized that doesn't allow for a single point of failure. Each pod in itself is just a big chunk of raw storage for an inexpensive price; it is not a "solution" in itself.
Emphasis mine. I believe there are quite a few successful and reliable storage vendors not using ZFS. We get the point, you like it. Doesn't mean you can't succeed without it. Be more open minded.
I'm sorry if I haven't offended anyone
I'm somewhat serious about building one of these boxes myself.
I have to buy a lot of little parts from a multitude of vendors, fine. A small premium to pay over their quoted price.
My question falls to: where the heck do I buy a "Chyang Fun Industry (CFI Group) CFI-B53PM 5 Port Backplane (SiI3726)"?
Spend a few minutes and try and find that part for sale.
--frustrated--
Checksums, error corrections, self healing shit or what ever is the solution. Such things are easy to put on top of these pods.
I typed up a lengthy critique.. but decided not to post it...
I'll replace it with... "wow..."
This thing needs a lot more thought, especially with respect to redundancy, fault coverage, and maintenance.
Anyone else notice that they seriously restricted the throughput to/from the drives based on their choice of SATA cards (good old 32-bit PCI 2.3 only has max theoretical of 266MB/s, and the same goes for PCIe 1x [250MB/sec])? In the worst case scenario, each drive is at max getting 17MB/sec of transactional bandwidth, which is just pathetic (based on some very back of the envelope calculations). For the amount of money they spent on making a custom solution, an extra 100-200 bucks to get a few 4-8 lane PCIe sata cards is a pittance. Overall, it just demonstrates to me, at least, a poor understanding of what goes into making a good storage solution. And dont get me started on the the lack of backup power supplies, error checking ram, etc.
Raw storage will always be cheaper than the effort of designing of fault-tolerant, high-availability systems, but it's worth the effort to at least implement "good enough" systems to attempt to achieve these qualities rather than sticking with the dumb "stack-em-high" approach. Scalability matters, or else your "super cluster" will quickly be overtaken by the next dumb implementation when the next 18-month increment rolls around.
who's taking bets on how long it takes them to bight the bullet and shell out the cash for a netapp, emc, ibm, hp or other true SAN?
there are reasons that companies pay large sums of money for them. it's not because the *can* or because the *want to*.
one day they'll realize this.
not only is time travel possible, it's irrelevant.
I visited LCCC several times when in student government a few years back. They're a legitimate college with a good student population and decent teachers. They're right outside Ocala, FL - halfway between Tallahassee and Orlando.
At least I wouldn't trust software RAID10 to write to both disk sets and then fill in the the other set with the redundant copy when it had time. That really needs a battery-backed cache to implement safely. The overhead of RAID6 parity calculation should decrease for bulk writes, but at some point the CPU is going to be spending too much time calculating parity and not doing other stuff. 16 100MB/s drives in RAID6 would put quite a load on the system, but if it's only a file server it may be acceptable. I agree that degraded drives suffer a much worse slowdown,especially for partial stripe reads. You could easily start getting only 100MB/s for lots of small reads on that same RAID6 with one or two failed drives, and that's assuming the CPU is fast enough to do error correction at 100MB/s (with two drives missing the fast algorithm for accelerating raid6 stops working and it has to emulate gf(2^8) multiplication with lookup tables). Most of my personal needs are cheap bulk data storage (movies, isos, etc.), so RAID5/6 makes sense. At work, I use RAID1, 10, and 5 since we don't have hardware support for RAID6 on the SANs. Production data goes on RAID10 because we can afford it, mirroring for system drives, and RAID5 for test/development systems that just need lots of storage.