Build Your Own $2.8M Petabyte Disk Array For $117k
Chris Pirazzi writes "Online backup startup BackBlaze, disgusted with the outrageously overpriced offerings from EMC, NetApp and the like, has released an open-source hardware design showing you how to build a 4U, RAID-capable, rack-mounted, Linux-based server using commodity parts that contains 67 terabytes of storage at a material cost of $7,867. This works out to roughly $117,000 per petabyte, which would cost you around $2.8 million from Amazon or EMC. They have a full parts list and diagrams showing how they put everything together. Their blog states: 'Our hope is that by sharing, others can benefit and, ultimately, refine this concept and send improvements back to us.'"
Good luck with all the silent data corruption. Shoulda used ZFS.
"Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman
Support.
Hail Eris, full of mischief...
E pluribus sanguinem
Before realizing that we had to solve this storage problem ourselves, we considered Amazon S3, Dell or Sun Servers, NetApp Filers, EMC SAN, etc. As we investigated these traditional off-the-shelf solutions, we became increasingly disillusioned by the expense. When you strip away the marketing terms and fancy logos from any storage solution, data ends up on a hard drive.
That's odd, where I work we pay a premium for what happens when the power goes out, what happens with a drive goes bad, what happens when maintenance needs to be performed, what happens when the infrastructure needs upgrades, etc. This article left out a lot of buzzwords but they also left out the people who manage these massive beasts. I mean, how many hundreds (or thousands) of drives are we talking here?
You might as well add a few hundred thousand a year for the people who need to maintain this hardware and also someone to get up in the middle of the night when their pager goes off because something just went wrong and you want 24/7 storage time.
We don't pay premiums because we're stupid. We pay premiums so we can relax and concentrate on what we need to concentrate on.
My work here is dung.
What's that in Libraries of Congress?
Build a man a fire and he'll be warm for an hour. Set him on fire and he'll be warm for the rest of his life.
Looks like a cheap downscale undersized version of a Sun X4500/X4540.
And as others have pointed out, you pay a vender because in 4 years they will still be stocking the drives you bought today, where as for this setup you will be praying they are still on ebay
"If everybody is thinking alike, somebody isn't thinking" - Gen. George S. Patton
That's all fine and dandy but where is my support going to come from when this server has issues? Are they throwing in for free maintenance and upgrades to this server when it no longer meets requirements? If not, this figure is highly disingenuous.
Nominally a Slashvertisement, but the detailed specs for their "pods" (watch out guys, Apples gonna SUE YOU) are pretty damn cool. 45 drives on two consumer grade power supplies gives me the heebie jeebies though (powering up in stages sounds like it would take a lot of manual cycling, if you were rebooting a whole rack, for instance), and I'd be interested to know why they chose JFS (perfectly valid choice) over some other alternative...There are plenty of petabyte capable filesystems out there.
Very interesting though. I tried to push a much less ambitious version of this for work, and got slapped down because it wasn't made by (insert proprietary vendor here). Of course, we're still having storage issues because we can't afford the proprietary solution, but at least there is no non-branded hardware in our server room.
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
AHhh, this is why the EMC guy committed suicide. It wasn't because he was dying of cancer.
Trolling is a art,
...but that doesn't add up. $7,867 / 67 petabytes = $117.42/petabyte, not $117,000/petabyte.
Perhaps they were using the 'new' math.
No ECC? Good luck....
Soon I shall have a single media server with every episode of "General Hospital" ever made stored at a high bitrate. WHO'S LAUGHING NOW, ALL YOU WHO DOUBTED ME!!!!
And how big is a petabyte you ask? There have been about 12,000 episodes of General Hospital aired since 1963. If you encoded 45 minute episodes at DVD quality mpeg2 bitrate, you could fit over 550,000 episodes of America's finest television show on a 1 petabyte server, enough to archive every episode of this remarkable show from its auspicious debut in 1963 until the year 4078.
SJW: Someone who has run out of real oppression, and has to fake it.
How do you replace disks in the chassis? We've got 1,000 spinning disks and we've got a few failures a month. With 45 disks in each unit you are going to have to replace a few consumer grade drives.
But when we priced various off-the-shelf solutions, the cost was 10 times as much (or more) than the raw hard drives.
Um..and what do you plan on running these disks with? HD's don't magically store and retreive data on their own. The HD's are cheap compared to the other parts that create a storage system. That's like saying a Ferrari is a ripoff because you can buy an engine for $3,000.
If you check out what the company does, they are an online backup company. They don't host servers on this array, just backup data from your desktop. They just need massive amounts of space which they make redundant.
I love free shipping, even if it costs me more !! I like FREE STUFF !!
They designed and built it so they should know how to support it. If someone else builds one, just learning how to get that beast up and running is excellent hands on training.
Yeah, this only works if your the geeks building the hardware to begin with. The real cost is in setup and maintenance. Plus, if the shit hits the fan, the CxO is going to want to find some big butts to kick. 67TB of data is a lot to lose (though it's only about 35 disks at max cap these days).
These guys, however, happen to be both the geeks, the maintainers, and the people-whos-butts-get-kicked-anyway. This is not a project for a one or two man IT group that has to build a storage array for their 100-200 person firm. These guys are storage professionals with the hardware and software know how to pull it off. Kudos to them for making it and sharing their project. It's a nice, compact system. It's a little bit of a shame that there isn't OTS software, but at this level you're going to be doing grunt work on it with experts anyway.
FWIW, Lime Technology (lime-technology.com) will sell you a case, drive trays, and software for a quasi-RAID system that will hold 28TB for under $1500 (not including the 15 2TB drives - another $3k on the open market). This is only one fault tolerant, though failure is more graceful than a traditional RAID). I don't know if they've implemented hot spares or automatic failover yet (which would put them up to 2 fault tolerant on the drives, like RAID6).
Is it just my observation, or are there way too many stupid people in the world?
where's the extensive stuff that sun (I work at sun, btw; related to storage) and others have for management? voltages, fan-flow, temperature points at various places inside the chassis, an 'ok to remove' led and button for the drives, redundant power supplies that hot-swap and drives that truly hot-swap (including presence sensors in drive bays). none of that is here. and these days, sas is the preferred drive tech for mission critical apps. very few customers use sata for anything 'real' (it seems, even though I personally like sata).
this is not enterprise quality no matter what this guy says.
there's a reason you pay a lot more for enterprise vendor solutions.
personally, I have a linux box at home running jfs and raid5 with hotswap drive trays. but I don't fool myself into thinking its BETTER than sun, hp, ibm and so on.
--
"It is now safe to switch off your computer."
Since you can now get 2TB drives you should be able to fit 90TB in one of these boxes :)
And I thought I was doing well with a few terabytes in my home server (but hey, ZFS should save me from silent data corruption when the drives inevitably start to fail).
It's not exactly rocket surgery.
When are they going to sell the Backblaze kit everything but the hard-drives?
Everything looks rather standard except the case and the HD panels inside it
I am sure there are companies who would really like to buy one or two.
Those "outrageously overpriced" models have multiple controllers that have battery backed up caches that mirror their data, SAS or FC instead of SATA, hot swappable components (power supplies, fans, drives, controllers, cache modules, etc), 99.999% uptime, testing/certification for EMI, shock, vibration, thermal, GUI/phone home management, and 24/7 on-site support. They are designed for high performance, mission critical situations. The blog is from a company that's doing backups. They did a good job, but it's apples and oranges. They don't have the performance, uptime, or support requirements. They're doing their own support and aren't selling the HW, so they don't have the certification. Their top loading trays are going to make it fun to replace a drive at the top of the cabinet.
Reliant Technology sells you NetApp FAS 6040 for $78,500 with a maximum capacity of 840 drives, without the hard drive (source: Google Shopping). If you buy FAS 6040 with the drives, most vendors will use more expensive and less capacity 15k rpm drives instead of the 7200rpm drives the BlackBlaze Pod uses, and this makes up a lot of the price difference. The point is, you could buy NetApp and install it yourself with cheap off-the-shelf consumer drives and end up spending about the same magnitude amount of money. I estimate that NetApp would cost just 1.5x the amount.
NetApp FAS 6040 at $78,500 + 840 x 1.5TB drives at $120 each = $179,300 which gives you 1.26PB. Cost per petabyte is $142,500, only slightly more expensive than BlackBlaze $117,000 from the article. The real story is that BlackBlaze is able to show a competitive edge of $30,000, or being 20% cheaper.
I once had a signature.
and save $2,799,720.
So there is a small blog write up that demonstrates you use a fairly unreliable hardware setup... love those rubber bands!... but your $5 a month service only supports Windows and newer Mac computers? .... meh
So, air will take the path of least resistance - so all those fans will be moving the coolest air (under the hard drives) and pushing it out the back. They really, really are going to hate replacing ~ 30 drives per enclosure after a few weeks.
Better plug them in from the bottom of the chassis, and put the standoffs on the "top" so the hot air will at least rise off the disks and can be push/pulled it out by that godawful fan system.
Oh yeah - TWO 760W power supplies? 1500 watts per 45 drives? That's pretty horrible by enterprise standards. They will spend 2X on powering these over a hp EVA4400.
If you build a petabyte stack using 1.5TB disks you need about 800 drives including RAID overhead. With an MTBF for consumer drives of 500,000 hours, a drive will fail roughly every 10-15 days, if your design is good and you create no hotspots/vibration issues.
Rebuild times on large RAID sets are such that it is only a matter of time before they run a double drive failure and lose their customers data. The money they saved by going cheap will be spent on lawyers when they get the liability claims in.
To Terminate, or not to Terminate, that's the question - SCSIROB
These cost a bit and have drives which fail at a fairly infrequent rate. It doesnt' hurt that the data center is kept at 64 degrees by two (redundant) chillers and has 450 KVa redundant power conditioners keeping the electricity on at all times. (We do shut off the power to the building once a month to check these and the diesel generator housed on the premises as well.)
Now - paying $x,xxx per year for maintenance on these units is cheap insurance in my mind. If something goes wrong, HP is available 24/7 to be onsite with replacement parts. This has - in fact happened - during the past few years. A controller on the array went bad, causing disk read failures. We instantly called HP, had a tech onsite, and had the controller replaced within a few hours of the problem being detected.
OTOH - for someone's 4 petabyte home pr0n collection, this might be a good idea! :P
The Kai's Semi-Updated Website Thingy
If you need the support, go pay the premium. Those of us with the appropriate technical background welcome the cheaper implementations.
If an article went up describing how a major vendor released a petabyte array for $2M the comments would full of people saying "I could make an array with that much storage far cheaper!"
Now someone has gone and done exactly that (they even used linuxto do it) and suddenly everyone complains that it lacks support from a major vendor.
This may not be perfect for everyones needs, but it's nice to see this sort of innovation taking place instead of blindy following the same path everyone else takes for storage.
These guys build their own hardware, think it might be able to be improved on or help the community, and they release the specs, for free, on the Internet. They then get jumped on by people saying "bbbb-but support!". They're not pretending to offer support, if you want support, pay the 2MM for EMC, if you can handle your own support in-house, maybe you can get away with building these out.
It's like looking at KDE and saying "But we pay Apple and Microsoft so we get support" (even though, no you don't). The company is just releasing specs, if it fits in your environment, great, if not, bummer. If you can make improvements and send them back up-stream, everyone wins. Just like software.
I seem to recall similar threads whenever anyone mentions open routers from the Cisco folks.
I like music
I'd like to see support for power supply redundancy or, at least, battery backup for the raid cards before I consider this as a viable solution - even for homebrew.
Not too shabby.
I had recently built a "storage pod" for my media @ home (6T using 4 1.5T drives), and had a hell of a time finding "good" components. So, I looked this over, and while it's made up of "consumer components" a couple of the components seem impossible to find for this as well.
Case: Custom Built
HD Backplane: Custom made by chinese manufacturer.
So good luck building a "one off" for your small business/home, as I'll also bet these prices are for "quantity" (quality not withstanding)
Lots missing here. How do I access the box? FC? FCoE? NFS only? 1Gbe or 10Gbe? What about cache, and caching algorith,s / logic? Snapshotting? Mirroring? Clones? Remote Mirroring?
Yes, all of these can be done at limited levels with Linux (openfiler for instance) but this implementation loses a lot in the ports and cache.
I think people are missing the point of this whole thing... instead of trashing and tearing the idea down. think what would make it better and improve the design... Ive been researching for a while now for something to store a life's worth of data, and this looks like something that will meet my needs. scalable, and enough space for a lifetime (I hope)
you know you can fry stuff putting things into things that dont like the things you put into it...
Those Seagate drives have been fraught with problems since their release. The model they quote is ST31500341AS. The reviews on both Amazon and NewEgg detail the history. Supposedly, Seagate finally got the firmware sorted out, but would you want to test it with a couple grand of drives? More to the point, would you want to support it? That choice has the air of penny-wise and pound foolish.
Or you could wait a decade and buy a similar capacity drive for $100.00
I guess my problem with this system is the multiple "single points of failure". There would be tons of downtime with this configuration. I guess I don't see where a piece of HW like this fits into a real data center? Consumer level reliability in a datacenter? The support required for this setup would be much higher than a piece of HW from EMC, netapp, etc....that's probably why Amazon isn't using this setup.
Basically, cost has been traded for quality in every possible design decision...
Weâ(TM)re a backup service, so our datacenter contains a complete copy of all of our customersâ(TM) data, plus multiple versions of files that change. In rough terms, every time one of our customers buys a hard drive, Backblaze needs another hard drive.
Data deduplication (see http://en.wikipedia.org/wiki/Data_deduplication) drastically reduce the storage requirements for backups. While email attachments are the classic example, it's doubtful that every one of their customer's is using a unique build of their OS. Ditto for third-party software. A lot of media also gets duplicated between people: vendor's whitepapers, video, even porn gets downloaded by lost of people. Rsync uses de-dup techniques to reduce bandwidth requirements; there's no reason why a clever storage node couldn't use that de-dup meta data to keep its own storage costs down.
Nothing for 6-digit uids?
More problematic than the hardware is their storage of the PRIVATE key on THEIR servers. Hello, gov't fishing expedition. This is just as bad as that "secure" email service some Canadian outfit was offering a few years back.
Granted, this effort is about how to get the cheapest possible mass storage with just enough redundancy to sleep fitfully at night. Shoot, my home rig is better designed but not quite a dense (I don't have the budget/time to make a custom case).
they used incredibly cheep-ass HBA's for no good reason.
a powersupply failure instantly kills the LUN
proper fault-tolerant power supplies exist (see Chenbro) and are very reasonably priced
questionable air-handling (side vents let air bypass drives)
I use
Chenbro 24port SAS multiplier
LSI 8-port SAS card (pci-x)
Dell Perc6i
Coolermaster Centurion 590
Coolermaster 4-in-3 (5 units)
650w power supply.
Chenbro has good backplanes (see their 16, 24 drive cases)
For your average datacenter, primary storage needs to be on a major vendor's hardware, because you need the extras that the major vendor's supply. However, Backblaze is in the business of providing off-site storage for their customers. Their data is the secondary copy, so it can be as cheap as they can make it. No one is going to be running their data center off of this copy, so it can be low performance. And while I'm not saying that they should, they could probably get away with running non-protected storage for everything. Even if they lose a drive every day, it's unlikely to hold the data needed for that day's requested restores. That means they can almost always rebuild a failed drive's contents the next time the affected customers sync up.
Nothing for 6-digit uids?
I need to lose about $117K. I really need petabytes for home use. Slashdot has finally cracked. You guys must be
ingested a huge amount of crack.
Yours In Russia,
Kilgore Trout
After looking at the petabytes-on-a-budget blog entry, click on "Home" for the main website where you'll spend 15 seconds watching a spokesgirl douse a laptop with lighter fluid and light it. Cheeky, but effective. .... but will it blend?
where I work we pay a premium for what happens when the power goes out, what happens with a drive goes bad,
Whomever spec'd your systems should have accommodated obvious failures like this. As in, paying for colo, using servers with dual power supplies that fail over, sensible RAID strategy. Giving money to EMC in this situation is not sensible.
but they also left out the people who manage these massive beasts. I mean, how many hundreds (or thousands) of drives are we talking here?
I have a couple of hundred drives going at any one time and I get an SNMP alert when a drive goes bad. I take one out of the closet and destroy the broken one. The RAID does the rest.
someone to get up in the middle of the night when their pager goes off because something just went wrong and you want 24/7 storage time.
Our storage strategy is N+1 all the way and required to be online 24/7 so failures are part of the plan. They are probably part of the plan at this startup.
We pay premiums so we can relax and concentrate on what we need to concentrate on.
I don't understand this. If your job is 89% software dev, then EMC may be the way to go. Expensive! But, it makes a little business sense. If you aren't spending most of your time writing software that adds value to your service/product, then EMC is doing your job and you are some kind of TPS generator. Do you pay a premium to blame someone else? I've had the opportunity to work in places like this and I've always passed because of the veiled contempt for IT.
Please, explain this to me.
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html
I like how you dismiss a detailed real world design example based simply on a claimed feature without any further substantiation. Very classy. I'm not saying you are wrong, but would it kill you to go into a little more detail about why these folks need "luck" when they are clearly very successful with their existing design?
STFU about slashdot bias.
That's amusing, since EMC was born out of the outrageously overpriced offerings from IBM and other mainframe companies of the day.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
I think this solution is quite interesting and probably fits their needs but comparing it to the storage solutions of the vendors listed is quite ridiculous. Another thing to note, there are vendors, NEXSAN, that sell cheaper storage systems, that while still more expensive then this, would have probably meet their needs. The first issue is high availability. There are many single points of failure on this box. There is only a single controller. The power supplies are not redundant. With the number of drives a single fan failure might lead to and high enough heat to damage components. Single port back plane. No NVRAM. The only thing that isn't a single point of failure are the drives themselves because they are in a RAID6 config but I still see a problem with that, their configuration uses no hot spares. A high end storage system is going to have multiple controllers, redundant power supplies, be able to sustain multiple fan failures, multiple back planes with interposer cards. It's also going to have NVRAM that should a power failure occur acknowledge cached data would not be lost. The second issue is maintenance. A high end storage system systems parts are high accessible and often hot swappable. A controller goes out, it's like changing a Nintendo cartridge. With this box if anything goes except a drive, the box is coming down. If you are a replacing a drive you'll have to slide the box out, hopefully you left enough clearance for the power cords when you slide it out, then you have to pop in a new drive, and hopefully not break the SATA connector on the back plane. Oh man, I forgot to put on a new rubber band, I mean vibration dampner. What's the performance of this box like? With software RAID and only a single processor with no ASIC acceleration for anything I would have to imagine the processor is going to get pretty bogged down. With a high end box everything is pretty much designed, within reason, to make the drives the ultimate performance bottleneck. Can this systems fully utilize all the drives or can the drives deliver more IOPS and throughput then the controller can handle? Extra features. What does this box offer in terms of volume copying, flash copying, and remote mirroring? The value of an enterprise solution is that it provides the features that keep it working 99.999% of the time, not just 99%. I see so many possible areas where data could possibly be lost or corrupted. A couple of comments have suggested this just being a block in a bigger solution, treating it just like a drive. In that case you are going to have to a additional layer of redundancy, probably a mirror. With a straight mirror you are going to see a doubling in cost of hardware, infrastructure, power and cooling, which is going to start disrupting cost/benefit of this solution. If you just want a bunch of file space accessible through HTTP with the ability to tolerate the occasional loss of data and downtime, this solution will work fine. If data loss or downtime means the loss of data or jobs, you'll go with one of the major storage vendors.
Oh, please stop with the rationale that *something* is accomplished with the average PHB cost/benefit analysis.
The *vast* majority of the time both the costs and benefits are fabricated out of whole cloth to support a foregone conclusion.
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html
A plain-vanilla SAN is worth every penny.
Especially now that you can get them from distressed companies who paid too much for them a couple of years ago, $15,000 will get you a refrigerator-sized solution. Straight retail on a 2U san is still getting cheaper every year. http://h71016.www7.hp.com/dstore/ctoBases.asp?oi=E9CED&BEID=19701&SBLID=&ProductLineId=450&FamilyId=2569&LowBaseId=15222&LowPrice=$1,899.00
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html
I remember when i first got my hands on a sun thumper... impressive piece of kit but with a sun price tag... this one is way kewler on the price.
A project i worked on tried to deploy quite a number of thumper's and we ran into "issues"... First the racks - the thumper weighs in the vicinity of 150kgs (330lbs i think?), try put 10 of them in a rack and your in for a shock, assuming the rack can handle it, most data center floors have "issues" supporting the weight.
The second problem we had was cooling, the temp coming out the back of the rack was quite astronomical, and lastly power. In AU, this can often be a pain in the rear, specially with each thumper taking in about 2kw - 20kw per rack = PAIN.
Still, its kewl to see them do it all open.
I'd love to see someone do something like that though with computing power. Take the proliferation of mini-itx boards with "real" cpu's on them (ok, desktop cpus, but still, not shabby really) you could do a similar setup with a custom case supporting quite a number of those little buggers quite easily. Theres a beautiful little zotac AMD board that is almost ideal - supports the quad core, has a gig interface (sadly it has been aimed at htpc's cause you could replace the wireless, video, 6usb ports, etc with server-useful componentry - i.e. 2 or more gig ports and ipmi). But mostly it would be cheap and could run something like ovirt or abicloud quite happily. Shame that. There are other options in the space, like intel have a half-width xeon board (and a 1ru case that can support 2 of them side-by-side), but they're hard to get a hold of and quite long. get rid of local storage on the servers, use serial instead of video and add gpxe for remote boot - brilliant and dont exist!.
On a completely side note, one thing that i'd love to see in linux that has yet to exist in a useful format is replication (async) - the only real option is drbd, but its such a pain to setup and very inflexible. I was always so disappointed that neigther zfs, lvm or btrfs include it (even basic local replication would have surficed given the existence of so many network level block storage transports (iscsi, etc).
They went RAID 6, even though it is slow as shit, for the added failsafe mechanisms.
How long does it take to rebuild the array if a disk has to be replaced? Each RAID 6 volume is 15 disks of 1.5TB each (19.5TB data + 3TB parity). So either they'd have to take a real performance hit during a relatively short rebuild period, or a smaller hit over a longer rebuild period. Longer rebuild periods increase the odds of further failures before the rebuild is complete.
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
These people are in the same business as Carbonite. I am sure Carbonite wired together some similarly hacked solution.
OOOPS!
It seems to me that people are trying to engineer a impossible business model. Unlimited storage for $5 per month.
These are pretty impressive and a good start but there's a few of things I would do differently. I use to build NAS boxes for a living and there are some issues with this design. I think they have cut too many corners
1) Their choice of RAID cards is somewhat questionable. What 4 cards can you get for $175 total which will support proper hotswap? Even running software raid, I would still want cards that provide proper monitoring and drive management like 3ware. Yeah maybe it would have cost you a fair bit more $175 per box but it would be worth the difference. You would still be saving a ton. Also I am not sure I would put more than one drive on a cable with a multiplexer. You can get 16 port 3ware cards that use multiport cables that break out at the back plane. Now you would also have to upgrade to a server class motherboard with at least 3 PCIe slots.
2) I haven't checked recently but is software raid 6 even recommended yet. I know the 2.6 kernel has been supporting for a while but it was still listed as experimental last I checked. I might stick with raid 5 here.
3) While using Zippy power supplies is an excellent choice, I would definitely want redundant power in these boxes.
Skimming through this article one thing is very clear here. performance. In many cases the extremely slow response times from this type of array setup will just not be acceptable.
Some quick math, basic 8 lane sas cards which will give you any kind of performance is going to run you about $500 per card. So that's an extra $2000 right there per machine. Then once you have all these nice sas cards so that your storage isn't slow as dirt you'll need a nice 4 slot, 8x PCI-E motherboard and some serious CPU power to drive the software array. Next you'll need some kind of interconnect to make this useful. You could go 10Ge cards for around $500+ each, however a simple switch is going to set you back many thousands of dollars. The same goes for IB interconnects.
Looking at many of the software based solutions such as the referenced sun x4550 - these are not simple servers here. They are quiet beefy and still find them selves overwhelmed by the IO from even slow disks.
These seem like a neat idea for cheap slow storage but I wouldn't ever look to something like this for enterprise or HPC class storage solutions.
You realize they are USING this NOT SELLING it, right? They tell YOU how YOU can build one, nowhere are they offering to sell some schmuck a storage array.
If you don't know how to maintain it, do not try to do it yourself! however if you do, and you can save the kind of money they are saving, then go for it.
pci sata card? low end MB? IDE boot disk?
the low end mb with on board is also bad add a cheap pci-e or pci video and trun off / don't get a board with on board video that uses system ram in a sever real severs have on board pci video that has it's own ram on board video also uses up chip set i/o.
IDE is dieing on new board and the boot need to move to ide.
the sata card should be a good pci-e one and you do not need 3-4 low end ones.
Most of what you're paying for with vertical hardware such as Sun, SGI, Netapp, EMC, etc. is the SUPPORT, and the THROAT TO CHOKE. You're buying a service from them really, it's best not to think of it as hardware you bought.
Yeah, the hardware has a big pricetag. The support has pretty much the same pricetag as the hardware, every year. And those guys will move heaven and earth to debug your wierd problems that you hit under your specific stres scenarios. It takes a lot of work to figure out where the bug is when there are components from many different vendors interacting. If you built your storage array yourself, it's your problem when it does weird stuff.
Hope you're good with a kernel debugger, because under stress, it gets really, really fun to figure out what's going on, and hundreds of hours of debugging sometimes.
But yeah, you can build a big array for a lot less that a SAN vendor will charge you. You're just on the hook for when it eats your data, which it will, eventually. And, you don't have a big company *testing* the hell out of the exact hardware config you're running, so every issue is a one-off.
It's not just the hardware, its all the components in the storage stack between your app and the bits on disk. Multipathing, caching, network, all that stuff. How does the whole storage stack respond to a particular error thrown on a consumer grade hard disk? It might respond properly, it might mask the error. Who knows? Do you trust that your writes made it to disk when the OS says they did?
Do-it-yourself only makes sense if a) you don't care about the data. (for example, the caching disks in the Google farms can die, and it's no big deal), or b) the risk is worthwhile and you're willing to accept a higher probability of data corruption and / or loss. or c) the SAN you're talking about is ghetto anyway, and you can engineer better perf/reliability yourself; aka you know what the hell you're doing and accept the aforementioned risks.
The article talks about how it is not intended as a complete solution. They do not go into, or intend to, describe their redundancy features, their performance issues, or anything else.
From the Article:
A Backblaze Storage Pod is a Building Block
We have been extremely happy with the reliability and excellent performance of the pods, and a Backblaze Storage Pod is a fully contained storage server. But the intelligence of where to store data and how to encrypt it, deduplicate it, and index it is all at a higher level (outside the scope of this blog post). When you run a datacenter with thousands of hard drives, CPUs, motherboards, and power supplies, you are going to have hardware failuresâ"itâ(TM)s irrefutable. Backblaze Storage Pods are building blocks upon which a larger system can be organized that doesnâ(TM)t allow for a single point of failure. Each pod in itself is just a big chunk of raw storage for an inexpensive price; it is not a âoesolutionâ in itself.
If you did want to attack this concept, it would be based on the fact that I cannot think of a good general storage use for this besides serving static webpages.
The only access method is through https.
There is only 1gigabyte bandwidth per 67 terabytes. 67 Terabytes is duh, 67000Gigabytes... Thats 536000 gigabits. a 1gigabit/s interface needs 6 days to move all that data. Oh and it can only be accessed through https. So its somewhat questionable that you can actually move nearly that much data. I don't really know what the limitations of the harddrives or SATA are, but no matter how much speed any of that has, the network link and latency are going to be significant if you are really moving large scale data. I can only assume their applications don't require speed, or that by duplicating it over a large number of systems they are going to get some load balancing. So then one asks... HOw many of these pods equal a redundant system with reasonable performance? And what is the power usage involved?
There is Raid6 based on 15 drive sets with 2 parity drives spread across between 1 and 3 controllers but there is no hot swappable drive, fan, or controller.
Essentially a single drive failure requires you to take down the entire system. Now I assume there is a replicated system, so you can just take down any of these boxes with no planning.
--------------------
Honestly I am sure this suits their purpose. I can't imagine what purpose it would suit for me.
This is definitely an ambitious project, kudos to them for pulling it off and sharing with the world. If you look at the diagram, they're powering half of the drives from each power supply, meaning that a single AC or PSU failure takes out half of the drives in the shelf. The PSUs and motherboard are big single point of failures and it means they must be willing to accept loss of an entire shelf (not too much of a problem given the low per-shelf cost).
I also echo the comments above about lack of enclosure management and difficulty of replacing a single disk. They talk about using a rubber band around each disk as a vibration dampener, and a piece of foam across the top of all 45 for the same purpose. If you pull the lid off to replace a drive, then you're taking the foam out too. The vibration during disk replacement will likely kill the performance of the system, even if only for the minute it takes to replace the drive. And I expect that even with the lid on, vibration-induced performance is pretty severe.
However, it doesn't sound like these guys care too much about performance, so this is likely a non-factor. After all, they're accessing 45 desktop SATA disks via what appears to be a single 1 Gbps ethernet link. Using their 87% of 67.5 TB figure, they have 58.725 TB of usable storage and a maximum of around 100 MBps maximum throughput to the shelf. That comes out to nearly 7 days to fill one of these shelves up.
I give it a 50/50 chance of actually breaking even vs. buying the cheaper Dell solution in a 5 year time frame.
I give it a 10% chance of causing an EPIC FAIL that causes the company to go out of business from a massive loss of customer data.
I have mod points and I am not afraid to use them
After adjustment, storage capacity has increased about 100,000x per dollar in the last 25 years. To get to a petabyte in the desktop price range requires just another 10 years or so.
(-1: Post disagrees with my already-settled worldview) is not a valid mod option.
The $117K is just the computer hardware. You still need UPS, A/C, Power, and floor space. Add up those, and a reasonable profit, and I'll bet Amazon and EMC don't look so bad. But if you already have the infrastructure, and the marginal cost of adding the storage arrays is low, then the design could save money.
and not something you'd want to store valuable data on. First off, it does not have redundant power. You could probably add redundant power for another $1,000 or so.
Second of all, if you did set up something like RAID 5 or RAID 6 (or RAIDZ/RAIDZ2), the rebuild time on a drive would probably be well over 12 hours with 1.5TB SATA drives.
I'm sure many people would be tempted to put all 45 drives in a large RAID 5 volume, which would be even scarier.
A more practical version would be to go with 41x 500GB SATA, 3x 60GB SSD, dual redundant power supplies, 32GB RAM, and Solaris or OpenSolaris.
You would probably break it down something like this: 2 disks - RAID 1 mirror for the system 2 30GB SSD drives for the slog (definitely helps improve performance) 3 hot spare and then 6 sets of 6 drives in RAIDZ-2 in a single pool This leaves out a couple of drives. You could put in a couple of 1.5TB (or even 2TB) in a Raid 1 mirror for some supplementary storage or just leave them out. You're not going to have as much storage, but, your data will be safer. Plus, dropping down to 500GB from 1.5TB drives is a large difference in price (as much as $50-$60 per drive,) and the price differentials mean that the added expenses (such as power and the SSD drvies.)
The real value in a data storage system isn't in the hardware, it's in the data. And the real cost incurred in a data storage system is measured in the inability of the customer to access that data quickly, efficiently and (in the case of a disaster) at all.
If you need to crunch the data quickly, a higher-performing system is going to save you money in the end. Look at all the benchmarks: no home-grown systems are anywhere on the lists. If you want to stream through your data at several gigabytes per second, you need to pay for a fast interconnect. Putting 45 drives behind a single 1GbE just doesn't cut it.
Similarly, if you want to ensure that the data is protected (integrity, immutable storage for folks who need to preserve data and be certain it hasn't been tampered with, etc) and stored efficiently (single instance store, or dedupe, so you don't fill your petabytes of disks with a bajillion copies of the same photos of Anna Kournakova) then you need to pay for the extra goodness in that software and hardware as well.
Finally, if you want extremely high availability, then the cost of the hardware is miniscule compared to the cost of downtime. We had customers that would lose millions of dollars per service interruption. They're willing to pay a million dollars to eliminate or even reduce downtime.
These folks are essentially just building a box that makes a bunch of disks behave like a honking big tape drive. It's a viable business--that's all some folks need. But EMC et al are not going to lose any sleep over this.
Am I part of the core demographic for Swedish Fish?
And that's the thing. Instead of being lazy, you can with this setup, expand you & your own staff's knowledge, and do it all yourself. Even if you had to completely replace the entire system yourself every year, you could operate the whole thing for over 10 years including associated costs of keeping the whole system going if you compared it to just the cost of buying from Amazon or EMC. In 10 years you'll likely just replace the whole setup anyways. I'd rather have someone I can ask down the hall whats going on with the system than a support company that'll be there in 4-6 hours.
:D
How about reading the section "A Backblaze Storage Pod is a Building Block".
<snip> the intelligence of where to store data and how to encrypt it, deduplicate it, and index it is all at a higher level (outside the scope of this blog post). When you run a datacenter with thousands of hard drives, CPUs, motherboards, and power supplies, you are going to have hardware failures — it's irrefutable. Backblaze Storage Pods are building blocks upon which a larger system can be organized that doesn't allow for a single point of failure. Each pod in itself is just a big chunk of raw storage for an inexpensive price; it is not a "solution" in itself.
Emphasis mine. I believe there are quite a few successful and reliable storage vendors not using ZFS. We get the point, you like it. Doesn't mean you can't succeed without it. Be more open minded.
I'm sorry if I haven't offended anyone
I'm somewhat serious about building one of these boxes myself.
I have to buy a lot of little parts from a multitude of vendors, fine. A small premium to pay over their quoted price.
My question falls to: where the heck do I buy a "Chyang Fun Industry (CFI Group) CFI-B53PM 5 Port Backplane (SiI3726)"?
Spend a few minutes and try and find that part for sale.
--frustrated--
Checksums, error corrections, self healing shit or what ever is the solution. Such things are easy to put on top of these pods.
I typed up a lengthy critique.. but decided not to post it...
I'll replace it with... "wow..."
This thing needs a lot more thought, especially with respect to redundancy, fault coverage, and maintenance.
Wow. Where is your support going to come from? You, you dumb ass. If you don't know how to replace a drive, you're an idiot.
Anyone else notice that they seriously restricted the throughput to/from the drives based on their choice of SATA cards (good old 32-bit PCI 2.3 only has max theoretical of 266MB/s, and the same goes for PCIe 1x [250MB/sec])? In the worst case scenario, each drive is at max getting 17MB/sec of transactional bandwidth, which is just pathetic (based on some very back of the envelope calculations). For the amount of money they spent on making a custom solution, an extra 100-200 bucks to get a few 4-8 lane PCIe sata cards is a pittance. Overall, it just demonstrates to me, at least, a poor understanding of what goes into making a good storage solution. And dont get me started on the the lack of backup power supplies, error checking ram, etc.
I think we'll stick with these, http://www.xyratex.com/products/storage-systems/storage-F5404E.aspx . A little more expensive but hey, they come with an OS
Raw storage will always be cheaper than the effort of designing of fault-tolerant, high-availability systems, but it's worth the effort to at least implement "good enough" systems to attempt to achieve these qualities rather than sticking with the dumb "stack-em-high" approach. Scalability matters, or else your "super cluster" will quickly be overtaken by the next dumb implementation when the next 18-month increment rolls around.
who's taking bets on how long it takes them to bight the bullet and shell out the cash for a netapp, emc, ibm, hp or other true SAN?
there are reasons that companies pay large sums of money for them. it's not because the *can* or because the *want to*.
one day they'll realize this.
not only is time travel possible, it's irrelevant.
tech r for tech guys
not dummies
I began designing video servers about 15 years ago, and the single factor I consider most important to device longevity is keeping the temperature low. The BackBlaze configuration attempts, as many have done, to solve the problem with multiple fans. Most folks also make the mistake of trying to push air into the box.
In the images shown, the packaging places the drives so close together that achieving any reasonable cooling effect, especially for the front two rows, will be difficult. This is also exacerbated by the layer of foam they use on top of the drives to ensure they remain firmly seated. A drive's worst enemy is heat, and they make their own.
I visited LCCC several times when in student government a few years back. They're a legitimate college with a good student population and decent teachers. They're right outside Ocala, FL - halfway between Tallahassee and Orlando.
At least I wouldn't trust software RAID10 to write to both disk sets and then fill in the the other set with the redundant copy when it had time. That really needs a battery-backed cache to implement safely. The overhead of RAID6 parity calculation should decrease for bulk writes, but at some point the CPU is going to be spending too much time calculating parity and not doing other stuff. 16 100MB/s drives in RAID6 would put quite a load on the system, but if it's only a file server it may be acceptable. I agree that degraded drives suffer a much worse slowdown,especially for partial stripe reads. You could easily start getting only 100MB/s for lots of small reads on that same RAID6 with one or two failed drives, and that's assuming the CPU is fast enough to do error correction at 100MB/s (with two drives missing the fast algorithm for accelerating raid6 stops working and it has to emulate gf(2^8) multiplication with lookup tables). Most of my personal needs are cheap bulk data storage (movies, isos, etc.), so RAID5/6 makes sense. At work, I use RAID1, 10, and 5 since we don't have hardware support for RAID6 on the SANs. Production data goes on RAID10 because we can afford it, mirroring for system drives, and RAID5 for test/development systems that just need lots of storage.