Build Your Own 135TB RAID6 Storage Pod For $7,384
An anonymous reader writes "Backblaze, the cloud-based backup provider, has revealed how it continues to undercut its competitors: by building its own 135TB Storage Pods which cost just $7,384 in parts. Backblaze has provided almost all of the information that you need to make your own Storage Pod, including 45 3TB hard drives, three PCIe SATA II cards, and nine backplane multipliers, but without Backblaze's proprietary management software you'll probably have to use FreeNAS, or cobble together your own software solution... A couple of years ago they showed how to make their first-generation, 67TB Storage Pods"
It's full of stars!!
Wow, are we already approaching Petabyte clusters? I'm still getting used to Terabyte!
Ugh, replying to myself. I missed the link in the post.
But nothing's changed, right? It's the same chassis, same diagrams from backblaze. Only ~2 years of bigger drives is new.
For a true porn collector yet.
Vote monkeys into Congress. They are cheaper and more trustworthy.
for both internet security and privacy: each of us can now store his own local copy of the internet and surf offline!
Or can somebody tell me if the cooling of the HDs is ok if they are stacked like in the picture?
The article says it uses RAID 6 - 45 hard drives are in the pod, which are grouped into an arrays of 15 that use RAID 6 (the groups being combined by logical volumes), which gives you an actual data capacity of 39TB per group (3TB * (15 - 2) = 39TB), which then becomes 117TB usable space (39TB * 3 = 117TB). The 135TB figure is what it would be if you used RAID 1, or just used them as normal drives (45 * 3TB = 135TB).
And these are all "manufacturer's terabytes", which is probably 1,024,000,000,000 bytes per terabyte instead of 1,099,511,627,776 (2^40) bytes per terabyte like it should be. So it's a mere 108 terabytes, assuming you use the standard power-of-two terabyte ("tebibyte', if you prefer that stupid-sounding term).
You can buy 68 internal drives (2TB each) for the low price of $5439.32 http://www.newegg.com/Product/Product.aspx?Item=N82E16822152245 I'm not a hardware expert, but I imagine you could connect them somehow for less than $1944.68.. ($7384 - $5439.32)
RAID-6, really?
After 5+ years working with ZFS, personally, I wouldn't touch md/extX/xfs/btrfs/whatever with a 10 foot pole. Solaris pretty much sucks (OpenSolaris is dead and the open source spinoffs are a joke), but for a storage backend it's years ahead of Linux/BSD.
Sure, you can run ZFS on Linux (I did) and FreeBSD (I do), but for huge amounts of serious data? No thanks.
.
Both FreeBSD and FreeNAS, in addition to OpenSolaris, support ZFS.
When you choose which file system to use, you should consider what the purpose of the storage is. If it's to run a database, you may want to rethink the decision to go with a journaling file system, because databases often their own journaling (like PostreSQL WAL), which actually means the performance will get reduced if you put a journaling file system underneath that. Just my 0.0003 grams of gold.
You can't handle the truth.
I wouldn't be surprised if the top of the case fit flush with the hard drive cases and was used as a heatsink. Alu top case, finned, with a bank of fans in push/pull configuration, and a hot/cold arrangement of ducting along the racks.
That's how I'd do it, anyway.
Finally had enough. Come see us over at https://soylentnews.org/
The multipliers make me more nervous!
Seriously... my experience with sata multipliers has been that they should be avoided at all costs.
It really won't cost that much because you can sell your furnace.
Why not use a SAS card?
why have three PCIe cards that are only X1 when a x4 or better card with more ports has more PCI-e bandwidth and some even have there own RAID cpu on them.
Why use a low end I3 cpu in a 7K system? at least go to i5 even more so with software raid.
This is nothing new. You've never been in a datacenter before, kid. You can ask a grownup one day and he can take you there and you will feel the heat. And NOISE. No offense, but I think you're one of those gamer kids who builds rigs for max FPS, with esoteric water cooling and silent fans everywhere.
Yeah, no, you don't need to pamper your hardware that much. Even laptop drives work way hot (60C+) for years with no issue.
Most servers are built that way too. The Sun x4500 is extremely densely packed. And there are hundreds running just fine.
If you're in the SF Bay Area check out http://geeksessions.com/ where Gleb Budman from Backblaze will be speaking about the Storage Pod and their approach to Network & Infrastructure scalability along with engineers from Zynga, Yahoo!, and Boundary. This event will also have a live stream on geeksessions.com.
Full Disclosure: This is my event.
50% discount to the event (about $8 bucks and free beer) for the Slashdot crowd here: http://gs22.eventbrite.com/?discount=slashdot
Sun has been selling this same design for several years -- Sun x4500 released October 2006. - 6 SATA controllers - 48 top loading SATA drives - 2 x86 CPU.
Here is a link to Backblaze's actual blog entry for the new pods 135TB, and here is the original 67TB pods. The blog article is actually quite fascinating. Apparently they are employee owned, use entirely off-the-shelf parts (except for the case, looks like), and recommend Hitachi drives (Deskstar 5K3000 HDS5C3030ALA630) as having the lowest failure rate of any manufacturer (less than 1% they say).
I found it kinda amusing that ext4's 16TB volume limit was an "issue" for them. Not because its surprising, but because... well, its 16TB. The whole blog post is actually recommended reading for anyone looking to build their own data pods like this. It really does a good job showing their personal experience in the field and problems/not problems they have. For instance: apparently heat isn't an issue, as 2 fans are able to keep an entire pod within the recommended temperature (although they actually use 6). It'll be interesting to see what happens as some of their pods get older, as I suspect that their failure rate will get pretty high fairly soon (their oldest drives are currently 4 years old, I expect when they hit 5-6 years failures will start becoming much more common.) All in all, pretty cool. Oh, and it shows how much Amazon/ Dell price gouges, but that shouldn't really shock anyone. Except the amount. A petabyte for three years is $94,000 with Backblaze, and $2,466,000 with Amazon.
P.S. I suspect they use ext4 over ZFS because ZFS, despite the built in data checks, isn't mature enough for them yet. They mention they used to use JFS before switching to ext4, so I suspect they have done some pretty extensive checking on this.
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
Thank you for pointing that out about laptop drives. I have one at home burning it up at over 50C.
The diversity and expression of human opinion is essential to human survival.
and different hardware/raid/multiplier/power harness setup..
basically the same just updated - and worth an note about.. i wish they sold or someone sold the setup sans drives (or just the bare case) - it looks fun to mess with but don't have a lot of free time now days.
'...if only "Jumping to a Conclusion" was an event in the Olympics.'
With the latest bandwidth caps I'm seeing on my provider (AT&T U-verse), I can download data at a rate of 250 GB per month. So it'll take me 45 YEARS to fill up that 135 TB array. Something tells me they'll have better storage solutions by then.
In the meantime, I'm just waiting for Google to roll out the high-speed internet in my locale next year - maybe then I'll have a chance at filling up my current file server.
I have one running as a server. The fan inside is broken so no cooling at all. It runs around 100C for several months now.
Don't fight for your country, if your country does not fight for you.
Well that noise are the massive fans that keep the temperature of the equipment fairly close to ambient. If you quiet down the fans, the room temperature won't change much but power-hungry components will suddenly be way, way above room temperature. I had a really crappy cabinet crammed with back-to-back disks, didn't think much of it until they started dying... checked the SMART data, oh 75C for the top drive... that's 50C or so above the ambient temperature in the room. Better cabinet with more space, more and bigger fans, now it's down to 40-45C. It's not to "pamper" that hardware they do it, it's to do it quietly. If you don't care that your gaming machine sounds like a jet engine taking off, there's no problem.
Live today, because you never know what tomorrow brings
You would be surprised that there is a piece of foam between the top of the case and the drives if you RTFA!
I did something a bit similar on a smaller scale about 9 years ago. (Linux software RAID, 12 disk in a cheap server). The trick is to make sure that you pay something like 70% of the total hardware cost for the disks. It is possible, it can be done reliable, but you have to know what you are doing. If you are not a competent and enterprising engineer, forget it (or become one). But the largest cost driver in storage is that people want to buy storage pre-configured and in a box that they do not need to understand. This is not only very expensive, (when I researched this 9 years ago, disk part of total price was sometimes as low as 15%!), but gives you lower performance and lower reliability. And also less flexibility.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Thermal design is highly non-intuitive. So you experiment, measure and have monitoring and automated emergency-shutdown in place. You do not even net fan-monitoring with this setup. Just very simple disk-temperature monitoring will tell you when a fan is down. My guess would be that they can tolerate one fan failure for some time and do a forced shutdown if two go down.
This is for experienced engineers. I have done things like this before, and I think I could design both hardware and software for these boxes. It is not magic, just solid engineering with a solid understanding of the problems involved.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Didn't the summary say so and provide a link to the previous story. Of course, in addition to the drives getting bigger they changed a couple other things (MB, memory, CPU, SATA cards, SATA multipliers, wiring), but it is the same case so sure it's the same.
Oh no doubt. I mean they are using these things reliably as you said, so I'm sure it works. Same can be said about the heat issues (though I guess that would be dependant on external cooling as well).
Just saying that the mere mention of SATA multipliers makes me cringe and fear for my data/sanity :)
You can get this from the company that builds the cases for them, Protocase. Send an email to lpodgursky@protocase.com for details. It's $5395.00 (1-4 units) and $4995.00 (5-9 units). And yes, that's more than building it yourself naturally.
Help stamp out iliturcy.
An Intel i3 540, more powerful than the CPU on most hardware RAID controllers.This thing will be doing very little other than handling the RAID sets.
hmm.
What the hell else is Sean doing with his time? That's what the articles are really missing...
Yes, 640K disks with 640 Terabytes each ought to be enough for anybody. :-)
The Tao of math: The numbers you can count are not the real numbers.
i wish they sold or someone sold the setup sans drives (or just the bare case)
TFA says the case is available from Protocase for $875 in single unit quantities.
A "pod" is just a standard x86 PC in this custom 4U case. Sure, it has a few specific extras, but all are standard, off-the-shelf hardware that you can easily buy. Appendix A in the Backblaze blog post gives every detail you need.
If you start with just 15 hard drives (for a total of 45TB), then the price would be about $3300. You probably only save about $500 by using an standard case, because a decent one with room for 15 or more drives will set you back at least $300.
The redundancy is unit based, not component based. This makes alot of sense, it's what google does. You don't have to go for expensive proprietary parts, you just buy two commodity parts (or more).
Cheap storage VM.
you could use openfiler, but you would want to swap some of your disk space for network controllers.
Cheap storage VM.
Seriously... my experience with sata multipliers has been that they should be avoided at all costs.
SAS multipliers with SATA drives is a better risk/cost balance, for the general case.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
5k just for the case? or is that everything sans drives?
'...if only "Jumping to a Conclusion" was an event in the Olympics.'
Instead of FreeNAS, you can use Openfiler. Also, Open-E is really good, and has easy to setup block replication failover as well. If you want to go high end for custom storage, take a look at Datacore.
Number 1 thing would be more / bigger network links. I think this has used all its PCIe slots, so you might have to cut the capacity by 1/3 to put in a bigger network card - say a 4x Gigabit Ethernet card (cheap) or a 10gig card (more expensive, plus need the 10gig port to hook it to). Or get a motherboard with more slots.
More speculatively:
More RAM might help if you can set it up to cache the right things. Faster drives would help the IOPS (lower latency) but the bandwidth bottleneck is going to be the network. You'll likely want two or three of these boxes for redundancy and backup, too. (Plus spares of everything.) Or maybe a big tape loader, but at this scale I think whole-server redundancy is a lot less trouble. (Your backup will weigh ~145lbs, though.)
ZFS is likely a more reliable good way to go than RAID, though some think it's too new. If you add some SSDs, you can get better system response, too:
"ZFS also supports both read and write caching, for which special devices can be used. Solid State Devices can be used for the L2ARC, or Level 2 adaptive replacement cache, speeding up read operations, while NVRAM buffered SLC memory can be boosted with supercapacitors to implement a fast, non-volatile write cache, improving synchronous writes. Finally, when mirroring, block devices can be grouped according to physical chassis, so that the filesystem can continue in the case of the failure of an entire chassis."- http://en.wikipedia.org/wiki/ZFS
With a lot of RAM, mostly set up as a disk, a UPS, (and a shutdown script running on line power loss, of course) the write caching could likely be implemented more inexpensively than the SLC solution. It isn't that big a perceived performance increase on most systems, though, and adds some risk. Regular MLC SSD read caching with L2ARC can make the system far more responsive at very little cost, and with no reliability concerns if ZFS is set up correctly.
"Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry
Everything sans drives.
Help stamp out iliturcy.
opps - read the thing and missed the one off price on the line item - thanks for pointing that out.
'...if only "Jumping to a Conclusion" was an event in the Olympics.'
They also sell the case by itself. They wanted $872 for qty 1 on the case alone about 18 months ago, some reasonable customization extras are available (custom silkscreen logo, custom colors and so on). Shipping is extra. It's odd that they don't have a simple web store setup for this, but it looks like their business is almost exclusively bespoke tin bending.
Help stamp out iliturcy.
See: http://bigip-blogs-adc.oracle.com/brendan/entry/test for more about ARC, L2ARC, using SSDs with ZFS. With 128GB RAM, 550GB of SSDs, and 18TB of disk, the speedup was 8.4x over just the RAM and disks, with 20x less latency. YMMV with different workloads.
"Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry
But the raid cpu is on it's own where the system cpu has to do the video, networking, and the OS on top of doing the raid work.
I think that you are looking for redundancy at too small a scale: Yes, per-box, there is very little redunancy. RAID-6 makes it not completely useless; but a PSU going out will take out half the box, which will render it pretty useless until the PSU comes back online, and if the mobo dies, game over.
However, as the pictures suggested, they are running rather a lot of these boxes. Their (proprietary) software layer handles storing data across all the boxes and presenting it in some useful-to-the-backblaze-client way over the internet. An OSS analog would be something like Tahoe-FS treating each storage box as a backend server. In that scenario, you can, depending on the desired tradeoff between cost and risk, allow one or more entire servers to fail without compromising the overall logical filesystem...
My place is in the market for a new SAN device, so it was very interesting to see this post today. What kind of changes would people suggest in order to make this sort of thing perform better (and more reliably) as a SAN device instead of just backup storage?
The number of changes that you would need to make to this device to turn it into a decent SAN would probably be rather more expensive than just buying the SAN from somebody who has economies of scale. You could just install an iSCSI initiator on the OS and call it a day; but performance would be deeply miserable and uptime not so exciting, by SAN standards.
Such a comparatively unreliable node really starts to make sense if you are working at a scale where each storage pod is considered to be a swappable component where failure or downtime is acceptable. There are a number of filesystems, some OSS, some proprietary, which allow you to present a single logical filesystem whose contents(and a configurable amount of redundancy information) are spread among a (potentially large) number of storage nodes connected over an IP network.
If you were talking about needing that amount of storage, you could set up a 'SAN' head node, based on a fairly powerful, all-the-redundant-bells-and-whistles enterprise grade server, which would run such a filesystem across a large number of these pods and present an iSCSI initiator to the rest of the network. It would still be on the slower-but-cheaper side of Real Serious SAN gear; but the correct choice of head node or head nodes could get reliability up there. If your needs are less than or equal to a single pod, though, you really can't bolt on incrementally more reliability or performance without causing the price to zoom up...
This is a modern 3.1 GHz, dual-core CPU vs. .... let's take a Promise SATA RAID card with an Intel 333 controller. That's an 800 MHz ARMv5TE CPU, two ARM generations ago, not even superscalar. The i3 is going to have many cycles to spare after taking the load of three such controllers.
The drives do not look to be hot swapable
(Disclaimer: I work at Backblaze) All SATA drives are inherently hot swappable, including the ones in the Backblaze pod. We have tried it, it worked the few times we did it. But for normal operations, we shut the pod down completely to swap drives. The first reason is that because the pods are stacked on top of each other and the drives are replaced from the top, we have to slide the pod out half way out of the rack like a drawer. It feels kinda wrong to slide servers around like that while the drives are spinning, so we avoid it (I have no proof it actually causes significant problems). Another reason is that with the top of the pod open, the cooling airflow isn't the same and some of the drives in the center start rising in temperature. This isn't fatal, but it puts you on a "timer" where you want to get the hot swap done within a reasonable amount of time (like 5 minutes) and get the pod closed back up again. Finally, it just seems safer to let the machine come up cleanly with the drive replaced. For our application it doesn't matter at all, no customer can possibly know or care if one, two, or ten pods are offline during a reboot.
Yeah, no, you don't need to pamper your hardware that much. Even laptop drives work way hot (60C+) for years with no issue.
That sounds a little hot. Just logged into one of my compute servers and the sensors read between 34 and 44 degrees. Though it's a 1U quad 6100[*] with very little disk space. But in general, slightly cold is waaaaay worse than very hot since the oil gets too viscous. My laptop runs hotter (cpu reads between 50 and 70 degrees), but it has a flash disk.
[*] The 1U quad 6100s are astonishingly dense. You see may vendors bragging about how some silly hacked up job made of infinite atom CPUs or ARM or MIPS is super dense and low power, and they tend not to stack up well against the 6100s in terms of flops / U (or often even cores / U) and don't do much (if at all) better in terms of flops per watt. I think AMD are the current winner in this regard.
But yes, the head and noise is quite astonishing.
By the way, there are decent companies that will sell you a watercooled rig off the shelf if you need a GPU workstation that doesn't sound like a turbojet. They look a little funny and l33t-gamer but they work very well.
SJW n. One who posts facts.
The foam is there for good reason, too... you don't want hard drives banging (even at a sub-millimeter distance) against the top of the case - tends to wear out the drives, cause more errors, and makes noise.
I know, I know... 'but they're flush!' Well, unless you custom-machined each HDD case *and* the unit case they went in, you're guaranteed to have a few drives in that type of physical array vibrate like that.
Quo usque tandem abutere, Nimbus, patientia nostra?
I was talking about guys that but full-tower machines to make 4-way RAID 0+1 arrays, with each disk 20cm apart from the other, and a hard drive cooler (with two fans) for each drive. That's overkill.
Just a small amount of wind running under the drive is enough to keep it cool. No need to keep it at room temperature. 50C is good enough.
Yes, you need forced air (that is, fans). But a few correctly placed fans for the whole case, are enough.
Let see... first thing I see when I click on hard drives on new egg is a 3TB drive for $180.
So.
135/3 = 45
45 * $180 = 8100
Thats just drives, no raid, no controllers, no chassis/cass.
With more digging I find a 5400 RPM drive for 139 ... so ...
45 * 140 = 6300, but still just the drives ... and no RAID.
Can you find cheaper drives? I'm sure, I spent all of about 10 seconds looking, but I doubt you're going to want to.
You guys are all wondering around arguing over the silliness of their slashvertisment (which it most certainly is) and various software implementations that would take the place of theirs and be better (which I don't disagree with one bit) but ... you entirely over looked the fact that their statements are bold faced lies. They didn't build it for that much. They may have ignored a bunch of costs and said 'we built it for X amount', but thats like saying the space program only cost American Tax payers the cost of shutting it down because we already did the other stuff so it doesn't count against the cost!
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
And that takes into account price breaks and volume pricing?
There exist 10 to 25 OEM packs of drives from many manufacturers, did you look at those mfg part no.s?
What about a full pallet?
Only a moron would buy that amount of drives from a company that sells mainly to CONSUMERS.
Even as a consumer, with large enough volumes, you may in some cases purchase straight from a distributor.
While its great that they posted the plans, some of the parts list are custom, and its a bit too much hardware tinkering for me. What I would like to see is a similar commercially produced box, minus drives for a few thousand. All the big players with turnkey solutions seem to sell only with drives at ridiculous prices.