Build Your Own 135TB RAID6 Storage Pod For $7,384
An anonymous reader writes "Backblaze, the cloud-based backup provider, has revealed how it continues to undercut its competitors: by building its own 135TB Storage Pods which cost just $7,384 in parts. Backblaze has provided almost all of the information that you need to make your own Storage Pod, including 45 3TB hard drives, three PCIe SATA II cards, and nine backplane multipliers, but without Backblaze's proprietary management software you'll probably have to use FreeNAS, or cobble together your own software solution... A couple of years ago they showed how to make their first-generation, 67TB Storage Pods"
It's full of stars!!
Wow, are we already approaching Petabyte clusters? I'm still getting used to Terabyte!
Yet another dup from long ago
They have had a blog post on this topic for almost a year at least.
For a true porn collector yet.
Vote monkeys into Congress. They are cheaper and more trustworthy.
Something about all those drives being packed in there like hot metal sardines gives me a bad feeling...
"When information is power, privacy is freedom" - Jah-Wren Ryel
for both internet security and privacy: each of us can now store his own local copy of the internet and surf offline!
Or can somebody tell me if the cooling of the HDs is ok if they are stacked like in the picture?
The article says it uses RAID 6 - 45 hard drives are in the pod, which are grouped into an arrays of 15 that use RAID 6 (the groups being combined by logical volumes), which gives you an actual data capacity of 39TB per group (3TB * (15 - 2) = 39TB), which then becomes 117TB usable space (39TB * 3 = 117TB). The 135TB figure is what it would be if you used RAID 1, or just used them as normal drives (45 * 3TB = 135TB).
And these are all "manufacturer's terabytes", which is probably 1,024,000,000,000 bytes per terabyte instead of 1,099,511,627,776 (2^40) bytes per terabyte like it should be. So it's a mere 108 terabytes, assuming you use the standard power-of-two terabyte ("tebibyte', if you prefer that stupid-sounding term).
Didn't we cover this story a couple of years ago with smaller drives?
You can buy 68 internal drives (2TB each) for the low price of $5439.32 http://www.newegg.com/Product/Product.aspx?Item=N82E16822152245 I'm not a hardware expert, but I imagine you could connect them somehow for less than $1944.68.. ($7384 - $5439.32)
RAID-6, really?
After 5+ years working with ZFS, personally, I wouldn't touch md/extX/xfs/btrfs/whatever with a 10 foot pole. Solaris pretty much sucks (OpenSolaris is dead and the open source spinoffs are a joke), but for a storage backend it's years ahead of Linux/BSD.
Sure, you can run ZFS on Linux (I did) and FreeBSD (I do), but for huge amounts of serious data? No thanks.
.
Both FreeBSD and FreeNAS, in addition to OpenSolaris, support ZFS.
When you choose which file system to use, you should consider what the purpose of the storage is. If it's to run a database, you may want to rethink the decision to go with a journaling file system, because databases often their own journaling (like PostreSQL WAL), which actually means the performance will get reduced if you put a journaling file system underneath that. Just my 0.0003 grams of gold.
You can't handle the truth.
It really won't cost that much because you can sell your furnace.
Why not use a SAS card?
why have three PCIe cards that are only X1 when a x4 or better card with more ports has more PCI-e bandwidth and some even have there own RAID cpu on them.
Why use a low end I3 cpu in a 7K system? at least go to i5 even more so with software raid.
If you're in the SF Bay Area check out http://geeksessions.com/ where Gleb Budman from Backblaze will be speaking about the Storage Pod and their approach to Network & Infrastructure scalability along with engineers from Zynga, Yahoo!, and Boundary. This event will also have a live stream on geeksessions.com.
Full Disclosure: This is my event.
50% discount to the event (about $8 bucks and free beer) for the Slashdot crowd here: http://gs22.eventbrite.com/?discount=slashdot
Here is a link to Backblaze's actual blog entry for the new pods 135TB, and here is the original 67TB pods. The blog article is actually quite fascinating. Apparently they are employee owned, use entirely off-the-shelf parts (except for the case, looks like), and recommend Hitachi drives (Deskstar 5K3000 HDS5C3030ALA630) as having the lowest failure rate of any manufacturer (less than 1% they say).
I found it kinda amusing that ext4's 16TB volume limit was an "issue" for them. Not because its surprising, but because... well, its 16TB. The whole blog post is actually recommended reading for anyone looking to build their own data pods like this. It really does a good job showing their personal experience in the field and problems/not problems they have. For instance: apparently heat isn't an issue, as 2 fans are able to keep an entire pod within the recommended temperature (although they actually use 6). It'll be interesting to see what happens as some of their pods get older, as I suspect that their failure rate will get pretty high fairly soon (their oldest drives are currently 4 years old, I expect when they hit 5-6 years failures will start becoming much more common.) All in all, pretty cool. Oh, and it shows how much Amazon/ Dell price gouges, but that shouldn't really shock anyone. Except the amount. A petabyte for three years is $94,000 with Backblaze, and $2,466,000 with Amazon.
P.S. I suspect they use ext4 over ZFS because ZFS, despite the built in data checks, isn't mature enough for them yet. They mention they used to use JFS before switching to ext4, so I suspect they have done some pretty extensive checking on this.
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
...And actual useful snapshot capabilities. And utilities so easy to use, even your freakin' grandma can sling about storage pools.
No. most manufacturers define the terms as 1024 bytes per kilobyte, 1000 kilobytes per megabyte, 1000 megabytes per gigabyte, and 1000 gigabytes per terabyte. Which gets really confusing sometimes - they can't even stay consistent within their own system.
I haven't checked how Hitachi does it, but that's how Seagate and Western Digital do it. I would assume Hitachi marks them the same way.
No, actually, you're completely wrong.
Hitachi (click Specifications):
Capacity - One GB is equal to one billion bytes and one TB equals 1,000GB (one trillion bytes) when referring to hard drive capacity.
Seagate:
When referring to hard drive capacity, one gigabyte, or GB, equals one billion bytes and one terabyte, or TB, equals one trillion bytes.
Western Digital (click Specifications):
As used for storage capacity, one megabyte (MB) = one million bytes, one gigabyte (GB) = one billion bytes, and one terabyte (TB) = one trillion bytes.
Some floppies use hybrid measurements, but hard drives have been entirely powers of ten for ages.
With the latest bandwidth caps I'm seeing on my provider (AT&T U-verse), I can download data at a rate of 250 GB per month. So it'll take me 45 YEARS to fill up that 135 TB array. Something tells me they'll have better storage solutions by then.
In the meantime, I'm just waiting for Google to roll out the high-speed internet in my locale next year - maybe then I'll have a chance at filling up my current file server.
Not really that useful for any data that needs to be accessible 100% of the time. The drive do not look to be hot swapable and there is no redundancy anywhere in the design.
Even with all those raid groups with a single processor read write times are going to be hideous. Also not knowing about the software Your volumes/aggregates may be limited to a single RAID group which limits the usefulness.
Yeah its a cheap solution but its usefulness in a production or backup environment is limited. There are storage providers out there that have systems with price points not much higher than this that aren't as unreliable.
I did something a bit similar on a smaller scale about 9 years ago. (Linux software RAID, 12 disk in a cheap server). The trick is to make sure that you pay something like 70% of the total hardware cost for the disks. It is possible, it can be done reliable, but you have to know what you are doing. If you are not a competent and enterprising engineer, forget it (or become one). But the largest cost driver in storage is that people want to buy storage pre-configured and in a box that they do not need to understand. This is not only very expensive, (when I researched this 9 years ago, disk part of total price was sometimes as low as 15%!), but gives you lower performance and lower reliability. And also less flexibility.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
I can't imagine who has a need for such a ridiculous amount of storage, but nevertheless...
ME WANT!
After all, "640K ought to be enough for anybody"...okay, he was talking about memory, but still...
*Sigh* (goes back to tinkering with 3 TB RAID array/server)
My place is in the market for a new SAN device, so it was very interesting to see this post today. What kind of changes would people suggest in order to make this sort of thing perform better (and more reliably) as a SAN device instead of just backup storage?
An Intel i3 540, more powerful than the CPU on most hardware RAID controllers.This thing will be doing very little other than handling the RAID sets.
hmm.
What the hell else is Sean doing with his time? That's what the articles are really missing...
Instead of FreeNAS, you can use Openfiler. Also, Open-E is really good, and has easy to setup block replication failover as well. If you want to go high end for custom storage, take a look at Datacore.
P.S. I suspect they use ext4 over ZFS because ZFS, despite the built in data checks, isn't mature enough for them yet. They mention they used to use JFS before switching to ext4, so I suspect they have done some pretty extensive checking on this.
ZFS is mentioned in the blog comments, as well as in the HN thread: they looked into it, but given that they decided to go with Linux on their servers, ZFS isn't really available in a stable fashion. If they had decided to go with (Open)Solaris or Illuminos or FreeBSD, then ZFS would be a more viable option.
It'd certainly be a lot less convoluted than mdadm(RAID6) -> LVM(PV,VG, 3xLV) -> ext4. A 'zpool create mypool1 raidz2 disk1 disk2 ... diskN' is a lot simpler.
But the raid cpu is on it's own where the system cpu has to do the video, networking, and the OS on top of doing the raid work.
This is a modern 3.1 GHz, dual-core CPU vs. .... let's take a Promise SATA RAID card with an Intel 333 controller. That's an 800 MHz ARMv5TE CPU, two ARM generations ago, not even superscalar. The i3 is going to have many cycles to spare after taking the load of three such controllers.
I have a hard time envisioning such an extreme capacity versus throughput requirement.
Like I said, part of it is the question of scrubbing which is a local task that scales with the amount of storage. I have seen much smaller SATA MD arrays (3-5 disks) which end up taking over a day per week just to complete their scrubbing with some light background load. If these larger arrays cannot effectively scale the scrubbing with many more disks, they could wind up saturating with nothing but scrubbing 24x7. How long does it take for a pod to scrub its entire array? How long does it take to sync a replacement 3 TB drive?
And as for application requirements, I admit I have trouble understanding a new deployment that would be sized to the throughput of one or two disks as a relevant benchmark. When I think of RAID 6, I think of many disks in parallel and many times the throughput of a single disk. I also assume many time the typical client load, e.g. backup of a department full of PCs, or a rackful of servers, rather than one PC or one server.
Let see... first thing I see when I click on hard drives on new egg is a 3TB drive for $180.
So.
135/3 = 45
45 * $180 = 8100
Thats just drives, no raid, no controllers, no chassis/cass.
With more digging I find a 5400 RPM drive for 139 ... so ...
45 * 140 = 6300, but still just the drives ... and no RAID.
Can you find cheaper drives? I'm sure, I spent all of about 10 seconds looking, but I doubt you're going to want to.
You guys are all wondering around arguing over the silliness of their slashvertisment (which it most certainly is) and various software implementations that would take the place of theirs and be better (which I don't disagree with one bit) but ... you entirely over looked the fact that their statements are bold faced lies. They didn't build it for that much. They may have ignored a bunch of costs and said 'we built it for X amount', but thats like saying the space program only cost American Tax payers the cost of shutting it down because we already did the other stuff so it doesn't count against the cost!
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
And that takes into account price breaks and volume pricing?
There exist 10 to 25 OEM packs of drives from many manufacturers, did you look at those mfg part no.s?
What about a full pallet?
Only a moron would buy that amount of drives from a company that sells mainly to CONSUMERS.
Even as a consumer, with large enough volumes, you may in some cases purchase straight from a distributor.
While its great that they posted the plans, some of the parts list are custom, and its a bit too much hardware tinkering for me. What I would like to see is a similar commercially produced box, minus drives for a few thousand. All the big players with turnkey solutions seem to sell only with drives at ridiculous prices.