Ask Slashdot: How Do You Store a Half-Petabyte of Data? (And Back It Up?)
An anonymous reader writes: My workplace has recently had two internal groups step forward with a request for almost a half-petabyte of disk to store data. The first is a research project that will computationally analyze a quarter petabyte of data in 100-200MB blobs. The second is looking to archive an ever increasing amount of mixed media. Buying a SAN large enough for these tasks is easy, but how do you present it back to the clients? And how do you back it up? Both projects have expressed a preference for a single human-navigable directory tree. The solution should involve clustered servers providing the connectivity between storage and client so that there is no system downtime. Many SAN solutions have a maximum volume limit of only 16TB, which means some sort of volume concatenation or spanning would be required, but is that recommended? Is anyone out there managing gigantic storage needs like this? How did you do it? What worked, what failed, and what would you do differently?
we use Ceph, its fast, redundant, and crazy scalable, oh did i mention free (paid support)? ceph.com
Honestly, that's the WORST thing to do. When you talk to the pros, they will try and sell you some outrageous overpriced Fiber Channel system that's total overkill for what you are doing. I've worked with 'big data' storage companys like EMC and Netapp. We needed 300TB of 'nearline' storage, and EMC came up with a $3,000,000.00 TOTAL overkill Fiber Channel solution, and Netapp wasn't much better, coming in at close to $2,000,000.00. Total ripoff. The ONLY reason you would ever choose Fiber Channel over ISCSI is if you are doing HUGE transactional database, with millions of access per minute. If you just need STORAGE, I went with Synology, and got 300TB of RAID-10 storage for about 100K. I DUPLICATED it (200K total), and still only paid 10% of what the 'vendors' tried to sell me, I was VERY clear that I did not need Fiber Channel, I refused to spend tons of money for something that would have zero bearing on the performance, and found it's much better to research and provide your own solution at 10% of the cost of the big vendors. Why do you think EMC has almost 3Billion of revenue, because they convince pointy haired bosses that their solution is the best. Trust me, going with a 2nd tier vendor for 'near line storage' is a much better idea than talking to the 'big 5' to ask for a solution
You're not asking the right questions:
The first correct question is why on earth would someone need to access half a petabyte? In most cases the commonly accessed data is less than 1%. That's the amount of data that realistically needs to reside on disk. It never is more than 10% on such a large dataset. Everything else would be better placed on tape. Tiered storage is the answer to the first question. You have RAM, solid/flash storage (PCI based), fast disks, slow high capacity disks and tape. Choose your tiering wisely.
The second question you need to ask is how the customer needs to access that large datastore. In most cases you need serious metadata in parallel with that data. For Petabytes of data you cannot in most cases just use an intelligent tree structure. You need a web-site or an app to search that data and get the required "blob". For such an app you need a large database since you have 5M objects with searchable metadata (at 200MB/blob).
The third question is why do you have SAN as a premise? Do you want to put a clustered filesystem with 5-10 nodes? Probably Isilon or Oracle ZS3-2/ZS4-4 are your answer.
Fourth question: what are the requirements? (How many simultaneous clients? IOPS? Bandwidth? ACL support? Auditing? AD integration? Performance tuning?)
Fifth question: There is no such thing as 100% availability. The term disaster in Disaster Recovery is correctly placed. Set reasonable SLA expectations. If you go for five-nine availability it will triple the cost of the project. Keep in mind that synchronous replication is distance limited. Typically, for a small performance cost, the radius is 150 miles and everything above impacts a lot.
Even if you solve the problems above, if you want to share it via NFS/CIFS or something else you're going to run into troubles. Since CIFS was not realistically designed for clustered operation regardless of the distributed FS underneath the CIFS server, you get locking issues. Windows Explorer is a good example since it creates thumbs.db files, leaves them open and when you want to delete the folder you cannot unless you magically ask the same node that was serving you when it created the Thumbs.DB file. Apparently, the POSIX lock is transferred to the other server and stops you from deleting, but when Windows Explorer asks the other node who has the lock on the file you get screwed since the other server doesn't know. Posix locks are different from Windows locks. It affects all Likewise based products from EMC (VNX filler, Isilon, etc.) and it also affects the CIFS product from NetApp. I'm not sure about Samba CTDB though.
I would design a storage based on ZFS for the main tiers, exported via NFSv4 to the front-end nodes and have QFS on top of the whole thing in order to push rarely accessed data to Tape. The fronted nodes would be accessed via WebDAV by a portal in which you can also query the metadata with a serious DB behind it.
I've installed Isilon storage for 6000 xendesktop clients that all log-on at 9AM, i've worked on an SL8500, Exadata, various NetApp and Sun storages and I can tell you that you need to do a study. Have simulations with commodity hardware on smaller datasets to figure out the performance requirements and optimal access method (NAS, Web, etc.). Extrapolate the numbers, double them and ask for POC and demos from vendors, be it IBM, EMC, Oracle, NetApp or HP. Make sure that in the future, when you'll need 2PB you can expand in an affordable manner. Take care since vendors like IBM tend to use the least upgradable solution. They will do a demo with something that can hold 0,6PB in their max configuration and if you'll need to go larger you'll need a brand new solution from another vendor.
It's not worth doing it yourself since it will be time-consuming (at least 500 man-hours until production) and with at least 1 full-time employees for the storage. But if you must, look at Nexenta and the hardware that they recommend.
And remember to test DR failover scenarios.
Good luck!
UNIX was not designed to stop you from doing stupid things, because that would also stop you from doing clever ones.
Nope. Not 'owned'. It's covered under the CDDL and developed by a group that isn't associated with Sun. Open-ZFS.
There's no place like
Buy Storage Pods, designed by BackBlaze. You can get 270TB of raw storage in 4U of rackspace for $0.051 per gigabyte. Total cost for half a petabyte of raw storage: $27,686. To back it all up cheaply but relatively effectively, buy a second set to use as a mirror. $55,372. For use with off-the-shelf software (FreeNAS running ZFS or Linux running mdm RAID) to present a unified filesystem that won't self-destruct when a single drive fails, you'll need to over-provision enough to store parity data. Go big or go home. Just buy another pod for each of the primary and the backup sets. Total of 6 pods with 1620TB of raw storage: $83,058. Some assembly required. And 24U of rackspace required, with power and cooling and 10Gbe ethernet and UPSs (another 4-8U of rackspace).
Expect a ballpark price of something a little under $100,000 that will meet your storage requirements with sufficient availability and redundancy to keep people happy. It will require 2 racks of space, and regular care and feeding. Do the care and feeding in house. A support contract where you pay some asshole tens of thousands of dollars a year to show up and swap drives for you is a waste of money. Bearing that in mind, as other posters have said, talk to storage vendors selling turnkey solutions. Come armed with these numbers. When they bid $1 million, laugh in their faces. But there's an outside chance you'll find a vendor with a price that is something less than hyperinflated. Stranger things have happened.
If you don't generate data very quickly, you can ease into it. For around $35,000, you can start with just 2 pods and the surrounding infrastructure, and add pods in pairs as necessary to accommodate data growth. Add $27,000 in 2 chassis next year to double your space. Add $26,000 of space again in 2017 and increase your raw capacity another 50%. (Total storage cost using BackBlaze-inspired pods is dominated by hard drive prices, which trend downwards.) When you find out your users underestimated growth, another $25,000 of space in 2018 takes you to somewhere in the neighborhood of 2 petabytes of raw storage, that you're using with double parity and 100% mirrored backup for a total effective useable space of approximately 918TB. You'll be replacing 2-3 drives per year, starting out, and 0-1 after infant mortality has run its course. Keep extras in a drawer and do it yourself in half an hour each on a Friday night. If you configured ZFS with reasonably sized vdevs, (3-5 devices) the array rebuild should be done by Monday morning. By 2020, you'll be back up to replacing 2-3 drives per year again as you climb the far side of the bathtub curve. While you're at it, you can seriously consider replacing whole vdevs with larger capacity drives, so your total useable space can start to creep up over time, without buying new chassis. By 2025, you will have 8 chassis in two racks hosting 2.88PB of raw storage space that's young and vital and low maintenance, having spent roughly $200,000.
A bargain, really.
Lucky (?) for you, I just went through purchasing a storage refresh for a cluster, as we're planning to move to a new building and no one trusts the current 5 year old solution to survive the move (besides which, we can only get 2nd hand replacements now). The current system is 8 shelves of Panasas ActiveStor 12, mostly 4 TB blades, but the original 2-3 shelves are 2 TB blades, giving about 270 TB raw storage, or about 235ish TB in real use. The current largest volume is about 100 TB in size, the next-largest is about 65 TB, with the remainder spread among 5-6 additional volumes including a cluster-wide scratch space. Most of the data is genomic sequences and references, either downloaded from public sources or generated in labs and sent to us for analysis.
As for the replacement...
I tried to get a quote from EMC. Aside from being contacted by someone *not* in the sector we're in, they also managed to misread their own online form and assumed that we wanted something at the opposite end of the spectrum from what I requested info on. After a bit of back and forth, and a promise to receive a call that never materialized, I never did get a quote. My assumption is they knew from our budget that we'd never be able to afford the capacities we were looking for. At a prior job, a multi-million dollar new data center and quasi-DR site went with EMC Isilon and some VPX stuff for VM storage/migration/replication between old/new DCs, and while I wasn't directly involved with it there, I had no complaints. If you can afford it, it's probably worth it.
The same prior job had briefly, before my time there, used some NetApp appliances. The reactions of the storage admins wasn't all that great, and throughout the 6 years I was there, we never could get NetApp to come in to talk to us whenever we were looking for expansion of our storage. I've had colleagues swear by NetApp though, so YMMV.
I briefly looked at the offerings from Overland Storage (where we got our current tape libraries), on the recommendation of the VAR we use for tapes & library upgrades. It looked promising, but in the end, we'd made a decision before we got most of those materials...
What we ended up going with was Panasas, again. Part of it was familiarity. Part of it was their incredible tech support even when the AS12 didn't have a support contract (we have a 1 shelf AS14 at our other location for a highly specialized cluster, so we had *some* support, and my boss has a golden tongue, talking them into a 1-time support case for the 8 shelf AS12). We also have a good relationship with the sales rep for our sector, the prior one actually hooked us up with another customer to acquire shelves 6-8 (and 3 spares), as this customer was upgrading to a newer model. Based on that, we felt comfortable going with the same vendor. We knew our budget, and got quotes for three configurations of their current models, ActiveStor 14 & 16. We ended up with the AS16, with 8 shelves of 6 TB disk (x2) and 240 GB SSD per blade (10 per, plus a "Director Blade" per). Approximate raw storage is just a bit under 1 PB (roughly 970-980 TB raw for the system).
In terms of physical specs, each shelf is 4U, have dual 10 GbE connections, and adding additional shelves is as easy as racking them and joining them to the existing array (I literally had no idea what I was doing when we added shelves on the current AS12, it just worked as they powered on). Depending on your environment, they'll support NFS, CIFS, and their own PanFS (basically pNFS) through a driver (or Linux kernel module, in our case). We're snowflakes, so we can't take advantage of their "phone home" system to report issues proactively and download updates (pretty much all vendors have this feature now). Updating manually is a little more time-consuming, but still possible.
As for backups, I honestly have no idea what I'm going to do. Most data, once written, is static in our environment, so I can probably get away with infrequent longer retention period backups for every
"The urge to save humanity is almost always a false front for the urge to rule." --H.L. Mencken
500TB is nothing these days. You can easily buy any system and it will support it. Look at FreeBSD/FreeNAS with ZFS (or their commercial counterpart by iXSystems). If you want to have an extremely comfortable, commercial setup, go Nexenta or with a bit of elbow grease, use the open/free counterpart OpenIndiana (Solaris based).
You can build 2 systems (I personally have 3, 1 with SAS in Striped-Mirrors, 1 with Enterprise-SATA in RAIDZ2 and 1 with Desktop-SATA in RAIDZ2) and have ZFS snapshots every minute/hour/day replicated across the network for backups, both Nexenta and FreeNAS have that right in the GUI. The primary system also has a mirrored head node which can take over in less than 10s. As far as sharing out the data: AFP/SMB/NFS/iSCSI/WebDAV etc. whatever you need to build up on it.
My system is continuously snapshotted to it's primary backup so that in case of extreme failure (which has not happened in the 7 years since I've built this system) I can run from the primary backup until the primary has been restored with perhaps a few seconds of data loss (don't know if that's acceptable to you but in my case it's not a problem in case we do have a full meltdown)
Where are those systems limited to 16TB? I wouldn't touch them with a 10-foot pole because they're running behind (within a few years a single hard drive will surpass that limit).
Custom electronics and digital signage for your business: www.evcircuits.com