Ask Slashdot: How Do You Store a Half-Petabyte of Data? (And Back It Up?)
An anonymous reader writes: My workplace has recently had two internal groups step forward with a request for almost a half-petabyte of disk to store data. The first is a research project that will computationally analyze a quarter petabyte of data in 100-200MB blobs. The second is looking to archive an ever increasing amount of mixed media. Buying a SAN large enough for these tasks is easy, but how do you present it back to the clients? And how do you back it up? Both projects have expressed a preference for a single human-navigable directory tree. The solution should involve clustered servers providing the connectivity between storage and client so that there is no system downtime. Many SAN solutions have a maximum volume limit of only 16TB, which means some sort of volume concatenation or spanning would be required, but is that recommended? Is anyone out there managing gigantic storage needs like this? How did you do it? What worked, what failed, and what would you do differently?
It's all going to get backed up.
we use Ceph, its fast, redundant, and crazy scalable, oh did i mention free (paid support)? ceph.com
Do you mean:
(a) "Don't store it. Employ Amazon (or some other cloud) storage."? or
(b) "Do not use Amazon."
Clarity: it's like that one thing that is not the other thing, except for when it is.
Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
Honestly, you should talk to the pros. I would call a couple of storage vendors, give them the basic outline of what you want to do, and let them tell you how they would do it. You can even get more formal and issue a Request for Information (RFI) or even a Request for Quote (RFQ). If you're a biggish company, your purchasing people probably have an SOP and standard forms for how to issue an RFI/RFQ. For the big boy storage vendors, half a petabyte is commonplace. The bigger question may very well be what this is going to look like at a software level. Managing the data might be a bigger challenge than storing it. Is this going to be organized in some sort of big data solution like Hadoop? Is it just a whole bunch of files and a people are going to write R or SAS jobs to query against it? Sometimes the tool set that you want to use will drive your choices in how to build the infrastructure under it.
At Facebook, it's memcached, with an HDD backup, eventually put onto tape...
At Google, it's a ramdisk, backed up to SSD/HDD, eventually put onto tape...
For anyone who can't afford half a petabyte of RAM with the commensurate number of computers? I have no good ideas... except maybe RAM cache of SSD, cache of HDD, backed up on tape...
Using something like HDFS to store your data in a Hadoop cluster of file requests, is likely the best F/OSS solution you're going to get for that...
WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
This project must have an unrealistically low budget, otherwise there are quite a few Enterprise solutions that will do all OR a combination of these tasks.
> how do you present it back to the clients?
Look at a NAS, not a SAN. ie NetApp or 3Par C series.
> And how do you back it up?
Disaster Recovery replication to another system or hosted services. NetApp, EMC, 3Par, etc, etc
> Many SAN solutions have a maximum volume limit of only 16TB
NetApp Infinite volumes limit is 20PB
You can contact a sales person from any of those companies to answer any of these questions.
Seriously. Call ixsysyems. They specialize in this stuff and they use ZFS.
There's no place like
The research projects I've seen using that amount of storage has usually used a tape solution with dCache in front of it. You use a number of tape robots filled with tape, put them in different locations and have them back up everything between them.
If you want to keep your data on-site, unless your already have a lot of the infrastructure that you can leverage the path of least resistance is to use something like a NetApp Filer.
For backups it can create snapshots on a schedule (hourly/daily/weekly), then either replicate them to a second physical storage unit (hopefully at a different site) or present them to your backup solution.
Using the file services on the NetApp will also provide a solution to your "how do I present it to the storage consumers" question - iSCSI, CIFS with domain integration, NFS, Fibre Channel... You also get storage level de-duplication and compression, if that works for your data.
Of course you will pay what seems like a lot for it, but it does solve a lot of your problems in one unit. How much will it save in servers, backup capacity, a multi-drive tape library, daily visits to the server room to reload tapes and so on.
But if your data center isn't up to providing the level of availability you want then any hardware solution is going to be problematic - large storage systems do not like having the power pulled out from under them. Minimum is dual-redundant UPS power and fault tolerant cooling, or you will most likely have problems.
Something like storage pods? https://www.backblaze.com/blog/storage-pod/
I use slashdotFS which is a markovian random comment generator which effectively embeds data in a stegenographic comment. The FS handles the details of creating and saving these so it's all transparent and mounts on your desktop like a regular drive. It's slow but it's capacity seems unlimited and frequently gets modded insightful
Some drink at the fountain of knowledge. Others just gargle.
You could look into Lustre, although it would change your hardware configuration a bit (its not a SAN) Depending on your configuration and desired redundancy, this will affect costs a bit (i.e.. more luster nodes).
You could by a traditional SAN and tie it all together with fibre, though you'd need a clustered file system like Stornext, or another commercial CFS, or even GFS if you prefer open source. This would help solve your traversal of the system as a regular directory structure issue.
Best bet for backup would be to a robot tape library of some sort. There is some work being done on dynamic backup of data in Luster systems in the HPC space, but its not very mature. CFS systems like Sternest have methods in place for automatically backing up data on the filesystem.
SanDisk's Infiniflash is 512TB in a 3U chassis that is SAS-connected. You can front this with something like DataCore's SANsymphony to turn it into a NAS/SAN appliance.
The pricing looks to be around $1/GB, which is a ton cheaper than building a SAN of that capacity, plus it's much smaller in power/space/cooling.
up 12 days, 22:30, 2 users, load averages: 993.20, 994.21, 994.56
*makes note to limit user processes...
Let's start growing brains in jars.
“He’s not deformed, he’s just drunk!”
What clients will you be exporting it to? Linux, OS X, Windows? All three?
What kind of throughput do you need? Is 10 MB/sec enough? 100 MB/sec? 10 GB/sec?
What kind of IO are you doing? Random or sequential? Are you doing mostly reads, mostly writes, or an even mix?
Is it mission critical? If something goes wrong, do you fix it the next day, or do you need access to a tier 3 help desk at 3 am?
We have a couple of petabytes of CMS-HI data stored on a homegrown object filesystem we developed and exported to the compute nodes via FUSE. Reed-Solomon 6+3 for redundancy. No SAN, no fancy hardware, just a bunch of Linux boxes with lots of hard drives.
There is no "one shoe fits all" filesystem, which is part of the reason we use our own. If you have the ability to run it, I'd suggest looking at Ceph. It only supports Linux, but has Reed-Solomon for redundancy (considered it a higher tier of RAID) and good performance if you need it. If you have to add Windows or OS X clients into the mix, you may need to consider NFS, Samba, WebDAV, or (ugh) OpenAFS.
You're asking like you will be implementing it... don't.
Gather all their requirements, gather your requirements on top of it (I'm pretty confident that some of those requirements were your additions for "you'd be an idiot to have that, but not also have this...", possibly including the backup).
Then put out an Preliminary RFP to the major storage vendors, including asking them what they'd say you'd missed in the preliminary.
Then take the recommendations they make on top of the preliminary with a grain of salt, since most of them will be intended to insure vendor lock-in to their solution set, revise the preliminary, and put out a final RFP.
Then accept the bid that you like which management is willing to approve.
Problem solved.
P.S.: You don't have to grow everything yourself from seed you genetically modify yourself, you know...
Unless you REALLY want to pay for it.
As someone who works in a Hospital system, Imaging Informatics specifically, we have roughly that much data spread across 2 locations. Backups aren't what you think they are. We backup the infrastructure config. Databases, VM cluster config and VM's, which compressed, probably equates to 5-10 Terabytes. That's it. That's the stuff which, if worst possible event happened, we wouldn't be exctly back to 0 when we rebuilt.
As for the 400-500 Terabytes of data, they're in what we call Archive state. There isn't backup of them, but they are in proper data centers with fire suppression. So there's that... Still, if 1 site went up, we'd be down that data. Thems the breaks... Goes back to money! But, what we do have, is evertying in RAID with Hot Spare. I think... I know 2 drives can fail in a block, and have recently, and we can recover the block. As 75% of this data is pretty much read-only transfer, the only stuff being written to permanent storage is new data. I think we're seeing 120-150 Terabyte of growth a year, and we're looking at new storage since current gear is at the 'EOL'. Life Cycle wise, not warranty or operation.
Point is, will we see a PetaByte storage system bought? Maybe, but it will be the same setup. Archive system, with backup for the 'guts', what I like to call it. Simply put, CXX's don't want to throw the $$ down for Petabyte Data store site duplication. If money was far more flowing to use, we'd at least start there and implement a 100-150 Terabyte SSD Caching block with 10GB Fiber, in and out. Not happening, but a man can dream...
Backblaze blog has a rundown of their storage pod https://www.backblaze.com/blog/storage-pod-4-5-tweaking-a-proven-design/
This with something like gluster, luster, cephe or even just nfs.
Backblaze is an online backup provider. They have open sourced some of their software and hardware designs.
They are currently storing over 150 Petabytes of user data. https://www.backblaze.com/blog/150-petabytes-of-cloud-storage/
They are working on scalability into the Zettabyte range https://www.backblaze.com/blog/vault-cloud-storage-architecture/
They have open sourced their hardware design for anyone to use. https://www.backblaze.com/blog/storage-pod-4-5-tweaking-a-proven-design/
They also looked into using 3rd party vendors but decided that they could build a better solution for at least 1/8 the price. https://www.backblaze.com/blog/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/
I know that it is not a plug and play solution but if you are willing to build off of their work you can save a ton of money and have a solution that truly fits your needs.
That's the easiest question I've ever seen.
1. Wait about a decade or so.
2. Buy two half-petabyte flash drives.
3. Alternate your copies on the two flash drives, the previous one becomes your backup.
NEXT!
Get free satoshi (Bitcoin) and Dogecoins
Step 1: buy a metric shitton of storage space (virtual or physical)
Step 2: put your data on it
Step 3: ???
Step 4: profit
If you have a small budget and moderate reliability requirements, I'd suggest looking into building a couple Backblaze-style storage pods for block store (5x 180TB storage systems, apx $9000 each), each exporting 145TB RAID5 volumes via iSCSI to a pair of front-end NAS boxes. NAS boxes could be FreeBSD or Solaris systems offering ZFS filestores (putting multiples of 5 volumes, one from each blockstore, together in RAIDZ sets), which then export these volumes via CIFS or NFS to the clients. Total cost for storage, front-ends, 10GbE NICs and a pair of 10GbE switches: $60K, plus a few weeks to build, provision, and test.
If you have a bigger budget, switch to FibreChannel SANs. I'd suggest a couple HP StorServ 7450s, connected via 8 or 16Gb FC across two fabrics, to your front ends, which aggregate the block storage into ZFS-based NAS systems as above, implementing raidz for redundancy. This would limit storage volumes to 16TB each, but if they're all exposed to the front ends as a giant pool of volumes, then ZFS can centrally manage how they're used. A 7450 filled with 96 4TB drives will provide 260TB of usable volume space (thin or thick provisioned), and cost around $200K-$250K each. Going this route would cost $500-$550K (SANs, plus 8 or 16Gb FC switches, plus fibre interconnects, plus HBAs) but give you extremely reliable and fast block storage.
A couple advantages of using ZFS for the file storage is its ability to migrate data between backing stores when maintenance on underlying storage is required, and its ability to compress its data. For mostly-textual datasets, you can see a 2x to 3x space reduction, with slight cost in speed, depending on your front-ends' CPUs and memory speed. ZFS is also relatively easy to manage on the commandline by someone with intermediate knowledge of SAN/NAS storage management.
Whatever you decide to use for block storage, you're going to want to ensure the front-end filers (managing filestores and exporting as network shares) are set up in an identical active/standby pair. There's lots of free software on linux and freebsd that accomplish this. These front-ends would otherwise be your single-point-of-failure, and can render your data completely unusable and possibly permanently lost if you don't have redundancy in this department.
But he used vague requirements so not to give enough information for an actual informed decision.
But in general it sounds like it is going to be expensive and a lot of work, with working out a lot of details more then storing and backing up data.
Then the question but how do you present it back to the clients? That is a different can of worms.
The real question should be.
Which consulting company should I work with on a big data project?
Have you worked with some that seems to be able to give you a clear goal and time lines, and meet the budget specified.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
They'll be happy to talk to you for free, for the prospect of getting their hands on that kind of cash. You're easily looking at $.5M-$1M between storage, processing, and redundancy.
Sounds like you need the storage onsite at least for the research project.
The mixed media thing sounds like something to throw at the cloud unless there's a reason not to do that.
As to spanning volumes etc... I don't really understand the file structure of this research project. Having a petabyte of data in a single directory is typically the opposite of good ideas.
I'd like more information.
As to back ups... it depends on how frequently the information changes. Backup tapes are probably the cheapest way to go for backups of archives. 3 TB at 20 dollars a tape.... not bad. And you can do incremental back ups if there are little changes.
The tapes are supposed to last about 10 years. So that's something.
If we're talking about high frequency changes... you almost need to replicate the primary storage... and the number of times you need to do that is variable on how badly you need to not lose the data.
If we're talking about data that if lost orphans are going to get ground up into hamburger and fed to the dogs... you're going to want multiple back ups. If it would merely be annoying... maybe one back up is fine.
I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
We recently bought for our group a NAS server with ~200Tb of raw storage (175Tb after RAID6 with a good card). And this is NFS mounted to other servers. It is pretty easy to use and configure and quite cheap (20k UK pounds). Regarding the backup, I would probably just buy a second server. (maybe with cheaper confiuration, worse raid card, etc.)
You will not get a good answer here, because even if there would be one it will be hard to find between all the nonsense.
BTW your scenario is incomplete and therefore it is unlikely to give a good answer. It looks a little bit like you want /. to make your homework.
You're not asking the right questions:
The first correct question is why on earth would someone need to access half a petabyte? In most cases the commonly accessed data is less than 1%. That's the amount of data that realistically needs to reside on disk. It never is more than 10% on such a large dataset. Everything else would be better placed on tape. Tiered storage is the answer to the first question. You have RAM, solid/flash storage (PCI based), fast disks, slow high capacity disks and tape. Choose your tiering wisely.
The second question you need to ask is how the customer needs to access that large datastore. In most cases you need serious metadata in parallel with that data. For Petabytes of data you cannot in most cases just use an intelligent tree structure. You need a web-site or an app to search that data and get the required "blob". For such an app you need a large database since you have 5M objects with searchable metadata (at 200MB/blob).
The third question is why do you have SAN as a premise? Do you want to put a clustered filesystem with 5-10 nodes? Probably Isilon or Oracle ZS3-2/ZS4-4 are your answer.
Fourth question: what are the requirements? (How many simultaneous clients? IOPS? Bandwidth? ACL support? Auditing? AD integration? Performance tuning?)
Fifth question: There is no such thing as 100% availability. The term disaster in Disaster Recovery is correctly placed. Set reasonable SLA expectations. If you go for five-nine availability it will triple the cost of the project. Keep in mind that synchronous replication is distance limited. Typically, for a small performance cost, the radius is 150 miles and everything above impacts a lot.
Even if you solve the problems above, if you want to share it via NFS/CIFS or something else you're going to run into troubles. Since CIFS was not realistically designed for clustered operation regardless of the distributed FS underneath the CIFS server, you get locking issues. Windows Explorer is a good example since it creates thumbs.db files, leaves them open and when you want to delete the folder you cannot unless you magically ask the same node that was serving you when it created the Thumbs.DB file. Apparently, the POSIX lock is transferred to the other server and stops you from deleting, but when Windows Explorer asks the other node who has the lock on the file you get screwed since the other server doesn't know. Posix locks are different from Windows locks. It affects all Likewise based products from EMC (VNX filler, Isilon, etc.) and it also affects the CIFS product from NetApp. I'm not sure about Samba CTDB though.
I would design a storage based on ZFS for the main tiers, exported via NFSv4 to the front-end nodes and have QFS on top of the whole thing in order to push rarely accessed data to Tape. The fronted nodes would be accessed via WebDAV by a portal in which you can also query the metadata with a serious DB behind it.
I've installed Isilon storage for 6000 xendesktop clients that all log-on at 9AM, i've worked on an SL8500, Exadata, various NetApp and Sun storages and I can tell you that you need to do a study. Have simulations with commodity hardware on smaller datasets to figure out the performance requirements and optimal access method (NAS, Web, etc.). Extrapolate the numbers, double them and ask for POC and demos from vendors, be it IBM, EMC, Oracle, NetApp or HP. Make sure that in the future, when you'll need 2PB you can expand in an affordable manner. Take care since vendors like IBM tend to use the least upgradable solution. They will do a demo with something that can hold 0,6PB in their max configuration and if you'll need to go larger you'll need a brand new solution from another vendor.
It's not worth doing it yourself since it will be time-consuming (at least 500 man-hours until production) and with at least 1 full-time employees for the storage. But if you must, look at Nexenta and the hardware that they recommend.
And remember to test DR failover scenarios.
Good luck!
UNIX was not designed to stop you from doing stupid things, because that would also stop you from doing clever ones.
Library storage sounds like that may be your best choice. Several high end vendors sell such systems and may need to have RFS and RFQ's submitted, not to mention seeing the systems in action. This is not going to be cheap, but it's best on the long term investment. Ensure that it is scalable and can handle any future expansions without investing in whole new kit or that will simply put your department back to square one.
First rule of holes; When in one, stop digging.
On a SAN the 16tb limit comes generally from 32 bit SANs the 64 bit SANs wouldn't have it. Plenty of SAN solutions can handle 500tb or 10x that much. So just upgrade. If you only want backup there are plenty of hardware backup devices that handle this. For example exagrid scales to I believe 300tb / hr much less 500tb total. This isn't gigantic in today's world. You just need to have a conversation with your vendor, or an agent. You aren't asking for anything abnormal or challenging.
Just put "bomb" and "assassinate" in every line. ... It's all going to get backed up.
But getting them to restore it after it's gotten lost or corrupted is difficult.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
For high throughput/IOPS requirements build a Lustre/Ceph/etc. cluster and mount the cluster filesystems directly on as many clients as possible. You'll have to set up gateway machines for CIFS/NFS clients that can't directly talk to the cluster, so figure out how much throughput those clients will need and build appropriate gateway boxes and hook them to the cluster. Sizing for performance depends on the type of workload, so start getting disk activity profiles and stats from any existing storage NOW to figure out what typical workloads look like. Data analysis before purchasing is your best friend.
If the IOPS and throughput requirements are especially low (guaranteed < 50 random IOPS [for RAID/background process/degraded-or-rebuilding-array overhead] per spindle and what a couple 10gbps ethernet ports can handle, over the entire lifetime of the system) then you can probably get away with just some SAS cards attached to SAS hotplug drive shelves and building one big FreeBSD ZFS box. Use two mirrored vdevs per pool (RAID10-alike) for the higher-IOPS processing group and RAIDZ2 or RAIDZ3 with ~15 disk vdevs for the archiving group to save on disk costs.
Plan for 100% more growth in the first year than anyone says they need (shiny new storage always attracts new usage). Buy server hardware capable of 3 to 5 years of growth; be sure your SAS cards and arrays will scale that high if you go with one big storage box.
Buy Storage Pods, designed by BackBlaze. You can get 270TB of raw storage in 4U of rackspace for $0.051 per gigabyte. Total cost for half a petabyte of raw storage: $27,686. To back it all up cheaply but relatively effectively, buy a second set to use as a mirror. $55,372. For use with off-the-shelf software (FreeNAS running ZFS or Linux running mdm RAID) to present a unified filesystem that won't self-destruct when a single drive fails, you'll need to over-provision enough to store parity data. Go big or go home. Just buy another pod for each of the primary and the backup sets. Total of 6 pods with 1620TB of raw storage: $83,058. Some assembly required. And 24U of rackspace required, with power and cooling and 10Gbe ethernet and UPSs (another 4-8U of rackspace).
Expect a ballpark price of something a little under $100,000 that will meet your storage requirements with sufficient availability and redundancy to keep people happy. It will require 2 racks of space, and regular care and feeding. Do the care and feeding in house. A support contract where you pay some asshole tens of thousands of dollars a year to show up and swap drives for you is a waste of money. Bearing that in mind, as other posters have said, talk to storage vendors selling turnkey solutions. Come armed with these numbers. When they bid $1 million, laugh in their faces. But there's an outside chance you'll find a vendor with a price that is something less than hyperinflated. Stranger things have happened.
If you don't generate data very quickly, you can ease into it. For around $35,000, you can start with just 2 pods and the surrounding infrastructure, and add pods in pairs as necessary to accommodate data growth. Add $27,000 in 2 chassis next year to double your space. Add $26,000 of space again in 2017 and increase your raw capacity another 50%. (Total storage cost using BackBlaze-inspired pods is dominated by hard drive prices, which trend downwards.) When you find out your users underestimated growth, another $25,000 of space in 2018 takes you to somewhere in the neighborhood of 2 petabytes of raw storage, that you're using with double parity and 100% mirrored backup for a total effective useable space of approximately 918TB. You'll be replacing 2-3 drives per year, starting out, and 0-1 after infant mortality has run its course. Keep extras in a drawer and do it yourself in half an hour each on a Friday night. If you configured ZFS with reasonably sized vdevs, (3-5 devices) the array rebuild should be done by Monday morning. By 2020, you'll be back up to replacing 2-3 drives per year again as you climb the far side of the bathtub curve. While you're at it, you can seriously consider replacing whole vdevs with larger capacity drives, so your total useable space can start to creep up over time, without buying new chassis. By 2025, you will have 8 chassis in two racks hosting 2.88PB of raw storage space that's young and vital and low maintenance, having spent roughly $200,000.
A bargain, really.
Super-Micro has 36 and 72 drive racks that aren't horrible human effort wise (you can get 90 drive racks, but I wouldn't recommend it). You COULD get 8TB drives for like 9.5 cent / GB (including the $10k 4U chassi overhead). 4TB drives will be more practical for rebuilds (and performance), but will push you to near 11c / GB. You can go with 1TB or even 1/2TB drives for performance (and faster rebuilds), but now you're up to 35c / GB.
.. $200k.. But you can grow into it.
That's roughly 288TB of RAW for say $30k 4U. If you need 1/2 PB, I'd say spec out 1.5PB - thus you're at $175K
Note this is for ARCHIVE, as you're not going to get any real performance out of it.. Not enough CPU to disk ratio.. Not even sure if the MB can saturate a 40Gbps QSFP links and $30k switch. That's kind of why hadoop with cheap 1CPU + 4 direct-attached HDs are so popular.
At that size, I wouldn't recommend just RAID-1ing, LVMing, ext4ing (or btrfsing) then n-way foldering, then nfs mounting... Since you have problems when hosts go down and keeping any of the network from stalling / timing out.
Note, you don't want to 'back-up' this kind of system.. You need point-in-time snapshots.. And MAYBE periodic write-to-tape.. Copying is out of the question, so you just need a file-system that doesn't let you corrupt your data. DEFINITELY data has to replicate across multiple machines - you MUST assume hardware failure.
The problem is going to be partial network down-time, crashes, or stalls, and regularly replacing failed drives.. This kind of network is defined by how well it performs when 1/3 of your disks are in 1-week-long rebuild periods. Some systems (like HDFS) don't care about hardware failure.. There's no rebuild, just a constant sea of scheduled migration-of-data.
If you only ever schedule temporary bursts of 80% capacity (probably even too high), and have a system that only consumes 50% of disk-IO to rebuild, then a 4TB disk would take 12 hours to re-replicate. If you have an intelligent system (EMC, netapp, ddn, hdf, etc), you could get that down to 2 hours per disk (due to cross rebuilding).
I'm a big fan of object-file-systems (generally HTTP based).. That'll work well with the 3-way redundancy. You can typically fake out a POSIX-like file-system with fusefs.. You could even emulate CIFS or NFS. It's not going to be as responsive (high latency). Think S3.
There's also "experimental" posix systems like ceph, gpfs, luster. Very easy to screw up if you don't know what you're doing. And really painful to re-format after you've learn it's not tuned for your use-case.
HDFS will work - but it's mostly for running jobs on the data.
There's also AFS.
If you can afford it, there are commercial systems to do exactly what you want, but you'll need to tripple the cost again. Just don't expect a fault-tolerant multi-host storage solution to be as fast as even a dedicated laptop drive. Remember when testing.. You're not going to be the only one using the system... Benchmarks perform very differently when under disk-recovery or random-scatter-shot load by random elements of the system - including copying-in all that data.
-Michael
Lucky (?) for you, I just went through purchasing a storage refresh for a cluster, as we're planning to move to a new building and no one trusts the current 5 year old solution to survive the move (besides which, we can only get 2nd hand replacements now). The current system is 8 shelves of Panasas ActiveStor 12, mostly 4 TB blades, but the original 2-3 shelves are 2 TB blades, giving about 270 TB raw storage, or about 235ish TB in real use. The current largest volume is about 100 TB in size, the next-largest is about 65 TB, with the remainder spread among 5-6 additional volumes including a cluster-wide scratch space. Most of the data is genomic sequences and references, either downloaded from public sources or generated in labs and sent to us for analysis.
As for the replacement...
I tried to get a quote from EMC. Aside from being contacted by someone *not* in the sector we're in, they also managed to misread their own online form and assumed that we wanted something at the opposite end of the spectrum from what I requested info on. After a bit of back and forth, and a promise to receive a call that never materialized, I never did get a quote. My assumption is they knew from our budget that we'd never be able to afford the capacities we were looking for. At a prior job, a multi-million dollar new data center and quasi-DR site went with EMC Isilon and some VPX stuff for VM storage/migration/replication between old/new DCs, and while I wasn't directly involved with it there, I had no complaints. If you can afford it, it's probably worth it.
The same prior job had briefly, before my time there, used some NetApp appliances. The reactions of the storage admins wasn't all that great, and throughout the 6 years I was there, we never could get NetApp to come in to talk to us whenever we were looking for expansion of our storage. I've had colleagues swear by NetApp though, so YMMV.
I briefly looked at the offerings from Overland Storage (where we got our current tape libraries), on the recommendation of the VAR we use for tapes & library upgrades. It looked promising, but in the end, we'd made a decision before we got most of those materials...
What we ended up going with was Panasas, again. Part of it was familiarity. Part of it was their incredible tech support even when the AS12 didn't have a support contract (we have a 1 shelf AS14 at our other location for a highly specialized cluster, so we had *some* support, and my boss has a golden tongue, talking them into a 1-time support case for the 8 shelf AS12). We also have a good relationship with the sales rep for our sector, the prior one actually hooked us up with another customer to acquire shelves 6-8 (and 3 spares), as this customer was upgrading to a newer model. Based on that, we felt comfortable going with the same vendor. We knew our budget, and got quotes for three configurations of their current models, ActiveStor 14 & 16. We ended up with the AS16, with 8 shelves of 6 TB disk (x2) and 240 GB SSD per blade (10 per, plus a "Director Blade" per). Approximate raw storage is just a bit under 1 PB (roughly 970-980 TB raw for the system).
In terms of physical specs, each shelf is 4U, have dual 10 GbE connections, and adding additional shelves is as easy as racking them and joining them to the existing array (I literally had no idea what I was doing when we added shelves on the current AS12, it just worked as they powered on). Depending on your environment, they'll support NFS, CIFS, and their own PanFS (basically pNFS) through a driver (or Linux kernel module, in our case). We're snowflakes, so we can't take advantage of their "phone home" system to report issues proactively and download updates (pretty much all vendors have this feature now). Updating manually is a little more time-consuming, but still possible.
As for backups, I honestly have no idea what I'm going to do. Most data, once written, is static in our environment, so I can probably get away with infrequent longer retention period backups for every
"The urge to save humanity is almost always a false front for the urge to rule." --H.L. Mencken
One of these will do you well
https://en.wikipedia.org/wiki/...
For storage that's trickier. You probably need to characterize your usage before you talk to a vendor otherwise they will oversell you into oblivion.
...of Windows10 boxes!
Sacred cows make the best burgers.
Where I work, we are running EMC's Isilon platform. We have ~4PB of data replicated between two data centers.
The platform supports the traditional CIFS/SMB and NFS for client connectivity.
It also has Hadoop support (HDFS). The great thing about the HDFS support is that you do not have to spin a separate file system for it. The same files that your clients access via CIFS or NFS can be accessed via HDFS. Isilon was built with Hadoop in mind and the Isilon nodes act as Hadoop "compute nodes".
The OneFS file system presents a practically unlimited in size, single file system. There are some interesting tuning options that can be leveraged depending on your data type and IO patterns. If you need to get REALLY crazy, the system has support for tiering data based on a whole slew of different factors (last accessed date, file date, file size... basically any file metadata attribute you can think of can be used for tiering purposes).
This probably does not matter for you, but the system also supports AES256 at-rest encryption. We deal with a lot of financial and other highly sensitive data for clients that demand at-rest encryption, so that was a must have for us.
The only downside is that since it is from EMC, you can plan on paying through the nose for it. (But never pay full retail for EMC, ever. Threaten them with NetApp if you have to. ;) )
We still leverage a SpectraLogic tape library to archive data off of the system. With a moderately specced NetBackup system we get a consistent ~35000kb/s restore rate off of a single drive. That lets us provide reasonable RTOs back to the business.
On the subject of backup, another great thing about Isilon is that you can dedicate certain nodes to specific tasks. In the Isilon architecture, the NL nodes are the slowest nodes that they have. We leverage those for backup to keep the network IO off of the faster X and S-nodes.
500TB is nothing these days. You can easily buy any system and it will support it. Look at FreeBSD/FreeNAS with ZFS (or their commercial counterpart by iXSystems). If you want to have an extremely comfortable, commercial setup, go Nexenta or with a bit of elbow grease, use the open/free counterpart OpenIndiana (Solaris based).
You can build 2 systems (I personally have 3, 1 with SAS in Striped-Mirrors, 1 with Enterprise-SATA in RAIDZ2 and 1 with Desktop-SATA in RAIDZ2) and have ZFS snapshots every minute/hour/day replicated across the network for backups, both Nexenta and FreeNAS have that right in the GUI. The primary system also has a mirrored head node which can take over in less than 10s. As far as sharing out the data: AFP/SMB/NFS/iSCSI/WebDAV etc. whatever you need to build up on it.
My system is continuously snapshotted to it's primary backup so that in case of extreme failure (which has not happened in the 7 years since I've built this system) I can run from the primary backup until the primary has been restored with perhaps a few seconds of data loss (don't know if that's acceptable to you but in my case it's not a problem in case we do have a full meltdown)
Where are those systems limited to 16TB? I wouldn't touch them with a 10-foot pole because they're running behind (within a few years a single hard drive will surpass that limit).
Custom electronics and digital signage for your business: www.evcircuits.com
What are your performance requirements. If you just need a giant dump of semi-offline storage then look into building a backblaze Storage Pod.
https://www.backblaze.com/blog...
For about $30,000 you could build four storage pods. Speed would not be terrific. Backups are handled through RAID. If you want faster, more redundant or fully serviced your next step up in price is probably a $300,000 NAS solution. Which might serve you better anyway.
Use Amazon S3 storage (gives you cloud storage with a directory tree.
Accessible via desktop apps or even web browser if you want.
For stuff they want to archive but will rarely ever use have those S3 folders archive to Glacier.
Nothing to backup and you can store petabytes in glacier cheaper than any other option on the planet. :)
Where I work we deal with data sets of a similar order. However, different data sets are stored differently depending on need. For online relational data where performance is critical, it's in master/slave/backup DB clusters running with 4.8TB PCIe SSDs. The backups are taken from a slave node and stored locally, plus they're pushed offsite. No tape, if we need a restore we can't really wait that long.
For data we can afford to access more slowly we use large HDFS clusters with regular SATA discs. There's a level of redundancy built in there, and where data is important enough to need a real backup (much of it is not) it is also pushed offsite. The HDFS approach has the advantage of presenting as a very large filesystem, and obviously if you're running hadoop against it there's an automatic advantage.
---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"
While I agree with most commenters that you need to supply many more details before even beginning to narrow the options, if you do look at the storage vendors, DDN (Data Direct Networks) is really hard to beat.
I see the EMC Isilon guys posting here and need to counter. :) They are overpriced and underpowered for almost every application. Their strength is typical enterprise environments - lots of small files accessed via NFS and "enterprise" SLAs. That's almost always the wrong solution for big data applications (NFS is terrible for big data). EMC Isilon sold a lot of storage into my space (gene sequencing) and very few customers are happy, especially when they find out what the other vendors could do.
I've organized bake-offs between DDN, Isilon, and a number of other vendors. DDN always came out ahead on price and performance (every time they were half the price and twice the speed as Isilon). DDN is the most represented of the vendors on the Top 500 Supercomputing list and also power a certain streaming movie/TV service we all know and love. DDN is also a pretty ethical - if they're a bad match for your application, they'll let you know and provide recommendations.
Whatever you do, don't build it yourself. As tempting and fun as it is, given that you're asking the question, you've already self-identified as someone who won't be able to support it. I've seen many smart people go the SuperMicro JBOD route only to create support nightmares for themselves.
Also, for that much space, avoid Amazon at all costs. It's way too expensive compared to dedicated hardware.
For cost, budget around $150-250k to get started. It might seem pricey, but you'll spend more than that on manpower building it yourself (or your first few months on Amazon).
In addition to DDN, IBM, Dell, and HP all have solutions in this range that aren't terribly expensive.
-Chris
Gluster or Ceph, depending on requirements.
Both are Open Source, call Red Hat if you want support.
I keep it all in a separate drive, and only mount it when I want to look at the data. Also, I mount it under .porn, so it isn't visible in a casual listing.
A republic cannot succeed till it contains a certain body of men imbued with the principles of justice and honour.
Given how few use cases there are like the one you describe, there are probably a lot of important considerations that didn't make it into your question that make your use case unique.
This is one of those cases where you really need to sit down and decide what works best for your situation, NOT what works best for other situations that require this amount of data storage.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
To store files close to a petabyte, you need a petafile, obviously.
Storing the data is the easy part, Glusterfs should do it just fine. The point I am curious about is backups: how do you backup such a volume?
Disclaimer: I work for a storage vendor. Also a long time Slashdot reader though, so this isn't mean as a sales pitch.
Half of a petabyte is not really a lot of data in today's world. I talk to people every day that are trying to find ways to manages many PBs (into the hundreds) and are having challenges doing this with traditional storage. The trend that was started by the big Internet companies is to get rid of the fibre-channel SANs and instead solve the problem of storage using standard x86 servers. They use Linux as an abstraction layer from the hardware, and applications acting as storage systems too pool many servers together.
One of the challenges you need to get over is stretching a namespace that big without filesystem limitations like maximum inode counts. This is generally accomplished using some type of key/value store (object) under the hood. Single flat namespaces with no practical size barrier.
Some options that are available today are Swift from OpenStack and Ceph from Red Hat if you want to go the open source route. These can be good choices if you have the engineering staff on hand to piece it all together and the talent to keep it running. GPFS is also making a come back in this area, and there are a ton of startups looking at this space now.
My company has a commercial solution for this stuff. Pretty cool - it's a Linux app and runs on the server of your choice. I'l save you the sales pitch, and if you want you can try it for free on your own here: http://scality.com/trial
Whatever you choose, best of luck to you!
I am a professional and manage several hundred petabytes globally. From experience I can tell you, they may be asking for half petabyte right now but tomorrow that will double and again next year and so on. Plan big to start with and you'll save your future self a lot of grief! If you PM me I can give you more details but in short I can suggest:
1) Look at a scalable filesystem like GPFS or StorNext. Yes there is a price tag associated with big iron filesystems (and no I don't work for any of them) but you get what you pay for, and scalability is everything. As an example - pairing GPFS with TSM and the right hardware, I can create an infinitely scalable filesystem that'll scale to yodabytes.
2) Tier the storage system. Think SSD for the cache (here and now) I/O, winchester disk for the short term and tape for the long term. Yes, tape: compute cost per tb on tapes the vault versus square footage in the data center.
3) Separate your networks. Keep the client access separated from the disk i/o. Doing this will save massive congestion problems from day one!
There are lots of other things to consider but by today's standards a half petabyte isn't an insurmountable amount of data just like a terabyte was twenty years ago.
It may sound "funny," but I once priced Mega (KimDotCom) for offsite backup & storage. They turned out to be less expensive than Amazon Glacier by a bit AND instantly available. We didn't go with them. Instead, we replicated across data centers with multi-terabyte storage nodes.
Isilons are a cool technology. Take FreeBSD, add a custom filesystem (OneFS), link individual nodes via Infiniband, and let the custom code automatically select which nodes/drives to fetch data from. If a hard drive blows, it shrinks the array in order to maintain redundancy.
Of course, Isilons support deduplication, iSCSI (you create a disk image and mount that), and your NAS protocols of choice. If you set a hard quota, the presented directory can be configured to show the quota as the disk space present. Very nifty, and not that expensive for an enterprise array. Need more space? Add drives or more nodes.
For long term backups, Isilons support NDMP [1].
[1]: Of course, you can always connect a tape silo to a UNIX machine, write a script that SSHes into an Isilon node and pulls off /ifs/data.
Store it in the cloud. 1/2 petabyte isn't even the "highest tier" requirement.
On Azure it will cost $168k/year to store this much data instantly accessible. Whatever other solution you come up with, if it takes more than 1 full time person to support, then it's already more expensive (and that's not even including the up-front capital costs, installation and setup costs, training costs, deprecation, maintainance, ...)
Sounds like a fairly simple case for a Hadoop cluster - a smallish one at that. We're currently deploying to clusters at 1PB/rack density, which means you could deploy a rack or two easily enough. You'd get compute, you get a single flat filesystem, you get redundancy, all built in. Our biggest cluster is now up to 16PB, all one big compute/storage beast, chugging away all day.
I'd suggest starting with the Hortonworks Sandbox VM - grab it, fire it up, play with it. Add some files, poke around, see if it meets your needs. Learn about mapreduce, or maybe your data can be put in to HIVE for analysis.
The nice thing is that yo ucan use hardware you may already have to get things going. Hortonworks is pretty much at the point of a 'next next finish' installer, so you really only need to dedicate a few hours to getting something up to test. Then, thre's a lot of tuning and craziness to running a bigger cluster, but a POC is simple.
Anyhow, I'm blind, because all I do is Hadoop clusters all day, but this seems like an easy win for ya.
GL;HF!
We emerge from our mother's womb an unformatted diskette; our culture formats us. - Douglas Coupland
Not only are you out of your league, but you're barking up the wrong tree.
1) You should hire someone to figure it out for you- as either on-site consultancy or use something like amazon.
2) You should use a different site that has more than 5 legitimate comments on a thread.
Another example of posters trying to be cute and split their reply between the Subject and Comment blocks. It causes confusion when the comments don't stand alone and then you realize the subject line needs to preface the comment.
Just "Don't" do it.
--
If costs are not a priority look into using multiple EMC SANs striped in a RAID array. I've installed a few with the largest encompassing 14 physical units for ~100 VMs, they work great.
Get quotes from Netapp, EMC, and Red Hat.
Budget? I suppose we do a round of layoffs...
Sleep your way to a whiter smile...date a dentist!
I think that the intention was to stimulate a discussion amongst a community of geeks who have a genuine interest in this type of technology and enjoy discussing solutions that they have built. Sure, you could just outsource the service and pay consultants to do it for you but I don't think that is the general ethos of the traditional Slashdot reader. Also, if you feel that you should be paid for commenting here then this is probably not the forum for you. Twat.
How about MooseFS (http://moosefs.org) for an OSS solution, or if you want appliances off the shelf that won't cost you a limb or three, Exablox (http://exablox.com). Or if you need more than the 700TB that can give you, how about http://www.scality.com/ - which is software defined and you can use your own iron.
-- Sig Sig Sputnik
Which is the perfect situation to employ a consultant. Outcome 1: he'll ask the right questions, get accurate answers because management know the requirements, and it'll be a success. Outcome 2..N: it'll be a disaster but it won't be your fault.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Would I make a major enterprise purchase based on a Slashdot discussion? Absolutely not. Would I want to read a Slashdot discussion and maybe follow suggested links and look up all the buzzwords BEFORE talking to vendors or consultants? Absolutely.
Both are free, hardware agnostic and the future of software defined storage. And Red Hat can provide enterprise support if you need.
Which is the perfect situation to employ a consultant. Outcome 1: he'll ask the right questions, get accurate answers because management know the requirements, and it'll be a success. Outcome 2..N: it'll be a disaster but it won't be your fault.
Excellent answer and the bit about covering your rear is priceless.
I have consulted on issues like this and there are multiple solutions some relatively simple and others complex, but a backup solution for half a petabyte of data is not going to come cheap so obviously any professional consultant will also want to cover themselves as well.
If a project has not been raised with all input being documented, milestones set and sign-off for all steps no professional consultant would want to touch this. Sure you can jury rig a solution and it may work but if anything goes wrong then whoever is perceived to be guiding this is effectively going to be looking for a new job.
Here are some very basic questions a consultant is going to ask and don't think these can be answered in a simple sentance:
The above is just the start of the questions and there are going to be many many more before that will require detailed answers before any recommendation is reached with regard to equipment, installation, maintenance as well as backup, storage and and recovery strategies. This takes time and everyone wants to cover their rear so sign-off for important steps (ie. milestones) are essential.
There ain't no such thing as proprietary standards only proprietary formats. Standards are by definition open.
I did some work to help ease the traffic flow around Atlanta, GA. (There is a giant highway that runs around it in a circle, access was fairly easy but egress was not as good as it should have been. The idea was brilliant when they designed it. Importantly,population growth was around the outside of the circle and there was congestion at peak hours and the load was not where it was anticipated and designed for.)
Anyhow, after bidding and getting the contract (a consulting contract - we would recommend design changes, for example, but not specify how the changes were made only what needed to be changed and where and traffic engineers would take care of the rest - traffic engineering was not a part of this contract and we did not bid on that project due to the mess that it was, it has only been marginally improved but it is great in off-peak hours except it is not really needed in off-peak hours) we learned something. They had effectively bid out to hire a consultant to see if they had needed to hire a consultant. Our internal name was, "The Georgia Recursive Loop." The City of Atlanta has its own traffic engineers, not as many as needed really, so we were unable to recommend consulting a consultant to keep the chain going.
That was one of the projects (surprisingly few) that made me feel a little bad for the tax payers. They were not the only ones that hired a consultant to consult on hiring consultants. Sometimes they hire a lawyer, a specialist who is not on the city budget, to determine if they should hire a consultant to determine if they should employ the services of a consultant. (I am looking at YOU District of Columbia. I am looking at you...) Buffalo, NY hired a lawyer who recommended a specialist lawyer to vet our proposals. The original lawyer remained on the books and handled communication between the specialist lawyer (who had ended up being our main contact) and the city council. The council, of course, reported to the manager of the local transportation department. It was a lot like the "Chinese Telephone" game we played as kids where you say one thing in one person's ear and they repeat it and so on and so on until it is munged silliness at the end.
It is quite lucrative, really. If you are not insane when you start then you will be by the time you get familiar with all of the silliness. Sorry for the novella but there simply is no easy way to share the experiences. Hopefully it is reasonably clear. My only justification, for being a part of the system, is that it paid well, provided great jobs, and the tech/educational aspects of it were originally mind blowing and fun.
"So long and thanks for all the fish."
Windows is limited to 512 total shadow copies. Shadow copies could accidentally be lost for a number of reasons, they are not guaranteed. Microsoft has a list of things to be careful about that can influence your chance of losing a shadow copy, including block size and defragmentation, which could cause older shadow copies to get destroyed.
LVM has performance issues. Many people complaints of over 10x reduction in performance after only a few snapshots. It also only works at the block level and not the FS level, which highly limits its usefulness.
'nuff said
Bingo Dictionary - Pragmatist, n. A myopic idealist.
We store and backup about this much data (a little more), although spread across a variety of machines. All in all, though, the data is primary virtual hard drives (we run a private cloud environment).
Storing it on disk is easy enough - and cheap enough, that it's little concern. Amazon, Azure, etc. are *insanely* expensive for this task, month by month, compared to self owned disks.
As our hypervisors are all Microsoft (Hyper-V - and yes, I know this is Slashdot and I just said I use a Microsoft product but it's easily the most economical approach, when 99% of your clients need Windows licensing), we use Windows Server 2012 R2 native tiered storage pools on a mix of SATA HDD and SSD to achieve the storage, generally spread across a group of Supermicro servers with large numbers of disk bays - effectively software defined storage.
For backup, we use the highly dense 1RU servers, with 12 bays (Supermicro again), with commodity 6 or 8TB SATA disks. Each RU can get near to 100TB of storage (raw) and they don't use much kW - and they cost hardly anything. Backups are performed using Microsoft DPM 2012 R2, as well, because, again, cheapest option and so far, 0 problems.
The biggest issue I have is airwalled backups - those are hard to manage, for low dollars, for this kind of setup. So I've resorted to having a few more backup machines and manually swapping the network cable from one group, to the next, as the equivalent of swapping tapes.