Open Source Highly Available Storage Solutions?
Gunfighter asks: "I run a small data center for one of my customers, but they're constantly filling up different hard drives on different servers and then shuffling the data back and forth. At their current level of business, they can't afford to invest in a Storage Area Network of any sort, so they want to spread the load of their data storage needs across their existing servers, like Google does. The only software packages I've found that do this seamlessly are Lustre and NFS. The problem with Lustre is that it has a single metadata server unless you configure fail-over, and NFS isn't redundant at all and can be a nightmare to manage. The only thing I've found that even comes close is Starfish. While it looks promising, I'm wondering if anyone else has found a reliable solution that is as easy to set up and manage? Eventually, they would like to be able to scale from their current storage usage levels (~2TB) to several hundred terabytes once the operation goes into full production."
When you're talking HA, you're always talking "big money". If you want a fully redundant infrastructure, you might have to start using commercial operating systems (like Novell Linux, RedHat, or even other Unix-based commercial OS like Solaris or AIX). The problem here is support. Full HA environments are incredibly complex, and you will need to make very, very sure that everything works well.
I wrongly implemented HA system will have less uptime than a 499US$ Dell with a single ATA drive.
Entry level SANs using iSCSI are available at quite affordable prices. Look at HPs and IBMs (e.G. the DS300). Even the entry models allow you to use MPIO.
Use mirrored via DRBD SCSI or SATA disks on storage nodes which exports iSCSI devices.
On usage nodes you can use ocfs2 or RedHat's gfs for accessing those iSCSI devices.
You should also use meaningful fencing/locking methods. (read manuals for ocfs or gfs for details).
Configuring failover for Lustre's metadata server seems the obvious solution... there are plenty of FOSS failover solutions (although I for one prefer Veritas VCS for anything critical).
"He who would learn astronomy, and other recondite arts, let him go elsewhere. " -- John Calvin, commenting on Genesis 1
What about the ZFS that runs on one the above listed OS's? Easy to manage, grow, seemless usage from the user's point of view, etc...
Not exactly what your asking for, but a damn good answer for your problem.
Apple's XServer RAID and OpenFiler (openfiler.org) The XServer RAID is basically a LSI Logic Engenio RAID at a very cheap price and you can't beat OpenFiler for free. The XServer RAID at 10.5TB costs about $1.31 a GB.
I know several people who backup their NetApps to this setup or just use it for storage where they don't require what NetApp offers and don't want to spend $25k+.
Before you go about deciding what file system you need, you need to spend some time thinking about what kinds of files your customers are storing. RDBMS data? Large graphics/audio/video files that rarely if ever change? Scanned documents? Large numbers of small files? Small numbers of large files? You get the idea.
Then you can start looking at solutions. 'Optimal File System' can mean many things to many people, and everyone here is going to have a different viewpoint. You need to decide what features of a file system makes it optimal for you. Then you can go looking for a solution.
Sig? What sig? Do I have to have a sig!?!?
GlusterFS (www.gluster.org) is just "THE" best cluster filesystem I have ever studied. I'm testing a few of them for a project here at my job and, at least for my case, GlusterFS is the best. It can scale to the petabytes with as many servers as you want. It can also use InfiniBand as interconnect protocol besides the usual TCP/IP/Ethernet.
It's design is simple is smart. Every feature is a translator that interconnects to other translators. So, you may organize your filesystem they way *you* want it.
Let-me give-you an example: they have 2 translators: 'unify' to unifying harddrives as one and 'afr' for automaticly file replication. Depeding on the order you use it you have two completly different setups. You can have two cluters replicating eachother or you can have a cluster of replicating servers pair.
Beside it's features and design, it's development team is *very* friendly. Yesterday someone (user) asked for a feature in the devel list, a get answered saying: good ideia, i'll do it.
Very good software.
Take a look: http://www.gluster.org/glusterfs.php
As others have mentioned, HA solutions are complicated and expensive. Unless you really need it, you probably don't want to go down that route.
6 -20050311-Kadam-OE.pdf), I'm sure other vendors sell similar gear), you can daisy-chain 3 enclosures per connector, and a SAS card has 2 connectors. With 2 cards per server (about what you can fit in a 2U box?) you then get 12 external enclosures of 15 drives each, for a total of 192 drives. With 750 GB SATA drives you then have 144 TB raw storage per server.
But with 2 TB currently and scaling to perhaps a few hundred TB in the future, the obvious simple solution is to just buy bigger servers. With modern gear you can really connect a frightening amount of storage to a single server at modest cost. Say a rackmount box with space for 12 drives, then SAS card(s) with external connector(s), so you can chain together multiple enclosures. Taking Dell as an example (just what I quickly found with google (http://www.dell.com/downloads/global/power/ps3q0
When needs grow beyond one server, clever use of automount maps lets you manage the namespace for multiple servers easier than doing it all by hand.
As for Lustre, it's really a specialized solution for HPC, made for multiple compute nodes striping to the storage nodes at full speed using a collective IO API like MPI-IO.
We've been using MogileFS on commodity Linux servers for a few months now and it's been working great. The MogileFS community/mailing list is very active, so it's actually been fun to implement.
Right now we have 22.8 TB spread across six 2U servers using a mix of 400 and 500 GB SATA drives. The great thing is that we can lose an entire file server (or two) with no downtime or loss of data.
Another reason to like MogileFS is that it removes the need to maintain RAID arrays. A RAID-5 array made of 750 GB disks is very risky. A high-end controller will still take many hours to rebuild a degraded array, during which time you could lose another disk and be largely screwed. (This actually happened to us very early on and we lost 0.02% of our data after restoring from backup, which still hurt.)
I say we take off and nuke the entire site from orbit. It's the only way to be sure.
I don't know why you think NFS doesn't support failover; check out Red Hat Cluster (PDF) or Sun Cluster. You will need a RAID array that has two host ports, such as VTrak E310s, IBM DS3200, HP StorageWorks 500, or Xserve RAID.
I would not suggest cluster file systems such as Lustre for a small installation; they're generally designed to scale up to hundreds or thousands of servers, but not to scale down to a handful.
If they want 'several hundred' terabytes then they can afford a server with a SAS enclosure attached to it. The true cost will be the drives you put in it, but you can use standard SATA drives so it won't cost too much.
Alternatively, buy more drives and put them in 1 server with a good raid card. Even cheaper.
If they want true mutiple server redundancy, then you just need 2 of everything, and rsync them every so often, or make backups of the first onto the second.
ZFS is to disks what Pac-man is to dots. Run out of space? Feed it another block device.
Don't get me wrong, I've met enough problems myself in IT. But firstly, your problem needs to be expressed clearly.
"High Availability" can mean a lot of things. The most important part of it, though, is "how highly available do you need?". Do you want to survive the loss of a server? Of a room? An office? A city?
Basically, you've got two options.
1. Homebuilt, possibly based around either Solaris (ZFS looks interesting) or a specialised Linux distribution. OpenFiler looks interesting but doesn't appear to get a lot of attention, so community support may be lacking. Unless you've already got the hardware, however, you'll need at least two reasonably large servers.
Depending on how crucial all this is to your employer (I'm assuming it's fairly crucial or you wouldn't be looking at HA systems in the first place), the level of support you have available to fall back on with this may or may not be acceptable.
In any case, if you're going to have to spend the amount of money involved in buying two large servers and paying for support on a linux distro anyway, you may as well look at option 2.
2. An entry-level SAN.
Yes, I know you said you can't afford it. But I don't think the problem you're discussing can be easily tackled for zero-cost, and if there's cost involved you'd be in remiss of your duties to not cover every possible base.
I was faced with the same problem myself a few months ago. Eventually I concluded that there simply wasn't the business justification for highly-available storage - we could make do with servers with redundant power supplies and disks, and regular backups. However, I was surprised to find that an entry-level SAN from Dell (actually rebranded EMC units) isn't that much dearer than "buy two dirty great servers and run OpenFiler", and has the benefit that if you do need support, you don't run the risk of hardware and software support folks pointing the finger at each other, saying "it's not our problem, it's theirs".
Plus any half-decent SAN vendor will provide a clear upgrade path - if you roll your own, you'll have to figure out how you upgrade on your own when the time comes.
Finally, think of it like this.
Any business which relies on its backend systems to be solid and reliable should take any reasonable suggestion to maintain that reliability seriously. And by definition, this implies that storage must be reliable.
If it's that important to the business that your systems continue to operate in the face of extreme adversity, and you decided to save £1000 by taking the homebrew route, you're going to have a lot of justifying to do if the worst happens and your supposedly-HA system falls over. Particularly if your answer to "what are you doing about it?" is "I've posted a message to a forum and I'm awaiting a reply". Realistically the only way it can work is if you're competent enough to be able to fix even the worst outage yourself with little or no recourse to asking on forums (though reading documentation is OK). Even then, you should keep the system simple enough that it doesn't take several months of familiarising yourself with it before anyone else has a chance of fixing it, otherwise all you've done is moved the point of failure from the hardware to yourself.
The alternative answer "I've placed an emergency support call with our suppliers and they should be ringing me back within the hour" carries a heck of a lot more weight.
Software like Lustre and Starfish only wants you to help testing the software. Both are not OSS in my opinion and not ready for the production. So if you have to pay, why not go with a commercial software? Have a look at polymatrix, although they do not have an integrated HSM. Or, get SAMFS in a HA-NFS Server configuration (could be linux). Yes, you pay for the license by the GB, but you do same is true for the hardware cost. Having a single (large enough and scaleable) filesystem will stop your customer to duplicate and move things around, causing increased maintenance cost.
I've never tried it, but I think I read somewhere you can do md (as in /dev/md or /sbin/mdadm) RAID using devices on the network. I'm not sure if it would apply to this situation anyway, but I thought it work mentioning.
Max.
Take a look at ATA over Ethernet. You can use your existing Ethernet (or the secondary ports if you have them) and get what is essentially SAN storage. I would recommend getting an external box like a Coraid, but you could build your own with the vblade software. I would just recommend using as many physical disks as possible and stripe them so you can get some acceptable performance. You can even use a clustered file system like Red Hat GFS on them.
Many of the most helpful posts here have tried to touch on how important your data is, and what happens if you have downtime. My organization is going though its first, long overdue, SAN purchase. We currently have 4 2TB SATA arrays that are going to be replaced by 2 completely independent EMC CX3 SANS. We made this decision, after much pushing on the "moneymen", because downtime is not an option and the SATA arrays cannot be trusted.
So, how much does an hour of downtime cost you? How big is your IT staff and can they handle management of a large Opensource file system cluster? The extra money spent on a smaller san may pay for itself a matter of hours in a failure situation.
Also, it sounds like you would like to use a SAN if the money was available. One thing I discovered about SANs is buying bigger, is not always better. You say you only need a few TB now, but hundreds later. My advice is to not be looking for a SAN than can grow to 100+TB, but to what you "reasonably" expect to be using in 2-3 years. Here is why. SANS typically come with 3 yr service warranties. At the same time, the model lifespans are also timed to be around 3 years. So, in 3 years, when your service contract is up, the model is being phased out. So, instead of renewing your service contract, you trade in and upgrade your SAN, which in three years, the smallest san will be able to do 100+ TB.
I recently attended a SAN integration class for our new system where some students were from one of the US largest retailers. The students said their group buys a san, starts migrating data to the SAN (which takes 18 months), and then starts migrating data off almost immediatly after all data is migrated onto the SAN. The migration off the san, also takes 18 months.
I think you may really want to consider a small SAN as your choice, especially if you are going to buying new hardware for this sysetm.
That was extremely funny... even if I don't know if I took it the way you ment it.
thread originator, have you read RefriedBean's post, and checked out Coraid & the AoE protocol? scalable data storage, LAN (but future for WAN), RAID capable, lower cost than Fibre et al, fast b/c of lower info transmission overhead in protocol, since it does not use the TCP part of TCP/IP, & pick your FS.
"turning espresso into code..."