Slashdot Mirror


Open Source Highly Available Storage Solutions?

Gunfighter asks: "I run a small data center for one of my customers, but they're constantly filling up different hard drives on different servers and then shuffling the data back and forth. At their current level of business, they can't afford to invest in a Storage Area Network of any sort, so they want to spread the load of their data storage needs across their existing servers, like Google does. The only software packages I've found that do this seamlessly are Lustre and NFS. The problem with Lustre is that it has a single metadata server unless you configure fail-over, and NFS isn't redundant at all and can be a nightmare to manage. The only thing I've found that even comes close is Starfish. While it looks promising, I'm wondering if anyone else has found a reliable solution that is as easy to set up and manage? Eventually, they would like to be able to scale from their current storage usage levels (~2TB) to several hundred terabytes once the operation goes into full production."

46 comments

  1. Entry level SAN? by lukas84 · · Score: 5, Insightful

    When you're talking HA, you're always talking "big money". If you want a fully redundant infrastructure, you might have to start using commercial operating systems (like Novell Linux, RedHat, or even other Unix-based commercial OS like Solaris or AIX). The problem here is support. Full HA environments are incredibly complex, and you will need to make very, very sure that everything works well.

    I wrongly implemented HA system will have less uptime than a 499US$ Dell with a single ATA drive.

    Entry level SANs using iSCSI are available at quite affordable prices. Look at HPs and IBMs (e.G. the DS300). Even the entry models allow you to use MPIO.

    1. Re:Entry level SAN? by jhines · · Score: 2, Interesting

      Solaris 10 is free these days. ZFS looks really good for this kind of application.

    2. Re:Entry level SAN? by lukas84 · · Score: 1

      Support contracts and supported hardware isn't.

    3. Re:Entry level SAN? by Metrol · · Score: 1

      Seems that ZFS will be available for FreeBSD here soon. Available now for the adventurous, which I imagine is not how the poster is feeling. As a FreeBSD user I'm definitely looking forward to a stable ZFS on there.

      --
      The line must be drawn here. This far. No further.
    4. Re:Entry level SAN? by jabuzz · · Score: 2, Insightful

      Thing is they want to be able to go from 2TB to hundreds of TB and they cannot afford a SAN!!!

      They need to accept that this ain't going to happen, and what they need to do is put in a solution for now and plan for a different solution when they go into production and presumably have the money.

      However one has to wonder if there current storage requirements are a messily 2TB why the heck do they need more than one server, unless it is a second for failover.

    5. Re:Entry level SAN? by Linagee · · Score: 1, Offtopic

      I wrongly implemented HA system will have less uptime than a 499US$ Dell with a single ATA drive.

      Was I the only one to catch this disturbing freudian slip? Is this guy dangerous around computers or what?

    6. Re:Entry level SAN? by msporny · · Score: 4, Informative

      Full Disclosure: I'm one of the author's of the Starfish Filesystem.

      Simply not true anymore, lukas84. High-availability solutions don't have to cost "big money". Starfish is the perfect example of such a system. In fact, it is THE reason we wrote Starfish: To provide an in-expensive, fault-tolerant, highly available clustered storage platform that works from the smallest website to the largest storage network. We've based the technology on the assumption that having expensive hardware/software is the wrong way to go about solving the problem.

      Full HA environments do not need to be incredibly complex. If your HA solution is incredibly complex, you've done something wrong. Take a look at how easy it is to set up a Starfish file system:

      Starfish QuickStart Tutorial

      That solution doesn't cost "big money", nor is it "incredibly complex".

      --
      Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
      Founder/CEO - Digital Bazaar, Inc.
    7. Re:Entry level SAN? by duffbeer703 · · Score: 2, Interesting

      I respectfully disagree. Products like StarFish might make a storage highly available at a low price, but what about the other components of the system? If your network, app servers, etc aren't highly available, you have a whole new range of equipment and services that needs an HA solution as well.

      I worked at a place where a $400 million project that spent tons of money on high availability database and server components was crippled by bad switches and application servers.

      --
      Conformity is the jailer of freedom and enemy of growth. -JFK
    8. Re:Entry level SAN? by msporny · · Score: 2, Funny

      network, app servers, etc aren't highly available, you have a whole new range of equipment and services that needs an HA solution as well I couldn't agree with you more. I focused on the storage aspect because the article, thread, and Starfish is about HA storage.

      I worked at a place where a $400 million project that spent tons of money on high availability database and server components was crippled by bad switches and application servers. I'm sorry to hear that. What an embarrassingly colossal waste of money. I'm assuming that was US tax payer dollars at work?
      --
      Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
      Founder/CEO - Digital Bazaar, Inc.
  2. On a normal hardware you can by WetCat · · Score: 4, Informative

    Use mirrored via DRBD SCSI or SATA disks on storage nodes which exports iSCSI devices.
    On usage nodes you can use ocfs2 or RedHat's gfs for accessing those iSCSI devices.
    You should also use meaningful fencing/locking methods. (read manuals for ocfs or gfs for details).

    1. Re:On a normal hardware you can by purduephotog · · Score: 1

      Interesting.

      Our 'organization' has a similar situation - they will have about 200gb of data coming in per day, yet only have 2.5 TB of data storage (Raid 5). It's hillarious when the engineers tell management that the solution isn't workable- they don't listen, and they 'dismiss' anyone that 'can't provide productive feedback'.

      At least there's an answer out there, I'll read up on this and see if there isn't a way around it.

    2. Re:On a normal hardware you can by Fallen+Kell · · Score: 1, Insightful

      ummmm.... you are not even in the same league let alone ballpark with what the OP is asking. I believe he is talking about something like the Veritas Cluster Server, where you have multiple systems which can be used to serve out services such as NFS, or even software application services like running clearcase database, or even websites.

      Basically you setup two similar systems (well they don't have to be, but it helps), they get a direct connection between the two, as well as the normal network connections. For this to work well, it is also assumed that you have a SAN setup with the data volumes that you want to share out being available to both servers. If one system goes down, the other takes over all the services that are needed to do what the other was handling (i.e. for instance if you are doing NFS shares, the disk groups that hold the volume are first brought under control by the system, then the volumes are mounted, once the volumes are mounted the virtual IP address used for the hostname sharing that data is configured, then the volumes are shared out. All clients actually see the information through the virtual IP address/DNS name, so if a server fails, the clients will only see a hickup in their connection to the data areas during the time it takes for the other server to take control of the disks and setup the virtual address).

      Now, you would also what other things like redundant SAN storage, using storage arrays that support multiple paths or possibly even mirroring between multiple arrays through software like Veritas Disk Suite.

      Again, this is well above the mirrored disks in a single server. The poster wants full redundancy in services. Your mirror only fixes a few disk failures, not a network subnet outage, a fibre switch failing, a motherboard failing, a fibre card dieing, etc., etc., etc.... In other words there is a LOT that needs to be in place to get real high availability. A mirror won't cut it.

      --
      We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"
    3. Re:On a normal hardware you can by Fallen+Kell · · Score: 1

      hmmm... must have hit reply on wrong post.

      --
      We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"
    4. Re:On a normal hardware you can by sumdumass · · Score: 1

      I had a similar situation happen. We spec'ed out a 1 TB server that was supposed to store nothing but document images at one clients office. This was supposed to last for 4-5 years and account for the volume it would be seeing in that lifetime.

      One of their other servers started flaking out after it was dropped 2 foot to the ground by someone trying access the rear of the cabinet to plug an extention cord into the outlet they saw. (don't ask me). The solution was to temporarily use the document storage server. That was less then a year ago and the hard drive is full already and I have had to open storage space on the main OS drive in the /tmp directory. I am hoping is fills up, the power goes off and the machine doesn't boot again just to get things back into shape. But for some reason, spending another $1500 for a raid controller and more drives isn't in the budget for a company that brings in millions a year.

  3. Then configure failover ... by Fished · · Score: 2, Informative

    Configuring failover for Lustre's metadata server seems the obvious solution... there are plenty of FOSS failover solutions (although I for one prefer Veritas VCS for anything critical).

    --
    "He who would learn astronomy, and other recondite arts, let him go elsewhere. " -- John Calvin, commenting on Genesis 1
  4. Zfs on Solaris 10, OpenSolaris, FreeBSD... by xeoron · · Score: 1

    What about the ZFS that runs on one the above listed OS's? Easy to manage, grow, seemless usage from the user's point of view, etc...

  5. OpenFiler and Apple's XServer RAID by C_Kode · · Score: 2, Informative

    Not exactly what your asking for, but a damn good answer for your problem.

    Apple's XServer RAID and OpenFiler (openfiler.org) The XServer RAID is basically a LSI Logic Engenio RAID at a very cheap price and you can't beat OpenFiler for free. The XServer RAID at 10.5TB costs about $1.31 a GB.

    I know several people who backup their NetApps to this setup or just use it for storage where they don't require what NetApp offers and don't want to spend $25k+.

  6. Rethink your drink by wcspxyx · · Score: 5, Insightful

    Before you go about deciding what file system you need, you need to spend some time thinking about what kinds of files your customers are storing. RDBMS data? Large graphics/audio/video files that rarely if ever change? Scanned documents? Large numbers of small files? Small numbers of large files? You get the idea.

    Then you can start looking at solutions. 'Optimal File System' can mean many things to many people, and everyone here is going to have a different viewpoint. You need to decide what features of a file system makes it optimal for you. Then you can go looking for a solution.

    --
    Sig? What sig? Do I have to have a sig!?!?
  7. GlusterFS by danielcolchete · · Score: 5, Informative

    GlusterFS (www.gluster.org) is just "THE" best cluster filesystem I have ever studied. I'm testing a few of them for a project here at my job and, at least for my case, GlusterFS is the best. It can scale to the petabytes with as many servers as you want. It can also use InfiniBand as interconnect protocol besides the usual TCP/IP/Ethernet.
    It's design is simple is smart. Every feature is a translator that interconnects to other translators. So, you may organize your filesystem they way *you* want it.
    Let-me give-you an example: they have 2 translators: 'unify' to unifying harddrives as one and 'afr' for automaticly file replication. Depeding on the order you use it you have two completly different setups. You can have two cluters replicating eachother or you can have a cluster of replicating servers pair.
    Beside it's features and design, it's development team is *very* friendly. Yesterday someone (user) asked for a feature in the devel list, a get answered saying: good ideia, i'll do it.
    Very good software.
    Take a look: http://www.gluster.org/glusterfs.php

    1. Re:GlusterFS by Anonymous Coward · · Score: 0

      I'll second that. The extremely modular approach is really paying off; each piece is very simple, new features are developed at an astonishing rate, and bugs are tracked down very quickly. It also runs on top of whatever underlying filesystem you like (even NFS!), and this gives one the assurance that, in case GlusterFS was completely broken somehow, you could still access the data.

      The developers are very friendly and active on the mailing list. Their roadmap is very aggressive. In a few months, they'll support most anything you'd want in a distributed filesystem.

      I think the developers themselves were surprised at how the benchmarks are turning out. They were inspired by Lustre when developing GlusterFS, but they wanted something simpler. However, the benchmarks are showing that, in many/most cases, GlusterFS is actually faster (and Lustre, last I knew, was the world-record holder)!

  8. Just buy bigger servers by joib · · Score: 3, Informative

    As others have mentioned, HA solutions are complicated and expensive. Unless you really need it, you probably don't want to go down that route.

    But with 2 TB currently and scaling to perhaps a few hundred TB in the future, the obvious simple solution is to just buy bigger servers. With modern gear you can really connect a frightening amount of storage to a single server at modest cost. Say a rackmount box with space for 12 drives, then SAS card(s) with external connector(s), so you can chain together multiple enclosures. Taking Dell as an example (just what I quickly found with google (http://www.dell.com/downloads/global/power/ps3q06 -20050311-Kadam-OE.pdf), I'm sure other vendors sell similar gear), you can daisy-chain 3 enclosures per connector, and a SAS card has 2 connectors. With 2 cards per server (about what you can fit in a 2U box?) you then get 12 external enclosures of 15 drives each, for a total of 192 drives. With 750 GB SATA drives you then have 144 TB raw storage per server.

    When needs grow beyond one server, clever use of automount maps lets you manage the namespace for multiple servers easier than doing it all by hand.

    As for Lustre, it's really a specialized solution for HPC, made for multiple compute nodes striping to the storage nodes at full speed using a collective IO API like MPI-IO.

    1. Re:Just buy bigger servers by msporny · · Score: 2, Informative

      Full Disclosure: I'm one of the author's of the Starfish Filesystem.

      As others have mentioned, HA solutions are complicated and expensive. Unless you really need it, you probably don't want to go down that route.

      High-availability solutions don't have to be complicated and expensive. Starfish is the perfect example of such a simple and low-cost system. In fact, it is THE reason we wrote Starfish: To provide an in-expensive, fault-tolerant, highly available clustered storage platform that works from the smallest website to the largest storage network. We've based the technology on the assumption that having expensive hardware/software is the wrong way to go about solving the problem.

      Buying bigger servers and attaching massive storage systems to them is not a very good idea when it comes to reducing single points of failure in your HA network. You must assume hardware failure - it is going to happen, when you have so many pieces of spinning metal you will hit the point at which you are losing a hard drive every day. You will start losing machines at least once a month. Or worse - what happens when you lose one out of your four "big servers" and 155TBs goes off-line in an instant? Buying bigger and more expensive hardware is a "throw money at the problem and maybe it'll disappear" solution. It is wishful thinking at best. The system you describe is a nightmare scenario when it comes to HA - I would highly advise that nobody solve their storage problem with that approach.

      As for Lustre, it's really a specialized solution for HPC, made for multiple compute nodes striping to the storage nodes at full speed using a collective IO API like MPI-IO.

      Not really. We've used it for years on several of our web clusters. It does a very good job at providing great I/O throughput, yes - but it is applicable to many more problems than that. It is a good file system back-end for any website that has to deal with a large amount of data. It might not be right for what you want to do with it, but that doesn't mean it should be pigeon-holed to only being a "specialized solution for HPC".

      --
      Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
      Founder/CEO - Digital Bazaar, Inc.
    2. Re:Just buy bigger servers by joib · · Score: 1


      High-availability solutions don't have to be complicated and expensive. Starfish is the perfect example of such a simple and low-cost system. In fact, it is THE reason we wrote Starfish: To provide an in-expensive, fault-tolerant, highly available clustered storage platform that works from the smallest website to the largest storage network. We've based the technology on the assumption that having expensive hardware/software is the wrong way to go about solving the problem.


      Oh, I absolutely agree. But I wasn't advocating any gold-plated SAN solution either. A standard rackmount server with hotswappable fans and power supplies is a few thousand, the daisy-chainable external enclosures (again, with redundant power supplies and fans) likewise. With 750 GB SATA drives about 1/2 of the hardware budget goes to the drives, which is about the same as you get with smaller nodes that you advocate. Some study (IBM, IIRC) showed that about 80 % of server failures are due to failing power supplies, fans, or hd:s. With all of these redundant, even a single server can be quite reliable. And if that isn't enough, you can mirror it to another box using DRBD or GNBD. And the software is mature, which unfortunately can't be said of most parallel FS:s (case in point, our $zillion cray is down at the moment due to Lustre problems), and free. Your website says Starfish is $12000 per 10 TB, which is more than the hardware itself..

      Of course, at some point a bunch of big mirrored servers and playing with automount becomes pretty tedious to maintain.

      I guess my main point is that with current modestly priced gear (standard rackmount servers and external enclosures), you can go to pretty big systems before a "real" parallel FS becomes necessary. I guess in many cases the limiting factor will be the network BW, 2 GbE (most rackmount servers come with two gigabit ethernet interfaces, so you can bond them) is not that much for ~150 TB storage. 10 GbE would help, of course.

      However, with a stable, mature, free and high performance parallel FS the balance would shift to much less storage per server (perhaps even using just internal storage). I'm just not convinced such a thing exists yet, however much I'd like to see it.

    3. Re:Just buy bigger servers by msporny · · Score: 2, Informative
      Full disclosure: I am one of the authors of the Starfish file system.

      With all of these redundant, even a single server can be quite reliable.

      Hmmm... you seem to be concerned with a completely different class of problem than the one Starfish addresses. HA systems assume that your single server will fail eventually (which it will). There many single points of failure in the scenario you describe (ram, motherboard, glitch in the redundant power supply). What happens when you need to take the machine down for maintenance? What happens when the power strip or the UPS you have the machine plugged into fails? Your proposed solution also doesn't scale very well. If you connect 10 clients to a file system exported by your single redundant server (you have created a fantastic bottleneck in your system architecture).

      Of course, at some point a bunch of big mirrored servers and playing with automount becomes pretty tedious to maintain.

      I'm glad you said this - you are quite right. Most people do not address the amount of money that it costs their system administrators to get it right.

      However, with a stable, mature, free and high performance parallel FS the balance would shift to much less storage per server (perhaps even using just internal storage). I'm just not convinced such a thing exists yet, however much I'd like to see it.

      Just because something has just been released to the public doesn't mean it is not stable and mature. You are drawing a false parallel between "time that the software has been available to the public" and "stability".

      We postulated that most web server clusters out there right now did not need more than 1TB of back-end storage. We use Starfish internally for our storage needs. The system is free for the previously mentioned conditions and has the source code available. We are attempting to provide a solution to the problem that you state at the end of your post.

      --
      Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
      Founder/CEO - Digital Bazaar, Inc.
  9. Try out MogileFS by Geek+Dash+Boy · · Score: 3, Informative

    We've been using MogileFS on commodity Linux servers for a few months now and it's been working great. The MogileFS community/mailing list is very active, so it's actually been fun to implement.

    Right now we have 22.8 TB spread across six 2U servers using a mix of 400 and 500 GB SATA drives. The great thing is that we can lose an entire file server (or two) with no downtime or loss of data.

    Another reason to like MogileFS is that it removes the need to maintain RAID arrays. A RAID-5 array made of 750 GB disks is very risky. A high-end controller will still take many hours to rebuild a degraded array, during which time you could lose another disk and be largely screwed. (This actually happened to us very early on and we lost 0.02% of our data after restoring from backup, which still hurt.)

    --
    I say we take off and nuke the entire site from orbit. It's the only way to be sure.
    1. Re:Try out MogileFS by msporny · · Score: 1

      Full Disclosure: I'm one of the authors of the Starfish file system.

      We've played around with MogileFS. It does a very good job at archiving files. It is write-once, which is good for certain very specific applications. Unfortunately, it did not solve our problem. We needed a POSIX-compliant file system that looked like just another disk to Linux, but was inexpensive, simple to set up, fault-tolerant, and performed automatic data backup.

      Starfish and Lustre are really for people that just want the file system to work with most of the 15,000+ packages for Linux. No muss, no fuss.

      To give you some background: we needed applications like Samba, Apache, MySQL, NFS, and PHP to just work with the file system without needing any modifications. MogileFS is not POSIX-compliant, thus wasn't a good drop-in replacement for us. Starfish is POSIX-compliant and so is Lustre. Which file system fits your application really depends on your needs. In general, the further away from POSIX-compliant file systems that you go - the more development you will have to do to make your system work correctly.

      --
      Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
      Founder/CEO - Digital Bazaar, Inc.
  10. Take another look at NFS by Wesley+Felter · · Score: 3, Informative

    I don't know why you think NFS doesn't support failover; check out Red Hat Cluster (PDF) or Sun Cluster. You will need a RAID array that has two host ports, such as VTrak E310s, IBM DS3200, HP StorageWorks 500, or Xserve RAID.

    I would not suggest cluster file systems such as Lustre for a small installation; they're generally designed to scale up to hundreds or thousands of servers, but not to scale down to a handful.

    1. Re:Take another look at NFS by msporny · · Score: 1, Interesting

      I would not suggest cluster file systems such as Lustre for a small installation; they're generally designed to scale up to hundreds or thousands of servers, but not to scale down to a handful.

      Our first Lustre cluster was 3 servers - it worked just fine. Starfish effortlessly scales down to 2 servers. Here is an example of it doing so:

      Starfish Quickstart Tutorial

      Just because something scales to thousands of active nodes and disks, doesn't mean it can't scale down gracefully. The Internet is a good example of this concept.

      --
      Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
      Founder/CEO - Digital Bazaar, Inc.
  11. then buy 2 fileservers by gbjbaanb · · Score: 1

    If they want 'several hundred' terabytes then they can afford a server with a SAS enclosure attached to it. The true cost will be the drives you put in it, but you can use standard SATA drives so it won't cost too much.

    Alternatively, buy more drives and put them in 1 server with a good raid card. Even cheaper.

    If they want true mutiple server redundancy, then you just need 2 of everything, and rsync them every so often, or make backups of the first onto the second.

  12. OpenSolaris? by Ant+P. · · Score: 1

    ZFS is to disks what Pac-man is to dots. Run out of space? Feed it another block device.

  13. I don't think your question has enough detail by jimicus · · Score: 4, Informative

    Don't get me wrong, I've met enough problems myself in IT. But firstly, your problem needs to be expressed clearly.

    "High Availability" can mean a lot of things. The most important part of it, though, is "how highly available do you need?". Do you want to survive the loss of a server? Of a room? An office? A city?

    Basically, you've got two options.

    1. Homebuilt, possibly based around either Solaris (ZFS looks interesting) or a specialised Linux distribution. OpenFiler looks interesting but doesn't appear to get a lot of attention, so community support may be lacking. Unless you've already got the hardware, however, you'll need at least two reasonably large servers.

    Depending on how crucial all this is to your employer (I'm assuming it's fairly crucial or you wouldn't be looking at HA systems in the first place), the level of support you have available to fall back on with this may or may not be acceptable.

    In any case, if you're going to have to spend the amount of money involved in buying two large servers and paying for support on a linux distro anyway, you may as well look at option 2.

    2. An entry-level SAN.

    Yes, I know you said you can't afford it. But I don't think the problem you're discussing can be easily tackled for zero-cost, and if there's cost involved you'd be in remiss of your duties to not cover every possible base.

    I was faced with the same problem myself a few months ago. Eventually I concluded that there simply wasn't the business justification for highly-available storage - we could make do with servers with redundant power supplies and disks, and regular backups. However, I was surprised to find that an entry-level SAN from Dell (actually rebranded EMC units) isn't that much dearer than "buy two dirty great servers and run OpenFiler", and has the benefit that if you do need support, you don't run the risk of hardware and software support folks pointing the finger at each other, saying "it's not our problem, it's theirs".

    Plus any half-decent SAN vendor will provide a clear upgrade path - if you roll your own, you'll have to figure out how you upgrade on your own when the time comes.

    Finally, think of it like this.

    Any business which relies on its backend systems to be solid and reliable should take any reasonable suggestion to maintain that reliability seriously. And by definition, this implies that storage must be reliable.

    If it's that important to the business that your systems continue to operate in the face of extreme adversity, and you decided to save £1000 by taking the homebrew route, you're going to have a lot of justifying to do if the worst happens and your supposedly-HA system falls over. Particularly if your answer to "what are you doing about it?" is "I've posted a message to a forum and I'm awaiting a reply". Realistically the only way it can work is if you're competent enough to be able to fix even the worst outage yourself with little or no recourse to asking on forums (though reading documentation is OK). Even then, you should keep the system simple enough that it doesn't take several months of familiarising yourself with it before anyone else has a chance of fixing it, otherwise all you've done is moved the point of failure from the hardware to yourself.

    The alternative answer "I've placed an emergency support call with our suppliers and they should be ringing me back within the hour" carries a heck of a lot more weight.

  14. Free is not necessarily as in free beer by Jump · · Score: 1

    Software like Lustre and Starfish only wants you to help testing the software. Both are not OSS in my opinion and not ready for the production. So if you have to pay, why not go with a commercial software? Have a look at polymatrix, although they do not have an integrated HSM. Or, get SAMFS in a HA-NFS Server configuration (could be linux). Yes, you pay for the license by the GB, but you do same is true for the hardware cost. Having a single (large enough and scaleable) filesystem will stop your customer to duplicate and move things around, causing increased maintenance cost.

    1. Re:Free is not necessarily as in free beer by Anonymous Coward · · Score: 1, Informative

      Lustre is used in production as the base filesystem for several of the largest computer systems in the world, numbnuts, and the source (with excellent admin docs) is available under the GPL (i.e. as open source as e.g. linux) from lustre.org, at least the bits that haven't made it into the kernel already (i.e. ext4).

      You may be just out-of-date - Lustre development hasn't used the "ghostscript" like "old versions are open source" model for ages now.

      But if you want HA Lustre, you still need HA-grade and doubled-up-for-failover hardware with shared and raided block devices. Lustre scales much better than virtually anything else, but it's not particularly cheaper hardware-wise if you're using it for HA.

      If you want commercial software, HP will sell you quality-assured Lustre and decent hardware in HA configurations, relabelled "HP SFS".

    2. Re:Free is not necessarily as in free beer by msporny · · Score: 2, Interesting

      Full Disclosure: I am one of the authors for the Starfish file system.

      Software like Lustre and Starfish only wants you to help testing the software.

      Both are not OSS in my opinion and not ready for the production.

      Lustre is open-source and it has been production ready for years. The open source notice is on their website - GPL. You don't get much more open source than GPL. Lustre provides support to commercial enterprises.

      As for Starfish, we eat our own dog food at our company. The newest version of Starfish will be taking over full-time for all of our HA storage systems in one months time. The website that runs on top of it is Bitmunk, our bread and butter. The license allows anybody to setup a small HA cluster for free. This is going to help a great deal of small websites and research institutions. If they want us to fix bugs that they find, we'll be more than happy to oblige. However, depending on your customers to find your bugs is not only a horrible business practice, it is reckless. We put ourselves at risk far before we make a release - if there is a bug, we're usually the first people to find it.

      Please take a look at both sites more thoroughly.
      --
      Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
      Founder/CEO - Digital Bazaar, Inc.
    3. Re:Free is not necessarily as in free beer by LivinFree · · Score: 1

      If you want commercial software, HP will sell you quality-assured Lustre and decent hardware in HA configurations, relabelled "HP SFS". Also, in the spirit of full-disclosure, I am a supporting engineer of Polyserve, now owned by HP.

      If you're looking for a HA storage solution, have you looked at Cluster Gateway? It's essentially a Polyserve file system with the NFS or CIFS solution pack, depending on which platform you're implementing. The software-only costs are relatively low (I've been bitching for a while that they're giving it away,) and you can use commodity servers and storage.

      A scalable, clustered file system, that if properly implemented (the important part,) is single-point-of-failure immune. The minimum is two nodes - that's a scale-down if I ever heard one. It works with iSCSI (Openfiler is used by me in my test labs) or FC storage.

      Think of it as GFS + Redhat Cluster Suite, but better implemented. On the other hand, if you're looking for zero-dollar-investment, check out Cluster Suite and GFS with CentOS for free. The user interface is terrible, and simple tasks are made hard, but it does work well, again, if implemented properly.
    4. Re:Free is not necessarily as in free beer by Jump · · Score: 1

      I was looking at both sites recently, and I do appreciate the efforts put forward in both projects. However, the post was about an open source high availability solution which also scales well and deals with users which do not plan well how they distribute their data.

      Starfish has a limit of 1TB for the 'free' solution and is not GPL. Lustre is GPL (the limited free edition only), but cannot reexport over NFS, only SAMBA (not everyones choice). What about backup? Both solutions are not providing any means of backing up the presumably huge amount of data. As you get into the 50 TB+ regime, how you would ever be able to make a backup? Here is where a HSM kicks in: backups are not necessary anymore.

      What is missing is a HSM kind of system natively integrated within NFS. Then you could take whatever cluster-filesystem you like to provide r/w access to the same aggregated storage pool (iSCSI, or FC attached RAIDs), reexport it with NFS from all cluster nodes (scaleable performance & failover!), and HSM to manage the data growths.

    5. Re:Free is not necessarily as in free beer by msporny · · Score: 1

      What about backup? Both solutions are not providing any means of backing up the presumably huge amount of data. As you get into the 50 TB+ regime, how you would ever be able to make a backup? Here is where a HSM kicks in: backups are not necessary anymore.

      Starfish was designed to automatically back data up - HSM was designed in from the beginning. You never have to backup a Starfish storage network. Take another look at Starfish - it does exactly what you're asking for:

      Starfish Introduction (mentions file mirroring)

      --
      Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
      Founder/CEO - Digital Bazaar, Inc.
    6. Re:Free is not necessarily as in free beer by Jump · · Score: 1

      The link you quote only explains that you won't loose data if something breaks because it is distributing the data in a redundant way. This is what I expect from every RAID and it clearly is an intriguing idea to take this one step further and distribute over multiple machines, sites, etc. especially if you can scale performance as well. However, with HSM I meant 'hierarchical storage management', you may also know it by a different name. With HSM, data is kept on cheaper offline media like tapes and on disk. The disk only serves as a cache for the data on tape, less frequently (forgotten?) data is only on tape, but takes a while to be accessed, while more frequently used data is on disk as well. You do not need a backup, because every modification you do to the disk version is copied to tape after a preset delay. This way you have all former versions of a file if you need to recover. When you run out of tape space, you recycle space occupied by older versions, but theoretically you can consider the total space to be infinite (buy tapes as needed), while the 'online space' is limited and needs only big enough to hold your active data. With a HSM-kind of setup, you combine the advantages of tapes (cheap, save, low energy, easy to extend) with the advantages of disks (fast, random access). This is what is missing in solutions like starfish. If somebody deletes a file within starfish, it's probably gone forever, while if you have HSM, there is a tape copy for every change you did. It keeps as many as possible, getting rid of old ones as tape space needs to get recycled.

    7. Re:Free is not necessarily as in free beer by draxbear · · Score: 1

      I agree this is an unfortunately trap for a few smaller shops out there to fall into. I've even had one support call where I asked them where are the backups, and the reply was, we have mirrored disks...

      You need to be able to recover from something being deleted (intentionally or otherwise) and often the ability to roll-back in time to older iterations of a document is useful.

      While not asked for by the original post, backing up this data is (hopefully) on his to-do list and is probably an entire post all by itself.

      --
      --- I've completed diagnosis of your problem and can classify it as a YOYO...You're On Your Own
  15. md? by dwater · · Score: 1

    I've never tried it, but I think I read somewhere you can do md (as in /dev/md or /sbin/mdadm) RAID using devices on the network. I'm not sure if it would apply to this situation anyway, but I thought it work mentioning.

    --
    Max.
  16. AoE by Refried+Beans · · Score: 1

    Take a look at ATA over Ethernet. You can use your existing Ethernet (or the secondary ports if you have them) and get what is essentially SAN storage. I would recommend getting an external box like a Coraid, but you could build your own with the vblade software. I would just recommend using as many physical disks as possible and stripe them so you can get some acceptable performance. You can even use a clustered file system like Red Hat GFS on them.

  17. Project scope and downtime costs by ebne0018 · · Score: 1

    Many of the most helpful posts here have tried to touch on how important your data is, and what happens if you have downtime. My organization is going though its first, long overdue, SAN purchase. We currently have 4 2TB SATA arrays that are going to be replaced by 2 completely independent EMC CX3 SANS. We made this decision, after much pushing on the "moneymen", because downtime is not an option and the SATA arrays cannot be trusted.
    So, how much does an hour of downtime cost you? How big is your IT staff and can they handle management of a large Opensource file system cluster? The extra money spent on a smaller san may pay for itself a matter of hours in a failure situation.

    Also, it sounds like you would like to use a SAN if the money was available. One thing I discovered about SANs is buying bigger, is not always better. You say you only need a few TB now, but hundreds later. My advice is to not be looking for a SAN than can grow to 100+TB, but to what you "reasonably" expect to be using in 2-3 years. Here is why. SANS typically come with 3 yr service warranties. At the same time, the model lifespans are also timed to be around 3 years. So, in 3 years, when your service contract is up, the model is being phased out. So, instead of renewing your service contract, you trade in and upgrade your SAN, which in three years, the smallest san will be able to do 100+ TB.
    I recently attended a SAN integration class for our new system where some students were from one of the US largest retailers. The students said their group buys a san, starts migrating data to the SAN (which takes 18 months), and then starts migrating data off almost immediatly after all data is migrated onto the SAN. The migration off the san, also takes 18 months.

    I think you may really want to consider a small SAN as your choice, especially if you are going to buying new hardware for this sysetm.

  18. Re:Windows Vista? by WgT2 · · Score: 1

    That was extremely funny... even if I don't know if I took it the way you ment it.

  19. Coraid & AoE by Ebola_Influenza · · Score: 1

    thread originator, have you read RefriedBean's post, and checked out Coraid & the AoE protocol? scalable data storage, LAN (but future for WAN), RAID capable, lower cost than Fibre et al, fast b/c of lower info transmission overhead in protocol, since it does not use the TCP part of TCP/IP, & pick your FS.

    --
    "turning espresso into code..."