Slashdot Mirror


No Hassle RAID 5 Implementations?

LambSpam asks: "I had a nightmare week (last week) with two of our servers running Intel's U3-1L RAID controller (RAID 5). Whenever there's a power outage in our building these controllers randomly mark one or more of the drives in the array offline (even with adequate UPS support), which means I have to manually mark them online and/or rebuild. Intel acknowledged the problem, but their solution involves updating the backplane's firmware, the controller firmware (destructive upgrade!), and even the firmware on our IBM drives in the array because they 'draw too much power' in certain conditions. I've only used one other RAID 5 implementation (MegaRAID), and it NEVER had these kinds of problems, whereas if you sneeze too hard around this U3-1L card it will go offline. Is this common with most hardware RAID implementations? What RAID 5 implementations works without hassle? What should I stay away from?"

11 of 51 comments (clear)

  1. Re:PERC? by krangomatik · · Score: 3, Interesting

    I haven't personally had any big problems with the PERC boards, although friends and co-workers always seem to have had bad experiences with them. I've had really good luck with IBM ServeRAID boards. We have quite a few of these in production boxes and haven't had any problems with them(the IBM hard drives on the other hand...plenty of failures there). If your RAID problems are big enough that you're willing to put up lots of $$$ to get rid of them you could look at buying a SAN or NAS. That way, in theory, you could have the vendor install and maintain the disk for you. Generally they seem to do an okay job. I must mention however, that I have seen a vendor make an oops and drop power to an array while trying to fix a power supply problem. That took some time to get back online because the CE out on site wasn't familiar with that product and ended up having to get a senior CE to drive out and fix it. All and all it seems like the big boys(IBM, EMC, Sun, STK, etc) are pretty good about keeping uptimes in the 99.99%+ range(i guess that's what you give them the big bucks for).

  2. Re:PERC? by AnalogBoy · · Score: 3, Interesting

    Keep in mind there are a few different versions of the PERC, some better than others.

    Just a note on EMC.. When i've had the joy of working with a Symmetrix, EMC has always done a wonderful job of never having any downtime. They would come out at any hour of the day or night to replace a redundant card or a spare disk that wasn't even being utilized. They always evaluate any changes before they are made. I'm sure its possible for them to make a mistake, but for mass storage they're the ones i would choose.

  3. Re:PERC? by foobar104 · · Score: 3, Interesting

    All and all it seems like the big boys(IBM, EMC, Sun, STK, etc)

    Just FYI, Sun doesn't actually make their high-end storage product. I think they call it the StorEdge 9900 or something but it's actually a rebranded Hitachi Data Systems 9960.

    Funny thing about HDS. When you buy one of their 9960 systems-- a minimum investment of about $250,000-- you get a guarantee. If you ever lose any data at all on that storage system due to hardware or firmware fault, HDS will give you 30% of your purchase price back.

    According to one of the senior HDS VPs that I spoke to last month, they've never had to pay out that penalty clause.

  4. Tried Adaptec? by Judg3 · · Score: 5, Informative

    Were I used to work (An all-windows shop) we used Adaptec RAID cards in all our "tower" based servers. Even the lower priced models (AAA-131U2) always performed without a hitch and we never had any problems with them at all. AMI's RAID controllers are real nice and all, but for the price it just wasn't worth it. The Adaptec solutions performed just as well and at a lower cost. You'd do good to check em out.

    Now the 3200 RAID Controllers int he Compaq's, thats another diffrent story altogether.
    We had roughly 2000 servers, operating 24/7 @ 67 degrees F. Two times a year we had a site shutdown. Every single time we had to bring everything back up we would have anywhere from 3-5 Compaq array controllers die. But never once did the low-buck Adaptecs crap out on us.

    --
    Looking for hardware (Currently need: Large Etch-a-Sketch) Have one? See my journal!
    1. Re:Tried Adaptec? by Sivar · · Score: 3, Informative

      The general consensus on StorageReview.com (a site that I would trust for anything storage related) is that Adaptec cards are crap, the performance under load is mediocre, they tend to die (despite being a solid-state device) and that often times the non-windows drivers aren't the best.
      Don't take it from me, ask around there. If they worked for you, however, great. Whatever works.

      --
      Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
  5. Firmware by Holophax · · Score: 4, Informative

    Just as a shot in the dark, I would suggest trying to upgrade the firmware on the drives first. At one of my old jobs, we used nothing but IBM drives, and we constantly had problems with the drives becoming marked as bad or off line, but simply pulling them and plugging them back in (hot swap) would bring them back. In our situation, we were using IBM Netfinity servers with IBM raid controllers. When we talked to IBM, they admitted there was a problem with the firmware on the drivers which would cause the drive to not spit out just one error whenever an event (even a simple read error) happened, but to spew them constantly, which made the raid controller mark the drive as bad. Seeing as it only takes a few minutes of downtime and is non-destructive, it might be worth a shot.

  6. Two possibilities... by Vrallis · · Score: 4, Interesting

    First, are you sure your UPS is a *TRUE* UPS? Even a lot of the 'high end' UPSes out there are still REALLY switched UPSes. This could very well be your problem.

    The other one is something I've heard of (I'm not an electrical expert, but I'll try to explain). Larger (older installations, particularly) sites were wired for three-phase electricity. Over time, they split the phases for normal 110 volt usage. There is a chance where if the PC is connected to power on one phase, but the external unit is connected to power from a different phase, that the differential between the two can cause problems, due to the ground connection between the two through the cable shielding. I know, it sounds like something from the BOFH daily calendar, but it does make sense. Try making sure both pieces of equipment are on the same true UPS, or at least switched UPSes on the same circuit.

    1. Re:Two possibilities... by RatOmeter · · Score: 3, Informative

      "First, are you sure your UPS is a *TRUE* UPS?"

      The term you're looking for here is "On-line UPS". There are two basic varieties of UPS, switched and on-line. Both share the following common features: The AC (mains) power coming into the UPS is rectified (converted to DC, usually in the range of 24 to 48 VDC). The DC is used to charge the batteries which are the source for backup power when the mains fails. AC backup power is supplied to your equipment by an invertor (DC to AC convertor) in the UPS which takes the battery's DC juice and "builds" a 50 or 60 Hz AC sine or pseudo sine wave at the right voltage.

      Switched UPS: When the AC mains is OK, your equipment is being powered by it. When the mains fails, the UPS literally switches to backup power from the invertor. This switching takes a measureable amount of time to complete and relies on your equipment's electronics to ride-through the loss of power until the switch to invertor power is complete. Advantage? Switched UPS's are generally less expensive.

      Online UPS: Regardless of whether the mains power is OK or not, the UPS's invertor is already on and already supplying your equipment. When the AC mains does fail (momentary loss, glitch, blackout or brownout), it takes zero time to switch to UPS power, because your equipment was already on UPS power! Advantages? (1) Zero switching time, (2) the online UPS will feed a constant, glitch-free sine wave to your equipment at the right frequency, the right RMS voltage all the time .

      -

  7. Compaq is good. by NetJunkie · · Score: 3, Interesting

    When I took over my current job the last network team had overloaded the circuits in the server room. We've had 3 circuits trip and had servers drop hard. None of the Compaq SmartArray controllers had any problems recovering.

    I suggest you also fix you power problem. The systems should have no idea power was lost to the building. If you are using a UPS and this is still happening, I'd find a better one.

  8. Re:Good advice above. by walt-sjc · · Score: 3, Insightful

    Ahhh! NO!!! Do NOT NOT NOT put everything on one circuit. First, computers with switching power supplies (almost 100% are) are NON-linear in power usage. They draw LARGE spikes of current sporadically. Second, if you blow a circuit, EVERYTHING YOU HAVE goes down. BAD BAD BAD! Third, if you run dual power supplies on your equipment, a power problem / spike on the circuit will affect both power supplies, not even counting that 50% of the benefit of dual power supplies is so that you have power redundancy.

    As others have statued, make sure you have a true "online" ups, but ALSO make sure that you don't run over 50% power utilization on the UPS either due to the non-linear nature of switching power supplies.

    Of course the BEST power stability solution is to use all 48VDC equipment like Telco's do. When was the last time your phone went down due to telco hardware failure? Note that most Major hardware vendors have 48VDC versions of their equipment (Sun, Cisco, etc.)

  9. Clarification by Futurepower(tm) · · Score: 3, Informative


    Everything needs to be on the same Ground circuit. It is necessary to avoid ground loops.

    "They draw LARGE spikes of current sporadically."

    I don't think this is correct. I have designed power supplies, and I don't immediately think of any reason why the power input of a switching power supply should vary differently from the power output. The only surge is when the hard disks spin up, but with SCSI there is a means to stagger the spin-up.

    --
    Bush's education improvements were