Slashdot Mirror


Sun Gagging Customers Damaged By Memory Problems?

cchuter writes "Apparently Sun has been getting it's customers to be 'mum' about a certain memory problem for as long as 18 months. The problem is assumed to be the cause for many website outages (most visible, ebay). "

20 of 156 comments (clear)

  1. I've personally seen this happen.. by Anonymous Coward · · Score: 3

    quite a number of times. 6 errors that I have handled in my Sun shop of 3 admins and 50 machines. Only seem em on the 400 mhz/8 mb cache processors. Looks something like: panic[cpu28]/thread=0x307dbe80: CPU28 Writeback Data Parity Error: AFSR 0x00000000 00800002 AFAR 0x00000001 8104dfe0 We've found that attaching a grounding strap to all of the servers that were effected has cleaned the problem up. Haven't seen one since the straps, been around 3 months where before it was once a month.

    1. Re:I've personally seen this happen.. by Anonymous Coward · · Score: 3

      Yes, that's a classic case of the problem. I'm a Sun support engineer of sorts by trade.

      A few points - The Ecache problem should, and normally only does, happen once on a CPU. If it happens more than once, Sun replace your CPU.

      The problem is severely localised, i.e. it seems to affect systems in certain system rooms. The problem only ever occurs when a system is idle on SMP systems (Who runs 400mhz 8mb Ecache UltraSPARC cpus in a single CPU system?).

      I've seen figures and graphs about the problem - It was a problem that has been generally sorted since about November last year, but the "one off reboot" problem appears to have persisted.

      The reason why Sun got customers to sign NDAs is that they were given in severe detail what the problem was, what Sun were doing to deal with it, and how Sun have generally sorted the problem. There's no large scale cover up, Sun gave customers extremely detailed and relevant information. This can only be a consistent problem if a customer doesn't act on the problem.

      This problem had precisely nothing to do with the Ebay admins problem with running their Oracle database. I'm posting anonymously for obvious reasons, even though I've given out far more detailed info. to customers with problems, and probably on occasion on USENET.

      Sun reacted to the problem with much vigor. I don't think you'd get the same detail from INTEL etc. Obviously Sun don't go out of their way to advirtise the problem. I've heard a few "solutions" to the problem conjured by folks on USENET, including tightening a screw on your CPU.

      There was no cover up. Customers got information. Customers who wanted more info. got detailed sensitive information on the problem, if they had a good case for it. Sun aren't about to force people to sign NDAs to use StarOffice.

      I don't believe this is as severe a problem as it used to be. I've seen the stats.

  2. Like Ronald Reagan... by namespan · · Score: 5
    Perhaps memory problems are contagious. First, the computer displays them. Then, the customer mysteriously acquires them ("er, um, no. I don't think we had a memory problem"). Then, the vendor ("why no. I don't recall us pressuring anyone").


    Darn Jedi mind tricks...

    --
    Libertarianism is rich wolves and poor sheep playing gambler's ruin for dinner.
  3. Re:Nothing secret about it by Mr.+Slippery · · Score: 4
    Dunno where the Gartner Group gets its figures from.
    I believe the answer is "thin air." Or "out of their collective ass."
    --
    Tom Swiss | the infamous tms | my blog
    You cannot wash away blood with blood
  4. The ebay issue is entirely separate by ragnar · · Score: 3
    Please refer to following article I wrote a few months back to dispell some of the hype about the Ebay problems. The article that /. cites is speculating that these events are related, but to best of my research and feedback from many parties involved, the problem lies on ebay.

    Also, do understand that these sort of NDA's are somewhat common when dealing with potentially explosive matters like this. Certainly Sun is interested in keeping tight lips, but they also would prefer to announce a solution along with the problem. It is an engineering problem where the "more eyes on the problem" approach doesn't necessarily bring about the greatest good.

    --
    -- Solaris Central - http://w
    1. Re:The ebay issue is entirely separate by softsign · · Score: 3
      If my computer is crashing, one of the very first things I do is get on the 'net, check www.deja.com, and see if other people are having similar problems

      Yes, I see the parallel. You're having some trouble with Samba, post to your Linux newsgroup and within a few hours you may have a few people who've experienced the same problem offering a solution.

      There's just one problem. Your computer is not an Enterprise 10000. How many people do you know that have an E10000? And out of those hundreds, how many do you know that are identically configured?

      This isn't some run-of-the-mill, I-just-installed-RH6.2-from-the-ISO-and-can't-conn ect-to-the-Internet problem. When people have problems with a system like the E10000 they call the people who know the E10000 best: Sun.

      You aren't going to find many employed administrators who have a habit of disclosing detailed explanations of their E10000 troubles on Usenet, hoping to find some help from their competitors.

      The reality is, if you've got an issue with an E10000 that Sun can't help you with themselves, then there ain't nobody else that's going to help you fix it, either. An NDA is really kind of redundant and I suspect it's just a legal exercise more than anything else.

      --

    2. Re:The ebay issue is entirely separate by Thagg · · Score: 3
      >Also, do understand that these sort of NDA's are somewhat common
      >when dealing with potentially explosive matters like this.
      >Certainly Sun is interested in keeping tight lips, but
      >they also would prefer to announce a solution along
      >with the problem. I'm sorry, but this is total bullshit. This might have worked back in the pre-'net days, but fortunately those days are behind us. If my computer is crashing, one of the very first things I do is get on the 'net, check www.deja.com, and see if other people are having similar problems. It's good to know that other people are having similar problems! And, if anybody has gotten a satisfactory solution, you can demand that same solution. Keeping people in the dark is so neandertal I can't quite believe that anybody can defend it anymore.

      I'm sure that Firestone would have preferred that nobody talk about their tire problems, or that the makers of Rezulin would have preferred people not talk about those annoying liver-failure deaths that occured, but it's just pigheaded of them. And, in these connected days, it will not stand.

      thad

      --
      I love Mondays. On a Monday, anything is possible.
  5. The Other Shoe Drops by empesey · · Score: 4

    Wasn't it Sun that was complaining about Microsoft forcing clients to sign all those agreements forbiddding them to talk about some of Microsoft's practices? Granted, these are two separate issues, but now that Sun is having the issues, it's suddenly a different matter. Shut everyone up, and hope that no one finds out, before we can rememdy the problem or ship a new product.

    And why would you have to bribe people that you'll fix something quicker, if they sign an NDA? That's an automatic red flag. I find it hard to believe that CEO and other top brass fall for such nonsense. There must be more to the story that was has been disclosed.

    --

  6. Why customers sign the NDA by echo8 · · Score: 3

    I certainly can't speak to why ALL the customers who signed the NDA did so. What I can speak to is why my company (a decently large telecom company which shall remain nameless) did so: Sun had a software patch that they felt might help alleviate the problem (I still can't reveal the details of what it does or why). In order to receive the software, we had to sign a non-disclosure agreement. It's that simple: we have a problem. If you want us to solve it, sign the paper. Otherwise, shut up and wait along with everyone else. As we had business-critical systems that were affected, it's not hard to understand why management did not hesitate to sign the NDA.

  7. Ebay Was Not a RAM Problem by sabat · · Score: 3

    Ebay's problem was that it was running new hardware (E10000) with a very old OS (Solaris 2.5, not even 2.5.1) and a version of Oracle that had documented problems with that version of Solaris.

    It had nothing to do with RAM, although I'm sure their former IT director would love to claim that.

    --
    I, for one, welcome our new Antichrist overlord.
    1. Re:Ebay Was Not a RAM Problem by Anonymous Coward · · Score: 3
      Nonesense... they are still having problems and have had constant hardware problems since they have been there. 4 hardware problems in the last few weeks. Get a UE10000 and watch it act like a yo-yo!

      User: aw@ebay.com

      Date: 08/09/00

      Time: 21:24:33 PDT

      *** TECH MESSAGE ***

      Recently we have experienced several issues that have impacted eBay's availability. We want to take a moment to update you about our situation and the things that we're doing to address the issues.

      First, over the last few weeks, we have been making a number of "headroom" improvements to the entire system to ensure the scalability of the site for the future. Normally, making these improvements should be invisible to you. Unfortunately, this was not the case.

      These changes resulted in availability issues with My eBay and Seller Search during high traffic periods. There were a number of fine-tunes that had to be made, as well as code issues that had to be addressed, to resolve this problem.

      We believe these issues have been resolved. To be sure, though, we will continue monitoring the system through a few more "prime times" (hours when traffic of the site is at its heaviest).

      Second, we have experienced three hardware failures in the last 10 days that have resulted in system downtimes, including the one tonight. During each failure, we have migrated to our backup system as quickly as possible to restore system availability.

      Later tonight and during our regularly scheduled maintenance on Friday morning, we plan to make additional improvements to the system to help address the hardware issues.

      System stability is still our number one priority. We appreciate your support.

      Regards, eBay

      User: aw@ebay.com

      Date: 08/09/00

      Time: 19:57:52 PDT

      *** SYSTEM STATUS ***

      The eBay system is currently available.

      At 19:15 PT, we experienced a hardware failure on our main server. We migrated to our backup system, and the site became available at 19:57 PT. Please accept our apologies.

      We will continue to carefully monitor the system and will inform you of any changes in its status.

      Regards, eBay

    2. Re:Ebay Was Not a RAM Problem by bigdogs · · Score: 3

      You're partially right. They did encounter some known Oracle problems, specifically with intimate shared memory.

      They were definitely *not* running 2.5; an E10K requires >=2.5.1. Their high profile outages last year were while running 2.6.

  8. Re:Very Interesting.. by barracg8 · · Score: 4
    • 'That sounds like a good deal, but I have a better one. I give you the finger, and you give me my phone call.'
    Trouble is, you would get about the same reaction Neo did.

    Think about it. There is nothing legally requiring Sun to deal with problems in the order that they are informed about them. There is nothing wrong with Sun implementing a high priority queue, of people who sign NDAs, and a low priority queue, of people who don't.

    So you face a decision take the red pill, and you get your website back up and running. Take the blue pill, and Sun gets a bit of bad press, and you go bust.

    If you are someone like Ebay, it really comes down to that. You are your website, and you must sell your soul to keep it up 24/7 (or the best you can).

    Here's a little story:

    I know of a UK company who had a problem with Win95. It crashed every 49.7 (I think) days. So they went to M$ UK. They were told it would cost tens or hundreds of thousands of £ for M$ to look into the problem. M$ knew the company had no clout, and could not afford this, so they decided to fuck them.

    The company had some form of relationship to a larger US company, so they got them to take it to M$ in the US. This time, M$ insisted on the company signing a NDA. When they did so M$ admitted that this was a known flaw in '95. The clock didn't wrap nicely, so when you reach 2^32 milliseconds - 49.7 days (as I remember) Windows 95 (at least version A) crashes.

    M$ has since admitted publicly.

    People like micros~1 and Sun have reputations to keep, and a great deal of power. When you are dependant on them for your businesses survival, they can make you their bitches.

    Chalk it down on the 'List of Good Reasons to Use Opensource'.

    G

  9. Re:Very Interesting.. by Sangui5 · · Score: 4
    Think about it. There is nothing legally requiring Sun to deal with problems in the order that they are informed about them. There is nothing wrong with Sun implementing a high priority queue, of people who sign NDAs, and a low priority queue, of people who don't.

    Taken from the Sun website:

    (3) CUSTOMER-DEFINED PRIORITY AND RESPONSE TIME:

    When Customer's designated Contact calls for support assistance, Contact will assign a priority rating to the call: URGENT, SERIOUS, or NOT CRITICAL:

    URGENT (system unusable) - Live transfer of service request. Personnel arrive at the installation site within an average of two (2) hours of service request for on-site hardware support assistance.

    SERIOUS (system seriously impaired) - Callback within an average of two (2) hours of service request. Personnel arrive at the installation site within an average of one (1) business day for on-site hardware support assistance.

    NOT CRITICAL - Callback within an average of four (4) hours of service request. Personnel arrive at the installation site within an average of one (1) business day or at a later mutually convenient time for on-site hardware support assistance.
    ...
    (17) SYSTEM AVAILABILITY GUARANTEE: For properly configured, maintained and administered systems, Sun will commit to maintain certain levels of System Availability. System Availability Guarantees require a separate contract addendum which will contain the specific terms of the Guarantee.

    This is from the Platinum Warrenty, which is standard with a E10K (what EBay runs). They have a contractual agreement with everybody that they sell such a standard configured E10K to have an average response time on urgent calls, and even on the most minor problems, within an average of one day, if no other time is convienient.

    In addition, if your web site is that important to your business, you can have a separate system availability guarantee. If Sun has agreed to provide five 9's, then they get 5 minutes 15 seconds of downtime a year. Even if they only have to provide three 9's, that's still only ~ 8 hours downtime a year.

    Sun makes their money by providing very reliable hardware, guaranteeing obscene quantities of uptime, charging an arm and a leg, and then delivering on all of their promises. If they don't deliver, then they will get their asses handed to them in a breach of contract lawsuit. If people agreed to an NDA, it was either Sun doing a very good job of talking fast, or promising better service than what they had contracted for. Any business which had to sign that NDA in order to stay afloat should have invested the extra money in a better warranty agreement, because if your web site is that important to you, you should spend the extra cash to get your uptime guaranteed and contracted.

    Business types don't really mind really expensive hardware/service agreements. Those are nice, fixed, predictable costs, especially if you have contracted with a reliable vendor (Sun). What they hate is having to lay out a bunch of money that they didn't plan for, because something unpredictable went wrong, and they didn't have their risks hedged. Hedging other people's risks is Sun's bread and butter.
  10. The game of misinformation and misplaced advocacy by Kysh · · Score: 4

    As a sr solaris sysadmin, who has worked on Sun boxes for years, I have /nothing/ but praises for Sun service and support. Sun QA is top-notch, in comparison to the rest of the tech industry. I got my start in Linux, and still use it a great deal. At home, all but three of my boxes run Linux, including several PCs and a Sun 670MP. I also use various BSDs. Pretty much, so long as it's Unix, it's ok by me.
    Bearing this in mind, realize that I am capable of obejctive, honest review.
    Sun has done more for the free software community than anyone therein seems to want to acknowledge, even though they are threatened by Linux. They are a large company, and do have their share of corporatism, but they also get an unfairly bad rap in the Linux community, for reasons I do not comprehend. Sun hardware has always been the industry standard for rock-solid reliability, and IO bandwidth. They never have been the blazing speed machines.
    Going back to Ebay, where people were asking whether this was a problem with the cache (It is not a RAM issue, but an issue with the cache on the 400Mhz UltraSparc II processors, and I have /never/ seen it outside of 2x400 configuration in an Ultra II).. It wasn't. Ebay was a victim of bad sysadmins. Perhaps they were very good sysadmins, who had no idea of what to do with an E10k. Perhaps management made the decision for them. (This happens with eerie regularity)
    The fact of the matter is, the E10k is not a 'super-processing-power' box. It's a 'IO pumping, high-availibility' box. The sysadmins at Ebay had the E10k running flat out, not partitioned (As they're meant to be run) in quadrants. They grew so fast that they put the other E10k into production in the same fashion, instead of using it as a hot standby. Each E10k was a single point of failure, with the ability to be multiply redundant internally removed. A single problem with an OS that wasn't even officially supported on the E10k running at an invalid patchlevel caused a very highly publicised downtime. Instead of blaming bad setup (Which would be disasterous for investor relations), Ebay blamed Sun.

    As to the latter part of this article, I know nothing about Sun covering up that problem, (Which I have seen before), but don't deny that Sun, being a big corporation, might do such things, as all corporations are wont to do, even the ones very popular in the Linux community. Usually that problem manifests itself in the system log long before any problem is ever seen. This problem is also listed on Sunsolve.
    Sunsolve is one of the most open policies I've ever seen to system-related issues. The only group of people that even come close to that level of support is Debian.

    While I know this was rather long-winded and might generate lots of flames, I do mean it. Don't bash Sun summarily, and don't bash Sun on QA. It's like talking about raising "Serious questions about Honda QA" if Honda issued a recall for defective OEM tires (A year after the vehicles with those tires were issued). Almost nobody would think to bash Honda QA over a single issue. Sun may have had a few quality issues from time to time, but so does everyone. And at least Sun is actually saying something, unlike companies that deny forever.

    Why bash Sun, and not Intel - Another /. headline for today.

    -Kysh

    --
    --=:: Wings and tail and snout and scales of blackest night ::=- A dragon stands be
  11. Facts... FACTS please! by Anonymous Coward · · Score: 4

    I'm not representing Sun in this post-- just the facts. Get the facts straight:

    It affected very few customers. (Sun has bent-over-backwards for customers to fix this. This included free consulting services to some sites and Sun did not use NDAs to gag customers on this. The NDAs were for disclosure of strategic planning regarding future systems/products and service offerings. How can a company keep customers that have had problems without being friendly to them?

    It's an issue that only affects 8GB cached 400Mhz processors.

    It's an issue w/ CACHE on a CPU NOT system memory.

    The problems usually crop up in systems in poorly maintained data-centers. This includes centers with large temperature fluctuations, poor voltage regulation, poor humidity controls, and improper grounding. "User-error" and "misuse" exasserbate the problem.

    The problems are limited to a particular production run of CPUs. New CPUs don't have this problem.

    Sun hasn't denied a problem. Sun hasn't bragged about the problem, either. Would you?

  12. Re:Yes but show me a computer that has the followi by Miguelito · · Score: 4

    Show me a PC that still costs that and has:
    1. Built in ability to boot off the net or _any_ other device you want, and can be set to default to that.
    2. Serial console ability out of the box.
    3. Massive online support center (sunsolve) from the vendor.
    4. If hardware is made for the system, it _will_ work, period (I've run into plenty of PC hardware that doesn't play nice on some mobos or in combination with some other cards).

    Sun hardware is expensive, and support is costly, but you damn well get what you pay for. We can get Sun here within an hour for critical issues, and the next day at the latest for non-critical. How's your local vendor and/or manufacturer on similar issues with PC hardware? I've had PC stuff fail and it takes forever to get replacements (unless the store's open, they have it in stock, and will let you return it). It's usually faster to just buy a new part.

    BTW to your #3: Which OS is native for PCs? DOS? Windows? Linux? Personally I prefer Linux, but PCs weren't designed for any specific OS. Sun hardware was. Linux runs nicely on it though. :)

    Oh yeah, can you run 64 bit on that PC? Didn't think so.

    --
    - My favorite error message: xscreensaver, running on an old Sparc 5 w/ 8bit color: bsod: Couldn't allocate color Blue
  13. Sun becoming Microsoft? by sterno · · Score: 4
    Perhaps Sun believed that the recent strong growth of Windows as a server platform meant that customers liked having sporadic reboots.

    ---

    --
    This sig has been temporarily disconnected or is no longer in service
  14. Re:Do they actually have any effect? by TheGratefulNet · · Score: 3
    I was told by one of our hardware engrs that only very high density ram (256meg or more) really needs ECC support.

    then again, each year ram density increases so I'm not sure which density is considered safe and won't need ECC.

    for boxes that will stay up a week or more, I tend to buy ECC 'just because'. its not much more expensive and doesn't slow things down enough to care about (even though gamers and o/c'ers will disagree).

    --

    --

    --
    "It is now safe to switch off your computer."
  15. Nothing secret about it by devphil · · Score: 5

    The existence of a problem with the 400MHz CPU with a 4 or 8MB cache has been well known on Usenet groups and a couple of online magazines. Sun engineers posting to the discussions say, "Yes, there's a problem. We think it might be foo, bar, or baz. Try the following steps..."

    I don't think I've ever heard or read anything about Sun denying a funky problem in those chips. They may still be looking for the precise cause, but every time the issue comes up, somebody from Sun generally admits to it.

    Dunno where the Gartner Group gets its figures from.

    --
    You cannot apply a technological solution to a sociological problem. (Edwards' Law)