Slashdot Mirror


Google Finds DRAM Errors More Common Than Believed

An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.

333 comments

  1. Percentage? by Runaway1956 · · Score: 4, Interesting

    "a mean of 3,751 correctable errors per DIMM per year."

    I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.

    --
    "Windows is like the faint smell of piss in a subway: it's there, and there's nothing you can do about it." - Charlie Br
    1. Re:Percentage? by sopssa · · Score: 0

      Well haven't Google always used thousands and thousands of normal pc's in their server farms instead of powerful, actual premium server-grade hardware.

      Not really a surprise that they tend to break more.

    2. Re:Percentage? by gspear · · Score: 5, Informative
      From the study's abstract:

      "We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a suprisingly small effect on error behavior in the field, when taking all other factors into account."

    3. Re:Percentage? by CAIMLAS · · Score: 1

      Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
    4. Re:Percentage? by Runaway1956 · · Score: 5, Informative

      No, I don't believe so. They use server boards, custom made to their specs. And, I'm pretty sure that those specs include ECC memory - that is the standard for servers, after all. http://news.cnet.com/8301-1001_3-10209580-92.html If you're really interested, that story gives you a starting point to google from.

      --
      "Windows is like the faint smell of piss in a subway: it's there, and there's nothing you can do about it." - Charlie Br
    5. Re:Percentage? by Aneurysm · · Score: 1

      I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.

      Except it depends on how the modules were originally tested. The study is saying that they break more than previously thought, rather than they break a lot. If they were originally tested in a stressed system similar to Googles and Google is finding that they have far more errors than they should then their study is still valid.

    6. Re:Percentage? by Tumbleweed · · Score: 4, Insightful

      Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.

      Yeah, but let's look at the more common situation - a home. Variable temperatures, most likely QUITE variable power quality, low-quality PSU, and almost certaily no UPS to make up for it. Add that to low-quality commodity components (mobo & RAM).

      I'd not be surprised to find the problem much more prevalent in non-datacenter environments.

      Switching to high-quality memory, PSU & UPS has made my systems unbelievably reliable the last several years. YMMV, but I doubt by much.

    7. Re:Percentage? by Anonymous Coward · · Score: 0

      The author has 12 gigs on his mac...

    8. Re:Percentage? by Red+Flayer · · Score: 2, Interesting
      Humorous ordering of replies to this article.

      Your post:

      Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.

      Post before yours:

      From the study's abstract:
      "We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a suprisingly small effect on error behavior in the field, when taking all other factors into account."

      The 'components bit' of your post may be spot-on, but the juxtaposition of your temperature claim, along with the previous poster's quoting of the abstract FTA, is funny (to me, anyway).

      --
      "Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
    9. Re:Percentage? by jasonwc · · Score: 4, Informative

      The article suggests that errors are less likely on systems with few DIMMS, those which are less heavily used, and that there was no significant difference among types of RAM or vendors, at least with regard to ECC RAM. Thus, laptop and desktop users, who likely only have 2 or 3 DIMMs and make only casual use of their systems have lower risk of errors. ECC RAM may in general be of much higher quality than non-ECC RAM, and thus more prone to error, but its usage is also less mission-critical. In addition, ECC RAM is usually used in systems with many DIMMs that are run 24/7/365.

      Good news
      The study had several findings that are good news for consumers:

              * Temperature plays little role in errors - just as Google found with disk drives - so heroic cooling isnâ€(TM)t necessary.
              * The problem isnâ€(TM)t getting worse. The latest, most dense generations of DRAM perform as well, error wise, as previous generations.
              * Heavily used systems have more errors - meaning casual users have less to worry about.
              * No significant differences between vendors or DIMM types (DDR1, DDR2 or FB-DIMM). You can buy on price - at least for the ECC-type DIMMS they investigated.
              * Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems.

    10. Re:Percentage? by The+Archon+V2.0 · · Score: 2, Insightful

      No, I don't believe so. They use server boards, custom made to their specs.

      I suppose it depends on how you define "server board". Room for tons of ECC RAM and two CPUs is server or serious-workstation class (or maybe I-just-use-Notepad-and-my-sales-guy-is-on-commission class), but I think once you're on to custom boards that only use certain voltages of electricity, you've moved into a class by yourself.

      And, I'm pretty sure that those specs include ECC memory - that is the standard for servers, after all.

      Section 7: "All DIMMs were equipped with error correcting logic (ECC) to correct at least single bit errors."

      So, yes, it's ECC.

    11. Re:Percentage? by HornWumpus · · Score: 2, Interesting

      IIRC ECC ram has extra bits and hardware to fix any single bit error and record that it happened.

      Regular ram only has parity which can tell the MB the data is suspect but not which bit flipped. Kernel panic, Blue Screen, Guru Meditation# whatever.

      It's the same RAM, just arranged differently on the DIMM.

      I once had a dual Pentium PRO that required ECC RAM. BIOS recorded 0 ECC errors in the three years or so that was my primary machine. Which is what the Google study would lead me to expect.

      --
      John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
    12. Re:Percentage? by silent_artichoke · · Score: 5, Funny

      You know, maybe googling it isn't the best idea in this case. Memory errors and all...

    13. Re:Percentage? by skirtsteak_asshat · · Score: 2, Interesting

      Well, consider that they had a board CUSTOM MADE for them, which means custom BIOS fitments, custom feature implementations, custom BUGS Then add the reality that is DRAM - an imperfect 'art' form of data storage and retrieval. No two chips are EXACTLY the same... though very close. Manufacturing defects may not manifest themselves under normal conditions, and require heating/cooling cycles or fluctuating voltages to break down. Running ECC performs a basic parity check, nothing more, and it's still possible to pass bad bits with ECC enabled, just much less likely. The idea is that you can't really test subcomponents individually and have them check out, and then assemble a system and expect it to just 'work'. Some ram is pretty damn finicky. Standards are anything but.

    14. Re:Percentage? by SnarfQuest · · Score: 1

      To make it easier to comprehend, 3751 per DIMM per year means that you are getting about 10 errors per day per memory stick. Mosy machines have at least 2 sticks, so that is 20 errors per day. Since you probably don't have error correcting built into your machine, that means those 20 errors actually cause something wrong to happen in your machine. You can hope it's causing the screensaver problems, but it can be doing something very bad to you.

      --
      Who would win this election: Andrew Weiner vs Andrew Weiner's weiner.
    15. Re:Percentage? by vadim_t · · Score: 1

      * Temperature plays little role in errors - just as Google found with disk drives - so heroic cooling isn'tt necessary.

      Talk about a misunderstanding.

      First, the paper on hard drives did show that temperature was important. It did show though that too cold is worse than too hot. Also, the data wasn't perfect. Google doesn't have a whole lot of drives running at strange temperatures, since they're a datacenter. A consumer though, might well run a drive at 60C in a badly cooled desktop or laptop, and there's not a single datapoint on Google's graph for that.

      In my experience, a drive cooled by a case intake fan runs at about 35c, which comes up as just perfect on google's graph.

      The memory paper finds an even bigger effect:

      Figure 7 (left) shows that for all platforms higher temper-
      atures are correlated with higher correctable error rates. In
      fact, for most platforms the correctable error rate increases
      by a factor of 3 or more when moving from the lowest to the
      highest temperature decile (corresponding to an increase in
      temperature by around 20C for Platforms B, C and D and
      an increase by slightly more than 10C for Platform A ).

      I believe 3x more errors is pretty damn significant, unless you want to adhere to the idea of that a very rare event happening 3 times as often is still very rare, relatively speaking.

      But "a mean of 3,751 correctable errors per DIMM per year." sounds rather big to me. Sure, it's a tiny part of 4GB of RAM. But a single bit wrong in exactly the right place could result in things like very unpleasant disk corruption, and most FSes won't like that because they're not designed to compensate for random disk corruption (yeah, I know about ZFS and its checksums, but not everybody runs it)

    16. Re:Percentage? by ByOhTek · · Score: 1

      Switching to high-quality memory, PSU & UPS has made my systems unbelievably reliable the last several years. YMMV, but I doubt by much.

      I'll second this. Once or twice I skimped on mobo or memory in a pinch, and those have been the only machines of mine to have stability issues post Windows 98. (Even in Windows 98 I could get about 3 weeks of uptime before needing a reboot. It sucked, but it wasn't as bad as some people had to deal with).

      --
      Self proclaimed typo king, and inventor of the bear destroying coffee table (patent not pending).
    17. Re:Percentage? by DigiShaman · · Score: 0, Redundant

      Parity is different then ECC. Parity allows the system to detect, but *NOT* correct errors. ECC however, detects and corrects errors. Unless specified, all consumer desktops and laptops contain standard memory (non parity or ECC).

      --
      Life is not for the lazy.
    18. Re:Percentage? by antifoidulus · · Score: 0, Flamebait

      Um, thats precisely what the GP said....so yeah, you are pretty representative of the intelligence level of Rush Limbaugh listener.

    19. Re:Percentage? by R2.0 · · Score: 1

      I believe 3x more errors is pretty damn significant, unless you want to adhere to the idea of that a very rare event happening 3 times as often is still very rare, relatively speaking.

      I believe it depends on scale. If I buy 3 Lotto tickets instead of one, my odds of winning are 3x as much, or 200% larger. But I don't believe anyone would see a reduction from 1:195,249,054 to 1:65,083,018 as "significant" - for all practical purposes, your odds are still "1:a really big number, so don't buy that boat quite yet".

      --
      "As God is my witness, I thought turkeys could fly." A. Carlson
    20. Re:Percentage? by The+Archon+V2.0 · · Score: 3, Insightful

      "a mean of 3,751 correctable errors per DIMM per year."

      Hey, the ECC did its job! Let's all go home.

      I'm much to lazy to do the math.

      I tried, based on the abstract. Wound up getting a figure of 8% of 2 gigabyte systems having 10 RAM failures per hour and the other 92% being just peachy. While a few bits going south is AFAIK the most common failure state for RAM, some of those RAM sticks must be complete no-POST duds and some are errors-up-the-wazoo massive swaths of RAM corrupted, so that throws my back of the envelope math WAY off....

      In other words, big numbers make Gronk head hurt. Gronk go make fire. Gronk go make boat. Gronk go make fire-in-a-boat. Gronk no happy with fire-in-a-boat. Boat no work, and fire no work, all at same time.

      Sorry, lost my thread there. So yeah, complex numbers, hard math, random assumptions that bugger our conclusions and maybe bugger theirs.

      The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.

      The problem with something like this is the assumption that Google world == real world.

      This RAM is all running on custom Google boards that no one else has access to, with custom power supplies in custom cases in custom storage units. To the researchers' credit, they split things by platform later on, but that just means Google-custom-jobbie-1 and Google-custom-jobbie-2, not Intel board/Asus board/Gigabyte board. Without listing the platforms down to chipsets and CPU types (not gonna happen), it's hard to compare data and check methodology.

      While Google is the only place you're going to find literal metric tons of RAM to play with, the common factor that it's all Google might be throwing the numbers. At least some confirmation that these numbers hold at someone else's data center would be nice.

      But then, I didn't RTWholeFA, so maybe I missed something.

    21. Re:Percentage? by poetmatt · · Score: 3, Informative

      uh, article showed that temperature has nothing to do with it.

      the rest is accurate.

    22. Re:Percentage? by clarkn0va · · Score: 1

      It says in the article that the study found temperature not to be a factor.

      --
      I am literally 3000 tokens away from the chaotic crossbow --Stephen
    23. Re:Percentage? by Anonymous Coward · · Score: 0, Interesting

      the mobo's used by google are the cheapest boards they can get made. There is NO testing until they hit the datacenter floor. Crap mobo plus poor environment (high heat and vibration + poor power controls) makes for a high failure rate. ECC ram has an odd number of memory chips. The odd chip allows for the parity ram. Google memory has even chip counts since non-ECC ram is much MUCH less expensive. So the bios is custom and carves out ECC function from non-ECC ram

    24. Re:Percentage? by poetmatt · · Score: 1

      eh? I was following this thread and I misread and followed the route of digishaman as well. I'm not defending him, just sometime people fail to read properly when multitasking, myself included.

    25. Re:Percentage? by DigiShaman · · Score: 1

      Ok, first of all. You are a retard!!!

      Second. Standard memory is ***NOT*** the same thing as parity!

      What part of my previous post did you not get?

      --
      Life is not for the lazy.
    26. Re:Percentage? by evilviper · · Score: 1

      Well haven't Google always used thousands and thousands of normal pc's in their server farms instead of powerful, actual premium server-grade hardware.

      No, Google has always used servers. The trademark of Google, which you're misquoting, is the fact that they use clusters of x86 hardware, rather than big iron (mainframes).

      Compared to proprietary hardware, x86 servers are dirt cheap.

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    27. Re:Percentage? by vadim_t · · Score: 1

      At 3,751 errors per DIMM/year it means that a system with 2 sticks (very common for dual channel) is getting 20 bits flipped per day. The question then is how long will it take for that to screw up something important.

      Since a modern machine has plenty RAM for disk cache, and in many workloads most memory would be dedicated to that, this would easily mean that every day some software operates on data that's not exactly what was on disk, and if you write any significant amount of data back, it's quite possible you're writing the wrong thing as well.

      Since the data on disk persists, this means that your data is getting constantly more screwed up.

    28. Re:Percentage? by Anonymous Coward · · Score: 0

      GP said "IIRC ECC has extra bits and hardware.." and then "It's the same RAM, just arranged differently on the DIMM.". If it is the same then why does it have extra bits and hardware? It appears as though the GP is contradicting themselves. Sounds to me like you are pretty representative of the intelligence level of a Glenn Beck listener.

    29. Re:Percentage? by kimvette · · Score: 1

      Google's big surprise: each server has its own 12-volt battery to supply power if there's a problem with the main source of electricity. The company also revealed for the first time that since 2005, its data centers have been composed of standard shipping containers--each with 1,160 servers and a power consumption that can reach 250 kilowatts.

      I've actually been looking for a 12V power supply for a while. I wonder if they use power supplies off the shelf or if they are custom-manufactured just for Google?

      --
      The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
    30. Re:Percentage? by Runaway1956 · · Score: 1

      Yes, I saw that, and it was also pointed out earlier in this discussion. I, for one, am not willing to accept that statement. It should be noted that a lot of "assumptions" were made in this study, and that those assumptions are referred to throughout the TFA and the PDF. Of all the hardware errors I've ever dealt with, heat was the most common problem.

      --
      "Windows is like the faint smell of piss in a subway: it's there, and there's nothing you can do about it." - Charlie Br
    31. Re:Percentage? by Anonymous Coward · · Score: 0

      Who is Glenn Beck?

    32. Re:Percentage? by mcgrew · · Score: 1

      You know, maybe googling it isn't the best idea in this case. Memory errors and all...

      I was going to debunk that but I forgot what was on my mind. Damn dimms!

    33. Re:Percentage? by Extide · · Score: 1

      It's the same RAM chips, but there are just 9 chips on each stick, instead of 8 (or 18 instead of 16). It works just like raid 5.

      --
      Technophile
    34. Re:Percentage? by phantomcircuit · · Score: 2, Insightful

      Yeah, but let's look at the more common situation - a home. Variable temperatures, most likely QUITE variable power quality, low-quality PSU, and almost certaily no UPS to make up for it. Add that to low-quality commodity components (mobo & RAM).

      The vast majority of people have laptop's now which come with a built in UPS.

    35. Re:Percentage? by osu-neko · · Score: 3, Informative

      ... Running ECC performs a basic parity check, nothing more...

      Not exactly...

      --
      "Convictions are more dangerous enemies of truth than lies."
    36. Re:Percentage? by Tumbleweed · · Score: 1

      The vast majority of people have laptop's now which come with a built in UPS.

      I doubt the battery system of a laptop does any undervoltage or power spike protection. A UPS is more than a battery.

    37. Re:Percentage? by Lonewolf666 · · Score: 1

      Seconded - my private PC runs very reliably with a quality PSU and ECC RAM. It does not have a UPS but the power grid is quite stable here in Germany.

      --
      C - the footgun of programming languages
    38. Re:Percentage? by phantomcircuit · · Score: 2, Informative

      UPS - Uninterruptible Power Supply

      Now many UPSs also include a Power Conditioner, but a UPS is not a power conditioner.

    39. Re:Percentage? by drsmithy · · Score: 1

      Temperature plays little role in errors - just as Google found with disk drives [...]

      That's not what Google found at all. They found that in the temperature range typically seen an airconditioned datacentre, temperature is not a major influence on failure rates.. Their data shows that once the temperature rises above about 40 degrees C, failure rates start to increase. 40 degrees is pretty typical for the average home PC, and downright cool in cramped cases like iMacs.

    40. Re:Percentage? by rwa2 · · Score: 1
      "I heard it from Slashdot", but I think the deal was that Google used "commodity" server hardware that they spec'd and cobbled together on their own, instead of just buying into established premium-grade servers like IBM Bladecenters or HP Proliants or Dell PowerEdge or something like that.

      Actually, I thought I had heard that they build their clusters using SuperMicro boxes (which are integrated and sold by a variety of distributors), but I can't find anything to back that up now. But yeah, black box commodity servers.

    41. Re:Percentage? by Tumbleweed · · Score: 1

      Now many UPSs also include a Power Conditioner, but a UPS is not a power conditioner.

      True, but the power conditioning is what's going to improve the life of your system, most likely, not the battery backup.

    42. Re:Percentage? by Anonymous Coward · · Score: 0

      >>ou are pretty representative of the intelligence level of Rush Limbaugh listener.

      If you're going to disparage the intelligence of others, at least make an attempt at being grammatically correct.

    43. Re:Percentage? by Fulcrum+of+Evil · · Score: 1

      Room for tons of ECC RAM and two CPUs is server or serious-workstation class (or maybe I-just-use-Notepad-and-my-sales-guy-is-on-commission class), but I think once you're on to custom boards that only use certain voltages of electricity, you've moved into a class by yourself.

      He probably means that the boxes are made to spec. Google isn't stupid enough to go with custom mobos for what amounts to generic grunt clusters.

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    44. Re:Percentage? by Yvan256 · · Score: 1
    45. Re:Percentage? by kimvette · · Score: 1

      Scratch that, they're not doing what I thought. I went and RTFA now that I have a few minutes and see there is no separate power supply outside of the 12VDC feed. Darn.

      --
      The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
    46. Re:Percentage? by drsmithy · · Score: 1

      "We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a suprisingly small effect on error behavior in the field, when taking all other factors into account."

      What temperature range does "the field" encompass, as opposed to "lab conditions" ?

      They found a similar result with hard disks, but their data pretty much finishes at around 40 degrees, roughly where the typical desktop PC's drive is starting.

    47. Re:Percentage? by vtcodger · · Score: 1

      ***Regular ram only has parity***

      Commodity DRAM hasn't had parity since the early 1990s when DRAM was selling for $100 a Megabyte. Microsoft -- which was trying to sell its memory hungry Windows OS -- pushed for the removal of parity in order to reduce DRAM prices, claiming (probably incorrectly) that DRAM failures were no longer a significant problem. I wished at the time, and still wish, they hadn't done that. Up to that point, Microsoft's record was actually pretty consumer friendly. No more regrettably. Although they are still pretty mellow compared to IBM in the mainframe arena in the 1960s and 1970s.

      =====================

      As I understand it, ECC is not exactly the same as parity. It is a set of overlapping parity bits cleverly designed such that for single bit failures, the hardware can look at which parity bits have failed, figure out which data bit is causing the failures, and reverse it.

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
    48. Re:Percentage? by kimvette · · Score: 1

      Google has patents on the built-in battery design, "but I think we'd be willing to license them to vendors," Hoelzle said.

      Oh, so it's $FOO, but in a server.

      Running computers on batteries? It got a patent?

      I think there is a good bit of prior art if only one knows where to look.

      I mean, really. This is a good idea, and it's about darn time a large-form-factor motherboard running on low voltage is available, but IMHO this should not be patentable. It's simply designing around a low-voltage input.

      --
      The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
    49. Re:Percentage? by conureman · · Score: 1

      It varies from town to town here in U.S. I've always been fortunate to live in good power areas (and Los Angeles used to give us 90 p.s.i. Water pressure!) But when we move to our retirement house, I'm gonna need a power conditioner. The lights dimmed several times when I was re-painting it recently, and went off once for a few minutes, the tenants said it happens ALL the time. I always get high-end RAM and PSUs, I've seen others suffer for the lack.

      --
      The cost of that cleanup, of course, will be borne by taxpayers, not industry.
    50. Re:Percentage? by Carbaholic · · Score: 1

      accelerated stress testing is common and covered in basic statistical reliability coursework. In Applied reliability Second edition by Tobias and Trindade they explain how to account for increased temperatures by using the Arrhenius Model.

      physical acceleration models pretty interesting....well at least for me and a few google engineers.

    51. Re:Percentage? by QuantumRiff · · Score: 1

      thats more than 10 errors per day... That is excessive, no matter the load they put on their servers, or how many DIMMS there are.. And their memory loads aren't all that excessive in the day of 1U boxes holding 128GB of ram for Virtual Machines...

      --

      What are we going to do tonight Brain?
    52. Re:Percentage? by PitaBred · · Score: 3, Informative

      Did you even read the article? They found that heat WAS NOT one of the factors. Which makes the rest of your statement seem like just as much bullshit.

    53. Re:Percentage? by Timothy+Brownawell · · Score: 1

      the mobo's used by google are the cheapest boards they can get made. There is NO testing until they hit the datacenter floor. Crap mobo plus poor environment (high heat and vibration + poor power controls) makes for a high failure rate. ECC ram has an odd number of memory chips. The odd chip allows for the parity ram. Google memory has even chip counts since non-ECC ram is much MUCH less expensive. So the bios is custom and carves out ECC function from non-ECC ram

      It sounds like you don't know what you're talking about. Would you care to show why we should believe you?

    54. Re:Percentage? by bzipitidoo · · Score: 1

      More like 20 errors per 24 hours of operation. Computers that do not run 24/7 will see fewer errors per day, of course.

      How likely is a bad bit to cause a serious problem? The bulk of RAM is used for data, not code. Data is read more often than written. Most of the time, the bad bit was in unused memory, or passes unnoticed as one wrong pixel in an image containing a million pixels, or didn't go bad before being used. Just taking a big guess here but I'm thinking about 1% of the bad bits cause a serious problem, rare enough compared to real software bugs that it's easy to lump it in. We'd notice our computers are flaky if it happened much more often than that. With an error rate almost as high as 1 per hour, bad bits can't be crashing the OS in more than about 0.001% of occurrences. Else we couldn't possibly get uptimes of many years.

      --
      Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
    55. Re:Percentage? by Timothy+Brownawell · · Score: 1

      At 3,751 errors per DIMM/year it means that a system with 2 sticks (very common for dual channel) is getting 20 bits flipped per day.

      No, from the actual paper it looks like the errors are very "bunched up". So in a given day you're only, say, 10x as likely as thought to get a memory error, but when you do you'll get a whole bunch all at once. There's also the issue that most DIMMs had no errors.

      What it looks like to me, is that say 80-90% of DIMMs are "good" and will never see errors, most of the rest are "iffy" and overly susceptible to EMI or something (maybe the ones that barely made their speed bin?), and a few are just "bad" and will garble your data at the slightest excuse. Kinda makes me wonder if running at say 10% underclock might make the errors mostly go away.

    56. Re:Percentage? by Anonymous Coward · · Score: 0

      Actually, this error rate doesn't sound too bad. Let's say the memory bus is operating at a measly 400MHz for 24/7. That's about 1.26E+16 clock cycles per year. With an average of 3751 soft errors per year, that comes out to about 3 per 10 trillion (10E+12) clock cycles.

      Now, not all clock cycles will be memory reads, but the vast majority of them will be. So if even if the memory is only 1% utilized, that's still only 3 per 100 billion (100E+9) reads. That's 99.999 999 997% up-time/accuracy or whatever you want to call it.

    57. Re:Percentage? by Anonymous Coward · · Score: 0

      People may remember you for being a dimm-wit.

    58. Re:Percentage? by wagnerrp · · Score: 3, Interesting

      Actually, they are custom motherboards. They are a non-standard form factor, using a custom 12V power connector, instead of a normal ATX/EPS plug. When you figure they're buying tens of thousands of these systems, why would you not have an OEM build you custom boards?

    59. Re:Percentage? by pankkake · · Score: 1

      I have an old computer (almost ten years old) but with an upgraded case, PSU and heatsink (much much better than what was available at that time). Good (but no "server") parts. Result: 663 days uptime (behind an UPS). Not on Windows, of course.

      --
      Kill all hipsters.
    60. Re:Percentage? by Fulcrum+of+Evil · · Score: 1

      In other words, big numbers make Gronk head hurt. Gronk go make fire. Gronk go make boat. Gronk go make fire-in-a-boat. Gronk no happy with fire-in-a-boat. Boat no work, and fire no work, all at same time.

      Me am go too far!

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    61. Re:Percentage? by CAIMLAS · · Score: 1

      Actually, didn't the google drive study find that disks performed better at warm-but-not-hot temperatures, and that excessive cooling was actually detrimental? That's what I recall coming away from it.

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
    62. Re:Percentage? by Fulcrum+of+Evil · · Score: 1

      Simple answer: I can custom hack a UPS, get a UPS mfr. to make it for me, and generally have a lower cost profile with longer times between generations and lower system risk. I'm also not tied to any OEM (save for the UPS, which is fairly simple)

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    63. Re:Percentage? by CAIMLAS · · Score: 1

      Correct; however, quite a few of the "higher end" (ie $50 and up, thereabout) PSUs have power conditioning. They call it active (or passive) PFC - power filter control. Any decent PSU you buy today (eg. Antec earthwatts) is going to have it.

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
    64. Re:Percentage? by CAIMLAS · · Score: 1

      Like everything, there's a "temperature band" tolerance for any given device. There's the "works well" band, which is smaller and sits somewhere within the overall band which is "acceptable" - and a gradual gradient between "well" and "acceptable", all the way up to "melts solder" and "PCB forms frost". I'd wager that the temperature band google studied was almost entirely within the "works range" range, so any variances wouldn't show that much difference.

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
    65. Re:Percentage? by Anonymous Coward · · Score: 0

      PFC is not power conditioning -- it's a way to make the PSU look like a resistive load (rather than a reactive load) to the electrical system supplying power. It doesn't have much direct benefit for the average consumer, but it is more efficient in that there is less power wasted to heat in the delivery wires, and it's more gentle to generators and the like. Industrial users are often billed differently than residences, and in those cases a PSU with PFC costs them much less in electricity.

    66. Re:Percentage? by Austerity+Empowers · · Score: 5, Informative

      I work on server design, specifically motherboards. ECC is a feature, it helps prevent bit errors from passing through undetected. It is not a method for preventing errors from happening in the first place, nor does it influence the number of bit errors. That is a property of the motherboard design, the chipset, the DIMM PCB and the DRAM. Second, just because you provide a spec for a mobo, does not mean that it is all inclusive. Generally people specify form factor, power, features. They don't specify quality and in most cases don't give a criteria for what it means for a feature to "work". In fact most customers I've talked to don't really understand what quality means from hardware (and sometimes in general). Hardware management, much like software, is designed with similar principles of impact/effort: if customers don't care, we don't test. In other words if it ain't listed on the box, or the salesman won't write it down, just assume it wasn't done.

      In spite of the fact that computer motherboards are digital electronics, there is in fact anything but a binary determination of "work" and "not work". Digital signals are an engineering approximation, one which falls apart at high speeds, dense routing and inexpensive design. Well designed and tested motherboards have a well known bit error rate, and reliable companies will not ship a new design until they meet their target. I do this on systems I design, but they aren't cheap, not by a lot. It is a very expensive, time consuming process, one which most companies really want to get rid of. Not all systems are so thoroughly tested, in fact the vast majority of boards out there, server or otherwise, aren't tested much at all.

      Forking money for ECC is very similar to paying the mob to protect you. Yes, it will give you more peace of mind, but what you really want is to not be having these problems to begin with. For people who care about data integrity, you should be asking what the bit error rate is and how they know. If they don't know, then you don't want it, ECC or no ECC. Don't assume "the industry" is equal, and don't assume that because a vendor's product X is really good that their product Y is really good too: you WILL be wrong, particularly on computers.

    67. Re:Percentage? by Carnildo · · Score: 1

      Yes, something doesn't add up about Google's numbers: the error rates the study reports work out to 1600-5000 single-bit errors per gigabyte per year. I ran a small university computer lab for two years. The computers had parity (not ECC) memory, and the OS would panic on a single-bit error. Based on the crash rate, I figured the error rate was around 15 single-bit errors per gigabyte per year.

      Likewise, it doesn't match the results people are getting from Memtest86. According to Google's numbers, you'd expect between 4 and 13 errors per gigabyte of memory in a single 24-hour run, but people almost always report error-free runs.

      --
      "They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.
    68. Re:Percentage? by mirix · · Score: 1
      --
      Sent from my PDP-11
    69. Re:Percentage? by mabinogi · · Score: 1

      you don't truly understand how patents work, do you?

      A patent isn't on a general concept, it's on a very specific design.
      So yes, others have put batteries in computers, that doesn't change anything. If Google has a patent, it's for the _specific_ way in which they use the battery.

      Don't let the existence of overly vague and broad software patents confuse you - real patents are supposed to be very specific.

      --
      Advanced users are users too!
    70. Re:Percentage? by Runaway1956 · · Score: 1

      Ahhh - ya got me. Yeah, I know that ECC is error "correction", not error "prevention". But, I was thinking along the lines of "prevention" when I posted.

      The rest of your post is informative as well - thanks!

      --
      "Windows is like the faint smell of piss in a subway: it's there, and there's nothing you can do about it." - Charlie Br
    71. Re:Percentage? by Austerity+Empowers · · Score: 1

      The article vaguely said "temperature plays little role" which doesn't help, and the supporting PDF in figure 7 paints a slightly different picture. Quite clearly increasing temperature increases error rate on every single system. However as you will note, motherboard variance had a far more significant contribution on the bottom line. In other words, compared to variable x, variable y is irrelevant. That's not the same as saying variable y is not a factor.

    72. Re:Percentage? by poopdeville · · Score: 1

      That's true, but Google obviously isn't going to go for that. Why have a million power supplies wasting power, when you can have all your boards running off of a single high quality power supply UPS unit?

      --
      After all, I am strangely colored.
    73. Re:Percentage? by poopdeville · · Score: 1

      Uh, that's what makes it cool... they have a giant 12VDC, Kilo- (or Mega-) Ampere power supply somewhere, and lots of 12VDC cables going around... That significantly reduced the cost of cooling a computer.

      --
      After all, I am strangely colored.
    74. Re:Percentage? by Anonymous Coward · · Score: 0

      A true UPS (as opposed to an SPS) is its own power conditioner. A UPS draws in AC power, uses that power to charge a battery. From the battery, it draws energy to generate AC. The AC power is then fed to the components plugged into the UPS.

      You are thinking of what is correctly identified as a Standby Power Supply - these are similar to Uninterruptible Power Supplies, but they merely monitor the incoming AC power, feeding it directly to components. When there is an interruption of service, they begin to draw from their batteries. They can't be called "uninterruptible" because they're not - there is always an interruption of power before they begin working, even though it is on the order of milliseconds.

      I don't know exactly when, but at some point in the last ten years, makers of SPSes began to label the packaging as UPSes instead. You can still tell whether or not you have a true UPS though - scan the box and instructions and look for small print with the words "Switching Time." A real UPS doesn't switch off, therefore has no switching time.

      Captcha is "cripples"

      Now that I'm thinking about it, I wonder if Consumer Reports knows about this - don't they sponsor "truth in advertising" lawsuits?

    75. Re:Percentage? by poopdeville · · Score: 1

      Correct; however, quite a few of the "higher end" (ie $50 and up, thereabout) PSUs have power conditioning. They call it active (or passive) PFC - power filter control. Any decent PSU you buy today (eg. Antec earthwatts) is going to have it.

      I doubt that any consumer computer case is going to have active power conditioning. The right way to do that is to have a tone generator play 60Hz into a big old power amplifier, using dirty power to create relatively clean sine waves. If your needs are very pressing, you can use multiple stages. Anything less is at best a crude approximation.

      --
      After all, I am strangely colored.
    76. Re:Percentage? by Fulcrum+of+Evil · · Score: 1

      Because then you need big freaking bus bars to deal with the current.

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    77. Re:Percentage? by khellendros1984 · · Score: 1

      But that's what they've elected to do, and it seems to have worked out well for them so far.

      --
      It is pitch black. You are likely to be eaten by a grue.
    78. Re:Percentage? by argosreality · · Score: 1

      I've only twice had memtest throw errors on sticks that were bad when the system was clearly crashing in a method "similar" to memory errors. Swapping out the memory made the system completely stable. this has happened on ten systems, so memtest only picked up errors on 20% of the of the bad sticks. I think this could partly be attributed to the way memtest tests (its not exactly real world, whether its doing linear tests or its "random" mode -- non match real world usage honestly) but also it could very well be an issue with the integrated memory management built into new processors (we've had much harder problems detecting issues with athlon processors than intels for some reason) or maybe Im shooting shit into the wind too. All I know is I rarely rely on memtest for anything anymore. The PC-doctor toolkit seems to hit bad memory more frequently but it takes HOURS to run a full pass on even small memory amounts (running it on a machine with 8Gb was just gdamn painful) but still...hours of testing vs just buying a new kit of 4Gb memory for $40 and RMA'ing the old shit. Simple math

    79. Re:Percentage? by ignavus · · Score: 1

      This memory failure explains why I have such problems with Google Translate.

      Whenever I type in the German phrase "ich erinnere mich nicht", it keeps on telling me "I do not remember"

      --
      I am anarch of all I survey.
    80. Re:Percentage? by Trogre · · Score: 1

      The problem isnâ(TM)t getting worse.

      Well, maybe not *that* problem.

      --
      "Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
    81. Re:Percentage? by Trogre · · Score: 1

      > The problem isnâ€(TM)t getting worse.

      Well maybe not *that* problem.

      --
      "Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
    82. Re:Percentage? by Anonymous Coward · · Score: 0

      Compared to proprietary hardware, x86 servers are dirt cheap.

      x86 is also "proprietary." Sure, the instruction set, interrupts, and other details are available to the public, but it is non-open hardware. Compare that to OpenSPARC or other architectures released under an FOSS license, which can be modified for any purpose, put onto an FPGA, etc. The 80x86 only shows the illusion of freedom, but in reality, all x86 chips are manufactured in accordance with an Intel license, at least AMD's and VIA's are. It is like comparing Windows to IBM Mainframe to Linux. The internals of Windows are more well-known than the Mainframe's z/OS, and people are even making a Windows clone, ReactOS, but that still means it's proprietary and not free and open like Linux.

    83. Re:Percentage? by klashn · · Score: 0

      A quick clarification on ECC RAM being "better quality." I do not believe that is true. The only difference between an ECC DIMM and a non-ECC DIMM is the 9th DRAM chip that is on the ECC DIMM. I also commend the quality of this whole discussion vs the ZDNET discussion where gamers are quarreling over whether a memory error will end up wiping your drive. In theory yes, but the probability is miniscule. We have fault handlers and Machine Check Architecture on our side to prevent it.

    84. Re:Percentage? by Nethead · · Score: 1

      Hate to say it, but that was a damn lame comic.

      --
      -- I have a private email server in my basement.
    85. Re:Percentage? by Fulcrum+of+Evil · · Score: 1

      Eh, I chuckled. It can't all be AD&D.

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    86. Re:Percentage? by Nethead · · Score: 1

      The phone company seems to do well with it too. -48VDC everywhere, big room of batteries in the basement. UPS? we don't need no stinkin' UPS, we're running on batteries all the time.

      --
      -- I have a private email server in my basement.
    87. Re:Percentage? by Nethead · · Score: 1

      Oh, it's an AD&D comic. Then I guess that one would be an improvement.

      If you haven't yet, check out Sluggy Freelance: http://www.sluggy.com/comics/archives/daily/970825

      --
      -- I have a private email server in my basement.
    88. Re:Percentage? by haruharaharu · · Score: 1
      --
      Reboot macht Frei.
    89. Re:Percentage? by Nethead · · Score: 1

      Ok, excellent art and dialog. But with that rack I wouldn't call her "low charisma" at all. I'll be following this one. Beats the hell out of the same lame Dilbert comics. Thanks for the tip.

      And in case you haven't seen this one: http://www.userfriendly.org/

      --
      -- I have a private email server in my basement.
    90. Re:Percentage? by haruharaharu · · Score: 1

      ok, girl genius online - I dig mad scientists, I guess. Kimiko is low charisma because she's socially awkward - can't talk to boys, calls people primates, that sort of thing.

      --
      Reboot macht Frei.
    91. Re:Percentage? by dontmakemethink · · Score: 1

      I suppose it depends on how you define "server board". Room for tons of ECC RAM and two CPUs is server or serious-workstation class

      Definitely open to interpretation. You've described my audio workstation there. ECC and two socket 940 dual-core Opterons was the only way to run ProTools on a quad-core box back in late 2005. The motherboard is a server/workstation/gaming hybrid, Asus K8N-DL. Hard to believe it's still pounding out albums after four years, only significant downtime was a PSU failure.

      However I can't imagine one of the most successful and server-dependent companies in the world having anything but the very best. The one guy above would have us think it runs on a room of clustered eMacs or something.

      --

      War as we knew it was obsolete
      Nothing could beat complete denial
      - Emily Haines
    92. Re:Percentage? by Hurricane78 · · Score: 1

      So it's the same as with the drugs of the pharma industry. You actually want to prevent a disease, but all they ever offer you, is to ignore your problems by treating the symptoms only. So that the next time, you can make the same mistake. And that as soon as you stop taking their meds, you're sick again. (On "good" meds even sicker than you were before.)

      Interesting.

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    93. Re:Percentage? by Halo1 · · Score: 1

      I work on server design, specifically motherboards. ECC is a feature, it helps prevent bit errors from passing through undetected. It is not a method for preventing errors from happening in the first place, nor does it influence the number of bit errors.

      It does, sort of, because errors can also occur in the parity RAM. Of course, this does not change the actual number of errors in the data ram, but the ECC will detect more errors than actually occur in a non-ECC chip of the same size.

      --
      Donate free food here
    94. Re:Percentage? by petermgreen · · Score: 1

      AC power conditioning seems a rather ass-backwards way of doing things to me anyway.

      IMO if you need AC power conditioning it means your PSU isn't doing it's job properly.

      It's just unfortunate that nowhere seems to review PSUs properly :(

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
    95. Re:Percentage? by Anonymous Coward · · Score: 0

      From the paper's abstract :

      For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account.

    96. Re:Percentage? by Anonymous Coward · · Score: 0

      Kernel panic, Blue Screen, Guru Meditation

      The probability that a bit-flip happens in a part that corresponds to your OS's kernel is very low. Typically your kernel's size in memory is tiny compared to the rest. On a correctly designed OS an application going crazy cannot take down your OS and should said application get stuck, a good old "kill -9" is guaranteed to release all the application's resources to the OS (this is the case on Linux and OS X, for example).

      I can guarantee you that if that bit-flip happens, say, while you SCP'ing a file, the only thing that is going to happen is that a tiny part of the transfer shall be done twice, because of a failed checksum.

      If that bitflip happens, say, in a 1920x1200 picture that just got uncompressed from .jpg before being sent to the video card, you'll be very unlikely to notice that bit-flip at all.

      So bit-flip, yes. Certain OS crash when a bit-flip happens, definitely no. I guess it's safe to say 99.9% of the bit-flips do not cause an OS crash.

    97. Re:Percentage? by RockDoctor · · Score: 1

      However I can't imagine one of the most successful and server-dependent companies in the world having anything but the very best. The one guy above would have us think it runs on a room of clustered eMacs or something.

      From what I've heard Google people saying about their hardware (mostly through SlashDot I think), they've looked at the cost quite closely and realised that with the number of boxes they have to run they'd get a significant number of hardware failures every minute of every day, even using the highest quality gear available at eye-watering prices. So they have no option but to design their software to transparently handle hardware failure. At which point, there's no benefit to eye-wateringly expensive gear when you'd still have to have automatic failover and multiply redundant equipment.
      Probably, the man who pulls the dead servers, PSUs etc from the racks and replaces them spends the rest of the day sweeping floors and may moonlight at a McJob. The tech who runs the diagnostics and passes the failed component to the repair shop or the recycling bin ... maybe a bit more education (s/he may have to plug in multiple diagnostic leads before the computer says "shop" or "bin").
      We're talking about an industrial-scale operation here. It doesn't need intelligence at most levels ; the intelligence has gone into the design of the system.

      I had an unpleasant shock recently : I went into a computerised data acquisition lab at work (supplied by a 3rd-party company) to borrow their multimeter for 5 minutes. (I started my career in such labs, for that company.) To my astonishment, they're not allowed to have a multimeter. Or screwdrivers small enough to fit into the wiring terminals for the sensors, or the isolation barrier sets, or anything of that sort. An extremely depauperate toolbox indeed. When I was in the same units, doing the same job (nominally), I was expected to be able to fault-find the sensors and signal processing to at least board-level ; to be able to repair some board-level faults ; to completely replace any wiring system from A/D converter all the way out to the sensor (using appropriate explosion-proof rated cables, glands, JBs, isolation barriers and techniques), and to do the documentation necessary for any changes. And to do the actual job as well as maintenance of the equipment.
      These days absolutely all changes more complex than putting leads into sockets has to have a service technician come out to do the repairs.
      And I wondered why the modern breed of mudlogger doesn't understand his job and doesn't pay adequate attention to his data quality. Well, I know now.
      That's good. They're always going to need supervision.
      It shouldn't have been a surprise, I suppose. Last year I spent 3 days wondering how a different service provider in the same business could spend 3 days failing to get an RS-232 data link between their computers and a third company's computers. Not failing to get the computers to talk to each other, but failing to get the physical link cable to work.

      [shakes head] No wonder they run out of essential consumables. If we were still allowed to use the word, I'd call it shocking incompetence. As it is, it's just grounds for a "RFI - Request For Improvement".

      --
      Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
    98. Re:Percentage? by Mr+Z · · Score: 1

      That sounds like a telecom voltage. Hmmm.... ;-)

    99. Re:Percentage? by Mr+Z · · Score: 1

      ECC doesn't change the DRAM's bit error rate. It changes the number of bits. ;-)

      Decompressing that witticism a bit: DRAM cells have a particular error rate associated with them. That is, you can expect the total number of errors in your system to be proportional to both total memory size and total time. The DRAM used to store ECC bits needs to be included in that calculation. If you have 1 gigabyte of ECC DRAM, that's 8 * 2^30 bits of storage in a non-ECC system, and 9 * 2^30 bits of storage in an ECC system--12.5% larger. So the absolute number of single-bit errors should be 12.5% larger in the ECC system than the non-ECC system.

      Now, since the amount of storage usable for data did not increase in the system with ECC RAM, it the error rate per data bit increased, even if the error rate per DRAM bit did not. ECC, though, will at a minimum correct single bit errors (including in itself), so it more than amply carries its weight. (Most commonly, it will also detect double-bit errors. On higher end systems, you get nifty features like AMD's Chip Kill, which can correct up to 4 bit errors.)

    100. Re:Percentage? by Keeper+Of+Keys · · Score: 1

      Apart from (maybe) shortening the life of all their memory DIMMs.

    101. Re:Percentage? by Mr+Z · · Score: 4, Informative

      "Regular RAM" has neither parity nor ECC.

      The original PC added a 9th bit to each byte, creating parity RAM. It was unique among personal computers at the time. None (or nearly none) of the original PC's contemporaries did this. But, since IBM did, many clones followed suit in the PC space. Macs, notably, didn't support ECC for many, many years, but if you pop open a Columbia Data Products PC, you'll see parity RAM. (Note "128K RAM with parity" in that scan.) IBM went with byte parity in part because bytes were the smallest memory unit the CPU read or wrote to the memory. With byte parity, every memory access could be protected.

      This ratio of 9/8 stuck with the PC's memory system over the years, following it to ever wider interfaces. That includes the 16 bit buses of the 286 and 386SX, the 32-bit buses of the 386DX and 486, and the 64 bit bus of the original Pentium. While many manufacturers made the byte parity optional as a cost saver, it was still rather common.

      Once you get to 64 bits, you have 8 extra parity bits for a total memory width of 72 bits. This is enough bits to implement a single-error correct, double-error detect Hamming code on the 64-bit data. As long as you always read or write in multiples of 64 bits, you can also generate the Hamming code on writes and check it on reads.

      Note that caveat: "As long as you always read or write in multiples of 64 bits." By the time you get to the 486 era, on-board L1 caches started to become standard equipment. Caches can turn a single byte read or write into a multiple byte line-fill (assuming they do read-allocate and write-allocate). They can also make writes wider. In write-back mode, they tend to write back the entire cache line if any portion was updated. In write-through mode, they could theoretically package additional bytes from the cache line to go with whatever bytes the CPU wrote to get to a minimum data size. (I don't know if the 486 or Pentium actually did this, FWIW. I'm speaking of general principles of operation.)

      The combination of caches and wider buses made ECC practical for PC hardware starting with the Pentium. That's why you started to see it in that time frame and not before.

      BTW, the error rate for individual DRAM bit flips should increase as the bits get smaller. It doesn't surprise me that your Pentium Pro's bits never flipped. It was probably built around 16 megabit DRAM chips, or maybe 64 megabit. If you compare a 16 megabit DRAM chip to a 1 gigabit DRAM chip of the same physical size, the bit cells on the gigabit chip are 1/64th the size. That means far fewer electrons holding the bit. As you can imagine, that might increase the likelihood of error per bit. Google's study didn't show an increase in error rate across memory technologies, but its window of memory technologies didn't stretch back 15 years to the Pentium Pro era.

      There's also just the total quantity of memory. Your Pentium Pro system probably had at most 128MB. Compare that to a modern system with 4GB. A 4GB system has 32x the memory of a 128MB system. Even if the per-bit error rate remained constant, there are 32x as many bits, so 32x as many errors. Modern systems also implement scrubbing, meaning they actively read all of memory in the background looking for errors. Older systems just waited for the CPU to access a word with a bad bit to raise an error. This also makes the observed error rate drastically different, since many errors would go by unnoticed in a system without scrubbing, but would get proactively noticed (and fixed) in a system with scrubbing.

      FWIW, I run my systems these days with ChipKill ECC enabled and scrubbing enabled. Not taking chances. I'll give up 3-5% on performance since most of the time I won't notice it.

    102. Re:Percentage? by Mr+Z · · Score: 1

      First, the paper on hard drives did show that temperature was important. It did show though that too cold is worse than too hot. Also, the data wasn't perfect. Google doesn't have a whole lot of drives running at strange temperatures, since they're a datacenter. A consumer though, might well run a drive at 60C in a badly cooled desktop or laptop, and there's not a single datapoint on Google's graph for that.

      Yipe! 60C is 140 degrees F... Really? How do you manage to get a hard drive that hot, when it's bolted to a case that can act as a heat sink? I can imagine CPU temps getting that high, but that's because you're dissipating 20W to 100W in an area the size of a fingernail. Even with a heatsink, it can be hard to move that much energy that quickly from that small a space. Hard drives, though? Are you wrapping the thing in bubble wrap?

    103. Re:Percentage? by Mr+Z · · Score: 1
      Kinda makes me wonder if running at say 10% underclock might make the errors mostly go away.

      It depends on what the failure mode is. If the failure is due to a burst of cosmic rays or something flipping a set of bits, then it doesn't matter what clock rate you run at. If the failure is due to a bounce on the data bus that carries data to the DRAM while you're writing it (creating a hard error, the type Google observed at a much, much higher rate than expected), then a slower clock might work better there.

      This also raises another question for me: What about signal line termination? The lines connecting the DRAMs to the memory controller are transmission lines, and so are subject to reflections and such. Buffered vs. unbuffered affects the electrical characteristics, as does the number of populated slots vs. unpopulated. Do the numbers shift dramatically if I don't fill all my DRAM slots?

    104. Re:Percentage? by AmiMoJo · · Score: 2, Insightful

      Comparing ECC to mob protection is not a very good analogy. ECC lets you detect and in some cases fix memory errors. The key is the detection part.

      If you get a single bit error which results in corrupt data, unless you verify that data some other way you won't know about it unless you have ECC. Verifying data multiple times is computationally expensive and degrades performance, and most server OSs and software don't do it anyway.

      As well as error detection the fact that you know it was the memory which corrupted the data (rather than, say, a HDD read error or a malfunctioning CPU) is valuable. It's much better to be able to say "DIMM 3 is failing" than "there is a fault, let me spend time and effort figuring out where it is". Of course it isn't always as easy as that, but it's still better than non-ECC.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    105. Re:Percentage? by vadim_t · · Score: 1

      I had a 3.5" disk running at 65C in a cramped mini-ITX case. The case was almost entirely filled with hardware and cabling, and only had two 40mm fans for the whole thing. The disk was behaving rather oddly, which I didn't find surprising at all.

      In my experience, drives in a badly cooled desktop can quite easily reach 50C. 7200 RPM drives also manage to get impressively hot when used outside the case (I've ocassionally done this to copy data over and such)

      People who like pointing to the Google study completely ignore that it's data that comes from servers in a datacenter, and not a set of normal users, some of which have horrible cases with no airflow, and an ambient temperature of 35C.

    106. Re:Percentage? by blagder · · Score: 1

      Huh? ECC should be able to _correct_ (and thus prevent) single-bit errors. If it is a double bit error then it will only detect the error (preventing them from passing through undetected). And I believe that when a line has more than 2 bits corrupted then that may not be detected by ECC in all cases. Since the probability of single-bit errors is massively higher than double-bit errors, ECC does help correct errors; preventing them from affecting the end user.

    107. Re:Percentage? by poopdeville · · Score: 1

      That's true, especially for computers. (That was what made me doubt that there would be active conditioning in UPSes in the first place). There's one circumstance where your summary might be wrong: if your power company isn't doing its job. If you have some time and supplies, try putting an oscilloscope into your mains power. (Please be safe. There is a safe way to do this.) You might be surprised how square and noisy the "sine" wave looks.

      There are some electronic devices that require a lot more isolation from all that gunk than a generic power supply can create. Surprisingly, it can be cheaper to build a power amplifier and a high quality tone generator than to build/buy a custom passive power supply at the same power rating. After all, it's just a generic power supply, a "generic" tone generator and a "generic" amplifier glued together. Heck, in a sense every switching power supply pursues this tactic, though not to the end of creating sine waves. (The end is to create two sets of square waves that "mesh" like gears do to produce a steady voltage)

      --
      After all, I am strangely colored.
    108. Re:Percentage? by Anonymous Coward · · Score: 0

      Do you know what Google and Wikipedia are? If you don't know something do yourself a favour and use them to educate yourself. Only ask a question if you can't find the answer yourself without significant effort.

    109. Re:Percentage? by Anonymous Coward · · Score: 0

      Good work not reading it asshole

  2. Gentoo?? by Anonymous Coward · · Score: 1, Funny

    I use Gentoo; how does this affect me?

    1. Re:Gentoo?? by Runaway1956 · · Score: 4, Funny

      I would suspect that it has no bearing on you at all. Simply chanting "Gentoo Gentoo Gentoo" should cure any and all hardware errors. You're safe, AC.

      I'll keep this fool occupied, someone go call the guys in white coats for me.

      --
      "Windows is like the faint smell of piss in a subway: it's there, and there's nothing you can do about it." - Charlie Br
    2. Re:Gentoo?? by Anonymous Coward · · Score: 3, Funny

      If you use Gentoo, you'll have to make your own DRAM from the schematics.

    3. Re:Gentoo?? by K.+S.+Kyosuke · · Score: 1

      It means that you have a custom kernel with a random bug exclusively designed for you. Limited series only!

      --
      Ezekiel 23:20
    4. Re:Gentoo?? by Anonymous Coward · · Score: 0

      I think the idiot here is you and whoever voted troll.

      Compiling is affected by HW errors.
      So Gentoo is more at danger than others.

      His question is to what degree.

    5. Re:Gentoo?? by CarpetShark · · Score: 1

      If you use Gentoo, you'll have to make your own DRAM from the schematics.

      But not before reading an intensive debate on why you should choose DRAM over an array of floppy drives, and setting lots of variables to tell gentoo that you have not, in fact, chosen the array of floppy drives for main memory.

    6. Re:Gentoo?? by Jesus_666 · · Score: 1

      That's just if you emerge the memory package. That package is old and you should really emerge ram. But beware; without the proper USE flags it defaults to SRAM. You should really use USE="dynamic synchronous ddr ddr-level-3 ecc -kde -gnome" emerge -av ram. Although you could also just read the README that explains how to make your entire system run on 10 GHz octuple data rate PCRAM. You just need to edit a few dozen config files and recompile everything. It's really easy.

      --
      USE HOT GRITS WITH STATUE OF NATALIE PORTMAN (NAKED AND PETRIFIED)
    7. Re:Gentoo?? by poopdeville · · Score: 1

      I wouldn't mind having to edit some files if it produced a 10GHz system.

      --
      After all, I am strangely colored.
  3. ZFS by Hatta · · Score: 0

    This really makes me want to use ZFS.

    --
    Give me Classic Slashdot or give me death!
    1. Re:ZFS by jfengel · · Score: 4, Insightful

      Changing your file system solves RAM errors how?

    2. Re:ZFS by Anonymous Coward · · Score: 1, Funny

      it reduces the effects of universal entropy, obviously.

    3. Re:ZFS by Anonymous Coward · · Score: 0

      Corrupted data in memory gets written to disk files. A file system that will let you roll back files when an error is detected will mitigate that situation.

    4. Re:ZFS by fuzzyfuzzyfungus · · Score: 2, Informative

      Just as likely to crash, less likely to silently scribble bits of nonsense all over the filesystem before doing so...

      Obviously, not having RAM errors would be even nicer; but, if you can at least detect trouble when it arises rather than well afterwords, you can avoid having it propagate further, and get away with using cheap redundancy instead of expensive perfection.

    5. Re:ZFS by Anonymous Coward · · Score: 0

      Swap?

    6. Re:ZFS by profplump · · Score: 2, Interesting

      Adding checksumming adds another place for errors to occur though -- if data is written correctly but the checksum is-miscalculated, either before it is stored or when the data is being verified -- you'll end up throwing out perfectly good data. If you also have redundancy you're probably willing to live with that, but if you're running on single disk ZFS is just adding more opportunities for data corruption in RAM.

    7. Re:ZFS by Anonymous Coward · · Score: 0

      And why would you choose ZFS anyway? ReiserFS makes errors disappear.

    8. Re:ZFS by K.+S.+Kyosuke · · Score: 2, Funny

      It makes the problem magically go away by redirecting his attention to a catchy new gadget.

      --
      Ezekiel 23:20
    9. Re:ZFS by Anonymous Coward · · Score: 1, Funny

      Are you kidding?! It's 100x greater than we thought!!

    10. Re:ZFS by The+Archon+V2.0 · · Score: 1

      it reduces the effects of universal entropy, obviously.

      Sorry, you're looking for the thread two doors over, "Universe Has 100x More Entropy Than We Thought"

    11. Re:ZFS by Anonymous Coward · · Score: 1, Funny

      A second vote for ReiserFS. It even flatly denies that any errors occurred.

    12. Re:ZFS by Sl4shd0t0rg · · Score: 1

      it reduces the effects of universal entropy, obviously.

      Sorry, you're looking for the thread two doors over, "Universe Has 100x More Entropy Than We Thought"

      Wooooosh....

    13. Re:ZFS by K.+S.+Kyosuke · · Score: 1

      Well, that's the problem, these issues are clearly linked: Once we found out that the entropy is 100x greater than we had thought, all the RAM errors immediately went through the roof. We must push the entropy lower to fix the RAMs.

      --
      Ezekiel 23:20
    14. Re:ZFS by Jesus_666 · · Score: 1

      ZFS has its own, superior implementation of RAM. Duh.

      --
      USE HOT GRITS WITH STATUE OF NATALIE PORTMAN (NAKED AND PETRIFIED)
    15. Re:ZFS by crazyvas · · Score: 1

      Changing your file system can solve a lot of errors and problems. For instance, if you have "Wife is still alive" errors, you can solve it by changing your file system to ReiserFS.

    16. Re:ZFS by Hatta · · Score: 1

      ZFS Features

              * Pooled Storage Model
              * Always consistent on disk
              * Protection from data corruption
              * Live data scrubbing

              * Instantaneous snapshots and clones
              * Fast native backup and restore
              * Highly scalable
              * Built in compression
              * Simplified administration model

      Come on, it's not that hard to figure out.

      --
      Give me Classic Slashdot or give me death!
    17. Re:ZFS by The+Archon+V2.0 · · Score: 1

      it reduces the effects of universal entropy, obviously.

      Sorry, you're looking for the thread two doors over, "Universe Has 100x More Entropy Than We Thought"

      Wooooosh....

      If you think I meant it, then yes, wooooosh.

    18. Re:ZFS by poopdeville · · Score: 1

      Adding checksumming adds another place for errors to occur though -- if data is written correctly but the checksum is-miscalculated, either before it is stored or when the data is being verified -- you'll end up throwing out perfectly good data.

      Throwing out good, re-computable data is a lot less bad than writing bad un-re-computable data. At worst, you have to recompute it. Admittedly, this can be a serious problem, but it can potentially be mitigated by algorithm analysis and statistics to determine an upper bound on the size of the effect of error introduced as a function of "time" or "steps". For example, when I was working as a research analyst doing data mining, a single bit flip (or more generally, an erroneous "row") would not strongly affect the results of machine training. Certainly not enough to make us want to run the program for another week to fix it.

      --
      After all, I am strangely colored.
    19. Re:ZFS by Anonymous Coward · · Score: 0

      It is for you, apparently. Where the corruption comes from surprisingly plays a role here.

    20. Re:ZFS by Hatta · · Score: 1

      Well go on then. Explain how ZFS's end to end checksumming won't detect corrupted data due to RAM errors. Go on, I'm waiting.

      --
      Give me Classic Slashdot or give me death!
    21. Re:ZFS by ignavus · · Score: 1

      Changing your file system solves RAM errors how?

      "NTFS" - 4 bytes of memory
      "ZFS" - 3 bytes of memory

      That's 25% fewer bytes to get RAM errors in.

      --
      I am anarch of all I survey.
    22. Re:ZFS by L4t3r4lu5 · · Score: 1

      Yo dawg, I herd u want checksums for ur checksums...

      --
      Finally had enough. Come see us over at https://soylentnews.org/
    23. Re:ZFS by Anonymous Coward · · Score: 0

      Like writing the corrupted data with perfect checksum for that corrupted data to disk? You know, writing?

    24. Re:ZFS by Hatta · · Score: 1

      Ooh, hand waving. You can do better than that. How do you get from valid data, with a valid checksum on disk, to data in RAM, to corrupted data in RAM, to a checksum for that corrupted data, to corrupted data and checksum on disk, without ZFS noticing that the checksums don't match?

      --
      Give me Classic Slashdot or give me death!
    25. Re:ZFS by Anonymous Coward · · Score: 0

      Heh? Data in RAM isn't checksummed, only the transfer to disk. ZFS can only checksum data where ZFS is involved, so it will happily store your already corrupt data to disk and will also happily give garbage back to, saying its the stuff you wanted.

    26. Re:ZFS by Anonymous Coward · · Score: 0

      And just to be clear: ZFS doesn't know it should be the same data (why would you write the same data back to disk? it's already there!) It just makes sure the data you give it is the same it later restores. Even if it's garbage.

    27. Re:ZFS by Hatta · · Score: 1

      Fair enough.

      --
      Give me Classic Slashdot or give me death!
  4. Not that many by Anonymous Coward · · Score: 0

    I nmver havm any DRIM error{ on my comp}ter6

  5. Bus errors! by redelm · · Score: 5, Informative
    Hard DRAM errors are rather hard to explain if the cells are good -- maybe a bad write. After much DRAM testing (I use memtest86+ weeklong), I've yet to find bad cells.

    What I have seen (and generated) is the occasional (2-3/day) bus error with specific (nasty) datapatterns. Usually at a few addr. I write that off to mobo trace design and crosstalk between the signals. Failing to round the corners sufficiently, or leaving spurs is the likely problem. I think Hypertransport is a balanced design (push-pull differential like ethernet) and should be less succeptible.

    1. Re:Bus errors! by marcansoft · · Score: 2, Informative

      I had a RAM stick (256MB DDR I think) with a stuck bit once. At first I just noticed a few odd kernel panics, but then I got a syntax error in a system Perl script. One letter had changed from lowercase to uppercase. That's when I ran memtest86 and found the culprit.

      At the time, a "mark pages of memory bad" patch for the kernel did the trick and I happily used that borked stick for a year or so.

    2. Re:Bus errors! by Anonymous Coward · · Score: 0

      You are more likely to have cross talk and SI (Signal Integrity) problems with memory than chip level communication.
      Hyper-transport is not going to save you when the memory bus is single-ended.

      - Point to point connections between CPU and North Bridge (on older CPU) is as good as it can get as far as SI is concerned.
      - Memory module are are bussed together i.e. point to multi-point. Signals crossing connector where you can have crosstalk issue because of non-idea grounding forcing crosstalk.
      - Memory on modules are not buffered nor routed point to point. They are relying on a balance branch topology. If the loading is unbalanced, then the topology is broken. ;P

      What might matter is how the board layout is done, decoupling and circuit design on termination and on-board power supply. All of these would affect the eye opening on the signals and noise that the system would tolerate.

    3. Re:Bus errors! by CastrTroy · · Score: 2, Insightful

      I find that more often then not, when people get blue screens or frequent crashes, that it's due to a bad RAM chip. I think it's kind of a bad thing that most motherboards don't really test the RAM when you book up. Usually running the real RAM test will pick up on most memory errors. You don't even need to run memtest. Sure you save a few seconds on boot up, but it's often better to know there is a problem with your memory then go on for months thinking there is some other problem.

      --

      Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
    4. Re:Bus errors! by dotgain · · Score: 2, Interesting
      I had one mobo, can't remember brand/model exactly but CPU was an AMD K6-2 450MHz, and back then we ran XFree86 which came as seven gzipped-tarballs (if you compile from source). I think it was file number three that would never gunzip on my PC, "invalid compressed data - CRC error", but the MD5 checked out, so I tried it on another machine and it was indeed fine.(and this is back when MD5 was thought secure)

      This machine compiled a lot of source (it was a Gentoo box), so surely if errors like these had been happening frequently we'd have known from heaps of signal-elevens killing the compiles all the time, right?

      ~24 hours of Memtest86 revealed nothing. Googling at the time found someone with the exact same mobo+CPU having problems gunzipping the exact same file (with the correct MD5), and I wondered if there was some specific bit-pattern in the file (or gunzip's state) that b0rked on my mobo. In retrospect I should have tried Solaris x86 on the same machine to try gunzipping the file.

    5. Re:Bus errors! by afidel · · Score: 1

      That's why Nehalem server boards have ECC on the busses just like real servers have had since forever.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    6. Re:Bus errors! by Cyberax · · Score: 1

      Some hard errors occur because of natural alpha-decay - even one alpha particle can flip a bit. Also, energetic cosmic rays can cause problems.

    7. Re:Bus errors! by SuperQ · · Score: 1

      Yup, I'm really disappointed that consumer PCs are still lacking ECC ram. The support for it is in all the chipsets, but it adds $5 to the cost of the machines. Oh well.

    8. Re:Bus errors! by Anonymous Coward · · Score: 0

      This is a major challenge for makers of electronics for satellites. Shielding isn't enough, you also have to get rid of the big static charges that can build. You should see the mess you get when there is a solar flare.

    9. Re:Bus errors! by Anonymous Coward · · Score: 0

      I have an 512MB SD-RAM DIMM which has one faulty bit. It only shows up in one of the more complex memtest86+ tests, and only about every other pass. That single bit has caused lots of file corruption until I found the cause. Don't underestimate the seriousness of RAM errors.

    10. Re:Bus errors! by Cthefuture · · Score: 1

      Ever heard of Tin Whiskers?

      Also, cosmic rays and such can screw up memory if they happen to pass through the chip.

      Besides that there are bad components on the motherboard that can affect memory as well. Bad capacitors, power regulators, defective interconnects. Lots and lots of things can cause memory errors.

      --
      The ratio of people to cake is too big
    11. Re:Bus errors! by Anonymous Coward · · Score: 0

      "... Hard to explain ..."?

      Has no-one heard of "background Ionizing radiation"?
      Seriously, especially for low-voltage high speed DRAM - if a cell happens to get hit by some completely unavoidable background radiation, you're going to get bits flipped.

      If you don't have ECC memory - you *won't even know* that you're getting such random bit flips.

      Forget lead shielding, btw - it's not going to stop cosmic rays and the hard gamma by much. Might help with the Beta radiation, but that *should* be the least of your worries... unless you've got bigger problems to worry about.

      Forget building deep underground too - the earth itself is radioactive.

      Hell, YOU are radioactive. Are you going to keep your distance from your computer to stop your data being oh-so-slowly corrupted?

      Flash memory is somewhat safer - mostly owing to it's larger write voltage.

      Seriously - does no-one remember how much more reliable computers with 5v DRAM were? I'm thinking especially of the Apple's before they went to more commodity-type components - somewhere before the G3's.

    12. Re:Bus errors! by WuphonsReach · · Score: 1

      Hard DRAM errors are rather hard to explain if the cells are good -- maybe a bad write. After much DRAM testing (I use memtest86+ weeklong), I've yet to find bad cells.

      Try something like Prime95 in torture test mode for a day or two if you really want to find bad / flaky memory systems. Or memory that can't handle the timings, even though it's supposed to. It pegs the CPU at 100% and uses all available memory. (You'll have to run multiple copies, last I looked, to bog down a multi-core system though.)

      --
      Wolde you bothe eate your cake, and have your cake?
    13. Re:Bus errors! by Anonymous Coward · · Score: 0

      Not necessarily.

      The failure is within the module, the motherboard or the memory controller. Depending on the architecture the memory controller could still be on the motherboard.

      If you are reliably generating errors it would be prudent to isolate the modules. ie, swap positions and retest. If the error follows the module it is a failing module. If the error remains within the same physical location it will be necessary to isolate. (depending on bank population you may be able to move to an adjacent section and retest).

      Isolation tests are the quickest/dumbest test that can be performed in the face of uncertainty.

    14. Re:Bus errors! by Anonymous Coward · · Score: 0

      I find that more often then not, when people get blue screens or frequent crashes, that it's due to a bad RAM chip. I think it's kind of a bad thing that most motherboards don't really test the RAM when you book up. Usually running the real RAM test will pick up on most memory errors. You don't even need to run memtest. Sure you save a few seconds on boot up, but it's often better to know there is a problem with your memory then go on for months thinking there is some other problem.

      Working as a Memory Stores only tech support for 3 years...

      Motherboards comprise about 33% of that. Its not always the Ram.

    15. Re:Bus errors! by redelm · · Score: 1

      yes, hard to explain because radiation produces _soft_errors_ (re-read and it goes away).

    16. Re:Bus errors! by redelm · · Score: 1
      Then why do the errors stay at the same memory locations when I swap modules? This puzzled me and caused me to attribute it to mobo busses.

      I agree the modules are partial nightmares. But the data sides have relatively few branches, usually one chip delivers 8 bits, and branches only for other modules or stubs. The address side has horrible fan-out, but fortunately lots of cycles to settle.

    17. Re:Bus errors! by klashn · · Score: 0

      I'm sure we all disable the POST memory test which pretty much only tests for stuck bits. I agree constant blue-screens can be caused by bad RAM chips, but the likely hood of a bad ram chip reaching and end user consumer is pretty small. Running into a case where that particular RAM module doesn't handle power management correctly is WAY more common, and of course the board design matters a whole lot!

    18. Re:Bus errors! by Mr+Z · · Score: 1

      It could also be bad/dirty connectors on the motherboard, I suppose.

    19. Re:Bus errors! by Mr+Z · · Score: 1

      Murrh? A soft-error means a bit flipped but the memory cell is still fine. It won't unflip until you rewrite it. A hard error means that a component repeatedly causes the same error, implying that the component itself is flaky.

      So, a cosmic ray or alpha particle might flip a bit, but it's an error that's random and affects all bits across all DIMMs with roughly equal probability. A hard error is something you could isolate to a particular chip, bus or memory controller. EMI-induced errors due to bad board design or bad manufacturing fall in the hard error category, since it's a defect that will cause repeated errors at an elevated rate. The fact that 90% of the errors come from 20% of the machines in Google's study indicates that hard errors are the dominant error mode.

    20. Re:Bus errors! by jandrese · · Score: 1

      I've had the opposite experience with those BIOS memory tests. I've never, ever had one find anything wrong, even on a system that had quite bad memory. OTOH, Memtest86 wasn't able to find any problem with the sticks either, but the instant you started a program that blitted a lot of data to the screen the system would crash. Screensavers for instance never lasted more than a couple of minutes. Replacing the memory eventually solved the problem, but it took a long time to track down the problem.

      --

      I read the internet for the articles.
    21. Re:Bus errors! by bendodge · · Score: 1

      Sounds more like bad VRAM to me...although I don't know how replacing your RAM would have fixed that.

      --
      The government can't save you.
  6. An explanation by Anonymous Coward · · Score: 1, Funny

    Maybe this is explainable by today's story that the universe has 100x more entropy than we thought

  7. Not posqible? by Anonymous Coward · · Score: 0

    Bad memory bits? How is thap posqible?

    And here I thought people were swearing at me (%^&%$&) in email when it really was just bad bits. Whew... What a relief!

  8. ECC on a home system? by eison · · Score: 4, Interesting

    I've always thought it would be a nice-to-have feature for my home system to have ECC - perhaps it might degrade over time and misbehave less if it could detect and fix some errors. But my normal sources don't seem to stock many choices. E.g. Newegg appears to have 2 motherboards to choose from, both for AMD CPUs, nothing for Intel. Frys appears to have one, same, AMD only. Is this just the way things are, or do I need to be looking somewhere else? Would picking one of these motherboards end up in not working out well for my gaming rig?

    --
    is competition good, or is duplication of effort bad?
    1. Re:ECC on a home system? by binarylarry · · Score: 1

      ECC is slightly slower.

      --
      Mod me down, my New Earth Global Warmingist friends!
    2. Re:ECC on a home system? by Anonymous Coward · · Score: 0

      I used to use ECC for my home systems, but they've segmented the market so completely it's now virtually impossible to get it unless you suck it up and pay "workstation" prices. Sorry, I don't need 8 cores and 16GB of RAM.

      Long for the good ol days of the trusty Intel 440BX. :(

    3. Re:ECC on a home system? by swb · · Score: 1

      You'd probably have to look at server boards rather than desktop boards.

      http://bit.ly/16EUiC

      Link to Newegg with filtered set of ECC compatible server boards.

      But you'll pay a lot more and probably need a larger case and a bunch of other BS, although it looks like there are some ATX factor boards.

    4. Re:ECC on a home system? by Anonymous Coward · · Score: 0

      I am almost positive that the latest intel desktop chips the i7's do not support ecc memory. The older core 2 quads do though. Last time I checked newegg had a supermicro board for the 775 socket intel quad cores. The motherboard I am speaking of is a supermicro and has a supports 1333 speed ddr3. Might be a nice way to build a low cost intel single socket quad core ecc unbuffered memory server.

    5. Re:ECC on a home system? by Spoke · · Score: 1

      Because AMD Athlon/Phenom CPUs have the memory controller integrated into the CPU, the CPU (not the motherboard) actually dictates what type of RAM you can use.

      For all the desktop class AMD Athlon/Phenom CPUs, you can use un-buffered ECC memory. Just make sure it's not buffered or registered. You need an Opteron to use buffered or registered memory.

      If you want an Intel processor, you have to use a Xeon (and the right mobo) to use ECC memory.

    6. Re:ECC on a home system? by UserChrisCanter4 · · Score: 1

      ECC is a server-targeted feature. Newegg has 18 mainboards that support ECC listed in the Dual LGA 1366 category alone, and I'd imagine plenty more scattered throughout their server board categories.

      As you've already discovered, though, it's not terribly common on home-targeted boards. You're welcome to use one of those boards for gaming, but you'll probably have to use a pricier Xeon or Opteron processor, more expensive ECC RAM, and suffer with slower PCI-E links for your videocards. Higher prices and similar or slower gaming performance is probably not what you're interested in.

      You'll also have to assume that a bit will flip in an area of RAM that's actually holding information that's important at the moment that bit flips; it's a useless feature if nothing's in the bit of RAM that accidentally flips. It's extremely useful on servers that are on 24/7, always stressed, and likely to have the RAM completely filled with important information. For home users, it falls on the wrong side of the cost/benefit test.

    7. Re:ECC on a home system? by Anonymous Coward · · Score: 0

      You need to look at Tyan motherboards: http://www.tyan.com/

      They have been making less "desktop" boards of the years, but the lines between server, desktop, and workstation are awfully blurry these days.

    8. Re:ECC on a home system? by DAldredge · · Score: 4, Informative

      A lot of the AMD boards support ECC RAM but newegg doesn't show it. Most every AM2 motherboard supports it. My main workstation at home is a Phenom II with 8GB ECC RAM mainly for that reason.

    9. Re:ECC on a home system? by vadim_t · · Score: 2, Informative

      ECC is slower by something like 1%, which is completely unnoticeable since RAM contributes relatively little to the overall system performance. 2x faster RAM won't make things run twice as fast, because normally CPU caches get a > 90% hit ratio. Otherwise things would be incredibly slow, as the fastest RAM still is horribly slow and has a horrible latency compared to the cache.

    10. Re:ECC on a home system? by conureman · · Score: 1

      I guess it's gettin' pretty long in the tooth, but my favorite home board is a one-socket opteron. It's only got four gigs of RAM though (and two empty slots).
      What I learned from TFA is I didn't do anything but piss everyone off with the "heroic" cooling I've been doing all these years. I've never lost a HDD, and I've always blamed the wind tunnel factor. Live and learn, eh?

      --
      The cost of that cleanup, of course, will be borne by taxpayers, not industry.
    11. Re:ECC on a home system? by GiMP · · Score: 1

      You're right, the i7 does not support ECC. You need to instead run a Lynnfield or Bloomfield Xeon processor, which are as i7, based on Nehalem.

    12. Re:ECC on a home system? by MechaStreisand · · Score: 1

      If the motherboard doesn't have the traces for that extra chip, though, then it won't support it. Or if it does, but the BIOS doesn't have the option to enable it, same deal. Don't assume a motherboard supports ECC memory unless the manufacturer says so (and the manufacturer isn't Abit). Asus boards for the Athlon 64 and up all support ECC, for instance.

      --
      Disclaimer: IANAL. This post is, however, legal advice, and creates an attorney-client relationship.
    13. Re:ECC on a home system? by Anonymous Coward · · Score: 0

      "Heroic cooling" did save my ass when the AC system failed. The servers ran fine even though server room temperature was nearly 50C! (That's 120F)

    14. Re:ECC on a home system? by PitaBred · · Score: 2, Informative

      The article states 5-6%, which jives with benchmarks I've found.

    15. Re:ECC on a home system? by vadim_t · · Score: 1

      It's been a while since I looked, but I remember seeing benchmarks of 1-2%.

      Still, here's what I was talking about

      Notice how going from 1066 to 2000 ( 87% faster memory ) gains less than 5% in framerate in HL2.

      Assuming that scaling holds in general, the 5% hit of ECC will slow you down by about 0.25% total. That's well within the noise inherent in a benchmark.

      Even in the case of Far Cry, which sees the most benefit from faster RAM, the hit from ECC would still be hard to notice at ~2%.

    16. Re:ECC on a home system? by modemboy · · Score: 1

      Actually every AMD board will. It is a function of the on chip memory controller, so is entirely dependent on the processor.

    17. Re:ECC on a home system? by Delkster · · Score: 1

      "Heroic cooling" sounds like someone frantically waving a hand-held fan at the servers and biting his teeth in pain while an epic score plays in the background.

    18. Re:ECC on a home system? by Anonymous Coward · · Score: 0

      Some of the socket 1366 boards will support ECC. They require a core i7 (and not the cheapest models, either - those use another socket), but that's the best gaming CPU money can buy right now. Overpriced, yes, but you'll get ECC with excellent gaming performance.

      The first MB I found that claims ECC support was the DFI LanParty DK, but I'm sure that's not the only one.
      All the dual-socket boards also do ECC, but those tend to require the shockingly expensive Xeon 55xx CPUs.

    19. Re:ECC on a home system? by Jeff+DeMaagd · · Score: 1

      The only sure way to get ECC support is to buy an Opteron or Xeon board. Sometimes they require ECC.

    20. Re:ECC on a home system? by Jeff+DeMaagd · · Score: 1

      There is some overlap, but you probably want workstation boards if you're looking to use it at your desk. Server boards are more likely to have on-board graphics, workstation boards don't, so you add your own graphics card that suits the task.

    21. Re:ECC on a home system? by Anonymous Coward · · Score: 0

      Careful -- while it's true the AM2+ on-die memory controller is responsible for ECC functionality, most motherboard BIOSes don't have support for turning it on. There are a handful of OS-specific hacks out there to enable it anyway on some boards, but it's not guaranteed in general terms.

      So far the only AM2+/AM3 boards I've found that actually let you turn on ECC in BIOS are Biostar's T-Series boards (I'm currently running a TA790GX XE with an Athlon II), and some versions of ASUS boards. Gigabyte claims support on a couple of their boards, but has no BIOS support for ECC. I've yet to find any others with real ECC support.

    22. Re:ECC on a home system? by springbox · · Score: 1

      I saw a MSI board with a nForce series 900 chipset that had ECC BIOS options

    23. Re:ECC on a home system? by Fulcrum+of+Evil · · Score: 1

      As you've already discovered, though, it's not terribly common on home-targeted boards. You're welcome to use one of those boards for gaming, but you'll probably have to use a pricier Xeon or Opteron processor, more expensive ECC RAM, and suffer with slower PCI-E links for your videocards. Higher prices and similar or slower gaming performance is probably not what you're interested in.

      I can get a dual AMD TYAN board with ECC ram and 2 x16 slots (which is pointless, speedwise - you won't need 32 lanes) on newegg. The opterons aren't expensive, since I buy lower speed parts (games are mostly gfx limited). For all that, I get a machine that stays up for months at a time and lasts many years.

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    24. Re:ECC on a home system? by level_headed_midwest · · Score: 1

      Just about all ASUS socket AM3 motherboards support ECC RAM, and all Phenom, Phenom II, and Athlon II CPUs also support ECC memory. Put the two together and you have an inexpensive way to support ECC RAM that does not involve buying a workstation board (although ASUS's AM3 790FX board is just about as expensive as a workstation board.)

      --
      Just "gittin-r-done," day after day.
    25. Re:ECC on a home system? by level_headed_midwest · · Score: 1

      Core i7s do not support ECC, so plugging them into an LGA1366 motherboard that has ECC support still means ECC doesn't work. You have to buy the Xeon equivalent of the i7 (Xeon X35xx series) to get ECC support in the CPU, and then pair that with an appropriate ECC-supporting motherboard to get ECC to work.

      --
      Just "gittin-r-done," day after day.
    26. Re:ECC on a home system? by Mashiki · · Score: 1

      Used to be back about 10 years ago every board out there supported ECC out of the box. In fact, it was enabled by default and when you were building a system you had to disable it otherwise it would cause errors with non-ecc memory. Now it's kind of hit and miss. Issues you'll run into with ECC include: Degraded performance, slower mean access time, occasional slower start up.

      ECC is generally set for what was called critical environments. These days however, I'd say that most issues with stability in modern machines and ram issues can be fixed by proper use of spread spectrum(very few people use it). Most boards will let you muck with not only memory, but cpu and video card bus and that's usually enough to reduce the error rate and make a machine stable.

      If you are looking at building one for just gaming, you'll see a small decrease in performance. Not much really, is it worth the bump in memory cost? Well that's up to you, to decide. In a lot of cases it works out to being around $4/GB which isn't a whole pile. And if you don't like it, you can still flip off ECC mode and no difference.

      --
      Om, nomnomnom...
    27. Re:ECC on a home system? by argosreality · · Score: 1

      Actually, the lack of increase going from 1066 to 2000mhz makes sense because its using a CPU that 1.) Lacks an integrated memory controller 2.) has a shared FSB to the northbridge that has the memory controller and 3.) Doesnt even need that much bandwidth to be maxxed out (most current games are GPU limited not CPU). So bandwidth isnt the limiter but latency still is. So, I bet you WOULD see more of a hit switching to ECC vs non-ecc on the same bandwidth memory however I cant really find many modern benchmarks to support that. Logically, it makes sense however but I've been known to be completely wrong before so it wouldnt be the first time ;)

    28. Re:ECC on a home system? by csartanis · · Score: 1

      Subscription only content? Niiiice!

    29. Re:ECC on a home system? by UserChrisCanter4 · · Score: 1

      You can get a dual CPU board for about $300 (Why dual CPU for gaming, again, when a multi-core chip will do fine for the few games that truly benefit from multiple CPUs?) You can spend more money on equal speed parts or kid yourself that your CPU speed won't have any effect - I'll give you that it affects some types of games less than others, but it always has some relevance to the argument.

      You can pay more money for slower RAM that will be guaranteed against a flipped bit, even if that bit doesn't effect what you're playing at the moment anyway.

      You can get 2 x16 slots that cut down to x8 speed when used in paired mode and further cripple your performance as compared to a regular ol' $150 semi-premium mainboard. Either computer may last for many years, but the presence of ECC RAM has no bearing on that case.

      Staying up for months? I had a craptastic K6-2 on ALi chipset computer that had no problem doing that. If someone's building a gaming computer, they're probably using/dual-booting into windows, and the updates or dual-booting is going to negate their "amazing" uptime anyway.

      In other words, you advocate spending a good deal more money for less performance than a non-server product would deliver on gaming, with "reliability" features that may never be relevant given the stated use. OP wants to know why he wouldn't want ECC RAM for gaming, and I explained it.

    30. Re:ECC on a home system? by atamido · · Score: 1

      I can't find anywhere in the study that describes a speed difference, other than the impact of actually fixing the errors. Do you have a link to a study that describes the performance impact that doesn't require a subscription?

    31. Re:ECC on a home system? by Fulcrum+of+Evil · · Score: 1

      You can get a dual CPU board for about $300 (Why dual CPU for gaming, again, when a multi-core chip will do fine for the few games that truly benefit from multiple CPUs?) You can spend more money on equal speed parts or kid yourself that your CPU speed won't have any effect - I'll give you that it affects some types of games less than others, but it always has some relevance to the argument.

      Yes, it's $300 for the board if you want ECC and no, the speed isn't much of a factor - GPU speed is.

      You can pay more money for slower RAM that will be guaranteed against a flipped bit, even if that bit doesn't effect what you're playing at the moment anyway.

      You can pay $100 for 4G of RAM. who cares that it's slower? It doesn't affect system speed.

      You can get 2 x16 slots that cut down to x8 speed when used in paired mode and further cripple your performance as compared to a regular ol' $150 semi-premium mainboard. Either computer may last for many years, but the presence of ECC RAM has no bearing on that case.

      Who cares if it does? You aren't using that bandwidth anyway.

      Staying up for months? I had a craptastic K6-2 on ALi chipset computer that had no problem doing that. If someone's building a gaming computer, they're probably using/dual-booting into windows, and the updates or dual-booting is going to negate their "amazing" uptime anyway.

      I don't build gaming computers and never claimed to. My systems stay up for months and I don't have to worry about random errors because I get quality power supplies and use ECC when possible. I like my rock-stable dev box.

      In other words, you advocate spending a good deal more money for less performance than a non-server product would deliver on gaming, with "reliability" features that may never be relevant given the stated use. OP wants to know why he wouldn't want ECC RAM for gaming, and I explained it.

      Here's a nickel kid, go get yourself a better computer.

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    32. Re:ECC on a home system? by UserChrisCanter4 · · Score: 1

      Hell, let's grant your claim that GPU speed is all that matters. It's not, but we'll roll with it. You're going to end up with a lousier GPU when you blow all of your budget on more expensive mainboards, RAM, and CPUs. Enjoy your sub-par (but not really any more stable) gaming experience!

      The OP is building a gaming computer. He specifically asked about it. Not a "dev box," not a DB server, not a home theater PC, but something for gaming. Before you go spouting off like a jack-ass about not building gaming computers, understand that I'm talking about them because the original poster asked why he wouldn't put ECC in a gaming machine. For someone who doesn't build gaming computers and never claimed to, though, you sure can spout bullshit about games not benefiting from CPU speed, RAM speed, or GPU bandwidth.

      Different needs for different apps; you can continue to pretend that RAM and GPU bandwidth and CPU speed have no effect on gaming. The rest of us will stay in reality, thanks.

    33. Re:ECC on a home system? by Fulcrum+of+Evil · · Score: 1

      You can pretend that GPU BW and RAM speed matter, but they don't. CPU can have an impact, but it's mostly the GPU and possibly the disk (but hey, RAID0 SATA solves that for cheap).I never claimed that my box was an ideal game box, just that ECC is easily done on a home PC. $1500 for a computer is cheap as hell, kid.

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    34. Re:ECC on a home system? by UserChrisCanter4 · · Score: 1

      Easily done on a home PC, and with no gain or benefit for that user. OP wants a gaming machine, and you want to spout off on things that won't benefit him.

      "CPU can have an impact" is a bit of an understatement. According to Tom's Hardware's Far Cry 2 benchmark a $280 core i7 920 will spank a pair of comparably priced 2.5Ghz Opterons ($360 for the pair) to the tune of almost 30 additional frames per second. The opteron isn't on that chart, mind you, but you can ballpark it pretty reasonably among the comparably clocked Phenom chips.

      I'm glad that you've realized the folly of claiming that a slow CPU isn't going to make a difference, but I'd encourage you to go check benchmarks on RAM speed and x8 vs x16 GPUs on games (the latter might be harder to find as it's been a while since mid-range boards likely to be used for gaming have dropped speed on second/both PCI-E slots). It's not but a few frames per second - maybe 10 or 12 in aggregate between the RAM and GPU bandwidth- but that's quite a bit when it costs nothing more and the original poster wants a gaming machine.

      FYI: calling people "kid" on the internet doesn't bolster your argument one bit. It just makes you look rude.

    35. Re:ECC on a home system? by Fulcrum+of+Evil · · Score: 1

      Is that the best you've got? The difference between 80 and 120 FPS is laughable, and the linked benchmark doesn't specify the system - would you have me believe that the machines in question ran 80 fps in software mode? No GPU is specified, so the chart is basically pointless.

      I claimed that the difference between CPUs wasn't a factor and that the GPU was - that's engineer speak for saying the GPU will drive the numbers most days. I already know about RAM speed - 5-10% is pointless unless you're amazon, and the same is true for pci-e slot speeds.

      By the way, I like being rude to you - you spout off about how many FPS you can get like it's your e-peen, when the truth is, if you can run a game at 60-80fps in bad conditions (which you haven't really addressed), it doesn't matter if you could make it faster. You sound like a goddamn teenager.

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    36. Re:ECC on a home system? by UserChrisCanter4 · · Score: 1

      What's the best I've got? For starters, not dismissing non-substantial differences as meaningless? The end result of our entire discussion is the same as it was at the beginning: The original poster wants a gaming machine, and you would have him select something that is either slower at the same price or more expensive. Assuming whatever budget you're working with, burning money on pricier mainboards and RAM (even if you don't care about the CPUs) still takes money away from the portion of the budget that you can devote to GPUs. I'll completely agree that GPUs make up the majority of the performance in the computer, so I'm very confused about why you seem focused on getting the poster to drop cash in areas other than his GPU.

      If you want to casually dismiss any performance difference as meaningless, go right ahead; don't be surprised when someone calls you on it. 1-2% is pointless and should be ignored, but 5-10% is not. The guy plays games, and wants to know why he would want ECC RAM in his gaming computer. Frames per second is a pretty damned important feature in a gaming computer, and is certainly a hell of a lot more compelling to someone who says they play games than months of uptime. ECC has its place, and clearly you value it, but its place is not in a gaming computer unless your goal is simply to throw money about.

      So, what's the best you've got? The poster wants to play games, and presumably he's interested in the best product for doing so. Please, tell me how dropping loads more money on a mainboard that supports ECC RAM, the ECC RAM itself, and the slower processors to run the whole system will benefit him rather than focusing that budget on videocards. Alternately, feel free to resort to condescension and insults, as it's about all you appear capable of mustering to bolster your argument.

    37. Re:ECC on a home system? by Fulcrum+of+Evil · · Score: 1

      1-2% is pointless and should be ignored, but 5-10% is not.

      Anything under about 30% total difference will most times not be noticed. In order to wow someone, it's got to be over 100% - subjectively, 10% difference doesn't matter.

      Frames per second is a pretty damned important feature in a gaming computer, and is certainly a hell of a lot more compelling to someone who says they play games than months of uptime. ECC has its place, and clearly you value it, but its place is not in a gaming computer unless your goal is simply to throw money about.

      Funny, random crashes are a lot more irritating than missing out on that 120FPS number on my 60Hz lcd.

      Please, tell me how dropping loads more money on a mainboard that supports ECC RAM, the ECC RAM itself, and the slower processors to run the whole system will benefit him rather than focusing that budget on videocards.

      Nowhere did we discuss that. Get the ECC systems and a pair of $200 cards and your rig will be quite fast enough for quite some time (2-3 years?). You make it sound like ECC is absurdly expensive when it's really pretty cheap - $100 isn't much. Hell, $1500 isn't a lot for a fast box. I paid almost that much for a 386 back in the day.

      Alternately, feel free to resort to condescension and insults, as it's about all you appear capable of mustering to bolster your argument.

      Speak for yourself - your best is a webpage somewhere that talks about FPS without saying what informs the number. For all you know, the faster proc has a newer GPU.

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    38. Re:ECC on a home system? by UserChrisCanter4 · · Score: 1

      Your ECC system is substantially more expensive; it's not just the RAM. $50 more for comparable RAM, $100-$150 more for a board capable of supporting it, and $100-$150 more for slower CPUs. It's the total cost of the platform, not just the RAM, that causes a sticking point. $250 or $300 goes a long way in performance. ECC is all of a sudden the difference between a single $200 card and a pair, and it doesn't provide any tangible benefit to the guy who wanted to game.

      5-10% is important once it represents the difference between playable and not. You're absolutely right that the difference between 110 and 120fps on a 60hz LCD is pointless and comical, but mid-range systems don't tend to do that. If you can keep it at 60-70fps on today's games, then next year's start dropping a little below that point. I'll bet the person asking for a gaming computer will want every performance gain he can get to keep him in the playable range for the next few years.

      Random crashes? I hate to fight your strawman with anecdotal evidence, but it's not really a problem. The flipped bit has to be one in use at the time it flips, and it has to be one that's actually important (and not just changing color on a texture or something). I haven't used ECC since the EDO RAM era, and I've never had a problem with system instability other than in the dark days of Windows 9x.

      My best may be a webpage that's a little vague (who benchmarks products without trying to keep things as similar as possible?), but you haven't even gone that far. Seems as if you're sticking to handwaving for this one. Your argument has boiled down to two points thus far:

      A) UPTIME! Besides months of uptime being useless to someone gaming (due to frequent reboots for Windows patches), I hope you understand that there isn't some mass "crash fest" afoot for the majority of the users not running ECC. Seriously, your system's uptime and stability, like any other well-built computer, likely owes far more to a good PSU, reputable brand components, and good cooling than it does ECC.

      B) It's good enough; you might pay a bit more and get a little less performance, but that's not such a big deal. Except that it is a big deal, because every dollar you advocate spending on hardware with minimal benefit is a dollar not spent on hardware with an actual tangible benefit.

      You and I both know that the best you could honestly tell someone in this position is "It might one day keep your OS or game from crashing, but I can't really tell you the likelihood that it will."

    39. Re:ECC on a home system? by Carl+Drougge · · Score: 1

      While it's certainly true that most consumer boards don't have BIOS support for ECC, my Gigabyte GA-MA-770-UD3 does, so gigabyte probably doesn't lie when they claim support on other boards. (But these options are not shown if you don't actually have ECC-memory, so you could easily fool yourself when you check for it.)

      Also my slightly older ABit AN-M2HD supports ECC. Both boards were bought with ECC in mind, it's not all that common. But it's certainly possible to get.

    40. Re:ECC on a home system? by Anonymous Coward · · Score: 0

      That's interesting; I noticed the GA-MA770T-UD3P claimed support for ECC, but the PDF manual showed no options for controlling it whatsoever, so I contacted support to ask. This was their reply:

      You do not need to use ecc memory, you can use non ecc un-buffered memory as well
      No ecc option available to be enable, you can use the memory but will not be able to take full advantage of it

      You're saying there are BIOS options for enabling ECC and controlling the background scrubber times for SDRAM and the caches?

      It wouldn't be the first time a manual was wrong; Biostar's primary manuals claim no ECC support, but the options are documented in the BIOS manuals for the T-Series boards. It would still be disconcerting if even tech support was wrong, though.

    41. Re:ECC on a home system? by Carl+Drougge · · Score: 1

      Yes, I am saying that. There are options for:
      DRAM ECC enable
      DRAM MCE enable
      Chip-Kill mode enable
      DRAM ECC Redirection
      DRAM background scrubber
      L2 cache background scrubber
      DCache backdround scrubber

      The full specs of my board are GA-MA770-UD3 rev 2.0 with BIOS version "FA", Kingston valueram ECC memory, Athlon X2 BE-2400 CPU. So tech support could be right for the board you asked about, though probably not.

    42. Re:ECC on a home system? by Anonymous Coward · · Score: 0

      Good to hear, thanks. If I'd known sooner, I'd likely be using a Gigabyte board right now instead of the Biostar, as there were a couple on my short list for other reasons. Ah well, next time...

  9. Dell by ^_^x · · Score: 5, Interesting

    In my experience at work ordering Dell desktops and laptops, by far the most common defect is 1-3% of machines with bad RAM. Typically it's made by Hynix, occasionally Hyundai, and I've never seen other brands fail. On many occasions though, I've predicted Hynix, pulled it, and sure enough theirs was the piece causing the errors in Memtest86+...

    1. Re:Dell by Jah-Wren+Ryel · · Score: 5, Interesting

      Hyundai is Hynix and they are the second largest DRAM manufacturer by marketshare (roughly 20% second to Samsung's 30%).

      Its no surprise that you've only seen Hynix brand fail in Dells, chances are they are in 90%+ of Dell (and HP and Apple) boxes because they primarily buy from Hynix in the first place. Its selection bias.

      --
      When information is power, privacy is freedom.
    2. Re:Dell by wiredlogic · · Score: 1

      Hynix is the former Hyundai Electronics.

      --
      I am becoming gerund, destroyer of verbs.
    3. Re:Dell by noidentity · · Score: 1

      If Hynix is simply used in most RAM modules, then even with the same failure rate you're more likely to find that brand in failed modules. And if your sample is small enough, you could easily find no other brand in failed modules.

    4. Re:Dell by ZosX · · Score: 1

      I've had the worst luck with hynix sticks. Usually when I rebuild systems the sticks that are bad are usually hynix or even hyundai. Mushkin and kingston have always been pretty good to me though and are usually pretty rock solid. Hell, mushkin even has a lifetime warranty. How many other manufacturers offer that?

    5. Re:Dell by klashn · · Score: 0

      Dell is known to buy the cheapest memory available to them, so the worst vendor could be any vendor at their time of purchase! In my test environment people have told me, historically Elpida has had the worst margins on their parts.

    6. Re:Dell by ^_^x · · Score: 1

      Very true on the selection bias thing - they're probably in 60-80% of the systems that come in. There is a mix of other brands installed though, and they haven't failed.
      Interesting that the two brands that DO noticably fail are one in the same though.

  10. Google's system validation vs standard validation by Anonymous Coward · · Score: 0

    Isn't google running many servers in unusually higher ambient temperatures and in very uniform custom configurations? The results may not apply to anybody else.

  11. I thought that an inability to recall events by bugs2squash · · Score: 3, Funny

    was only a problem for government computers.

    --
    Nullius in verba
    1. Re:I thought that an inability to recall events by rrhal · · Score: 1

      And all this time we blamed Microsoft ...

      --
      All generalizations are false, including this one. Mark Twain
    2. Re:I thought that an inability to recall events by L4t3r4lu5 · · Score: 1

      No, theirs is the issue of not being able to store and transport data securely; A TPM issue, I believe.

      --
      Finally had enough. Come see us over at https://soylentnews.org/
  12. Re:gulff by Anonymous Coward · · Score: 0

    How sick is it, that I contemplated moderating this as funny? The old troll-eske meme were so innocent, I really miss them.

    ... and Natalie Portman.

  13. Misleading, to say the very least. by jhfry · · Score: 5, Interesting

    Read the article and remember they are talking averages here.

    They give it away with this line:

    Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems

    Essentially, only 8% of their ECC DIMM's reported ANY errors in a given year.

    Also this was pretty telling:

    Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent.

    And this:

    For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform.

    So essentially, they are saying that only 8% of DIMMSs reported errors, 90% of which were on 20% of the machines that had errors, mostly because of motherboard issues... yet DIMMs are less reliable than previously thought.

    I would imagine that if you removed all of the bad motherboards, power supplies, environmental, and other issues... that DIMMs are actually more reliable than I previously thought, not less! I wonder what percentage of CPU operations yield incorrect results. With Billions of instructions per second, even an astronomically low average of undetected cpu errors would guarantee an error at least as often as failed DIMMs.

    What I did take from the article was that without ECC ram, you have no way of knowing that your RAM has errors. I guess I should rethink my belief that ECC was a waste of money.

    --
    Sometimes the best solution is to stop wasting time looking for an easy solution.
    1. Re:Misleading, to say the very least. by PRMan · · Score: 1

      I do remember reading an article where I was surprised that Google used such low-quality cheap hardware...

      That being said, this isn't really that surprising. Like another poster said, once I started buying quality motherboards (Asus) and quality RAM brands, I really haven't had any problems.

      --
      Peter predicted that you would "deliberately forget" creation 2000 years ago...
    2. Re:Misleading, to say the very least. by RAMMS+EIN · · Score: 1

      ``What I did take from the article was that without ECC ram, you have no way of knowing that your RAM has errors.''

      But that's not actually true. Parity allows you to detect errors, but not correct them. Thus, parity RAM is not ECC RAM, but it will detect memory errors.

      --
      Please correct me if I got my facts wrong.
    3. Re:Misleading, to say the very least. by ZosX · · Score: 1

      the quality of the hardware matters little when you have so much built in redundancy. who cares if a server fails when you got three to back the failed one up? they were smart in realizing that for the cost of a sun server you could buy like 10 pcs and basically achieve a lot more with a great deal more redundancy.

    4. Re:Misleading, to say the very least. by dzfoo · · Score: 1

      astronomically low average

      I wonder, is there such thing? I thought "astronomical" is used metaphorically to mean vastly large quantities, as opposed to, say, "microscopic" used to mean significantly small quantities.

      I would imagine that an astronomically low measurement could still span the distance between planets.

              -dZ.

      --
      Carol vs. Ghost
      ...Can you save Christmas?
    5. Re:Misleading, to say the very least. by internet-redstar · · Score: 1

      For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform.

      So essentially, they are saying that only 8% of DIMMSs reported errors, 90% of which were on 20% of the machines that had errors, mostly because of motherboard issues... yet DIMMs are less reliable than previously thought.

      I would imagine that if you removed all of the bad motherboards, power supplies, environmental, and other issues... that DIMMs are actually more reliable than I previously thought, not less!

      I agree.
      Most of the 'Uncorrectable Error'-events they notice are propably more power-supply related than RAM related events.

      At least that is our experience with 'bad systems' who go down periodically without no apparent reason; replacing the power supply mostly does the trick. The idea that must be a RAM issue is often a wrong one; and it has had so many customers struggling with doing weeks of memtesting, suspecting the innocent linux kernel or making up even stranger things; while the conclusion of such event is mostly: power supply.

      It was so with all VA Linux Systems machines in the past (we did the European support for those), and still actual with our experiences in customer datacenters.

      Yet Google has a special power supply infrastructure...
      ... do they have a problem there, or am I missing the point completely somewhere?

    6. Re:Misleading, to say the very least. by jhfry · · Score: 1

      I agree that astronomically is not the right word, but what is it's opposite.

      Microscopic isn't small enough, would you agree with "impossibly low average"?

      --
      Sometimes the best solution is to stop wasting time looking for an easy solution.
    7. Re:Misleading, to say the very least. by Mr+Z · · Score: 1

      "Parity" RAM just means that it has an extra bit for each byte. With enough bytes, you have enough "parity" storage that you can store an ECC value instead. See this comment I made elsewhere on this thread.

      Confusing things is that many people call Hamming codes "Hamming parity," which actually isn't as gross an error as it sounds, because Hamming codes are constructed from parity computations on different subsets of the data.

      So, when you buy a x72 DIMM instead of a x64 DIMM, you have enough additional storage that you can store an ECC value with each 64-bit datum. That's true regardless of the fact the sticker and common parlance might refer to such a DIMM as "parity RAM".

  14. "RAID"-style system for RAM... by MattRog · · Score: 4, Interesting

    RAM is dirt cheap and most server systems support significantly more RAM than most people bother to install. For critical systems, ECC works but that doesn't prevent everything (double bit errors etc.). Is it time for a Redundant Array of Inexpensive DIMMs? Many HA servers now support Memory Mirroring (aka RAID-1 http://www.rackaid.com/resources/rackaid-blog/server-dysfunction/memory_mirroring_to_the_rescue/) but should there be more research into different RAID levels for memory (RAID5-6, 10, etc?)

    --

    Thanks,
    --
    Matt
    1. Re:"RAID"-style system for RAM... by imsabbel · · Score: 2, Insightful

      ECC IS Raid5 for RAM....

      --
      HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
    2. Re:"RAID"-style system for RAM... by TJamieson · · Score: 2, Interesting

      I think OP's point was, say you have 4G of non-ECC RAM. It would be neat if you could turn that into, say, 2G of "RAID RAM".

      --
      For the last time, PIN Number and ATM Machine are redundancies!
    3. Re:"RAID"-style system for RAM... by MattRog · · Score: 2, Informative

      No, not really.

      RAID-5 allows for disk failure via distributed block parity. ECC recovers single bit error.

      The "Memory RAID" design should prevent a larger issue (multi-bit/DIMM failure/etc. that ECC cannot prevent) from taking the whole system out.

      I would imagine that ECC memory would be used in conjunction with higher-level striping or mirroring to prevent and recover from both failures.

      --

      Thanks,
      --
      Matt
    4. Re:"RAID"-style system for RAM... by evil-barn · · Score: 2, Informative

      You can do this. My IBM x3550 servers (which are ancient) has this option. It's set by jumpers on the motherboard.

    5. Re:"RAID"-style system for RAM... by hemp · · Score: 1

      The "Memory RAID" design should prevent a larger issue (multi-bit/DIMM failure/etc. that ECC cannot prevent) from taking the whole system out

      There are intel based servers (IBM in my experience) that are reminiscence of mainframe systems in that two separate banks of memory to allow for redundancy.

      --
      Skip ------ See the latest from http://www.anArchyFortWorth.com
    6. Re:"RAID"-style system for RAM... by Fulcrum+of+Evil · · Score: 1

      It's cheaper to get 4G of ECC ram.

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    7. Re:"RAID"-style system for RAM... by citizenr · · Score: 1

      Dual Channel IS Raid0 for RAM

      --
      Who logs in to gdm? Not I, said the duck.
    8. Re:"RAID"-style system for RAM... by Anonymous Coward · · Score: 0

      Several Intel servers already have this feature - eg HL DL380 G5 - "Up to 64 GB PC2-5300 Fully Buffered DIMMs (DDR2-667) with 4:1 and 2:1 interleaving available, online spare and mirrored memory capabilities "

    9. Re:"RAID"-style system for RAM... by hvdh · · Score: 1

      You'd still need ECC RAM. Otherwise, when reading a broken page from 2 modules,
      you'd get different data, but don't know which one is correct.
      With ECC modules, one of the pages read would fail the ECC check
      and you'd just use the other one.

    10. Re:"RAID"-style system for RAM... by jandrese · · Score: 1

      Or you could just buy ECC Memory and not destroy your memory bandwidth.

      --

      I read the internet for the articles.
    11. Re:"RAID"-style system for RAM... by afidel · · Score: 1

      IBM has a patent on RAID for memory, look up ChipKill. HP and IBM both use it, not sure about Dell.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    12. Re:"RAID"-style system for RAM... by Dwindlehop · · Score: 1

      /me waves hand.

      RAID is not the technology you are looking for.

      You are thinking of lockstep memory operation.
      http://en.wikipedia.org/wiki/Reliability,_Availability_and_Serviceability

      --
      Jonathan Pearce jonathan@pearce.name
      3EAAFB2A http://www.jonathan.pearce.name/
    13. Re:"RAID"-style system for RAM... by Apocros · · Score: 1

      is this thread still going...? probably not... still...

      The problem with this RAID memory scheme is: how do you know which memory chip is providing the correct data without some sort of encoded error correction data (eg ECC)? Drives store CRC data, so you can tell which one is providing wrong data (assuming it's not a drive-is-dead sort of issue). Without that error correction data though, you'd need at least a 3-way mirror (and go with the majority "vote") to know what any given byte of data was correctly supposed to be.

      I have to think this "mirroring" function is actually using some sort of hamming code, rather than a direct duplication of bits. It wouldn't make any sense to do otherwise. It'd be interesting to see some documentation though.

      --
      "onward!" cried the copper man, little knowing brass corrupts...
  15. Want to confirm? Look at your bittorrent log. by sshir · · Score: 4, Interesting

    Seriously. If you download a lot, and I do, you see quite a few checksum mismatches in the log.
    Especially if the torrent is old. Some of them may be sabotage activity, but I doubt that, considering kind of files.

    They are not transmission errors: TCP-IP checks for that. Not hard drive errors - again checksums. They can be intrasystem transmission errors though.

    I remember folks who did complete checkers wrote that they had a lot of them too.

    1. Re:Want to confirm? Look at your bittorrent log. by rdebath · · Score: 4, Interesting

      The TCP/IP checksums are really weak, only 16bits and rather a poor algorithm anyway. So more than one in 65 thousand errors will be undetected by a TCP/IP checksum. And that's not including buggy network adaptors and drivers that 'fix' or ignore the checksums.

      If you're transferring gigabytes of data you really need something a lot better.

      Still that's probably not the most common source of errors. You see the same problem exists when data is transferred across an IDE or SCSI bus if there's a checksum at all it's very weak and the amounts of data transferred across a disk bus are scary.

    2. Re:Want to confirm? Look at your bittorrent log. by pavon · · Score: 1

      That's interesting. If you were checking with a newer version of uTorrent, you may have been using UDP, and not TCP. They added UDP capability about a year ago, and I assume others have as well. I don't know if they do error correction on a per-packet basis or rely on block checksums.

    3. Re:Want to confirm? Look at your bittorrent log. by phantomcircuit · · Score: 4, Informative

      The checksum used by TCP is several orders of magnitude more likely to match a corrupted packet than the checksum used by bittorrent. (citation)

      More than likely these are transmission errors where the TCP checksum matched but the bittorrent checksum did not.

    4. Re:Want to confirm? Look at your bittorrent log. by noidentity · · Score: 1

      BitTorrent uses a different checksum algorithm than TCP/IP, one that's probably not nearly as thorough. Plus the fact that the BitTorrent checksum would have been verified on the receiving end, likely by reading the data from memory after the entire block/file was received. Are you suggesting that it got corrupted in the short time between being verified and written to the hard drive (or read from the hard drive and sent)?

    5. Re:Want to confirm? Look at your bittorrent log. by Anonymous Coward · · Score: 0

      I don't see where this study says that. Where in the study (which page) does it suggest that a cryptographic hash is less likely to find errors?

    6. Re:Want to confirm? Look at your bittorrent log. by Anonymous Coward · · Score: 0

      TCP checksum = 16 bits
      BitTorrent SHA1 checksum = 160 bits

      Additionally, I think you'd be hard pressed to find someone who thinks the cryptographic hash in question is less robust than the TCP checksum algorithm.

    7. Re:Want to confirm? Look at your bittorrent log. by Vegeta99 · · Score: 1

      This is off topic, but man do I got karma to burn.

      How the HELL did you make that link? I'm a Penn Stater, and any time I save a link to something I found using one of PSU's online sources, the link doesn't work as soon as my session ends. Instead, I gotta write a damn proper citation so I can use Citeseer to find it later!

    8. Re:Want to confirm? Look at your bittorrent log. by Anonymous Coward · · Score: 0

      Too bad bittorrent uses SHA-1 hashes. :)

    9. Re:Want to confirm? Look at your bittorrent log. by Anonymous Coward · · Score: 0

      No.

    10. Re:Want to confirm? Look at your bittorrent log. by Mr+Z · · Score: 1

      You're confusing the different layers. The CRC-32 at the link layer is at a layer below the IP datagram. The TCP checksum catches errors between TCP and the link layer. Bittorrent's checksum checks at a layer above that, catching errors that may have occurred before the torrent's packets even entered the TCP/IP stack at the sending side. The authors of the paper you linked even recommend application-level error checking on the first page.

    11. Re:Want to confirm? Look at your bittorrent log. by phantomcircuit · · Score: 1

      I just included the session in the url. Posting right now the link does not work.

    12. Re:Want to confirm? Look at your bittorrent log. by phantomcircuit · · Score: 1

      Well you're right that with bittorrent there are three checksum levels, link-layer, tcp, and application; regardless though the checksums used in most link-layer protocols are as bad or worse than the checksum used by tcp.

      So like I said more than likely the bittorrent mis matches are actually transmission errors in which the relatively weak checksums used by the link-layer and tcp managed to match, while the cryptographic strength checksum used by bittorrent catches the mismatch.

    13. Re:Want to confirm? Look at your bittorrent log. by Mr+Z · · Score: 1

      Then what exactly did you mean when you said this?

      The checksum used by TCP is several orders of magnitude more likely to match a corrupted packet than the checksum used by bittorrent.

      On one hand you're saying TCP's checksum is orders of magnitude more likely to catch the error, and on the other hand you're saying Bittorrent's is. Pick one.

      --Joe

    14. Re:Want to confirm? Look at your bittorrent log. by Mr+Z · · Score: 1

      Ugh... I see my error. I read that umpteen times as "catch a corrupted packet", not "match a corrupted packet." The unusual wording threw me off. I would have worded it as "let a corrupted packet slip by" or "miss a corrupted packet."

      Sorry for the confusion.

    15. Re:Want to confirm? Look at your bittorrent log. by phantomcircuit · · Score: 1

      No problem.

    16. Re:Want to confirm? Look at your bittorrent log. by phantomcircuit · · Score: 1

      Try re-reading my sentence.

      If the checksum of a corrupted packet matches the calculated checksum has an error occured?

    17. Re:Want to confirm? Look at your bittorrent log. by Mr+Z · · Score: 1

      *d'oh* our replies crossed. Again, my apologies.

  16. Corsair by Anonymous Coward · · Score: 0

    I had a pair of Corsair sticks that caused me months of grief. I would get kernel panics that gave absolutely no indication that memory was to blame, and memory tests and stress tests were never able to reproduce the problem. After 9 months I decided to try ignoring all indications that something else was wrong and bought new RAM. Sure enough, 12 months since then, and I haven't had a single problem. I suspect it's an issue to do with timing more than the storage medium itself, which supports Google's theory that it's often caused by bad motherboard design.

  17. Re:gulff by K.+S.+Kyosuke · · Score: 0, Offtopic

    Well, hot grits have never been too rich, not even those you have been keeping in your pants all the time.

    --
    Ezekiel 23:20
  18. NOT A PROBLEM HERE !! by Anonymous Coward · · Score: 0

    I run Linux so I am immune to this. Besides, I don't even use ECC so I have ZERO CE counts, so I am immune to that problem as well.

  19. Radiation Effects by Maximum+Prophet · · Score: 4, Interesting

    At Purdue, many years ago, one of the engineers mapped the ECC RAM errors in a room with hundreds of sparc stations and found that it was mostly in a cone shape pointed toward the window. That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.

    --
    All ideas^H^H^H^H^Hprocesses in this post are Patent Pending. (as well as the process of patenting all postings)
    1. Re:Radiation Effects by imsabbel · · Score: 1

      Well.
      Bullshit.

      Sorry, but true. Look up alpha radiation if you want to know why.

      --
      HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
    2. Re:Radiation Effects by Spliffster · · Score: 1

      BOFH excuse #345: Stray Alpha Particles from memory packaging^H^H^H^H^H^Hpile of coal caused Hard Memory Error on Server.

    3. Re:Radiation Effects by SleazyRidr · · Score: 1

      For some reason that post gave me the mental image of a bloke named 'low level alpha radiation' climbing out the window and hiding in a pile of coal...

    4. Re:Radiation Effects by QuestionsNotAnswers · · Score: 1

      Do students prefer to use machines that have a view out of the window?

      Did the sun shine in the window during the day?

      Was the error distribution at night the same?

      Was the pile offcenter? (the cone depends upon location)

      If the engineer presumed it was Alpha (or Beta, or Neutron) radiation then that is not a good sign they are that clued up. Although gamma could be a possibility.

      --
      Happy moony
  20. Mainboards by conureman · · Score: 1

    Alrighty then, which mainboards have the lowest error rates? TFA seems to have obfuscated that. That's MSs job, I thought Google was supposed to Do No Evile?

    --
    The cost of that cleanup, of course, will be borne by taxpayers, not industry.
    1. Re:Mainboards by DomNF15 · · Score: 1

      Wouldn't be very useful info to anyone buying consumer level products as the boards in question are server grade. Also, Google saying that company X's boards are more failure prone could get them into trouble. Furthermore, if you have a server farm/data center, you should do your own research, but barring that, shouldn't expect others to do it for you for free.

    2. Re:Mainboards by conureman · · Score: 1

      I rely on /. for all my free research. Thanks.

      --
      The cost of that cleanup, of course, will be borne by taxpayers, not industry.
    3. Re:Mainboards by jarden_from_cerberus · · Score: 1

      Alrighty then, which mainboards have the lowest error rates?

      According to the article, all of Google's machines are using boards from Gigabyte, which is definitely insufficient information to draw any sort of conclusion about which boards are the most error-prone.

    4. Re:Mainboards by cpghost · · Score: 1

      According to the article, all of Google's machines are using boards from Gigabyte (...)

      Google's machines consist of regular single-PC boards? How quaint!

      Seriously, doesn't Google use high performance servers, which are an entirely type of beast anyway?

      --
      cpghost at Cordula's Web.
    5. Re:Mainboards by conureman · · Score: 1

      Thanks, I RTFA but missed that. I keep hearing good things about Gigabyte boards but haven't bought one yet. I got one a few years back from a guy who thought it was sketchy, but I built it up with a good PS (which was what looked sketchy in his application) and it worked fine for me. I ALMOST got one last time around, but couldn't quite overcome my prejudice, and got another ASUS. I'm way overdue to upgrade, not that I need to, it's just that "ooh shiny!" compulsion thing.

      --
      The cost of that cleanup, of course, will be borne by taxpayers, not industry.
    6. Re:Mainboards by aXis100 · · Score: 1

      Google uses custom trays fitted out with a 12V desktop motherboard & 12V SLA battery, and everything is velcro'd in place. The density is high, and the the cost is very low. They dont need expensive servers because they maintain redundancy at the "whole computer" level.

  21. Error by Anonymous Coward · · Score: 0

    Um, what was the topic again? My memory isn't what it used to be.

  22. clearly not a radiation engineer by SuperBanana · · Score: 5, Insightful

    That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.

    Alpha radiation is stopped by a sheet of office paper. It certainly wouldn't make it through the window, through the machine case, electromagnetic shield, circuit board, chip case, and into the silicon. Even beta radiation would be unlikely to make it that far.

    What is much more likely: thermal effects. IE, infrared from the sun heating up machines near the window.

    1. Re:clearly not a radiation engineer by networkBoy · · Score: 1

      beta would be believable though (as opposed to alpha).

      I tend to agree thermal might be the culprit, specifically the delta T not the absolute T. It is the act of changing temperature that harms PCs the most, not the temp that they settle at. As the temperature changes different materials (FR4, lead/tin solder, copper, plastic) expand/contract at different rates. This change causes poor signal connections, and as RAM is likely the most sensitive (socketed rather than soldered) this would explain the bit errors.
      -nB

      --
      whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
    2. Re:clearly not a radiation engineer by paradigm82 · · Score: 1

      Right, however alpha radiation is the suspected cause of some DRAM failures. The problem is the alpha radiation emitted by the decaying Carbon-14 atoms in the plastic of the module itself.

    3. Re:clearly not a radiation engineer by John+Hasler · · Score: 1

      > beta would be believable though (as opposed to alpha).

      Beta particles (electrons) are slightly more penetrating than alphas but they still would never make it from the coal pile to the computers.

      --
      Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
  23. In that case, are these results usefull at all? by pavon · · Score: 1

    Which really makes me question whether these results have any validity outside of google. The study found that the majority of errors appeared to be related to the motherboard, but didn't list any information about the motherboards in use. If they are all custom built for google, then there is absolutely no way for any of us to know whether the error rate they exhibited is representative of what you'd get from average COTS server-grade motherboards currently on the market. Thus these results are meaningless to anyone who uses different motherboards, ie everyone but google.

  24. Lessons learned from *Non* ECC RAM by Rashkae · · Score: 3, Insightful

    My takeaway from this paper is that maybe google should hire more technicians who are experienced with non-ecc ram systems. They even believed, prior to this study, that soft errors were the most common error state. I could have told you from the start that was bunk. In over 15 years of burn-in tests as part of pc maintenance, the number of soft-errors observed is... 0. Either the hardware can make it through the test with no error, or there is a DIMM that will produce several errors over a 24 hour test. This doesn't mean that random soft errors never happen when I'm not looking/testing, but the 'conventional wisdom' that soft errors are the predominant memory error doesn't even pass the laugh test.

    From looking at the numbers on this report, I get the feeling that hardware vendors are using ECC as an excuse to overlook flaws on flaky hardware. I would now be really interested in a study that compares the real world reliability of ECC vs non-ECC hardware that has been properly QC'd. I'll wager the results would be very interesting, even of ECC still proves itself worth the extra money.

    1. Re:Lessons learned from *Non* ECC RAM by Anonymous Coward · · Score: 0

      I think it costs far more to properly QC than to use ECC ...

    2. Re:Lessons learned from *Non* ECC RAM by Anonymous Coward · · Score: 0

      I'm guessing you burn-in test is with things like MemTest? That kind of tests write to memory and immediately read back the data to see if it's correct. It cannot detect soft-error, so no wonder you never saw one.

    3. Re:Lessons learned from *Non* ECC RAM by DigiShaman · · Score: 1

      Memtest86 can detect soft-errors, but you have to run Test #9 manually. Below is their description of what it does.

      The bit fade test initializes all of memory with a pattern and then sleeps for 90 minutes. Then memory is examined to see if any memory bits have changed. All ones and all zero patterns are used. This test takes 3 hours to complete. The Bit Fade test is not included in the normal test sequence and must be run manually via the runtime configuration menu.

      If a soft-error were to occur, it would have happened within that 90 minute test window. However, it wont tell you *why* it happened.

      --
      Life is not for the lazy.
    4. Re:Lessons learned from *Non* ECC RAM by Slashcrap · · Score: 1

      In over 15 years of burn-in tests as part of pc maintenance, the number of soft-errors observed is... 0.

      I've spent 15 years of my career doing repetitive, low level PC maintenance, now let me tell you why Google need to hire better engineers...

    5. Re:Lessons learned from *Non* ECC RAM by Rhys · · Score: 1

      Believe what you like about google's engineers, but speaking as someone who runs a rather large cluster with a rather large number of ECC DDR sticks I can tell you that yes you'll see a lot of soft errors. Not only will you see them, but you can detect them in ways other than looking at the reports to the OS. ECC error correction kicking in has a higher latency, so you can see differences in gflops between machines with problems and machines that are clean.

      Our terminal hard error rate was way below the soft error rate.

      --
      Slashdot Patriotism: We Support our Dupes!
    6. Re:Lessons learned from *Non* ECC RAM by Rashkae · · Score: 1

      I think you are confusing the terminology used by google in this report. If you have specific machines or DIMMS causing problems, those are classified by google as hard errors (ie, an error that is caused by poorly performing or outright deffective hardware.). The so called soft errors are errors that can randomly 'just happen' anywhere.

      A terminal error, in google terms, is called an Uncorrectable error, and they make no effort to determine if it's soft or hard, the machine is simply removed and the affected DIMM replaced.

  25. Bad PCB layout ? by Anynomous+Coward · · Score: 1

    The length difference between both traces for differential clocks to ram must be less than 10 mils (1/4th of a millimeter) so that travel time of both signals is matched to within 2 picoseconds. Think about it. With such stringent requirements, a marginal(*) PCB design can easily cause errors once in a while.

    (*) a PCB design where corners have been cut would actually be a good design in this case. Unless Google does not mount the PCBs in a case. But I digress.

    --
    I'm not a coward by any name.
  26. Difficult to find parts that support ECC by RAMMS+EIN · · Score: 4, Interesting

    When I was building the computer I'm typing this on, I had the grand idea of building it with so much RAM that I could basically work from RAM. Meaning, for example, that all my running programs and the project I was working on would have to fit in RAM.

    Of course, with such a dream, I was concerned about the reliability of my memory. So I wanted ECC. I found out that having ECC memory is not just a matter of buying ECC memory. There are different kinds of ECC memory, and you need to find a combination of memory, motherboard, and CPU that works together. Many sites that offer CPUs and/or motherboards don't list support for ECC among the specifications. Searching for it is difficult, because searching for "ECC" also returns hits for things like "non-ECC" and "ECC: no".

    Finally, I found a combination of motherboard and CPU that would support unbuffered ECC DDR2, and a matching pair of memory modules to go with it. And then, when I got all the parts, the RAM didn't fit in the motherboard. Turns out the RAM was FB-DIMM, which had not been listed in the advertisement. I gave up and just bought 2GB of non-ECC RAM to just get the system working. The FB-DIMM (all 8GB of it) is still sitting here, because I haven't found anyone who wants to buy it from me.

    Lessons learned: 1. The saying "the nice thing about standards is that there are so many to choose from" is still relevant. I don't know why there have to be so many hardware interfaces to memory chips, but there are, so be careful. 2. Apparently, nobody really cares about ECC RAM, otherwise information would be easier to find. 3. Apparently, AMD CPUs and matching motherboards more usually support ECC RAM than Intel parts and matching motherboards.

    --
    Please correct me if I got my facts wrong.
    1. Re:Difficult to find parts that support ECC by Carnildo · · Score: 1

      Finally, I found a combination of motherboard and CPU that would support unbuffered ECC DDR2

      That's why you were having trouble: ECC memory is almost invariably buffered.

      --
      "They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.
    2. Re:Difficult to find parts that support ECC by DigiShaman · · Score: 1

      I can't vouch for the quality and performance of Crucial's memory. However, their product selector has never failed me. All you have to do is choose your make/model PC or motherboard. They will then filter out what modules are compatible. They even go so far as to tell you how many slots you will be working with and how to populate them. Very nice!

      www.crucial.com

      I often direct to clients purchase memory from them when they want to upgrade an old computer. It's simply not worth buying "generic" memory and risk spending a long afternoon (and large service bill) futzing around with it.

      --
      Life is not for the lazy.
    3. Re:Difficult to find parts that support ECC by value_added · · Score: 1

      Lessons learned: ...

      I'd suggest replacing Lessons Nos. 1-3, with One Lesson To Rule Them All:

      Buy quality hardware.

      In your case, what you should have been looking for is a "server board". The cheapest offering (often marketed as "entry level server board") would probably have been more than fine, and represented a substantial step up from consumer-grade hardware. Support for ECC RAM, among other things, is assumed.

    4. Re:Difficult to find parts that support ECC by citizenr · · Score: 1

      Actually the only lesson is dont build computers if you dont know enough about them.
      You dont need magic "matching" motherboards for AMD. Chipset DOESNT touch ram in AMD world.

      --
      Who logs in to gdm? Not I, said the duck.
    5. Re:Difficult to find parts that support ECC by Anonymous Coward · · Score: 0

      When I was building the computer I'm typing this on, I had the grand idea of building it with so much RAM that I could basically work from RAM. Meaning, for example, that all my running programs and the project I was working on would have to fit in RAM.

      Of course, with such a dream, I was concerned about the reliability of my memory. So I wanted ECC. I found out that having ECC memory is not just a matter of buying ECC memory. There are different kinds of ECC memory, and you need to find a combination of memory, motherboard, and CPU that works together. Many sites that offer CPUs and/or motherboards don't list support for ECC among the specifications. Searching for it is difficult, because searching for "ECC" also returns hits for things like "non-ECC" and "ECC: no".

      Finally, I found a combination of motherboard and CPU that would support unbuffered ECC DDR2, and a matching pair of memory modules to go with it. And then, when I got all the parts, the RAM didn't fit in the motherboard. Turns out the RAM was FB-DIMM, which had not been listed in the advertisement. I gave up and just bought 2GB of non-ECC RAM to just get the system working. The FB-DIMM (all 8GB of it) is still sitting here, because I haven't found anyone who wants to buy it from me.

      Lessons learned: 1. The saying "the nice thing about standards is that there are so many to choose from" is still relevant. I don't know why there have to be so many hardware interfaces to memory chips, but there are, so be careful. 2. Apparently, nobody really cares about ECC RAM, otherwise information would be easier to find. 3. Apparently, AMD CPUs and matching motherboards more usually support ECC RAM than Intel parts and matching motherboards.

      AMD processors have been using on-die memory controllers for a number of years. Usually the support is limited to the non-budget line such as the current Phenom II and Athlon II X2.

  27. No cosmic ray effect? by Anonymous Coward · · Score: 0

    Did they consider the increase in cosmic rays during the lull in solar activity? They say the rate hasn't changed, but is that due to chips becoming more resistant to errors in recent years? Or does a cosmic ray tend to produce only one error on a chip, and denser chips have kept the error rate the same despite more errors?

  28. Re:Frying eggs? by conureman · · Score: 1

    Damn! I think my mainboard is set to power off before it gets that hot, maybe even if the CPU gets up there, IIRC. But I'm not stuck in a server room, anyways.

    --
    The cost of that cleanup, of course, will be borne by taxpayers, not industry.
  29. Memtest86 by Anonymous Coward · · Score: 0

    Does anyone know if these ECC corrected errors are errors that Memtest86 would have caught?

    1. Re:Memtest86 by Rashkae · · Score: 1

      Memtest would catch the errors that occur on it's watch (if the same error were to happen on non-ecc, and therefore not be corrected by the hardware before memtest even sees it).. However, memtest does not detect the errors that happen when it's not running, which should be the point of of ECC, think of it as an always on memtest that keeps your pc going even in the face of failures.

      What I see here that I find odd, however, is that Google (and presumably other large data centre's) were operating under the presumption that memory errors are just normal and to keep on going so long as the ECC was able to correct them. That's what I find hard for my small system mindset to comprehend. In my world, when hardware looks like it's unreliable, I schedule a replacement.

  30. One of the reasons I used Intel motherboards by Anonymous Coward · · Score: 0

    was because they qualified memory against the motherboard and would tell us what memory to use by maker's part number. We used ECC and I don't recall any memory problems in hundreds of machine years of use. Of course every machine was burned in for days.

    Not the cheapest way but, when you pay, you can pay for the quality parts up front or pay for the maintenance costs in the field.

    Since parts are cheaper than people...

    Of course that meant you had to be willing to take the long term view of profit being based on customer satisfaction.

  31. DRAM errors? by countertrolling · · Score: 1

    Or entropy? We just discovered the same about autism and climate change. What's up? We've been working with one eye closed all this time?

    --
    For justice, we must go to Don Corleone
    1. Re:DRAM errors? by RadioheadKid · · Score: 1

      We've known about them for a long time, why do you think there are three computers on the space shuttle...

      --
      "Karma can only be portioned out by the cosmos." -Homer Simpson
    2. Re:DRAM errors? by countertrolling · · Score: 1

      ...why do you think there are three computers on the space shuttle...

      Because four could produce a tied vote?

      --
      For justice, we must go to Don Corleone
  32. Dust by Anonymous Coward · · Score: 0

    Working at a computer surplus I've seen examples of systems that failed a memory test (occasionally) or crashed during Ubuntu install* (more common). Usually someone missed a blown cap, but second cause of this.. I'll pull the RAM, blow dust out of the DIMM slots pop RAM back in and it's fine (I do put it back in the same slots, so I'm not just moving a bad bit around in system memory. And did run memtest86 the first few the first few times this worked for me just to make sure I wasn't just smoking crack.) Some of these have those huge dust bunnies, but the worst is actually the finer dust. I'm assuming it's slightly conductive.

    *Our Ubuntu install volume is so high we wore out close to 30 CDs a month and I got sick of burning them all the time. So, it is a netbooted automated Ubuntu 8.04 install... it's great for volume install to just be able to pxe boot a box and walk away (this is using the alternate installer and preseed file.). As a bonus it's enough of a burn-in that it seems to catch most flakey systems.

  33. What about soft errors due to chip packaging? by RadioheadKid · · Score: 1

    I find conclusion 7 a bit presumptuous. Soft errors are also caused by alpha particles emitted by contaminants in a chip's packaging in addition to cosmic rays. You could imagine that certain DIMMs might have lower quality (i.e. more contaminated) packaging than other DIMMs.

    --
    "Karma can only be portioned out by the cosmos." -Homer Simpson
  34. How to enable ECC after booting... by Anonymous Coward · · Score: 2, Informative
    Here's the technique I use on Linux, for a K10. The scrubber can be accessed via the PCI config space of vendor:device 1022:1203, using registers starting at offset 0x40, just afte rthe 64-byte standard PCI config space.
    • Turn off ECC error reporting with the low 3 bits of 40.L
    • Turn on ECC (bit 22 of 44.L)
    • Set the scrub address to 0 (64 bits in 5C.L and 60.L), with the lsbit set to 1 (write back after correction)
    • Set the scrub rate to the maximum of 64 bytes/40 ns (1.5 GiB/s) using lsbyte of 58.L
    • Set the L1, L2, and L3 cache scrub rates to the AMD-recommended values (other bytes of 58.L).
    • Wait 6 seconds (5.37, actually) for 8 GiB of memory to be scrubbed
    • Set the scrub rate to 2^13 times less (0.66 GiB/hour) to scrub 8 GiB every 12 hours
    • Enable ECC error reporting.

    The commands to do this are:

    • setpci -v -d 1022:1203 40.L=0:3 44.L=00400000:00400000 5C.L=1,0 58.L=0F121001
    • sleep 6
    • setpci -v -d 1022:1203 58.L=0E:FF 40.L=3:3

    You can watch the scrub address register incrementing using
    setpci -d 1022:1203 60.L 5C.L

    Similar commands work on the K8 (single-core Athlon 64), but the device is :1103, and leave the msbyte of 58.L alone (there is no L3 cache scrubber).

  35. So... by zogger · · Score: 1

    ...who makes a good board that is tested adequately?

    1. Re:So... by klashn · · Score: 0

      The company I work for does make good boards that are tested adequately. We use a multitude of internal and external tools to make sure that logic and electricals are within margin. We work with different DIMM vendors to ensure that our BIOSes work with all vendors. We have even added a DIMM training mechanism to our hardware to ensure the best margin for EACH cold boot.

    2. Re:So... by Nethead · · Score: 2, Insightful

      Then, for leaping gods sake, tell us who you work for!

      --
      -- I have a private email server in my basement.
    3. Re:So... by Austerity+Empowers · · Score: 1

      I bet not every board your company sells is tested equally well.

  36. Kingston makes unbuffered ECC RAM. DDR3-1600? by Anonymous Coward · · Score: 0
    When I was looking for 2 GiB sticks of DDR2-800 for my servers, the only source was Kingston. Look for part numbers KVR800D2E5/2G (1 stick) and KVR800D2E5K2/4G (2 sticks). For DDR3, one 4 GiB stick of DDR3-1333 unbuffered is KVR1333D3E9S/4G. 2 or 3 sticks would be KVR1333D3E9SK2/8G and KVR1333D3E9SK3/12G, but those SKUs don't seem to exist.

    Because Intel has been driving DDR3 adoption, their consumer stuff omits ECC support, and they don't officially support DDR3-1600 (even though people are running DDR3-2000 quite successfully), I haven't found DDR3-1600 ECC anywhere. Pointers would be appreciated.

    In general, it's KVR<speed>D<ddr#>E<CAS>, where D standard for double-sided and E for ECC. Suffixes after that are S (thermal sensor), K<n> (kit of n DIMMs), and /<x>G (total size). Note K2/4G is 2x2G sticks, ot 2x4G.

  37. This is why God invented ECC memory by kriston · · Score: 1

    This is why God invented ECC memory.
    Honestly, if I have to rebuild a Postgres database one more time I'm going to puke.
    Even with ECC memory, gamma-ray- and neutrino-induced ECC memory errors cause our generic x86 systems to corrupt memory, and thus corrupt my database. Half the time it corrupts the system table indices which are always kept as a memory-mapped file. Somehow it manages to corrupt the tables themselves.
    This is one of those things that the kids at Postgresql.org have no solution for.

    My solution is either to use Sun hardware or an x86 server with a recognizable brand name on it with an equally recognizable brand name memory, but that simply cannot happen due to who and how our systems are procured.

    Incidentally, I have syslogs full of successfully recovered ECC errors on Sun Solaris machines. Even the non-recoverable ones have not once induced a data loss. All we need to do in these cases is swap the memory module and all is well.

    However, I do not have even ONE line of evidence of a recovered ECC memory error from ANY of our generic x86 Linux machines. All we need to do is restore the database from a backup. It's usually corrupted beyond repair.

    Why does Google so often seem to discover things the rest of us already knew and write "whitepapers" about them like they've stumbled upon some big, new discovery?

    I say to Google: next time give AOL or IBM a call before you publish.

    --

    Kriston

    1. Re:This is why God invented ECC memory by Slashcrap · · Score: 1

      This is why God invented ECC memory.

      They're using ECC memory, idiot.

      Even with ECC memory, gamma-ray- and neutrino-induced ECC memory errors cause our generic x86 systems to corrupt memory, and thus corrupt my database. Half the time it corrupts the system table indices which are always kept as a memory-mapped file. Somehow it manages to corrupt the tables themselves.

      However, I do not have even ONE line of evidence of a recovered ECC memory error from ANY of our generic x86 Linux machines. All we need to do is restore the database from a backup. It's usually corrupted beyond repair.

      If our database was regularly corrupting itself and the DBA kept blaming memory errors even though there was no evidence of memory errors, I think that DBA would soon be looking for alternative employment. Honestly, your story sounds pretty fishy.

      I say to Google: next time give AOL or IBM a call before you publish.

      I'm an incompetent DBA who can't even read articles, allow me to tell you what Google are doing wrong...

      Fuck this site.

  38. Past time for UNIVERSAL ECC & better QC. by Anonymous Coward · · Score: 0

    Good heavens, RAM is cheap enough these days, just pass through the extra 6% or 12% mark-up that it would cost to add a few ECC bits to a DIMM and make the same DIMM type STANDARD for consumer / workstation as well as server products. Overall it would reduce the cost of server memory and have only a slight impact on desktop memory due to the higher and more stable market volume benefits as well as lower costs due to warranty / QA issues.

    It was considered important enough for servers to have ECC back when they might have 512MBy RAM. Now a typical mid range desktop workstation will have 6GBy or so and a 64 bit OS, and servers and power user workstations several times that much.

    Given the vastly increased use of 'sleep' capability in workstations the average "uptime" before reboot / reload is likely to be in the range of several months for many systems, especially as patches less often require reboots. That is plenty of time for RAM to get an error due to radiation, EMI, a motherboard glitch, et. al. and cause corrupted data or system instability.

    It is especially compelling now that so much ram is used for disk cache, and the average disk drive for a mid range consumer system is measured in the terabyte range. So there's a good likelihood that RAM errors will affect not only immediate calculations / system stability but also be persisted to disc from a write buffer and thus cause permanent corruption. Of course most common filesystems (thanks ZFS!) don't do any kind of buffer / block checksumming and are especially susceptible to happily passing on corrupted I/O buffer data between the disc and the filesystem.

    A little RAM ECC would go a long way toward making PCs 'appliances' instead of tempermental geek toys that still often enough crash or glitch. Although the average person may not always see how close to the edge of corruption / failure a typical system is, if you run something like memtestx86 or the test mode of prime95 for a week or so (if it even lasts that long) it is very common to see data corrupting glitches on many systems running at stock factory "stable" configurations.

    GPUs are even worse than PC motherboards in this respect, they need ECC ASAP if we're going to rely on GPGPU as anything useful to deliver accurate long running computations at bleeding edge clock speeds.

  39. Reminds me of that old tale... by marciot · · Score: 1

    "Is there anything I can place in my AUTOEXEC.BAT to prevent memory errors? A software patch or something?"

    (if you don't know what I am talking about, google NOSMOKE.EXE. Funny read)

  40. Re:Kingston makes unbuffered ECC RAM. DDR3-1600? by klashn · · Score: 0

    I am currently testing DDR3 1600 ECC and non-ECC DIMMs in my lab, currently not available on the market yet, but direct samples from the vendors. Those Kingston DIMMs actually use Elpida parts, similar to the Crucial memory using Micron parts.

  41. Could it be caused by overheating? by dzfoo · · Score: 1

    Perhaps they should consider putting back the A/C units in their data centers. I'm just saying.

            -dZ.

    --
    Carol vs. Ghost
    ...Can you save Christmas?
  42. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  43. Numerator or denominator? by Mr+Z · · Score: 1

    Hmmm... The lottery has astronomical odds, which means I have a microscopic chance of winning.

  44. I'm just talking about Google... by Anonymous Coward · · Score: 0

    Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo...

    Shut yo' mouth!

  45. Wrong by Anonymous Coward · · Score: 0

    This paper is a fraud, as said before google doesn't use ecc memory.

  46. how about letting the OS do it? by bill_mcgonigle · · Score: 1

    I've always thought it would be a nice-to-have feature for my home system to have ECC -

    Anybody know of a linux kernel module that will fake ECC on a regular system? Yeah, I know, it'll be slower.

    --
    My God, it's Full of Source!
    OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  47. Re:Kingston makes unbuffered ECC RAM. DDR3-1600? by RAMMS+EIN · · Score: 1

    Thanks, that helps a lot. The memory I bought is actually Kingston, but I apparently didn't know enough about how to read their part numbers to tell that it was FB-DIMM instead of DDR2 SDRAM. The whole exercise sure taught me to look up the part numbers in the future, and not go by the description alone!

    --
    Please correct me if I got my facts wrong.