Slashdot Mirror


Open Compute Project Comes Under Fire

judgecorp writes: The Open Compute Project, the Facebook-backed effort to create low-cost open source hardware for data centers has come under fire for a slack testing regime. The criticism was first aired at The Register where an anonymous test engineer described the project's testing as a "complete and total joke." The founding director of the project, Cole Crawford has penned an open letter in reply. The issue seems to be that the testing for standard highly-reliable hardware used by telcos and the like is very thorough and expensive. Some want the OCP to use more rigorous testing to replicate that level of reliability. Crawford argues that web-scale data centers are designed to cope with hardware failures, and "Tier 1" reliability would be a waste of effort.

86 comments

  1. Smells like astroturf. by An+Ominous+Coward · · Score: 3, Insightful

    Probably Cisco trolling against a movement that's going to put them out of business.

    Sooner the better, I say.

    1. Re:Smells like astroturf. by DraconPern · · Score: 1

      Never heard of datacenterdynamics. Are they even legit?

  2. Autism ... Autism Everywhere. by Anonymous Coward · · Score: 3, Funny

    Some people just have to get a burr up their ass [arse] about everything.

    Wait, Register is still up? Do they still say 'boffin' every paragraph? I couldn't bear to click through.

    1. Re: Autism ... Autism Everywhere. by Anonymous Coward · · Score: 0

      Don't forget using "Middle Kingdom" as a slant.

  3. Sort of.. by Anonymous Coward · · Score: 0

    I'll agree to a point that "web-scale data centers are designed to cope with hardware failures" bit, but when you are standing there with an internal customer shitting a brick because their product or project that is supposed to be mission critical is on COTS hardware and has no redundancies built in you tend to wonder why something of this nature was done. Specially with networking gear, I've seen a lot of companies use cheap hardware, throw a lot of it at the problem, and when something large scale happens (usually once a year with this stuff) everyone starts asking questions and you shrugging isn't a good enough answer even though you didn't buy it, didn't test it, and were forced to use it, maintain it, and take responsibility for it.

    1. Re:Sort of.. by Cramer · · Score: 2

      We aren't talking about a rack full of dell/hp knock-off "servers". OCP hardware is rows of racks full of stripped down, barebones systems. If your "mission critical" app fails, it's because you or your data center are a bunch of fools. Resilience comes from redundancy. If you fail to provide the redundant hardware, or capacity to spin up your crapplication on other systems, then that's your damn fault. (just as much as choosing to build your own rack full of budget trash.)

      OCP hardware is cheap, so you can afford a lot of it. But it's cheap, and thus, prone to higher failure rates. This equals, in enterprise definitions, an "unreliable infrastructure". In the end, it'll work out to roughly the same total cost, but with one all the money is spent up front to fill a room no one visits, vs. the other spending very little to fill the same room but has people in there regularly replacing failed components. (Banks prefer the former, Google, the latter.)

    2. Re:Sort of.. by Bing+Tsher+E · · Score: 2

      So all this cheap hardware gets deployed, then swapped out a whole bunch of times. The waste stream is much, much bigger because you're routinely scrapping out cheaply thrown together motherboards, etc.

      It doesn't sound very green.

    3. Re: Sort of.. by Entrope · · Score: 1

      On top of that, the more of these things you expect to deploy, the better an investment in test and verification amortizes. How much does the testing cost, and how long does it take someone to replace a failed system, and how many replacements does it take before the operations cost exceeds the verification cost?

    4. Re:Sort of.. by KGIII · · Score: 1

      If I understand it then it is not green at all. They, quite literally, plan on chucking out whole stripped down towers when a single component fails. They will not be replacing fans, hard drives, RAM, failed network cards, or any of that. It is cheaper for them to toss them in the trash then it is for them to debug, fault-check, and/or replace hardware. It is not that the techs are making that much, it is because the hardware is that cheap and the value of uptime is so high. They are probably even going to be paying someone to come haul these discarded husks out and to wipe/destroy the drives.

      Green? Hell no. They'll pay Jose to plant a few trees in South America and call it carbon neutral.

      --
      "So long and thanks for all the fish."
    5. Re:Sort of.. by jabuzz · · Score: 1

      What happens when due to a lack of testing your cheap OCP hardware has a design flaw and 10,000 servers all fail in a month?

      That is the criticism I think, that there is too little testing in OCP designs to make sure critical design flaws don't exist. No amount of fault tolerant software design is going to save you from mass hardware failures.

    6. Re:Sort of.. by Anonymous Coward · · Score: 1

      Your understanding is wrong.

      OpenCompute systems tend to be more reliable, because they have less components that can fail, and they have strict guidelines over which components may be used in their construction based on statistics from previous crops failure analysis. In these sorts of operations systems are taken offline when they fail, and when a rack reaches a certain level of degradation, the whole rack is taken down, removed, and queued for refurbishment. They typically have a work center at one end of the datacenter building, where racks are stripped down, each failed node is tested, repaired, and at the same time a separate team is putting together a replacement rack from refurbished nodes, and at the same time a third team is installing a refurbished rack in it's place.

      This does translate to a larger pool of dead nodes deployed in the datacenter at any given time, but reduces the total cost of refurbishment by turning it into a production line.

      Of course hardware does get tossed after a few years. This is like HPC or anything else where the cost of replacing the node is lower than the cost of energy to keep it in operation. Reducing energy costs is pretty much definitional green. No doubt the scrapped hardware is sent off for precious metal recovery and recycling.

    7. Re:Sort of.. by KGIII · · Score: 1

      Ah - thanks. I was under the impression that they were just going to be grabbing failed units out and chucking them in the bin. I was not surprised that they would do so. I am glad they will be fixing them for a while, at least. Disposal of eWaste is a problem even today though more and more is being recycled or reused.

      --
      "So long and thanks for all the fish."
  4. Web-scale by Tailhook · · Score: 1

    Web-scale? Way to be tone-deaf there Mr. Crawford.

    Or maybe the ridicule heaped on users of that particular term is something indulged only by the neckbeard wannabes that haunt Slashdot. In which case, carry on!

    --
    Maw! Fire up the karma burner!
  5. Because... by Anonymous Coward · · Score: 0

    Webscale.

  6. Cheap hardware. Smart Software by biojayc · · Score: 5, Insightful

    You don't need expensive hardware to run datacenters. You need cheap commodity hardware with smart software on top. Just ask Google or Facebook.

  7. Saying you test is easy. by digsbo · · Score: 4, Insightful

    But testing well is really, really hard. And expensive, especially for data center scenarios. If you haven't put it in an oven and observed the effects, it's not tested for telco data centers.

    1. Re:Saying you test is easy. by GerryGilmore · · Score: 3, Informative

      And there is the rub. NEBS testing (telco-level) is horrifically expensive and - for DC applications - totally unnecessary. NEBS servers have to withstand that because they are often the *only* server performing a certain function in the CO. Not anywhere near the same use-case.

    2. Re: Saying you test is easy. by Anonymous Coward · · Score: 1

      Financially, hardware Tax depreciates in three years anyway. Lately, hardware is a little slow on Moore's Law but power efficiency/computing performance has been about the same pace... If you're at the top end you're losing money not replacing fairly often. What happens after isn't their problem. There's no purpose in testing something to last in the desert for ten years because the vas majority of hardware is "disposable". If you want to complain about the waste push for more recyclable materials, and of course boards that use fewer parts they don't need... Which is the purpose of this project.

    3. Re:Saying you test is easy. by digsbo · · Score: 2

      Agreed, but still, even in a non-NEBS scenario, there's still a lot to be tested because you're putting something potentially flammable in someone's data center. It's really easy to think of designing so a server failure doesn't bring a cluster down, but a server failure that results in a fire has the potential to do more.

      The one time I had a fire in a test lab, it really scared me, and made me realize as rare as that kind of thing is, it's potentially disastrous. And that's why they test for it.

    4. Re:Saying you test is easy. by Anonymous Coward · · Score: 0

      Of course, NEBS testing is only undergone by a subset of 'Tier 1' servers anyway.

    5. Re:Saying you test is easy. by Anonymous Coward · · Score: 0

      Yes, that is the interesting part. Some of the OCP mindset is 'we don't need UL testing'. Like you say, that will change real quick if a disaster causes toxins to affect people at a datacenter location.

    6. Re:Saying you test is easy. by JoeMerchant · · Score: 1

      Telco switches are ghost towns... big empty buildings out in the boonies that used to hold massive racks of relays with a little box in the middle that replaces all that, or tiny shacks built after the tech came up to speed that just holds the little box. They aren't manned, they are critical, and they need to have reliability due to their geographic dispersal.

      Datacenters are, eponymously, centralized. Keep a staff of 4-5 guys on-hand at all-times, give them a PC gaming center to play epic COD on when things are going well (in other words, pay them dirt and they'll be happy), and when the system detects a fault, they need to be on it before it gets out of hand, or their 100%, 100% uptime bonus is toast (in other words, their base pay is minimum wage, but they can make double that if they can keep the equipment failures from getting out of hand, which with automatic monitoring, diagnostics and failovers, should be a cake-walk).

       

    7. Re:Saying you test is easy. by randalware · · Score: 1

      toxic material is an important consideration.
      but NEBS test servers for a data center is ridiculous !

      Major manufacters (HP,IBM,SUN,etc) only test one or two hardware chassis for NEBS.
            one basic 2u server & the next size up multi processor.

      NEBS servers are designed to be utility server in a telco switch site.
      The power is DC and the site has a big bank of batteries to power the site during outages.
      A telco is aiming for NO outages and is very hardware focused.

      Anyone elses datacenter is AC and with software/hardware configurations that switch the load/programs/traffic to backup servers.
      Where most companies fail is in testing the high availibility failovers.
      Databases, COTS, homegrown programs usually have a common problem, HA is NOT designed into it, it is a patch & cobble after the fact.

      When an entire datacenter goes offline, you will see almost every company go up in smoke for the day.
      Geo fail over is NOT some thing many do well and is difficult to buy a magic solution from a vendor.
      And testing again is hard to get management buy in for.

      I have had these discussion for most of my employers, and few took more than baby steps toward a solid solution.
      And I have seen one of them make apologies on tv and one was involved in a merger then the IT was outsourced.

      None of them would have survived a disaster recovery with some major outages
      Planning a disaster recovery system that come back online within a few minutes is cheap compared to one that never stops.
       

      --
      This is my opinion based on what little I know and understand of the rumors and lies Thanks, Randal
    8. Re: Saying you test is easy. by KGIII · · Score: 1

      and of course boards that use fewer parts they don't need...

      I now have a picture in my head of a guy, his name is Ralph, sitting there, drilling holes, and soldering on random extra bits like capacitors, diodes, a spare bios chip bracket, and a USB port. I know what you meant but, really, that is how my brain works.

      --
      "So long and thanks for all the fish."
  8. Yeah we'll just do that in software? by captaindomon · · Score: 1

    "web-scale data centers are designed to cope with hardware failures". So.... it's OK if you use my motherboard design and they randomly fail, because you should just make up for that in software or hardware redundancy? Um, no.

    --
    Just because I can hook a shark from a boat, I do no offer to wrestle it in the water.
    1. Re:Yeah we'll just do that in software? by Anonymous Coward · · Score: 0

      Um, yes. That is what it means. If something fails, the system handles it. That is how Google, facebook, Amazon, etc, etc work. And they are very successful.

    2. Re:Yeah we'll just do that in software? by mpoulton · · Score: 1

      "web-scale data centers are designed to cope with hardware failures". So.... it's OK if you use my motherboard design and they randomly fail, because you should just make up for that in software or hardware redundancy? Um, no.

      That's exactly what it means, and how it works. When you have tens of thousands of nodes, some of them WILL eventually fail during operation, no matter how good the hardware is. Thus, the software must be designed to accommodate hardware failures and seamlessly continue operation without interruption or data loss. If you already have to design the software to handle that anyway, then there is not much incentive to go to great lengths to improve hardware reliability. Whether the failure rate is 1:100000 or 1:1000 annually, the result is the same on the software side. But if the less reliable hardware is dramatically cheaper (which it is), then it makes more sense to use the cheap hardware and replace it more often.

      --
      I am a geek attorney, but not your geek attorney unless you've already retained me. This is not legal advice.
    3. Re: Yeah we'll just do that in software? by Anonymous Coward · · Score: 0

      They also have rather different business models than most organizations. Amazon is the closest to 'normal' of the bunch you mentioned, but even then they're just different. If you do not take that into account while you're busy worshipping what they do, you're going to fail.

  9. "designed to cope with hardware failure" by Anonymous Coward · · Score: 1

    Crawford thinks that web-scale data centers are designed to cope with hardware failures but hasn't tested it

    FTF Crawford.

  10. Isn't this expected? by fuzzyfuzzyfungus · · Score: 4, Insightful

    I don't know if it's a good idea or not(probably depends on who you are, and I'm sure that there will be some people who chose incorrectly); but is it really a surprise that OCP would be doing their testing on the cheap 'n cheerful side of things?

    It was my understanding that their premise, from the beginning, was that existing hardware vendors were excessively focused on adding costly, thermally demanding, and often proprietary, features at the hardware level that were unnecessary if you were willing to compensate for their absence in your software design.

    There is obviously some level of reliability below which no compensation at the software level is possible(if you can't run the algorithm for detecting errors because it keeps glitching out, it's probably not going to work); but the impression they always conveyed was that many of the more sophisticated reliability mechanisms are really features aimed at people who are substantially less able to cope with failure; and are therefore willing to pay substantially more for hardware that can invisibly paper over a variety of moderately serious failures and allow the software on top to run without incident; rather than buying lots of cheap hardware that has a risk of going down in a screaming heap.

    So long as nobody gets any stupid optimistic ideas, I don't really see the issue. Sure, if Facebook were about sending men to mars, they should seriously consider having three CPUs running in lockstep and voting on all operations and so on; but this project is about delivering as many ad impressions per dollar as possible; no reason to get worked up over the occasional glitch.

    1. Re:Isn't this expected? by Anonymous Coward · · Score: 0

      Keep in mind the gripe is that they claim to have a test process, but they basically allow everyone to 'self certify'. Basically, a vendor can do whatever the hell they want and say 'OCP Certified'. The phrase is literally meaningless, but OCP champions it as being something to look for.

      OCP is essentially discontinuing efforts to provide value. The market is no different with or without the OCP 'sticker' at this point, except OCP gets some money from the vendors that want to say 'me too'. Over time OCP has evolved to be more and more lax and less and less specific to the point where you can technically slap OCP on most any piece of datacenter equipmen that exists today without any technical change and no one can really call you on it.

      The sensibilities they advocate may have something to it, but as an organization they are severely problematic. A great deal of their 'specifications' are nothing more than problem statements without answers.

    2. Re:Isn't this expected? by fuzzyfuzzyfungus · · Score: 1

      So, do we suspect mere incompetence, or is the OCP one of those 'open' projects where the lead is all gung-ho about industry collaboration and openness and such; so long as they are losing to somebody else, and then more or less immediately drops all but the barest vestiges of 'open' once they have the improvements they came for?

      I certainly can't rule out the former, especially since a bunch of preening software narcissists who "move fast and break things" and are proud of it don't seem like naturals for either project management or hardware engineering; but I'd also be unsurprised if this was Facebook's 'shit, Google is hammering us on hardware and operations costs per ad served, we need to beat some fear into our vendors...' project, and now that it has succeeded in doing that, there really isn't any advantage for them in bothering to improve, maintain, or prevent from being watered down into meaninglessness, the 'spec;. Any guesses?

      "I get it; some asshole said he was open; but he was only open for business."

    3. Re:Isn't this expected? by mysidia · · Score: 1

      if you can't run the algorithm for detecting errors because it keeps glitching out, it's probably not going to work

      Chances are you can't make good assurances about tolerating any kind of byzantine fault.

      I realize there are finally some options for tolerating certain kinds of Byzantine faults in specific kinds of scenarios. In general, it is too hard or expensive, so the fact is, less reliable hardware does mean the application will be less reliable. Buying cheaper hardware is still a cost tradeoff that adds risk. The risk may be more limited if the software is really really REALLY good, and there's a really resilient system of thousands of nodes.

      Imagine you don't want to pay for ECC protected buffers/RAM, because it's too expensive.... you receive some data from a user over the network.

      You decode the packet, verify the checksum, and store the data.

      Before you get to reading the data after verifying the checksum --- a 1-bit error occurs in the RAM. You won't detect the error, because you already verified the integrity, your next step is to add the corrupted data to the database.... boom you have data corruption.

  11. why not get flamable parts and oil cool it by Anonymous Coward · · Score: 0

    it would be alot better than water

  12. 5 9's by The+Raven · · Score: 4, Insightful

    I'm gonna side with OCP on this one. It is far more economical to deal with reliability via redundancy than it is via expensive parts. This is why we use RAID rather than speccing our drives to last 10 years minimum. All the big players in the datacenter market have put thousands of hours each into designing systems tolerant of missing parts.

    The downside is that writing custom stacks tolerant of missing pieces is fucking hard and a huge up-front investment for a company. Most off-the-shelf software does not have that level of redundancy and fault tolerance baked in already. This means that for many small to medium sized deployments it's cheaper to buy a really expensive fault tolerant rack of servers and run your off-the-shelf software on it than it is to buy into OCP with inexpensive hardware that's more open to failure, because your software is NOT open to failure.

    Different strokes for different folks and all. Use the right tool for the job. And OCP was made by companies with massive data farms to fit their needs... and their needs are probably not your needs.

    --
    "I will trust Google to 'do no evil' until the founders no longer run it." Hello Alphabet.
    1. Re:5 9's by romanr · · Score: 1

      Exactly this. Pick the right tool for the right job. If you are just serving up simple web pages to the masses, go cheap, they can always hit refresh if things fail.

      If you have serious money flowing through the platform, plan and purchase accordingly. What is an outage going to cost you? A $50,000 server may end up being very, very cheap if an outage costs you $100,000 per hour.

    2. Re:5 9's by hawguy · · Score: 1

      Exactly this. Pick the right tool for the right job. If you are just serving up simple web pages to the masses, go cheap, they can always hit refresh if things fail.

      If you have serious money flowing through the platform, plan and purchase accordingly. What is an outage going to cost you? A $50,000 server may end up being very, very cheap if an outage costs you $100,000 per hour.

      If an outage costs you $100K/hour, you better not be running it on a single server.

    3. Re:5 9's by Anonymous Coward · · Score: 0

      There is an all important rule in computing: GIGO.

      1: HDD reliability has arguably gone down, so people have started stacking drives. Because of this and the fact that I/O capacity isn't improved much (which causes a degraded array to remain degraded longer while rebuilding), RAID 5 has given to RAID 6, and there is talk about triple parity becoming a must, just because of the rebuild times.

      2: NoSQL is a great concept in general. The car analogy is to remove the body, and only have one brake rotor for stopping the car. Other than MarkLogic (which in itself got a black eye due to the Obamacare website issues), ACID compliance is iffish to nonexistant.

      3: The reason why OCP is useful is because with offshoring and H-1B usage, devs can be had for cheap, so a large company can throw a ton of man-hours into creating a backend application with a lot of redundancy, and eventually (similar to having a bunch of monkeys eventually going to type out the works of Shakespeare), the code will stabilize enough to do this. Only a very few industries can do this. This is the same reason you see OpenStack in universities, but not the "real" world. A VMWare cluster can be stood up and working in a reasonable amount of time. OpenStack needs a lot of manpower to make it work and keep it working, which universities have in abundance, but few businesses can do this.

    4. Re:5 9's by Anonymous Coward · · Score: 0

      The downside is that writing custom stacks tolerant of missing pieces is fucking hard and a huge up-front investment for a company.

      I don't think it is hard. But then, I guess that's why I get paid the big bucks to make it happen. Really it's just that average programmers are crap, and don't really have any ability.

  13. Re:Cheap hardware. Smart Software by drinkypoo · · Score: 1

    Yep. This thread is full of people pooh-poohing this idea and meanwhile it's the strategy used by the most successful corporations on the internet. Welcome to Slashdot!

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  14. Silicon Valley by BobSwi · · Score: 1

    Sounds like Hooli XYZ! Where's Nelson Big Head Bighetti?

  15. Cheap, reliable, fast.... by bobbied · · Score: 1

    Pick two...

    It all boils down to what you want, but of the three things we all say we want, you get only two...

    --
    "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  16. Hardware failures by QuietLagoon · · Score: 1

    ...Crawford argues that web-scale data centers are designed to cope with hardware failures...

    By that logic, the telco data centers are not designed to cope with hardware failures?

    .
    Of course, I really don't care if facebook has downtime due to hardware reliability issues. facebook is more a waste of time than anything else.

    1. Re:Hardware failures by FranTaylor · · Score: 1

      Of course, I really don't care if facebook has downtime due to hardware reliability issues. facebook is more a waste of time than anything else.

      Facebook's customers would tend to disagree. They are paying a lot of money to Facebook and they want their money's worth.

      Facebook's users are not the customers, they are the product.

    2. Re:Hardware failures by Bing+Tsher+E · · Score: 1

      I'd imagine Facebook puts more resources into keeping the tracking and Ad-serving hardware 100% operational. The rest of the infrastructure is just the chicken feed sprinkle.

    3. Re:Hardware failures by FranTaylor · · Score: 1

      The rest of the infrastructure is just the chicken feed sprinkle.

      That "chicken feed sprinkle" is precisely what the customers are paying for. Facebook is not just selling ads, they are selling everything you type.

  17. It's in the name by Anonymous Coward · · Score: 0

    It's called Open Compute Project for a reason. The willing may pursue different goals with Open Telecom Project. Or read Jim Gray's technical report number 85.7 for the Tandem Computers.

  18. Re:Cheap hardware. Smart Software by Anonymous Coward · · Score: 0

    Note that their datacenter disciplines are not actually proven to be the best, but boy do they think so. You ask their datacenter guys and you'd think it's because of *them* that the business plan works.

    Notably, the mindset centers around *ZERO* warranty and no testing at all. This encourages some nasty behavior in the vendors. They may be able to tolerate it, but they are paying a price in terms of how much they have to oversize and replace.

    There's a middle ground between ludicrous resiliency inside a single system and 'lose hundreds of servers / racks a day and you would never notice' (except for the replacement parts, spare capacity, power and cooling sucked down by zombie servers before being addressed, and so on and so forth).

  19. testing is for design problems, too by FranTaylor · · Score: 3, Interesting

    it doesn't matter how many redundant servers you have, if they are all going to fail in the same way

  20. Be highly available in software, not hardware by poopie · · Score: 4, Insightful

    I suspect open compute project welcomes additional testing resources for the benefit of everyone... as long as it doesn't involve an oppressive amount of process that simply serve to slow down progress.

    But... Web scale IS different, so I can't blame the main sponsors for not prioritizing what isn't as important to them. Once you accept that ALL hardware fails, and that you can either pay more for more reliable hardware, or you can develop better software architecture to handle failures, you look at things differently. Spend your money once on good software engineering, instead of over and over on every server.

    1. Re:Be highly available in software, not hardware by FranTaylor · · Score: 1

      Once you accept that ALL hardware fails, and that you can either pay more for more reliable hardware

      If you have all the same hardware and it's not adequately tested, then all of your hardware is vulnerable to the same issues, and your application will possibly fail on all of them! Throwing more hardware at the problem just means more failures.

      or you can develop better software architecture to handle failures

      How can you develop software to work around systemic hardware problems? How can you write software that automatically detects if your floating point hardware is always correct? You say "do it on multiple systems and compare the results" but what if they all have the same flaw?

      you look at things differently. Spend your money once on good software engineering,

      really? spend money up front preparing for every single possible hardware problem? design it all in advance that way? so what is the world supposed to do for the next thousand years before you finish the software?

    2. Re:Be highly available in software, not hardware by Anonymous Coward · · Score: 0

      Spending the money on software architecting rather than hardware is the right idea asymptotically. Paying programmers to make your software fault-tolerant takes O(1) money and buying primo fault-tolerant servers costs O(t) money over time.

      But if you're not a large organization (huge enough to actually *reach* that asymptotic regime), you can't just ignore the constants in front of those 1 and t factors. How many programmer-years will it take to make your software stack fault tolerant? That's a pretty good chunk of change.

      Fortunately, I have some hope that coming advances as supercomputers make the leap to the exascale will render this whole thing moot. It's been under discussion in HPC for a while: For a long time now, MPI has simply barfed and died if a node crashes. The code has to checkpoint itself so that in the event of a crash, it can restart. But as you approach the exascale, you're talking millions of nodes and with any realistically achievable MTBF you approach spending 100% of your time writing checkpoint files. They were, from what I gathered, deciding that it's not reasonable to make every code re-implement the same reliability and fault-tolerance wheel and that instead the matter should be moved to hardware and OS: Improved hardware self-monitoring to detect soft problems before they become catastrophic, followed by the OS and MPI simply pausing execution of the job while the dying node's image is transparently migrated to a new node.

      Best get on it... The DOE plans to bring a 300 petaflop machine online in 2018 (Though GPUs will alleviate some of the insane-node-count issue) and to break the exaflop barrier by 2020.

    3. Re:Be highly available in software, not hardware by Anonymous Coward · · Score: 0

      We aren't talking about systemic design errors here that are present in all the hardware, we are talking about when individual hardware parts fail. Don't be dense. Obviously your designs need to be tested.

    4. Re:Be highly available in software, not hardware by FranTaylor · · Score: 1

      Obviously your designs need to be tested.

      the implementations need to be tested, too. all chips are not created equal. this batch works great, the next batch fails under certain circumstances. without actual ongoing hardware tests you won't catch it.

    5. Re:Be highly available in software, not hardware by Anonymous Coward · · Score: 0

      The chips are all from Intel or AMD or Broadcom or LSI etc.

      They are the same chips being used in Dell and HP and Oracle and SuperMicro and ...

      And all those chip suppliers already have adequate test procedures.

      OCP does have JTAG and burn-in tests. They aren't really testing much more than DOA-free, but that is good enough for their intended purpose.

  21. TL; DR; If you want good stuff pay for it. by Anonymous Coward · · Score: 0

    Who would believe any "testing" certification these guys came up with anyway? Cheap shit suppliers come and go. Suckers that insist on buying absolute shit will be with us always.

  22. Nonsense by YuppieScum · · Score: 2

    For *some* datacentre tasks you can use cheap, commodity hardware. For others, you need expensive, certified, bullet-proof hardware.

    --
    This sig left unintentionally blank.
    1. Re:Nonsense by HiThere · · Score: 1

      There is no such thing as "bullet-proof hardware" except in the sense that some of it would stop a 45 bullet.

      Cheaply build hardware fails more often, but *ALL* hardware fails, and you need to plan for it. Ever hear of "RAID"? That's the way all (almost all?) hard disks are built these days. But they still fail. They used to fail more frequently. ("RAID" == "Redundant Array of Inexpensive Disks").

      --

      I think we've pushed this "anyone can grow up to be president" thing too far.
    2. Re:Nonsense by mysidia · · Score: 2

      There is no such thing as "bullet-proof hardware"

      Uh no... there definitely is. There's no X86 based system that really falls into this category though. Many mainframe systems are bulletproof, in the sense the mainframe won't fail or crash, or lose work, or corrupt data, upon any component failures. Tandem computers' systems and some other past solutions on the market were pretty darned bullet proof.

      That didn't mean no components failed -- only that when components died - CPUs and system bus included, things kept working.

      The shift to platforms such as X86 was a shift away from super-reliable systems and towards super-cheap systems.

      The tradeoff was made a long time ago..... cheaper always wins in the long run.

      Now there are many expensive X86 computers being sold to businesses as "super reliable servers" just like the ol' business computers were sold. Ultimately...... they're going to give way to cheaper X86, or their successor as time marches on.

      This goes back to business rule #1: Lower cost = More profit.

      The fact is, the business doesn't need things to be close to bulletproof ---- especially after the competition switches to the cheaper thing and uses the lower cost of their cheaper X86 servers to offer their services at lower price and undercut you in the marketplace.

    3. Re: Nonsense by Anonymous Coward · · Score: 0

      Even mainframes go down only not that often.

    4. Re:Nonsense by Anonymous Coward · · Score: 0

      Uh no... there definitely is. There's no X86 based system that really falls into this category though. Many mainframe systems are bulletproof, in the sense the mainframe won't fail or crash, or lose work, or corrupt data, upon any component failures.

      So.... Mainframes are bulletproof because they have redundancy built into the software and OS?

      Isn't that exactly what Google and Facebook are doing, albeit at a higher level of the software stack?

      I'll grant you that Mainframes and bigiron unix systems have pretty decent low latency IPI and shared memory that makes doing redundant execution more efficient when done at the lowest level, and that's precisely what you need in deeply serialised workloads like credit card processing and flight booking.

      Most applications don't need this kind of low latency atomicity, so they just put the redundancy at a higher level. The customer interface level reliability is the same or higher.

      All Opencompute is: a recognition that at the scale of Facebook, it's cheaper to do your own engineering than to pay royalties on someone else's, and then sharing that engineering with the hope that someone else will share back.

      The first part is hardly different to AT&T building their own computers and switches for the telephone network way back when.

  23. Re:Cheap hardware. Smart Software by Anonymous+Brave+Guy · · Score: 4, Interesting

    I think the point is that so far it is only used by "the most successful corporations on the internet". In fact, you can probably count the number of organisations in the entire world that qualify on the fingers of one hand, though it will take a few more fingers to count how much money they have invested to reach this point.

    Unfortunately, as lovely and friendly as all the Software Defined X advances seem with their mantra of openness, almost no-one is actually building a "web-scale data centre" with a 24/7 staff dedicated to just swapping out broken hardware and effectively unlimited resources to devote to designing hardware architectures and building control software that can cope with frequent failures without losing significant amounts of real money. For normal organisations, even those with heavy IT requirements and 12 figure market caps, running your critical infrastructure on hardware that does have a serious level of testing and consequent robustness may still be advantageous.

    (Full disclosure: I sometimes work for clients in the networking industry, though whether an industry shift towards things like OCP would benefit or harm them would be open to debate so I think I'm still reasonably neutral here.)

    --
    If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  24. Re:Cheap hardware. Smart Software by Anonymous Coward · · Score: 3, Informative

    While I was working at Amazon we were told to expect hardware failures and to build our software around it. I have a couple of friends doing hardware testing for AWS and all of their hardware is of extremely low quality and has major visable issues such as bowing, flimsy connectors, and little to no hardware redundancy in the hardware itself(no dual power supplies or hot swappable anything). This really isn't a surprise at all, its just where the industry is going.

  25. Re:Cheap hardware. Smart Software by drinkypoo · · Score: 1

    Unfortunately, as lovely and friendly as all the Software Defined X advances seem with their mantra of openness, almost no-one is actually building a "web-scale data centre" with a 24/7 staff dedicated to just swapping out broken hardware and effectively unlimited resources to devote to designing hardware architectures and building control software that can cope with frequent failures without losing significant amounts of real money.

    I think that's because most customers don't want that, partly because they don't understand how they would use it yet — but also because there is the fundamental problem of paying a middleman. If you are depending on someone to build the cloud for you, you're going to have to accept that they're going to want to get paid for their trouble. And nobody likes to write checks, they like to cash 'em.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  26. Who needs OCP? by viperidaenz · · Score: 2

    MongoDB is Web-scale.

    1. Re:Who needs OCP? by Anonymous Coward · · Score: 0

      Right, now I'm thinking of how much fun it will be to castrate my first bull, down on the farm. I cannot wait to cut off the testicles of a 3000 pound raging bull as it tries to kick my head in.

  27. Re:Cheap hardware. Smart Software by JoeMerchant · · Score: 1

    Isn't this the point of the cloud: don't buy/build/maintain your own, rent from us and save because we do it cheaper and better than you ever could on your own?

    I think by the time you reach a scale where you have 24/7/365.24 staffing adequate to handle the failures as they happen, you can take advantage of the higher failure rate / lower cost equipment. You don't need to be Google scale to do this.

  28. Test engineer says... by tlambert · · Score: 2

    Test engineer says... big companies need to hire more test engineers.

    Are we surprised?

    1. Re:Test engineer says... by FranTaylor · · Score: 1

      the reality of massive system outages affecting NYSE and airlines says that more test engineers are needed

    2. Re:Test engineer says... by Bing+Tsher+E · · Score: 1

      Software engineers say 'give us much more money to make software that is ten times as complex so you can throw it on cheap hardware to run.'

      Are we surprised?

      The trick is, robust hardware is robust hardware. It's done, you test it, then you build quality metrics into the process of building it and you're done. Complicated software to accommodate less robust hardware is bigger, more complex, and thus more prone to software bugs. You fix it by making it even more complex.

      But the software guys will be there to get paid to write even more of it. Yay.

    3. Re:Test engineer says... by Anonymous Coward · · Score: 0

      Software engineers say 'give us much more money to make software that is ten times as complex so you can throw it on cheap hardware to run.'

      That's the wrong idea.

      The idea is to build simplified software. Simplified software, with built-in redundancy. Then write all your software the way you normally would, but on this simplified platform that invisible to the application programmer, executes your program with redundant storage, redundant processing and redundant transaction completion.

      If you're making your software more complex to handle redundancy: YOU'RE DOING IT WRONG.

  29. Even if they have a point, OCP is a joke by Anonymous Coward · · Score: 0

    So many chances to go metric and add other improvements, all deliberately missed. Now we have a "standard" that sits right between two telco standards, with no obvious indication why it would be better than either: It's just more of the same. Thus the thing is an elaborate shtick to be speshul and troll the manufacturers into getting to do facebook's bidding.

  30. Re:Cheap hardware. Smart Software by mbkennel · · Score: 1

    The problem is when managers want to replicate this with cheap commodity developers and cheap commodity IT support on top of unreliable hardware infrastructure instead of the expensive, and rare, high-end personnel and internal resources that Google and Facebook have.

    Since most companies won't be able to hire the top 1% of those people, might it be more worthwhile to buy more reliable and expensive hardware?

  31. Re:Cheap hardware. Smart Software by mysidia · · Score: 1

    You need cheap commodity hardware with smart software on top. Just ask Google or Facebook.

    The software used by the rest of us (e.g. MySQL) isn't that smart, and it's very expensive to get software that is that smart --- requires hundreds of thousands of ops engineer developer man hours, potentially to build that software system.

    There are open source products that can be that smart, with enough deployment work. Developing smart custom applications is a bear.

    It may very well be cheaper in many cases for smaller scale applications to spend the extra money on some more reliable hardware instead of massive $$$ on extra development.

    I guess you could say then definitively now that OpenCompute is not for everyone.... it's especially not for IaaS hosting providers, if the components are more prone to failures that the service provider will be held responsible for.

  32. Re:Cheap hardware. Smart Software by mysidia · · Score: 1

    instead of the expensive, and rare, high-end personnel and internal resources that Google and Facebook have.

    Then they are destined to fail, if they are unwilling to invest in suitably skilled personnel AND high enough quality development for the chosen architecture to implement their intended plan.

    might it be more worthwhile to buy more reliable and expensive hardware?

    Paying up to keep the more qualified personnel on staff can have other benefits. I think the competition for good people is much less than you imply.... if you are willing to pay up. Many times the top 1% of the technical talent does not wind up with significantly more pay than the next 30% down.

    Developers in the top 70% can still build highly-resilient applications, also, and if you pay more than the typical market rate for them, you can likely pick many of them up.

    It's those "C" level folks that are so hard to avoid, and the fact is, No interview screening procedures the average person will come up with are likely to reliably distinguish and eliminate those.

  33. Re:Cheap hardware. Smart Software by Anonymous+Brave+Guy · · Score: 4, Interesting

    Well, I have a few issues with the cloud hype, starting with the scarcity of evidence to support claims about cloud services being cheaper and/or more secure and/or more reliable than doing things yourself. Every major cloud provider has had serious downtime, and there is only so much you can attribute to being more visible at greater scale or to users not configuring HA tools properly. Far too many on-line services also run into significant security/privacy problems. And cost-wise going with the cloud rather than your own systems tends to be favourable at certain levels (other things being equal) but it can be outrageously expensive in other cases.

    These myths aren't really the point here anyway. The point in this case is that no matter how fast your recovery time may be, whatever was happening on your hardware at the time it failed is lost, and in some cases you simply can't make that transparent to your users. Not everything in the world of programming is a distributed map-reduce where losing a hardware node means you just redistribute the 0.0001% of the job it was doing to another and no-one notices. Not everything in the world of networking can tolerate a multi-second failover process without an observable blip in connectivity. As for redundant/HA storage, the CAP theorem called and asked to speak with you about your database, but I think you were on with physics at the time so I just took a message.

    It's not just about whether the wastage due to more frequent failures works out cheaper economically than paying a premium for better hardware. It's also about how much downtime you (or your customers) are willing to tolerate and what proportion of overall system time is spent just recovering from failures. If you've ever had the joy of watching the (N+1)-th drive fail in your RAID with N-way redundancy while it's still rebuilding from replacing the earlier failures, you'll know what I mean.

    --
    If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  34. Re:Cheap hardware. Smart Software by JoeMerchant · · Score: 2

    I've never had an N+1 drive fail in a RAID setup. What I have had happen is the power supply to the whole array fail... then we can talk about redundant power supplies, but, really, the data needs to be mirrored offsite at a place where a serious (fire / flood / riot / meteor strike / whatever) event doesn't take down all copies of the data / service. This was sort of the founding principle of ARPANET, anyway.

    Economics varies, people negotiate bad contracts all the time that lead to higher costs of whatever approach they have taken. Not surprising that something with the hype of "the cloud" can get people to sign bad deals. Also not surprising that some "bulletproof" hardware is excessively premiumed compared to the advantages it conveys.

    In a "rich infrastructure" 100 cheap cars beats a single tank. In a desert with bad to non-existent roads and a costly supply chain, you'll want the tank. Assuming these "cloud" data centers have sufficient infrastructure and scale, they should be able to do it better and cheaper. Of course, it's always possible to mismanage anything, and this goes double for security concerns.

    If you want / need control and you can't afford to point to a sub-contractor not living up to their contract terms when something goes wrong, then do it in-house. If in-house is a single site, or the multiple sites you do have can't afford around the clock technical maintenance presence, then, yeah, go for the "good stuff" and let the expensive machines help you in your (ultimately futile) pursuit of perfection. If your organization is numbered in the 10,000s or larger, and top management takes IT seriously, they should seriously be employing fault tolerant methodologies - whether you use cheap crap for equipment or not.

    The N+1 failure days will happen, and multi-second fail-overs response times sound perfectly acceptable to me, unless you are in high-speed trading, in which case - a pox on you and your servers and may you lose Billions in your next equipment snafu. But, those days when you have the unacceptable failure (Fukushima Daiichi?) are the days when you step back and improve the design and methods. Generally speaking, there are bigger gains to be had with redundancy and distribution than there are with "more bulletproof" hardware slotted back into the same system design that just bit you.

  35. the main benefit is flexibility by Chirs · · Score: 3, Informative

    I don't think I'd ever go to the cloud because it's cheaper or more secure or more reliable. The main benefit that I see is flexibility.

    If your loads are stable and known in advance, it's likely cheaper to buy hardware and staff people to take care of it. On the other hand if loads spike wildly from one day to the next the cloud makes perfect sense. Need a thousand cores of compute power right this second? Amazon/Google/Rackspace/HP would be happy to rent it to you.

  36. there is some very reliable hardware out there by Chirs · · Score: 1

    I worked on a telecom switch that ran processing on cards that had two CPUs in lockstep. If the output of the two ever differed the card was taken out of service and its last transaction was rolled back. Memory contents were stored in at least three places at any given time. The dataplane was inductively coupled to avoid the possibility of DC current damaging things.

    We replaced it with commodity hardware and smarter software. It wasn't *quite* as reliable, but it was a whole lot cheaper and the speeds ramped up much faster.

    1. Re:there is some very reliable hardware out there by Anonymous Coward · · Score: 0

      Stratus Computer used to sell systems with 4 physical CPUs acting as 1 logical CPU. Each pair of CPUs was on a hot swappable board, and ran in lockstep (with cross-checking on a per-instruction basis). A fault in any single CPU could be detected, the affected board would take itself out of service, and the redundant board would continue processing. There was no failover, there was no rollback - transactions were not lost. Everything else was duplicated - memory, disk, power, I/O. Everything in lockstep, no lost transactions. All boards (even memory) could be replaced with the system on line and processing at full speed.

      These were expensive systems, engineered and tested to telco standards of reliability (5 9's availability, including planned and unplanned maintenance). They were used by customers who had real money riding on each transaction - telcos, banks, airlines - and who were willing to pay the premium for that sort of reliability.

      This design became a lot harder when CPUs became non-deterministic, and lock-stepping on an instruction-by-instruction basis wasn't practical anymore. Stratus is still around in some form - not sure how their systems work today.

    2. Re:there is some very reliable hardware out there by HiThere · · Score: 1

      Yes. That was a bit better than a "tell me three times" system. But there are still failure modes (as you note) which was my point.

      --

      I think we've pushed this "anyone can grow up to be president" thing too far.
  37. If you'd been watching the attack maps, by tlambert · · Score: 1

    If you'd been watching the attack maps, you'd know that:

    (1) It's China
    (2) It's likely at the government level

    If you'd been watching current events, you know that:

    (3) China's economy has been crashing, going on three weeks now
    (4) They're really unhappy about people taking money out of, and shorting, Chinese stocks, adding to the crash
    (5) They've lost $3.25T in market cap since June 12th
    (6) That's just over 20% of their Gross National Product

    So it's likely they are attacking our financial markets over that.

    See also:

    "Key things to know about China's market meltdown"
    http://www.cnn.com/2015/07/08/...

  38. Re: Cheap hardware. Smart Software by Anonymous Coward · · Score: 0

    Ever notice just exactly who pushes renting everything? That would be people who own stuff. Cloud computing is like privatization in government. It will never get past the fact that somebody wants to make a profit, and so it will never be as cheap as everyone says.

    Look--if somebody buys cheap hardware, I can too. Control and monitoring stuff is getting better all the time. I can get that too. I don't need 'web scale' for everything and, unlike Google and Facebook, it damned well does matter if my data is in a consistent state everywhere all the time. They simply have different business needs for their own operations. It's not good or bad, it just is.

    Cloud computing isn't bad either. It's an excellent choice in some situations and an expensive and poor choice in others. It is a tool, and unless your business is very small, it had best not be your only tool.

  39. REGIMEN!! Not "regime". by Anonymous Coward · · Score: 0

    "has come under fire for a slack testing regime"

    The correct word is "regimen", not "regime".

  40. The OCP pedigree may be ok by Anonymous Coward · · Score: 0

    The arguments for less pedigreed hardware are
    The application is ok with an occasional server failure.
    A fancy compute server from a tier 1 vendor comes from the same ODM as an OCP server.
    A compute server will be obsolete in 3 years, but a telco platform is expected to last for 10-20.
    Both design and manufacturing make reliability. The OCP designs may actually get more thought and testing than their tier 1 cousins.
    (For example, the OCP power plan with distributed backup appears an improvement over the telco 48volt centralized battery plant.)

    Some possible problems with this plan are:
    Part of the testing is for safety, this still seems necessary.
    Replacing h/w every 3 years isn't green.
    It seems to me that, if you populate a data center with junk, then failures might be more than occasional.
    Another problem is that some failures might just be flakies which is another thing for the application to deal with to prevent bad results.
    In anything new there will be other issues we don't know about yet.

    The OCP hardware needs to be good enough to avoid these problems, but no better.
    Even with zero testing in the design and manufacturing phases, it should be obvious in the deployment if the equipment meets this criteria.
    At best, this test engineer is saying that waiting till deployment is wastefull (or dangerous?).

    If there is a problem with OCP, it may be that writing general purpose applications to run on flakey hardware may be a harder problem than just building stable hardware. Which says that market forces may force the OCP ODM's to make pretty good stuff.

  41. Re:Cheap hardware. Smart Software by sabri · · Score: 1

    Note that their datacenter disciplines are not actually proven to be the best, but boy do they think so.

    They are proven to be the best for their specific type of operations. I'm quite sure that their SOPs won't work for the banking or healthcare industry for example.

    If Facebook goes down, a bunch of 30 year olds are going to complain (teens use other social media these days, and grandparents won't care and try again later). If the Sutter Health (norcal hospital chain) network/DC goes down, people's health will be affected.

    Different operations and requirements, require different budgets and ways of working. For hyperscalers as FB and Google, RAID makes sense. Where RAID in this case is Redundant Amount of Inexpensive Devices.

    --
    I'm not a complete idiot... Some parts are missing.