Open Compute Project Comes Under Fire

← Back to Stories (view on slashdot.org)

Open Compute Project Comes Under Fire

Posted by samzenpus on Wednesday July 8, 2015 @09:35AM from the no-sir-I-don't-like-it dept.

judgecorp writes: The Open Compute Project, the Facebook-backed effort to create low-cost open source hardware for data centers has come under fire for a slack testing regime. The criticism was first aired at The Register where an anonymous test engineer described the project's testing as a "complete and total joke." The founding director of the project, Cole Crawford has penned an open letter in reply. The issue seems to be that the testing for standard highly-reliable hardware used by telcos and the like is very thorough and expensive. Some want the OCP to use more rigorous testing to replicate that level of reliability. Crawford argues that web-scale data centers are designed to cope with hardware failures, and "Tier 1" reliability would be a waste of effort.

21 of 86 comments (clear)

Min score:

Reason:

Sort:

Smells like astroturf. by An+Ominous+Coward · 2015-07-08 09:39 · Score: 3, Insightful

Probably Cisco trolling against a movement that's going to put them out of business.
Sooner the better, I say.
Autism ... Autism Everywhere. by Anonymous Coward · 2015-07-08 09:41 · Score: 3, Funny

Some people just have to get a burr up their ass [arse] about everything.
Wait, Register is still up? Do they still say 'boffin' every paragraph? I couldn't bear to click through.
Cheap hardware. Smart Software by biojayc · 2015-07-08 09:48 · Score: 5, Insightful

You don't need expensive hardware to run datacenters. You need cheap commodity hardware with smart software on top. Just ask Google or Facebook.
Saying you test is easy. by digsbo · 2015-07-08 09:48 · Score: 4, Insightful

But testing well is really, really hard. And expensive, especially for data center scenarios. If you haven't put it in an oven and observed the effects, it's not tested for telco data centers.
1. Re:Saying you test is easy. by GerryGilmore · 2015-07-08 10:02 · Score: 3, Informative
  
  And there is the rub. NEBS testing (telco-level) is horrifically expensive and - for DC applications - totally unnecessary. NEBS servers have to withstand that because they are often the *only* server performing a certain function in the CO. Not anywhere near the same use-case.
2. Re:Saying you test is easy. by digsbo · 2015-07-08 10:13 · Score: 2
  
  Agreed, but still, even in a non-NEBS scenario, there's still a lot to be tested because you're putting something potentially flammable in someone's data center. It's really easy to think of designing so a server failure doesn't bring a cluster down, but a server failure that results in a fire has the potential to do more.
  The one time I had a fire in a test lab, it really scared me, and made me realize as rare as that kind of thing is, it's potentially disastrous. And that's why they test for it.
Isn't this expected? by fuzzyfuzzyfungus · 2015-07-08 09:57 · Score: 4, Insightful

I don't know if it's a good idea or not(probably depends on who you are, and I'm sure that there will be some people who chose incorrectly); but is it really a surprise that OCP would be doing their testing on the cheap 'n cheerful side of things?

It was my understanding that their premise, from the beginning, was that existing hardware vendors were excessively focused on adding costly, thermally demanding, and often proprietary, features at the hardware level that were unnecessary if you were willing to compensate for their absence in your software design.

There is obviously some level of reliability below which no compensation at the software level is possible(if you can't run the algorithm for detecting errors because it keeps glitching out, it's probably not going to work); but the impression they always conveyed was that many of the more sophisticated reliability mechanisms are really features aimed at people who are substantially less able to cope with failure; and are therefore willing to pay substantially more for hardware that can invisibly paper over a variety of moderately serious failures and allow the software on top to run without incident; rather than buying lots of cheap hardware that has a risk of going down in a screaming heap.

So long as nobody gets any stupid optimistic ideas, I don't really see the issue. Sure, if Facebook were about sending men to mars, they should seriously consider having three CPUs running in lockstep and voting on all operations and so on; but this project is about delivering as many ad impressions per dollar as possible; no reason to get worked up over the occasional glitch.
5 9's by The+Raven · 2015-07-08 09:58 · Score: 4, Insightful

I'm gonna side with OCP on this one. It is far more economical to deal with reliability via redundancy than it is via expensive parts. This is why we use RAID rather than speccing our drives to last 10 years minimum. All the big players in the datacenter market have put thousands of hours each into designing systems tolerant of missing parts.
The downside is that writing custom stacks tolerant of missing pieces is fucking hard and a huge up-front investment for a company. Most off-the-shelf software does not have that level of redundancy and fault tolerance baked in already. This means that for many small to medium sized deployments it's cheaper to buy a really expensive fault tolerant rack of servers and run your off-the-shelf software on it than it is to buy into OCP with inexpensive hardware that's more open to failure, because your software is NOT open to failure.
Different strokes for different folks and all. Use the right tool for the job. And OCP was made by companies with massive data farms to fit their needs... and their needs are probably not your needs.

--
"I will trust Google to 'do no evil' until the founders no longer run it." Hello Alphabet.
testing is for design problems, too by FranTaylor · 2015-07-08 10:29 · Score: 3, Interesting

it doesn't matter how many redundant servers you have, if they are all going to fail in the same way
Be highly available in software, not hardware by poopie · 2015-07-08 10:31 · Score: 4, Insightful

I suspect open compute project welcomes additional testing resources for the benefit of everyone... as long as it doesn't involve an oppressive amount of process that simply serve to slow down progress.
But... Web scale IS different, so I can't blame the main sponsors for not prioritizing what isn't as important to them. Once you accept that ALL hardware fails, and that you can either pay more for more reliable hardware, or you can develop better software architecture to handle failures, you look at things differently. Spend your money once on good software engineering, instead of over and over on every server.
Nonsense by YuppieScum · 2015-07-08 10:32 · Score: 2

For *some* datacentre tasks you can use cheap, commodity hardware. For others, you need expensive, certified, bullet-proof hardware.

--
This sig left unintentionally blank.
1. Re:Nonsense by mysidia · 2015-07-08 13:25 · Score: 2
  
  There is no such thing as "bullet-proof hardware"
  Uh no... there definitely is. There's no X86 based system that really falls into this category though. Many mainframe systems are bulletproof, in the sense the mainframe won't fail or crash, or lose work, or corrupt data, upon any component failures. Tandem computers' systems and some other past solutions on the market were pretty darned bullet proof.
  That didn't mean no components failed -- only that when components died - CPUs and system bus included, things kept working.
  The shift to platforms such as X86 was a shift away from super-reliable systems and towards super-cheap systems.
  The tradeoff was made a long time ago..... cheaper always wins in the long run.
  Now there are many expensive X86 computers being sold to businesses as "super reliable servers" just like the ol' business computers were sold. Ultimately...... they're going to give way to cheaper X86, or their successor as time marches on.
  This goes back to business rule #1: Lower cost = More profit.
  The fact is, the business doesn't need things to be close to bulletproof ---- especially after the competition switches to the cheaper thing and uses the lower cost of their cheaper X86 servers to offer their services at lower price and undercut you in the marketplace.
Re:Cheap hardware. Smart Software by Anonymous+Brave+Guy · 2015-07-08 10:48 · Score: 4, Interesting

I think the point is that so far it is only used by "the most successful corporations on the internet". In fact, you can probably count the number of organisations in the entire world that qualify on the fingers of one hand, though it will take a few more fingers to count how much money they have invested to reach this point.
Unfortunately, as lovely and friendly as all the Software Defined X advances seem with their mantra of openness, almost no-one is actually building a "web-scale data centre" with a 24/7 staff dedicated to just swapping out broken hardware and effectively unlimited resources to devote to designing hardware architectures and building control software that can cope with frequent failures without losing significant amounts of real money. For normal organisations, even those with heavy IT requirements and 12 figure market caps, running your critical infrastructure on hardware that does have a serious level of testing and consequent robustness may still be advantageous.
(Full disclosure: I sometimes work for clients in the networking industry, though whether an industry shift towards things like OCP would benefit or harm them would be open to debate so I think I'm still reasonably neutral here.)

--
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Re:Cheap hardware. Smart Software by Anonymous Coward · 2015-07-08 10:48 · Score: 3, Informative

While I was working at Amazon we were told to expect hardware failures and to build our software around it. I have a couple of friends doing hardware testing for AWS and all of their hardware is of extremely low quality and has major visable issues such as bowing, flimsy connectors, and little to no hardware redundancy in the hardware itself(no dual power supplies or hot swappable anything). This really isn't a surprise at all, its just where the industry is going.
Who needs OCP? by viperidaenz · 2015-07-08 11:49 · Score: 2

MongoDB is Web-scale.
Test engineer says... by tlambert · 2015-07-08 12:37 · Score: 2

Test engineer says... big companies need to hire more test engineers.
Are we surprised?
Re:Sort of.. by Cramer · 2015-07-08 12:39 · Score: 2

We aren't talking about a rack full of dell/hp knock-off "servers". OCP hardware is rows of racks full of stripped down, barebones systems. If your "mission critical" app fails, it's because you or your data center are a bunch of fools. Resilience comes from redundancy. If you fail to provide the redundant hardware, or capacity to spin up your crapplication on other systems, then that's your damn fault. (just as much as choosing to build your own rack full of budget trash.)
OCP hardware is cheap, so you can afford a lot of it. But it's cheap, and thus, prone to higher failure rates. This equals, in enterprise definitions, an "unreliable infrastructure". In the end, it'll work out to roughly the same total cost, but with one all the money is spent up front to fill a room no one visits, vs. the other spending very little to fill the same room but has people in there regularly replacing failed components. (Banks prefer the former, Google, the latter.)
Re:Sort of.. by Bing+Tsher+E · 2015-07-08 13:52 · Score: 2

So all this cheap hardware gets deployed, then swapped out a whole bunch of times. The waste stream is much, much bigger because you're routinely scrapping out cheaply thrown together motherboards, etc.
It doesn't sound very green.
Re:Cheap hardware. Smart Software by Anonymous+Brave+Guy · 2015-07-08 13:55 · Score: 4, Interesting

Well, I have a few issues with the cloud hype, starting with the scarcity of evidence to support claims about cloud services being cheaper and/or more secure and/or more reliable than doing things yourself. Every major cloud provider has had serious downtime, and there is only so much you can attribute to being more visible at greater scale or to users not configuring HA tools properly. Far too many on-line services also run into significant security/privacy problems. And cost-wise going with the cloud rather than your own systems tends to be favourable at certain levels (other things being equal) but it can be outrageously expensive in other cases.
These myths aren't really the point here anyway. The point in this case is that no matter how fast your recovery time may be, whatever was happening on your hardware at the time it failed is lost, and in some cases you simply can't make that transparent to your users. Not everything in the world of programming is a distributed map-reduce where losing a hardware node means you just redistribute the 0.0001% of the job it was doing to another and no-one notices. Not everything in the world of networking can tolerate a multi-second failover process without an observable blip in connectivity. As for redundant/HA storage, the CAP theorem called and asked to speak with you about your database, but I think you were on with physics at the time so I just took a message.
It's not just about whether the wastage due to more frequent failures works out cheaper economically than paying a premium for better hardware. It's also about how much downtime you (or your customers) are willing to tolerate and what proportion of overall system time is spent just recovering from failures. If you've ever had the joy of watching the (N+1)-th drive fail in your RAID with N-way redundancy while it's still rebuilding from replacing the earlier failures, you'll know what I mean.

--
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Re:Cheap hardware. Smart Software by JoeMerchant · 2015-07-08 15:18 · Score: 2

I've never had an N+1 drive fail in a RAID setup. What I have had happen is the power supply to the whole array fail... then we can talk about redundant power supplies, but, really, the data needs to be mirrored offsite at a place where a serious (fire / flood / riot / meteor strike / whatever) event doesn't take down all copies of the data / service. This was sort of the founding principle of ARPANET, anyway.
Economics varies, people negotiate bad contracts all the time that lead to higher costs of whatever approach they have taken. Not surprising that something with the hype of "the cloud" can get people to sign bad deals. Also not surprising that some "bulletproof" hardware is excessively premiumed compared to the advantages it conveys.
In a "rich infrastructure" 100 cheap cars beats a single tank. In a desert with bad to non-existent roads and a costly supply chain, you'll want the tank. Assuming these "cloud" data centers have sufficient infrastructure and scale, they should be able to do it better and cheaper. Of course, it's always possible to mismanage anything, and this goes double for security concerns.
If you want / need control and you can't afford to point to a sub-contractor not living up to their contract terms when something goes wrong, then do it in-house. If in-house is a single site, or the multiple sites you do have can't afford around the clock technical maintenance presence, then, yeah, go for the "good stuff" and let the expensive machines help you in your (ultimately futile) pursuit of perfection. If your organization is numbered in the 10,000s or larger, and top management takes IT seriously, they should seriously be employing fault tolerant methodologies - whether you use cheap crap for equipment or not.
The N+1 failure days will happen, and multi-second fail-overs response times sound perfectly acceptable to me, unless you are in high-speed trading, in which case - a pox on you and your servers and may you lose Billions in your next equipment snafu. But, those days when you have the unacceptable failure (Fukushima Daiichi?) are the days when you step back and improve the design and methods. Generally speaking, there are bigger gains to be had with redundancy and distribution than there are with "more bulletproof" hardware slotted back into the same system design that just bit you.
the main benefit is flexibility by Chirs · 2015-07-08 15:25 · Score: 3, Informative

I don't think I'd ever go to the cloud because it's cheaper or more secure or more reliable. The main benefit that I see is flexibility.
If your loads are stable and known in advance, it's likely cheaper to buy hardware and staff people to take care of it. On the other hand if loads spike wildly from one day to the next the cloud makes perfect sense. Need a thousand cores of compute power right this second? Amazon/Google/Rackspace/HP would be happy to rent it to you.