Intel Launches Its Own Apache Hadoop Distribution

Posted by Soulskill on Tuesday February 26, 2013 @09:51AM from the if-you-want-something-done-right-do-it-yourself dept.

Nerval's Lobster writes "The Apache Hadoop open-source framework specializes in running data applications on large hardware clusters, making it a particular favorite among firms such as Facebook and IBM with a lot of backend infrastructure (and a whole ton of data) to manage. So it'd be hard to blame Intel for jumping into this particular arena. The chipmaker has produced its own distribution for Apache Hadoop, apparently built 'from the silicon up' to efficiently access and crunch massive datasets. The distribution takes advantage of Intel's work in hardware, backed by the Intel Advanced Encryption Standard (AES) Instructions (Intel AES-NI) in the Intel Xeon processor. Intel also claims that a specialized Hadoop distribution riding on its hardware can analyze data at superior speeds—namely, one terabyte of data can be processed in seven minutes, versus hours for some other systems. The company faces a lot of competition in an arena crowded with other Hadoop players, but that won't stop it from trying to throw its muscle around."

1 of 18 comments (clear)

Min score:

Reason:

Sort:

Re:Speed by Anonymous Coward · 2013-02-26 10:36 · Score: 5, Informative

The performance claim in the summary seems to come from page 15 of this presentation, where the speedup for a 1TB sort (presumably distributed) is 4 hours -> 7 minutes. I can't find the details for that test, but most of the speedup comes from using better hardware - faster CPU and network adapter, and SSDs instead of HDDs - while they get a 40% speedup from using their Hadoop distribution over some other Hadoop distribution, which is a fairly modest gain.
The biggest performance benefit of Spark comes from avoiding disk and network access, so improving those bottlenecks will presumably reduce Spark's lead over Hadoop somewhat. But it's hard to say how well Spark would do with this particular hardware and test setup. I would guess it's still much faster than their Hadoop distribution. (Note: I'm a Spark power user but not an expert in its performance.)