Slashdot Mirror


Is Big Data Leaving Hadoop Behind?

knightsirius writes: Big Data was seen as one the next big drivers of computing economy, and Hadoop was seen as a key component of the plans. However, Hadoop has had a less than stellar six months, beginning with the lackluster Hortonworks IPO last December and the security concerns raised by some analysts.. Another survey records only a quarter of big data decision makers actively considering Hadoop. With rival Apache Spark on the rise, is Hadoop being bypassed in big data solutions?

3 of 100 comments (clear)

  1. Nope. Not happening. by Art+Popp · · Score: 5, Informative

    FTA: ...biggest problem is that people allegedly still can’t use Hadoop... Hadoop is still too expensive for firms...

    Hadoop is an ecosystem with lots of moving parts. Those are real problems above, but Spark (Particle) is not a stand alone replacement for an ecosystem the size of Hadoop. Moreover it has no problem running integrating with Yarn on Hadoop where you can run Hbase, Cassandra, MongoDB, Rainstor, Flume, Storm, R, Mahout and plenty of other Yarn-compatible goodies.

    It's also worth noting that Hortonworks and Cloudera may not be "taking off as hoped" because the branded big-iron players are finally in the ring. They hide the (rather hideous) complexity and integrate well with any existing systems you have with those vendors. Teradata for instance has a Hadoop/Aster integration that's impressive and turn key. They bought Rainstor, and will soon have it integrated, and that's Spark-fast and hassle free. IBM's BigInsights is very impressive if you have the means.

    So, no, Hadoop is in no danger of being replaced. The value proposition that my $4.2M cluster outperformed two $6M "big name" vendor supported appliances is undeniable, but only that stark when your $'s have an M suffix. What will probably occur though is that we'll end up replacing every component in Hadoop with a faster one, and MapReduce will become a memory as things like Spark and Hive/Tez move away from that methodology.
         

  2. Re:Rival? by Anonymous Coward · · Score: 5, Informative

    They need to refer the the pieces of hadoop. HDFS is the storage piece and many things can interface to it, it isn't great but is often good enough especially if you just have a couple local disks per node. YARN is the scheduler piece, it is mostly awful performance-wise but is fairly easy to use...long run it'll lose to something like mesos I think. MR is the map reduce piece that everyone thinks of when you say hadoop. Almost everything will run quicker in spark(still using a map/reduce methodology) than hadoop MR.

    As a side note, I don't know anyone who still writes MR jobs directly, they are all doing pig or hiveql.

  3. Hadoop was never really the right solution... by rockmuelle · · Score: 5, Insightful

    A scripting language with a good math/stats library (e.g., NumPy/Pandas) and decent raid controller are all most people really need for most "big data" applications. If you need to scale a bit, add few nodes (and put some RAM in them) and a job scheduler into the mix and learn some basic data decomposition methods. Most big data analyses are embarrassingly parallel. If you really need 100+ TB of disk, setup Lustre or GPFS. Invest in some DDN storage (it's cheaper and faster than the HDFS system you'll build for Hadoop).

    Here's the break down of that claim in more computer sciencey terms: Almost all big data problems are simple counting problems with some stats thrown in. For more advanced clustering tasks, most math libraries have everything you need. Most "big data" sizes are under a few TB of data. Most big data problems are also I/O bound. Single nodes are actually pretty powerful and fast these days. 24 cores, 128 GB RAM, 15 TB of disk behind a RAID controller that can give you 400 MB/s data rates will cost you just barely 5 figures. This single node will outperform a standard 8 node Hadoop cluster. Why? Because the local, high density disks that HDFS encourages are slow as molasses (30 MB/s). And...

    Hadoop has a huge abstraction penalty for each record access. If you're doing minimal computation for each record, the cost of delivering the record dominates your runtime. In Hadoop, the cost is fairly high. If you're using a scripting language and reading right off the file system, your cost for each record is low. I've found Hadoop record access times to be about 20x slower than Python line read times from a text file, using the _same_ file system for Hadoop and Python (of course, Hadoop puts HDFS on top of it). In Big-O terms, the 'c' we usually leave out actually matters here - O(1*n) vs. O(20*n). 1 hour or 20 hours, you pick.

    If you're really doing big data stuff, it helps to understand how data moves through your algorithms and architect things accordingly. Almost always, a few minutes of big-O thinking and some basic knowledge of your hardware will give you an approach that doesn't require Hadoop.

    tl;dr: Hadoop and Spark give people the illusion that their problems are bigger than they actually are. Simply understanding your data flow and algorithms can save you the hassle of using either.

    -Chris