Slashdot Mirror


Is Big Data Leaving Hadoop Behind?

knightsirius writes: Big Data was seen as one the next big drivers of computing economy, and Hadoop was seen as a key component of the plans. However, Hadoop has had a less than stellar six months, beginning with the lackluster Hortonworks IPO last December and the security concerns raised by some analysts.. Another survey records only a quarter of big data decision makers actively considering Hadoop. With rival Apache Spark on the rise, is Hadoop being bypassed in big data solutions?

15 of 100 comments (clear)

  1. Nope. Not happening. by Art+Popp · · Score: 5, Informative

    FTA: ...biggest problem is that people allegedly still can’t use Hadoop... Hadoop is still too expensive for firms...

    Hadoop is an ecosystem with lots of moving parts. Those are real problems above, but Spark (Particle) is not a stand alone replacement for an ecosystem the size of Hadoop. Moreover it has no problem running integrating with Yarn on Hadoop where you can run Hbase, Cassandra, MongoDB, Rainstor, Flume, Storm, R, Mahout and plenty of other Yarn-compatible goodies.

    It's also worth noting that Hortonworks and Cloudera may not be "taking off as hoped" because the branded big-iron players are finally in the ring. They hide the (rather hideous) complexity and integrate well with any existing systems you have with those vendors. Teradata for instance has a Hadoop/Aster integration that's impressive and turn key. They bought Rainstor, and will soon have it integrated, and that's Spark-fast and hassle free. IBM's BigInsights is very impressive if you have the means.

    So, no, Hadoop is in no danger of being replaced. The value proposition that my $4.2M cluster outperformed two $6M "big name" vendor supported appliances is undeniable, but only that stark when your $'s have an M suffix. What will probably occur though is that we'll end up replacing every component in Hadoop with a faster one, and MapReduce will become a memory as things like Spark and Hive/Tez move away from that methodology.
         

  2. Rival? by Culture20 · · Score: 2

    I thought Spark worked from within Hadoop. Is that like using emacs to run vi?

    1. Re:Rival? by Anonymous Coward · · Score: 5, Informative

      They need to refer the the pieces of hadoop. HDFS is the storage piece and many things can interface to it, it isn't great but is often good enough especially if you just have a couple local disks per node. YARN is the scheduler piece, it is mostly awful performance-wise but is fairly easy to use...long run it'll lose to something like mesos I think. MR is the map reduce piece that everyone thinks of when you say hadoop. Almost everything will run quicker in spark(still using a map/reduce methodology) than hadoop MR.

      As a side note, I don't know anyone who still writes MR jobs directly, they are all doing pig or hiveql.

    2. Re:Rival? by careysub · · Score: 3, Interesting

      They need to refer the the pieces of hadoop. HDFS is the storage piece and many things can interface to it, it isn't great but is often good enough especially if you just have a couple local disks per node. YARN is the scheduler piece, it is mostly awful performance-wise but is fairly easy to use...long run it'll lose to something like mesos I think.

      That's a good call. With Cloudera and HortonWorks both adding new components to the Hadoop stack it has exploded in the number of components in the last a year or two, and that can be a bad thing. The complexity of the whole ecosystem is getting horrendous, with a typical configuration file doubling from 250 or so to 500 configuration items, which are almost all undocumented (unless you read the code - which scarcely qualifies as "documented") in the last year. For a practical deployment you are pretty much forced to use a commercial stack to get something up and running in a manageable fashion. And then there is the fact that the HDFS foundation is showing its age.

      MR is the map reduce piece that everyone thinks of when you say hadoop. Almost everything will run quicker in spark(still using a map/reduce methodology) than hadoop MR.

      Spark on Mesos is looking mighty awesome.

      As a side note, I don't know anyone who still writes MR jobs directly, they are all doing pig or hiveql.

      MapReduce is still viable for stable production jobs, but not in a dynamic requirements environment.

      Although HiveQL is alive and kicking, the complete replacement of Hive Server with Hive Server 2, while possibly an improvement in usability overall (I am not convinced), it trashes your skill investment in the (now) obsolete Hive stack component. Maybe I am just grousing, but I start having reservations about technology planning in the data center when a key stack component changes so much it a relatively short period of time

      --
      Starships were meant to fly, Hands up and touch the sky - Nicky Minaj
  3. Relevance of Security by Luthair · · Score: 4, Funny

    Is security really that big of a deal? Isn't the intent to run it on a private network to crunch numbers behind the scene?

    We don't ask about the susceptibility of safety deposit boxes to crowbars and dynamite, they're inside a vault.

  4. Re:Nope. Not happening. by Hognoxious · · Score: 4, Insightful

    As a storage admin for a Multi-Hospital organization using anything open source is not really an option if we want to keep the HIPPA-potamus away.

    Makes sense. If they can see your source (which you have to show them, or it wouldn't be open) then it makes absolute sense they can totally see your data.

    You weren't previously the city manager of Tuttle, Oklahoma, were you?

    --
    Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  5. Re:What Fucking Decade Is It? by Tablizer · · Score: 2

    If your data needs to be correct, define it and its relationships then use SQL. You will have to pay someone decent money to do this correctly.

    PHB's have to learn the hard way. They want it cheap, big, and now. Security & reliability issues are something they try to blame on somebody else using their well-honed spin skills.

  6. Only Spark? by sfcat · · Score: 2

    The problem with "big data" is that there are no vendor specs and the implementations are sometimes questionable. There is a provider that does a better which is SQLStream (http://www.sqlstream.com) which has a streaming DB which is controlled via SQL. In addition to normal tables, you have streams which are relational typed conduits though which data flows and windows which are time (and row) based groups of tuples which can be used in agg queries with all the standard SQL functions (there's also Java UDXes and MED support). Designing your middleware on top of a SQL engine is a much better design pattern than doing it all with hand wired Java. All this and about 100x the throughput of a Hadoop program. Disclaimer: I'm an engineer at SQLStream.

    --
    "Those that start by burning books, will end by burning men."
    1. Re:Only Spark? by phantomfive · · Score: 2

      I read your post but I still have no idea what your 'streams' are, or why anyone would want to use them.

      --
      "First they came for the slanderers and i said nothing."
  7. Hadoop was never really the right solution... by rockmuelle · · Score: 5, Insightful

    A scripting language with a good math/stats library (e.g., NumPy/Pandas) and decent raid controller are all most people really need for most "big data" applications. If you need to scale a bit, add few nodes (and put some RAM in them) and a job scheduler into the mix and learn some basic data decomposition methods. Most big data analyses are embarrassingly parallel. If you really need 100+ TB of disk, setup Lustre or GPFS. Invest in some DDN storage (it's cheaper and faster than the HDFS system you'll build for Hadoop).

    Here's the break down of that claim in more computer sciencey terms: Almost all big data problems are simple counting problems with some stats thrown in. For more advanced clustering tasks, most math libraries have everything you need. Most "big data" sizes are under a few TB of data. Most big data problems are also I/O bound. Single nodes are actually pretty powerful and fast these days. 24 cores, 128 GB RAM, 15 TB of disk behind a RAID controller that can give you 400 MB/s data rates will cost you just barely 5 figures. This single node will outperform a standard 8 node Hadoop cluster. Why? Because the local, high density disks that HDFS encourages are slow as molasses (30 MB/s). And...

    Hadoop has a huge abstraction penalty for each record access. If you're doing minimal computation for each record, the cost of delivering the record dominates your runtime. In Hadoop, the cost is fairly high. If you're using a scripting language and reading right off the file system, your cost for each record is low. I've found Hadoop record access times to be about 20x slower than Python line read times from a text file, using the _same_ file system for Hadoop and Python (of course, Hadoop puts HDFS on top of it). In Big-O terms, the 'c' we usually leave out actually matters here - O(1*n) vs. O(20*n). 1 hour or 20 hours, you pick.

    If you're really doing big data stuff, it helps to understand how data moves through your algorithms and architect things accordingly. Almost always, a few minutes of big-O thinking and some basic knowledge of your hardware will give you an approach that doesn't require Hadoop.

    tl;dr: Hadoop and Spark give people the illusion that their problems are bigger than they actually are. Simply understanding your data flow and algorithms can save you the hassle of using either.

    -Chris

  8. Re:Nope. Not happening. by Rich0 · · Score: 4, Insightful

    I agree that the problem is that most companies don't know how to run it

    I think a bigger problem is that most companies don't even know what big data actually is. It is a big buzzword. I hear managers talking about it all the the time. Half the time they're talking about some database table with a few hundred thousand records in it. Other times they're talking about some repository full of documents or binary files that might be terrabytes in size, but it is just random stuff. They don't actually have questions in mind that they want to answer, and ultimately that is what tools like Hadoop are about.

    I've heard "big data" applied to problems that are basically just file shares or the like.

    Then if a company really does have a problem where Hadoop and such is useful, they want to buy some product off the shelf that solves that particular problem, and usually they don't exist. Or they want to hire a bunch of random rent-a-coders and have them solve the problem, and they go about solving it with single-threaded solutions written in .net or whatever the commodity solution in use is at the company.

    Sure, your Facebooks and Googles and Netflixs and Amazons know what they're doing. Your average GE or Exxon or Pfizer generally doesn't do that level of comp sci.

  9. Re:Nope. Not happening. by jbolden · · Score: 2

    You are overestimating the difficulty at this point. This not compsci anymore and hasn't been for many many years. It isn't even hard administration. It is probably easier to get a big data system running in 2015 than it was to use Oracle in 1995.

    As far as your examples you went way too big. GE is a huge DevOps shop, they know what Big Data is. Exxon has massive supercomputing datasets. I would bet they were doing big data long before it got cool. Pfizer has an IT department that is some of everything but they have many many data warehouses so I can't imagine they aren't playing with data lakes.

  10. Re:What Fucking Decade Is It? by jbolden · · Score: 3, Insightful

    Hadoop didn't exist in 2005. 1.0 release was December 2011 earliest versions I know of were floating around in 2007.

    As for using SQL, Hadoop supports SQL (mostly). Problem with Hadoop is the data sets are too big for RDBMS engines to handle. It has nothing to do with developer skill it has to do with the type of database engine and how data is being handled.

  11. Big Data != toolset by Required+Snark · · Score: 2
    Both Pointy Headed Bosses and Slashdot loooove talking about tools. As the posts generally show, both PHBs and Slashdoters have no clue about what Big Data is used for. It's all about the buzzwords and technology, not about use and utility.

    There are no references to any algorithms. Rank ordering? Nope. Social graph analytics? No. Netflix style recommendations? Uh-uh. Statistics? None.

    Without talking about data sets, algorithms and expected results, yammering about tools is meaningless. Hot air.

    But who cares, because you all get to call each other stupid, and try and prove that you are the biggest baddest tech weenie on the block. From here it seems that you don't even know where the block is. You don't even seem to know which direction you need to go to get to a street. (Like the implied car reference there?)

    I'm beyond unimpressed. It's obvious that no one has a clue what they are talking about. Go off and learn something, and then maybe you will be able to write a post that isn't a waste of time. Other then that STFU and get off my lawn.

    --
    Why is Snark Required?
  12. Re:What Fucking Decade Is It? by Anonymous Coward · · Score: 2, Funny

    Hadoop didn't exist in 2005.

    Unless you work in recruitment.