Slashdot Mirror


The Joys and Hype of Hadoop

theodp writes "Investors have poured over $2 billion into businesses built on Hadoop," writes the WSJ's Elizabeth Dwoskin, "including Hortonworks Inc., which went public last week, its rivals Cloudera Inc. and MapR Technologies, and a growing list of tiny startups. Yet companies that have tried to use Hadoop have met with frustration." Dwoskin adds that Hadoop vendors are responding with improvements and additions, but for now, "It can take a lot of work to combine data stored in legacy repositories with the data that's stored in Hadoop. And while Hadoop can be much faster than traditional databases for some purposes, it often isn't fast enough to respond to queries immediately or to work on incoming information in real time. Satisfying requirements for data security and governance also poses a challenge."

55 comments

  1. Shorter by Anonymous Coward · · Score: 0

    Hadoop isn't a silver bullet.

  2. Advertising by Anonymous Coward · · Score: 0

    Most of the practical uses advertised for HDP are targeted at people who want to snarf and massage data to make a fraction of a penny.

  3. Well No Shi... by bigdady92 · · Score: 5, Informative

    Hadoop is not a magic thing that can all of a sudden produce reams of new data sets. The setup, on an enterprise scale, takes thousands or tens of thousands of dollars in hardware. Then you have the Map/Reduce jobs to create as well as pointing all your data to the new clusters. Then the tweaking starts, and then your pointy haired Boss or Accounting PencilTwit comes to you and demands results for all of this capital expense you just had them buy for some pinhead to get a better dashboard in sales.

    Hadoop, done right, takes many departments to work on and organize in a big enterprise. Small shops may have one guy who is both SA and Programmer who could get the job done enough to make a difference. Furthermore, you NEED a full install from a big vendor. Installing Hadoop from OpenSource is a nightmare, and the big vendor's make it painfully simple to get the job done quickly. Can you do it by hand? Sure. Do you have the time? Not when you have other projects to work on and you can spend the companies capital to get the install and config done in 1/10th the time. /Cloudera Certified //A year later and they still don't know how to get data through the pipeline ///Setting up the hardware was a BLAST!

    --
    Wheel of Time: Book by Book and Sumview (summary review) Bigdady92 style: http://bigdady92.blogspot.com/
    1. Re:Well No Shi... by Xyrus · · Score: 1

      This also assumes the data and the domain you're trying to apply Hadoop to are ones that Hadoop can effectively be useful for. A lot of PHBs and such are pretty ignorant when it comes to the problems that Hadoop can be applied to

      --
      ~X~
  4. Very overhyped technology by Anonymous Coward · · Score: 0

    It really doesn't solve the hardest problems of data analytics and the stuff it does solve is not overly hard. Hadoop is not without value but it is a fad. Misapplied in many places based on its hype alone.

  5. Doesn't seem to be stopping the NSA by Anonymous Coward · · Score: 4, Interesting

    Checkout the job postings in central Maryland near BWI: Java, Hadoop, TS/SCI with full scope poly. Hundreds of postings.

    There is only one customer in near BWI that requires the last.

  6. Just the passing of a fad by Anonymous Coward · · Score: 0

    There's a shocker. People jumping on the 'cloud' bandwagon and then suddenly finding that HDFS/MapReduce don't actually match their requirements properly.

    If you are doing realtime queries then Hadoop isn't for you. If your data fits on two or three servers than you're probably better off just getting one big server. Most people don't actually understand what Hadoop is for: It's designed to avoid the disk read bottleneck by aggregating hundreds of spindles on different servers together in one unified data volume. There's a major latency cost to that, the payback is that you can run queries on data that would never fit on a single server.

  7. Apache Spark > Hadoop by Code+Herder · · Score: 5, Informative

    I used to be a big fan of Hadoop until I gave Apache Spark a try. My god, the speed, ease of use and install simplicity was just ridiculous. I mean, words failed me the first time I used it, I got it installed and working under 2 hours and it was so blazing fast, it was just a joke.

    For people who took a look a few years back, it has matured a lot from an interesting prototype to something I now use in production on my clients data. Documentation is still a bit sketchy for niche functions but it's improved a lot also.

    https://spark.apache.org/

  8. It's simple... by Anonymous Coward · · Score: 5, Funny

    The reason they're running into problems is they haven't fully embraced the synergy in B2B ROI cloud possibilities. If they utilize agile scrum development, they will be able to be on the bleeding edge of viral blog immersion while reaching convergence with real-time content management crowdsourcing.

    1. Re:It's simple... by Narcocide · · Score: 1

      You forgot vertical integration. :-p

    2. Re:It's simple... by Anonymous Coward · · Score: 0

      Well that's going straight into the objective paragraph of the old cv.

    3. Re:It's simple... by tehcyder · · Score: 2

      The reason they're running into problems is they haven't fully embraced the synergy in B2B ROI cloud possibilities. If they utilize agile scrum development, they will be able to be on the bleeding edge of viral blog immersion while reaching convergence with real-time content management crowdsourcing.

      The first ten words made sense and were in actual English. You're doing it wrong.

      --
      To have a right to do a thing is not at all the same as to be right in doing it
  9. Horton hears a hadoop? by Anonymous Coward · · Score: 0

    clearly they need to yell in unison if they want to save hadoopville.

  10. ETL by mveloso · · Score: 1

    I remember Cloudera saying that most people use hadoop for ETL. Not sure if you've checked, but hadoop is like the ne plus ultra of ETL tools. It's worth a look if you have to transform lots and lots of data.

    1. Re:ETL by ionrock · · Score: 1

      I remember Cloudera saying that most people use hadoop for ETL. Not sure if you've checked, but hadoop is like the ne plus ultra of ETL tools. It's worth a look if you have to transform lots and lots of data.



      The problem is you still have to Extract data from other systems, Transform them to make them suitable for Hadoop and Load them in HDFS (or S3). Once that data is available to Hadoop, it becomes extremely powerful.

      Practically all analytical systems have the same issue. The reason to use an analytics system, like Hadoop, is because the database is not fast enough to query. I say "fast enough" because even though many databases *could* be fast enough, it become contentious to perform queries that utilize resources required in production.

      I'm not holding my breath for ETL companies to arrive that make this initial process easier as each client would have different network, databases, and software that would have to be supported. A better tactic is to work towards publishing streams of data from the start and building an ETL system that can help distribute the leg work across an organization.
    2. Re:ETL by sfcat · · Score: 1

      I remember Cloudera saying that most people use hadoop for ETL. Not sure if you've checked, but hadoop is like the ne plus ultra of ETL tools. It's worth a look if you have to transform lots and lots of data.

      Um, for what purpose? After you use it as an "ETL" tool, the idea is that afterwards you can query it, analyze it, etc. Traditionally you used an ETL tool to get data into a database then used tools that spoke SQL to analyze the data. With Hadoop, you have to write all your ETL tools yourself. So using Hadoop as an ETL tool is really a bridge to nowhere.

      --
      "Those that start by burning books, will end by burning men."
    3. Re:ETL by Anonymous Coward · · Score: 0

      You do not have to write all of the ETL tools yourself. A lot of the ETL vendors such as Informatica and Talend are now providing support for Hadoop. If you can write one of their mappings then you effectively write ETL that runs in Hadoop.
         

  11. Hadoop is webscale! by Anonymous Coward · · Score: 1

    That means it is better.

  12. Use a tool for what its good for by ADRA · · Score: 1

    Hadoop is good at generally running massive queries of tons of data in a relatively efficient amount of time. I say efficient and not fast, becuase the requests can vary from well structured for grid data sets to massive bloated ugly queries that would be massive bloated and ugly in any DBMS environment. If you want to talk about regulation, etc.. I think you're batrking up the wrong tree with Hadoop. If you're concerned with regulation, seed the DB with unique though meaningless data when importing and avoid all of those problems.

    --
    Bye!
  13. these goddamn kids on my lawn. by nimbius · · Score: 0

    Joys and Hype of Hadoop and Hortonworks and cloudera MapR...I'll say it one last time: I dont know what a pokaman is and i dont give a shit. this is slashdot for crying out loud and back in my day we played nethack on the VAX-785! and the only damn color we had was GREEN or ORANGE if you were in the upstairs lab! AND WE LIKED IT THAT WAY.

    --
    Good people go to bed earlier.
    1. Re:these goddamn kids on my lawn. by cruff · · Score: 1

      ... back in my day we played nethack on the VAX-785!

      I started out playing rogue on the Vaxen. Then there was plain hack. Those were the days. Still play nethack now and again.

  14. More accurate by BitZtream · · Score: 5, Informative

    Hadoop isn't a database.

    It's a data processing system for massive quantities of data processed and distilled in large batches. If you're trying to treat it as a database, you're doing it wrong. The article is simple using Hadoop for the wrong purpose.

    You use Hadoop to reduce large amounts of data into smaller more manageable collections of useful data, which can then be queried real time.

    --
    Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    1. Re:More accurate by sexconker · · Score: 1

      Hadoop isn't a database.

      It's a data processing system for massive quantities of data processed and distilled in large batches. If you're trying to treat it as a database, you're doing it wrong. The article is simple using Hadoop for the wrong purpose.

      You use Hadoop to reduce large amounts of data into smaller more manageable collections of useful data, which can then be queried real time.

      They can't be queried in real time. Some set of data can be queried, relatively quickly. That's the whole point.
      Hadoop takes your shit and fucks with it to the point where you can't ever be sure if you're getting truly live or complete data or not.

    2. Re:More accurate by Anonymous Coward · · Score: 0

      And k is?

    3. Re:More accurate by Anonymous Coward · · Score: 0

      kafka

    4. Re:More accurate by steelfood · · Score: 2

      No, that's MapReduce. Hadoop is a distributed data store. Or a distributed file store optimized for large files.

      It's not a database. It certainly isn't database management software. It won't manage your data for you. Even the myriad of tools built to run on top of it are nowhere near effective at it. Rather, they're more data consumers than data managers.

      --
      "If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."
    5. Re:More accurate by Anonymous Coward · · Score: 0

      No, that's MapReduce. Hadoop is a distributed data store. Or a distributed file store optimized for large files.

      It's not a database. It certainly isn't database management software. It won't manage your data for you. Even the myriad of tools built to run on top of it are nowhere near effective at it. Rather, they're more data consumers than data managers.

      I would be interested to see viable uses for Hadoop that don't require MapReduce. Granted, a decent Google query might uncover some things to research further, but does anyone have any solid examples with citations?

      Hadoop, or HDFS to some, has interesting possibilities as a distributed filesystem, but it has potential competition from distributed filesystems like Ceph, Cinder, Lustre, and GlusterFS...to name a few.

    6. Re:More accurate by Anonymous Coward · · Score: 0

      kafka

      He probably means k language and kdb. Not Kafka

    7. Re:More accurate by Anonymous Coward · · Score: 0

      Kdb+ from http://kx.com/ a column database.

  15. More accurate by slashdice · · Score: 3, Interesting

    You're better off using k to process your data.

    Source: we replaced hadoop with k. After a couple weeks of training, I was getting results faster than the high-priced hadoop contractors (most of them worked on the hadoop codebase, had written hadoop books, etc).

    --
    Copyright (c) 1990 - 2014 Dice. All rights reserved. Use of this comment is subject to certain Terms and Conditions.
  16. Paywalled by LordLimecat · · Score: 3, Informative

    Since when is it acceptable to post articles that are paywalled?

    We're not even going to pretend to care about the article?

    1. Re:Paywalled by Anonymous Coward · · Score: 2, Insightful

      Reading TFA before responding is considered bad form.

    2. Re:Paywalled by Anonymous Coward · · Score: 0

      +1 nice.

    3. Re:Paywalled by forgottenusername · · Score: 1

      Yeah, obnoxious. People ought to browse submissions in private browsing mode or something. Then if they happen to have a sub to a paywall site they'd see the article the same way people who don't would.

    4. Re:Paywalled by Anonymous Coward · · Score: 0

      It's harder to get this right than you might think, particularly if you don't submit articles regularly. Sometimes sites will give you the first few views without problems but will then show you the paywall. You may or may not be able to work around that by deleting their cookies and/or getting a new IP address. WSJ in particular, has somewhat complicated rules.

      OK, maybe people should know about WSJ but they probably won't remember until they've been bitten by linking them in a submission. And there's tons of other sites that are just a google search away, some of which have the same annoying now-you-see-it-now-you-don't policies.

    5. Re:Paywalled by Anonymous Coward · · Score: 0

      You must be new here.

      Sometimes, I don't even read the summary. Or most of the comments.

    6. Re:Paywalled by tehcyder · · Score: 1

      You must be new here.

      Sometimes, I don't even read the summary. Or most of the comments.

      You must not be new here.

      --
      To have a right to do a thing is not at all the same as to be right in doing it
  17. Re:Apache Spark Hadoop by Anonymous Coward · · Score: 1

    Running spark on hdfs seems to be a pretty good idea though, and you'll still need a YARN setup.

    Or you can push your spark deployment on mesos.

  18. LOL by sexconker · · Score: 0

    And while Hadoop can be much faster than traditional databases for some purposes,

    If by "some purposes" you mean "idiots who don't know how to design a relational database", then sure.

    1. Re:LOL by Anonymous Coward · · Score: 0

      Or data sizes on the order of 5 - 10 TB... or so I'm told.

    2. Re: LOL by Anonymous Coward · · Score: 0

      My porn collection!

  19. We Made it Work! by Anonymous Coward · · Score: 0

    "And while Hadoop can be much faster than traditional databases for some purposes, it often isn't fast enough to respond to queries immediately or to work on incoming information in real time. Satisfying requirements for data security and governance also poses a challenge."

    We did not have any issue integrating HADOOP into our SOA backend or with a traditional thick client with full control over access to search results that include export controlled information.

  20. Apple - $3B on crappy headphones. $19B on WhatsApp by enjar · · Score: 0

    Apple bought out Beats for $3B and change. They make middling, overpriced headphones that come in a variety of colors. Facebook dropped $19B on an app that sends messages. Facebook dropped $1B on a company that makes Polaroids on your smartphone.

    $2B of investments into multiple companies that are working on a technology platform that provides methods for sifting though vast amounts of certain types of business data, running on low-cost, commodity hardware and backed by an open source project seems positively rational in comparison. I recall similar "hype" regarding companies like RedHat, who were working to commercialize Open Source projects. Sure, some of them are going to eventually fold or shut down (or get bought out), but that's part of the risk of investing. I'd imagine that one of them will become successful at offering a very saleable product that is successful.

    Hadoop is only on v2, and still has unpolished bits and weirdness. But there's a burgeoning collection of add-ons and tools, and there are plenty of people who are using it successfully in production. I recall other open source projects that went through similar growing pains and weirdness, but eventually matured very nicely.

  21. Free mirror on nasdaq.com by michaelmalak · · Score: 4, Informative

    Free nasdaq.com mirror of this particular article.

  22. link formatting by Anonymous Coward · · Score: 0

    slashdot make your links visible. having basically the same formatting between your links and text makes this site useless. I shouldnt have to adjust my monitor or mouse over the whole article to find a link.

    until then, this is a 10 year subscriber, signing out.

  23. Re:Apple - $3B on crappy headphones. $19B on Whats by Anonymous Coward · · Score: 0

    Apple bought out Beats for $3B and change. They make middling, overpriced headphones that come in a variety of colors.

    If you are judging Beats by the sound quality, you are missing the point. It's a designer purse that you wear on your head. Beats exist to be seen, not heard. They are valuable because they are expensive, as opposed to the other way around.

    Does Prada care if Bear Grylls doesn't think they make rugged handbags? No. Does Beats care if you think they make middling headphones? No. They just want you to talk about how expensive they are, so their customers feel like special snowflakes.

  24. Embarrassingly Parallel by Anonymous Coward · · Score: 1

    So, Hadoop is a framework for processing embarrassingly parallel (running the same function on a massive amount of aligned data chunked into pieces) tasks using Java.

    This seems like a cluster-fuck (pun intended) to me that could get done as well or faster with an ordinary cluster environment with less software and memory overhead. For those in HPC, am I missing something? This also seems to have a very narrow scope of usage so you're getting a lot of mess for moderate returns.

    1. Re:Embarrassingly Parallel by Narcocide · · Score: 1

      Nope, you've missed nothing. Its over-hyped crap that only gained initial popularity because someone did it in Java, and enterprises like Java.

    2. Re:Embarrassingly Parallel by Anonymous Coward · · Score: 0

      Thx. That's what I took from reading about Hadoop. ;)

    3. Re:Embarrassingly Parallel by godrik · · Score: 1

      The main interest of Hadoop is that it makes it easy to do out of core computation if the computations are loosely coupled and are mostly IO-bound. For anything else, Hadoop is probably not the right tool and is overhyped and typically inefficient.

  25. ohoh by Anonymous Coward · · Score: 0

    only people are smart enough to reinvent big query could be using hadoop
    while the bosses thinking hadoop is a button.........developers not even know how to break problems into tasks instead of end-user sql typing skills......XDD

  26. Re:Apache Spark Hadoop by Anonymous Coward · · Score: 0

    Advantage of Spark is that you can get it working in 2 hours on a decent sized data set.

    Disadvantage is once you cross from decent sized data to big data and you go past simple usecases, spark goes bonkers and takes away all the time you saved so far. But still there is enough hype with spark because most people spent only those initial 2 hours.

  27. Re:Apple - $3B on crappy headphones. $19B on Whats by Anonymous Coward · · Score: 0

    What do Beats and WhatsApp have to do with Hadoop, as business propositions? Absolutely nothing. Parent is completely off topic, and apparently doesn't even know it.

  28. Scale by brunes69 · · Score: 1

    "The setup, on an enterprise scale, takes thousands or tens of thousands of dollars in hardware"

    You are off by at least two orders of magnitude, at last by any reasonable definition of "Enterprise".

    An enterprise grade hadoop cluster that is dealing with enterprise workloads is going to start roughly in the mid-six figures and grow into the low 7 or 8 figures over time and scale. Scale is not cheap.

    1. Re:Scale by Alan+Shutko · · Score: 1

      That's still an order of magnitude cheaper than stuff like Teradata.