Apache Hadoop Has Failed Us, Tech Experts Say (datanami.com)

← Back to Stories (view on slashdot.org)

Apache Hadoop Has Failed Us, Tech Experts Say (datanami.com)

Posted by EditorDavid on Saturday March 25, 2017 @09:34AM from the future-is-cloud-y dept.

It was the first widely-adopted open source distributed computing platform. But some geeks running it are telling Datanami that Hadoop "is great if you're a data scientist who knows how to code in MapReduce or Pig...but as you go higher up the stack, the abstraction layers have mostly failed to deliver on the promise of enabling business analysts to get at the data." Slashdot reader atcclears shares their report: "I can't find a happy Hadoop customer. It's sort of as simple as that," says Bob Muglia, CEO of Snowflake Computing, which develops and runs a cloud-based relational data warehouse offering. "It's very clear to me, technologically, that it's not the technology base the world will be built on going forward"... [T]hanks to better mousetraps like S3 (for storage) and Spark (for processing), Hadoop will be relegated to niche and legacy statuses going forward, Muglia says. "The number of customers who have actually successfully tamed Hadoop is probably less than 20 and it might be less than 10..."

One of the companies that supposedly tamed Hadoop is Facebook...but according to Bobby Johnson, who helped run Facebook's Hadoop cluster before co-founding behavioral analytics company Interana, the fact that Hadoop is still around is a "historical glitch. That may be a little strong," Johnson says. "But there's a bunch of things that people have been trying to do with it for a long time that it's just not well suited for." Hadoop's strengths lie in serving as a cheap storage repository and for processing ETL batch workloads, Johnson says. But it's ill-suited for running interactive, user-facing applications... "After years of banging our heads against it at Facebook, it was never great at it," he says. "It's really hard to dig into and actually get real answers from... You really have to understand how this thing works to get what you want."
Johnson recommends Apache Kafka instead for big data applications, arguing "there's a pipe of data and anything that wants to do something useful with it can tap into that thing. That feels like a better unifying principal..." And the creator of Kafka -- who ran Hadoop clusters at LinkedIn -- calls Hadoop "just a very complicated stack to build on."

6 of 150 comments (clear)

Min score:

Reason:

Sort:

MapReduce is great by Anonymous Coward · 2017-03-25 09:44 · Score: 4, Insightful

If 1) you have a staff of elite programmers like Google or Facebook, who have CS degrees from top universities and are accustomed to picking up new programming languages and tools on a continuing basis; AND
2) your business has a pressing need to crunch terabytes of logs or document data with no fixed schema and continually changing business needs.
For the average Fortune 500 (or even IT) shop, not so much. A '90s style data warehouse accessible through SQL queries works much better.
1. Re:MapReduce is great by Tablizer · 2017-03-25 11:36 · Score: 5, Insightful
  
  As a reminder, SQL is a query language and not a hardware technology. It doesn't dictate HOW to store data (assuming it meets certain minimum standards). You probably are referring to typical RDBMS.
  
  --
  Table-ized A.I.
It has not by gweihir · 2017-03-25 09:58 · Score: 3, Insightful

What has happened instead is that quite a few "tech experts" did not understand what it actually was and had completely unrealistic expectations. Map-reduce is nice when you a) have computing power coming out of your ears and b) have very specific computing tasks. That means that in almost all cases, this technology is a bad choice and that was rather obvious to any actual expert right from the start.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Idiotic babble by lucm · 2017-03-25 11:40 · Score: 5, Insightful

People who bash Hadoop without understanding at a very minimum the moving parts have obviously no experience with it.
Hadoop is not one thing. It's three:
1) a distributed filesystem (HDFS)
2) a job scheduler (Yarn)
3) a distributed computing algorithm (MapReduce)
Many tools like Hbase or Accumulo *need* HDFS. That's a core component and there's no equivalent in Spark. Anyone saying HDFS is obsolete is a clueless idiot.
Anyways the Spark vs Hadoop narrative is bullshit. A serious Spark setup usually runs on top of a Hadoop cluster, and often you can't get away entirely from MapReduce (or its actual successor, Tez) because Spark runs in-memory and doesn't scale as much; for some workloads you need the read-crunch-save aspect of MapReduce because there's just too much data, and MapReduce is also more resilient as you don't lose as much when a node crashes during a job. Spark is more advanced and has actual analytics capabilities thanks to a powerful ML library (while Hadoop is just distributed computing), but it's not a case of either/or.
For instance a common approach is to use Hadoop jobs to trim down your data (via Pig or other blunt tool) to a point where you can run machine learning algorithms on Spark.
As for Kafka, it's just a fucking message queue. It's fast and very powerful, but comparing it to Hadoop is like saying you should use Linux instead of MySQL.
Whoever considers buying services from those Snowflake morons, run away.

--
lucm, indeed.
Re:Do not blame the tool(s), blame the workman... by somenickname · 2017-03-25 12:59 · Score: 5, Insightful

My 4th grade English teacher used to say, "A bad workman blames his tools."
Sounds relevant to me here.
Apparently your 4th grade English teacher has never tried to use a hammer covered in spikes that arrived in a box labeled "Screwdriver".
Hadoop isnt just mapreduce and pig by vile8 · 2017-03-25 17:13 · Score: 5, Insightful

Hadoop starts with a vastly distributable, and resilient file system (HDFS) which enables, as a base, technologies that include things like HBase (columnar stores), Impala (Parquet example), Titan (graphs), Spark (lord everything.. its the emacs of data frameworks), or the latest projects which completely change the paradigm of how you are looking at data at unbelievable speeds. (who the hell runs mapreduce and expects real time performance?... its a full disk scan across distributed stores... and fairly sane from that perspective)

If you don't have problems that relate to these paradigms... dont use it. Seriously. Just because its new doesnt mean it fits every situation. Its not mysql/mariadb/postgresql... if you think its even remotely close to that simple you should run for the hills. If you have a significantly large (not talking hundreds of megs or even a couple gigs... you need to be thinking in Billions of rows here) configuration management problem then its a great base to layer other projects on top of to solve your problem.

Also, I found a large number of problems to solve using timestamped individual data cells that CANNOT be done using traditional sql methodologies. Lexicographic configuration models, analytics (obv), massive backup history just to name a few. If the management and installation of the cluster are scary... well...not everything in CS is easy... especially when it gets to handling the worlds largest datasets.... so, this probably isn't really your problem... call the sysadmins and ask them (politely) to help. Believe it or not the main companies have wizards which can help get you going across clusters... and even manage them visually (not that I ever would... UI's are for people who can't type).

When people (or just this CEO) says it doesn't deliver on its promise. You are likely trying to solve a problem wholy inappropriately. I have personally used it to solve problems like making real time recommendations in under 200ms across several gigs of personal data daily (totalling easily into terabytes). (No you don't use mapreduce... think harder... but you DO use HDFS).

So what promise were you told?

Other than real time (as illustrated above), you can do archiving, ETL of course, and things like enabling SQL lookups, or RRDs... using a number of toolkits or spark. Seriously, this is one of the best things since sliced bread when it comes to processing and managing real big data problems. Check out the Lambda processing model when you get a chance... you might be impressed, or be utterly confused. Lambda (and not talking about programming Lambda, nor AWS Lambda) applies multiple apache technologies to solve historical with real time problems in a sane manner. Also managing massively distributed backups is much simpler with HDFS

Honestly, outside of Teradata implementations, there is no where in the world you can get this kind of data resiliency, efficiency, nor management. Granted it doesn't have the 20+ years of chops in HUGE datasets Teradata does, nor the support... but its open source and won't cost you much to try.

Long long story short. What the hell! I feel like programmers today are constantly ... whining... about complexity. It seems like a trend to say "well I couldn't use it for my project so that means no one really does.. they are just trying to look cool." Which I would have to reply... you're an idiot. Yes its complex... if you understand storage / manipulation / migration / replication / indexing... you should be impressed to say the very very least. If you dont, please go read the changelog, Readme, and any note based install guides. or do some research on the commercial companies using this technology successfully.... instead of making of figures and claiming its gospel.

Any commercial solution will cost you ... well... millions just to get started solving the problems Hadoop nailed out of the gate.

If Hadoop seems large and frightening just wait until y