Apache Hadoop Has Failed Us, Tech Experts Say (datanami.com)

← Back to Stories (view on slashdot.org)

Apache Hadoop Has Failed Us, Tech Experts Say (datanami.com)

Posted by EditorDavid on Saturday March 25, 2017 @09:34AM from the future-is-cloud-y dept.

It was the first widely-adopted open source distributed computing platform. But some geeks running it are telling Datanami that Hadoop "is great if you're a data scientist who knows how to code in MapReduce or Pig...but as you go higher up the stack, the abstraction layers have mostly failed to deliver on the promise of enabling business analysts to get at the data." Slashdot reader atcclears shares their report: "I can't find a happy Hadoop customer. It's sort of as simple as that," says Bob Muglia, CEO of Snowflake Computing, which develops and runs a cloud-based relational data warehouse offering. "It's very clear to me, technologically, that it's not the technology base the world will be built on going forward"... [T]hanks to better mousetraps like S3 (for storage) and Spark (for processing), Hadoop will be relegated to niche and legacy statuses going forward, Muglia says. "The number of customers who have actually successfully tamed Hadoop is probably less than 20 and it might be less than 10..."

One of the companies that supposedly tamed Hadoop is Facebook...but according to Bobby Johnson, who helped run Facebook's Hadoop cluster before co-founding behavioral analytics company Interana, the fact that Hadoop is still around is a "historical glitch. That may be a little strong," Johnson says. "But there's a bunch of things that people have been trying to do with it for a long time that it's just not well suited for." Hadoop's strengths lie in serving as a cheap storage repository and for processing ETL batch workloads, Johnson says. But it's ill-suited for running interactive, user-facing applications... "After years of banging our heads against it at Facebook, it was never great at it," he says. "It's really hard to dig into and actually get real answers from... You really have to understand how this thing works to get what you want."
Johnson recommends Apache Kafka instead for big data applications, arguing "there's a pipe of data and anything that wants to do something useful with it can tap into that thing. That feels like a better unifying principal..." And the creator of Kafka -- who ran Hadoop clusters at LinkedIn -- calls Hadoop "just a very complicated stack to build on."

17 of 150 comments (clear)

Min score:

Reason:

Sort:

MapReduce is great by Anonymous Coward · 2017-03-25 09:44 · Score: 4, Insightful

If 1) you have a staff of elite programmers like Google or Facebook, who have CS degrees from top universities and are accustomed to picking up new programming languages and tools on a continuing basis; AND
2) your business has a pressing need to crunch terabytes of logs or document data with no fixed schema and continually changing business needs.
For the average Fortune 500 (or even IT) shop, not so much. A '90s style data warehouse accessible through SQL queries works much better.
1. Re:MapReduce is great by Anonymous Coward · 2017-03-25 10:17 · Score: 4, Funny
  
  You've done an incredible amount of work to reach this conclusion. Congrats. Did you use map-reduce on your data set?
2. Re:MapReduce is great by Mitreya · 2017-03-25 11:06 · Score: 2
  
  1) you have a staff of elite programmers like Google or Facebook, who have CS degrees from top universities and are accustomed to picking up new programming languages and tools on a continuing basis;
  
  I disagree.
  MapReduce is actually great for teaching people about parallel processing! I have been able to teach a distributed computing course to non-CS (primarily data science) MS students because it achieves parallelization without most of the complexities associated with distributed query processing. With Hadoop streaming, all you need is basic knowledge of python (or similar) to write your own custom jobs, even without Hive/Pig/etc.
  That to me is one of the greatest accomplishments of MapReduce. Bringing distributed computing concepts to the general audience.
  
  2) your business has a pressing need to crunch terabytes of logs or document data with no fixed schema and continually changing business needs.
  That part is true. Almost no one has that much of a data processing need.
  But it is still good for teaching distributed / remote computing to non-CS majors.
3. Re:MapReduce is great by gweihir · 2017-03-25 11:18 · Score: 5, Interesting
  
  Indeed. I went though their "interview-process" a while back at the request of a friend that was there and desperately wanted me for his team. Interestingly, I failed to get hired, and I think it is because I knew a lot more about the questions they asked than the people that created (and asked) these questions. For example, on (non-cryptographic) hash-functions my answer was to not do them yourself, because they would always be pretty bad, and to instead use the ones by Bob Jenkins, or if things are slow because there is a disk-access in there to use a crypto hash. While that is what you do in reality if you have more than small tables, that was apparently very much not what they wanted to hear. They apparently wanted me to start to mess around with the usual things you find in algorithm books. Turns out, I did way back, but when I put 100 Million IP addresses into such a table, it performed abysmally bad. My take-away is that Google prefers to hire highly intelligent, but semi-smart people with semi-knowledge about things and little experience and that experienced and smart people fail their interviews unless they prepare for giving dumber answers than they can give. I will never do that.
  On the plus side, my current job is way more interesting than anything Google would have offered me.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
4. Re:MapReduce is great by Tablizer · 2017-03-25 11:36 · Score: 5, Insightful
  
  As a reminder, SQL is a query language and not a hardware technology. It doesn't dictate HOW to store data (assuming it meets certain minimum standards). You probably are referring to typical RDBMS.
  
  --
  Table-ized A.I.
5. Re: MapReduce is great by Anonymous Coward · 2017-03-25 12:06 · Score: 5, Interesting
  
  That's because the mediocre programmers are the ones giving the interviews. Close friend interviewed last year only to sit in front of a bunch of know it all elitists. One douche rambled on about how he wishes there were monads in c++ and how great functional design is. Now my friend and his roommate are CS geeks and their spare time is doing shit like build a lisp interpreter in c++ just for fun. So he asked mr monad if the project used a functional approach which was a solid no. Idiot just wanted to show off the fact he knew what functional programming is and wasted time. He passed on the Google job for a big local company doing back end dev work. Job pays as good as Google without the pompous know nothing's with the ability to remotely work. Fuck working for Google.
6. Re:MapReduce is great by Mitreya · 2017-03-25 12:34 · Score: 2
  
  The underlying expense and architecture mistakes "scalability" for actual throughput in processing. It's proven extremely unstable in tasks larger than a small proof of concept
  Can you elaborate on some reasons?
  I was part of a research paper some time ago, and Map Reduce does have the advantage of in the ability to resume (rather than restart) queries on failure and better handling of ad-hoc queries (compared to RDBMS).
7. Re:MapReduce is great by Kjella · 2017-03-25 16:04 · Score: 5, Interesting
  
  For example, on (non-cryptographic) hash-functions my answer was to not do them yourself, because they would always be pretty bad, and to instead use the ones by Bob Jenkins, or if things are slow because there is a disk-access in there to use a crypto hash. While that is what you do in reality if you have more than small tables, that was apparently very much not what they wanted to hear. They apparently wanted me to start to mess around with the usual things you find in algorithm books.
  No offense, but "I'd rather just use a library" seriously brings into question what you bring to the table and whether you'll just be searching experts-exchange for smart stuff other people have done..Like everybody knows you shouldn't use homegrown cryptographic algorithms, but if a cryptologist can't tell me what an S-box is and points me to using a library instead it doesn't really tell me anything about his skill, except he didn't want to answer the question. In fact, dodging the question like that would be a pretty big red flag.
  Don't get me wrong, you can get there. But start off with roughly what you'd do if you had to implement it from scratch, what's difficult to get right, then suggest implementations you know or alternative ways to solve it. Because they're not that stupid that they think this is some novel issue nobody's ever looked at before or found decent answers to. They want to test if you have the intellect, knowledge and creativity to sketch a solution yourself. Once you've done that, then you can tell them why it's probably not a good idea to reinvent the wheel.
  
  --
  Live today, because you never know what tomorrow brings
8. Re:MapReduce is great by gweihir · 2017-03-26 05:52 · Score: 3, Interesting
  
  No offense, but you miss the point entirely. What I answered is very far from "use a library". First, it is an algorithm, not a library. That difference is very important. Second, it is a carefully selected algorithm that performs much better than what you commonly find in "libraries" in almost all situations. And third, the hash-functions by Bob Jenkins (and the newer ones bu DJB, for example) are inspired by crypto, but much faster in exchange for reduced security assurances. In fact so fast that they can compete directly with the far worse things commonly in use. "Do not roll your own crypto" _does_ apply_ though.
  So while I think you meant to be patronizing, you just come across as incompetent. A bit like the folks at Google, come to think of it...
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
It has not by gweihir · 2017-03-25 09:58 · Score: 3, Insightful

What has happened instead is that quite a few "tech experts" did not understand what it actually was and had completely unrealistic expectations. Map-reduce is nice when you a) have computing power coming out of your ears and b) have very specific computing tasks. That means that in almost all cases, this technology is a bad choice and that was rather obvious to any actual expert right from the start.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
A little clueless.... by Anonymous Coward · 2017-03-25 10:16 · Score: 5, Informative

Did nobody explain to the original poster that Spark in serious deployments is built on top of Hadoop? Or that Kafka uses the Hadoop (YARN) scheduler and is generally used to sink data to HDFS files, also built on top of Hadoop? This is kind of like someone saying that TCP/IP is no longer relevant because we now have DNS....
Just say Pachyderm by awilden · 2017-03-25 10:16 · Score: 3, Informative

People should check out these guys: http://pachyderm.io/ The power of Hadoop, but you choose whatever programming language you think is best for you.
Idiotic babble by lucm · 2017-03-25 11:40 · Score: 5, Insightful

People who bash Hadoop without understanding at a very minimum the moving parts have obviously no experience with it.
Hadoop is not one thing. It's three:
1) a distributed filesystem (HDFS)
2) a job scheduler (Yarn)
3) a distributed computing algorithm (MapReduce)
Many tools like Hbase or Accumulo *need* HDFS. That's a core component and there's no equivalent in Spark. Anyone saying HDFS is obsolete is a clueless idiot.
Anyways the Spark vs Hadoop narrative is bullshit. A serious Spark setup usually runs on top of a Hadoop cluster, and often you can't get away entirely from MapReduce (or its actual successor, Tez) because Spark runs in-memory and doesn't scale as much; for some workloads you need the read-crunch-save aspect of MapReduce because there's just too much data, and MapReduce is also more resilient as you don't lose as much when a node crashes during a job. Spark is more advanced and has actual analytics capabilities thanks to a powerful ML library (while Hadoop is just distributed computing), but it's not a case of either/or.
For instance a common approach is to use Hadoop jobs to trim down your data (via Pig or other blunt tool) to a point where you can run machine learning algorithms on Spark.
As for Kafka, it's just a fucking message queue. It's fast and very powerful, but comparing it to Hadoop is like saying you should use Linux instead of MySQL.
Whoever considers buying services from those Snowflake morons, run away.

--
lucm, indeed.
Re:Something less dismissive? by lucm · 2017-03-25 12:14 · Score: 2

The Hadoop defenders will no doubt counter with, "but Hadoop wasn't designed to be an RDBMS!", to which I say it doesn't matter. That's what people were trying to make Hadoop into because that's what businesses thought that they needed: a drop in replacement for SQL and RDBMS that addressed their scalability problems. In the meantime SQL and RDBMS developers have answered the challenge and continued improving their tools, addressing many of the shortcomings that Hadoop was supposed to resolve while Hadoop was still over promising and under delivering. The old quip is still true, "SQL is dead. Long live SQL."
That's bullshit and obviously you're a DBA defending his turf. A Hadoop cluster will scale beyond anything a RDBMS can handle, and if the only tool in your toolbox is SQL you can use products like Hive or Hawq that will process your queries through a specialized JDBC driver and run them across as many nodes as your budget can afford.
For instance you could have petabytes of data in CSV format stored on your HDFS cluster, and you could create a relational model on top of them without rewriting a single byte, then use SQL to interact with this huge data set. It's like mounting external sources in Oracle or Postgresql, but at a scale that neither product can process.
Do you know what the NSA used to store all that big brother data? Accumulo, which sits on Hadoop. They would have never been able to crunch that volume of data with [insert your RDBMS product here].
Don't diss stuff you don't understand. Nobody is taking your precious database away, there's just an alternative for people with more complex needs.

--
lucm, indeed.
Re:Do not blame the tool(s), blame the workman... by somenickname · 2017-03-25 12:59 · Score: 5, Insightful

My 4th grade English teacher used to say, "A bad workman blames his tools."
Sounds relevant to me here.
Apparently your 4th grade English teacher has never tried to use a hammer covered in spikes that arrived in a box labeled "Screwdriver".
1PB meh by lucm · 2017-03-25 16:38 · Score: 2

I think even a vanilla Postgresql will do 1-2 Petabytes.
The maximum column size for Postgres is 1GB. The maximum table size is 32TB. So let's say you have a 1PB data set, that means you need to shard your data in at least 25 tables of 250 columns.
Let's say you want to run a query vertically; you'll need to join those 25 tables, start the query and go on vacation for a month. That's how 1PB works on Postgres.
And don't you even dare do some leaf-level manipulations on that volume of data, like a lateral join - unless you enjoy a faint smell of burnt plastic in your data center. Meanwhile, that kind of thing runs smoothly on Hadoop, and if it's too slow you just add nodes.
I'm not saying RDBMS are dead - in my opinion the vast majority of use cases warrant for a traditional RDMBS or non-Hadoop NoSQL database. But when it comes to seriously big data, fuggedaboutit.

--
lucm, indeed.
Hadoop isnt just mapreduce and pig by vile8 · 2017-03-25 17:13 · Score: 5, Insightful

Hadoop starts with a vastly distributable, and resilient file system (HDFS) which enables, as a base, technologies that include things like HBase (columnar stores), Impala (Parquet example), Titan (graphs), Spark (lord everything.. its the emacs of data frameworks), or the latest projects which completely change the paradigm of how you are looking at data at unbelievable speeds. (who the hell runs mapreduce and expects real time performance?... its a full disk scan across distributed stores... and fairly sane from that perspective)

If you don't have problems that relate to these paradigms... dont use it. Seriously. Just because its new doesnt mean it fits every situation. Its not mysql/mariadb/postgresql... if you think its even remotely close to that simple you should run for the hills. If you have a significantly large (not talking hundreds of megs or even a couple gigs... you need to be thinking in Billions of rows here) configuration management problem then its a great base to layer other projects on top of to solve your problem.

Also, I found a large number of problems to solve using timestamped individual data cells that CANNOT be done using traditional sql methodologies. Lexicographic configuration models, analytics (obv), massive backup history just to name a few. If the management and installation of the cluster are scary... well...not everything in CS is easy... especially when it gets to handling the worlds largest datasets.... so, this probably isn't really your problem... call the sysadmins and ask them (politely) to help. Believe it or not the main companies have wizards which can help get you going across clusters... and even manage them visually (not that I ever would... UI's are for people who can't type).

When people (or just this CEO) says it doesn't deliver on its promise. You are likely trying to solve a problem wholy inappropriately. I have personally used it to solve problems like making real time recommendations in under 200ms across several gigs of personal data daily (totalling easily into terabytes). (No you don't use mapreduce... think harder... but you DO use HDFS).

So what promise were you told?

Other than real time (as illustrated above), you can do archiving, ETL of course, and things like enabling SQL lookups, or RRDs... using a number of toolkits or spark. Seriously, this is one of the best things since sliced bread when it comes to processing and managing real big data problems. Check out the Lambda processing model when you get a chance... you might be impressed, or be utterly confused. Lambda (and not talking about programming Lambda, nor AWS Lambda) applies multiple apache technologies to solve historical with real time problems in a sane manner. Also managing massively distributed backups is much simpler with HDFS

Honestly, outside of Teradata implementations, there is no where in the world you can get this kind of data resiliency, efficiency, nor management. Granted it doesn't have the 20+ years of chops in HUGE datasets Teradata does, nor the support... but its open source and won't cost you much to try.

Long long story short. What the hell! I feel like programmers today are constantly ... whining... about complexity. It seems like a trend to say "well I couldn't use it for my project so that means no one really does.. they are just trying to look cool." Which I would have to reply... you're an idiot. Yes its complex... if you understand storage / manipulation / migration / replication / indexing... you should be impressed to say the very very least. If you dont, please go read the changelog, Readme, and any note based install guides. or do some research on the commercial companies using this technology successfully.... instead of making of figures and claiming its gospel.

Any commercial solution will cost you ... well... millions just to get started solving the problems Hadoop nailed out of the gate.

If Hadoop seems large and frightening just wait until y