Apache Hadoop Has Failed Us, Tech Experts Say (datanami.com)
It was the first widely-adopted open source distributed computing platform. But some geeks running it are telling Datanami that Hadoop "is great if you're a data scientist who knows how to code in MapReduce or Pig...but as you go higher up the stack, the abstraction layers have mostly failed to deliver on the promise of enabling business analysts to get at the data." Slashdot reader atcclears shares their report:
"I can't find a happy Hadoop customer. It's sort of as simple as that," says Bob Muglia, CEO of Snowflake Computing, which develops and runs a cloud-based relational data warehouse offering. "It's very clear to me, technologically, that it's not the technology base the world will be built on going forward"... [T]hanks to better mousetraps like S3 (for storage) and Spark (for processing), Hadoop will be relegated to niche and legacy statuses going forward, Muglia says. "The number of customers who have actually successfully tamed Hadoop is probably less than 20 and it might be less than 10..."
One of the companies that supposedly tamed Hadoop is Facebook...but according to Bobby Johnson, who helped run Facebook's Hadoop cluster before co-founding behavioral analytics company Interana, the fact that Hadoop is still around is a "historical glitch. That may be a little strong," Johnson says. "But there's a bunch of things that people have been trying to do with it for a long time that it's just not well suited for." Hadoop's strengths lie in serving as a cheap storage repository and for processing ETL batch workloads, Johnson says. But it's ill-suited for running interactive, user-facing applications... "After years of banging our heads against it at Facebook, it was never great at it," he says. "It's really hard to dig into and actually get real answers from... You really have to understand how this thing works to get what you want."
Johnson recommends Apache Kafka instead for big data applications, arguing "there's a pipe of data and anything that wants to do something useful with it can tap into that thing. That feels like a better unifying principal..." And the creator of Kafka -- who ran Hadoop clusters at LinkedIn -- calls Hadoop "just a very complicated stack to build on."
One of the companies that supposedly tamed Hadoop is Facebook...but according to Bobby Johnson, who helped run Facebook's Hadoop cluster before co-founding behavioral analytics company Interana, the fact that Hadoop is still around is a "historical glitch. That may be a little strong," Johnson says. "But there's a bunch of things that people have been trying to do with it for a long time that it's just not well suited for." Hadoop's strengths lie in serving as a cheap storage repository and for processing ETL batch workloads, Johnson says. But it's ill-suited for running interactive, user-facing applications... "After years of banging our heads against it at Facebook, it was never great at it," he says. "It's really hard to dig into and actually get real answers from... You really have to understand how this thing works to get what you want."
Johnson recommends Apache Kafka instead for big data applications, arguing "there's a pipe of data and anything that wants to do something useful with it can tap into that thing. That feels like a better unifying principal..." And the creator of Kafka -- who ran Hadoop clusters at LinkedIn -- calls Hadoop "just a very complicated stack to build on."
If 1) you have a staff of elite programmers like Google or Facebook, who have CS degrees from top universities and are accustomed to picking up new programming languages and tools on a continuing basis; AND
2) your business has a pressing need to crunch terabytes of logs or document data with no fixed schema and continually changing business needs.
For the average Fortune 500 (or even IT) shop, not so much. A '90s style data warehouse accessible through SQL queries works much better.
How about: "Hadoop served many people well for a long time, but it is time for it to be deprecated now." ?
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
"It's very clear to me, technologically, that it's not the technology base the world will be built on going forward"... [T]hanks to better mousetraps like S3 (for storage) and Spark (for processing), Hadoop will be relegated to niche and legacy statuses going forward, Muglia says.
My 4th grade English teacher used to say, "A bad workman blames his tools."
Sounds relevant to me here.
What has happened instead is that quite a few "tech experts" did not understand what it actually was and had completely unrealistic expectations. Map-reduce is nice when you a) have computing power coming out of your ears and b) have very specific computing tasks. That means that in almost all cases, this technology is a bad choice and that was rather obvious to any actual expert right from the start.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
They're choosing someone to lead the merger of some high schools?
Fucking hell, unless you chew your tongue when you talk they don't even sound the same.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Did nobody explain to the original poster that Spark in serious deployments is built on top of Hadoop? Or that Kafka uses the Hadoop (YARN) scheduler and is generally used to sink data to HDFS files, also built on top of Hadoop? This is kind of like someone saying that TCP/IP is no longer relevant because we now have DNS....
People should check out these guys: http://pachyderm.io/ The power of Hadoop, but you choose whatever programming language you think is best for you.
As far as I know, most people are using Apache Spark for new projects.
"First they came for the slanderers and i said nothing."
Like virtually every other technology that came before it, it is a stepping stone leading to what will come after. To say that is has "failed" is a gross overstatement and is frankly misleading. Since it's inception, it has provided a platform to do things that no product before it could. It may not be the way to go going forward, but we probably wouldn't have any idea what the way forward might be if something like this hadn't been built.
I am sure that many of us feel like we could design our code better the second (or third) time around once we know the pitfalls and have had the opportunity to better understand the problem domain. No matter how you look at it, Hadoop has been an important and necessary step forward. Even if everyone stops using tomorrow in favor of something better, the knowledge and experience we have gained from having it is irreplaceable. From my perspective, it seems like a huge success.
The world would be better off if we nuked Mecca and Hadoop... then build something useful on top of the heathen ashes.
When your software integration prevents your software from being used in conjunction with a variety of other platforms, you drastically reduce the number of users and in turn the number of developers that will work on it. As you integrate software more and more, you exponentially decrease the number of developers interested in making tools to make operation of your software easier. I'm not saying that making a system that works with everything will attract more developers but I am saying that making an overly integrated system will drive away many developers.
Anons need not reply. Questions end with a question mark.
People who bash Hadoop without understanding at a very minimum the moving parts have obviously no experience with it.
Hadoop is not one thing. It's three:
1) a distributed filesystem (HDFS)
2) a job scheduler (Yarn)
3) a distributed computing algorithm (MapReduce)
Many tools like Hbase or Accumulo *need* HDFS. That's a core component and there's no equivalent in Spark. Anyone saying HDFS is obsolete is a clueless idiot.
Anyways the Spark vs Hadoop narrative is bullshit. A serious Spark setup usually runs on top of a Hadoop cluster, and often you can't get away entirely from MapReduce (or its actual successor, Tez) because Spark runs in-memory and doesn't scale as much; for some workloads you need the read-crunch-save aspect of MapReduce because there's just too much data, and MapReduce is also more resilient as you don't lose as much when a node crashes during a job. Spark is more advanced and has actual analytics capabilities thanks to a powerful ML library (while Hadoop is just distributed computing), but it's not a case of either/or.
For instance a common approach is to use Hadoop jobs to trim down your data (via Pig or other blunt tool) to a point where you can run machine learning algorithms on Spark.
As for Kafka, it's just a fucking message queue. It's fast and very powerful, but comparing it to Hadoop is like saying you should use Linux instead of MySQL.
Whoever considers buying services from those Snowflake morons, run away.
lucm, indeed.
I need to scratch Hadoop off my list of technologies that I need to read about because everyone else in the office is reading a particular Big Data book?
The often overlooked addendum is that a good craftsman also knows both the value of a good tool and how to recognize a crappy tool. :)
If you look at the list technical experts
1 Bob Muglia - Head of a startup competitor that trying to market data analytics product trying to steer some of that Hadoop investment into his fold. His sales model is "Look how easy we are" What you should be asking is how much does it cost and how do I get my data back.
2 Bob Johnson - Cofounder of an analytics company trying to steer some of that Hadoop investment into his pocket.
This is a beat up driven by people who wished that they had a slice of the Hadoop pie. Hadoop is a complex system, however it scales to levels far beyond relational database technologies. Basically if you can do what you wanted with relational databases in a cost effective manner then you wouldn't or shouldn't have contemplated Hadoop in the first place. I'm not saying that the above products are good, I am saying that you have to take what they're saying with a grain (or bucket) of salt.
A number of existing Hadoop interfaces are batch based and exhibit a significant degree of latency, however other interfaces such as impala are faster and bypass the map/reduce operation to achieve realtime results. When organisations get this wrong, iIt's a bit like your finance manager talking to your vehicle fleet manager (who recommended Ford vehicles) and based upon that conversation getting a great deal on 2000 tractors. Upon finding his staff are upset, he/she dumps American manufacturing and settles on BMW as their Fleet vehicle of choice even through the cost is significantly higher. If you're a clever organisation you don't need to buy a Mercedes or BMW to achieve good results, however the owners of BMW or Mercedes would certainly encourage you to do so.
Personally I have watched a number of organisations deploy Hadoop clusters in a poor manner without understanding the system fundamentals and they'd love to blame the tool. I've also seen clever organisations save millions on their existing licences and meet their business or compliance objectives. It really comes down to looking at your organisation in a pragmatic manner and deciding are you collectively clever, or are you collectively stupid.
Based upon what I've seen and know.
1 Data Locality - Don't deploy Hadoop as virtuals which rely on an underlying SAN technology, if you're doing this you don't understand either the problem or the solution. The issue that you might solve with virtualisation is deployment and management, don't kill locality in doing so.
2 Competent staff - You are going to need, and retain, and train highly competent staff to ensure that the interfaces are simple (see point 4)
3 Understand your business drivers, are you a knowledge based organisation? What insights do you expect will make a difference to your bottom line.
4 Who will get access to the information and how.
5 Where is this information that you're going to throw into Hadoop anyway?
There are a number of managers who should be held accountable for poorly performing hadoop clusters and the above questions should be asked of all of them.
In some cases they would have been better off going with a simpler model initially such as Casandra to meet their requirements, however most organisations overestimate their abilities.
So I have a hadoop stack and a team of 4 data scientists. It takes them a month to develop an interface for new data... How do I get this dev time down. With new data-sets coming in on a weekly basis this team would need to grow 10X to keep up? In the mean time the average users needs to wait a month for access to new streams of info. That leaves our business a month behind on current trends that can definitely be predicted from the data streams. So what do I need to do?? Hire 36 new Data scientists or change the stack I work on?
as it is eaten so it shall pass
After only 5 minutes with Hadoop I could figure out it was nothing but a giant boondoggle. It only took to the end of that afternoon to be completely sure. Now, what... 3, 4 years later the rest of the industry is starting to figure it out, en-masse? Seems about right.
Perhaps the issue here is about unreasonable expectations.
No software, Hadoop or other, will magically extract meaning from a huge dump of data. You need work to do that, whatever the tool you use.
This rant reminds me about the people who purchased enterprise service bus to interconnect IT applications, just to discover that instead of interconnecting applications, they now need to interconnect applications with the enterprise service bus. No problem solved for free.
'"I can't find a happy Hadoop customer. It's sort of as simple as that," says Bob Muglia, CEO of Snowflake Computing, which develops and runs a cloud-based relational data warehouse offering' slashdot
Here's Bob Muglia while at Microsoft describing how to 'add additional semantics' to Outlook, that is perform a detailed analysis of Lotus Notes and then clone it into Outlook.
"Notes/Domino R5 is very scary. We all saw the demo. Exchange has worked with teams around the company to put together a very detailed analysis of the R5 betas and the hints they expose on their future direction.", Eric Lockard
"we will probably need to add additional semantics to the Outlook/CDO object model to enable easy conversion of Notes apps onto our solution", Bob Muglia
I read a great article where one guy compared Hadoop to tools such as grep. I many fundamental ways he was able to use UNIX command line tools to wildly outperform Hadoop on what I would consider to be on the larger end of a typical company's data set.
To me Hadoop was the classic solution desperately in quest of a problem. The worst problem with that being so many people who jumped onto Hadoop and thought they were ass kickers for doing so.
The simple reality is that for most corporate datasets the tool of choice is a boring relational database and usually something like MySQL. The common capacity roadblocks aren't found within the tool but in the tool users.
But if you use a tool like Hadoop, or go NoSQL with a tool like MongoDB, you get to say (until people realize you are actually quite stupid) "my datastore is better than your datastore".
Isn't mongoDB supposed to be similar to hadoop? Do the same pitfalls for hadoop apply to mongoDB?
I think even a vanilla Postgresql will do 1-2 Petabytes.
The maximum column size for Postgres is 1GB. The maximum table size is 32TB. So let's say you have a 1PB data set, that means you need to shard your data in at least 25 tables of 250 columns.
Let's say you want to run a query vertically; you'll need to join those 25 tables, start the query and go on vacation for a month. That's how 1PB works on Postgres.
And don't you even dare do some leaf-level manipulations on that volume of data, like a lateral join - unless you enjoy a faint smell of burnt plastic in your data center. Meanwhile, that kind of thing runs smoothly on Hadoop, and if it's too slow you just add nodes.
I'm not saying RDBMS are dead - in my opinion the vast majority of use cases warrant for a traditional RDMBS or non-Hadoop NoSQL database. But when it comes to seriously big data, fuggedaboutit.
lucm, indeed.
Hadoop starts with a vastly distributable, and resilient file system (HDFS) which enables, as a base, technologies that include things like HBase (columnar stores), Impala (Parquet example), Titan (graphs), Spark (lord everything.. its the emacs of data frameworks), or the latest projects which completely change the paradigm of how you are looking at data at unbelievable speeds. (who the hell runs mapreduce and expects real time performance?... its a full disk scan across distributed stores... and fairly sane from that perspective)
... whining... about complexity. It seems like a trend to say "well I couldn't use it for my project so that means no one really does.. they are just trying to look cool." Which I would have to reply... you're an idiot. Yes its complex... if you understand storage / manipulation / migration / replication / indexing... you should be impressed to say the very very least. If you dont, please go read the changelog, Readme, and any note based install guides. or do some research on the commercial companies using this technology successfully.... instead of making of figures and claiming its gospel.
... well... millions just to get started solving the problems Hadoop nailed out of the gate.
If you don't have problems that relate to these paradigms... dont use it. Seriously. Just because its new doesnt mean it fits every situation. Its not mysql/mariadb/postgresql... if you think its even remotely close to that simple you should run for the hills. If you have a significantly large (not talking hundreds of megs or even a couple gigs... you need to be thinking in Billions of rows here) configuration management problem then its a great base to layer other projects on top of to solve your problem.
Also, I found a large number of problems to solve using timestamped individual data cells that CANNOT be done using traditional sql methodologies. Lexicographic configuration models, analytics (obv), massive backup history just to name a few. If the management and installation of the cluster are scary... well...not everything in CS is easy... especially when it gets to handling the worlds largest datasets.... so, this probably isn't really your problem... call the sysadmins and ask them (politely) to help. Believe it or not the main companies have wizards which can help get you going across clusters... and even manage them visually (not that I ever would... UI's are for people who can't type).
When people (or just this CEO) says it doesn't deliver on its promise. You are likely trying to solve a problem wholy inappropriately. I have personally used it to solve problems like making real time recommendations in under 200ms across several gigs of personal data daily (totalling easily into terabytes). (No you don't use mapreduce... think harder... but you DO use HDFS).
So what promise were you told?
Other than real time (as illustrated above), you can do archiving, ETL of course, and things like enabling SQL lookups, or RRDs... using a number of toolkits or spark. Seriously, this is one of the best things since sliced bread when it comes to processing and managing real big data problems. Check out the Lambda processing model when you get a chance... you might be impressed, or be utterly confused. Lambda (and not talking about programming Lambda, nor AWS Lambda) applies multiple apache technologies to solve historical with real time problems in a sane manner. Also managing massively distributed backups is much simpler with HDFS
Honestly, outside of Teradata implementations, there is no where in the world you can get this kind of data resiliency, efficiency, nor management. Granted it doesn't have the 20+ years of chops in HUGE datasets Teradata does, nor the support... but its open source and won't cost you much to try.
Long long story short. What the hell! I feel like programmers today are constantly
Any commercial solution will cost you
If Hadoop seems large and frightening just wait until y
Hire someone competent with actual software development skills? Most data scientists I've met were glorified or relabeled data analysts. Some minor stats background and maybe they can hack together a script. That's fine and really valuable for analyzing large datasets and formatting the results into pretty figures for decision-makers to look at.
If your data is too complex for their basic ETL skills and it's taking a month to build interfaces, hire one competent and expensive developer to build those interfaces. You may call the role data engineer, data science engineer, software developer/engineer, or keep it as data scientist, but essentially get someone who actually understands software and database engineering, creating schemas, cleaning up data and knowing when all you have is shit and to move on. Would probably help to know some stats or has an interdisciplinary background so they can talk to and understand the data scientists' needs. With one person focused on rapidly developing the interfaces the other data scientists can focus on analysis. Holy shit four people sitting on data for a month. It just screams that no one knows what they're doing. This ain't grad school anymore; you have to earn your paycheck.
Some of the most competent data scientists I know personally have come from bioinformatics/computational biology backgrounds. Usually were hacking as teenagers and they should have some published software (or academic papers resulting from their programming) that they can talk intelligently about to show their skills. Some of bioinformatics is close enough to big data that with a Master's degree they should have at least been exposed to the ins and outs of handling varied data. If you're hiring remote (U.S.) I can help you out. Or for my hefty consulting fee I can work on-site when you need new interfaces built. Any kind of positive reply and I can figure out how to get in touch.
There goes the big data bubble.
A real shocker, to make something useful out of hadoop you must "understand" MapReduce and you must have a background in statistics and data analysis theory...this is a show stopper!
Idiot gets embezzled by nefwangled "open source" marketing language and must realize that, after all, Computing Is Hard, as every trade out there is.
This is just a variant of Dilbert's "teach me how to be an engineer, even i it takes all day".
I think many of the 'unhappy customers' the article refers to are companies where somebody who didn't quite understand the technology pushed hadoop as a replacement for (expensive) proprietary software like Oracle, to be then sorely disappointed especially on interactive performance.
I've been working with hadoop since 2007 and have successfully deployed for multiple clients. First of all, you really want to see if the use case makes sense, sometimes you're just better off with a RDBMs like mysql. Some companies just jump on the 'NoSQL' bandwageon to find out almost immediately that they, oh well, actually do need SQL.
Hadoop is based on some Google technologies (GFS, etc) that were designed to process immutable, append only crawler logs, for the search engine. So anything that requires record-level CRUD is off the table in vanilla hadoop. Other systems (like kudu or hbase) try to address that, but even there, it depends on your use case. These technologies are also not that easy to operate, especially when you stack them on top of each other. That's why there's a flurry of companies (like - guess what - snowflake, whose CEO is the author of the article) that offer Data Warehouse as a Service on top of hadoop.
I think the article is meant to scare people into buying their service...and it may be actually worth it if you don't have the skills/manpower to run the hadoop stack yourself. If only a few people access data, it's probably worthwhile, but in larger organizations, hadoop is merely the engine that does the ETL heavy lifting but you have other systems where the the data is queried. Note that for querying, there are several patterns: looking at long-term trends, looking at the most recent data, ad hoc querying, machine learning... each one may need its own specialize query engine.
Finally, the comparison with kafka is just insane and sounds like another sales pitch. Kafka is a whole different beast, it does streaming, but querying??
Big Data is a nice word. The fact that the concept if it is useful for roughly 5 ginormous global internet companies and beyond pointless for everybody else is probably something that 99.9% of all people making the final decisions on which technologie stack is used have zero clue about. They haven't got the faintes what big data actually means and what problems with it solutions like hadoop actually address.
I'd bet money that 99 of 100 scenarios in which hadoop would even run better with some unspectacular type-a round-robin master-slave loadbalanced mysql setup or something. ... Of course then you couldn't use that nice word "Big Data".
We suffer more in our imagination than in reality. - Seneca
Sorry for the typos - using a tablet just now. :-)
We suffer more in our imagination than in reality. - Seneca
Or, a lumberjack who has been faced with using a chainsaw when all he has known his entire life has been axes and saws...they understand the concept of how it could improve their ability to work, but anything beyond "put fuel and oil here, and change the chain when it gets dull" is lost on them unless they can also grok the internal combustion engine...
In other news, Bandwagon jumpers are shocked to discover that the cool new doohickey they read about in Tech Fashion Trends Magazine, doesn't actually magically fix every problem you throw it at.
Computer technology has now been around and commonplace for several decades now. It isn't knew that this stuff is complicated, and getting even more complicated with each passing year.
And yet while a client would never demand a builder use this specific kind of scaffolding and cement to build with because they read about how cool it was in some magazine, for some inexplicable reason people DO think that this is an entirely acceptable thing to do when it comes to software.
But that's ok. Customers who do this are a fantastic boon to the consulting industry. First for the slimey consultants (usually offshored to keep costs "low") that sell customers exactly what they want for, for cheap, and then for the much more expensive consultants later on who are tasked with trying to recover the steaming crater of a system the previous consultants left behind.
"You need competent high class programmers....luckily I am the Hadoop consultant you are lookin for and for a small fee I will..."
Imagine a Beowulf cluster of these!
How did you end up in that misery? I know the answer to your questions. How deep are your pockets?
Excuse me, MPI has been the "first widely-adopted open source distributed computing platform", and it has succeeded.
You're a stupid motherfucker. You have nothing useful to say. You contribute nothing useful to this site or to society [...] (etc)
I was unable to read the rest of your comment because I have a policy of stopping when it becomes obvious that the other person is just throwing a tantrum.
If you disagree with the fact that Wikipedia clearly indicates that Spark is NOT based on Hadoop, support your claim with a link or citation. Otherwise there is no need to get your panties in a bunch, you clearly don't have enough trolling skills to make even a drunk Mike Tyson circa 1997 angry.
lucm, indeed.
Are these people for real?
The whole article screams, "I don't know what I'm doing but I love jumping on bandwagons."
Apache Hadoop and Kafka are two completely different tools, intended for two COMPLETELY different workloads.
So if you used Hadoop when you should have used Kafka, that doesn't mean Hadoop is bad. It means you haven't done your job and properly vetted the tools available for suitability.