Is Big Data Leaving Hadoop Behind?
knightsirius writes: Big Data was seen as one the next big drivers of computing economy, and Hadoop was seen as a key component of the plans. However, Hadoop has had a less than stellar six months, beginning with the lackluster Hortonworks IPO last December and the security concerns raised by some analysts.. Another survey records only a quarter of big data decision makers actively considering Hadoop. With rival Apache Spark on the rise, is Hadoop being bypassed in big data solutions?
FTA: ...biggest problem is that people allegedly still can’t use Hadoop... Hadoop is still too expensive for firms...
Hadoop is an ecosystem with lots of moving parts. Those are real problems above, but Spark (Particle) is not a stand alone replacement for an ecosystem the size of Hadoop. Moreover it has no problem running integrating with Yarn on Hadoop where you can run Hbase, Cassandra, MongoDB, Rainstor, Flume, Storm, R, Mahout and plenty of other Yarn-compatible goodies.
It's also worth noting that Hortonworks and Cloudera may not be "taking off as hoped" because the branded big-iron players are finally in the ring. They hide the (rather hideous) complexity and integrate well with any existing systems you have with those vendors. Teradata for instance has a Hadoop/Aster integration that's impressive and turn key. They bought Rainstor, and will soon have it integrated, and that's Spark-fast and hassle free. IBM's BigInsights is very impressive if you have the means.
So, no, Hadoop is in no danger of being replaced. The value proposition that my $4.2M cluster outperformed two $6M "big name" vendor supported appliances is undeniable, but only that stark when your $'s have an M suffix. What will probably occur though is that we'll end up replacing every component in Hadoop with a faster one, and MapReduce will become a memory as things like Spark and Hive/Tez move away from that methodology.
Nope.
I thought Spark worked from within Hadoop. Is that like using emacs to run vi?
Don't tell Mary Jo Foley.
Yarn on Hadoop where you can run Hbase, Cassandra, MongoDB, Rainstor, Flume, Storm, R, Mahout and plenty of other Yarn-compatible goodies.
It's also worth noting that Hortonworks and Cloudera
I know R. My wife has a Yarn store. WTF are those other things?
I should use this sig to advertise my book ISBN-13 : 978-1501515132.
I tend to agree. As a storage admin for a Multi-Hospital organization using anything open source is not really an option if we want to keep the HIPPA-potamus away. However we have the means to buy and use name brand storage appliances (Currently NetApp and IBM SVC) which comes with the vendor support large organizations need.
Things like Hadoop I think are going to end up in the niche market of organizations that have the expertise to manage something as complex as Hadoop and have the need for Big Data and performance but dont have the capital to buy a brand name array.
Is this a question for Hadoop employees or slashdot? If there's something better, why does it matter to anyone other than the company developing Hadoop if it's relevant?
-SaNo
Is security really that big of a deal? Isn't the intent to run it on a private network to crunch numbers behind the scene?
We don't ask about the susceptibility of safety deposit boxes to crowbars and dynamite, they're inside a vault.
At least at the startup I work at. It's too slow.
I've heard of MongoDB. It's Web Scale!!
Momentarily, the need for the construction of new light will no longer exist.
Not just Hadoop. The whole stack is too bloated and difficult to use; it was never the best language for anything but people tried to use it for everything.
Funny I was thinking they were all children's books. cloudera, horton hears a works, etc.
i thought once I was found, but it was only a dream.
ahahhha mongodb is shit
Makes sense. If they can see your source (which you have to show them, or it wouldn't be open) then it makes absolute sense they can totally see your data.
You weren't previously the city manager of Tuttle, Oklahoma, were you?
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Well for one, the comment doesn't really apply much to anything Hadoop like. This actually is interesting in and of itself. In all this tech reporting the boring reality of 90+% of IT reality is generally ignored. This article is a great example, Hadoop was declared by many to be the one true solution for storing data as it went from 0.1% to 0.2% share. When that pace levels out at no more than a couple percent at best, articles declare it dead. It is obscure in real world terms, but much more common than when news unambiguously sung its praises. Of course IT is not alone, China gets blasted for slowing to like an 8% annual growth rate, which is still quite strong positive growth, just not the unsustainable grwoth previously seen.
For another, the ultimate (valid) point is that a lot of IT orgs won't use 'Hadoop' instead using things like 'BigInsights'. Sort of how they don't use 'Linux', they use 'Red Hat'. It's less about the source and more about covering their ass.
I think Slashdot got trolled by a Wall Street troll. Or a competitor to Hortonworks.
I agree that the problem is that most companies don't know how to run it and it's left to bigger organizations that 1) have the expertise in house and 2) actually need the added complexity.
Understanding which pieces of the ecosystem you need, how to deploy and running them in a production environment can be daunting, not to mention all the different possibilities of which cloud provider to use, which services, etc.
Cloudera and Hortonworks are capitalizing on it basically helping sorting out this complexity with consultants, and training, but since this business model scales with the number of employees, they are not scaling up that fast, also because there are not enough skilled engineers in the field. I personally interviewed several self proclaimed 'hadoop engineers' who had worked on hadoop for a year or more and yet didn't know what happens in the shuffle phase.
Another distinction to make is that Hadoop has now three major components: HDFS, YARN and map/reduce. Maybe Map/Reduce is losing its relevance as a hadoop component, as Tez/Shark/Flink advance, but should be noted that under the hood they use basically the same abstraction on parallelization, they just make better use of resources (especially memory), but they are not replacing HDFS not YARN. Mesos could be used in alternative of YARN, but I don't see any competitor for HDFS yet.
So, I would not say that Hadoop is being replaced, but more extended and to use a botanical analogy, beside growing, it's also being grafted on (flink,spark, cassandra, etc...).
Did I trip into a time warp and come out a decade in the past?
Who the fuck is actually talking about hadoop or map reduce in 2015? The same retards that were creaming their little cunts about it in 2005?
Even when you ignore the joke that is Java, hadoop is unwieldy, unreliable shit if you actually care about storing and retrieving correct, synchronized data.
If you're fine with throwing all of your data in a pot and getting some sort of result that looks mostly correct, then knock yourself out and use hadoop.
If your data needs to be correct, define it and its relationships then use SQL. You will have to pay someone decent money to do this correctly.
I don't really foresee anybody moving away from the MapReduce paradigm any time soon considering it's essentially a 30-year old idea that has only been getting more successful every day. The issue is whether a single proprietary clustering solution based on Java is really what people want to sign up to work with for the next 10 years. After setting up the basics for Hadoop over a day or two at work (with no real practical need for it) I for one can say I'm not particularly interested in that proposition.
Why did you say Spark (Particle)?
in the park, next to the terrorist with the bomb strapped around her waist getting the nerve to board the bus. She too fell for the 100 virgins ...blah blah blah.
CHANGE
THE
NAME
None of that matters when HDP stock drops to 0$. It will die and become irrelevant. Hadoop is a brand name that has to make money to stay alive, and it is not making any money.
Especially if an intern set it up
http://saveie6.com/
the answer is to the question is "No"
http://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines
The problem with "big data" is that there are no vendor specs and the implementations are sometimes questionable. There is a provider that does a better which is SQLStream (http://www.sqlstream.com) which has a streaming DB which is controlled via SQL. In addition to normal tables, you have streams which are relational typed conduits though which data flows and windows which are time (and row) based groups of tuples which can be used in agg queries with all the standard SQL functions (there's also Java UDXes and MED support). Designing your middleware on top of a SQL engine is a much better design pattern than doing it all with hand wired Java. All this and about 100x the throughput of a Hadoop program. Disclaimer: I'm an engineer at SQLStream.
"Those that start by burning books, will end by burning men."
A scripting language with a good math/stats library (e.g., NumPy/Pandas) and decent raid controller are all most people really need for most "big data" applications. If you need to scale a bit, add few nodes (and put some RAM in them) and a job scheduler into the mix and learn some basic data decomposition methods. Most big data analyses are embarrassingly parallel. If you really need 100+ TB of disk, setup Lustre or GPFS. Invest in some DDN storage (it's cheaper and faster than the HDFS system you'll build for Hadoop).
Here's the break down of that claim in more computer sciencey terms: Almost all big data problems are simple counting problems with some stats thrown in. For more advanced clustering tasks, most math libraries have everything you need. Most "big data" sizes are under a few TB of data. Most big data problems are also I/O bound. Single nodes are actually pretty powerful and fast these days. 24 cores, 128 GB RAM, 15 TB of disk behind a RAID controller that can give you 400 MB/s data rates will cost you just barely 5 figures. This single node will outperform a standard 8 node Hadoop cluster. Why? Because the local, high density disks that HDFS encourages are slow as molasses (30 MB/s). And...
Hadoop has a huge abstraction penalty for each record access. If you're doing minimal computation for each record, the cost of delivering the record dominates your runtime. In Hadoop, the cost is fairly high. If you're using a scripting language and reading right off the file system, your cost for each record is low. I've found Hadoop record access times to be about 20x slower than Python line read times from a text file, using the _same_ file system for Hadoop and Python (of course, Hadoop puts HDFS on top of it). In Big-O terms, the 'c' we usually leave out actually matters here - O(1*n) vs. O(20*n). 1 hour or 20 hours, you pick.
If you're really doing big data stuff, it helps to understand how data moves through your algorithms and architect things accordingly. Almost always, a few minutes of big-O thinking and some basic knowledge of your hardware will give you an approach that doesn't require Hadoop.
tl;dr: Hadoop and Spark give people the illusion that their problems are bigger than they actually are. Simply understanding your data flow and algorithms can save you the hassle of using either.
-Chris
BSD is dying for how long again? It's still around and having monthly releases. For open source projects, popularity contests are much less important. With massive existing user base, Hadoop will be actively maintained for long time. So if you already familiar with it and it serves the needs of your project, go right ahead.
Aren't SAP and SAS in this "Big Data" market, and have been for longer than this was a buzzword? My guess is that the companies that invested heavily in SAP and SAS in the late 1990s and early 2000s are sticking with that investment and toolset and couldn't be bothered by Hadoop. Sure, it might be free, but when your entire infrastructure for business information management is already in with SAP or SAS (presumably Oracle or MSSQL as well), you're not going to throw that away anytime soon without something more compelling than, "it's cheaper", because from a migration and overall cost of conversion viewpoint it ain't.
I agree that the problem is that most companies don't know how to run it
I think a bigger problem is that most companies don't even know what big data actually is. It is a big buzzword. I hear managers talking about it all the the time. Half the time they're talking about some database table with a few hundred thousand records in it. Other times they're talking about some repository full of documents or binary files that might be terrabytes in size, but it is just random stuff. They don't actually have questions in mind that they want to answer, and ultimately that is what tools like Hadoop are about.
I've heard "big data" applied to problems that are basically just file shares or the like.
Then if a company really does have a problem where Hadoop and such is useful, they want to buy some product off the shelf that solves that particular problem, and usually they don't exist. Or they want to hire a bunch of random rent-a-coders and have them solve the problem, and they go about solving it with single-threaded solutions written in .net or whatever the commodity solution in use is at the company.
Sure, your Facebooks and Googles and Netflixs and Amazons know what they're doing. Your average GE or Exxon or Pfizer generally doesn't do that level of comp sci.
I know R. My wife has a Yarn store. WTF are those other things?
Its a distributed exec for Java processes. That's really it. It has crappy monitoring built in that's unnecessary due to SNMP but they built it in anyway because...well I don't know why.
"Those that start by burning books, will end by burning men."
You are overestimating the difficulty at this point. This not compsci anymore and hasn't been for many many years. It isn't even hard administration. It is probably easier to get a big data system running in 2015 than it was to use Oracle in 1995.
As far as your examples you went way too big. GE is a huge DevOps shop, they know what Big Data is. Exxon has massive supercomputing datasets. I would bet they were doing big data long before it got cool. Pfizer has an IT department that is some of everything but they have many many data warehouses so I can't imagine they aren't playing with data lakes.
Not just covering, but documenting. To get security approvals and documentation in place for open source can be work. When you pay a vendor, they do that work As well. Which eleviates significant operational complexity.
There are no references to any algorithms. Rank ordering? Nope. Social graph analytics? No. Netflix style recommendations? Uh-uh. Statistics? None.
Without talking about data sets, algorithms and expected results, yammering about tools is meaningless. Hot air.
But who cares, because you all get to call each other stupid, and try and prove that you are the biggest baddest tech weenie on the block. From here it seems that you don't even know where the block is. You don't even seem to know which direction you need to go to get to a street. (Like the implied car reference there?)
I'm beyond unimpressed. It's obvious that no one has a clue what they are talking about. Go off and learn something, and then maybe you will be able to write a post that isn't a waste of time. Other then that STFU and get off my lawn.
Why is Snark Required?
I would strongly disagree. In 1995 relational theory and practice was well understood by a large set of developers and had stable, well documented implementations. Raw Hadoop and the associated computational model is not at that level of stability, documentation and usability. In addition the relational model applies to many business problems, large and small. Hadoop is generally applicable and cost efficient only for larger, more complex problems.
what could improve spark? where does it suck?
From 2010 to early on the year I was responsible for Big Data technical marketing at Microsoft, recently joined AWS. I won't comment of any of the specifics for my current or former employer, but it's a fact that other nosql technologies have a higher adoption rate. It's clear that the traditional datawarehouse had limitations, and that hadoop is not replacing the EDW. The largest companies are using proprietary technologies, not adopting hadoop. Hadoop 2.0 is much better, you should use it if you have the skills. But if you don't, relational, nosql and cloud databases are evolving to solve most use cases. I would invest more resources on Advanced Analytics both on open source (e.g. http://xpatterns.com/connect/ or https://aws.amazon.com/marketp... ) or proprietary (SAS, IBM, SAP...).
I would strongly disagree. In 1995 relational theory and practice was well understood by a large set of developers and had stable, well documented implementations. Raw Hadoop and the associated computational model is not at that level of stability, documentation and usability. In addition the relational model applies to many business problems, large and small. Hadoop is generally applicable and cost efficient only for larger, more complex problems.
you can't strongly anything as an AC, sorry buddy
Meaning the hype around big data has settled and its back to business. I'd say there less than 10 companies worldwide to whom big data actually might make sense. Others clean and aggregate their data in such a way that its actually useful. .... I don't want my bank guessing my balance with big data statistics, I want them to know it. And so do most other people.
We suffer more in our imagination than in reality. - Seneca
You are overestimating the difficulty at this point. This not compsci anymore and hasn't been for many many years. It isn't even hard administration. It is probably easier to get a big data system running in 2015 than it was to use Oracle in 1995.
I think you're misunderstanding my point.
Sure, it is easy to install Hadoop, and run it.
The hard part is figuring out WHAT to run on it.
Agree with both your comments. That's from a developers perspective it was certainly easier to use Oracle once setup in 1995 than it was to use Hadoop today (by a bit). What the thread was about was setup. What wasn't understood well in 1995 was how to package complex enterprise software so that sysadmin times to get it installed were reasonable. The original poster was talking about the complexity from scratch.
That's easy the big 5:
1) Datasets to big to use an RDBMS
2) 360 view of customers (CRM consolidation, sales systems consolidation...)
3) Security data from network security devices.
4) Stream in huge amounts of operational data (GPS on employees, physical sensors, machine health...) and do integrated data analysis
5) data warehouse consolidation
Good thing I didn't bother getting onto the Hadoop hype wagon a few years ago when Hadoop was the solution to every problem, the guarantee of a high-paying job forever, and the cure for cancer. How the mighty have fallen.
I work in the healthcare space as well, and Open Source stuff has never been a problem. HIPAA has no preference for big vendors.
For the most part, it's been EASIER to get approvals, because the software is often much more flexible, and it's cheaper to test the waters. Want to "try" a big vendor's solution? Write a justification, lobby a bunch of people, adjust a future budget, write a purchase order, etc etc etc. Want to try out an OSS package? Download it and work-up a proof-of-concept in some spare time and then show someone a working system.
It's easier to talk about the path from PoC to production than it is to talk about all the unknowns and costs associated with getting in bed with a vendor, because you KNOW that vendor is going to charge $500/hr for useless consulting services. Then finally, there's the possibility that the solution doesn't work out. With the OSS solution, you just shutter the work you've already done and move on. With a big vendor, you may find that "moving on" is not easily done, because they have spent most of that useless consulting time insinuating themselves into every part of your organization so you can let them go.
So you are basically saying that hadoop will eventually fall in disuse but HDFS (Hadoop file system) will linger on with new platforms built on top of it? Or do you believe that the HDFS will also be replaced eventually?
Spark is a horrible name for anything. Spark the M3 Wifi board people needed to change recently to Particle to try and have some sort of trademark (huge derp for anyone picking spark as a name to begin with). Otherwise it makes no sense.
No the Hadoop users are rolling their own solutions and don't actually give a fuck about boutique shops providing expensive solutions that their in house developers can provide. And their in house developers already know their data better than the expensive fucking consultants.
It's not getting that much attention because it's just not that big a deal anymore, just another tool for developers to use; especially Java developers.
In addition it seems like management has learned that Hadoop is a data-analysis package, and not a database.
Something as simple as manufacturing data far eclipses this number every day. Think of every screw from every supplier in every product. Then tracking the reliability of this product through the entire lifecycle with self diagnostic tests. No, this is not for your toy made in china, but when it comes to real top end products that HAVE to work, then you need this kind of data to figure out what went wrong and fix it fast. That could save your company millions. No, making your latest dot bomb app does not need this, but there are many places that do. Also check out financial apps like credit fraud, insurance, etc.
I tend to agree. As a storage admin for a Multi-Hospital organization using anything open source is not really an option if we want to keep the HIPPA-potamus away.
I worked for over a decade as an SE for an org that was both a hospital-IT vendor and a covered entity in its own right (we sold a PMS, a PACS, operated multiple HIEs, and were a claims clearinghouse). When choosing libraries and server technology, never once was the open source status of a piece of technology a consideration with regards to HIPAA. We would occasionally have to run things by the legal team to evaluate a new license or check our compliance, but that was it. HIPAA considerations were mostly operational: Is PHI encrypted when at rest or transmitted over the open internet? Are we ensuring only authorized personnel can see PHI? How are we handling backups? The ops team took care of most of those things and they didn't care what we built the software out of, as long as it conformed to the requirements.
"alleviates"....?
It is HIPAA, not HIPPA. Just remember it is an 'Accountability Act', and you know where the double letters are.
I think the idea was that Hadoop is a large ecosystem with good abstraction layering for attacking problems with large datasets. Whether you change out HDFS for MapR, or you change out MapReduce processing for Spark, or you decide to do all your data processing in real time with Storm, the Hadoop model works well.
If you wanted to justify discarding Hadoop's design completely, you'd have to be ready to re-write all the pieces that currently address people's needs. It's not unlike writing a new kernel (which you can do pretty easily) but then re-writing and testing all the hardware drivers (which would take decades).
Betteridge's law of headlines finally proven wrong?
You must be new to the game. Another vendor, with their special configuration platform and customized layers definitely "elevates" complexity.
Has anyone considered Joyent's Manta ?
This is a distributed object storage with integrated compute.
Data is stored on a cluster of SmartOS hosts..
And processed directly on each host inside a OS container (SmartOS zone), no data movement.
Lot of APIs available: R, command-line, python, ruby, node.js etc..
Available on their cloud and as a on-premises commercial product, opensourced last November (simulteanously with smartdatacenter).