Apache Hadoop Has Failed Us, Tech Experts Say (datanami.com)

← Back to Stories (view on slashdot.org)

Apache Hadoop Has Failed Us, Tech Experts Say (datanami.com)

Posted by EditorDavid on Saturday March 25, 2017 @09:34AM from the future-is-cloud-y dept.

It was the first widely-adopted open source distributed computing platform. But some geeks running it are telling Datanami that Hadoop "is great if you're a data scientist who knows how to code in MapReduce or Pig...but as you go higher up the stack, the abstraction layers have mostly failed to deliver on the promise of enabling business analysts to get at the data." Slashdot reader atcclears shares their report: "I can't find a happy Hadoop customer. It's sort of as simple as that," says Bob Muglia, CEO of Snowflake Computing, which develops and runs a cloud-based relational data warehouse offering. "It's very clear to me, technologically, that it's not the technology base the world will be built on going forward"... [T]hanks to better mousetraps like S3 (for storage) and Spark (for processing), Hadoop will be relegated to niche and legacy statuses going forward, Muglia says. "The number of customers who have actually successfully tamed Hadoop is probably less than 20 and it might be less than 10..."

One of the companies that supposedly tamed Hadoop is Facebook...but according to Bobby Johnson, who helped run Facebook's Hadoop cluster before co-founding behavioral analytics company Interana, the fact that Hadoop is still around is a "historical glitch. That may be a little strong," Johnson says. "But there's a bunch of things that people have been trying to do with it for a long time that it's just not well suited for." Hadoop's strengths lie in serving as a cheap storage repository and for processing ETL batch workloads, Johnson says. But it's ill-suited for running interactive, user-facing applications... "After years of banging our heads against it at Facebook, it was never great at it," he says. "It's really hard to dig into and actually get real answers from... You really have to understand how this thing works to get what you want."
Johnson recommends Apache Kafka instead for big data applications, arguing "there's a pipe of data and anything that wants to do something useful with it can tap into that thing. That feels like a better unifying principal..." And the creator of Kafka -- who ran Hadoop clusters at LinkedIn -- calls Hadoop "just a very complicated stack to build on."

150 comments

Min score:

Reason:

Sort:

MapReduce is great by Anonymous Coward · 2017-03-25 09:44 · Score: 4, Insightful

If 1) you have a staff of elite programmers like Google or Facebook, who have CS degrees from top universities and are accustomed to picking up new programming languages and tools on a continuing basis; AND
2) your business has a pressing need to crunch terabytes of logs or document data with no fixed schema and continually changing business needs.
For the average Fortune 500 (or even IT) shop, not so much. A '90s style data warehouse accessible through SQL queries works much better.
1. Re:MapReduce is great by gweihir · 2017-03-25 10:01 · Score: 1
  
  I have to say that I am less impressed with the quality of the coders at Google the more I know about them. The really good ones are leaving, are thinking about leaving or have already left a while ago. What is left is the mediocre ones that somehow managed to get in.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
2. Re:MapReduce is great by Anonymous Coward · 2017-03-25 10:17 · Score: 4, Funny
  
  You've done an incredible amount of work to reach this conclusion. Congrats. Did you use map-reduce on your data set?
3. Re:MapReduce is great by Anonymous Coward · 2017-03-25 10:19 · Score: 0
  
  I have to say that I am less impressed with the quality of the coders at Google the more I know about them. The really good ones are leaving, are thinking about leaving or have already left a while ago. What is left is the mediocre ones that somehow managed to get in.
  If you aren't impressed with the coders at google, you should talk to the ones at Amazon.
4. Re:MapReduce is great by Anonymous Coward · 2017-03-25 10:22 · Score: 1
  
  If you make people jump through hoops like circus animals to come work at your company you only get the desperate, or the ones who want the job as a status symbol.
5. Re:MapReduce is great by Anonymous Coward · 2017-03-25 10:25 · Score: 0
  
  I imagine everyone knows by now they're a shit company to work for and avoids them if possible. They're only good in terms of having a big name on your resume. Google has a slightly better reputation as an employer, but I imagine there is a lot of entry level competition to work for them, but once you have their name on your resume, you get better offers at numerous other companies where you will feel like you'll have more of an impact in the product(s) and success of the company.
6. Re:MapReduce is great by jbolden · 2017-03-25 10:33 · Score: 0
  
  I'd say its really this.
  You have a business problem which is completely unrealistic to solve via. vertical scaling on SQL Things in the range of 50-200k CPU hours Hadoop is good for. SQL solutions are pretty dreadful at 1000 CPU hours type workloads. BTW petabytes and exabytes. SQL is pretty good for terabytes.
7. Re:MapReduce is great by Mitreya · 2017-03-25 11:06 · Score: 2
  
  1) you have a staff of elite programmers like Google or Facebook, who have CS degrees from top universities and are accustomed to picking up new programming languages and tools on a continuing basis;
  
  I disagree.
  MapReduce is actually great for teaching people about parallel processing! I have been able to teach a distributed computing course to non-CS (primarily data science) MS students because it achieves parallelization without most of the complexities associated with distributed query processing. With Hadoop streaming, all you need is basic knowledge of python (or similar) to write your own custom jobs, even without Hive/Pig/etc.
  That to me is one of the greatest accomplishments of MapReduce. Bringing distributed computing concepts to the general audience.
  
  2) your business has a pressing need to crunch terabytes of logs or document data with no fixed schema and continually changing business needs.
  That part is true. Almost no one has that much of a data processing need.
  But it is still good for teaching distributed / remote computing to non-CS majors.
8. Re:MapReduce is great by Nkwe · 2017-03-25 11:13 · Score: 1
  
  If you make people jump through hoops like circus animals to come work at your company you only get the desperate, or the ones who want the job as a status symbol.
  Or the ones who like being made to jump through hoops like a circus animal. I guess if you are into that it's okay; who am I to judge?
9. Re:MapReduce is great by Anonymous Coward · 2017-03-25 11:15 · Score: 0
  
  "What is left *are* the mediocre ones that somehow managed to get in."
10. Re:MapReduce is great by gweihir · 2017-03-25 11:18 · Score: 5, Interesting
  
  Indeed. I went though their "interview-process" a while back at the request of a friend that was there and desperately wanted me for his team. Interestingly, I failed to get hired, and I think it is because I knew a lot more about the questions they asked than the people that created (and asked) these questions. For example, on (non-cryptographic) hash-functions my answer was to not do them yourself, because they would always be pretty bad, and to instead use the ones by Bob Jenkins, or if things are slow because there is a disk-access in there to use a crypto hash. While that is what you do in reality if you have more than small tables, that was apparently very much not what they wanted to hear. They apparently wanted me to start to mess around with the usual things you find in algorithm books. Turns out, I did way back, but when I put 100 Million IP addresses into such a table, it performed abysmally bad. My take-away is that Google prefers to hire highly intelligent, but semi-smart people with semi-knowledge about things and little experience and that experienced and smart people fail their interviews unless they prepare for giving dumber answers than they can give. I will never do that.
  On the plus side, my current job is way more interesting than anything Google would have offered me.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
11. Re: MapReduce is great by Anonymous Coward · 2017-03-25 11:19 · Score: 1
  
  I went through the process a few years ago for an SRE position. It was exactly the same process used at most other tech companies: a couple of screening interviews on the office, and a half to 2/3 day of on-site, one on one, specific tech interviews with people who *seriously* know their stuff.
  The campus is overall a weird cult, and they don't have other offices in places I want to live (maybe Pittsburgh, someday), so I don't work there. But they haven't done the really weird interviews that they used to be famous for in quite a while.
12. Re:MapReduce is great by Anonymous Coward · 2017-03-25 11:19 · Score: 0
  
  and when the circus animals are running the circus, the hoops gradually get higher and higher.
13. Re:MapReduce is great by Anonymous Coward · 2017-03-25 11:21 · Score: 0
  
  They hired a few Bs... (e.g. As tend to hire As... Bs tend to hire Bs and Cs... once the whole company is made up of Bs and Cs, it's done [or rather, starts issuing dividends on incredible revenues and stops growing faster than others]).
14. Re:MapReduce is great by Anonymous Coward · 2017-03-25 11:24 · Score: 0
  
  Are you guys hiring ?
15. Re:MapReduce is great by angel'o'sphere · 2017-03-25 11:27 · Score: 1
  
  I have heard plenty of stories like this.
  And I have to say, while the questions google is asking in an interview are relevant for their business, they are rather simple.
  I guess I would fail an interview, too.
  On the other hand, I work freelance, so big companies are rarely interesting.
  
  --
  Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
16. Re:MapReduce is great by Antique+Geekmeister · 2017-03-25 11:35 · Score: 1
  
  > MapReduce is actually great for teaching people about parallel processing! I
  And about how _not_ to do it. The underlying expense and architecture mistakes "scalability" for actual throughput in processing. It's proven extremely unstable in tasks larger than a small proof of concept, and any task I've encountered in which the actual data to be processed has to be successfully, processed, and verified within a specified deadline.
17. Re:MapReduce is great by Tablizer · 2017-03-25 11:36 · Score: 5, Insightful
  
  As a reminder, SQL is a query language and not a hardware technology. It doesn't dictate HOW to store data (assuming it meets certain minimum standards). You probably are referring to typical RDBMS.
  
  --
  Table-ized A.I.
18. Re:MapReduce is great by molarmass192 · 2017-03-25 11:59 · Score: 1
  
  I was working with multi-petabyte GIS databases on Oracle 9 in the late 90s. SQL is just fine in multi-PBs if you know how to write a query and tune a database.
  
  --
  
  Good people do not need laws to tell them to act responsibly, while bad people will find a way around the laws-Plato
19. Re: MapReduce is great by Anonymous Coward · 2017-03-25 12:06 · Score: 5, Interesting
  
  That's because the mediocre programmers are the ones giving the interviews. Close friend interviewed last year only to sit in front of a bunch of know it all elitists. One douche rambled on about how he wishes there were monads in c++ and how great functional design is. Now my friend and his roommate are CS geeks and their spare time is doing shit like build a lisp interpreter in c++ just for fun. So he asked mr monad if the project used a functional approach which was a solid no. Idiot just wanted to show off the fact he knew what functional programming is and wasted time. He passed on the Google job for a big local company doing back end dev work. Job pays as good as Google without the pompous know nothing's with the ability to remotely work. Fuck working for Google.
20. Re:MapReduce is great by Pseudonym · 2017-03-25 12:31 · Score: 1
  
  I work with (multi-terabyte, not multi-petabyte) GIS databases. I am also a Haskell programmer (though not for my day job) so MapReduce doesn't scare me off at all. It's very hard to see how MapReduce specifically would help large-scale GIS.
  The main benefit of MapReduce for most problems isn't the programming model, it's the principle "move your code to where the data is" in a way that's agnostic to precisely where the data is. When you have big data, you need to do that. Precisely what that code does is a secondary concern.
  
  --
  sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
21. Re:MapReduce is great by lucm · 2017-03-25 12:33 · Score: 1
  
  The problem those companies face is that they grew so fast that they're struggling with past technical decisions that are difficult to revert (e.g. Twitter and their initial RoR architecture). The wheels keep turning so they end up having to build sophisticated layers on top of their legacy garbage.
  We've all been there. Someone (maybe even you) builds a throwaway Excel macro or Wordpress-driven monstrosity just to address a temporary need that is not worth spending more than 2h on, and first thing you know it's become a mission critical component. Now imagine that this piece of shit ends up powering tools used by millions of users; you can't simply stop everything and rewrite from scratch, and you can't stop adding features, for which you need busloads of new programmers. At that point you don't need programmers who will tell you: dude you should use Nodejs or MySQL, you need programmers who can apply a very narrow and deep understanding of graph computing (or whatever powers your AI-like engine) to constantly changing requirements.
  Most things they do on a daily basis at Amazon or Google would get you fired from a normal IT job, just like throwing explosives in a fire is done on a regular basis in oilfields but would cost a NYC firefighter his job.
  
  --
  lucm, indeed.
22. Re:MapReduce is great by Mitreya · 2017-03-25 12:34 · Score: 2
  
  The underlying expense and architecture mistakes "scalability" for actual throughput in processing. It's proven extremely unstable in tasks larger than a small proof of concept
  Can you elaborate on some reasons?
  I was part of a research paper some time ago, and Map Reduce does have the advantage of in the ability to resume (rather than restart) queries on failure and better handling of ad-hoc queries (compared to RDBMS).
23. Re:MapReduce is great by arglebargle_xiv · 2017-03-25 12:38 · Score: 1
  
  If you make people jump through hoops like circus animals to come work at your company
  They jump through hadoops, not hoops. That's how they show they're qualified to work with it.
24. Re:MapReduce is great by arglebargle_xiv · 2017-03-25 12:43 · Score: 1
  
  +1. To get hired at Google, you have to be smart enough to get through the tests but not smart enough to know that what you're working on has already been looked at by two dozen other people at different times and there's a pretty good solution already available. Instead, you get to reinvent the wheel yourself from scratch, but yours is going to be bigger, better, faster, newer, and web scale, because you're Google. I declined to work there, it would have driven me nuts to work in such an environment.
25. Re: MapReduce is great by Anonymous Coward · 2017-03-25 14:02 · Score: 0
  
  I've also heard many suck stories, but more often than not, it is someone who has an inflated view of themselves and are likely missing the actual reason they didn't get hired. I've even had these stories come up from people I'm interviewing, as if they were expecting me to be impressed with how they outsmarted the people interviewing at the big companies. It makes me wonder if after we rejected them if they went on to think their answers were too smart for our interview process, instead of the fact that their sample code simply didn't work, or that they packed communication skills, or that they refused to answer technical questions directly relevant for the position (e.g being hired to maintain a numerics library does require some knowledge of the algorithms involved).
  On the flip side, with good interviewees or when interviewing for a position myself, I've found it simple to avoid the problem the GP mentions where you answer the wrong way: just ask what type of answer they want and possibly give a quick version of different kinds of answer. If they ask you how to do X, don't just answer "library foo," but instead say, "do you want to know what algorithm I would use if doing it from scratch, as I would use bar's algorithm, but if you want a more real world answer for when not writing from scratch, I would use library foo because it is well optimized for most situations because of feature baz."
26. Re: MapReduce is great by Anonymous Coward · 2017-03-25 14:21 · Score: 0
  
  A 10GB drive in 1999 was pretty big, so to have a multipetabyte array would mean at least 200,000 drives. You actually worked on a project that would have purchased over half a percent of all drives sold in the US (or Europe)? If you didn't meant 1999 by late 90s, you would have to double or quadruple that to a couple percent of drives sold.
27. Re:MapReduce is great by jbolden · 2017-03-25 14:29 · Score: 0
  
  We aren't talking size of the total database, we are talking size of the dataset that needs to be manipulated for the query. You can't tune Oracle for Hadoop levels of parallelism.
28. Re:MapReduce is great by Antique+Geekmeister · 2017-03-25 15:53 · Score: 1
  
  > Can you elaborate on some reasons?
  It has suffered from a problem common to various object oriented projects: by refusing to acknowledge the existence of lower level structures, such as the very real storage hardware and real network connections necessary to propagate the data among the nodes for effective access, the result is that it didn't scale. Backup of results from well-delineated processing steps, which is critical for debugging or re-running new versions of particular processing steps, wound up being quite slow, quite expensive, and was often ignored. The result often became a demand for complete snapshotting of the _entire_ system before scheduled processing, which does _not_ scale well. And the critical management nodes themselves increased in fragility as the system scaled.
  > Map Reduce does have the advantage of in the ability to resume (rather than restart) queries on failure.
  Only for failures of the particular query. From my experience, very few programmers elected to verify the results of any query. And if the query failed, it should be _reported_ as failed and re-run as needed. The possibility of a failure of the MapReduce[sic] system itself was seen as involving a different layer of abstraction. I'm afraid that some, if not most, Java programmers have been taught _not_ to acknowledge or to deal with errors from lower layers. Errors for individual operations grew exponentially in frequency as the size of a cluster grew. From a recent project, anything more than approximately 10 nodes with more than 20 TB of overall disk space failed so often that it was unusable.
  Ad-hoc queries are important to be able to handle. Many that I've seen in fieldwork are so badly structured that spending the engineering time on a database programmer who can help optimize them is much more cost effective. Such an engineer can even work on the underlying RDBMS structures if needed, at far greater cost/benefit than trying to maintain a MapReduce system.
29. Re:MapReduce is great by Kjella · 2017-03-25 16:04 · Score: 5, Interesting
  
  For example, on (non-cryptographic) hash-functions my answer was to not do them yourself, because they would always be pretty bad, and to instead use the ones by Bob Jenkins, or if things are slow because there is a disk-access in there to use a crypto hash. While that is what you do in reality if you have more than small tables, that was apparently very much not what they wanted to hear. They apparently wanted me to start to mess around with the usual things you find in algorithm books.
  No offense, but "I'd rather just use a library" seriously brings into question what you bring to the table and whether you'll just be searching experts-exchange for smart stuff other people have done..Like everybody knows you shouldn't use homegrown cryptographic algorithms, but if a cryptologist can't tell me what an S-box is and points me to using a library instead it doesn't really tell me anything about his skill, except he didn't want to answer the question. In fact, dodging the question like that would be a pretty big red flag.
  Don't get me wrong, you can get there. But start off with roughly what you'd do if you had to implement it from scratch, what's difficult to get right, then suggest implementations you know or alternative ways to solve it. Because they're not that stupid that they think this is some novel issue nobody's ever looked at before or found decent answers to. They want to test if you have the intellect, knowledge and creativity to sketch a solution yourself. Once you've done that, then you can tell them why it's probably not a good idea to reinvent the wheel.
  
  --
  Live today, because you never know what tomorrow brings
30. Re: MapReduce is great by aaronb1138 · 2017-03-25 17:04 · Score: 1
  
  When Hadoop arrived on the scene it had the exact smell of typical programmer make work projects like Ruby and SEO.
31. Re:MapReduce is great by globaljustin · 2017-03-25 17:10 · Score: 1
  
  Interesting comments on this thread, thanks. I've learned a lot.
  fwiw, I have a network engineering background and Hadoop always seemed like a clusterfsk to me...good to learn the actual story isn't far from my impressions.
  
  --
  Thank you Dave Raggett
32. Re:MapReduce is great by colinrichardday · 2017-03-25 17:46 · Score: 1
  
  Are hadoops hoops with ads in them?
33. Re: MapReduce is great by Anonymous Coward · 2017-03-25 20:47 · Score: 0
  
  The first time we looked at hadoop was back when it had a single point of failure, so I could only vanilla hadoop recommend for research purposes. It has at least improved since then.
34. Re: MapReduce is great by Anonymous Coward · 2017-03-25 20:51 · Score: 0
  
  Indeed, aren't there SQL shims over hadoop?
35. Re: MapReduce is great by nebosuke · 2017-03-25 21:50 · Score: 1
  
  Removable media (e.g. tape and WORM optical disk) libraries were typical for petabyte+ storage arrays back in the late 90s. I remember the Subaru telescope facility in Hawaii had a petabyte storage facility which was primarily an automated tape library (plus a large section of wall occupied by a physically massive ~40gb RAM array) when I very briefly interned* there in the late 90s.
  That was large, but not uniquely or ridiculously large. My WAG is that, globally, there were probably on the order of 1k installations of similar or greater magnitude at the time. Certainly some of the DOD projects at LLNL would easily have been at the scale your parent poster claims.
  *I.e. assisted with workstation builds & linux installs.
36. Re:MapReduce is great by Anonymous Coward · 2017-03-26 01:07 · Score: 0
  
  Your insight into his methods is phenomenal. Does it come from your ego?
  Captcha: really[?]
37. Re:MapReduce is great by Anonymous Coward · 2017-03-26 03:10 · Score: 0
  
  accustomed to picking up new programming languages and tools on a continuing basis
  Been in this industry forever, and every two-bit brogrammer claims they do this.
  None of them do.
38. Re:MapReduce is great by Anonymous Coward · 2017-03-26 04:07 · Score: 0
  
  No offense, but "I'd rather just use a library" seriously brings into question what you bring to the table.
  Except that's the right answer. It's arrogant pricks who think that they're hot shit who reinvent the wheel, do it badly and then charge headlong into their next coding disaster, energy drink in hand and earbuds in ears. Meanwhile, a more responsible engineer has to come along afterwards and clean up the hot mess so that the users can actually have a working system that isn't chock full of silly bugs.
  
  They want to test if you have the intellect, knowledge and creativity to sketch a solution yourself.
  The best way to determine that is to ask an abstract hypothetical question, where there is no existing implementation and no risk of getting it wrong. Bringing in real world concerns that you want the candidate to ignore because "it's an interview question" is stupid because it clouds the issue and prevents the type of answer that you're looking for. Maybe the candidate is an honest guy and prefers to give you the "don't write your own encryption algorithms" answer because in reality that is the right answer. Then you pass up an otherwise excellent candidate because your interview question was poor. Is that really what you want?
39. Re: MapReduce is great by Anonymous Coward · 2017-03-26 05:48 · Score: 0
  
  Basically add a bunch of useless fluff in there instead of just giving them a straight forward answer. I guess they are testing you to see how many company buzz words you can throw around.
  I bet if he answered "first I'd use google to find the best algorithms available. Then I'd use use Google hangouts to post a survey to get some feedback from my fellow developers. Then I'd use Google Docs to setup a spread sheet to list the pros and cons...."
  You get the drift.
40. Re:MapReduce is great by gweihir · 2017-03-26 05:52 · Score: 3, Interesting
  
  No offense, but you miss the point entirely. What I answered is very far from "use a library". First, it is an algorithm, not a library. That difference is very important. Second, it is a carefully selected algorithm that performs much better than what you commonly find in "libraries" in almost all situations. And third, the hash-functions by Bob Jenkins (and the newer ones bu DJB, for example) are inspired by crypto, but much faster in exchange for reduced security assurances. In fact so fast that they can compete directly with the far worse things commonly in use. "Do not roll your own crypto" _does_ apply_ though.
  So while I think you meant to be patronizing, you just come across as incompetent. A bit like the folks at Google, come to think of it...
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
41. Re:MapReduce is great by gweihir · 2017-03-26 06:00 · Score: 1
  
  No offense, but "I'd rather just use a library" seriously brings into question what you bring to the table.
  Except that's the right answer. It's arrogant pricks who think that they're hot shit who reinvent the wheel, do it badly and then charge headlong into their next coding disaster, energy drink in hand and earbuds in ears. Meanwhile, a more responsible engineer has to come along afterwards and clean up the hot mess so that the users can actually have a working system that isn't chock full of silly bugs.
  Oh yes. Of course the answer is not to use "any library", but to carefully select a good algorithm and then use a library for that. I cannot count the times some "Rockstar"-wannabe has reinvented the wheel and did it really, really badly because they were not even aware of the basics.
  
  They want to test if you have the intellect, knowledge and creativity to sketch a solution yourself.
  The best way to determine that is to ask an abstract hypothetical question, where there is no existing implementation and no risk of getting it wrong. Bringing in real world concerns that you want the candidate to ignore because "it's an interview question" is stupid because it clouds the issue and prevents the type of answer that you're looking for. Maybe the candidate is an honest guy and prefers to give you the "don't write your own encryption algorithms" answer because in reality that is the right answer. Then you pass up an otherwise excellent candidate because your interview question was poor. Is that really what you want?
  While I know that this is not what Google wanted, it is what they did. And on the hash-question, I do know that I do not have what it takes to come up with a good solution (you need to be a cryptographer for that these days and I am only a competent user of crypto) and so I have stopped bothering to even look at it. This is something that everybody competent selects from a catalog. Of course the real problem is that the Google folks vastly overestimate their own skills, or they would have been able to evaluate what the actual quality level for the selection I proposed is and then could have asked why exactly I proposed this one. That question never came, which is an utter fail.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
42. Re:MapReduce is great by gweihir · 2017-03-26 06:02 · Score: 1
  
  Actually, A-players tend to hire A-players and B-players tend to hire C-players.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
43. Re: MapReduce is great by gweihir · 2017-03-26 06:09 · Score: 1
  
  Quite possibly these people are vastly overestimating their own skills because they "work at Google". Fortunately, I did not run into socially inept interviewers, but as to the questions asked, they did not have more than surface knowledge. That is not how you interview somebody with advanced skills and experience, because people on that level rarely run into things they have not seen before in some form and that they need to solve on an elementary level. I think this happened to me once in the last 5 years, and there likely is a next case upcoming in the next few months. Both are in research projects.
  The really funny thing is that I do know Google would have needed people like me desperately, because on architecture-level (where you need the real experience and insights), they still suck badly and may even be getting worse.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
44. Re:MapReduce is great by Anonymous Coward · 2017-03-26 08:09 · Score: 0
  
  To be clear: SQL is the wrong tool for what the social media-type companies are doing and want to be doing in the future.
  Just because your mediocre company forces its employees to do it, doesn't make it the correct decision, but you also need to pull it off successfully, which is harder.
45. Re:MapReduce is great by Boronx · 2017-03-26 08:54 · Score: 1
  
  No it isn't. If I'm going to hire someone to link in a library, give me somebody who has some clue what the library is doing. The initial results will be better, and if there's something wrong with the chosen black box, we'll have a chance of figuring it out.
  
  --
  Play Command HQ online
46. Re: MapReduce is great by Anonymous Coward · 2017-03-26 09:05 · Score: 0
  
  Clearly, it was for the NSA.
47. Re:MapReduce is great by michael_wojcik · 2017-03-27 04:19 · Score: 1
  
  Or to put it another way: There are better ways to determine someone's understanding of wheels than asking them to make one.
  If I were interviewing a candidate and wanted, for some reason, some sense of that person's understanding of hash functions, I'd hope for more than "just use a library", but I also wouldn't be looking for a Introduction to Algorithms exposition on them. gweihir's original post comes pretty close to the sweet spot: it shows an understanding of the problem domain, some sense of approaches that have proven useful for particular situations, and a degree of experience. I could ask for expansion if I wanted it.
  I don't understand the Google hagiographers, personally. Yes, they have a good R&D department: they manage both solid primary research and a steady rate of applying it to their various businesses and projects. But a lot of it is overrated in the general and industry press.
  Splinter is very good work, for example. The GoogLeNet architecture, particularly the concept of Inception layers, is quite good (though already arguably surpassed by competing approaches to enhancing ConvNets). But the essential concepts of MapReduce should be obvious to any competent CS undergrad. The original PageRank was just "hey, here's a proxy for site quality": it's more (basic) economics than CS. And that's true of a lot of what comes out of Google: perfectly good, but not some sort of astounding new idea.
48. Re: MapReduce is great by RevDisk · 2017-03-27 06:55 · Score: 1
  
  Worked for DISA. Can confirm, we had massive tape silos and entire teams that loaded/unloaded bulk lots of tapes in the LTO3/LTO4 days. Usually everything worked fine with the load time to snag the proper tape, load it to a free drive and start reading. Lot of that was done by mainframes using COBOL. Fun stuff.
  
  The new Samsung 16TB SSDs will be substantial game changers in... oh, five years. They're shipping now, but if the price drops to a grand or two per SSD, it'll be really interesting for bulk storage. Petabyte level SSD storage in 5U, for a hundred grand, will be very nifty.
49. Re:MapReduce is great by Anonymous Coward · 2017-03-27 07:07 · Score: 0
  
  Captcha cannot help but keep answering? Great... Sounds somewhat feminine, right? Not your answer, the answer itself, like the I Ching...
50. Re:MapReduce is great by lucm · 2017-03-27 13:12 · Score: 1
  
  Just because your mediocre company forces its employees to do it, doesn't make it the correct decision
  In my experience, the bigger the organization gets, the more it's important to think in terms of "right" practice, not "best" practice. The correct decision is the one that makes the business successful consistently; and unless you have the psychic ability to see the future, slowing down the business to do things by the book is typically a bad idea, especially if the company is experiencing a huge growth.
  
  --
  lucm, indeed.
51. Re:MapReduce is great by Anonymous Coward · 2017-03-28 23:32 · Score: 0
  
  So while I think you meant to be patronizing, you just come across as incompetent. A bit like the folks at Google, come to think of it...
  I think you may have been turned down for reasons unrelated to your technical knowledge. Which is a shame, because fixing yourself is vastly harder than looking shit up on the Internet.
Something less dismissive? by Zero__Kelvin · 2017-03-25 09:46 · Score: 1

How about: "Hadoop served many people well for a long time, but it is time for it to be deprecated now." ?

--
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
1. Re:Something less dismissive? by Anonymous Coward · 2017-03-25 10:01 · Score: 0
  
  sooo.... "hadoop sux balls for developers and ops guys" would not fall into the "less dismissive, my balls have not dropped yet" category?
2. Re:Something less dismissive? by Anonymous Coward · 2017-03-25 10:52 · Score: 0
  
  How about: "Hadoop served many people well for a long time, but it is time for it to be deprecated now." ?
  Some of the dismissiveness must surely come from the fact that Hadoop is a tool in the data processing space which has long been dominated by reliable and industrial strength tools that have stood the test of time for decades through hardware upgrades, changes in user interfaces, operating systems and countless APIs. Consider for example the concept of the relational database management system (RDBMS) and SQL which has been a workhorse of data storage, processing and analysis since at least the 1980s and still stands tall in the business and scientific computing worlds. Compared with something like that, Hadoop looks more like the proverbial flash in the pan than a worthy competitor; hence the unfavorable reception after failing to meet lofty expectations set by previous tools, like SQL and the RDBMS. The Hadoop defenders will no doubt counter with, "but Hadoop wasn't designed to be an RDBMS!", to which I say it doesn't matter. That's what people were trying to make Hadoop into because that's what businesses thought that they needed: a drop in replacement for SQL and RDBMS that addressed their scalability problems. In the meantime SQL and RDBMS developers have answered the challenge and continued improving their tools, addressing many of the shortcomings that Hadoop was supposed to resolve while Hadoop was still over promising and under delivering. The old quip is still true, "SQL is dead. Long live SQL."
3. Re:Something less dismissive? by lucm · 2017-03-25 12:14 · Score: 2
  
  The Hadoop defenders will no doubt counter with, "but Hadoop wasn't designed to be an RDBMS!", to which I say it doesn't matter. That's what people were trying to make Hadoop into because that's what businesses thought that they needed: a drop in replacement for SQL and RDBMS that addressed their scalability problems. In the meantime SQL and RDBMS developers have answered the challenge and continued improving their tools, addressing many of the shortcomings that Hadoop was supposed to resolve while Hadoop was still over promising and under delivering. The old quip is still true, "SQL is dead. Long live SQL."
  That's bullshit and obviously you're a DBA defending his turf. A Hadoop cluster will scale beyond anything a RDBMS can handle, and if the only tool in your toolbox is SQL you can use products like Hive or Hawq that will process your queries through a specialized JDBC driver and run them across as many nodes as your budget can afford.
  For instance you could have petabytes of data in CSV format stored on your HDFS cluster, and you could create a relational model on top of them without rewriting a single byte, then use SQL to interact with this huge data set. It's like mounting external sources in Oracle or Postgresql, but at a scale that neither product can process.
  Do you know what the NSA used to store all that big brother data? Accumulo, which sits on Hadoop. They would have never been able to crunch that volume of data with [insert your RDBMS product here].
  Don't diss stuff you don't understand. Nobody is taking your precious database away, there's just an alternative for people with more complex needs.
  
  --
  lucm, indeed.
4. Re:Something less dismissive? by Anonymous Coward · 2017-03-25 13:05 · Score: 0
  
  If Hadoop is as amazing as you say it is then why aren't more companies enjoying success with it? Either you're the smartest person in the world or maybe the emperor has no clothes. Which do you think is more likely?
5. Re:Something less dismissive? by lucm · 2017-03-25 15:18 · Score: 1
  
  If Hadoop is as amazing as you say it is then why aren't more companies enjoying success with it?
  Can you provide numbers to back your statement that not many companies are "enjoying success" with it? Or are you content to repeat the same bullshit over and over?
  A few interesting facts.
  -Cloudera, Hortonworks, MapR, Pivotal are all in Gartner Magic Quadrant for Data Warehouse and Database Management Solutions for Analytics
  -Most of the big BI products (MicroStrategy, etc) offer connectors to AWS EMR, HDInsight and various other Hadoop offerings. Do you know why? Because people use them.
  -Hortonworks and Cloudera, two big Hadoop vendors, have $100+ millions in revenue.
  So spin your bullshit any way you want if it makes you more secure in your little DBA garden, but Hadoop is now a growing technology in the enterprise. Are all Hadoop project successful? Of course not, but the same goes for SAP, Cognos, Oracle or any other data-related project. Doesn't mean they're not successful products.
  
  --
  lucm, indeed.
6. Re:Something less dismissive? by orlanz · 2017-03-25 16:04 · Score: 1
  
  Nothing against Hadoop. Every problem has a proper solution provided by a proper tool.
  But petabytes isn't exactly reaching limits of Oracle or Postgresql. You start having to tune these guys & properly setting up the hardware once they get near terabytes, but I think even a vanilla Postgresql will do 1-2 Petabytes.
  Now crossing 10 Petabytes... I think it makes more sense to use Teradata. Its decades old and I don't think anything really comes close to it in today's world. Even at 1+ Petabytes, I feel more comfortable with it than Oracle or Postgresql. The later two aren't exactly trying to fit into this domain space. But then again, if you are actually utilizing more than 10 Petabytes of data... you are doing something wrong.
  And I don't think Hadoop is trying to fit into the 10+ Petabyte world either.
7. Re:Something less dismissive? by Anonymous Coward · 2017-03-25 16:39 · Score: 0
  
  Umm, you realize that one petabyte is 1000 terabytes, right?
8. Re:Something less dismissive? by TechyImmigrant · 2017-03-26 03:13 · Score: 1
  
  >For instance you could have petabytes of data in CSV format stored on your HDFS cluster
  And somewhere in a tiny sub-corner of those petabytes, someone generated the CSV with Excel and the quoting is all messed up.
  
  --
  I should use this sig to advertise my book ISBN-13 : 978-1501515132.
9. Re:Something less dismissive? by lucm · 2017-03-27 11:39 · Score: 1
  
  >For instance you could have petabytes of data in CSV format stored on your HDFS cluster
  And somewhere in a tiny sub-corner of those petabytes, someone generated the CSV with Excel and the quoting is all messed up.
  Almost all the tools default to tab-delimited (Pig, cut, etc) but yes there's usually an Excel saboteur or two in every organization.
  
  --
  lucm, indeed.
Do not blame the tool(s), blame the workman... by bogaboga · 2017-03-25 09:51 · Score: 0

"It's very clear to me, technologically, that it's not the technology base the world will be built on going forward"... [T]hanks to better mousetraps like S3 (for storage) and Spark (for processing), Hadoop will be relegated to niche and legacy statuses going forward, Muglia says.
My 4th grade English teacher used to say, "A bad workman blames his tools."
Sounds relevant to me here.
1. Re: Do not blame the tool(s), blame the workman... by Anonymous Coward · 2017-03-25 10:00 · Score: 0
  
  You can mine for gold with a pick axe or a machine. Which is better
2. Re: Do not blame the tool(s), blame the workman... by Entrope · 2017-03-25 10:00 · Score: 1
  
  Hadoop is not tools, it is one particular tool. Some tools are just bad -- I give you the magnetic stud finder as an example.
3. Re: Do not blame the tool(s), blame the workman... by Anonymous Coward · 2017-03-25 10:03 · Score: 0
  
  Did your 4th grade English teacher also tell you about affirming the consequent?
4. Re:Do not blame the tool(s), blame the workman... by Anonymous Coward · 2017-03-25 10:06 · Score: 0
  
  You're a moron. Some tools are better for certain jobs than others. Your teacher might be a moron too, but was probably trying to get a bunch of 4th graders to stop complaining about the pencil sharpener.
5. Re:Do not blame the tool(s), blame the workman... by Nkwe · 2017-03-25 11:20 · Score: 1
  
  This point isn't that all tools are equal. Different tools are clearly better or worse for different jobs. The point is that poor performance is not the tool's fault, it is the the fault of the worker for either choosing the wrong tool or not knowing how to be productive with the tools available. Or the short form, "Only a Tool would blame the tool."
6. Re:Do not blame the tool(s), blame the workman... by somenickname · 2017-03-25 12:59 · Score: 5, Insightful
  
  My 4th grade English teacher used to say, "A bad workman blames his tools."
  Sounds relevant to me here.
  Apparently your 4th grade English teacher has never tried to use a hammer covered in spikes that arrived in a box labeled "Screwdriver".
7. Re:Do not blame the tool(s), blame the workman... by Tablizer · 2017-03-25 19:24 · Score: 1
  
  Sounds like a Catholic school punishment tool.
  
  --
  Table-ized A.I.
8. Re:Do not blame the tool(s), blame the workman... by Anonymous Coward · 2017-03-25 21:15 · Score: 0
  
  Apparently your 4th grade English teacher has never tried to use a hammer covered in spikes that arrived in a box labeled "Screwdriver".
  The fact that you received a modded hammer instead of a screwdriver does not make it not your fault for trying to use said hammer instead of a screwdriver. Monkey see label, monkey use tool?
  I suspect this teacher would indeed never try to use such a hammer, regardless of what labeling was on the box it came in. Because that is not a workman's tool - about the only thing it would be good for is demolition, and there are tools for that as well.
9. Re:Do not blame the tool(s), blame the workman... by roman_mir · 2017-03-26 00:03 · Score: 1
  
  Home Depot saw your order for a meat tenderizer and did their best to help...
  
  --
  You can't handle the truth.
10. Re: Do not blame the tool(s), blame the workman... by Aighearach · 2017-03-26 16:54 · Score: 1
  
  Tools is as tools does
11. Re:Do not blame the tool(s), blame the workman... by michael_wojcik · 2017-03-27 04:29 · Score: 1
  
  My 4th grade English teacher used to say, "A bad workman blames his tools."
  Did your English teacher also explain the concept of the cliché?
  This particularly tiresome one, of dubious provenance (wikiquote sites numerous variations from a host of sources), is surely mentioned at least a few times in the comments for any thread about deficiencies in a product. It seems terribly unlikely that anyone is reading it here for the first time.
  It's a splendid example of sophomoric thinking. Yes. poor workers often blame tools. So do good ones, with reason. It's as uncompelling a maxim as I've heard.
12. Re:Do not blame the tool(s), blame the workman... by krisbrowne42 · 2017-03-27 05:31 · Score: 1
  
  My 4th grade English teacher used to say, "A bad workman blames his tools."
  Sounds relevant to me here.
  Apparently your 4th grade English teacher has never tried to use a hammer covered in spikes that arrived in a box labeled "Screwdriver".
  You found Windows! Don't forget that the handle's splinters each carry a different painful virus
It has not by gweihir · 2017-03-25 09:58 · Score: 3, Insightful

What has happened instead is that quite a few "tech experts" did not understand what it actually was and had completely unrealistic expectations. Map-reduce is nice when you a) have computing power coming out of your ears and b) have very specific computing tasks. That means that in almost all cases, this technology is a bad choice and that was rather obvious to any actual expert right from the start.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
1. Re:It has not by Anonymous Coward · 2017-03-25 10:58 · Score: 0
  
  What has happened instead is that quite a few "tech experts" did not understand what it actually was and had completely unrealistic expectations. Map-reduce is nice when you a) have computing power coming out of your ears and b) have very specific computing tasks. That means that in almost all cases, this technology is a bad choice and that was rather obvious to any actual expert right from the start.
  Sounds kind of like the blockchain craze that's mesmerizing the fintech folks these days. Maybe blockchains will turn out to be their version of Hadoop, another hot new technology that ultimately disappoints.
2. Re:It has not by gweihir · 2017-03-25 11:07 · Score: 1
  
  Would not surprise me at all.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
3. Re:It has not by lucm · 2017-03-25 12:02 · Score: 1
  
  What has happened instead is that quite a few "tech experts" did not understand what it actually was and had completely unrealistic expectations. Map-reduce is nice when you a) have computing power coming out of your ears and b) have very specific computing tasks.
  Spot on. Hadoop is meant to run on a shitload of commodity computers, which is something most organizations don't have - if you can afford a shitload of commodity computers your sysadmins will probably choose to buy high-end SAN and top notch blade servers, and virtualize everything.
  You can see it immediately when you install a packaged version like Hortonworks; the wizard will put data on all your volumes because it assumes you're running on a bunch of low-end servers with shitty RAID or even JBOD - but if you're in a typical enterprise situation, your server is a virtual machine and all the volumes come from the same virtual disk so there's no point in spreading the data across volumes.
  And the specialized computing part is also true. Processing data on cluster means that you have to "think" your workload in terms of map-reduce (whether you're crunching on Hadoop MR, Tez or Spark) and this does not always translate in a computing environment that is relevant for everyday situations.
  Basically, those tools were designed for Google and Yahoo: tons of servers, team of highly skilled programmers. It's still a valuable technology stack if you have the right use case but more often than not, a typical BI product or a MPP appliance is a better choice.
  
  --
  lucm, indeed.
4. Re:It has not by Xyrus · 2017-03-26 01:36 · Score: 1
  
  Precisely. Hadoop was marketed as a big data panacea and everyone tried to apply it to everything only to discover that it really wasn't a panacea and really wasn't a good solution to the problems they were throwing at it. In addition, it's not particularly easy to use and you can spend a considerable amount of time just in configuring, tweaking, and maintaining the system.
  Hadoop, like any other tool, has it's uses. But like any other tool if you try to apply it outside of what it was really intended to be used for your going to run into problems.
  
  --
  ~X~
Illiterate cackwads by Hognoxious · 2017-03-25 10:04 · Score: 1

That feels like a better unifying principal
They're choosing someone to lead the merger of some high schools?
Fucking hell, unless you chew your tongue when you talk they don't even sound the same.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
1. Re:Illiterate cackwads by phantomfive · 2017-03-25 11:18 · Score: 1
  
  they don't even sound the same.
  In Americanish they do.
  
  --
  "First they came for the slanderers and i said nothing."
2. Re:Illiterate cackwads by hackwrench · 2017-03-25 12:41 · Score: 1
  
  phantomfive's right. Do you happen to know where to go to listen to some sound files to hear how the rest of the world pronounces them?
3. Re:Illiterate cackwads by 50000BTU_barbecue · 2017-03-25 14:26 · Score: 1
  
  your rite, the author of this artical needs to be kicked in the testical
  
  --
  Mostly random stuff.
4. Re:Illiterate cackwads by Anonymous Coward · 2017-03-25 19:20 · Score: 0
  
  fuk grammer notsees!
5. Re:Illiterate cackwads by rl117 · 2017-03-25 23:30 · Score: 1
  
  It's not grammar that's the problem, it's basic spelling!
A little clueless.... by Anonymous Coward · 2017-03-25 10:16 · Score: 5, Informative

Did nobody explain to the original poster that Spark in serious deployments is built on top of Hadoop? Or that Kafka uses the Hadoop (YARN) scheduler and is generally used to sink data to HDFS files, also built on top of Hadoop? This is kind of like someone saying that TCP/IP is no longer relevant because we now have DNS....
1. Re:A little clueless.... by angel'o'sphere · 2017-03-25 21:58 · Score: 1
  
  Was about to say the same, hehe ...
  
  --
  Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
2. Re:A little clueless.... by Anonymous Coward · 2017-03-26 07:45 · Score: 0
  
  I'm sure the "expert," that sells proprietary cloud based data warehouse software, knows what he is talking about.
Just say Pachyderm by awilden · 2017-03-25 10:16 · Score: 3, Informative

People should check out these guys: http://pachyderm.io/ The power of Hadoop, but you choose whatever programming language you think is best for you.
1. Re:Just say Pachyderm by tikijetski · 2017-03-26 01:12 · Score: 1
  
  That plus the ability to track all intermediate results (data provenance) is very compelling. You can read more about the principles behind our tech here: http://pachyderm.io/dsbor.html [disclaimer: I'm employed by pachyderm]
Apache Spark by phantomfive · 2017-03-25 10:18 · Score: 1

As far as I know, most people are using Apache Spark for new projects.

--
"First they came for the slanderers and i said nothing."
1. Re:Apache Spark by lucm · 2017-03-25 11:53 · Score: 1
  
  As far as I know, most people are using Apache Spark for new projects.
  Spark is a framework that includes ETL, in-memory computing and a machine learning library - a typical case of wheel reinventing.
  Those "most" people you mention probably only use the machine learning part, and on a fairly small data set. In theory, Spark RDD can scale to "Petabytes" (says them) but I've never seen it work on even TB level volumes of data, while Hadoop scales to unlimited volumes (Yahoo used to run a 40,000 nodes cluster).
  Spark is awesome but it's not a replacement for Hadoop for distributed computing, it's not as powerful as sqoop for ETL and it's not as advanced as Flink for streaming. They should just focus on the machine learning library, like Mahout ended up doing.
  
  --
  lucm, indeed.
2. Re:Apache Spark by phantomfive · 2017-03-25 12:30 · Score: 1
  
  Good to know.
  
  --
  "First they came for the slanderers and i said nothing."
3. Re:Apache Spark by angel'o'sphere · 2017-03-25 21:59 · Score: 1
  
  If you actually would work with Spark, you would know it is based on Hadoop, just saying.
  
  --
  Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
4. Re:Apache Spark by lucm · 2017-03-26 02:01 · Score: 1
  
  If you actually would work with Spark, you would know it is based on Hadoop, just saying.
  Even a retard with a low-speed internet access can look this up on Wikipedia and prove you wrong. Are you trolling or just stupid?
  
  --
  lucm, indeed.
5. Re: Apache Spark by Anonymous Coward · 2017-03-26 03:18 · Score: 0
  
  You're a stupid motherfucker. You have nothing useful to say. You contribute nothing useful to this site or to society. Go fuck yourself in the ass with a rusty spoon. I can't get over what a stupid motherfucker you are. There's a reason so many people have added you as a foe. Nobody likes you. You're a worthless sack of shit. Fuck off.
6. Re:Apache Spark by angel'o'sphere · 2017-03-26 04:58 · Score: 1
  
  Are we talking about the same : http://spark.apache.org/ ??
  Why so angry?
  
  --
  Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
7. Re:Apache Spark by lucm · 2017-03-27 12:57 · Score: 1
  
  Yes. Spark can optionally run on Hadoop, which is not the same thing as being based on Hadoop. So before implying that other people would "know" something if they had worked with Spark, make sure that the thing in question is true.
  
  --
  lucm, indeed.
8. Re:Apache Spark by angel'o'sphere · 2017-03-27 21:58 · Score: 1
  
  It is the opposite around.
  Spark runs by default on Hadoop, it was designed on top of Haddop.
  Perhaps it can run on other things, too. I never saw one doing it, though.
  What e.g. would be an example? Of such an "other file system"?
  
  --
  Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
I guess it depends on your definition of success by Anonymous Coward · 2017-03-25 10:46 · Score: 0

Like virtually every other technology that came before it, it is a stepping stone leading to what will come after. To say that is has "failed" is a gross overstatement and is frankly misleading. Since it's inception, it has provided a platform to do things that no product before it could. It may not be the way to go going forward, but we probably wouldn't have any idea what the way forward might be if something like this hadn't been built.
I am sure that many of us feel like we could design our code better the second (or third) time around once we know the pitfalls and have had the opportunity to better understand the problem domain. No matter how you look at it, Hadoop has been an important and necessary step forward. Even if everyone stops using tomorrow in favor of something better, the knowledge and experience we have gained from having it is irreplaceable. From my perspective, it seems like a huge success.
Hadoop... the Mecca of distributed computing by Anonymous Coward · 2017-03-25 10:51 · Score: 0

The world would be better off if we nuked Mecca and Hadoop... then build something useful on top of the heathen ashes.
Over-integrated software sucks. by Gravis+Zero · 2017-03-25 11:13 · Score: 1

When your software integration prevents your software from being used in conjunction with a variety of other platforms, you drastically reduce the number of users and in turn the number of developers that will work on it. As you integrate software more and more, you exponentially decrease the number of developers interested in making tools to make operation of your software easier. I'm not saying that making a system that works with everything will attract more developers but I am saying that making an overly integrated system will drive away many developers.

--
Anons need not reply. Questions end with a question mark.
Idiotic babble by lucm · 2017-03-25 11:40 · Score: 5, Insightful

People who bash Hadoop without understanding at a very minimum the moving parts have obviously no experience with it.
Hadoop is not one thing. It's three:
1) a distributed filesystem (HDFS)
2) a job scheduler (Yarn)
3) a distributed computing algorithm (MapReduce)
Many tools like Hbase or Accumulo *need* HDFS. That's a core component and there's no equivalent in Spark. Anyone saying HDFS is obsolete is a clueless idiot.
Anyways the Spark vs Hadoop narrative is bullshit. A serious Spark setup usually runs on top of a Hadoop cluster, and often you can't get away entirely from MapReduce (or its actual successor, Tez) because Spark runs in-memory and doesn't scale as much; for some workloads you need the read-crunch-save aspect of MapReduce because there's just too much data, and MapReduce is also more resilient as you don't lose as much when a node crashes during a job. Spark is more advanced and has actual analytics capabilities thanks to a powerful ML library (while Hadoop is just distributed computing), but it's not a case of either/or.
For instance a common approach is to use Hadoop jobs to trim down your data (via Pig or other blunt tool) to a point where you can run machine learning algorithms on Spark.
As for Kafka, it's just a fucking message queue. It's fast and very powerful, but comparing it to Hadoop is like saying you should use Linux instead of MySQL.
Whoever considers buying services from those Snowflake morons, run away.

--
lucm, indeed.
1. Re:Idiotic babble by Anonymous Coward · 2017-03-25 12:51 · Score: 0
  
  They're Precious Snowflakes! Actually Bob Muglia is an-Microsoft exec, I recognize the name from the Mini Microsoft blog.
2. Re:Idiotic babble by lucm · 2017-03-25 15:22 · Score: 1
  
  Please tell me they named the fantastic "Microsoft Bob" app after him.
  
  --
  lucm, indeed.
3. Re:Idiotic babble by Anonymous Coward · 2017-03-26 02:43 · Score: 0
  
  As a CEO I'd expect Bob to be in tune with the company's balance sheets, capabilities, and direction. I wouldn't expect him to be in tune with how to develop a solution using Hadoop. As a former executive of Microsoft, again he shouldn't be doing the programming.
  Just because someone is ex-Microsoft doesn't mean they have a no expiration tech-skills card, or that they are elite skilled programmers. Getting hired at Microsoft is an exercise in memorizing the answers to the ~200 problem solving questions they ask, because they only ask the interesting problem solving questions, which get advertised in puzzle book published over the last 80 years. Keeping your job is more about politics and ensuring you have your "bottom 10% scapegoat" hand picked by your team which undermines their efforts to ensure the team has some stability.
4. Re:Idiotic babble by Anonymous Coward · 2017-03-26 03:15 · Score: 0
  
  This.
  If this guys suggestion is "Spark is better," then he needs to understand that Spark needs HDFS.
  It's like saying "Microsoft's .NET framework has turned out to be a complete failure unless you know x86 assembly. Everyone has moved on to C#."
5. Re:Idiotic babble by Anonymous Coward · 2017-03-27 01:52 · Score: 0
  
  Thank you for this, you just saved me the time of writing pretty much exactly what you have, especially the bit about Kafka. I work with Kafka day in, day out, and we also have Spark jobs running on our Hadoop cluster (we've managed to avoid MapReduce/Tez up to now), and the bit about "use Kafka instead" hacked me off.
Does this mean... by __aaclcg7560 · 2017-03-25 11:47 · Score: 1

I need to scratch Hadoop off my list of technologies that I need to read about because everyone else in the office is reading a particular Big Data book?
Re: Do not blame the tool(s), blame the workman.. by Anonymous Coward · 2017-03-25 12:00 · Score: 0

The often overlooked addendum is that a good craftsman also knows both the value of a good tool and how to recognize a crappy tool. :)
Technical experts or competitors by Stonefish · 2017-03-25 12:55 · Score: 1

If you look at the list technical experts
1 Bob Muglia - Head of a startup competitor that trying to market data analytics product trying to steer some of that Hadoop investment into his fold. His sales model is "Look how easy we are" What you should be asking is how much does it cost and how do I get my data back.
2 Bob Johnson - Cofounder of an analytics company trying to steer some of that Hadoop investment into his pocket.
This is a beat up driven by people who wished that they had a slice of the Hadoop pie. Hadoop is a complex system, however it scales to levels far beyond relational database technologies. Basically if you can do what you wanted with relational databases in a cost effective manner then you wouldn't or shouldn't have contemplated Hadoop in the first place. I'm not saying that the above products are good, I am saying that you have to take what they're saying with a grain (or bucket) of salt.
A number of existing Hadoop interfaces are batch based and exhibit a significant degree of latency, however other interfaces such as impala are faster and bypass the map/reduce operation to achieve realtime results. When organisations get this wrong, iIt's a bit like your finance manager talking to your vehicle fleet manager (who recommended Ford vehicles) and based upon that conversation getting a great deal on 2000 tractors. Upon finding his staff are upset, he/she dumps American manufacturing and settles on BMW as their Fleet vehicle of choice even through the cost is significantly higher. If you're a clever organisation you don't need to buy a Mercedes or BMW to achieve good results, however the owners of BMW or Mercedes would certainly encourage you to do so.
Personally I have watched a number of organisations deploy Hadoop clusters in a poor manner without understanding the system fundamentals and they'd love to blame the tool. I've also seen clever organisations save millions on their existing licences and meet their business or compliance objectives. It really comes down to looking at your organisation in a pragmatic manner and deciding are you collectively clever, or are you collectively stupid.
Based upon what I've seen and know.
1 Data Locality - Don't deploy Hadoop as virtuals which rely on an underlying SAN technology, if you're doing this you don't understand either the problem or the solution. The issue that you might solve with virtualisation is deployment and management, don't kill locality in doing so.
2 Competent staff - You are going to need, and retain, and train highly competent staff to ensure that the interfaces are simple (see point 4)
3 Understand your business drivers, are you a knowledge based organisation? What insights do you expect will make a difference to your bottom line.
4 Who will get access to the information and how.
5 Where is this information that you're going to throw into Hadoop anyway?
There are a number of managers who should be held accountable for poorly performing hadoop clusters and the above questions should be asked of all of them.
In some cases they would have been better off going with a simpler model initially such as Casandra to meet their requirements, however most organisations overestimate their abilities.
How do I get "Joe User" to to access the data???? by FaytLeingod · 2017-03-25 13:30 · Score: 1

So I have a hadoop stack and a team of 4 data scientists. It takes them a month to develop an interface for new data... How do I get this dev time down. With new data-sets coming in on a weekly basis this team would need to grow 10X to keep up? In the mean time the average users needs to wait a month for access to new streams of info. That leaves our business a month behind on current trends that can definitely be predicted from the data streams. So what do I need to do?? Hire 36 new Data scientists or change the stack I work on?

--
as it is eaten so it shall pass
Right on time. by Narcocide · 2017-03-25 13:34 · Score: 1

After only 5 minutes with Hadoop I could figure out it was nothing but a giant boondoggle. It only took to the end of that afternoon to be completely sure. Now, what... 3, 4 years later the rest of the industry is starting to figure it out, en-masse? Seems about right.
1. Re:Right on time. by EmperorOfCanada · 2017-03-25 15:54 · Score: 1
  
  You should give MongoDB a try. It might impress you for a whole 6 minutes before you realize that the developers are a bunch of assholes.
2. Re:Right on time. by Anonymous Coward · 2017-03-25 17:21 · Score: 0
  
  https://www.youtube.com/watch?v=b2F-DItXtZs
Unreasonable expetations by manu0601 · 2017-03-25 13:37 · Score: 1

Perhaps the issue here is about unreasonable expectations.
No software, Hadoop or other, will magically extract meaning from a huge dump of data. You need work to do that, whatever the tool you use.
This rant reminds me about the people who purchased enterprise service bus to interconnect IT applications, just to discover that instead of interconnecting applications, they now need to interconnect applications with the enterprise service bus. No problem solved for free.
Hadoop has failed says ex Microsoftie by najajomo · 2017-03-25 13:40 · Score: 1

'"I can't find a happy Hadoop customer. It's sort of as simple as that," says Bob Muglia, CEO of Snowflake Computing, which develops and runs a cloud-based relational data warehouse offering' slashdot

Here's Bob Muglia while at Microsoft describing how to 'add additional semantics' to Outlook, that is perform a detailed analysis of Lotus Notes and then clone it into Outlook.

"Notes/Domino R5 is very scary. We all saw the demo. Exchange has worked with teams around the company to put together a very detailed analysis of the R5 betas and the hints they expose on their future direction.", Eric Lockard

"we will probably need to add additional semantics to the Outlook/CDO object model to enable easy conversion of Notes apps onto our solution", Bob Muglia
Hadoop is easly put to shame by EmperorOfCanada · 2017-03-25 15:52 · Score: 1

I read a great article where one guy compared Hadoop to tools such as grep. I many fundamental ways he was able to use UNIX command line tools to wildly outperform Hadoop on what I would consider to be on the larger end of a typical company's data set.

To me Hadoop was the classic solution desperately in quest of a problem. The worst problem with that being so many people who jumped onto Hadoop and thought they were ass kickers for doing so.

The simple reality is that for most corporate datasets the tool of choice is a boring relational database and usually something like MySQL. The common capacity roadblocks aren't found within the tool but in the tool users.

But if you use a tool like Hadoop, or go NoSQL with a tool like MongoDB, you get to say (until people realize you are actually quite stupid) "my datastore is better than your datastore".
What about mongoDB? by manno · 2017-03-25 16:21 · Score: 1

Isn't mongoDB supposed to be similar to hadoop? Do the same pitfalls for hadoop apply to mongoDB?
1PB meh by lucm · 2017-03-25 16:38 · Score: 2

I think even a vanilla Postgresql will do 1-2 Petabytes.
The maximum column size for Postgres is 1GB. The maximum table size is 32TB. So let's say you have a 1PB data set, that means you need to shard your data in at least 25 tables of 250 columns.
Let's say you want to run a query vertically; you'll need to join those 25 tables, start the query and go on vacation for a month. That's how 1PB works on Postgres.
And don't you even dare do some leaf-level manipulations on that volume of data, like a lateral join - unless you enjoy a faint smell of burnt plastic in your data center. Meanwhile, that kind of thing runs smoothly on Hadoop, and if it's too slow you just add nodes.
I'm not saying RDBMS are dead - in my opinion the vast majority of use cases warrant for a traditional RDMBS or non-Hadoop NoSQL database. But when it comes to seriously big data, fuggedaboutit.

--
lucm, indeed.
1. Re:1PB meh by TechyImmigrant · 2017-03-26 03:20 · Score: 1
  
  >in my opinion the vast majority of use cases warrant for a traditional RDMBS
  For my store PoS I'm working on dropping the Postgress backend and holding all the tables in memory. RAM grew faster than my tables did.
  
  --
  I should use this sig to advertise my book ISBN-13 : 978-1501515132.
2. Re:1PB meh by orlanz · 2017-03-26 05:45 · Score: 1
  
  s/column size/field size
  Not to sound pedantic on the terms you used, but want to make sure we don't confuse general readers.
  Normalizing:
  Initially, you should be Normalizing your data population. This is splitting it up into various tables. 25 isn't a lot of tables. I have seen DBs under 1TB with over 100 tables. How and what level you normalize is based on the type of data you have, their relationships, and most importantly, how you intend to utilize and extract the data to generate various kinds of information; now and hopefully for the foreseeable future. Ideally, you spend a lot of time designing and very little time implementing & redesigning (please ignore the real world for this post). Normalizing has nothing to do with the size or quantity of data; only quality and usage.
  Indexes:
  After you Normalize, you setup various indexes. The type you setup and on what columns is based on how much/often you intend to query the data, utilize it, and the information you want to generate from it. I normally set up indexes well after defining my queries. Joining properly indexed tables is not an expensive operation. Most data is either text or blobs. Indexes basically convert these to single number comparisons at the cost of an additional memory address lookup. Ignoring the write performance hit, this is a huge boost for joining and retrieving rows.
  Writing is also pretty good. You write to the tables you want to; no need to join the universe to update/add rows for one or two columns. Additionally, the only type of data that should be in your tables are the ones you need to manipulate, update, or look through. If you just want to get back a full book, picture, or LoC, offload that to a DFS and store the reference.
  With an unlimited number of tables, rows, and indexes at your disposal, Postgresql can easily store a Petabyte of data in the database.
  Sharding:
  But lets assume that the use case requires us to look through a Petabyte of data; all sitting in ONE table. You may want to see the title, and page number where your name appears in a few hundred Library of Congresses. In which case, you would shard the Postgresql DB. Sharding is for managing size or quantity of data.
  This is a 3rd party add on; not part of the vanilla Postgresql. You don't _join_ tables that are sharded. You don't even define the shards themselves. They are managed by the tool and are hidden from the querying client. The client doesn't know which tables are sharded and which aren't. And yes, the connections, cursors, queriers, subqueries, joins, and query components are load balanced across the shards. Yes, shards are fault tolerant. Yes, they are redundant. Yes, they allow fail/cascade over.
  In this situation, the capacity of the database is limited by the number of physical nodes (hardware) that you add to the collective (I am sure there is _some_ limit). From a client's view point, it looks like a single machine returning a single stream of data from a single instance of tables. Oracle has a similar solution. You might be thinking that I am almost describing Hadoop.
  But the true kings of the big data world are DB2 and TeraData. DB2 is designed from the ground up as a set of redundant components that unify to a singular system. TeraData is in terms of fully functional nodes that work together as a singular system. And both are decades old with their feature sets.
  I DO agree with you thou. There are many use cases that warrant Hadoop over the above.
3. Re:1PB meh by Anonymous Coward · 2017-03-26 08:51 · Score: 0
  
  So.. going to write your own reporting solution as well? Or does management not need any reports? Any guidelines from regulators viz. your storage and ability to audit the data? What about when the power fails or the server reboots unexpectedly? Any lost purchases where the money has been transferred already but the order didn't get through?
  All in all I think if the application was critical, I'd take out a license for an off-the-shelf database that does this already. Most modern RDBMS do have in-memory tables. SQL Server 2016 has them, and for small deployments it's pretty much free if you install SP1. It even runs on Linux nowadays, so no real issues there. Depending on what size data you have, of course.
4. Re: 1PB meh by Zero__Kelvin · 2017-03-26 10:38 · Score: 1
  
  Please tell me you are kidding, because if you are not you need to step away from the keyboard, and STAY away from the keyboard.
  
  --
  Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
5. Re:1PB meh by TechyImmigrant · 2017-03-26 12:48 · Score: 1
  
  >So.. going to write your own reporting solution as well?
  I already have. The reporting code doesn't change to accommodate this. The interface to the data model doesn't change. You have heard of data abstraction before haven't you? If you mess with stored procedures, you're still tied to the nipple of a DB vendor's tit. If the business logic code access the DB through your own procedures written in in-application code, then it's easy to adjust to a different storage model.
  >Or does management not need any reports?
  Well we own the company and so I write exactly the reports that we need. The accountant gets the transaction data in exactly the form they need to automatically reconcile with bank records. Apparently all her clients with the quickbooks tools can't do that.
  >Any guidelines from regulators viz. your storage and ability to audit the data?
  Nope. Personal data isn't in the PoS (point of sale) system. Financial transaction stuff (credit card info, etc.) is in PCI-DSS compliant kit. That doesn't change. The PoS is about an efficient storefront (checking out purchases) and inventory management.
  >Most modern RDBMS do have in-memory tables. SQL Server 2016 has them
  That's lovely, but I don't need an RDBMS. The properties of time series data (people buying things) allows a nice post hoc recombination at the back end and allows the front end (at checkouts) to carry on regardless if they lose contact with the back end, so no "The computers have gone down, we can't sell you this". It is all already in tables, then synced to the database on the server. The actual design change is to store transactions locally on disk - it's only a handful of bytes per transaction - in such a way that you can pull the plug and reboot the front end and have it carry on from the point it was switched off including the contents of the current transaction that was being input. This is not a hard problem, it just requires knowledge of designing sound transactions (in the computer sense) and you can formally prove they are right with tools like spin. I can't even start python database connector without a bunch of hacking on the platform.
  > It even runs on Linux nowadays
  So does the existing PoS software, which I wrote and the staff like because I iterated the design with them in the loop. It exists explicitly because off the shelf solutions were slow and cumbersome.
  >All in all I think if the application was critical, I'd take out a license for an off-the-shelf database that does this already.
  It is. It's the conduit through which we sell things. But it is based on a well polished bit of software that essentially bug free and keeps running year after year. I wouldn't risk commercially licensed software. What happens when it breaks, or it requires and os upgrade or it just goes out of fashion?
  However the substance of my comment was to agree with the comment before which suggests that large data tools are no applicable to most situations and I have a situation that fits that hypothesis, where the transaction rate is 10s of thousands per years. There are no Petas involved. I wasn't requesting a lecture on how to run our business.
  Don't apply for the job of DB admin, we aren't hiring.
  
  --
  I should use this sig to advertise my book ISBN-13 : 978-1501515132.
6. Re: 1PB meh by TechyImmigrant · 2017-03-27 04:05 · Score: 1
  
  Please tell me you are kidding, because if you are not you need to step away from the keyboard, and STAY away from the keyboard.
  No, not kidding. Now please explain how you know enough about our system to even be able to know if it's a good idea or not?
  The draw to tell people on the internet that they're doing it wrong seems to be very, very strong around here, even when armed with only a couple of paragraphs of information.
  
  --
  I should use this sig to advertise my book ISBN-13 : 978-1501515132.
7. Re:1PB meh by lucm · 2017-03-27 12:51 · Score: 1
  
  Just for the sake of discussion, if I was to design a POS today I think I'd consider the new in-memory engine in MongoDB. It's pretty cool; it writes nothing to disk (ever) but it can be part of a cluster where some other members use the normal engine. Each cluster supports up to 50 members, and the client can specify a preferred read node. So I would leave the write master in the backend and all the POS would have the inventory pretty much in real time on their local read node.
  Or since they bring up Kafka in the article, that could also be an interesting component. Unlike other queue engines, Kafka keeps track of how far each subscriber has gone in the sequence of events, so it's possible to have clients with wildly different needs connected to the same queue (such as a real-time ticker or a monthly dashboard).
  Well probably none of this applies in your case. I think I just have a thing for POS in general so I'm semi-jealous of your situation.
  
  --
  lucm, indeed.
8. Re:1PB meh by lucm · 2017-03-27 13:05 · Score: 1
  
  the true kings of the big data world are DB2 and TeraData.
  You had me until you mentioned DB2. I've never heard of a PB-level DB2 instance, I don't even think it's possible. Last time I checked a table couldn't go over 2TB and even BLOBs can't be bigger than 2GB.
  
  --
  lucm, indeed.
9. Re: 1PB meh by orlanz · 2017-03-27 15:29 · Score: 1
  
  I remember back in 2002 reading about a 2PB DB2 at some research university. My google-fu isn't good enough to hunt it down.
  But I hope the below provides some insight to where DB2 is at today. 500,000PB. I need to do more research because I am finding it hard to believe.
  http://it.toolbox.com/blogs/db...
  Anyway DB2 has always been more hardware limited than software. Every atom in DB2 can be plumped up in bits till it hits the hardware limits; multiplying its overall capacity. But too many bits and you are just wasting space.
10. Re: 1PB meh by lucm · 2017-03-27 17:22 · Score: 1
  
  Dang.
  https://www.ibm.com/developerw...
  Of course as with many IBM products, the miraculous setups are always in IBM labs.
  
  --
  lucm, indeed.
11. Re:1PB meh by TechyImmigrant · 2017-03-28 03:34 · Score: 1
  
  > So I would leave the write master in the backend and all the POS would have the inventory pretty much in real time on their local read node.
  That's pretty much it. So the front end is instant response for the user. Events are timestamped. The back end recombines the data in order when they are attached (normally they are attached all the time) and the inventories are kept in sync. The wrinkle is that the front end stores running state to local disk as pickled data so it can run solo (detached from the back end) across power cycles. Works great for trade shows - you can run it on a laptop.
  I can bring down the back end, upgrade the the OS, have a coffee, watch some Netflix and bring it back up again an no one notices.
  
  --
  I should use this sig to advertise my book ISBN-13 : 978-1501515132.
Hadoop isnt just mapreduce and pig by vile8 · 2017-03-25 17:13 · Score: 5, Insightful

Hadoop starts with a vastly distributable, and resilient file system (HDFS) which enables, as a base, technologies that include things like HBase (columnar stores), Impala (Parquet example), Titan (graphs), Spark (lord everything.. its the emacs of data frameworks), or the latest projects which completely change the paradigm of how you are looking at data at unbelievable speeds. (who the hell runs mapreduce and expects real time performance?... its a full disk scan across distributed stores... and fairly sane from that perspective)

If you don't have problems that relate to these paradigms... dont use it. Seriously. Just because its new doesnt mean it fits every situation. Its not mysql/mariadb/postgresql... if you think its even remotely close to that simple you should run for the hills. If you have a significantly large (not talking hundreds of megs or even a couple gigs... you need to be thinking in Billions of rows here) configuration management problem then its a great base to layer other projects on top of to solve your problem.

Also, I found a large number of problems to solve using timestamped individual data cells that CANNOT be done using traditional sql methodologies. Lexicographic configuration models, analytics (obv), massive backup history just to name a few. If the management and installation of the cluster are scary... well...not everything in CS is easy... especially when it gets to handling the worlds largest datasets.... so, this probably isn't really your problem... call the sysadmins and ask them (politely) to help. Believe it or not the main companies have wizards which can help get you going across clusters... and even manage them visually (not that I ever would... UI's are for people who can't type).

When people (or just this CEO) says it doesn't deliver on its promise. You are likely trying to solve a problem wholy inappropriately. I have personally used it to solve problems like making real time recommendations in under 200ms across several gigs of personal data daily (totalling easily into terabytes). (No you don't use mapreduce... think harder... but you DO use HDFS).

So what promise were you told?

Other than real time (as illustrated above), you can do archiving, ETL of course, and things like enabling SQL lookups, or RRDs... using a number of toolkits or spark. Seriously, this is one of the best things since sliced bread when it comes to processing and managing real big data problems. Check out the Lambda processing model when you get a chance... you might be impressed, or be utterly confused. Lambda (and not talking about programming Lambda, nor AWS Lambda) applies multiple apache technologies to solve historical with real time problems in a sane manner. Also managing massively distributed backups is much simpler with HDFS

Honestly, outside of Teradata implementations, there is no where in the world you can get this kind of data resiliency, efficiency, nor management. Granted it doesn't have the 20+ years of chops in HUGE datasets Teradata does, nor the support... but its open source and won't cost you much to try.

Long long story short. What the hell! I feel like programmers today are constantly ... whining... about complexity. It seems like a trend to say "well I couldn't use it for my project so that means no one really does.. they are just trying to look cool." Which I would have to reply... you're an idiot. Yes its complex... if you understand storage / manipulation / migration / replication / indexing... you should be impressed to say the very very least. If you dont, please go read the changelog, Readme, and any note based install guides. or do some research on the commercial companies using this technology successfully.... instead of making of figures and claiming its gospel.

Any commercial solution will cost you ... well... millions just to get started solving the problems Hadoop nailed out of the gate.

If Hadoop seems large and frightening just wait until y
Re:How do I get "Joe User" to to access the data?? by Anonymous Coward · 2017-03-25 17:18 · Score: 1

Hire someone competent with actual software development skills? Most data scientists I've met were glorified or relabeled data analysts. Some minor stats background and maybe they can hack together a script. That's fine and really valuable for analyzing large datasets and formatting the results into pretty figures for decision-makers to look at.
If your data is too complex for their basic ETL skills and it's taking a month to build interfaces, hire one competent and expensive developer to build those interfaces. You may call the role data engineer, data science engineer, software developer/engineer, or keep it as data scientist, but essentially get someone who actually understands software and database engineering, creating schemas, cleaning up data and knowing when all you have is shit and to move on. Would probably help to know some stats or has an interdisciplinary background so they can talk to and understand the data scientists' needs. With one person focused on rapidly developing the interfaces the other data scientists can focus on analysis. Holy shit four people sitting on data for a month. It just screams that no one knows what they're doing. This ain't grad school anymore; you have to earn your paycheck.
Some of the most competent data scientists I know personally have come from bioinformatics/computational biology backgrounds. Usually were hacking as teenagers and they should have some published software (or academic papers resulting from their programming) that they can talk intelligently about to show their skills. Some of bioinformatics is close enough to big data that with a Master's degree they should have at least been exposed to the ins and outs of handling varied data. If you're hiring remote (U.S.) I can help you out. Or for my hefty consulting fee I can work on-site when you need new interfaces built. Any kind of positive reply and I can figure out how to get in touch.
Pop! by Anonymous Coward · 2017-03-25 17:51 · Score: 0

There goes the big data bubble.
This is a shocker!!! by Anonymous Coward · 2017-03-25 21:10 · Score: 0

A real shocker, to make something useful out of hadoop you must "understand" MapReduce and you must have a background in statistics and data analysis theory...this is a show stopper!
Idiot can't use computer, news at 11 by Anonymous Coward · 2017-03-25 22:19 · Score: 0

Idiot gets embezzled by nefwangled "open source" marketing language and must realize that, after all, Computing Is Hard, as every trade out there is.
This is just a variant of Dilbert's "teach me how to be an engineer, even i it takes all day".
Unhappy customers: caveat emptor by rcongiu · 2017-03-25 22:57 · Score: 1

I think many of the 'unhappy customers' the article refers to are companies where somebody who didn't quite understand the technology pushed hadoop as a replacement for (expensive) proprietary software like Oracle, to be then sorely disappointed especially on interactive performance.
I've been working with hadoop since 2007 and have successfully deployed for multiple clients. First of all, you really want to see if the use case makes sense, sometimes you're just better off with a RDBMs like mysql. Some companies just jump on the 'NoSQL' bandwageon to find out almost immediately that they, oh well, actually do need SQL.
Hadoop is based on some Google technologies (GFS, etc) that were designed to process immutable, append only crawler logs, for the search engine. So anything that requires record-level CRUD is off the table in vanilla hadoop. Other systems (like kudu or hbase) try to address that, but even there, it depends on your use case. These technologies are also not that easy to operate, especially when you stack them on top of each other. That's why there's a flurry of companies (like - guess what - snowflake, whose CEO is the author of the article) that offer Data Warehouse as a Service on top of hadoop.
I think the article is meant to scare people into buying their service...and it may be actually worth it if you don't have the skills/manpower to run the hadoop stack yourself. If only a few people access data, it's probably worthwhile, but in larger organizations, hadoop is merely the engine that does the ETL heavy lifting but you have other systems where the the data is queried. Note that for querying, there are several patterns: looking at long-term trends, looking at the most recent data, ad hoc querying, machine learning... each one may need its own specialize query engine.
Finally, the comparison with kafka is just insane and sounds like another sales pitch. Kafka is a whole different beast, it does streaming, but querying??
Big Data is a nice word. by Qbertino · 2017-03-26 01:41 · Score: 1

Big Data is a nice word. The fact that the concept if it is useful for roughly 5 ginormous global internet companies and beyond pointless for everybody else is probably something that 99.9% of all people making the final decisions on which technologie stack is used have zero clue about. They haven't got the faintes what big data actually means and what problems with it solutions like hadoop actually address.
I'd bet money that 99 of 100 scenarios in which hadoop would even run better with some unspectacular type-a round-robin master-slave loadbalanced mysql setup or something. ... Of course then you couldn't use that nice word "Big Data".

--
We suffer more in our imagination than in reality. - Seneca
1. Re:Big Data is a nice word. by Anonymous Coward · 2017-03-26 07:59 · Score: 0
  
  The number is closer to 5000, but otherwise you are right.
Re: Big Data is a nice wors. by Qbertino · 2017-03-26 01:42 · Score: 1

Sorry for the typos - using a tablet just now. :-)

--
We suffer more in our imagination than in reality. - Seneca
Re: Do not blame the tool(s), blame the workman.. by Anonymous Coward · 2017-03-26 05:04 · Score: 0

Or, a lumberjack who has been faced with using a chainsaw when all he has known his entire life has been axes and saws...they understand the concept of how it could improve their ability to work, but anything beyond "put fuel and oil here, and change the chain when it gets dull" is lost on them unless they can also grok the internal combustion engine...
Bandwagons by ilsaloving · 2017-03-26 05:57 · Score: 1

In other news, Bandwagon jumpers are shocked to discover that the cool new doohickey they read about in Tech Fashion Trends Magazine, doesn't actually magically fix every problem you throw it at.
Computer technology has now been around and commonplace for several decades now. It isn't knew that this stuff is complicated, and getting even more complicated with each passing year.
And yet while a client would never demand a builder use this specific kind of scaffolding and cement to build with because they read about how cool it was in some magazine, for some inexplicable reason people DO think that this is an entirely acceptable thing to do when it comes to software.
But that's ok. Customers who do this are a fantastic boon to the consulting industry. First for the slimey consultants (usually offshored to keep costs "low") that sell customers exactly what they want for, for cheap, and then for the much more expensive consultants later on who are tasked with trying to recover the steaming crater of a system the previous consultants left behind.
Re: How do I get "Joe User" to to access the data? by Anonymous Coward · 2017-03-26 06:07 · Score: 0

"You need competent high class programmers....luckily I am the Hadoop consultant you are lookin for and for a small fee I will..."
The first? by 0dugo0 · 2017-03-26 06:57 · Score: 1

Imagine a Beowulf cluster of these!
Re:How do I get "Joe User" to to access the data?? by Anonymous Coward · 2017-03-26 07:56 · Score: 0

How did you end up in that misery? I know the answer to your questions. How deep are your pockets?
MPI by Anonymous Coward · 2017-03-26 19:10 · Score: 0

Excuse me, MPI has been the "first widely-adopted open source distributed computing platform", and it has succeeded.
There there by lucm · 2017-03-27 11:57 · Score: 1

You're a stupid motherfucker. You have nothing useful to say. You contribute nothing useful to this site or to society [...] (etc)
I was unable to read the rest of your comment because I have a policy of stopping when it becomes obvious that the other person is just throwing a tantrum.
If you disagree with the fact that Wikipedia clearly indicates that Spark is NOT based on Hadoop, support your claim with a link or citation. Otherwise there is no need to get your panties in a bunch, you clearly don't have enough trolling skills to make even a drunk Mike Tyson circa 1997 angry.

--
lucm, indeed.
Are these people really that stupid? by ilsaloving · 2017-03-29 01:04 · Score: 1

Are these people for real?
The whole article screams, "I don't know what I'm doing but I love jumping on bandwagons."
Apache Hadoop and Kafka are two completely different tools, intended for two COMPLETELY different workloads.
So if you used Hadoop when you should have used Kafka, that doesn't mean Hadoop is bad. It means you haven't done your job and properly vetted the tools available for suitability.