Slashdot Mirror


Is Big Data Leaving Hadoop Behind?

knightsirius writes: Big Data was seen as one the next big drivers of computing economy, and Hadoop was seen as a key component of the plans. However, Hadoop has had a less than stellar six months, beginning with the lackluster Hortonworks IPO last December and the security concerns raised by some analysts.. Another survey records only a quarter of big data decision makers actively considering Hadoop. With rival Apache Spark on the rise, is Hadoop being bypassed in big data solutions?

100 comments

  1. Nope. Not happening. by Art+Popp · · Score: 5, Informative

    FTA: ...biggest problem is that people allegedly still can’t use Hadoop... Hadoop is still too expensive for firms...

    Hadoop is an ecosystem with lots of moving parts. Those are real problems above, but Spark (Particle) is not a stand alone replacement for an ecosystem the size of Hadoop. Moreover it has no problem running integrating with Yarn on Hadoop where you can run Hbase, Cassandra, MongoDB, Rainstor, Flume, Storm, R, Mahout and plenty of other Yarn-compatible goodies.

    It's also worth noting that Hortonworks and Cloudera may not be "taking off as hoped" because the branded big-iron players are finally in the ring. They hide the (rather hideous) complexity and integrate well with any existing systems you have with those vendors. Teradata for instance has a Hadoop/Aster integration that's impressive and turn key. They bought Rainstor, and will soon have it integrated, and that's Spark-fast and hassle free. IBM's BigInsights is very impressive if you have the means.

    So, no, Hadoop is in no danger of being replaced. The value proposition that my $4.2M cluster outperformed two $6M "big name" vendor supported appliances is undeniable, but only that stark when your $'s have an M suffix. What will probably occur though is that we'll end up replacing every component in Hadoop with a faster one, and MapReduce will become a memory as things like Spark and Hive/Tez move away from that methodology.
         

  2. Nope. by Anonymous Coward · · Score: 0

    Nope.

  3. Rival? by Culture20 · · Score: 2

    I thought Spark worked from within Hadoop. Is that like using emacs to run vi?

    1. Re:Rival? by Anonymous Coward · · Score: 5, Informative

      They need to refer the the pieces of hadoop. HDFS is the storage piece and many things can interface to it, it isn't great but is often good enough especially if you just have a couple local disks per node. YARN is the scheduler piece, it is mostly awful performance-wise but is fairly easy to use...long run it'll lose to something like mesos I think. MR is the map reduce piece that everyone thinks of when you say hadoop. Almost everything will run quicker in spark(still using a map/reduce methodology) than hadoop MR.

      As a side note, I don't know anyone who still writes MR jobs directly, they are all doing pig or hiveql.

    2. Re:Rival? by careysub · · Score: 3, Interesting

      They need to refer the the pieces of hadoop. HDFS is the storage piece and many things can interface to it, it isn't great but is often good enough especially if you just have a couple local disks per node. YARN is the scheduler piece, it is mostly awful performance-wise but is fairly easy to use...long run it'll lose to something like mesos I think.

      That's a good call. With Cloudera and HortonWorks both adding new components to the Hadoop stack it has exploded in the number of components in the last a year or two, and that can be a bad thing. The complexity of the whole ecosystem is getting horrendous, with a typical configuration file doubling from 250 or so to 500 configuration items, which are almost all undocumented (unless you read the code - which scarcely qualifies as "documented") in the last year. For a practical deployment you are pretty much forced to use a commercial stack to get something up and running in a manageable fashion. And then there is the fact that the HDFS foundation is showing its age.

      MR is the map reduce piece that everyone thinks of when you say hadoop. Almost everything will run quicker in spark(still using a map/reduce methodology) than hadoop MR.

      Spark on Mesos is looking mighty awesome.

      As a side note, I don't know anyone who still writes MR jobs directly, they are all doing pig or hiveql.

      MapReduce is still viable for stable production jobs, but not in a dynamic requirements environment.

      Although HiveQL is alive and kicking, the complete replacement of Hive Server with Hive Server 2, while possibly an improvement in usability overall (I am not convinced), it trashes your skill investment in the (now) obsolete Hive stack component. Maybe I am just grousing, but I start having reservations about technology planning in the data center when a key stack component changes so much it a relatively short period of time

      --
      Starships were meant to fly, Hands up and touch the sky - Nicky Minaj
    3. Re:Rival? by Daniel+Hoffmann · · Score: 1

      You are absolutely right about the complexity of the ecosystem, but from my experience every Java based platform eventually evolves such complexity (it is like a xml fetish)

  4. MJ by Anonymous Coward · · Score: 0

    Don't tell Mary Jo Foley.

  5. Re:Nope. Not happening. by TechyImmigrant · · Score: 1, Funny

    Yarn on Hadoop where you can run Hbase, Cassandra, MongoDB, Rainstor, Flume, Storm, R, Mahout and plenty of other Yarn-compatible goodies.

    It's also worth noting that Hortonworks and Cloudera

    I know R. My wife has a Yarn store. WTF are those other things?

    --
    I should use this sig to advertise my book ISBN-13 : 978-1501515132.
  6. Re:Nope. Not happening. by Anonymous Coward · · Score: 0

    I tend to agree. As a storage admin for a Multi-Hospital organization using anything open source is not really an option if we want to keep the HIPPA-potamus away. However we have the means to buy and use name brand storage appliances (Currently NetApp and IBM SVC) which comes with the vendor support large organizations need.

    Things like Hadoop I think are going to end up in the niche market of organizations that have the expertise to manage something as complex as Hadoop and have the need for Big Data and performance but dont have the capital to buy a brand name array.

  7. Who is this question for? by sanosuke001 · · Score: 1

    Is this a question for Hadoop employees or slashdot? If there's something better, why does it matter to anyone other than the company developing Hadoop if it's relevant?

    --
    -SaNo
    1. Re:Who is this question for? by jbolden · · Score: 1

      Hadoop is open source. The companies building it are LinkedIn, Yahoo, Facebook and then the Hadoop vendors: Hortonworks (tightly tied to Microsoft), IBM, Cloudera (enterprise support vendor)...

    2. Re:Who is this question for? by Ksevio · · Score: 1

      Hadoop is open source software so it's more significant if it's in decline than a closed commercial alternative.

    3. Re:Who is this question for? by Anonymous Coward · · Score: 0

      Hortonworks is tightly tied to Microsoft, and Cloudera is the enterprise support vendor? Where the hell did you get that from? Would you like some more kool-aid?

  8. Relevance of Security by Luthair · · Score: 4, Funny

    Is security really that big of a deal? Isn't the intent to run it on a private network to crunch numbers behind the scene?

    We don't ask about the susceptibility of safety deposit boxes to crowbars and dynamite, they're inside a vault.

    1. Re:Relevance of Security by Hognoxious · · Score: 1
      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    2. Re:Relevance of Security by Anonymous Coward · · Score: 0

      The primary source of security threats is inside one's network. Either someone gains access to an employees credentials or the employees themselves are hacking. So yes, security is that big of a deal. Just look at the forensics performed after any large hack - all the damage is done by people who have found some way inside of the network. That is why all internal systems need to be secured - the perimeter defenses will, and probably already have, be breached.

  9. Yeah - Hadoop is dead by Anonymous Coward · · Score: 0

    At least at the startup I work at. It's too slow.

  10. Re:Nope. Not happening. by QRDeNameland · · Score: 0

    I've heard of MongoDB. It's Web Scale!!

    --
    Momentarily, the need for the construction of new light will no longer exist.
  11. Big Data is leaving Java behind by Anonymous Coward · · Score: 0

    Not just Hadoop. The whole stack is too bloated and difficult to use; it was never the best language for anything but people tried to use it for everything.

  12. Re:Nope. Not happening. by peragrin · · Score: 0

    Funny I was thinking they were all children's books. cloudera, horton hears a works, etc.

    --
    i thought once I was found, but it was only a dream.
  13. Re:Web scale? by Anonymous Coward · · Score: 0

    ahahhha mongodb is shit

  14. Re:Nope. Not happening. by Hognoxious · · Score: 4, Insightful

    As a storage admin for a Multi-Hospital organization using anything open source is not really an option if we want to keep the HIPPA-potamus away.

    Makes sense. If they can see your source (which you have to show them, or it wouldn't be open) then it makes absolute sense they can totally see your data.

    You weren't previously the city manager of Tuttle, Oklahoma, were you?

    --
    Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  15. Re:Nope. Not happening. by Anonymous Coward · · Score: 0

    Well for one, the comment doesn't really apply much to anything Hadoop like. This actually is interesting in and of itself. In all this tech reporting the boring reality of 90+% of IT reality is generally ignored. This article is a great example, Hadoop was declared by many to be the one true solution for storing data as it went from 0.1% to 0.2% share. When that pace levels out at no more than a couple percent at best, articles declare it dead. It is obscure in real world terms, but much more common than when news unambiguously sung its praises. Of course IT is not alone, China gets blasted for slowing to like an 8% annual growth rate, which is still quite strong positive growth, just not the unsustainable grwoth previously seen.

    For another, the ultimate (valid) point is that a lot of IT orgs won't use 'Hadoop' instead using things like 'BigInsights'. Sort of how they don't use 'Linux', they use 'Red Hat'. It's less about the source and more about covering their ass.

  16. This, the day after Hortonworks financials? by Anonymous Coward · · Score: 0

    I think Slashdot got trolled by a Wall Street troll. Or a competitor to Hortonworks.

  17. Re:Nope. Not happening. by ampsicora · · Score: 1

    I agree that the problem is that most companies don't know how to run it and it's left to bigger organizations that 1) have the expertise in house and 2) actually need the added complexity.
    Understanding which pieces of the ecosystem you need, how to deploy and running them in a production environment can be daunting, not to mention all the different possibilities of which cloud provider to use, which services, etc.

    Cloudera and Hortonworks are capitalizing on it basically helping sorting out this complexity with consultants, and training, but since this business model scales with the number of employees, they are not scaling up that fast, also because there are not enough skilled engineers in the field. I personally interviewed several self proclaimed 'hadoop engineers' who had worked on hadoop for a year or more and yet didn't know what happens in the shuffle phase.

    Another distinction to make is that Hadoop has now three major components: HDFS, YARN and map/reduce. Maybe Map/Reduce is losing its relevance as a hadoop component, as Tez/Shark/Flink advance, but should be noted that under the hood they use basically the same abstraction on parallelization, they just make better use of resources (especially memory), but they are not replacing HDFS not YARN. Mesos could be used in alternative of YARN, but I don't see any competitor for HDFS yet.

    So, I would not say that Hadoop is being replaced, but more extended and to use a botanical analogy, beside growing, it's also being grafted on (flink,spark, cassandra, etc...).

  18. What Fucking Decade Is It? by sexconker · · Score: 1, Interesting

    Did I trip into a time warp and come out a decade in the past?
    Who the fuck is actually talking about hadoop or map reduce in 2015? The same retards that were creaming their little cunts about it in 2005?

    Even when you ignore the joke that is Java, hadoop is unwieldy, unreliable shit if you actually care about storing and retrieving correct, synchronized data.
    If you're fine with throwing all of your data in a pot and getting some sort of result that looks mostly correct, then knock yourself out and use hadoop.

    If your data needs to be correct, define it and its relationships then use SQL. You will have to pay someone decent money to do this correctly.

    1. Re:What Fucking Decade Is It? by Tablizer · · Score: 2

      If your data needs to be correct, define it and its relationships then use SQL. You will have to pay someone decent money to do this correctly.

      PHB's have to learn the hard way. They want it cheap, big, and now. Security & reliability issues are something they try to blame on somebody else using their well-honed spin skills.

    2. Re:What Fucking Decade Is It? by jbolden · · Score: 3, Insightful

      Hadoop didn't exist in 2005. 1.0 release was December 2011 earliest versions I know of were floating around in 2007.

      As for using SQL, Hadoop supports SQL (mostly). Problem with Hadoop is the data sets are too big for RDBMS engines to handle. It has nothing to do with developer skill it has to do with the type of database engine and how data is being handled.

    3. Re:What Fucking Decade Is It? by ToasterMonkey · · Score: 1

      Did I trip into a time warp and come out a decade in the past?
      Who the fuck is actually talking about hadoop or map reduce in 2015? The same retards that were creaming their little cunts about it in 2005?

      Even when you ignore the joke that is Java, hadoop is unwieldy, unreliable shit if you actually care about storing and retrieving correct, synchronized data.
      If you're fine with throwing all of your data in a pot and getting some sort of result that looks mostly correct, then knock yourself out and use hadoop.

      If your data needs to be correct, define it and its relationships then use SQL. You will have to pay someone decent money to do this correctly.

      None of these complaints seem to keep people from using Splunk.... unstructured data soup isn't going anywhere at any scale, we'll just call it different things.
      I can't even fathom a world where all the data we analyze in Splunk could have been fed into Oracle and turned into usable reports. All of our users would have to be Oracle DBAs.

    4. Re:What Fucking Decade Is It? by Anonymous Coward · · Score: 0

      Wait, it's 2015 and hadoop map reduce is a data base? It's a fucking parallel processing framework, not a database. Stop trying to make it a database. It doesn't work like that.

    5. Re:What Fucking Decade Is It? by Anonymous Coward · · Score: 2, Funny

      Hadoop didn't exist in 2005.

      Unless you work in recruitment.

    6. Re:What Fucking Decade Is It? by Anonymous Coward · · Score: 0

      Repeat after me: Hadoop is not a database.

    7. Re: What Fucking Decade Is It? by Anonymous Coward · · Score: 1

      +2 Interesting? More like -5 Ignorant.

      RDBMSs are not a workable solution for the kinds of problems Big Data is trying to solve. You need something else. There is no such thing as a "simple" Big Data solution.

      The Java-based Big Data solutions are really the only ones that exist in the world, other than those that were developed in-house years ago by companies who had to deal with huge-scale problems in the past.

      So if your solution for Big Data is Oracle (RDBMS), you don't belong in this conversation.

    8. Re:What Fucking Decade Is It? by Anonymous Coward · · Score: 0

      A modern MS SQL server can handle around half a petabyte per database(surely one of the linux flavors is better no?). Under what circumstances is that not enough per database? I guess if you are the NSA and recording every phone call, text message, email, and sneeze on the planet perhaps. That said most businesses will never need a DB that size, and by the time they do they will have upgraded to an RDBMS that can handle it.

    9. Re:What Fucking Decade Is It? by jbolden · · Score: 1

      SQL server is based around the idea of small amounts of changes with data retention being long.

      Assume a system throwing off 3mbs of data which many companies can have if they are aggregating simple stuff like all customers on the websites and sequencing page by page access to look for correlations. There are 28,500 seconds in a workday (more if you have multiple locations). That's 85.5 petabytes of data per day. You need to aggregate this data fast. SQL Server's engine isn't designed for that.

      Or for example SQL Server doesn't handle queries against unstructured information. Imagine that each record has a field of text and you want to do joins based on fuzzy matching between these text fields. Even with a few gigs of data SQL Server will die.

    10. Re:What Fucking Decade Is It? by sexconker · · Score: 1

      Hadoop was created in 2005 and named after a toy elephant. It was an open source implementation of some shit Google wrote some papers on.
      The "Apache Hadoop" branded package hit RTM in 2011. Apache only got involved because of all the retards mindlessly jumping onto it. Those retards jumped onto it because they were told it was based on Google's work.

      As for datasets being too big for RDBMS engines to handle, WTF are you talking about? MS SQL can handle all the data you throw at it and has complete clustering capabilities. If your database is designed properly there's no problem. Oracle is the same.

      https://msdn.microsoft.com/en-...

      For SQL Server 2014, individual files are limited to a mere 16 TB, and you can have only 32,767 filegroups per database.
      The maximum size of a single database is 524,272 TB, and you can have 32,767 databases per instance, and you can have 50 instances per server.
      The number of rows per table is limited only by available storage. The max size of any in-page row is 8,060 bytes, with anything larger being moved off-row and the row maintaining a pointer.

      Please tell me what sort of dataset exceeds the 800 YB maximum you get from SQL Server 2014.

    11. Re:What Fucking Decade Is It? by sexconker · · Score: 1

      SQL server is based around the idea of small amounts of changes with data retention being long.

      Assume a system throwing off 3mbs of data which many companies can have if they are aggregating simple stuff like all customers on the websites and sequencing page by page access to look for correlations. There are 28,500 seconds in a workday (more if you have multiple locations). That's 85.5 petabytes of data per day. You need to aggregate this data fast. SQL Server's engine isn't designed for that.

      Or for example SQL Server doesn't handle queries against unstructured information. Imagine that each record has a field of text and you want to do joins based on fuzzy matching between these text fields. Even with a few gigs of data SQL Server will die.

      Check your math on that, please. 8*3600*3 = 84.375 GB, not 85.5 PB.

      Further, 85.5 PB per day is no problem if you can manage to write all of that out to disk (1037 GB per second non stop lol).
      SQL Server 2014 handles .5 EB per database, so it's not a problem. And if your tracking 3 MB per second per user, you're tracking bots, not users - it would actually be cheaper to just log all packets at that point.

      As for making sense of the data, SQL can handle all of it if you design your database sanely. Even if you don't, you can slap a full-text index on it (including "natural language" style indexes). You pay a penalty for your poor design, sure, but everything works.

    12. Re:What Fucking Decade Is It? by sexconker · · Score: 1

      Tell that to the people who use it as a database and say it heralds the end of SQL and other relational databases.

    13. Re: What Fucking Decade Is It? by sexconker · · Score: 1

      If you're using a term like "Big Data", you don't belong in the fucking building.
      Relational databases are perfectly suited to extremely large and complex datasets. You just have to intelligently design your database. You can't just throw noise into a pot and expect useful results. Hadoop (map reduce) tries to do exactly this. If you care about correctness, completeness, and synchronization of data, it's trash.

    14. Re: What Fucking Decade Is It? by Anonymous Coward · · Score: 0

      Aw, man... it must be so nice to live in such a simple world.

      Just ask Google, Facebook, and Twitter how big their relational databases are. They'll tell you they aren't very big. The real data is stored in their non-relational databases because it makes more sense that way.

      Store Tweets in a relational database? Sure, if all you want to do is count them. If you want to actually mine them for data (what's trending?), you're going to need a different kind of hammer.

      The least compelling reason for non-relational databases is the size argument. The most compelling argument for them is suitability to task. Oracle sure can store XML documents. Lots of them. Can it search them? Sure, any idiot can write a LIKE query against a text field. Some dbs (Oracle among them) even offer nice XML capabilities like being able to search a field using an XPath expression. That's treat, but it spens up an XML parser for each field it's analyzing. Awful.

      The solution is a tool like MarkLogic, an XML document database, which is best suited to this type of task. MarkLogic makes a shitty relational database, but it makes a kickass XML document store.

      Hadoop is best used for distributed analysis of data. The great part is that the algorithm can go to the data, rather than having to e.g. SELECT from a centralized database to get your data set, and then analyze it. Oracle and MS SQL Server and the like are great at relational JOINs and fetching data. But they suck at analyzing it in any meaningful way, so you have to write your own analytics tools. If all your analytics tools are pounding on your central relational database just to fetch data, it becomes a clusterfuck.

      Using a tool like Hadoop allows you to say "I need to touch every one of my billion billion records and do something with them. Oh, and by the way, we have geo-distributed data centers across 85 countries, so there is no centralized database, and aggregating them makes no sense."

      Hadoop allows you to operate on data sets that RDBMSs can't handle. It's not the size of a table that matters, but the horrendous JOINs you might have to do in order to pull the kinds of data you need to do your analytics. Cartesian products are Bad Things in relational land, but handled in a different way in "Big Data" land.

      So stop shitting all over something you evidently don't understand. If you don't need something like Hadoop, then shut the fuck up and let those interested in this topic discuss it. I'm sure there is a part of your job that seems equally pointless to me, even though I don't know fuck all about it. Should I open my big mouth and tell you you're doing it wrong?

    15. Re:What Fucking Decade Is It? by bingoUV · · Score: 1

      Something fitting in maximum supported size of a database does not mean that performance of data manipulation with the database will meet the business criteria in the available budget.

      --
      Bingo Dictionary - Pragmatist, n. A myopic idealist.
    16. Re:What Fucking Decade Is It? by jbolden · · Score: 1

      Check your math on that, please. 8*3600*3 = 84.375 GB, not 85.5 PB.

      You are correct. Sorry.

      And if your tracking 3 MB per second per user, you're tracking bots, not users

      Absolutely. You are mostly tracking network security events, computers talking to other computers. What you are generally looking for is unusual activity. Server 2047 never talks Asia all the sudden it is talking to Vietnam regularly. But to do that you need to know who is talking to what across the network.

      SQL can handle all of it if you design your database sanely

      Yes and no. Obviously if you knew in advance ever type of message, designed good ways of getting it in there, good aggregates then a RDBMS would be better. But with: formats of data poorly understood, bad understanding of the types of data, complex matches, unclear rules about to normalize... SQL Server's engine won't hold up. Of course you can just throw it in a table but then you can't do much with it at reasonable performance. That's what Big Data engines are for. Once (if everO you do understand the data well enough to get it into a RDBMS of course you would rather use an RDBMS.

      . You pay a penalty for your poor design, sure, but everything works.

      No it doesn't. RDBMS don't scale as well as Big Data systems. As the number of CPUs, total memory, total disk increases (particularly in cluster configuration) their performance does not increase linearly or even nearly linearly. You can't just pay a penalty and solve the problem by hardware.

  19. Re:Nope. Not happening. by Anonymous Coward · · Score: 0

    I don't really foresee anybody moving away from the MapReduce paradigm any time soon considering it's essentially a 30-year old idea that has only been getting more successful every day. The issue is whether a single proprietary clustering solution based on Java is really what people want to sign up to work with for the next 10 years. After setting up the basics for Hadoop over a day or two at work (with no real practical need for it) I for one can say I'm not particularly interested in that proposition.

  20. Re:Nope. Not happening. by Anonymous Coward · · Score: 0

    Why did you say Spark (Particle)?

  21. Hadoop sounds like something an arab dog does by Anonymous Coward · · Score: 0

    in the park, next to the terrorist with the bomb strapped around her waist getting the nerve to board the bus. She too fell for the 100 virgins ...blah blah blah.

    CHANGE
    THE
    NAME

  22. Re:Nope. Not happening. by Anonymous Coward · · Score: 0

    None of that matters when HDP stock drops to 0$. It will die and become irrelevant. Hadoop is a brand name that has to make money to stay alive, and it is not making any money.

  23. Didn't work very well for Hitler by Billly+Gates · · Score: 1
  24. When a headline asks a question by Anonymous Coward · · Score: 0

    the answer is to the question is "No"

    http://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines

  25. Only Spark? by sfcat · · Score: 2

    The problem with "big data" is that there are no vendor specs and the implementations are sometimes questionable. There is a provider that does a better which is SQLStream (http://www.sqlstream.com) which has a streaming DB which is controlled via SQL. In addition to normal tables, you have streams which are relational typed conduits though which data flows and windows which are time (and row) based groups of tuples which can be used in agg queries with all the standard SQL functions (there's also Java UDXes and MED support). Designing your middleware on top of a SQL engine is a much better design pattern than doing it all with hand wired Java. All this and about 100x the throughput of a Hadoop program. Disclaimer: I'm an engineer at SQLStream.

    --
    "Those that start by burning books, will end by burning men."
    1. Re:Only Spark? by phantomfive · · Score: 2

      I read your post but I still have no idea what your 'streams' are, or why anyone would want to use them.

      --
      "First they came for the slanderers and i said nothing."
    2. Re:Only Spark? by Anonymous Coward · · Score: 0

      I do. The term is ill-defined, but in general: if you land the data before you process it's a not-stream. MapReduce picks up data off disk, plays with it, and writes it back out (not-stream). SQL Server's SQL runs on data that is already landed on disk (not-stream). Streams are for doing processes before the data has landed. CC-Fraud detection is a good example as it does both:

      I want all the credit card transactions for all my users in a big database so I can run statistics on them and find the outliers: Classic database/MapReduce solution indicated (not-stream).

      On the way in, I want to use an in memory queue of all my customer IDs and the geographic coordinates and times of their last physically swiped credit card transaction, executing the following for every incoming record:

      hours_apart = new_swipe_time - old_swipe_time
      distance_apart = sqrt((new_x-old_x)^2 + (new_y-old_y)^2))
      if (distance_apart > hours_apart * 85)
                  fraud_inspection_queue.append(this.user)
      old_swipe_time = new_swipe_time; old_x = new_x; old_y=new_y

      This is cheap to compute in real time before you land the data and significantly reduces the time between a fraudulent transaction and the card's deactivation. When compared to the old way of inserting all the records into an "hourly" table, running hourly queries and then moving the hourly table into the month table it's a big improvement. Repeatedly picking up and putting down the same records in your database is a great way to make a hard to optimize mess of it. There are fixes, but when the whole purpose is to execute 5 lines of code per record, and only one record per customer needs to be in memory, it can be a very elegant fix to process the incoming data stream first.

    3. Re:Only Spark? by sfcat · · Score: 1

      I do. The term is ill-defined, but in general: if you land the data before you process it's a not-stream.

      That's a pretty good way to think about it. A stream is really just a table without disk backing which means you have to be reading from the stream before you write to it. In a streaming system, select queries run forever (or at least with a timeout) and inserts must happen after a select query on the same stream is made for the data to be transmitted through the stream. In this way you can take a stream of incoming data provided by an insert statement(s) and send it to multiple different reading queries (ie selects) which only know they are reading from a named stream, not how data is being fed into that stream. In this way you can build up a graph of streams (plus views, and tables) that process incoming streaming of tuples and only dump said data to disk when you have finished transforming, analyzing it and whatever else you do with it. Generally, customers will "compress" their high volume incoming stream using a group by into a summary of the data which is then written to a data warehouse or database.

      --
      "Those that start by burning books, will end by burning men."
  26. Hadoop was never really the right solution... by rockmuelle · · Score: 5, Insightful

    A scripting language with a good math/stats library (e.g., NumPy/Pandas) and decent raid controller are all most people really need for most "big data" applications. If you need to scale a bit, add few nodes (and put some RAM in them) and a job scheduler into the mix and learn some basic data decomposition methods. Most big data analyses are embarrassingly parallel. If you really need 100+ TB of disk, setup Lustre or GPFS. Invest in some DDN storage (it's cheaper and faster than the HDFS system you'll build for Hadoop).

    Here's the break down of that claim in more computer sciencey terms: Almost all big data problems are simple counting problems with some stats thrown in. For more advanced clustering tasks, most math libraries have everything you need. Most "big data" sizes are under a few TB of data. Most big data problems are also I/O bound. Single nodes are actually pretty powerful and fast these days. 24 cores, 128 GB RAM, 15 TB of disk behind a RAID controller that can give you 400 MB/s data rates will cost you just barely 5 figures. This single node will outperform a standard 8 node Hadoop cluster. Why? Because the local, high density disks that HDFS encourages are slow as molasses (30 MB/s). And...

    Hadoop has a huge abstraction penalty for each record access. If you're doing minimal computation for each record, the cost of delivering the record dominates your runtime. In Hadoop, the cost is fairly high. If you're using a scripting language and reading right off the file system, your cost for each record is low. I've found Hadoop record access times to be about 20x slower than Python line read times from a text file, using the _same_ file system for Hadoop and Python (of course, Hadoop puts HDFS on top of it). In Big-O terms, the 'c' we usually leave out actually matters here - O(1*n) vs. O(20*n). 1 hour or 20 hours, you pick.

    If you're really doing big data stuff, it helps to understand how data moves through your algorithms and architect things accordingly. Almost always, a few minutes of big-O thinking and some basic knowledge of your hardware will give you an approach that doesn't require Hadoop.

    tl;dr: Hadoop and Spark give people the illusion that their problems are bigger than they actually are. Simply understanding your data flow and algorithms can save you the hassle of using either.

    -Chris

    1. Re:Hadoop was never really the right solution... by sfcat · · Score: 1

      Here's the break down of that claim in more computer sciencey terms: Almost all big data problems are simple counting problems with some stats thrown in. For more advanced clustering tasks, most math libraries have everything you need. Most "big data" sizes are under a few TB of data. Most big data problems are also I/O bound. Single nodes are actually pretty powerful and fast these days. 24 cores, 128 GB RAM, 15 TB of disk behind a RAID controller that can give you 400 MB/s data rates will cost you just barely 5 figures. This single node will outperform a standard 8 node Hadoop cluster. Why? Because the local, high density disks that HDFS encourages are slow as molasses (30 MB/s). And...

      Hadoop has a huge abstraction penalty for each record access. If you're doing minimal computation for each record, the cost of delivering the record dominates your runtime. In Hadoop, the cost is fairly high. If you're using a scripting language and reading right off the file system, your cost for each record is low. I've found Hadoop record access times to be about 20x slower than Python line read times from a text file, using the _same_ file system for Hadoop and Python (of course, Hadoop puts HDFS on top of it). In Big-O terms, the 'c' we usually leave out actually matters here - O(1*n) vs. O(20*n). 1 hour or 20 hours, you pick.

      Optimization is usually about creating a small inner loop at the expense of setup cost. You can see this in compilers/languages (creating an optimized binary vs a script interpreter), in databases (prepare vs execute), and in these types of big data systems. Hadoop can't and doesn't optimize its inner loop very well at all due to its basic programming interface. It stores each row in an array of Java objects. A better design would process buffers of data with non-copying access libraries to hide this abstraction.

      Data processing systems can be thought of on a scale from simple, small systems that prepare each query quickly to scalable systems that take longer to optimize the query but execute it more quickly for each row. So MySQL for instance is at the quick/small end of the spectrum, Oracle near the middle, datawarehouses towards the larger and finally distributed and streaming systems at extremely large end of the spectrum. As you scale to millions of rows per second of throughput, you must have a very optimal inner loop and not touching disk frequently helps greatly. Hadoop does none of this well. SQLStream (see my post above) does it much much better via SQL.

      --
      "Those that start by burning books, will end by burning men."
    2. Re:Hadoop was never really the right solution... by Anonymous Coward · · Score: 0

      Heh, this is very true. Some dudes at my company are running a big expensive hadoop project (it's a bad sign when the number of external consultants equal the number of domain experts) to analyze around 10 TB of rather low-dimensional data..

    3. Re:Hadoop was never really the right solution... by Anonymous Coward · · Score: 0

      You are right. For now. And only for 'big' data of the order of few TB that you mention. How about for petabytes? Is there any alternative to Hadoop today that is petabyte scale? Care to mention one?

    4. Re:Hadoop was never really the right solution... by Anonymous Coward · · Score: 0

      tl;dr: Hadoop and Spark give people the illusion that their problems are bigger than they actually are. Simply understanding your data flow and algorithms can save you the hassle of using either.

      -Chris

      That's exactly right, but even if people had lots of data they would be a lot better off if they considered HDF instead of HDFS. The federal government ran into big data problems long before anyone else did, and NCSA developed a file format to deal with it, starting in the 1980s.

      It still exists and it is still public domain and it still has active development and it is still better than Hadoop. Spark can query data in it if you want to be buzzword compliant, and you can even publish a REST interface to a data set if you want.

      I suppose the problem is that it wasn't invented in Silicon Valley.

    5. Re:Hadoop was never really the right solution... by Anonymous Coward · · Score: 0

      Just use MongoDB, brah. MongoDB is web-scale.

    6. Re:Hadoop was never really the right solution... by Anonymous Coward · · Score: 0

      Single nodes are actually pretty powerful and fast these days. 24 cores, 128 GB RAM, 15 TB of disk behind a RAID controller that can give you 400 MB/s data rates will cost you just barely 5 figures. This single node will outperform a standard 8 node Hadoop cluster. Why? Because the local, high density disks that HDFS encourages are slow as molasses (30 MB/s). And...

      -Chris

      Most of the clients I work with use dozens of hosts larger than you just described, requiring data rates in the many GB/s range to complete various jobs on time. It sounds like your use case is actually "Small Data".

    7. Re: Hadoop was never really the right solution... by Anonymous Coward · · Score: 0

      > Hadoop has a huge abstraction penalty for each record access. If you're doing minimal computation for each record, the cost of delivering the record dominates your runtime. In Hadoop, the cost is fairly high. If you're using a scripting language and reading right off the file system, your cost for each record is low.

      You must have been misled at some point. The whole point of this kind analysis is that the algorithm goes to the data, not the other way around. So the "cost" of fetching the record is exactly the same as your home-brewed solution where you read right off the disk. Basically, Hadoop is the exact thing you described as the best way to implement things, and it's already been implemented.

      Should be a win, right?

  27. Netcraft confirms it! by iamacat · · Score: 1

    BSD is dying for how long again? It's still around and having monthly releases. For open source projects, popularity contests are much less important. With massive existing user base, Hadoop will be actively maintained for long time. So if you already familiar with it and it serves the needs of your project, go right ahead.

  28. What about SAP and SAS? by Anonymous Coward · · Score: 0

    Aren't SAP and SAS in this "Big Data" market, and have been for longer than this was a buzzword? My guess is that the companies that invested heavily in SAP and SAS in the late 1990s and early 2000s are sticking with that investment and toolset and couldn't be bothered by Hadoop. Sure, it might be free, but when your entire infrastructure for business information management is already in with SAP or SAS (presumably Oracle or MSSQL as well), you're not going to throw that away anytime soon without something more compelling than, "it's cheaper", because from a migration and overall cost of conversion viewpoint it ain't.

  29. Re:Nope. Not happening. by Rich0 · · Score: 4, Insightful

    I agree that the problem is that most companies don't know how to run it

    I think a bigger problem is that most companies don't even know what big data actually is. It is a big buzzword. I hear managers talking about it all the the time. Half the time they're talking about some database table with a few hundred thousand records in it. Other times they're talking about some repository full of documents or binary files that might be terrabytes in size, but it is just random stuff. They don't actually have questions in mind that they want to answer, and ultimately that is what tools like Hadoop are about.

    I've heard "big data" applied to problems that are basically just file shares or the like.

    Then if a company really does have a problem where Hadoop and such is useful, they want to buy some product off the shelf that solves that particular problem, and usually they don't exist. Or they want to hire a bunch of random rent-a-coders and have them solve the problem, and they go about solving it with single-threaded solutions written in .net or whatever the commodity solution in use is at the company.

    Sure, your Facebooks and Googles and Netflixs and Amazons know what they're doing. Your average GE or Exxon or Pfizer generally doesn't do that level of comp sci.

  30. Re:Nope. Not happening. by sfcat · · Score: 0

    I know R. My wife has a Yarn store. WTF are those other things?

    Its a distributed exec for Java processes. That's really it. It has crappy monitoring built in that's unnecessary due to SNMP but they built it in anyway because...well I don't know why.

    --
    "Those that start by burning books, will end by burning men."
  31. Re:Nope. Not happening. by jbolden · · Score: 2

    You are overestimating the difficulty at this point. This not compsci anymore and hasn't been for many many years. It isn't even hard administration. It is probably easier to get a big data system running in 2015 than it was to use Oracle in 1995.

    As far as your examples you went way too big. GE is a huge DevOps shop, they know what Big Data is. Exxon has massive supercomputing datasets. I would bet they were doing big data long before it got cool. Pfizer has an IT department that is some of everything but they have many many data warehouses so I can't imagine they aren't playing with data lakes.

  32. Re: Nope. Not happening. by Anonymous Coward · · Score: 0

    Not just covering, but documenting. To get security approvals and documentation in place for open source can be work. When you pay a vendor, they do that work As well. Which eleviates significant operational complexity.

  33. Big Data != toolset by Required+Snark · · Score: 2
    Both Pointy Headed Bosses and Slashdot loooove talking about tools. As the posts generally show, both PHBs and Slashdoters have no clue about what Big Data is used for. It's all about the buzzwords and technology, not about use and utility.

    There are no references to any algorithms. Rank ordering? Nope. Social graph analytics? No. Netflix style recommendations? Uh-uh. Statistics? None.

    Without talking about data sets, algorithms and expected results, yammering about tools is meaningless. Hot air.

    But who cares, because you all get to call each other stupid, and try and prove that you are the biggest baddest tech weenie on the block. From here it seems that you don't even know where the block is. You don't even seem to know which direction you need to go to get to a street. (Like the implied car reference there?)

    I'm beyond unimpressed. It's obvious that no one has a clue what they are talking about. Go off and learn something, and then maybe you will be able to write a post that isn't a waste of time. Other then that STFU and get off my lawn.

    --
    Why is Snark Required?
    1. Re:Big Data != toolset by David_Hart · · Score: 1

      I agree. There is a distinct lack of discussion that outlines where Hadoop shines versus a RDBMS and these other tools. I did some reading and it seems like a database system does better with data that is organized and has a distinct relationship between data sets. Hadoop and parallel processing seems to work better for data that is highly unstructured and for which you need to delve deeply to find relationships and create adhoc reports.

      Some have mentioned that one of the reasons for interest in Hadoops decline is that it is expensive. There are always newer tools being released that may cost less just to gain market share. The question is, as always, are they actually better products?

      I also agree that different problems require different solutions. Unless you are taking specifics, it becomes very difficult to produce a valid debate over the technology that would produce what is required. It's like arguing the merits of MS Excel vs. MySQL without knowing what the requirements are.

    2. Re:Big Data != toolset by Bob9113 · · Score: 1

      Both Pointy Headed Bosses and Slashdot loooove talking about tools. As the posts generally show, both PHBs and Slashdoters have no clue about what Big Data is used for. It's all about the buzzwords and technology, not about use and utility. There are no references to any algorithms.

      Heh. I've been doing big data since 2000. Fifteen years experience in a field that's five years old, I like to say. And let me say this: You nailed it. Your whole post, not just the part I quoted. I've used the tools, from Colt to R, and there is no substitute for the ability to analyze and match a business model, data system, algorithms, implementation, and business controls.

      On the upside, give me (or, I'm guessing, you) a month or two to develop a big data strategy, and we'll generate large, measurable, improvement in the company's desired performance metric -- using whatever toolset the company is fawning over at the moment. It may not be what sells the PHBs, but it feeds the bulldog.

      It is a shame, though, to see so many charlatans diverting so much revenue into ill-conceived projects. Alas.

    3. Re:Big Data != toolset by Schnee · · Score: 1

      +1. Without analysis, big data is just a bunch of data

    4. Re:Big Data != toolset by rrr00bb5454 · · Score: 1

      Except, if you are talking about a centralized database tool, you already know that the default design of "everybody write into the centralized SQL database" is a problem. Therefore, people talk about alternative tools; which are generally designed around a set of data structures and algorithms as the default cases. A lot of streaming based applications (ie: log aggregation) are a reasonable fit for relational databases except for the one gigantic table that is effectively a huge (replicated, distributed) circular queue that eventually gets full - and must insert and delete data at the same rate. Or the initial design already rules out anything resembling a relational schema, etc.

    5. Re:Big Data != toolset by rrr00bb5454 · · Score: 1

      Actually, the biggest problem with RDBMS and similar tools is the fact that you are expected to mutate data in place, and mash it into a structure that is optimized for this case. Most of the zoo of new tools are about supporting a world in which incoming writes are "facts" (ie: append-only, uncleaned, unprocessed, and never deleted), while all reads are transient "views" (from combinations of batch jobs and real-time event processing) that can be automatically recomputed (like database indexes).

  34. Re:Nope. Not happening. by Anonymous Coward · · Score: 0, Interesting

    I would strongly disagree. In 1995 relational theory and practice was well understood by a large set of developers and had stable, well documented implementations. Raw Hadoop and the associated computational model is not at that level of stability, documentation and usability. In addition the relational model applies to many business problems, large and small. Hadoop is generally applicable and cost efficient only for larger, more complex problems.

  35. What is spark bad at, regarding cassandra? by AnotherSeattlePrgmr · · Score: 0

    what could improve spark? where does it suck?

  36. Hadoop is growing but not that much by luisdans5060 · · Score: 1

    From 2010 to early on the year I was responsible for Big Data technical marketing at Microsoft, recently joined AWS. I won't comment of any of the specifics for my current or former employer, but it's a fact that other nosql technologies have a higher adoption rate. It's clear that the traditional datawarehouse had limitations, and that hadoop is not replacing the EDW. The largest companies are using proprietary technologies, not adopting hadoop. Hadoop 2.0 is much better, you should use it if you have the skills. But if you don't, relational, nosql and cloud databases are evolving to solve most use cases. I would invest more resources on Advanced Analytics both on open source (e.g. http://xpatterns.com/connect/ or https://aws.amazon.com/marketp... ) or proprietary (SAS, IBM, SAP...).

  37. Re:Nope. Not happening. by db10 · · Score: 1

    I would strongly disagree. In 1995 relational theory and practice was well understood by a large set of developers and had stable, well documented implementations. Raw Hadoop and the associated computational model is not at that level of stability, documentation and usability. In addition the relational model applies to many business problems, large and small. Hadoop is generally applicable and cost efficient only for larger, more complex problems.

    you can't strongly anything as an AC, sorry buddy

  38. No, the world is leaving big data behind. by Qbertino · · Score: 1

    Meaning the hype around big data has settled and its back to business. I'd say there less than 10 companies worldwide to whom big data actually might make sense. Others clean and aggregate their data in such a way that its actually useful. .... I don't want my bank guessing my balance with big data statistics, I want them to know it. And so do most other people.

    --
    We suffer more in our imagination than in reality. - Seneca
  39. Re:Nope. Not happening. by Rich0 · · Score: 1

    You are overestimating the difficulty at this point. This not compsci anymore and hasn't been for many many years. It isn't even hard administration. It is probably easier to get a big data system running in 2015 than it was to use Oracle in 1995.

    I think you're misunderstanding my point.

    Sure, it is easy to install Hadoop, and run it.

    The hard part is figuring out WHAT to run on it.

  40. Re:Nope. Not happening. by jbolden · · Score: 1

    Agree with both your comments. That's from a developers perspective it was certainly easier to use Oracle once setup in 1995 than it was to use Hadoop today (by a bit). What the thread was about was setup. What wasn't understood well in 1995 was how to package complex enterprise software so that sysadmin times to get it installed were reasonable. The original poster was talking about the complexity from scratch.

  41. Re:Nope. Not happening. by jbolden · · Score: 1

    That's easy the big 5:

    1) Datasets to big to use an RDBMS
    2) 360 view of customers (CRM consolidation, sales systems consolidation...)
    3) Security data from network security devices.
    4) Stream in huge amounts of operational data (GPS on employees, physical sensors, machine health...) and do integrated data analysis
    5) data warehouse consolidation

  42. Hype for nothing and hadoop for free? by Anonymous Coward · · Score: 0

    Good thing I didn't bother getting onto the Hadoop hype wagon a few years ago when Hadoop was the solution to every problem, the guarantee of a high-paying job forever, and the cure for cancer. How the mighty have fallen.

  43. Re: Nope. Not happening. by Anonymous Coward · · Score: 0

    I work in the healthcare space as well, and Open Source stuff has never been a problem. HIPAA has no preference for big vendors.

    For the most part, it's been EASIER to get approvals, because the software is often much more flexible, and it's cheaper to test the waters. Want to "try" a big vendor's solution? Write a justification, lobby a bunch of people, adjust a future budget, write a purchase order, etc etc etc. Want to try out an OSS package? Download it and work-up a proof-of-concept in some spare time and then show someone a working system.

    It's easier to talk about the path from PoC to production than it is to talk about all the unknowns and costs associated with getting in bed with a vendor, because you KNOW that vendor is going to charge $500/hr for useless consulting services. Then finally, there's the possibility that the solution doesn't work out. With the OSS solution, you just shutter the work you've already done and move on. With a big vendor, you may find that "moving on" is not easily done, because they have spent most of that useless consulting time insinuating themselves into every part of your organization so you can let them go.

  44. Re:Nope. Not happening. by Daniel+Hoffmann · · Score: 1

    So you are basically saying that hadoop will eventually fall in disuse but HDFS (Hadoop file system) will linger on with new platforms built on top of it? Or do you believe that the HDFS will also be replaced eventually?

  45. I am guessing it is a joke about IOT? by Anonymous Coward · · Score: 0

    Spark is a horrible name for anything. Spark the M3 Wifi board people needed to change recently to Particle to try and have some sort of trademark (huge derp for anyone picking spark as a name to begin with). Otherwise it makes no sense.

  46. Hadoop is just another tool by Anonymous Coward · · Score: 0

    No the Hadoop users are rolling their own solutions and don't actually give a fuck about boutique shops providing expensive solutions that their in house developers can provide. And their in house developers already know their data better than the expensive fucking consultants.

    It's not getting that much attention because it's just not that big a deal anymore, just another tool for developers to use; especially Java developers.

    In addition it seems like management has learned that Hadoop is a data-analysis package, and not a database.

  47. Manufacturing Data by clifwlkr · · Score: 1

    Something as simple as manufacturing data far eclipses this number every day. Think of every screw from every supplier in every product. Then tracking the reliability of this product through the entire lifecycle with self diagnostic tests. No, this is not for your toy made in china, but when it comes to real top end products that HAVE to work, then you need this kind of data to figure out what went wrong and fix it fast. That could save your company millions. No, making your latest dot bomb app does not need this, but there are many places that do. Also check out financial apps like credit fraud, insurance, etc.

    1. Re:Manufacturing Data by jbolden · · Score: 1

      Good examples.

    2. Re:Manufacturing Data by sexconker · · Score: 1

      Something as simple as manufacturing data far eclipses this number every day. Think of every screw from every supplier in every product. Then tracking the reliability of this product through the entire lifecycle with self diagnostic tests. No, this is not for your toy made in china, but when it comes to real top end products that HAVE to work, then you need this kind of data to figure out what went wrong and fix it fast. That could save your company millions.

      No, making your latest dot bomb app does not need this, but there are many places that do. Also check out financial apps like credit fraud, insurance, etc.

      Every screw from every supplier in every product? No one dealing in volume tracks that because it's fucking pointless. People dealing with really-expensive shit that requires that tracking don't deal in volumes where it would be a problem.

      But lets pretend we live in a fantasy world where that's true.

      SUPPLIERS
      ID - INT PrimaryKey, Identity
      Name nvarchar(whatever) ...

      PARTTYPES
      ID - INT PrimaryKey, Identity
      SupplierID INT ForeignKey (SUPPLIERS.ID)
      Name nvarchar(whatever) ...

      PARTS
      ID - BIGINT PrimaryKey, Identity
      PartTypeID INT ForeignKey (PARTTYPES.ID) ...

      PARTTRACKING
      ID - BIGINT PrimaryKey, Identity
      PartsID BIGINT ForeignKey (PARTS.ID) ...

      Add location/action/QC, result, timestamp, and employee id columns to the PARTTRACKING table and you get maybe 40 bytes per row total.

      Pretend there are 100 tracking entries per part (LOL), 10,000 tracked parts per product (LOLOL), and 1,000,000 of these presidential spacecraft level products per day (LOOOOL).
      You're only looking at 36 TB per day for PARTTRACKING, the most expensive (by far) table. Bump it up to 50 TB for all other shit in the DB.

      Your claim was that one day of manufacturing tracking would far eclipse .5 PB. Not only is this not true, SQL Server 2014 has a limit of .5 EB, not .5 PB.

      Further, .5 PB per day, even assuming non-stop operations, would be 6 GB per second of writing. And your claim is that that would be "far eclipsed".

      Stop talking out of your ass.

  48. Re:Nope. Not happening. by flink · · Score: 1

    I tend to agree. As a storage admin for a Multi-Hospital organization using anything open source is not really an option if we want to keep the HIPPA-potamus away.

    I worked for over a decade as an SE for an org that was both a hospital-IT vendor and a covered entity in its own right (we sold a PMS, a PACS, operated multiple HIEs, and were a claims clearinghouse). When choosing libraries and server technology, never once was the open source status of a piece of technology a consideration with regards to HIPAA. We would occasionally have to run things by the legal team to evaluate a new license or check our compliance, but that was it. HIPAA considerations were mostly operational: Is PHI encrypted when at rest or transmitted over the open internet? Are we ensuring only authorized personnel can see PHI? How are we handling backups? The ops team took care of most of those things and they didn't care what we built the software out of, as long as it conformed to the requirements.

  49. Re: Nope. Not happening. by Anonymous Coward · · Score: 0

    "alleviates"....?

  50. Re:Nope. Not happening. by number6x · · Score: 1

    It is HIPAA, not HIPPA. Just remember it is an 'Accountability Act', and you know where the double letters are.

  51. Re:Nope. Not happening. by Anonymous Coward · · Score: 0

    I think the idea was that Hadoop is a large ecosystem with good abstraction layering for attacking problems with large datasets. Whether you change out HDFS for MapR, or you change out MapReduce processing for Spark, or you decide to do all your data processing in real time with Storm, the Hadoop model works well.

    If you wanted to justify discarding Hadoop's design completely, you'd have to be ready to re-write all the pieces that currently address people's needs. It's not unlike writing a new kernel (which you can do pretty easily) but then re-writing and testing all the hardware drivers (which would take decades).

  52. Betteridge's law of headlines by allo · · Score: 1

    Betteridge's law of headlines finally proven wrong?

  53. Re: Nope. Not happening. by Anonymous Coward · · Score: 0

    You must be new to the game. Another vendor, with their special configuration platform and customized layers definitely "elevates" complexity.

  54. Hadoop alternatives by jean-guy69 · · Score: 1

    Has anyone considered Joyent's Manta ?

    This is a distributed object storage with integrated compute.
    Data is stored on a cluster of SmartOS hosts..
    And processed directly on each host inside a OS container (SmartOS zone), no data movement.

    Lot of APIs available: R, command-line, python, ruby, node.js etc..

    Available on their cloud and as a on-premises commercial product, opensourced last November (simulteanously with smartdatacenter).