Slashdot Mirror


How Open Sourcing Made Apache Kafka A Dominant Streaming Platform (techrepublic.com)

Open sourced in 2010, the Apache Kafka distributed streaming platform is now used at more than a third of Fortune 500 companies (as well as seven of the world's top 10 banks). An anonymous reader writes: Co-creator Neha Narkhede says "We saw the need for a distributed architecture with microservices that we could scale quickly and robustly. The legacy systems couldn't help us anymore." In a new interview with TechRepublic, Narkhede explains that while working at LinkedIn, "We had the vision of building the entire company's business logic as stream processors that express transformations on streams of data... [T]hough Kafka started off as a very scalable messaging system, it grew to complete our vision of being a distributed streaming platform."

Narkhede became the CTO and co-founder of Confluent, which supports enterprise installations of Kafka, and now says that being open source "helps you build a pipeline for your product and reduce the cost of sales... [T]he developer is the new decision maker. If the product experience is tailored to ensure that the developers are successful and the technology plays a critical role in your business, you have the foundational pieces of building a growing and profitable business around an open-source technology... Kafka is used as the source-of-truth pipeline carrying critical data that businesses rely on for real-time decision-making."

48 comments

  1. helps you what in the what? by war4peace · · Score: 5, Interesting

    The amount of corporate bullshit in TFS makes my head hurt and spin... at the same time.

    --
    ...gis sdrawkcab (usually not responding to ACs; don't bother posting as AC)
    1. Re:helps you what in the what? by Anonymous Coward · · Score: 1

      its another shitty java clusterfuck

    2. Re:helps you what in the what? by Anonymous Coward · · Score: 0

      Think tibco, but without voodoo magic required to play with it.

    3. Re:helps you what in the what? by Anonymous Coward · · Score: 0

      Whenever you hear "business-" prepended to anything, think "automating a manager's job."

    4. Re:helps you what in the what? by Anonymous Coward · · Score: 3, Funny

      I'm using Kafka to create data-driven feeds that leverage business intelligence in semantic data-driven mashups. By integrating a webscale dynamic platform it's added critical synergies to my KPIs.

    5. Re:helps you what in the what? by Anonymous Coward · · Score: 0

      I have to admit that after reading the summary, article and Apache Kafka's home page, I still have no fucking idea what it's all about. All I found out was that it uses Java, which doesn't really correlate with all the hype it seems to be getting because nobody is this excited about anything written in Java.

    6. Re: helps you what in the what? by Anonymous Coward · · Score: 0

      Why is this even a slashdot article? It's like saying 10 out of 10 top 10 Banks use OpenSSH because something you install runs it in the background.

      If you have the Hortons works Hadoop stack then you have it installed whether you know it or not or use it actively or not.

      It's a stream processing API that can be used for a bunch of stuff like ingesting data into distributed store like Hadoop or As a background building block for a CEP (Complex Event Processor) platform without coding your own.

      Some big name product come bundle with some free underlying API and the underlying API guys start posting to slashdot to claim everyone uses it and they are so successful.

    7. Re: helps you what in the what? by Anonymous Coward · · Score: 0

      Yeah, that's why all the banks use, they're known for pissing money away on shit

    8. Re: helps you what in the what? by Anonymous Coward · · Score: 0

      You have to have more than 1 week of JavaScript to get the gist of it

  2. Stop the oppression of the sporteists NOW!!!! by Anonymous Coward · · Score: 0

    The majority sports culture in the United States suffocates the non-adherents. Super Bowl is not inclusive of the full spectrum of society. Its existence only serves to show how little progress our society has made in the past decades and how much work remains to be done. Stop the oppression of America's football-rejecting subcultures now by making this year the last year of Super Bowl!!

    1. Re: Stop the oppression of the sporteists NOW!!!! by Anonymous Coward · · Score: 0

      What the serious fuck?!?

    2. Re:Stop the oppression of the sporteists NOW!!!! by Anonymous Coward · · Score: 0

      sorry you never got picked to play. bet you look hawt in cheerleader outfit!

    3. Re: Stop the oppression of the sporteists NOW!!!! by Anonymous Coward · · Score: 0

      Everyone gets a participation trophy... even if they have never played football.

    4. Re:Stop the oppression of the sporteists NOW!!!! by Anonymous Coward · · Score: 0

      and well it should. sports aint for snowflakes. stay in your moms basement like a good little k0d3r m0nk3y and leave the real world for real people.

    5. Re: Stop the oppression of the sporteists NOW!!!! by Anonymous Coward · · Score: 0

      Because sports is what constitutes the real world now.

  3. How open source creates fake news by Anonymous Coward · · Score: 0

    More at 11

  4. Re: Trump by Anonymous Coward · · Score: 0

    Hey, cock sucking fagget, are you in canada yet?

  5. Re: by Anonymous Coward · · Score: 0

    Have pity on a fake journalist!

  6. Kafka by Anonymous Coward · · Score: 1

    The experience I've had testing Kafka with large amounts of data lead me to a couple conclusions.

    Kafka is a lot of overhead to control streams, that don't solve the problems you are having when you need distributed streaming solutions. Primarily, bottlenecks, write speeds, read speeds, and processing performance irregularity (including debugging).

    The idea that Kafka helps you with stream processing in a way that more traditional methods (load balancing, splitting on load, processing in parallel) can't or don't or that it's easier, is false.

    If you're on an ec2, open a socket to S3 and write, have something process it. You'll save a lot of cycles (in every way).

    This article is some slick IBM-style marketing which is not very helpful to people with existing technical challenges.

    1. Re:Kafka by K.+S.+Kyosuke · · Score: 1

      Kafka is a lot of overhead to control streams, that don't solve the problems you are having when you need distributed streaming solutions. Primarily, bottlenecks, write speeds, read speeds, and processing performance irregularity (including debugging). The idea that Kafka helps you with stream processing in a way that more traditional methods (load balancing, splitting on load, processing in parallel) can't or don't or that it's easier, is false.

      Isn't this ideally a subject for a specialized language, or an embedded one? I know nothing about Kafka but it really seems that bolting a framework or a library to your system does little to help you with either performance or abstraction in a case like this.

      --
      Ezekiel 23:20
  7. Aptly named by Anonymous Coward · · Score: 0

    That was Kafkaesque.

  8. i live in canada STAY AWAY by Anonymous Coward · · Score: 0

    we are building a massive igloo over all canada to keep.....everyone out.....we even have 30 foot mutated polar bears imported from ...alaska to guard us....oh and a giant 50 foot beaver is said to be wandering about eating any americans it finds....

    1. Re:i live in canada STAY AWAY by Anonymous Coward · · Score: 0

      That's just terrific. We need some of those too.

    2. Re: i live in canada STAY AWAY by Anonymous Coward · · Score: 0

      I hear there are guys in the White House that throw hissy fits and act like little beavers.

  9. Helps you Java guys stay busy by thesjaakspoiler · · Score: 1

    because getting through those docs and getting the whole d*** thing to work is creating more jobs than it proclaims to save.

    1. Re:Helps you Java guys stay busy by Anonymous Coward · · Score: 0

      Kafka can be bit laborious to operate but it is a solid building block for many data pipeline use cases. There are good selection of service offerings if you just want to use it but don't care about running it your self: check out Aiven (https://aiven.io/kafka), CloudKarafka (https://www.cloudkarafka.com/) as well as Heroku (https://www.heroku.com/kafka).

  10. The vision by Anonymous Coward · · Score: 1

    We had the vision of building the entire company's business logic as stream processors that express transformations on streams of data

    Yikes, reality surpasses Dilbert.

    1. Re:The vision by K.+S.+Kyosuke · · Score: 1

      Sounds like a special case of functional programming. I didn't know that Dilbert had problems with functional programming mentioned in it.

      --
      Ezekiel 23:20
  11. Moves data from this cluster to that cluster relia by raymorris · · Score: 5, Informative

    Suppose you have some service that produces data. This service might be on one server, or a group of servers.

    Some other service receives this data. Perhaps the receiving service transforms the data in some way before passing it along to some other system.

    Kafka helps with that. It avoids some simple problems. For example, I once worked on a system in which a cron transferred the data at midnight each day. Each day, it sent over that day's data. Records created right at midnight might get skipped, or might get sent twice. In case of a network glitch, you'd have to manually retry in the morning. Kafka avoids those kinds of problems.

    Kafka is built on the idea that both producers and consumers may be groups of partially redundant servers, with the data split up between different servers. Kafka has features to enable load balancing.

    So it's appropriate where you want to get data from some group of servers to another group, possibly through a middle group, you want it reliable, load balanced, etc, without inventing and later correcting your own protocols.

  12. My read was that Kafka assists with those methods by raymorris · · Score: 1

    > The idea that Kafka helps you with stream processing in a way that more traditional methods (load balancing, splitting on load, processing in parallel) can't or don't or that it's easier, is false.

    My read was not that Kafka is supposed to *replace* "load balancing, splitting on load, processing in parallel", but that it's intended to *enable* "load balancing, splitting on load, processing in parallel". Not that it does something that load balancing doesn't do, but that it provides a proven load balancing solution, or at least some key parts.

  13. Slashvertisement? by Anonymous Coward · · Score: 0

    "An anonymous reader" ... named Neha Narkhede, by any chance?

  14. WTF does it do? by PhunkySchtuff · · Score: 1

    I've got no idea what Kafka does, and the summary really doesn't tell you much at all. I was about to put in a helpful post saying what it is, but even after visiting their home page I've still got no idea.

    Apparently Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.

    How about the Intro
    We think of a streaming platform as having three key capabilities:
    It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.
          It lets you store streams of records in a fault-tolerant way.
          It lets you process streams of records as they occur.

    What is Kafka good for?
    It gets used for two broad classes of application:
          Building real-time streaming data pipelines that reliably get data between systems or applications
          Building real-time streaming applications that transform or react to the streams of data

    OK, I still am not really sure what it does.

    1. Re:WTF does it do? by PhunkySchtuff · · Score: 3, Informative

      OK, now it's starting to make more sense looking at the use cases

      Here is a description of a few of the popular use cases for Apache Kafka. For an overview of a number of these areas in action, see this blog post.

      Messaging
      Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.
      In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often depend on the strong durability guarantees Kafka provides.

      In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ.

      Website Activity Tracking
      The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.
      Activity tracking is often very high volume as many activity messages are generated for each user page view.

      Metrics
      Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.

      Log Aggregation
      Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption. In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency.

      Stream Processing
      Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing. For example, a processing pipeline for recommending news articles might crawl article content from RSS feeds and publish it to an "articles" topic; further processing might normalize or deduplicate this content and published the cleansed article content to a new topic; a final processing stage might attempt to recommend this content to users. Such processing pipelines create graphs of real-time data flows based on the individual topics. Starting in 0.10.0.0, a light-weight but powerful stream processing library called Kafka Streams is available in Apache Kafka to perform such data processing as described above. Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza.

      Event Sourcing
      Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka's support for very large stored log data makes it an excellent backend for an application built in this style.

      Commit Log
      Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data. The log compaction feature in Kafka helps support this usage. In this usage Kafka is similar to Apache BookKeeper project.

  15. The cybercrud is thick with this one by Anonymous Coward · · Score: 0

    Frameworks like this are invariably heavyweight and have a real appreciable cost to use. Be sure you actually have enough work to do with one to make the investment (and continuing care-and-feeding headache) worth the time and money. Otherwise something built directly on your business process will be a much better solution, though you may have to periodically defend it against the framework salesmen.

  16. WTF by fisternipply · · Score: 1

    WTF was all that gibberish? Can someone tell me what this thing actually does?

    1. Re:WTF by Anonymous Coward · · Score: 1

      It handles data. Streams of data. The simplest (and entirely realistic) case is probably log data; some application generates log output, so you use a connector (say, logback-kafka-appender if you're using slf4j) to send that log output to Kafka (this would be "publishing" to a "topic" in enterprise speak) Subscribers (clients) then subscribe to (read from) the topic and consume the log output. A simple subscriber might then simply write that data to log files.

      So far, so good. Looks like any ordinary message broker. Now add some fault tolerance; your Kafka server is really a node in a cluster. You can add more Kafka nodes at will — say, different machines in the same rack or something — and your log traffic will reliably replicate between those nodes. If a node goes down everything keeps working; publishers keep publishing and consumers keep consuming without messages being lost, so now you can cycle hosts (hardware upgrades, patches, whatever) without disturbing anyone.

      Now you need to send this stuff to another data center. You set up a mirror between a Kafka cluster at your first data center and a new Kafka cluster at your new data center. Consumers talk to their local cluster at the new data center (minimizing WAN traffic and latency.) Data will reliably flow from one cluster to the other. If communication fails the flow will resume and catch up when communication is restored. Kafka transparently compresses data before sending over the WAN saving you lots of bandwidth.

      So lots of consumers get built around this flow of data. Dozens of applications, databases and other stuff consume all this data. Some throw most of it out looking for very specific things. Others ETL it all into big analytical systems. Meanwhile, your original application is oblivious; it's just sending log output to some Kafka cluster, entirely unaware of how many data centers and consumers are getting it.

      Then you start running your application that publishes the log output at some new site. The new application instance publishes it's log output to the same topic as the existing instance. Clients just keep consuming the same topic, oblivious to the new topology.

      Meanwhile you sit there monitoring all this, making sure your redundancy quotas are met, managing disk space and IOPS used to persist all the data, and calling developers and their bosses to make them stop overloading the system with excessive data.

      That's Kafka in a nutshell. It's not new; it's been done before in various ways. Kafka's novelty is that it's free and popular and succeeding in the market, displacing older systems and accumulating a lot of mind share.

    2. Re:WTF by Anonymous Coward · · Score: 0

      Thank you for that explanation. You answered in such way, that i can understand it even before morning coffee.
      This is really old school approach - answering to the question with ... information not rant ...

  17. Re:Moves data from this cluster to that cluster re by war4peace · · Score: 1

    Thank you very much for the clarification.
    I am a Business Intelligence Analyst and to my shame I had never heard of this solution, or maybe I had but it was so riddled with buzzwords and corporate bullshit that it became unintelligible to plebs like me.

    Yes, I can see quite a few use cases for it. If they only used your words to describe it :)

    --
    ...gis sdrawkcab (usually not responding to ACs; don't bother posting as AC)
  18. Re:Moves data from this cluster to that cluster re by Anonymous Coward · · Score: 0

    I had no idea what that Kafka was about before this, but it sure does sound like a BizTalk clone. The only difference seems to be that they gave some new material for the buzzword bingos.

  19. Re:Moves data from this cluster to that cluster re by kiwipom · · Score: 0

    If you think Kafka is like BizTalk, I suggest you look at the documentation / download it, it's nothing like it. It's a highly scalable, ultra-high throughput messaging system. The stream processing API is just a bolt on the side, but again is nothing like BizTalk. Kafka has been proved in production to handle 1.2 trillion messages per day, no way BizTalk can do that.

    --
    Dum spiro spero
  20. Re:Moves data from this cluster to that cluster re by Anonymous Coward · · Score: 0

    It's a highly scalable, ultra-high throughput messaging system.

    You mean like BizTalk?

    It's not like you couldn't have JFGI'ed when you encountered something you had never heard of.

  21. Re:Moves data from this cluster to that cluster re by Anonymous Coward · · Score: 0

    From what I can dig up about it, Kafka is just the message queue. BizTalk uses a message queue along with all of the routing, transformation, and customization stuff, all in one package.

    Basically, if you're running a Microsoft platform, you don't need Kafka because WCF or MSMQ or some other suitable alternative is already built in to the OS and dev tool chain. BizTalk is built on top of those things and goes way above and beyond Kafka. Also, BizTalk is built for heavier message loads than Kafka is, with maybe (maybe!) not as much throughput... unless you need it, then you can cluster it (or rent it on Azure, because, as the GP said, your groin will suffer).

  22. Re:Moves data from this cluster to that cluster re by abmw · · Score: 1

    More LIke MQ with more optional plumbing?

  23. Re:Moves data from this cluster to that cluster re by kiwipom · · Score: 0

    I have never, ever, heard those words used to describe BizTalk

    --
    Dum spiro spero
  24. Re:Moves data from this cluster to that cluster re by kiwipom · · Score: 0

    Ho ho ho!

    --
    Dum spiro spero