Slashdot Mirror


How Open Sourcing Made Apache Kafka A Dominant Streaming Platform (techrepublic.com)

Open sourced in 2010, the Apache Kafka distributed streaming platform is now used at more than a third of Fortune 500 companies (as well as seven of the world's top 10 banks). An anonymous reader writes: Co-creator Neha Narkhede says "We saw the need for a distributed architecture with microservices that we could scale quickly and robustly. The legacy systems couldn't help us anymore." In a new interview with TechRepublic, Narkhede explains that while working at LinkedIn, "We had the vision of building the entire company's business logic as stream processors that express transformations on streams of data... [T]hough Kafka started off as a very scalable messaging system, it grew to complete our vision of being a distributed streaming platform."

Narkhede became the CTO and co-founder of Confluent, which supports enterprise installations of Kafka, and now says that being open source "helps you build a pipeline for your product and reduce the cost of sales... [T]he developer is the new decision maker. If the product experience is tailored to ensure that the developers are successful and the technology plays a critical role in your business, you have the foundational pieces of building a growing and profitable business around an open-source technology... Kafka is used as the source-of-truth pipeline carrying critical data that businesses rely on for real-time decision-making."

2 of 48 comments (clear)

  1. Moves data from this cluster to that cluster relia by raymorris · · Score: 5, Informative

    Suppose you have some service that produces data. This service might be on one server, or a group of servers.

    Some other service receives this data. Perhaps the receiving service transforms the data in some way before passing it along to some other system.

    Kafka helps with that. It avoids some simple problems. For example, I once worked on a system in which a cron transferred the data at midnight each day. Each day, it sent over that day's data. Records created right at midnight might get skipped, or might get sent twice. In case of a network glitch, you'd have to manually retry in the morning. Kafka avoids those kinds of problems.

    Kafka is built on the idea that both producers and consumers may be groups of partially redundant servers, with the data split up between different servers. Kafka has features to enable load balancing.

    So it's appropriate where you want to get data from some group of servers to another group, possibly through a middle group, you want it reliable, load balanced, etc, without inventing and later correcting your own protocols.

  2. Re:WTF does it do? by PhunkySchtuff · · Score: 3, Informative

    OK, now it's starting to make more sense looking at the use cases

    Here is a description of a few of the popular use cases for Apache Kafka. For an overview of a number of these areas in action, see this blog post.

    Messaging
    Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.
    In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often depend on the strong durability guarantees Kafka provides.

    In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ.

    Website Activity Tracking
    The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.
    Activity tracking is often very high volume as many activity messages are generated for each user page view.

    Metrics
    Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.

    Log Aggregation
    Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption. In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency.

    Stream Processing
    Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing. For example, a processing pipeline for recommending news articles might crawl article content from RSS feeds and publish it to an "articles" topic; further processing might normalize or deduplicate this content and published the cleansed article content to a new topic; a final processing stage might attempt to recommend this content to users. Such processing pipelines create graphs of real-time data flows based on the individual topics. Starting in 0.10.0.0, a light-weight but powerful stream processing library called Kafka Streams is available in Apache Kafka to perform such data processing as described above. Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza.

    Event Sourcing
    Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka's support for very large stored log data makes it an excellent backend for an application built in this style.

    Commit Log
    Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data. The log compaction feature in Kafka helps support this usage. In this usage Kafka is similar to Apache BookKeeper project.