Meet Flink, the Apache Software Foundation's Newest Top-Level Project

← Back to Stories (view on slashdot.org)

Meet Flink, the Apache Software Foundation's Newest Top-Level Project

Posted by timothy on Tuesday January 13, 2015 @01:50AM from the name-is-self-explanitory dept.

Open source data-processing language Flink, after just nine months' incubation with the Apache Software Foundation, has been elevated to top-level status, joining other ASF projects like OpenOffice and CloudStack. An anonymous reader writes The data-processing engine, which offers APIs in Java and Scala as well as specialized APIs for graph processing, is presented as an alternative to Hadoop's MapReduce component with its own runtime. Yet the system still provides access to Hadoop's distributed file system and YARN resource manager. The open-source community around Flink has steadily grown since the project's inception at the Technical University of Berlin in 2009. Now at version 0.7.0, Flink lists more than 70 contributors and sponsors, including representatives from Hortonworks, Spotify and Data Artisans (a German startup devoted primarily to the development of Flink). (For more about ASF incubation, and what the Foundation's stewardship means, see our interview from last summer with ASF executive VP Rich Bowen.)

34 comments

Min score:

Reason:

Sort:

What Apache needs by Anonymous Coward · 2015-01-13 02:02 · Score: 3, Funny

A big data project to keep track of all of Apache's big data projects. Seems like there's a new one every month.
Ok, I give up by Anonymous Coward · 2015-01-13 02:03 · Score: 0

Where do you find an overview to give a clue as to what it is and why it is a good idea?
1. Re:Ok, I give up by fhueske · 2015-01-13 02:14 · Score: 3, Informative
  
  Flink is a parallel data processing engine similar to Hadoop and Spark with some unique features: 1) combines realtime stream and batch processing, 2) features an DBMS-style optimizer, 3) in-memory processing which goes gracefully to disk if memory is scarce, 4) provides special operators for iterative processing, ... Check out http://flink.apache.org/ for details.
2. Re:Ok, I give up by Anonymous Coward · 2015-01-13 02:37 · Score: 0
  
  Why do we need this when we already have Hadoop?
3. Re: Ok, I give up by samkass · 2015-01-13 02:43 · Score: 2
  
  But how does Apache Flink compare to Apache Spark? They both claim to be largely compatible with but much faster than Hadoop...
  
  --
  E pluribus unum
4. Re:Ok, I give up by Anonymous Coward · 2015-01-13 02:45 · Score: 0
  
  Q: What it is?
  A: Looks like a drop-in replacement for hadoop.
  Q: Why it is a good idea?
  A: It is 5 to 10 times faster than hadoop.
5. Re: Ok, I give up by fhueske · 2015-01-13 03:01 · Score: 3, Informative
  
  Flink and Spark address similar use cases and have similar APIs, but the technology under the hood differs a bit. Flink serializes data to memory and works a lot on binary representations. This also makes (partial) spilling to disk easier if the system runs out of memory. Flink's processing engine uses pipelined processing (like many DBMS) which allows for true streaming and batch processing in a single engine. Iterations are build-in operators in Flink which means that data continuously flows in cycles. A special iteration operator reduces the number of computations as iteration count increases which gives very good performance for certain use cases such as some ML algorithms.
6. Re:Ok, I give up by rockmuelle · 2015-01-13 03:20 · Score: 4, Interesting
  
  More importantly, why did we need Hadoop when we already had [your_favorite_language] + [your_favorite_job_scheduler] + [your_favorite_parallel_file_system]?
  Seriously, standard HPC batch processing methods are always faster and easier to develop for than latest_trendy_distributed_framework.
  The challenges of data at scale* are almost all related to IO performance and the overhead of accessing individual records.
  IO performance is solved by understanding your memory hierarchy and designing your hardware and tuning your file system around your common access patterns. A good multi-processor box with a fast hardware raid and decent disks and sufficient RAM will outperform a cheap cluster any day of the week and likely cost less (it's 2015, things have improved since the days of Beowulf). If you need to scale, a small cluster with Infiniband (or 10 GigE) interconnects and Lustre (or GPFS if you have deep pockets) will scale to support a few petabytes of data at 3-4 GB/s throughput (yes, bytes, not bits). You'd be surprised what the right 4 node cluster can accomplish.
  On the data access side, once the hardware is in place, record access times are improved by minimizing the abstraction penalty for accessing individual records. As an example, accessing a single record in Hadoop generates a call stack of over 20 methods from the framework alone. That's a constant multiplier of 20x on _every_ data access**. A simple Python/Perl/JS/Ruby script reading records from the disk has a much smaller call stack and no framework overhead. I've done experiments on many MapReduce "algorithms" and always find that removing the overhead of Hadoop (using the same hardware/file system) improves performance by 15-20x (yes, that's 'x', not '%'). Not surprisingly, the non-Hadoop code is also easier to understand and maintain.
  tl;tr: Pick the right hardware and understand your data access patterns and you don't need complex frameworks.
  Next week: why databases, when used correctly, are also much better solutions for big data than latest_trendy_framework. ;)
  -Chris
  *also: very few people really have data that's big enough to warrant a distributed solution, but let's pretend everyone's data is huge and requires a cluster.
  ** it also assumes the data was on the local disk and not delivered over the network, at which point, all performance bets are off.
7. Re:Ok, I give up by IAMBatman · 2015-01-13 08:20 · Score: 1
  
  It's scary how you communicate so much bullshit as if you are the expert.
Same old same old? by Bazman · 2015-01-13 02:14 · Score: 2, Funny

We need another distributed system for counting words like we need another javascript framework for writing a Todo list app.
1. Re:Same old same old? by QilessQi · 2015-01-13 03:52 · Score: 1
  
  A Javascript framework for writing a TODO list app? That's GREAT idea! And I know JUST how to do it! (starts typing)
  
  --
  Koans and fables for the software engineer
2. Re:Same old same old? by LeadSongDog · 2015-01-13 03:57 · Score: 1
  
  We need another distributed system for counting words like we need another javascript framework for writing a Todo list app.
  We need another bafflegab project description like a school of fish needs a robotic assembly line for bicycles. From the site:
  
  Flink Overview
  Apache Flink (incubating) is a platform for efficient, distributed, general-purpose data processing. It features powerful programming abstractions in Java and Scala, a high-performance runtime, and automatic program optimization. It has native support for iterations, incremental iterations, and programs consisting of large DAGs of operations.
  If you quickly want to try out the system, please look at one of the available quickstarts. For a thorough introduction of the Flink API please refer to the Programming Guide.
  So, what's that say it's for?
  
  --
  Oh, I'm sorry sir, I thought you were referring to me, Mr. Wensleydale.
3. Re:Same old same old? by Anonymous Coward · 2015-01-13 10:16 · Score: 0
  
  Says right at the beginning: is a platform for efficient, distributed, general-purpose data processing.
9 months by Anonymous Coward · 2015-01-13 02:22 · Score: 0

9 months were enough for me.
Flink? by Anonymous Coward · 2015-01-13 02:32 · Score: 0

Sounds like slang for flicking boogers while taking a piss
Re:Who uses Apache anymore? by Anonymous Coward · 2015-01-13 02:44 · Score: 0

Quite a lot of people use Apache OpenOffice.
"Flink" means "clever"! by Terje+Mathisen · 2015-01-13 02:50 · Score: 3, Interesting

In Scandinavian languages (Norwegian, Danish, Swedish), flink means clever or accomplished.
Was this by accident or intentional? :-)
Terje

--
"almost all programming can be viewed as an exercise in caching"
1. Re:"Flink" means "clever"! by fhueske · 2015-01-13 03:07 · Score: 1
  
  :-) The original intention was the German meaning, which is agile or swift (Flink was started in Berlin). But clever is great as well ;-)
2. Re:"Flink" means "clever"! by Anonymous Coward · 2015-01-13 03:14 · Score: 0
  
  in Dutch it is sturdy or vigorous
3. Re:"Flink" means "clever"! by Anonymous Coward · 2015-01-13 03:43 · Score: 0
  
  flink (Middle Low German) = bright; quick
  |-> flink (German) = quick
  |-> flink (Nordic languages) = clever
  *snellaz (Proto-Germanic) = quick
  |-> snäll (Swedish) = nice, kind
  |-> snild (Danish) = clever
  |-> schnell (German) = quick
  So if Sapir/Whorf are right:
  Germans are all about speed, but neither clever or nice.
  Scandinavians like it clever and nice, but can't really relate to the concept of speed.
4. Re:"Flink" means "clever"! by Reaper9889 · 2015-01-13 04:13 · Score: 1
  
  Actually, flink means nice in Danish.
5. Re:"Flink" means "clever"! by Anonymous Coward · 2015-01-13 05:09 · Score: 0
  
  http://en.wikipedia.org/wiki/Flink
Hey I've got prior art! by flink · 2015-01-13 03:20 · Score: 2

See user name.
1. Re:Hey I've got prior art! by Anonymous Coward · 2015-01-13 19:44 · Score: 0
  
  Now I've met Flink :)
Welcome to Apache! by Anonymous Coward · 2015-01-13 03:33 · Score: 0

Where projects go to die!
Hadoop needs a fairly specialized problem by erikscott · 2015-01-13 03:57 · Score: 5, Interesting

I've been running Hadoop on a 400 node ethernet cluster for a couple years now, and Spark for a few months. I'll give Spark points for speed - as long as your problem fits in RAM, it screams. They have their problems, certainly. Hadoop's dependence on Java and Spark's dependence on Scala... seriously, Java for HPC? WTF? If you're running on anything but x86 Linux you need your head examined. C and Fortran, folks.
You're absolutely right- Hadoop needs the right kind of job. It needs a problem where processing is per-record and has no dependencies on any other record. That eliminates a lot of interesting problems right there. It needs colossal logical block sizes, both to keep the network and drives saturated, but also to keep from bottlenecking on the HDFS namenode. This strongly suggests a small number of utterly huge files - maybe a hundred 100G files. These problems are, commercially, rare. I'm doing genomics-related things, and my 3 to 60 gig files (about 3TB total) are probably not big enough.
Spark is pretty clever. As long as your problem fits in RAM. :-) Since you're writing code in Scala, you're (a) the only person who can be on call and (b) irreplacable, so on balance that may not be so bad. Just depends.
As far as "conventional" cluster programming, I think a good MPI programmer is about as hard to hire as a Scala programmer. MPI looks easy until you get into the corner cases, as I'm sure you've experienced yourself. Trying to do scatter/gather in an environment where worker nodes can vanish without warning is basically a whole lot of not fun. Then there's infiniband. Infiniband FDR is kind of... touchy. If you order a hundred cables, you'll get 98 good ones, and 2 will fail intermittently. It'd be nice if the vendor would label which two were bad, but somehow they don't do this. It was bad enough that Mellanox blamed an earnings miss on bad cables. Maybe they're overcome that? Probably. Maybe. I'll give Hadoop points for working around dead machines and crippled networks.
You know, I've wanted to try sector and sphere, but somehow never gotten around to it.
1. Re:Hadoop needs a fairly specialized problem by rockmuelle · 2015-01-13 04:46 · Score: 4, Interesting
  
  MPI is definitely for very specific problems and really isn't what I'd consider "conventional" cluster programming. Most people associate MPI with clusters and parallel computing, but if you look at what's actually running on most big clusters, it's almost always just batch jobs (or batch jobs implemented using MPI :) ).
  Interestingly, all my examples were on genomics problems (processing SOLiD and ION Torrent runs). We started going down the Hadoop path because we thought it'd be more accessible to the bioinformaticians. But, once we saw the performance differences (and, importantly, understood the source of them) we abandoned it pretty quickly for more appropriate hardware designs (fast disks, fat pipes, lots of RAM, and a few linux tuning tricks -- swappiness=0 is your friend). Incidentally, GATK suffered from these same core performance problems. The original claims that the map-reduce framework would make GATK fast were never actually tested, just simply claimed in the paper. GATK's performance was always been orders of magnitude less than the same algorithms implemented without map-reduce. But, it's from the Broad, so it must be perfect. ;)
  I like sector and sphere. We also did a POC with them and they performed much better than the alternatives. Unfortunately, they also required very good programmers to use effectively.
  Good stuff!
  -Chris
2. Re:Hadoop needs a fairly specialized problem by Anonymous Coward · 2015-01-13 05:44 · Score: 0
  
  If you're running on anything but x86 Linux you need your head examined. C and Fortran, folks.
  Are you mechanical engineer perchance? How about C++?
3. Re:Hadoop needs a fairly specialized problem by IAMBatman · 2015-01-13 08:29 · Score: 1
  
  MPI *is* conventional. Anything else which came after it (Hadoop) is for the people with less programming skills. Programming full on distributed systems is not a task which all Hadoop programmers can do.
4. Re:Hadoop needs a fairly specialized problem by heuermh · 2015-01-14 05:41 · Score: 1
  
  Real experience with genomics on clusters, Hadoop and Spark, bad-mouthing the Broad, who are you guys (or gals)? We're hiring. :)
Already met Flink by HalAtWork · 2015-01-13 15:47 · Score: 1

Back in the 90s on the Sega Mega CD.

--
Twinstiq, game news