Meet Flink, the Apache Software Foundation's Newest Top-Level Project
Open source data-processing language Flink, after just nine months' incubation with the Apache Software Foundation, has been elevated to top-level status, joining other ASF projects like OpenOffice and CloudStack.
An anonymous reader writes The data-processing engine, which offers APIs in Java and Scala as well as specialized APIs for graph processing, is presented as an alternative to Hadoop's MapReduce component with its own runtime. Yet the system still provides access to Hadoop's distributed file system and YARN resource manager. The open-source community around Flink has steadily grown since the project's inception at the Technical University of Berlin in 2009. Now at version 0.7.0, Flink lists more than 70 contributors and sponsors, including representatives from Hortonworks, Spotify and Data Artisans (a German startup devoted primarily to the development of Flink).
(For more about ASF incubation, and what the Foundation's stewardship means, see our interview from last summer with ASF executive VP Rich Bowen.)
A big data project to keep track of all of Apache's big data projects. Seems like there's a new one every month.
Where do you find an overview to give a clue as to what it is and why it is a good idea?
We need another distributed system for counting words like we need another javascript framework for writing a Todo list app.
9 months were enough for me.
Sounds like slang for flicking boogers while taking a piss
Quite a lot of people use Apache OpenOffice.
In Scandinavian languages (Norwegian, Danish, Swedish), flink means clever or accomplished.
Was this by accident or intentional? :-)
Terje
"almost all programming can be viewed as an exercise in caching"
See user name.
Where projects go to die!
I've been running Hadoop on a 400 node ethernet cluster for a couple years now, and Spark for a few months. I'll give Spark points for speed - as long as your problem fits in RAM, it screams. They have their problems, certainly. Hadoop's dependence on Java and Spark's dependence on Scala... seriously, Java for HPC? WTF? If you're running on anything but x86 Linux you need your head examined. C and Fortran, folks.
You're absolutely right- Hadoop needs the right kind of job. It needs a problem where processing is per-record and has no dependencies on any other record. That eliminates a lot of interesting problems right there. It needs colossal logical block sizes, both to keep the network and drives saturated, but also to keep from bottlenecking on the HDFS namenode. This strongly suggests a small number of utterly huge files - maybe a hundred 100G files. These problems are, commercially, rare. I'm doing genomics-related things, and my 3 to 60 gig files (about 3TB total) are probably not big enough.
Spark is pretty clever. As long as your problem fits in RAM. :-) Since you're writing code in Scala, you're (a) the only person who can be on call and (b) irreplacable, so on balance that may not be so bad. Just depends.
As far as "conventional" cluster programming, I think a good MPI programmer is about as hard to hire as a Scala programmer. MPI looks easy until you get into the corner cases, as I'm sure you've experienced yourself. Trying to do scatter/gather in an environment where worker nodes can vanish without warning is basically a whole lot of not fun. Then there's infiniband. Infiniband FDR is kind of... touchy. If you order a hundred cables, you'll get 98 good ones, and 2 will fail intermittently. It'd be nice if the vendor would label which two were bad, but somehow they don't do this. It was bad enough that Mellanox blamed an earnings miss on bad cables. Maybe they're overcome that? Probably. Maybe. I'll give Hadoop points for working around dead machines and crippled networks.
You know, I've wanted to try sector and sphere, but somehow never gotten around to it.
Back in the 90s on the Sega Mega CD.
Twinstiq, game news