Apache Flink 1.9.0 Release Announcement

continuations · on Aug 25, 2019

Apache has a large number of stream processing frameworks:

Flink vs Spark vs Storm vs Kafka vs Samza vs Apex

How do they compare? How would you choose which one to use?

StevePerkins · on Aug 25, 2019

I don't have experience with Samza or Apex, but as for the first three:

1. Flink - Focused on stateful stream processing.

2. Spark - Focused on batch processing. Can be used for continuous streams, but approaches them as "micro-batches".

3. Kafka - A message queue system (for all practical purposes). Has an optional stream processing add-on for basic needs.

Separate use cases and strengths aside, it's worth calling out that all of these products are primarily backed by completely different companies. Apache is a consortium made of many companies, and serves as common branding for "community editions" of their "enterprise edition" products. There can quite a lot of overlap between sponsored products in this consortium.

lern_too_spel · on Aug 25, 2019

Spark supports both microbatch and continuous stream processing.

Apache Software Foundation is not a consortium made of many companies but a single non-profit that provides organizational support for open source projects, some of which have contributors employed as such by other companies and some of which have only volunteer contributors.

nivertech · on Aug 25, 2019

4. Apache BEAM (same model as Google Cloud Dataflow)

jdm2212 · on Aug 25, 2019

The only two of those I know are Kafka and Flink. For those two: Flink is much more full-featured and performant (basically the full Google DataFlow API, and several orders of magnitude faster than Kafka Streaming), but Kafka Streaming has a stupid simple API that is useful if you need streaming because $reason but don't care about scaling up to infinity. If you're doing some really hacky demoware, Kafka Streaming will probably be faster to spin up because you just need the Kafka Streaming jar and a Kafka cluster.

BFLpL0QNek · on Aug 25, 2019

Do you have any numbers to back up Flink is faster than KStreams, also under what scenario?

I am genuinely interested as use KStreams a lot but the engineering discipline in the API leads a lot to be desired and more than happy to switch the API if Flink is that much better.

jdm2212 · on Aug 25, 2019

Here's a benchmark of KStreams and Flink [1]. Note that the Flink vs Spark comparison is disputed [2], but both Flink and Spark are several orders of magnitude faster than KStreams. This is inevitable given KStreams architecture -- it stores all its state in Kafka rather than in a data store and with data structures optimized for the use case and doesn't do much coordination among workers. KStreams is there if you want streaming semantics on top of a small-ish Kafka topic you own, but don't care too much about perf. Deploying and maintaining Flink is a much bigger hassle than KStreams -- you need DevOps support to get Flink running, whereas KStreams runs (albeit quite slowly) inside your application with no new state store needed.

Confluent has a good discussion of the ownership issue (DevOps for Flink, devs for KStreams) here [3] though they seriously downplay the huge gap in perf.

[1] https://databricks.com/blog/2017/10/11/benchmarking-structur...

[2] https://www.ververica.com/blog/curious-case-broken-benchmark...

[3] https://www.confluent.io/blog/apache-flink-apache-kafka-stre...

je42 · on Aug 25, 2019

mmh. i found this more recent benchmark. where flink was still faster but kstreams' perfromance much more closer then in the 2017 benchmarks.

I guess kstreams improved performance over time ? Or is the benchmark design just different ?

barrkel · on Aug 25, 2019

You don't need me to search the internet for you, but https://thenewstack.io/apache-gets-another-real-time-stream-... has some comparisons.

There's also Apache Beam, which is an API for streaming, and has Flink and Apex execution engines. Google's Cloud Dataflow is another implementation of Apache Beam.

As to which one to choose, you need to evaluate them, there's no simple answers. If you have Hadoop already then Apex may be a better fit than Flink; OTOH if you do Akka stuff already, then Flink might integrate better with your stack. If you have more batch than streaming use cases, maybe you want Spark. Etc.

readme3 · on Aug 25, 2019

Also include storm in the mix too. Storm 2.0 was released recently. We have been using storm for a long time and we really like its a) Simple programming model b) Support for a wide variety of sources (e.g Kinesis , EventHub) c)Easy troubleshooting We did evaluate Spark streaming (we use Spark for batch workloads and it works well) , but fell back to storm because of the above

jdm2212 · on Aug 25, 2019

This is exciting! I've been using Flink a lot lately, and fine-grained recovery is going to be very useful for [work stuff]!

whoevercares · on Aug 25, 2019

From a WeChat posting I heard 1.5M LOC is changed. Wow