Discovering Anomalies in Real-Time with Apache Flink

js8 · on Feb 16, 2017

Since I maintain a pretty large ETL (batch) application for a living, I am genuinely curious about this. How do you handle failure in event-processing systems? I mean in batch, it's simple - if there is a record (event) that causes unexpected failure (or the program fails for other reason, for example it runs out of space), we just restart the batch.

But in event processing, unless you can afford yourself to skip events, how do you deal with that sort of thing, especially if the processing needs to keep track of internal state between events?

I read about event-sourcing, which kinda is a solution to that, but add checkpoints and you have pretty much batch processing again.

emidln · on Feb 16, 2017

Having done several event-processing systems (lately with Storm in Java/Clojure and Onyx in Clojure), I fail any event outside the hotpath, which sends it to an alternate queue where remediation passes attempt processing. This results in slightly delayed processing (we're talking a handful of seconds max), but keeps the high-throughput hot path very simple. Events that can't be handled in the secondary queue are logged/journaled and a report is generated periodically when such events exist.

It's fairly rare that I would even attempt to track internal state in an event-processing system (a node will typically emit/attach all necessary information for the next node), but in cases where I do (real-time numerical calculations), we accept a small percentage of error. We'd typically write a health check around our confidence in that calculation and expose that to systems that need to interact with it.

In terms of errors that are systemic rather than due to malformed events, I'd reprocess out of the queue assuming the timing makes sense. For nodes that care about time, we'd write in checks using the event's timestamp as a guard against processing (failing events outside the time range).

ussser · on Feb 16, 2017

Flink handles failures pretty well with its automated checkpointing mechanism (in addition to an other feature called "Savepoints", which allows you make a manual snapshot of the current state of the streaming pipeline to restore from it later in case of failure).

You can find more details in the docs:

- https://ci.apache.org/projects/flink/flink-docs-release-1.2/...

urlgrey · on Feb 16, 2017

Handling failure in a distributed stream processing system is challenging, but Flink removes a lot of this complexity from the job of the application developer.

A high-availability Flink cluster will often use an Apache Zookeeper cluster to elect a leader Job Manager (coordinator) instance. One or more Task Managers (the systems that actually execute the pieces of a Flink application) discover the current Job Manager leader by querying Zookeeper.

Zookeeper tracks the current leader and running jobs. State for those jobs is written periodically to external storage, often an HDFS cluster. Flink automatic-checkpointing will write application state at fixed intervals to storage and, in the event of a failure, automatically resume from the most recent checkpoint. There's also support for manual savepoints which can be used to restore state when submitting a new job or resuming from catastrophic failure.

Flink provides exactly-once guarantees within the context of the Flink application; any side-effects of your application, such as calls to external services or records written to a database, can happen multiple times if you're recovering from a failure.

mtrn · on Feb 16, 2017

Spot on. Can you recommend some good articles or books dealing with these kind of problems?

My current understanding is that a event-based (in a very broad sense) system are hard to "replay" in the case of a failure (error in data or just a bug), unless you build additional machinery, which decreases robustness. In contrast the task of making a batch processing system perform fast is easy and much better defined.

urlgrey · on Feb 16, 2017

I highly recommend reading Tyler Akidau's article titled "The world beyond batch: Streaming 101": https://www.oreilly.com/ideas/the-world-beyond-batch-streami...

And its follow-up post: https://www.oreilly.com/ideas/the-world-beyond-batch-streami...

scott_s · on Feb 16, 2017

The product I work on, IBM Streams, handles this through a combination of checkpoints, a consistency protocol, and rollback to the last known-good consistent state for the entire application on failure detection. For more information, you can read a blog post on it: "Guaranteed tuple processing in InfoSphere Streams v4 with consistent regions", https://developer.ibm.com/streamsdev/2015/02/20/processing-t...

There's also an academic paper that was in VLDB's industry track in 2016, "Consistent Regions: Guaranteed Tuple Processing in IBM Streams", http://www.vldb.org/pvldb/vol9/p1341-jacquesSilva.pdf

hackerboos · on Feb 16, 2017

"The Apache Flink project is a relative newcomer to the stream-processing space. Competing Open-Source platforms include Apache Spark, Apache Storm, and Twitter Heron."

Can someone explain why Apache are creating projects that compete with each other? Why not focus on one?

lvh · on Feb 16, 2017

Apache often houses existing projects; sometimes becoming a home for formerly-proprietary software that gets thrown over the wall. Remember Google Wave? That's Apache Wave now. Apache Storm started out as just Storm, open sourced after a Twitter acquisition. Twitter Heron was open sourced by Twitter. Flink is a fork of Stratosphere. Et cetera :) So: ASF is doing no such thing, they're just providing a framework for open source projects to function in.

Marazan · on Feb 16, 2017

When it comes to high volume stream processing there are various tradeoffs that result in incompatible design decisions. Thus it makes sense there are multiple stream processing projects.

urlgrey · on Feb 16, 2017

I'm the author of this Mux blog post and would love to take any questions or comments, as well as suggestions for future posts. Thank you for your interest!

oxtopus · on Feb 16, 2017

Can you elaborate on the "novel anomaly-detection algorithms" used here?

urlgrey · on Feb 16, 2017

Sure! We evaluated several anomaly-detection tools & libraries. They included:

tried-and-true statistical methods like probability density functions

Yahoo EGADS anomaly-detection library

Numenta HTM neural-network anomaly-detection library

We ruled out HTM due to AGPL licensing concerns. It's an interesting product, but wasn't a good fit for us at this point in time. EGADS and other basic statistical methods can actually get you pretty far.

bradknowles · on Feb 16, 2017

And what do you do with this video anomaly information?

urlgrey · on Feb 17, 2017

Mux (https://mux.com/) collects performance metrics for video delivery & playback on the websites & apps of our customers. These metrics feed into our real-time alerting system. If the error-rate for a customer property (site) or video-title is exceptionally high then an alert will be triggered. Mux customers can configure alert notifications to be sent to Slack & email, and view a history of alerts in the Mux web dashboard. We also offer the ability to view breakdowns of playback failures, video start-up times, and more through our dashboard. This can be helpful for diagnosing playback issues related to specific browsers, geographies, ISPs, and more.

falsedan · on Feb 16, 2017

I'd like to see more about how they used Flink, and less about their system architecture (which give great details, up until the data is processed with Flink).

urlgrey · on Feb 16, 2017

Great suggestion! We'll have a follow-up post soon that goes into greater detail on the mechanics of how Mux uses Flink.

dtjon · on Feb 16, 2017

We ditched Spark Structured Streaming for Flink for a Kafka consumer, processing 3B events per day. Its been extremely stable so far, and half the cost of the spark cluster

SureshG · on Feb 16, 2017

Could provide more details about your cluster setup? How big is your cluster?