<!doctype html><html lang="en" class="no-js"><head><meta charset="utf-8"> <!-- begin SEO --><title>Local recovery and high availability in Apache Flink - Marios Fragkoulis</title><meta property="og:locale" content="en-US"><meta property="og:site_name" content="Marios Fragkoulis"><meta property="og:title" content="Local recovery and high availability in Apache Flink"><link rel="canonical" href="https://mfragkoulis.github.io/development/flink.md"><meta property="og:url" content="https://mfragkoulis.github.io/development/flink.md"><meta property="og:description" content="Apache Flink is a stream processing system offering high-throughput and low-latency processing of millions events per second with exactly-once processing guarantees even in face of failures. However, Flink’s failover strategy stops data processing, resets the whole job graph, and resumes from the latest checkpoint assuming a replayable input source such as Apache Kafka. This is not suitable for mission-critical jobs that need fast recovery or large-scale jobs running on tens or hundreds of machines where the probability of any node failing becomes so big that may result in subsequent failures that leave the job with no progress between failures.This project provides a faster recovery strategy for higher availability that resets only the failed operators. The strategy sets up standby tasks that mirror running tasks and receive the state snapshots of the running tasks upon each checkpoint. Tasks that produce output log it until the next checkpoint. When a failure happens the standby task that mirrors the failed running task comes into play and substitutes the failed task. It requests the in-flight log from its upstream tasks and starts processing thereby reducing the failure only to the tasks that failed. This recovery strategy guarantess at-least-once processing since there may be records that will be processed twice, once by the failed task and once by the standby task that substituted it. Techniques like causal logging can help strengthen the processing guarantee to exactly-once accounting also for non-deterministic events like record delivery order, processing-time windows, callback functions executed by timers, and checkpoint signals. Check out the work by Pedro Silvestre (https://github.com/PSilvestre). Github repo: https://github.com/tud-delta/flink"><meta name="twitter:site" content="@mariofragkoulis"><meta name="twitter:title" content="Local recovery and high availability in Apache Flink"><meta name="twitter:description" content="Apache Flink is a stream processing system offering high-throughput and low-latency processing of millions events per second with exactly-once processing guarantees even in face of failures. However, Flink’s failover strategy stops data processing, resets the whole job graph, and resumes from the latest checkpoint assuming a replayable input source such as Apache Kafka. This is not suitable for mission-critical jobs that need fast recovery or large-scale jobs running on tens or hundreds of machines where the probability of any node failing becomes so big that may result in subsequent failures that leave the job with no progress between failures.This project provides a faster recovery strategy for higher availability that resets only the failed operators. The strategy sets up standby tasks that mirror running tasks and receive the state snapshots of the running tasks upon each checkpoint. Tasks that produce output log it until the next checkpoint. When a failure happens the standby task that mirrors the failed running task comes into play and substitutes the failed task. It requests the in-flight log from its upstream tasks and starts processing thereby reducing the failure only to the tasks that failed. This recovery strategy guarantess at-least-once processing since there may be records that will be processed twice, once by the failed task and once by the standby task that substituted it. Techniques like causal logging can help strengthen the processing guarantee to exactly-once accounting also for non-deterministic events like record delivery order, processing-time windows, callback functions executed by timers, and checkpoint signals. Check out the work by Pedro Silvestre (https://github.com/PSilvestre). Github repo: https://github.com/tud-delta/flink"><meta name="twitter:url" content="https://mfragkoulis.github.io/development/flink.md"><meta name="twitter:card" content="summary"> <script type="application/ld+json"> { "@context" : "http://schema.org", "@type" : "Person", "name" : "Marios Fragkoulis", "url" : "https://mfragkoulis.github.io", "sameAs" : null } </script> <!-- end SEO --><link href="https://mfragkoulis.github.io/feed.xml" type="application/atom+xml" rel="alternate" title="Marios Fragkoulis Feed"> <!-- http://t.co/dKP3o1e --><meta name="HandheldFriendly" content="True"><meta name="MobileOptimized" content="320"><meta name="viewport" content="width=device-width, initial-scale=1.0"> <script> document.documentElement.className = document.documentElement.className.replace(/\bno-js\b/g, '') + ' js '; </script> <!-- For all browsers --><link rel="stylesheet" href="https://mfragkoulis.github.io/assets/css/main.css"><meta http-equiv="cleartype" content="on"> <!-- start custom head snippets --><link rel="apple-touch-icon" sizes="57x57" href="https://mfragkoulis.github.io/images/apple-touch-icon-57x57.png?v=M44lzPylqQ"><link rel="apple-touch-icon" sizes="60x60" href="https://mfragkoulis.github.io/images/apple-touch-icon-60x60.png?v=M44lzPylqQ"><link rel="apple-touch-icon" sizes="72x72" href="https://mfragkoulis.github.io/images/apple-touch-icon-72x72.png?v=M44lzPylqQ"><link rel="apple-touch-icon" sizes="76x76" href="https://mfragkoulis.github.io/images/apple-touch-icon-76x76.png?v=M44lzPylqQ"><link rel="apple-touch-icon" sizes="114x114" href="https://mfragkoulis.github.io/images/apple-touch-icon-114x114.png?v=M44lzPylqQ"><link rel="apple-touch-icon" sizes="120x120" href="https://mfragkoulis.github.io/images/apple-touch-icon-120x120.png?v=M44lzPylqQ"><link rel="apple-touch-icon" sizes="144x144" href="https://mfragkoulis.github.io/images/apple-touch-icon-144x144.png?v=M44lzPylqQ"><link rel="apple-touch-icon" sizes="152x152" href="https://mfragkoulis.github.io/images/apple-touch-icon-152x152.png?v=M44lzPylqQ"><link rel="apple-touch-icon" sizes="180x180" href="https://mfragkoulis.github.io/images/apple-touch-icon-180x180.png?v=M44lzPylqQ"><link rel="icon" type="image/png" href="https://mfragkoulis.github.io/images/favicon-32x32.png?v=M44lzPylqQ" sizes="32x32"><link rel="icon" type="image/png" href="https://mfragkoulis.github.io/images/android-chrome-192x192.png?v=M44lzPylqQ" sizes="192x192"><link rel="icon" type="image/png" href="https://mfragkoulis.github.io/images/favicon-96x96.png?v=M44lzPylqQ" sizes="96x96"><link rel="icon" type="image/png" href="https://mfragkoulis.github.io/images/favicon-16x16.png?v=M44lzPylqQ" sizes="16x16"><link rel="manifest" href="https://mfragkoulis.github.io/images/manifest.json?v=M44lzPylqQ"><link rel="mask-icon" href="https://mfragkoulis.github.io/images/safari-pinned-tab.svg?v=M44lzPylqQ" color="#000000"><link rel="shortcut icon" href="/images/favicon.ico?v=M44lzPylqQ"><meta name="msapplication-TileColor" content="#000000"><meta name="msapplication-TileImage" content="https://mfragkoulis.github.io/images/mstile-144x144.png?v=M44lzPylqQ"><meta name="msapplication-config" content="https://mfragkoulis.github.io/images/browserconfig.xml?v=M44lzPylqQ"><meta name="theme-color" content="#ffffff"><link rel="stylesheet" href="https://mfragkoulis.github.io/assets/css/academicons.css"/> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "all" } } }); </script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ], processEscapes: true } }); </script> <script src='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/latest.js?config=TeX-MML-AM_CHTML' async></script> <!-- end custom head snippets --></head><body> <!--[if lt IE 9]><div class="notice--danger align-center" style="margin: 0;">You are using an <strong>outdated</strong> browser. Please <a href="http://browsehappy.com/">upgrade your browser</a> to improve your experience.</div><![endif]--><div class="masthead"><div class="masthead__inner-wrap"><div class="masthead__menu"><nav id="site-nav" class="greedy-nav"> <button><div class="navicon"></div></button><ul class="visible-links"><li class="masthead__menu-item masthead__menu-item--lg"><a href="https://mfragkoulis.github.io/">Marios Fragkoulis</a></li><li class="masthead__menu-item"><a href="https://mfragkoulis.github.io/publications/">Publications</a></li><li class="masthead__menu-item"><a href="https://mfragkoulis.github.io/teaching/">Teaching</a></li><li class="masthead__menu-item"><a href="https://mfragkoulis.github.io/service/">Service</a></li><li class="masthead__menu-item"><a href="https://mfragkoulis.github.io/development/">Development</a></li><li class="masthead__menu-item"><a href="https://mfragkoulis.github.io/talks/">Talks</a></li></ul><ul class="hidden-links hidden"></ul></nav></div></div></div><div id="main" role="main"><div class="sidebar sticky"><div itemscope itemtype="http://schema.org/Person"><div class="author__avatar"> <img src="https://mfragkoulis.github.io/images/mfg.jpg" class="author__avatar" alt=""></div><div class="author__content"><h3 class="author__name"></h3></div><div class="author__urls-wrapper"> <button class="btn btn--inverse">Follow</button> <a href="mailto:M.Fragkoulis@tudelft.nl"><i class="fa fa-fw fa-envelope-square" aria-hidden="true"></i></a> <a href="https://scholar.google.gr/citations?user=bJiiymEAAAAJ&hl=en"><i class="ai ai-google-scholar-square ai-fw"></i></a> <a href="https://github.com/mfragkoulis"><i class="fa fa-fw fa-github" aria-hidden="true"></i></a> <a href="https://twitter.com/mariofragkoulis"><i class="fa fa-fw fa-twitter-square" aria-hidden="true"></i></a> <a href="https://www.linkedin.com/in/marios-fragkoulis-80141a53/"><i class="fa fa-fw fa-linkedin-square" aria-hidden="true"></i></a> <a href="https://orcid.org/0000-0002-0160-0855"><i class="ai ai-orcid-square ai-fw"></i></a> <br> <br><ul class="author__urls social-icons"></ul></div></div></div><article class="page" itemscope itemtype="http://schema.org/CreativeWork"><meta itemprop="headline" content="Local recovery and high availability in Apache Flink"><meta itemprop="description" content="Apache Flink is a stream processing system offering high-throughput and low-latency processing of millions events per second with exactly-once processing guarantees even in face of failures. However, Flink’s failover strategy stops data processing, resets the whole job graph, and resumes from the latest checkpoint assuming a replayable input source such as Apache Kafka. This is not suitable for mission-critical jobs that need fast recovery or large-scale jobs running on tens or hundreds of machines where the probability of any node failing becomes so big that may result in subsequent failures that leave the job with no progress between failures.This project provides a faster recovery strategy for higher availability that resets only the failed operators. The strategy sets up standby tasks that mirror running tasks and receive the state snapshots of the running tasks upon each checkpoint. Tasks that produce output log it until the next checkpoint. When a failure happens the standby task that mirrors the failed running task comes into play and substitutes the failed task. It requests the in-flight log from its upstream tasks and starts processing thereby reducing the failure only to the tasks that failed. This recovery strategy guarantess at-least-once processing since there may be records that will be processed twice, once by the failed task and once by the standby task that substituted it. Techniques like causal logging can help strengthen the processing guarantee to exactly-once accounting also for non-deterministic events like record delivery order, processing-time windows, callback functions executed by timers, and checkpoint signals. Check out the work by Pedro Silvestre (https://github.com/PSilvestre). Github repo: https://github.com/tud-delta/flink"><div class="page__inner-wrap"><header><h1 class="page__title" itemprop="headline">Local recovery and high availability in Apache Flink</h1></header><section class="page__content" itemprop="text"></section><footer class="page__meta"></footer><section class="page__share"><h4 class="page__share-title">Share on</h4><a href="https://twitter.com/intent/tweet?text=https://mfragkoulis.github.io/development/flink.md" class="btn btn--twitter" title="Share on Twitter"><i class="fa fa-fw fa-twitter" aria-hidden="true"></i><span> Twitter</span></a> <a href="https://www.facebook.com/sharer/sharer.php?u=https://mfragkoulis.github.io/development/flink.md" class="btn btn--facebook" title="Share on Facebook"><i class="fa fa-fw fa-facebook" aria-hidden="true"></i><span> Facebook</span></a> <a href="https://plus.google.com/share?url=https://mfragkoulis.github.io/development/flink.md" class="btn btn--google-plus" title="Share on Google Plus"><i class="fa fa-fw fa-google-plus" aria-hidden="true"></i><span> Google+</span></a> <a href="https://www.linkedin.com/shareArticle?mini=true&url=https://mfragkoulis.github.io/development/flink.md" class="btn btn--linkedin" title="Share on LinkedIn"><i class="fa fa-fw fa-linkedin" aria-hidden="true"></i><span> LinkedIn</span></a></section><nav class="pagination"> <a href="#" class="pagination--pager disabled">Previous</a> <a href="https://mfragkoulis.github.io/development/dgsh.md" class="pagination--pager" title="The Directed Graph Shell (dgsh) ">Next</a></nav></div></article></div></script><div class="page__footer"><footer> <!-- start custom footer snippets --> <!-- end custom footer snippets --><div class="page__footer-follow"><ul class="social-icons"><li><strong>Follow:</strong></li><li><a href="https://twitter.com/mariofragkoulis"><i class="fa fa-fw fa-twitter-square" aria-hidden="true"></i> Twitter</a></li><li><a href="http://github.com/mfragkoulis"><i class="fa fa-fw fa-github" aria-hidden="true"></i> GitHub</a></li><li><a href="https://mfragkoulis.github.io/feed.xml"><i class="fa fa-fw fa-rss-square" aria-hidden="true"></i> Feed</a></li></ul></div><div class="page__footer-copyright">&copy; 2025 Marios Fragkoulis. Powered by <a href="http://jekyllrb.com" rel="nofollow">Jekyll</a> &amp; <a href="https://github.com/academicpages/academicpages.github.io">AcademicPages</a>, a fork of <a href="https://mademistakes.com/work/minimal-mistakes-jekyll-theme/" rel="nofollow">Minimal Mistakes</a>.</div></footer></div><script src="https://mfragkoulis.github.io/assets/js/main.min.js"></script> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-127030690-1', 'auto'); ga('send', 'pageview'); </script></body></html>
