Development

Local recovery and high availability in Apache Flink

Apache Flink is a stream processing system offering high-throughput and low-latency processing of millions events per second with exactly-once processing guarantees even in face of failures. However, Flink’s failover strategy stops data processing, resets the whole job graph, and resumes from the latest checkpoint assuming a replayable input source such as Apache Kafka. This is not suitable for mission-critical jobs that need fast recovery or large-scale jobs running on tens or hundreds of machines where the probability of any node failing becomes so big that may result in subsequent failures that leave the job with no progress between failures. This project provides a faster recovery strategy for higher availability that resets only the failed operators. The strategy sets up standby tasks that mirror running tasks and receive the state snapshots of the running tasks upon each checkpoint. Tasks that produce output log it until the next checkpoint. When a failure happens the standby task that mirrors the failed running task comes into play and substitutes the failed task. It requests the in-flight log from its upstream tasks and starts processing thereby reducing the failure only to the tasks that failed. This recovery strategy guarantess at-least-once processing since there may be records that will be processed twice, once by the failed task and once by the standby task that substituted it. Techniques like causal logging can help strengthen the processing guarantee to exactly-once accounting also for non-deterministic events like record delivery order, processing-time windows, callback functions executed by timers, and checkpoint signals. Check out the work by Pedro Silvestre (https://github.com/PSilvestre).

  1. Github repo: https://github.com/tud-delta/flink

The Directed Graph Shell (dgsh)

The directed graph shell, dgsh (pronounced /dæɡʃ/ — dagsh), provides an expressive way to construct sophisticated and efficient big data set and stream processing pipelines using existing Unix tools as well as custom-built components. It is a Unix-style shell (based on bash) allowing the specification of pipelines with non-linear non-uniform operations. These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the operation’s processing throughput.

  1. Project page: https://www.spinellis.gr/sw/dgsh
  2. Github repo: https://github.com/dspinellis/dgsh

The Pico Collections Query Language (PiCO QL) software library

PiCO QL is an SQL query interface to C++ collections and C data structures. Also configurable as a loadable Linux kernel module and an extension to Valgrind tools.

  1. Github repo: https://github.com/mfragkoulis/PiCO_QL