Publications

Beyond Analytics: the Evolution of Stream Processing Systems

Published in SIGMOD 2020 (tutorial), 2020

Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. The goal of this tutorial is threefold. First, we aim to review and highlight noteworthy past research findings, which were largely ignored until very recently. Second, we intend to underline the differences between early (’00-’10) and modern (’11-’18) streaming systems, and how those systems have evolved through the years. Most importantly, we wish to turn the attention of the database community to recent trends: streaming systems are no longer used only for classic stream processing workloads, namely window aggregates and joins. Instead, modern streaming systems are being increasingly used to deploy general event-driven applications in a scalable fashion, challenging the design decisions, architecture and intended use of existing stream processing systems.

Recommended citation: Paris Carbone, Marios Fragkoulis, Vasiliki Kalavri, and Asterios Katsifodimos. 2020. Beyond Analytics: The Evolution of Stream Processing Systems. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 2651–2658. DOI:https://doi.org/10.1145/3318464.3383131 https://dl.acm.org/doi/pdf/10.1145/3318464.3383131

REMA: Graph Embeddings-based Relational Schema Matching

Published in SEAData workshop in EDBT (short paper), 2020

Schema matching is the process of capturing correspondence between attributes of different datasets and it is one of the most important prerequisite steps for analyzing heterogeneous data collections. State-of-the-art schema matching algorithms that use simple schema- or instance-based similarity measures struggle with finding matches beyond the trivial cases. Semantics-based algorithms require the use of domain-specific knowledge encoded in a knowledge graph or an ontology. As a result, schema matching still remains a largely manual process, which is performed by few domain experts. In this paper we present the Relational Embeddings MAtcher, or REMA, for short. REMA is a novel schema matching approach which captures semantic similarity of attributes using relational embeddings: a technique which embeds database rows, columns and schema information into multidimensional vectors that can reveal semantic similarity. This paper aims at communicating our latest findings, and at demonstrating rema’s potential with a preliminary experimental evaluation.

Recommended citation: Christos Koutras, Marios Fragkoulis, Asterios Katsifodimos, Christoph Lofi: REMA: Graph Embeddings-based Relational Schema Matching. EDBT/ICDT Workshops 2020 http://ceur-ws.org/Vol-2578/SEAData5.pdf

Stateful functions as a service in action

Published in VLDB (demo paper), 2019

In the serverless model, users upload application code to a cloud platform and the cloud provider undertakes the deployment, execution and scaling of the application, relieving users from all operational aspects. Although very popular, current serverless offerings offer poor support for the management of local application state, the main reason being that managing state and keeping it consistent at large scale is very challenging. As a result, the serverless model is inadequate for executing stateful, latency-sensitive applications. In this paper we present a high-level programming model for developing stateful functions and deploying them in the cloud. Our programming model allows functions to retain state as well as call other functions. In order to deploy stateful functions in a cloud infrastructure, we translate functions and their data exchanges into a stateful dataflow graph. With this paper we aim at demonstrating that using a modified version of an open-source dataflow engine as a runtime for stateful functions, we can deploy scalable and stateful services in the cloud with surprisingly low latency and high throughput.

Recommended citation: Adil Akhter, Marios Fragkoulis, and Asterios Katsifodimos. 2019. Stateful functions as a service in action. Proc. VLDB Endow. 12, 12 (August 2019), 1890-1893. DOI: https://doi.org/10.14778/3352063.3352092. http://asterios.katsifodimos.com/assets/publications/stateful-functions.pdf

Operational stream processing: towards scalable and consistent event-driven applications

Published in EDBT 2019, 2019

In the last decade we are witnessing a widespread adoption of architectural styles such as microservices, for building event- driven software applications and deploying them in cloud infras- tructures. Such services favor the separation of a database into independent silos of data, each of which is owned entirely by a single service. As a result, traditional oltp systems no longer fit the architectural picture and developers often turn to ad-hoc solutions that rarely support acid transaction consistency. At the same time, we are witnessing the gradual maturation of distributed streaming dataflow systems. These systems nowadays have departed from the mere analysis of streaming windows and complex-event processing, employing sophisticated methods for managing state, keeping it consistent, and ensuring exactly-once processing guarantees in the presence of failures. The goal of this paper is threefold. First, we illustrate the requirements of stateful software services in terms of consistency and scalability. Second, we present how well existing solutions meet those requirements. Finally, we outline a set of challenging problems and propose research directions for enabling event- driven applications to be developed on top of streaming dataflow systems. We strongly believe that streaming dataflows can have a central place in service-oriented architectures, taking over the execution of acid transactions, ensuring message delivery and processing, in order to perform scalable execution of services.

Recommended citation: Asterios Katsifodimos and Marios Fragkoulis. (2019). "Operational stream processing: towards scalable and consistent event-driven applications." EDBT'19. https://openproceedings.org/2019/conf/edbt/EDBT19_paper_317.pdf

Live interactive queries to a software application’s memory profile

Published in IET Software, 2018

Memory operations are critical to an application’s reliability and performance. To reason about their correctness and track opportunities for optimisations, sophisticated instrumentation frameworks, such as Valgrind and Pin, have been developed. Both provide only limited facilities for analysing the collected data. This work presents a Valgrind’s extension for examining a software applications’ dynamic memory profile through live interactive analysis with SQL. The Pico COllections Query Library (PiCO QL) module maps Valgrind’s data structures that contain the instrumented application’s memory operations metadata to a relational interface. Queries are type-safe and the module imposes only a trivial overhead when idle. We evaluate our approach on ten applications and through a qualitative study. We find 900KB of undefined bytes in bzip2 that account for 12% of its total memory use and a performance-critical code execution path in the Unix commands sort and uniq. The referenced functions are part of glibc and have been independently modified to boost the library’s performance. The qualitative study has users rate the usefulness, usability, effort, correctness, and expressiveness of PiCO QL queries compared to Python scripts. The findings indicate that querying with PiCO QL incurs lower user effort.

Recommended citation: Marios Fragkoulis, Diomidis Spinellis, and Panos Louridas. (2018). "Live interactive queries to a software application's memory profile." IET Software. http://ietdl.org/t/uIR5q

Measuring and understanding database schema quality

Published in International Conference on Software Engineering - Software Engineering in Practice track, 2018

Context: Databases are an integral element of enterprise applica- tions. Similarly to code, database schemas are also prone to smells — best practice violations. Objective: We aim to explore database schema quality, associated characteristics and their relationships with other software artifacts. Method: We present a catalog of 13 database schema smells and elicit developers’ perspective through a survey. We extract embedded SQL statements and identify database schema smells by employing the DbDeo tool which we developed. We analyze 2925 production-quality systems (357 industrial and 2568 well-engineered open-source projects) and empirically study quality characteristics of their database schemas. In total, we analyze 629 million lines of code containing more than 393 thousand sql statements. Results: We find that the index abuse smell occurs most frequently in database code, that the use of an orm framework doesn’t immune the application from database smells, and that some database smells, such as adjacency list, are more prone to occur in industrial projects compared to open-source projects. Our co-occurrence analysis shows that whenever the clone table smell in industrial projects and the values in attribute definition smell in open-source projects get spotted, it is very likely to find other database smells in the project. Conclusion: The awareness and knowledge of database smells are crucial for developing high-quality software systems and can be enhanced by the adoption of better tools helping developers to identify database smells early.

Recommended citation: Tushar Sharma, Marios Fragkoulis, Stamatia Rizou, Magiel Bruntink, and Diomidis Spinellis. (2018). "Measuring and understanding database schema quality." 40th International Conference on Software Engineering - Software Engineering in Practice. pages 55-64. Gothenburg, Sweden. ACM. https://dl.acm.org/citation.cfm?id=3183529

The exception handling riddle: an empirical study on the Android API

Published in Journal of Systems and Software, 2018

We examine the use of the Java exception types in the Android platform’s Application Programming Interface (API) reference documentation and their impact on the stability of Android applications. We develop a method that automatically assesses an API’s quality regarding the exceptions listed in the API’s documentation. We statically analyze ten versions of the Android platform’s API (14–23) and 3539 Android applications to determine inconsistencies between exceptions that analysis can find in the source code and exceptions that are documented. We cross-check the analysis of the Android platform’s API and applications with crash data from 901,274 application execution failures (crashes). We discover that almost 10% of the undocumented exceptions that static analysis can find in the Android platform’s API source code manifest themselves in crashes. Additionally, we observe that 38% of the undocumented exceptions that developers use in their client applications to handle API methods also manifest themselves in crashes. These findings argue for documenting known might-thrown exceptions that lead to execution failures. However, a randomized controlled trial we run shows that relevant documentation improvements are ineffective and that making such exceptions checked is a more effective way for improving applications’ stability.

Recommended citation: Maria Kechagia, Marios Fragkoulis, Panos Louridas, and Diomidis Spinellis. (2018). "The exception handling riddle: an empirical study on the Android API." Journal of Systems and Software. volume 142. pages 248-270. doi: 10.1109/TC.2017.2695447. https://doi.org/10.1016/j.jss.2018.04.034

House of cards: code smells in open-source C# repositories

Published in ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2017

Background: Code smells are indicators of quality problems that make a software hard to maintain and evolve. Given the importance of smells in the source code’s maintainability, many studies have explored the characteristics of smells and analyzed their effects on the software’s quality. Aim: We aim to investigate fundamental characteristics of code smells through an empirical study on frequently occurring smells that examines inter-category and intra-category correlation between design and implementation smells. Method: The study mines 19 design smells and 11 implementation smells in 1988 C# repositories containing more than 49 million lines of code. The mined data are statistically analyzed using methods such as Spearman’s correlation and presented through hexbin and scatter plots. Results: We find that unutilized abstraction and magic number smells are the most frequently occurring smells in C# code. Our results also show that implementation and design smells exhibit strong inter-category correlation. The results of co-occurrence analysis imply that whenever unutilized abstraction or magic number smells are found, it is very likely to find other smells from the same smell category in the project. Conclusions: Our experiment shows high average smell density (14.7 and 55.8 for design and implementation smells respectively) for open source C# programs. Such high smell densities turn a software system into a house of cards reflecting the fragility introduced in the system. Our study advocates greater awareness of smells and the adoption of regular refactoring within the developer community to avoid turning software into a house of cards.

Recommended citation: Tushar Sharma, Marios Fragkoulis, and Diomidis Spinellis. (2017). "House of cards: code smells in open-source C# repositories." 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). pages: 424-429. Toronto, Canada. IEEE. https://ieeexplore.ieee.org/document/8170129/

Extending Unix pipelines to DAGs

Published in IEEE Transactions on Computers, 2017

The Unix shell dgsh provides an expressive way to construct sophisticated and efficient non-linear pipelines. Such pipelines can use standard Unix tools, as well as third-party and custom-built components. Dgsh allows the specification of pipelines that perform non-uniform non-linear processing. These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the processing task’s throughput. A number of existing Unix tools have been adapted to take advantage of the new shell’s multiple pipe input/output capabilities. The shell supports visualization of the process graphs, which can also aid debugging. Dgsh was evaluated through a number of common data processing and domain-specific examples, and was found to offer an expressive way to specify processing topologies, while also generally increasing processing throughput.

Recommended citation: Diomidis Spinellis and Marios Fragkoulis. (2017). "Extending Unix pipelines to DAGs." IEEE Transactions on Computers. 66(9). doi: 10.1109/TC.2017.2695447. https://ieeexplore.ieee.org/document/7903579/

Technologies for main memory data analysis

Published in PhD dissertation - Athens University of Economics and Business, Department of Management Science and Technology, 2017

This thesis studies the methods and technologies for supporting queries on main memory data and how the widespread architecture of software systems currently affects technologies. Based on the findings from the literature we develop a method and a technology to perform interactive queries on data that reside in main memory. Our approach is based on the criteria of usefulness and usability. After an overview of the programming languages that fit the data analysis we choose SQL, the standard data manipulation language for decades. The method we develop represents programming data structures in relational terms as requires SQL. The method’s implementation took place on the C and C++ programming languages because of their wide use for the development of systems and applications. The implementation expands to the development of two diagnostic tools for identifying problems in software systems through queries to main memory metadata related to their state. The overall evaluation of our approach involves its integration in three C++ software applications, in the Linux kernel, and in Valgrind, where we also perform a user study with students.

Recommended citation: Marios Fragkoulis. Technologies for main memory data analysis. PhD dissertation. Athens University of Economics and Business, Department of Management Science and Technology. 2017. http://mariosfragkoulis.gr/files/thesis-en-only.pdf

PiCO QL: A software library for runtime interactive queries on program data

Published in SoftwareX, Elsevier, 2016

PiCO QL is an open source C/C++ software whose scientific scope is real-time interactive analysis of in-memory data through sql queries. It exposes a relational view of a system’s or application’s data structures, which is queryable through SQL. While the application or system is executing, users can input queries through a web-based interface or issue web service requests. Queries execute on the live data structures through the respective relational views. PiCO QL makes a good candidate for ad-hoc data analysis in applications and for diagnostics in systems settings. Applications of PiCO QL include the Linux kernel, the Valgrind instrumentation framework, a GIS application, a virtual real-time observatory of stellar objects, and a source code analyser.

Recommended citation: Marios Fragkoulis, Diomidis Spinellis, and Panos Louridas. (2016). "PiCO QL: A software library for runtime interactive queries on program data." SoftwareX. volume 5. pages 134-138. Elsevier. https://www.sciencedirect.com/science/article/pii/S2352711016300176

Does your configuration code smell?

Published in 13th International Conference on Mining Software Repositories, 2016

Infrastructure as Code (IaC) is the practice of specifying computing system configurations through code, and managing them through traditional software engineering methods. The wide adoption of configuration management and increasing size and complexity of the associated code, prompt for assessing, maintaining, and improving the configuration code’s quality. In this context, traditional software engineering knowledge and best practices associated with code quality management can be leveraged to assess and manage configuration code quality. We propose a catalog of 13 implementation and 11 design configuration smells, where each smell violates recommended best practices for configuration code. We analyzed 4,621 Puppet repositories containing 8.9 million lines of code and detected the cataloged implementation and design configuration smells. Our analysis reveals that the design configuration smells show 9% higher average co-occurrence among themselves than the implementation configuration smells. We also observed that configuration smells belonging to a smell category tend to co-occur with configuration smells belonging to another smell category when correlation is computed by volume of identified smells. Finally, design configuration smell density shows negative correlation whereas implementation configuration smell density exhibits no correlation with the size of a configuration management system.

Recommended citation: Tushar Sharma, Marios Fragkoulis, and Diomidis Spinellis. (2016). "Does your configuration code smell?." 13th International Conference on Mining Software Repositories (MSR' 16). pages: 189-200. Austin, Texas. ACM. https://dl.acm.org/citation.cfm?id=2901761

An interactive SQL relational interface for querying main-memory data structures

Published in Computing, Springer, 2015

Query formalisms and facilities have received significant attention in the past decades resulting in the development of query languages with varying characteristics; many of them resemble sql. Query facilities typically ship as part of database management systems or, sometimes, bundled with programming languages. For applications written in imperative programming languages, database management systems impose an expensive model transformation. In-memory data structures can represent sophisticated relationships in a manner that is efficient in terms of storage and processing overhead, but most general purpose programming languages lack an interpreter and/or an expressive query language for manipulating interactive queries. Issuing interactive ad-hoc queries on program data structures is tough. This work presents a method and an implementation for representing an application’s arbitrary imperative programming data model as a queryable relational one. The Pico COllections Query Library (pico ql) uses a domain specific language to define a relational representation of application data structures and an sql interface implementation. Queries are issued interactively and are type safe. We demonstrate our relational representation for objects and the library’s usefulness on three large c++ projects. pico ql enhances query expressiveness and boosts productivity compared to querying via traditional programming constructs.

Recommended citation: Marios Fragkoulis, Diomidis Spinellis, and Panos Louridas. (2015). "An interactive SQL relational interface for querying main-memory data structures." Computing. 97(12). pages 1141-1164. Springer Vienna. https://link.springer.com/article/10.1007/s00607-015-0452-y

Relational access to Unix kernel data structures

Published in Ninth European Conference on Computer Systems (Eurosys), 2014

State of the art kernel diagnostic tools like DTrace and Systemtap provide a procedural interface for expressing analysis tasks. We argue that a relational interface to kernel data structures can offer complementary benefits for kernel diagnostics. This work contributes a method and an implementation for mapping a kernel’s data structures to a relational interface. The Pico COllections Query Library (PiCO QL) Linux kernel module uses a domain specific language to define a relational representation of accessible Linux kernel data structures, a parser to analyze the definitions, and a compiler to implement an SQL interface to the data structures. It then evaluates queries written in SQL against the kernel’s data structures. PiCO QL queries are interactive and type safe. Unlike SystemTap and DTrace, PiCO QL is less intrusive because it does not require kernel instrumentation; instead it hooks to existing kernel data structures through the module’s source code. PiCO QL imposes no overhead when idle and needs only access to the kernel data structures that contain relevant information for answering the input queries. We demonstrate PiCO QL’s usefulness by presenting Linux kernel queries that provide meaningful custom views of system resources and pinpoint issues, such as security vulnerabilities and performance problems.

Recommended citation: Marios Fragkoulis, Diomidis Spinellis, Panos Louridas, and Angelos Bilas. (2014). "Relational access to Unix kernel data structures." Ninth European Conference on Computer Systems (Eurosys' 14). pages 1-14. Amsterdam, Netherlands. ACM. https://dl.acm.org/citation.cfm?id=2592798.2592802