Live interactive queries to a software application’s memory profile

Published in IET Software, 2018

Memory operations are critical to an application’s reliability and performance. To reason about their correctness and track opportunities for optimisations, sophisticated instrumentation frameworks, such as Valgrind and Pin, have been developed. Both provide only limited facilities for analysing the collected data. This work presents a Valgrind’s extension for examining a software applications’ dynamic memory profile through live interactive analysis with SQL. The Pico COllections Query Library (PiCO QL) module maps Valgrind’s data structures that contain the instrumented application’s memory operations metadata to a relational interface. Queries are type-safe and the module imposes only a trivial overhead when idle. We evaluate our approach on ten applications and through a qualitative study. We find 900KB of undefined bytes in bzip2 that account for 12% of its total memory use and a performance-critical code execution path in the Unix commands sort and uniq. The referenced functions are part of glibc and have been independently modified to boost the library’s performance. The qualitative study has users rate the usefulness, usability, effort, correctness, and expressiveness of PiCO QL queries compared to Python scripts. The findings indicate that querying with PiCO QL incurs lower user effort.

Recommended citation: Marios Fragkoulis, Diomidis Spinellis, and Panos Louridas. (2018). "Live interactive queries to a software application's memory profile." IET Software.

Measuring and understanding database schema quality

Published in International Conference on Software Engineering - Software Engineering in Practice track, 2018

Context: Databases are an integral element of enterprise applica- tions. Similarly to code, database schemas are also prone to smells — best practice violations. Objective: We aim to explore database schema quality, associated characteristics and their relationships with other software artifacts. Method: We present a catalog of 13 database schema smells and elicit developers’ perspective through a survey. We extract embedded SQL statements and identify database schema smells by employing the DbDeo tool which we developed. We analyze 2925 production-quality systems (357 industrial and 2568 well-engineered open-source projects) and empirically study quality characteristics of their database schemas. In total, we analyze 629 million lines of code containing more than 393 thousand sql statements. Results: We find that the index abuse smell occurs most frequently in database code, that the use of an orm framework doesn’t immune the application from database smells, and that some database smells, such as adjacency list, are more prone to occur in industrial projects compared to open-source projects. Our co-occurrence analysis shows that whenever the clone table smell in industrial projects and the values in attribute definition smell in open-source projects get spotted, it is very likely to find other database smells in the project. Conclusion: The awareness and knowledge of database smells are crucial for developing high-quality software systems and can be enhanced by the adoption of better tools helping developers to identify database smells early.

Recommended citation: Tushar Sharma, Marios Fragkoulis, Stamatia Rizou, Magiel Bruntink, and Diomidis Spinellis. (2018). "Measuring and understanding database schema quality." 40th International Conference on Software Engineering - Software Engineering in Practice. pages 55-64. Gothenburg, Sweden. ACM.

The exception handling riddle: an empirical study on the Android API

Published in Journal of Systems and Software, 2018

We examine the use of the Java exception types in the Android platform’s Application Programming Interface (API) reference documentation and their impact on the stability of Android applications. We develop a method that automatically assesses an API’s quality regarding the exceptions listed in the API’s documentation. We statically analyze ten versions of the Android platform’s API (14–23) and 3539 Android applications to determine inconsistencies between exceptions that analysis can find in the source code and exceptions that are documented. We cross-check the analysis of the Android platform’s API and applications with crash data from 901,274 application execution failures (crashes). We discover that almost 10% of the undocumented exceptions that static analysis can find in the Android platform’s API source code manifest themselves in crashes. Additionally, we observe that 38% of the undocumented exceptions that developers use in their client applications to handle API methods also manifest themselves in crashes. These findings argue for documenting known might-thrown exceptions that lead to execution failures. However, a randomized controlled trial we run shows that relevant documentation improvements are ineffective and that making such exceptions checked is a more effective way for improving applications’ stability.

Recommended citation: Maria Kechagia, Marios Fragkoulis, Panos Louridas, and Diomidis Spinellis. (2018). "The exception handling riddle: an empirical study on the Android API." Journal of Systems and Software. volume 142. pages 248-270. doi: 10.1109/TC.2017.2695447.

House of cards: code smells in open-source C# repositories

Published in ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2017

Background: Code smells are indicators of quality problems that make a software hard to maintain and evolve. Given the importance of smells in the source code’s maintainability, many studies have explored the characteristics of smells and analyzed their effects on the software’s quality. Aim: We aim to investigate fundamental characteristics of code smells through an empirical study on frequently occurring smells that examines inter-category and intra-category correlation between design and implementation smells. Method: The study mines 19 design smells and 11 implementation smells in 1988 C# repositories containing more than 49 million lines of code. The mined data are statistically analyzed using methods such as Spearman’s correlation and presented through hexbin and scatter plots. Results: We find that unutilized abstraction and magic number smells are the most frequently occurring smells in C# code. Our results also show that implementation and design smells exhibit strong inter-category correlation. The results of co-occurrence analysis imply that whenever unutilized abstraction or magic number smells are found, it is very likely to find other smells from the same smell category in the project. Conclusions: Our experiment shows high average smell density (14.7 and 55.8 for design and implementation smells respectively) for open source C# programs. Such high smell densities turn a software system into a house of cards reflecting the fragility introduced in the system. Our study advocates greater awareness of smells and the adoption of regular refactoring within the developer community to avoid turning software into a house of cards.

Recommended citation: Tushar Sharma, Marios Fragkoulis, and Diomidis Spinellis. (2017). "House of cards: code smells in open-source C# repositories." 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). pages: 424-429. Toronto, Canada. IEEE.

Extending Unix pipelines to DAGs

Published in IEEE Transactions on Computers, 2017

The Unix shell dgsh provides an expressive way to construct sophisticated and efficient non-linear pipelines. Such pipelines can use standard Unix tools, as well as third-party and custom-built components. Dgsh allows the specification of pipelines that perform non-uniform non-linear processing. These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the processing task’s throughput. A number of existing Unix tools have been adapted to take advantage of the new shell’s multiple pipe input/output capabilities. The shell supports visualization of the process graphs, which can also aid debugging. Dgsh was evaluated through a number of common data processing and domain-specific examples, and was found to offer an expressive way to specify processing topologies, while also generally increasing processing throughput.

Recommended citation: Diomidis Spinellis and Marios Fragkoulis. (2017). "Extending Unix pipelines to DAGs." IEEE Transactions on Computers. 66(9). doi: 10.1109/TC.2017.2695447.

PiCO QL: A software library for runtime interactive queries on program data

Published in SoftwareX, Elsevier, 2016

PiCO QL is an open source C/C++ software whose scientific scope is real-time interactive analysis of in-memory data through sql queries. It exposes a relational view of a system’s or application’s data structures, which is queryable through SQL. While the application or system is executing, users can input queries through a web-based interface or issue web service requests. Queries execute on the live data structures through the respective relational views. PiCO QL makes a good candidate for ad-hoc data analysis in applications and for diagnostics in systems settings. Applications of PiCO QL include the Linux kernel, the Valgrind instrumentation framework, a GIS application, a virtual real-time observatory of stellar objects, and a source code analyser.

Recommended citation: Marios Fragkoulis, Diomidis Spinellis, and Panos Louridas. (2016). "PiCO QL: A software library for runtime interactive queries on program data." SoftwareX. volume 5. pages 134-138. Elsevier.

Does your configuration code smell?

Published in 13th International Conference on Mining Software Repositories, 2016

Infrastructure as Code (IaC) is the practice of specifying computing system configurations through code, and managing them through traditional software engineering methods. The wide adoption of configuration management and increasing size and complexity of the associated code, prompt for assessing, maintaining, and improving the configuration code’s quality. In this context, traditional software engineering knowledge and best practices associated with code quality management can be leveraged to assess and manage configuration code quality. We propose a catalog of 13 implementation and 11 design configuration smells, where each smell violates recommended best practices for configuration code. We analyzed 4,621 Puppet repositories containing 8.9 million lines of code and detected the cataloged implementation and design configuration smells. Our analysis reveals that the design configuration smells show 9% higher average co-occurrence among themselves than the implementation configuration smells. We also observed that configuration smells belonging to a smell category tend to co-occur with configuration smells belonging to another smell category when correlation is computed by volume of identified smells. Finally, design configuration smell density shows negative correlation whereas implementation configuration smell density exhibits no correlation with the size of a configuration management system.

Recommended citation: Tushar Sharma, Marios Fragkoulis, and Diomidis Spinellis. (2016). "Does your configuration code smell?." 13th International Conference on Mining Software Repositories (MSR' 16). pages: 189-200. Austin, Texas. ACM.

An interactive SQL relational interface for querying main-memory data structures

Published in Computing, Springer, 2015

Query formalisms and facilities have received significant attention in the past decades resulting in the development of query languages with varying characteristics; many of them resemble sql. Query facilities typically ship as part of database management systems or, sometimes, bundled with programming languages. For applications written in imperative programming languages, database management systems impose an expensive model transformation. In-memory data structures can represent sophisticated relationships in a manner that is efficient in terms of storage and processing overhead, but most general purpose programming languages lack an interpreter and/or an expressive query language for manipulating interactive queries. Issuing interactive ad-hoc queries on program data structures is tough. This work presents a method and an implementation for representing an application’s arbitrary imperative programming data model as a queryable relational one. The Pico COllections Query Library (pico ql) uses a domain specific language to define a relational representation of application data structures and an sql interface implementation. Queries are issued interactively and are type safe. We demonstrate our relational representation for objects and the library’s usefulness on three large c++ projects. pico ql enhances query expressiveness and boosts productivity compared to querying via traditional programming constructs.

Recommended citation: Marios Fragkoulis, Diomidis Spinellis, and Panos Louridas. (2015). "An interactive SQL relational interface for querying main-memory data structures." Computing. 97(12). pages 1141-1164. Springer Vienna.

Relational access to Unix kernel data structures

Published in Ninth European Conference on Computer Systems (Eurosys), 2014

State of the art kernel diagnostic tools like DTrace and Systemtap provide a procedural interface for expressing analysis tasks. We argue that a relational interface to kernel data structures can offer complementary benefits for kernel diagnostics. This work contributes a method and an implementation for mapping a kernel’s data structures to a relational interface. The Pico COllections Query Library (PiCO QL) Linux kernel module uses a domain specific language to define a relational representation of accessible Linux kernel data structures, a parser to analyze the definitions, and a compiler to implement an SQL interface to the data structures. It then evaluates queries written in SQL against the kernel’s data structures. PiCO QL queries are interactive and type safe. Unlike SystemTap and DTrace, PiCO QL is less intrusive because it does not require kernel instrumentation; instead it hooks to existing kernel data structures through the module’s source code. PiCO QL imposes no overhead when idle and needs only access to the kernel data structures that contain relevant information for answering the input queries. We demonstrate PiCO QL’s usefulness by presenting Linux kernel queries that provide meaningful custom views of system resources and pinpoint issues, such as security vulnerabilities and performance problems.

Recommended citation: Marios Fragkoulis, Diomidis Spinellis, Panos Louridas, and Angelos Bilas. (2014). "Relational access to Unix kernel data structures." Ninth European Conference on Computer Systems (Eurosys' 14). pages 1-14. Amsterdam, Netherlands. ACM.