introducing the archetype Betty
“Hello, I’m Betty. I’m an engineer and self-taught programmer that develops research software. Very often, this software represents a computational model for the simulation of an engineering application. For validating such a model, I have to compare my results with data such as other simulation data or experimental observations. For this and other purposes, I also write code for analysing and converting research data. My software usually has a lot of dependencies in form of the operating system and third-party libraries. While I’m very keen on guaranteeing the reproducibility of my computational results, I can’t dedicate too much working time to achieve this. My professional background can be located in any engineering discipline.”
key challenges & objectives
As defined here (PDF), “research software (as opposed to simply software) is software that is developed within academia and used for the purposes of research: to generate, process, and analyse results. This includes a broad range of software, from highly developed packages with significant user bases to short (tens of lines of code) programs written by researchers for their own use.” In the task area Betty we address this whole range of software, and in particular the fact that it is usually developed by domain specialists rather than software engineers. While research software certainly is research data, it exhibits particular characteristics compared to more “conventional” data. This yields the following challenges:
- Code inside a research software project that is under continuous development is subject to permanent change. Employing version control is mandatory to keep track of changes and to provide means to refer to a particular code instance.
- Any particular piece of research software often exhibits complex dependencies on other software such as third-party libraries and operating systems, as well as possibly on the hardware architecture on which the software is supposed to be executed.
- The broad variety of engineering applications is reflected in a huge number of engineering research software projects exhibiting very different scope, size, quality etc. This demands correspondingly adaptable RDM processes and tools.
- Engineering research software is often dedicated to numerical simulation. To ensure the quality of the software, the underlying model needs to be validated, which in turn usually requires the comparison of simulation results with other data. Analogous validation requirements hold for other types of software such as software for systems control.
- Software itself generates data in form of computational results. While the results are commonly discussed in a scientific publication, a proper RDM strategy for research software should aim for ensuring the reproducibility of these results.
A broad range of tools already exists to tackle several technical challenges listed above. The fundamental challenge within this task area rather is to offer the engineering community a consistent toolchain that actually will be employed in daily engineering research. From these key challenges and our common goals, we identify three key objectives for this task area. In particular, we would like that every engineer:
- can be equipped with the tools and knowledge that are necessary and useful to develop validated, quality-assured engineering research software.
- is able to guarantee the reproducibility or at least the transparency of his computational results and to provide his peers with usable solutions for the actual reproduction.
- can easily equip his own engineering research software and validation data with standardised metadata and find such software and data of others for his research.
We implement five measures to achieve the abovementioned goals. Measures B-1 and B-2 target primarily the first goal, measure B-3 the second, and measures B-4 and B-5 the third one:
- B-1Integrated toolchain for validated engineering research software: Several solutions for individual RDM measures in the context of software already exist. By integrating these tools, we will implement and support a consistent toolchain for developing validated engineering research software.
- B-2Best practice guides and recommendations: We aim to design best practice guides and recommendations for developing engineering research software by means of the toolchain that results from the tasks of measure B-1, starting from the first lines of code to the provision of web frontends and backends for the reproduction of computational results.
- B-3Containerisation and generation of web frontends: We focus on two main ingredients to achieve our second goal of enabling every engineer to guarantee and facilitate the reproducibility or at least the transparency of computational results: containerisation and generation of web frontends. With Docker and Singularity regarding containerisation as well as JupyterLab concerning web frontends, the technical tools are readily available.
- B-4Standardisation and automated extraction of research software metadata: Up to now, no standardised metadata format for engineering research software has emerged. CodeMeta defines subsets of the schema.org vocabulary as possible entries for the description of software. However, it is discipline- and even research-agnostic and doesn’t include any engineering-related entries such as the software’s application areas or details on a computational model. On the other hand, there exist recent efforts (ArXiv) on metadata formats for engineering sciences with no particular focus on software.
- B-5Catalogue of engineering research software and validation data: In order to help an engineer to find software that is supposed to be capable of solving his current research question, we aim to establish a catalogue of engineering research software and related validation data.
In the scope of this task area, we are planning to realise the abovementioned sustainability measures within three pilot use cases in collaboration with partners from different fields of engineering sciences. In turn, feedback from these use cases will allow us to refine the needs and demands that originate from different engineering disciplines, and to adjust the recommended measures accordingly.
We have recently completed the project descriptions of the pilot cases, which cover the following topics:
- Application of workflow tools to represent, share and publish scientific workflows: an evaluation of the requirements, a selection of tools, and best practices on how to use them. We have initiated a Git repository in which we will document our findings and experiences, open for contribution by the scientific community. Besides this, we aim to initiate a Special Interest Group on the topic.
- Software quality assurance for projects distributed over several code repositories, where downstream code repositories have to be tested against changes in upstream code repositories.
- Automated and continuous testing of simulation frameworks against remotely-hosted and possibly version-controlled reference/benchmark solutions to guarantee that the code produces physically meaningful results.
Moreover, we are in the process of developing an abstract model for the description of software-driven scientific workflows. On the basis of this model, we can define adequate metadata schemes for their description in a machine-readable format.