Final Report: Seed Fund “Tools for creating reproducible scientific workflows”

Data processing is usually not a  single task, but in general relies on a chain of tools. To achieve transparency, adaptability, and reproducibility of (computational) research, the FAIR principles must therefore be applied to all components of the research process. This includes the tools (i.e. any research software) used to analyze the data, but also the scientific workflow itself which describes how the various processes depend on each other. The goal of the project described in this article was the development of a NFDI4Ing-portfolio of tools for the creation and documentation of reproducible simulation workflows.

Illustration by Mudassar Iqbal@Pixabay

In the field of computational science and engineering, workflows often entail the application of various software, for instance, for simulation or pre- and postprocessing. Typically, these components are combined in arbitrarily complex workflows to address a specific research question. For peer researchers to understand, reproduce and (re)use the findings of a scientific publication, several challenges must be addressed. For instance, the employed workflow has to be automated and information on all used software must be available for a reproduction of the results. Moreover, the results must be traceable, and the workflow documented and readable to allow for external verification and greater trust.

The project presented here was part of the NFDI4Ing Seed Fund programme 2022. Its goal was the development of a portfolio of tools for the creation and documentation of reproducible simulation workflows, to support researchers in overcoming the challenges above.

To achieve this, existing workflow management systems (WfMSs) were discussed regarding their suitability for describing, reproducing, and reusing scientific workflows. To this end, a set of general requirements for WfMSs were deduced from user stories that the authors deemed relevant in the domain of computational science and engineering. Based on an exemplary workflow implementation and available documentation of each individual tool, a selection of different WfMSs was compared with respect to these requirements, to support fellow scientists in identifying which system best suits their requirements.

The data generated over the course of the project is publicly available and hosted on GitHub. The GitHub repository (Diercks, Gläser, Unger, & Flemisch, 2022) that contains the WfMS implementations of the exemplary workflow was created with the aim to continuously add more tools in the future, and to extend the documentation accordingly. An additional GitHub repository (Diercks, Gläser, Unger, & Flemisch, NFDI4Ing HPC Workflows, 2022) was created to document different approaches that address how to achieve portable workflow implementations in the context of HPC computing.

As part of the project, a special interest group (SIG) was formed within NFDI4Ing to report findings to, and to include feedback from interested members of NFDI4Ing. The SIG “workflow tools” remains active and is open for new participants – feel free to join the biweekly operative meetings or the SIG-wide meetings every 6 weeks.

Over the course of the project there was close and continuous cooperation with multiple NFDI4Ing task areas and -members. The collaborations in the context of reproducible workflows in HPC environments and the use of container technology were particularly fruitful and led to synergies with other research projects in that field (e.g. SURESOFT). Contributions by the community were made possible and encouraged through the publicly accessible repository on GitHub. Moreover, the developers of the investigated WfMSs were contacted to give feedback and enable exchange. The full results of the project can be found here.

Dr. Jörg F. Unger
Bundesanstalt für Materialforschung und -prüfung (BAM)

This text is based on the final report delivered to the NFDI4Ing steering committee.
Edited by Thorsten Schwetje