Betty’s (Re)Search Engine

Betty is the NFDI4Ing task area covering the research data management challenges faced by engineers who develop their own software solutions. Betty envisions a future in which every engineer produces verified, high-quality software that can be reused and extended. To that end, Betty wants to identify and provide the missing tools, teaching material and recommendations.

Most of the tasks involved in today’s research require the use of software for the generation, manipulation or visualization of data. Moreover, particularly in the engineering sciences, researchers often need to write their own code for very specific tasks that are not available in out-of-the-box solutions. With advancements in science, the tasks to be performed tend to become more and more complex and require complex software solutions that cannot be implemented in a feasible way from scratch within a single PhD thesis. For this reason, it is of major importance that existing research software can be found, reused and/or maintained to allow for incremental development processes. Moreover, we believe that computational research could benefit from the following practices related to software:

Proper use of version control for every bit of code that is used in research activities.
Tests for all code in order to verify its correctness.
Validation of the code using established benchmarks.
Use of established open-source packages for required algorithms and functionality instead of reinventing the wheel.
Use of well-established and widely-used data formats for any data that the research software produces.
Automation of the entire computational pipeline for research results that are the artifacts of multiple steps of data processing with different software tools. This automation should be published alongside a paper, related data and code.

To help with these practices, in the task area Betty we are working on:

solutions that help researchers to find software of peers that is relevant for their work
recommendations that help researchers with the development of their own software
recommendations on how to automate and publish computational pipelines
tools for easier compliance with standard and established file formats
tools for research software testing
tools for automated collection and metadata annotation of software and results that should be published

Betty‘s (Re)Search Engine

Betty’s Research Engine aims to solve the issue of finding research software relevant to one’s own work. The engine searches for software repositories that match a given search string and then tries to find corresponding publications. This enables users to sort the repositories based on the number of citations, and it also directly provides an impression about the research contexts in which a software has been successfully applied. Since the last newsletter, the source code has been made public here (Link), and a running prototype instance (Link) can be explored. You can find a more detailed description in a preprint for the Ing.Grid journal (Link), or, if you prefer a more visual guide, a “Getting Started” video is available on YouTube (Link).

Re-use and software quality

Reuse of research software can range from reproducing published research results to the modification, extension or incorporation of the source code into another project. These two reuse scenarios pose different requirements on published software.

The first scenario requires that the software is accessible in exactly the version that was used to produce the original results, and that a suitable software and hardware environment can be reinstantiated. In a recent preprint (Link) we discuss ways of employing version control, data repositories and metadata to continuously yield research results together with software artifacts that can be used to reproduce them.

The second scenario requires that the source code has a sufficiently high internal quality such that it can be easily understood, extended and adapted. This cannot be achieved by external tooling, but it requires that the developers have the skills and knowledge to develop code with such characteristics. Therefore, we have compiled teaching material around sustainable software development that aims at conveying the experiences of the software engineering community to an audience with little background in programming. The material is available as a semester course (Link) and as a one-week PhD seminar including several exercises (Link).

To help researchers with the task of automating their computational pipelines, we have collected our experiences from the NFDI4Ing Special Interest Group “Tools for describing, reproducing and reusing scientific workflows”. A discussion of a selection of tools as well as exemplary workflow implementations can be found in our public git repository (Link) and a recently submitted preprint to the Ing.Grid journal (Link).

Tools & Outlook

On the tooling side, we are happy to announce that our regression-testing tool “FieldCompare” is now available on PyPi (Link) and GitLab (Link), and a detailed description has been published in the Journal of Open Source Software (Link). FieldCompare is able to detect deviations in numerical results of a variety of standard file formats, and can be used to perform regression tests or to compare results against established references. In order to help researchers with writing their numerical results into established file formats, we have developed the utility library “GridFormat”, which we recently made publicly available (Link), although an extensive documentation and a first official release are still pending.

As an outlook, we recently teamed up with the research data management (RDM) team of the cluster of excellence SimTech at the University of Stuttgart to develop RDM solutions that allow researchers to automatically collect results from the local file system, annotate them with metadata, and upload them into a dataset on an institutional data repository. On the reuse side, we plan to improve the user experience by implementing preview capabilities for a variety of file formats to the data repository DaRUS. This enables users to inspect individual files without the need to download the entire dataset.

Call for participation

As all archetype-related task areas in NFDI4Ing, Betty follows a bottom-up approach by means of implementing pilot use cases together with partners from several areas of the engineering sciences. We are always keen on identifying new use cases and encourage every engineer interested or already involved in research software development to contact us and participate: betty@nfdi4ing.de.

Dennis Gläser for Betty