Abstracts for the
NFDI4Ing Conference 2022
Unifying the Understanding of RDM in Engineering Science
If not, sign up now for the conference 2022!
Abstracts for the NFDI4Ing Conference 2022
To read an abstract, please click its title. You can also collapse the abstract again this way. The ID is solely forpurposes of the review process.
Abstracts are sorted by ID.
Wikidata beginners’ workshop - Workshop
The world's knowledge at your fingertips! This beginners' workshop introduces ways of gathering information using Wikidata and includes a brief overview of the Wikimedia foundation. The hands-on part starts with a general introduction to subject-predicate-object queries as implemented in a Resource Description Framework (RDF). RDF lies at the core of Wikidata and is widely used in academic contexts. It represents the standard Web technology upon which the concept of Linked Data is applied. In short, RDF is the state-of-the-art way to describe data with metadata.
In this workshop, you will learn to use the RDF query language SPARQL and perform semantic queries. We will retrieve information about various consortia of Germany's Research Data Infrastructure (Nationale Forschungsdateninfrastruktur - NFDI). One of the goals of NFDI is to become Germany's backbone in research data management. Consequently, the structures and services of NFDI have to be Findable, Accessible, Interoperable and Re-usable (FAIR) in a sustainable way. Hence, more and more information relating to NFDI is being added to Wikidata. You can find out in this workshop how to gather it and how to join the Wikidata community. While we will not delve into the subject of becoming a contributor to Wikidata too much, a clear picture of what kind of data on NFDI is already there and what's missing will emerge.
A hallmark of this workshop is its use of Donald E. Knuth's "Literate Programming" approach from the 1980s. In doing so, we leave aside the graphical user interface of Wikidata's query builder and focus on actually writing down - typing - the code without assuming any prior experience with programming languages. Our method of teaching with two trainers and a split screen is born out of the necessities of teaching at a distance and has been developed and tested with different programming environments during the recent pandemic (DOI: 10.17192/bfdm.2021.3.8336).
As a matter of fact, Wikidata's query builder is limited when it comes to generating geographical maps or visualizing lists of image items like, for example, logos of different consortia. Therefore, writing the queries directly with your fingertips comes with the added bonus of getting to know a wide range of possible visualizations. Visualizations in Wikidata can be easily exported to other platforms and can also be put on websites, where they keep updating depending on new data added.
Last, but not least, adding data to Wikidata is what makes this workshop possible in the first place. Wikidata is not built on a common database. Common, or more traditional, databases are relational and allow for a structured query language (SQL). Wikidata, on the contrary, offers linked datasets and allows for creating a knowledge graph. While SPARQL may initially look like SQL, there are important differences because the data is linked. With SPARQL, your query matches graph patterns instead of SQL's relational matching operations. Using this language, one can perform a distributed or federated query across multiple databases with a single query statement at once. Enjoy!
License: CC-BY 4.0 ↗
PIA – A concept for a Personal Information Assistant for Data Analysis - Presentation
Data and their analysis always play a crucial role in research and science. In recent years, especially with steadily increasing computational power and the advances in the field of artificial intelligence (AI), the importance of high-quality data has continued to grow. In this context, entire research fields and committees exclusively concern with improving data and their quality to maximize the potential of their use. However, the use of AI in industry also offers enormous potential for companies. For example, in case of condition monitoring tasks, an early detection of damages and wear down of parts or machines can avoid unplanned machine downtime costs. Instead, maintenance can then be scheduled according to predicted maintenance, and downtimes can thus be minimized.
A basic requirement for a successful application of AI in the industrial context is a solid database with high quality data, e.g., from the production process itself, but also from testing processes. However, the practical application of AI algorithms often fails due to insufficient data quality due to missing or incomplete annotation of the data, incomplete data acquisition, problems when linking measurement data to the corresponding manufactured products, or a lack of synchronization between different data acquisition systems. Furthermore, in industry, data are typically acquired continuously without saving relevant metadata. In addition, this often leads to a brute force approach, which tries to use all acquired data. In general, large data sets are subsequently difficult to manage and their use is computationally expensive. A knowledge driven approach can efficiently use resources and increase the information density within the data, e.g., by reducing the amount of used sensor data due to process knowledge. By recording data in a targeted manner, redundancies can be avoided. However, linking knowledge and data is difficult and a known problem in many companies. In companies, the necessary process knowledge is often limited to a few individuals and cannot be easily accessed by colleagues. In the worst case, the process knowledge is lost if the specialist leaves the company.
For these cases and to retain the knowledge, the concept “PIA – Personal Information Assistant for Data Analysis” has been developed. PIA is an open-source framework, based on Angular, a platform for building mobile and desktop web applications, which runs locally on a server and can be accessed via the intranet. It allows companies to easily access their data as well as process knowledge, to link both and gain further insights into their manufacturing process and the acquired data. PIA is developed in accordance with the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles widely applied in scientific research and aims to transfer these principles to the industry.
In this contribution, the PIA framework is applied to a part of an assembly line demonstrator, which assembles a specific product in several variants, with the focus on screwing processes. In general, PIA provides a checklist as well as a knowledge base which supports users performing a data analysis project on brownfield assembly lines and increase data quality. The checklist is based on the Cross-Industry Standard Process for Data Mining (CRISP-DM) and provides must-checkboxes and best-practice-checkboxes, as well as tips and hints, to obtain high quality data. Using the modular approach of Angular, PIA can be easily extended to further processes or products and adapted to companies’ individual regulations and requirements.
License: CC-BY 4.0 ↗
“Continuous” Integration of Scientific Software (in Computational Science and Engineering) - Presentation
We present a Research Software Engineering (RSE) workflow that addresses the challenges of developing research software in Computational Science and Engineering (CSE) at University research groups in a sustainable way. Researchers with backgrounds in different scientific disciplines, often with no background in computer science, develop CSE research software to solve complex problems. Academic research software development lasts many years; however, many team members (Ph.D. students and postdoctoral researchers) leave after 3-5 years as their employment contracts end. Funding dedicated to sustainable research software development is generally not provided by funding sources that fund fundamental research, so there is a need for a simple and highly effective workflow that does not cause significant work overhead.
We propose a CSE-RSE workflow that is simple, effective, and largely ensures the FAIR principles [Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016).]. The CSE-RSE workflow uses established practices, services, and tools, pragmatically adapted for CSE research software: version-control, secondary-data standardization, data and publication repositories, continuous integration (CI), and containerization.
We employ a straightforward feature-branch version control model that centers around research ideas. Every research idea is associated to one feature-branch. Once it is successfully implemented, it is merged into the main branch. The feature-branch model simplifies cross-linking of research artifacts addressed below. Specifically, the feature-branch already links many digital artifacts such as CI pipelines and the associated Merge Requests, CI artifacts (research data, software binaries, data processing scripts), that all belong to the research idea. Even failed ideas can be cross-linked with this feature-branch version control model. In contrast to successful ideas, failed ideas are not merged back into the master-branch. At any point in time, a failed idea can be found, and revived with new insights.
Another advantage of the feature-branch version control model is the ability to connect a feature branch to a issue tracking. Depending on the size of a project, the possibilities given in the version control repository / collaborative development platform e.g. GitLab’s GUI are adequate to track the progress of a software. This not only helps the researchers themselves to keep track and with reporting, but also it enables other researchers to get an impression of the current project status and it helps in the often recurring on- and off-boarding process of project participants.
Secondary data is crucial - it is the basis for the scientific review process and the basis for comparing publications. Scientific publications contain secondary data from Verification and Validation (VV) tests organized in CSE into parameter studies. We store parameter study metadata directly as column data within secondary data - the data duplication caused by this approach is negligible while significantly simplifying secondary-data analysis and comparison. Jupyter notebooks are used for secondary-data analysis and VV test documentation. CSE-specific visualizations like error-convergence diagrams and visualizations of field data (e.g. velocity field) are a fundamental for understanding the results are included in the visualization stages of the CI pipeline, in Jupyter notebooks.
Contrary to the standard practice of unit-testing from ground-up, we propose a top-down testing approach. The workflow employs Continuous Integration and adapts it to RSE in CSE to construct the VV tests necessary for scientific publications. Low-level (unit) tests are implemented once the test-driven VV applications fail - do not deliver correct output. In the case of a failing VV test, we analyze the data generated by the visualization stage in the CI, and isolate suspect sub-algorithms that might have caused the failure. Those algorithms are then tested using the same input as used for the VV test application. Algorithms that fail the new unit-tests, are improved, until the VV test passes. The researchers therefore focus first and foremost on VV tests - the base for publishable scientific results, and add detailed tests while improving sub-algorithms, thus increasing the test coverage in a top-down approach. This approach to VV-driven testing makes the CI pipeline an automation and documentation tool for scientific workflows since it can reproduce publication results without human interaction. It motivates the researchers to employ automatic testing in the form of CI since it speeds up research and removes the burden of unit-testing large-scale legacy research software without having the means to do so. Containerization as a part of CI is another necessary tool for ensuring the reproducibility of research results.
Once a new method has been implemented, the last step in the workflow is to cross-link the research data, research software archives and software images, and the research report (pre-print, manuscript submitted to peer-review), using Persistent IDentifiers (PIDs). Links to the software development repository are also included, knowing that these might not be available after extended time periods.
The proposed workflow makes it possible to find all digital artifacts from a research report, find and readily re-use the crucial secondary data, and reproduce results from research software using software images and CI as a set of automated research workflows.
License: CC-BY 4.0 ↗
A-Match: Facilitating Data Exchange Between Different Applications via API Matching - Presentation
To answer challenging research questions, there are often several distinct tools and software necessary. Their successive execution is also called a toolchain. Toolchains often need interchangeable data between their tools. Especially in the engineering context, there is a heterogeneous software landscape and the different tools often have different data formats and conventions.
Formatting the data output from one tool to be correctly loaded by a second is a tedious manual process. Some tools offer different export formats, which can reduce the effort to some extent. However, this does not guarantee an error-free import and manual checks are needed as well. Additionally, compatibility can be withdrawn by software updates. This makes a complete automatic toolchain nearly impossible. A more reliable way to ensure correct data exchange is by interconnecting compatible interfaces (APIs) of the tools.
The idea of matching two APIs is not new. There are several approaches in research [1-5]. However, these methods often focus on constricted use cases and have different results on different domains. Similar to A-Match, many often rely on ontologies to integrate domain knowledge into the matching system.
As a first step in the direction of automatic API matching, we have developed a prototypically user interface (UI) to offer a fast and correct manual match between data of two different tools in the space domain. With this project, we wanted to extend the functionality of our prototype and adapt it to other contexts, namely the engineering domain. The resulting tool is called A-Match.
A-Match consists of two parts: the user interface (UI) and the matching backend. The user can select the APIs and data objects to match. Then, they can combine the terms as needed by hand or get automatic matching suggestions calculated by the backend. Here, a combination of semantic distance metrics and ontologies defining synonyms are used. When the user is content with the matched terms, the resulting changes are directly sent to the second API as an update. If individual attributes contain units of measurement, it is ensured that the input value is converted to the output unit. Thus, the toolchain can correctly continue.
To ensure that the NFDI4Ing community’s needs and wishes were incorporated throughout the project, we held two workshops. The given feedback was then implemented into our product. In the first workshop, we assessed the state of the current implementation and gathered more functionality requirements and wishes. After their implementation, we held a second workshop. Here, we focused on usability to ensure we are delivering a well-designed tool that has the support of the NFDI4Ing community. Additionally, we conducted a user study to evaluate the usability, usage frequency, and usefulness of A-Match. We made our resulting, finished software available as Open Source at the end of the project. A-Match could still be extended to more domains and thus, add more functionality in following projects. It would also be possible to integrate and adapt A-Match into a workflow of a company in an industrial cooperation.
In this presentation, we report A-Match’s functionality and results from the project. At the end, we will give a live demo of A-Match.
------------------
Literature
[1] Wu, Chen; Dillon, Tharam; Chang, Elizabeth. Intelligent matching for public internet web services towards semi-automatic internet services mashup. In: 2009 IEEE International Conference on Web Services. IEEE, 2009. S. 759-766
[2] Khorasgani, R. R., Stroulia, E., & Zaiane, O. R. (2011, September). Web service matching for RESTful web services. In 2011 13th IEEE International Symposium on Web Systems Evolution (WSE) (pp. 115- 124). IEEE.
[3] Caragea, D., & Syeda-Mahmood, T. (2004, May). Semantic api matching for automatic service composition. In Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters (pp. 436-437).
[4] Xie, W., Peng, X., Liu, M., Treude, C., Xing, Z., Zhang, X., & Zhao, W. (2020, November). API method recommendation via explicit matching of functionality verb phrases. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 1015-1026).
[5] Legler, F., & Naumann, F. (2007). A classification of schema mappings and analysis of mapping tools. Datenbanksysteme in Business, Technologie und Web (BTW 2007)–12. Fachtagung des GI Fachbereichs" Datenbanken und Informationssysteme"(DBIS).
License: CC-BY 4.0 ↗
How GitOps solves Experiment Configuration Documentation - Presentation
One of the main goals of research data management is to make experiments, which are the basis of most scientific conclusions, transparent, repeatable and to increase their trustworthiness. This enables researchers to verify those conclusions, to build on them and to compare them with their own. This approach, however, requires to thoroughly record, document, distribute and archive all data related to the experimental setup and its outcome. Otherwise, missing or unknown parameters, configurations or assumptions would render the attempt to understand or to repeat the experiment impossible. In practice, experiments conducted in manufacturing science are especially difficult to reproduce. Because of the great complexity of the experimental setups in use, a multitude of settings must be recorded and documented. However, those configurations tend to be either unknown, intransparent or hidden in proprietary software modules. Therefore, experimental results in manufacturing science tend to be lost for future use.
As one approach to tackle those difficulties, we suggest the extensive exploitation automated cloud-edge infrastructures. The basis for this approach is a system which automatically configures all manufacturing systems and machinery in a laboratory, which are used for experiments and research. The configuration files of this software can be used to precisely document large parts of the experimental setup. Together with recordings of the outcome of the experiments (e.g. measured sensor values), this setup information serves as a foundation for the appropriate research data management in manufacturing science.
Infrastructure-as-Code is a new approach that promotes the configuration of IT infrastructure through text and configuration files. It is part of the GitOps paradigm. The configuration files are supposed to include all parameters and configurations that are needed to set up, manage and operate an IT system. The main idea behind this is, to fully document the intended state of the system, its setup procedure and to facilitate changes. When combined with a versioning tool like Git, this allows tracking of all parameters and configurations across your complete infrastructure. In the domain of cloud infrastructure engineering, this approach became very popular in recent years. Cloud providers and large internet companies like Amazon, Netflix and Google rely on it to keep their infrastructure running.
When transferred to manufacturing science, Infrastructure-as-Code allows the complete configuration of an experimental setup - in a well-documented, archivable, machine and human readable, vendor independent form. Therefore, the code that is defining the configuration of an experimental setup can be saved together with the results of the experiments. This leads transparent, comprehensible, and trustworthy datasets. Every change of parameters can be easily tracked across all recorded datasets.
The Chair for Laser Technology of the RWTH Aachen University has developed a microservice-based manufacturing software toolkit to configure, control and execute a ultra-short pulse manufacturing unit. This toolkit is designed around the deployment of software containers, whose configuration is accomplished by sets of (human readable) files according to Infrastructure-as-Code. In our contribution to the NFDI4Ing conference, we would like to present this manufacturing software infrastructure as well as the steps needed to rebuild such a system. The open-source ecosystem this software is embedded in is also explained and demonstrated. Using Infrastructure-as-Code, the state of these parameters is automatically tracked and documented without the need for additional steps of documentation, lowering the manual effort being invested.
License: CC-BY 4.0 ↗
FAIR play integrated right from the start - Coscine - Demonstration
For many researchers, whether from engineering sciences or other fields, an involvement with the FAIR principles does not begin until the publication of an article and the sometimes obligatory transfer of the research data to a repository. At this point, a significant amount of valuable information about the research project is often already lost. Therefore, only a fraction of the data (and metadata) collected during a research project is ever published. One solution to make research data FAIR from the very beginning of its life cycle is to use a storage environment on a daily basis that implicitly implements FAIR principles.
To create such a storage environment, the research data management platform Coscine was developed at RWTH Aachen University. Coscine provides an integrated concept for research (meta)data management in addition to storage, management and archiving of research data. In the following, we present how Coscine supports FAIR principles - from the initial collection of data to its subsequent reuse.
To enable the reuse of research data in line with FAIR principles across institutional borders, Coscine can be accessed either through participating universities or at a low-threshold level via ORCID. After registration, researchers can create a research project for which both research data and metadata at various levels are collected and automatically linked. The first level of metadata relates to the research project (including name, description, PIs, discipline). The W3C standards RDF and SHCL are used for the technical representation and validation of all metadata stored in Coscine. This largely complies with the FAIR principles regarding interoperability and reusability of metadata. In addition, during the life of a project and after its completion, all associated metadata can be publicly shared within Coscine and are searchable and findable. A connection to the NFDI4Ing metadata hub is currently realized via "FAIR Digital Object" interfaces.
In the next step, different data sources, called resources, can be assigned to the research project. For each resource, Coscine assigns a handle-based ePIC PID. This is used to uniquely and permanently identify the location of the resource and all contained files on a global level. Within resources, fragment identifiers are used to address individual files. Thus, the research data is permanently referencable and findable in the sense of the FAIR principles.
To date, Coscine has storage resources and Linked Data resources. The storage resources allow researchers to access the Research Data Storage (RDS), a consortia object storage system funded by the Ministry of Culture and Science of the State of North Rhine-Westphalia (MKW) and the Deutsche Forschungsgemeinschaft (DFG). To support researcher’s processes as much as possible; Coscine provides multiple ways to interact with research data, either via a browser, using a REST-API or directly via an S3 interface. This allows for high performance transfer of even large amounts of research data. When using RDS resources, a retention and archiving period of research data of 10 years after the end of a research project is ensured in terms of good scientific practice and reusability. Within Linked Data resources, externally stored research data is assigned a PID and can be linked and tagged with metadata. Thus, even for externally stored research data, Coscine allows increasing FAIRness by linking the data with metadata and assigning PIDs.
After specifying high-level metadata for the respective resource (including resource name, discipline, keywords, metadata visibility, license), researchers select a suitable selection of metadata fields for their files from various so-called application profiles, e.g. for engineering research data the established EngMeta profile can be used. If a suitable application profile has not yet been added to Coscine, the AIMS Application Profile Generator can be used to create a profile with individual and discipline-specific metadata.
Within a resource, researchers can upload their files or store the link to their research data. When using Coscine via the web frontend, file upload is only possible after entering the associated metadata in the application profile. In this way, Coscine makes metadata entry a direct part of the researcher’s workflow, supporting the FAIR principles.
Coscine also ensures that data objects and associated metadata, linked by PID, are independently findable and accessible via a REST API. The REST API allows researchers to easily enter their data and metadata into the system and facilitates subsequent use of the same. In addition, the REST API enables token-based authentication to automate workflows. To help researchers interact with Coscine through the interfaces and improve integration with existing data management processes, a team of data stewards and developers has been established to provide tools, programs, and consultation for the technical adaptation of the platform. This includes the collection or extraction of metadata based on the data or the environment in which it was generated. Although the possibilities for automation are highly dependent on the research project, examples and tools support researchers in the implementation and thereby improve the quality of the collected metadata.
Thanks to the interfaces for automation, high technical security standards as well as extensive collaboration possibilities, Coscine is a strong partner for researchers in their daily management of research data. Coscine enables compliance with the FAIR principles from the very first storage of data by bundling raw data, metadata, interfaces and PIDs into a linked record according to the "FAIR Digital Objects" model. In this way, Coscine is a valuable contribution to the goal of NFDI4Ing: foster proper research data management in engineering sciences that implements the FAIR data principles.
License: CC-BY 4.0 ↗
OntoHuman: User Interface for Ontology-based Information Extraction from Technical Documents with Human-in-the-loop interaction - Presentation
In this talk, we present DSAT (Document Semantic Annotation Tool), a tool to automatically extract information from technical documents based on ontologies and natural language processing techniques, within the context of the OntoHuman project [1]. The OntoHuman project aimed to enrich ontologies, which contain semantic information describing objects or concepts, with information extracted from documents. The central component of the OntoHuman Project is DSAT, which was originally designed for assisting users to annotate key-value-unit tuples on technical documents.
Besides the user interface, there are other modules used in OntoHuman: an ontology enrichment module (ConTrOn - Continuously Trained Ontology), a DSAT database (DSAT DB) for storing annotations and custom ontologies, and an information extraction module (PLIX[2]). These components are integrated into OntoHuman to achieve the automatic information extraction.
Prior to OntoHuman, the ontologies used by DSAT for the automatic extraction were fixed and limited to one specific domain, i.e. spacecraft engineering. To update and customize an ontology manually is tedious and requires additional efforts to use ontology modelling tools. Therefore, a semi-automatic process to enrich ontologies can assist domain experts, who are not necessarily ontology experts, to map their knowledge into ontologies.
To enable the customization of ontologies, we improved DSAT and ConTrOn in the OntoHuman project. We also pursue the Human-in-the-Loop (HiL) approach, which requires humans to verify the results of an automatic process by providing feedback to the system. We combined the HiL component to generalize the automatic information extraction process. In contrast to the prototypical solution, we now can apply and customize ontologies to extract data from documents of other domains. Feedback from users can now be collected via a web-based user interface and used for updating ontologies further.
The following proposed features were implemented: correction of automatically extracted data, resolution of word ambiguities, adding new annotations, and export function for annotations. Additionally, we simplified the UI according to feedback from workshops participants from the NDFI4Ing community. We also conducted a user survey and received rather good rating for the tool (DSAT). Regarding the user experience, the tool is considered to be easy to use (6 points out of 7), supportive (5.5/7), efficient (6/7) and novel (5/7). The workshop’s participants rated the domain of usage of DSAT to generic purpose (rated 3.5 points out of 5), somewhat relevant to their colleagues’ work (3/5), and not very relevant to their own work (2/5). However, since the participants of the workshops were limited to 9 and 6 persons, we hope to collect further feedback and attract more users from various fields of work during this conference.
Since the automatic annotation of documents depends largely on the used ontologies, to fully use the tools for other domains, users should know where to find relevant ontologies. An ontology search API could be used to assist the users to find the right ontologies in the future. Furthermore, the suggested topics we collected from the workshops, such as semantic disambiguation, multi-language support, and graph value extraction are rather complicated topics. Therefore, we decided to research these topics beyond the project period. They are currently studied and could be integrated into DSAT in the future.
[1] The project was finished in June 2022, and the source code is publicly available on Zenodo (10.5281/zenodo.6783007)
[2] PLIX (Information Extraction module) version 1.0, license Apache-2.0, authors: Sarah Böning, Christian Kiesewetter
License: CC-BY 4.0 ↗
Towards a closed-loop data collection and processing ecosystem - Presentation
The digitization of research requires integrated tools which support researchers in the collection and processing of datasets stemming from experiments. In this paper, we introduce tools that have been developed to facilitate data storage and analysis. Both provide bidirectional integration, enabling trackable data flows. This means a direct link is established between collected data sets and their respective processing algorithms, further enabling direct data usage and higher reproducibility of experiments.
To provide an overview of the ecosystem, we first provide a short general description of the shepard data management system.
A great challenge in the handling of experimental data concerns the storage and structuring of data, which stems from experiments with a lot of steps, which may be of uniform or diverse character. The structured storage of that data is the basis for the application of AI methods and big data approaches to the analysis and evaluation of the obtained data. To deal with this issue, the German Aerospace Center (DLR) developed the tool "shepard", which aims to ease the processing of experimental data.
Currently, the Shepard system comes with a backend and a frontend. The backend serves for the actual storage and structuring of the data, providing also a REST API as an interface to its functionality. In the background, it uses a Neo4j database for organizational elements like users, collections, and data objects with various relations between them. Time series are stored in an InfluxDB while files or structured data are stored in a MongoDB. The frontend can be used for manual navigation through the data. However, it is not designed to upload large amounts of data (which should happen via the REST API of the backend).
Next, we introduce ReBAR as a data processing framework strongly leaning towards machine-learning supported analysis.
ReBAR is based on Apache Airflow and provides a state-of-the-art task orchestration framework. By directly integrating further tools like MLFlow, direct experiment tracking is simplified. Further components are two databases, one related to metadata using PostgreSQL and one related to storing intermediate artifacts using min.io as a S3-compatible object store. Finally, celery is used as a distributed task execution framework, enabling a high degree of parallel processing.
The modular structure of this framework allows easy deployment in local as well as server-based environments and custom tailoring to specific discipline needs.
Through the use of the aforementioned shepard REST API, ReBAR can automatically request data and provide it to internal processing workflows implemented in Python (or other supported languages).
After processing, results and generated artifacts can be fed back to shepard again using its API, attaching them directly to the original data.
Finally, we demonstrate the system with respect to a real-world use-case. We applied shepard to collecting data during manufacturing a thermoplastic composite with in-situ automated fiber placement. At the Center for Lightweight Production Technology (ZLP) in Augsburg, we placed a skin for a test shell with full-scale single aisle dimensions, the upcoming final demonstrator will be an 8 m long upper half shell. Three controllers serve as data sources. The robot controller e.g. delivers its position, motor temperature and some process information like current track and layer number. Themulti tow laying head controller provides the tape temperature in the melting zone and the pressure of the compation roller for example. The tape placement sensor controller publishes measurement results that contain e.g. gap information between the tracks.
To establish a viable digital twin the aquired data needs a correlation to the design data. Thus detected gaps between tracks at a specific position during production of the component can be traced back to a rapid axial movement of the robot for example. Though the different timeseries and files which are connected via references the data can be overlayed and anomalies can be detected. To maintain the process context of data, a special data reference generator was developed. This tool observes the guiding component of the process (in our example, the KRC4) and generates data elements with a process ID, which can be used for the structured acquisition of the generated data. The data can now be transferred to the shepard system, triggered by events like reaching the end of a track or the abortion of a process.
In this case, this data is then fed to ReBAR and several analysis tasks, including visualization of data and statistical analysis, are performed to establish a standardized overview of the collected data, opening up easy avenues for further research questions. The results are automatically fed back to shepard to provide traceability of data.
License: CC-BY 4.0 ↗
Metadata Standards and Metadata Generation in HPMC - Presentation
Within NFDI4Ing, the task area DORIS develops research-data concepts and software infrastructure for data from high performance measurement and computing (HPMC). The main goal is to make tier-0 HPC research data findable, accessible, interoperable, and reusable (FAIR). In particular, the findability and interoperability of HPMC-data are typically impeded by the lack of documentation and metadata. This problem is in part caused by the absence of a thorough semantics, or at least a common terminology, to describe data and workflows in HPMC.
One of the aims of the task area DORIS is to define and disseminate metadata standards for HPMC environments. Therefore, we are developing a comprehensive HPMC metadata scheme as sub-ontology in the framework of the recently released NFDI4Ing ontology (Metadata4Ing [1]) within the group. The Metadata4Ing-ontology allows for a description of the whole data-generation process that revolves around the object of investigation, starting with the steps before the generation of the data, such as the setup of the simulation to the post-processing steps, where the raw data are manipulated to extract further, refined data. Finally, the ontology also covers the summary of the data files and the information contained in them, together with all the personal and institutional roles. The subordinate classes and relations can be built according to the two principles of inheritance and modularity. Thanks to this concept, expanded ontologies for any possible combinations of methods and objects of research can be generated [2]. The HPMC sub-ontology [3] has been developed along with pilot users and HPC centers to cover common and specific demands. The main elements are methods (such as Partial Differential Equations), tools (hardware & software) and processing steps (code setup), which allows to describe any specific workflow and produced research data. Taking advantage of this flexibility, the ontology will be constantly expanded and adjusted based on user experience and community demands.
To facilitate metadata enrichment for researchers and to transfer the metadata scheme into real usage, the process should be highly automated and standardized. Our newly developed metadata crawler [4] helps in retrieving metadata for physics-based data, which are related to the physical parameters of the problem under investigation, as well as for machine data, which are related to the usage and the performance of the computing system where simulations are run. The required metadata can be retrieved from all processing steps of the investigation and can be classified according to a chosen ontology, such as Metadata4Ing in our test case.
Originally developed at the Chair of Aerodynamics and Fluid Mechanics at TUM, the crawler is a python-based application capable of reading out ontologies in order to create a dictionary with user-relevant properties to be filled. The dictionary is then used to create the metadata file in human-readable format that goes along with the main data set. The usage of the crawler will be shown by means of a minimum-working example, even though the application has the potential capability to handle large datasets as typical for HPC systems such as those available at LRZ, HLRS and JSC.
To complement the holistic HPMC-metadata approach, we have developed a concept to publish the retrieved metadata along with the corresponding research data in an institutional repository. The published metadata will be connected to the upcoming (indexable) NFDI4Ing Metadata Hub [5] to make HPMC research data findable and accessible for the whole research community.
[1] https://zenodo.org/record/5957104#.YhizODUo-Uk ↗
[2] https://nfdi4ing.pages.rwth-aachen.de/metadata4ing/metadata4ing/1.0.0/index.html ↗
[3] https://git.rwth-aachen.de/nfdi4ing/metadata4ing ↗ (still to come)
[4] https://gitlab.lrz.de/nfdi4ing/crawler ↗
[5] https://nfdi4ing.de/base-services/s-3/ ↗
License: CC-BY 4.0 ↗
Effective Research Data Sharing during the Plant Evolution in the automated Production Domain - Presentation
Automated production Systems often operate for a long period of time and evolve to fulfill changing requirements. Similarly, machines and plants may also evolve to incorporate innovative technologies and the changing requirements regarding functionality [1], especially in case of research systems. Accordingly, process and engineering data of the plant also changes during its evolution. In each change scenario, multiple disciplines are involved and domain-specific models may be generated to shape the system from multiple perspectives. The evolutionary information and the heterogeneity of data sources challenge the efficiency and reusability of shared research data in the automated production domain [2]. Although various variant- and version management tools have been developed, it is still time-intensive to identify the data changes within shared models or documents in each change scenario. In the wake of digitalization, increasing amounts of research data are becoming available in the automated production domain, however, their usage is often limited to data collection and simple visualization [3]. To help researchers to gain an overview of the changes in each evolution more effectively and to reuse plant-related knowledge, relations between information changes and the terminological data of the system should be formally represented and shared in an openly accessible data repository.
In this contribution, the authors introduce a guideline for sharing research data of automated production systems regarding the information changes during its evolution. First, basic building blocks of an open-access data repository are introduced. To further promote the interoperability of shared data, an ontology of the information in the repository is developed, which links the terminologies of a production system to the relevant engineering models that are developed during each evolution. With the formalized data dependencies and system terminologies, available data within each version of models is restructured and can be presented to the users intuitively. As proof-of-concept, the proposed methodology is applied to share the research data of a lab-sized demonstrator that undergoes different evolutions. System data is systematically stored together with formalized knowledge and terminologies in the form of ontology by using a popular version management tool (GitHub).
In conclusion, an approach to effectively share the research data of an automated production plant will be proposed. We will discuss the requirements on sharing the research data during the plant evolution and propose the necessary components in an openly accessible data repository to satisfy these requirements. To make the shared information regarding the changes in each scenario not only intuitively visible for other researchers but also interpretable for the computers, an ontology is developed. Benefiting from this information ontology, research data of a production system including the embedded knowledge is formally represented and can be reused for the upcoming updates. In addition, when new changes arrive, the relevant data can be automatically added into the ontology. In the future, supporting tools can be developed for the users to query the system information in each change scenario. To further improve the efficiency of data sharing, the developed ontology can be merged with other graph-based data base to realize an automatic inconsistency checking process of system information.
Reference:
[1] C. Legat, J. Folmer, and B. Vogel-Heuser, “Evolution in industrial plant automation: A case study,” in IECON 2013-39th Annual Conference of the IEEE Industrial Electronics Society, 2013, pp. 4386–4391.
[2] B. Vogel-Heuser, F. Ocker, I. Weiß, R. Mieth, and F. Mann, “Potential for combining semantics and data analysis in the context of digital twins,” Philosophical Transactions of the Royal Society A, vol. 379, no. 2207, p. 20200368, 2021.
[3] M. Fuhrmans and D. Iglezakis, “Metadata4Ing - Ansatz zur Modellierung interoperabler Metadaten für die Ingenieurwissenschaften,” RWTH-2021-08329, 2020. [Online].
License: CC-BY 4.0 ↗
Data literacy right from the start - FAIR Data Management in Engineering Sciences in the first semester of the Bachelor of Mechanical Engineering - Sustainable Engineering at TU Darmstadt - Presentation
Data literacy right from the start. The paradigm shows the way to sustainability in the use and provision of research data. FAIR data creates transparency, trust and it is part of good scientific practice. The Department of Mechanical Engineering at TU Darmstadt has revised the curricula of its degree programmes in recent years: The aim was, firstly, education in responsible research and innovation. The focal points in mechanical engineering are derived from the UN Sustainable Development Goals. Secondly, to anchor digitisation as a competence in the curriculum. In the process, data and digital competence as well as FAIR data management were integrated into existing modules and distributed across all semesters.
One of these modules is Fundamentals of Digitisation in the first semester of the Bachelor's programme in Mechanical Engineering - Sustainable Engineering. It consists of a lecture (2 SWS) and an exercise (2 SWS). The latter is divided into a lecture hall and a group exercise. The module is worth 4 credit points. As a competence-oriented examination, the students submit project work with a programming focus during the semester.
FAIR Data Management in Engineering Sciences is the second part of the course, comprising four lectures and four exercises each. The chapters of the lecture are:
(i) Motivators for FDM, (ii) Data Quality, (iii) Data and Models, (iv) Knowledge Management.
In the first chapter, the motivators for data management are elaborated with four examples from academia, industry and society. With examples from research (urbanisation, cavitation), data management and the research process are linked. The students learn about the steps they will later have to carry out themselves in the module Practical Digitisation through Research-Based Learning. Based on good scientific practice, the topics of software development and the FAIR principles are addressed.
In data quality, a differentiation is made between formal and content-related data quality. Aspects of formal data quality are metadata, file formats and identifiers. The content-related data quality is addressed via a short introduction to statistics. A more in-depth study takes place in the module Measurement Technology, Sensor Technology and Statistics (4th semester). An essential aspect of the introduction is the differentiation of uncertainty into data, model and structural uncertainty.
The consideration of data and models in chapter 3 leads from axiomatic to data-driven to hybrid models. The requirements for models according to Heinrich Hertz are illuminated. This forms the basis for the consideration of the value of data and the data economy. As this topic has a high topical value, current issues such as the EU Data Act are discussed.
The last chapter deals with knowledge management. The central question is how knowledge and wisdom can emerge from data. The understanding of concepts of semantics, ontologies and knowledge graphs are taught using everyday examples.
The exercises are programming exercises in the Python language. Basic understanding of algorithms and introduction to MATLAB and Python are already laid in the first part of the module. Contents of the exercise are:
(i) Version control (git), (ii) Visualisation of data, (iii) file formats (HDF5), (iv) data structures.
A characteristic of the exercises is that they build on each other and are application-oriented. The exercise is also the basis for the project work. There, the students have to carry out the following data processing steps:
(i) Read in data, (ii) check data, (iii) analyse data, (iv) process data, (v) visualise data, (vi) version code, (vii) document code.
The subject of the project work is a data set from research, which the students have to re-use.
At the end of the course, the students know the basics of FAIR data management, can classify and evaluate them and implement them as algorithms.
With the first-time implementation of the module Fundamentals of Digitisation in the winter semester 21/22, initial experience could be gained with regard to the teaching of data competence in the first semester. In addition to an insight into the course, these are presented as lessons learned. Furthermore, the conception and implementation of the exercises will be explained in detail and potential for improvement for coming semesters will be worked out.
License: CC-BY 4.0 ↗
Towards Improved Findability of Energy Research Software by Developing a Metadata-based Registry - Presentation
In energy research, self-designed software and software components are crucial tools for multiple purposes like the visualization of processes and values, e.g., power quality, the (co-)simulation of smart grids, or the analysis of transition paths. Within an exemplified research cycle, this self-designed software is often a starting point and, therefore, fundamental for producing new research results while it also presents a result of performed research.
Energy research software (ERS) can be defined as software that is employed in the scientific discovery process to understand, analyze, improve, and/or design energy systems. Software-wise, ERS ranges from simple scripts over libraries (e.g., for python) up to full software solutions. Content-wise, it can for example visualize, analyze, and/or generate (artificial) data from energy (sub-)components or grids in laboratories or the real world. Alternatively, it can also represent particular energy (sub-)components, energy (distribution) systems, and transition paths of energy use, distribution, conversion, and/or generation to analyze the design and/or control in simulations and optimizations.
The increased need for ERS leads to the development of multiple models and frameworks partly with overlapping scope coming from different subdomains. Often, new tools are developed without reusing the already existing ones, especially across institutional barriers. Since ERS will even become more complex in the upcoming years, e.g., due to the relevance of research on cyber-physical systems, thus introducing even more complexity to the systems, a lot of time is spent on (re)developing software instead of doing research slowing down the progress in research. Different approaches to formulate FAIR criteria for research software show that metadata and repositories for these metadata, e.g., software registries, are key elements for FAIR research software. Especially the findability of ERS can be increased by describing it with useful metadata and including it in a registry. Good metadata and a registry are a first step toward increasing the reuse of ERS and improving the research process in energy research. Therefore, our goal is to develop a good metadata schema for ERS, a metadata generation tool, and a metadata-based registry for ERS.
A metadata schema for ERS should be usable for all different types of ERS to increase their findability and is the foundation for the other two artifacts. It consists of elements describing the categories for the metadata, guidelines for creating metadata, a syntax, and of constraints for the metadata, e.g., ontologies as value vocabularies.
A metadata generation tool should support all researchers to create high-quality metadata for their ERS. It should lower the entrance barrier for creating metadata for all researchers in the energy domain without the need for a deeper understanding of the underlying technologies. For example, developers of a framework for co-simulation can register their software by entering a link to their repository and adding a few more descriptive data. The description should include as many terms from ontologies as possible. This can be achieved by linking the metadata generation tool to the terminology service developed in NFDI4Ing.
A registry for ERS should help researchers to find the right software based on multiple search criteria, e.g., researchers wanting to do a grid simulation can look for a library or framework compatible with the models in a certain programming language they already use.
Based on the requirements of these three artifacts, we formulated three research questions:
1. Which metadata elements are required to enable a useful classification and description of ERS to make it more FAIR?
2. Which elements from existing metadata schemas are suitable to be reused to describe ERS and which existing domain-specific ontologies can be used as a value vocabulary?
3. How can additional input information like keywords or already used software be used to improve the search results when looking for energy research software?
For the development of a metadata schema for ERS, existing metadata schemas for research software in general and in other domains as well as existing approaches in the energy research domain are relevant. Therefore, we will give an overview of this related work including CodeMeta for general research software, OntoSoft for research software in geoscience, biotoolsXSD for research software in life science, EngMeta for data in engineering, and the Open Energy Metadata for data in the energy domain. Within good metadata, ontologies are used as value vocabularies to improve the interoperability of metadata. Therefore, we also give a brief overview of existing energy-related ontologies.
Finally, we present our approach for the further development of our artifacts. In the first two steps, we will develop a metadata schema for ERS as an application profile based on the methodology of Curado Malta and Baptista. Therefore, we will perform a requirement analysis to develop a domain model. Then, based on the domain model, we will decide on metadata elements and will examine, which elements of existing metadata schemas can be reused. In the third step, we will develop the registry and in the fourth and final step, we will build the metadata generation tool.
License: CC-BY 4.0 ↗
ing.grid – FAIR Publishing with Open Review - Presentation
What is FAIR research data management without a FAIR journal that its methods can be published in? ing.grid is the first journal for FAIR data management in engineering sciences, offering a platform and recognition for sound scientific practice that involves all integral parts of a publication: the manuscripts, the datasets as well as the software. Focusing not only on the publication but also on the publishing process itself, ing.grid opens up the traditional peer review process into its novel concept of Open Peer Review and makes the entire lifecycle of the publication transparent. Review comments can be submitted by referees assigned by the editors as well as members of the respective community or public. After the article is accepted, the review comments are immediately visible to the public while the review discussion is still ongoing. ing.grid believes that this process will lead to higher quality of the review comments and authors’ responses and of the submissions in general. In the scope of the talk, an overview about the Focus and Scope of ing.grid will be given and the Open Peer Review process will be presented in detail
ing.grid Open Peer Review Process
Using a hybrid peer review process including single blind peer review and community peer review, ing.grid fosters open discussion and exchange in the engineering community on all issues related to data management and gives scientific credit by publishing contributions treating all endeavors regarding FAIR data management. The novelty of ing.grid’s approach to open peer review lies in the following 3 points:
- Transparent publishing process of the manuscript involving a preprint server
- For each publication published at ing.grid, readers can view the submitted version of the manuscript as well as the entire review discussion on ing.grid’s own preprint server. Peer review no longer takes place behind closed doors where unprofessional communication is common. Such "open peer review" has been shown to increase the quality and professionalism of reviewer comments (https://www.science.org/content/article/rude-paper-reviews-are-pervasive-and-sometimes-harmful-study-finds).
- high quality of publications ensured by single-blind peer review
- Once a manuscript is submitted to ing.grid, editors invite experts in the respective field to submit specially highlighted review comments on the preprint repository. In this way, the established method for scientific quality assurance is still maintained within the framework of open review.
- involvement of the community and the public in the review process
- The review process is not restricted to referees assigned by the editors, but rather open to the members of the community or the public.
Novel Approaches, Golden Standards
While radically rethinking the review process, ing.grid recognizes several initiatives that are standard in scientific publishing.
First and foremost, ing.grid is hosted over the TUjournals service of the Technical University of Darmstadt, which signed the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities and extended an Open Access Policy to the whole university. TUDa as well as ing.grid understand that the long-term availability of human knowledge should not depend on economic interests. ing.grid is classified as a Diamond OA journal. Authors do not pay article processing charges. All content is freely available without charge to the user or their institution. Users are allowed to read, download, copy, distribute, print, search, or link to the full texts of the articles, or use them for any other lawful purpose, without asking prior permission from the publisher or the author. Authors retain the copyright and full publishing rights of their articles without restrictions.
ing.grid assures an environment in which all the community is connected and able to share and discuss the results of research before its publication. A successful open review process also depends on all parties focused on taking part in this setting, considering that respectful behaviour is necessary for an atmosphere of ideas and collaboration. For ensuring that the ing.grid community is a harassment-free experience for everyone, editors supervising the publication process of a submission are obliged to moderate community and review comments.
Future scientific publishing
With the presented concepts, ing.grid lays the foundation for the future of scientific publishing. Transparency, traceability, comprehensibility and reproducibility are fostered by Open Peer Review and complete submissions including all three entities: manuscripts, software and datasets. Further challenges consist in linking these entities through suitable metadata. In the scope of future work, ing.grid is set to developing a journal knowledge graph for enhancing the user features for authors, readers, reviewers and editors.
License: CC-BY 4.0 ↗
Betty’s (Re)Search Engine: A client-based search engine for research software stored in repositories. - Demonstration
The findability, accessibility, interoperability and reusability (FAIR [1]) of research software often depends on the will of researchers and scientists to publish the result of their work that goes beyond the written explanation (such as found in papers, articles, technical reports etc.) and referencing that result accordingly. Promoting a novel approach or an improvement of an existing method, without providing the source code that was used to perform the research, means a greater effort for every researcher down the stream that wants to utilize the described research software. Services like Papers With Code [2] acknowledge the described problem and address it by requiring a valid link to a corresponding repository that goes with every publication on the platform. However, the process of finding research software for a specific purpose is in itself inefficient. One must look through a publication before being able to hope for a corresponding link to a repository where the research software is then stored. The described way of searching also does not allow applying preferences (e.g., only searching for research software written in Python).
Therefore, we present: Betty’s (Re)Search Engine, a novel approach of searching and sorting research software directly. In a cascading search the tool first finds a number of GitHub repositories that correspond to a given search query. Then, in a second step, it tries to find papers, articles, data instances etc. on platforms like Zenodo [3], SemanticScholar [4] or the Open Research Knowledge Graph [5] that are linked to a repository in the result list. If a matching publication exists, we then check for the number of citations that publication collected over time. This additional meta information about the repositories lets us then sort them according to their number of citations.
The result of this process is a list of repositories for a search query, sorted by their relevance in research. One key attribute of Betty’s (Re)Search Engine is that it is solely written in languages that can run in the Browser. Therefore, a classic frontend — backend approach is not required. The user effectively utilizes different APIs by himself, through our tool, by supplying his own credentials and consequently having to watch out for rate limitations by himself. This gives the user the maximum amount of privacy that is possible by using the selected APIs. We plan to put Betty’s (Re)Search Engine online by the end of 2022 under a NFDI4Ing (“nationale Forschungsdateninfrastruktur für die Ingenieurwissenschaften”) Domain.
Further work will deal with extending Betty’s (Re)Search Engine’s functionality by supporting more services and databases that the user could then simply plug in to the existing tool. So that users that have access to paid services (e.g., IEEE Xplore [5]) or platforms with restricted access (e.g., university’s or company's internal GitLab) can utilize the cascading search even more broadly. We believe that by enabling users to utilize this novel approach of searching for research software, we make research software stored on GitHub/ GitLab repositories more FAIR. And by that, saving our users valuable time.
[1] M. Wilkinson et al., "The FAIR Guiding Principles for scientific data management and stewardship", Scientific Data, vol. 3, no. 1, 2016. Available: 10.1038/sdata.2016.18 [Accessed 19 August 2022].
[2] "Papers with Code - The latest in Machine Learning", Paperswithcode.com, 2022. [Online]. Available: https://paperswithcode.com/ ↗. [Accessed: 19- Aug- 2022].
[3] "Zenodo - Research. Shared.", Zenodo.org, 2022. [Online]. Available: https://zenodo.org/ ↗. [Accessed: 19- Aug- 2022].
[4] "Semantic Scholar – AI Powered Research Tool", semanticscholar.org, 2022. [Online]. Available: https://www.semanticscholar.org/ ↗. [Accessed: 19- Aug- 2022].
[5] "IEEE XPLORE", ieeexplore.ieee.org, 2022. [Online]. Available: https://ieeexplore.ieee.org/Xplore/home.jsp ↗ [Accessed: 19- Aug- 2022].
License: CC-BY 4.0 ↗
Automatic Extraction of Descriptive Metadata to Promote the Usage of RDM Tools - Presentation
Transforming research data into FAIR Digital Objects (FDO) is an ongoing endeavor. One critical aspect of FDOs is the annotation of research data with metadata. Metadata can come in many forms, administrative metadata (e.g. location or rights), structural metadata (e.g. provenance information), and descriptive metadata (e.g. who, when or what). For administrative and technical metadata, this information is usually generated automatically due to its nature, while descriptive metadata mostly needs to be entered manually. This creates a barrier to the adoption of research data management (RDM) tools, since manually entering metadata for a FDO is often a very tedious and time-consuming task. Thankfully, some manual entering of descriptive metadata is in fact unnecessary since specific types of data or ways data was generated already contain required metadata values (e.g. who and when). In particular, the "what" part is reflected by the data itself and can usually be formulated as a summary of the data. (e.g. a text summary of a PDF). Therefore, this work proposes an automated workflow to extract this descriptive metadata instead of entering it manually.
The first step of the metadata extraction is executed by looking at and into the research data itself, collecting the data surrounding it (such as the creation date), and describing the content with information (like a text summary or a list of objects displayed in an image). For different types of data, additional extracting algorithms can be added. These here called extractors can be registered by referencing the specific Internet Media Types they are responsible for. An additional component is the extraction of the text artifacts which can be found in the research data. The assumption is that these text artifacts contain valuable information about the content. Therefore, in a second step, text which can be extracted from e.g. a PDF document or an image is collected and applied in an extractor specifically created for dealing with text. This extractor uses the text and transforms it into so-called facts which describe the content. The resulting metadata values of all extractors are then described in the Resource Description Framework (RDF) using existing and use case-specific ontologies. For values which cannot be easily mapped to an ontology, a further mapping step is proposed, which maps metadata values to a given metadata schema or application profile formulated in the Shapes Constraint Language (SHACL). With this mapping, it can be ensured that several use cases of different domains can be integrated to a generic extraction step.
For the technical implementation of metadata extraction, the following describes and demonstrates an RDM tool that can apply the workflow presented above. The RDM tool itself can be adapted and extended for every use case by following open-source principles and providing generic interfaces. To be easily integrable into other RDM tools, it is available as a Docker image and provides a simple HTTP API. Thanks to its dynamic configuration, users can describe the requirements of specific use cases and use the extracted metadata in other RDM tools.
One such other RDM tool is the research data management platform Coscine. In Coscine, metadata values are mainly added manually so far. Therefore, it is a perfect fit for the implementation of an automated metadata extraction. Experiments with smaller use cases of research data show that the amount of currently required manually entered values can be reduced by making use of the extracted values. The proposed solution to implement the metadata extraction in Coscine is to utilize a template engine. The idea is that with this, researchers can describe what kind of metadata values they expect for their research data. This template engine then collects and puts the metadata values to their correct place.
Therefore, this work aims to show the effectiveness of automated metadata extraction and tries to support researchers in the annotation of their research data. Looking into future use cases, with more implementations of data type-specific extractors, the power of the extracted metadata values could be brought to full use, e.g. traversing the content of archive formats like HDF5 by only utilizing the metadata. Specifically, this would make it easier to compare research data since the extracted metadata is more likely to correspond to defined ontologies. In summary, the tool for automatic extraction of descriptive metadata described here can significantly simplify the management of research data for researchers in terms of FAIR Digital Objects.
License: CC-BY 4.0 ↗
Metadata4Ing: An ontology for describing the generation and provenance of research data within a scientific activity - Presentation
Research in engineering is heavily based on existing scientific and process data, which become increasingly available with the rise of data publications and Open Science. A fast and effective data exchange and re-use requires suitable tools to locate existing datasets, to merge them with newly-generated data, and to ingest them into automated workflows. Prerequisite for that is the capability to search and filter across data collections based on their content, and to make data content interpretable for machines. For that reason datasets should be provided in a machine-readable format, together with a rich and unequivocal metadata description.
This can be achieved via the adoption of a formalized language common to data search engines and data repositories, which unifies a semantic description of the datasets, their content and their provenance/generation, thus allowing selective searching and filtering as well as machine interpretability of the selected information. The most powerful form of such a language is an ontology, i.e. a formal conceptualization of knowledge related to a specific domain, which provides basic terms for the creation of (meta)data schemata enabling a rich, coherent and fully semantic description of research data.
Metadata4Ing (m4i) (https://w3id.org/nfdi4ing/metadata4ing/ ↗) is an ontology that provides a framework for the semantic description of research data, with a particular focus on engineering and neighbouring disciplines. It offers terms and properties for the description of engineering workflows, the data generation process (experiment, observation, simulation) and engineering results. It considers, for example, the object of investigation, sample and data manipulation procedures, a summary of the data files, and personal and institutional roles of participants in data-driven research processes.
Metadata4Ing builds on existing ontologies like the Basic Formal Ontology (BFO), the PROV Ontology and the Data Catalogue Vocabulary (DCAT) and is extendable to the requirements of specific fields by deriving subclasses with specific properties. With its central class "processing step", Metadata4Ing allows to model input and output of data processing tasks together with the methods and tools used. Steps can be connected to each other and divided into substeps. The subclasses and relations are built according to the two principles of inheritance and modularity:
Inheritance: a subclass inherits all properties of its superordinate class, possibly adding some new ones. Modularity: m4i allows to define modular expansions dedicated to engineering subfields, for instance, defining new classes and properties for the description of new combinations of method × object of research.
This ontology is developed by the Working Group of the same name as the ontology within the NFDI4Ing Consortium: Metadata4Ing. It is curated in an ongoing process considering feedback and requests from the community, e.g. via competency questions (https://git.rwth-aachen.de/nfdi4ing/metadata4ing/kompetenzfragen/-/issues ↗). The terms of Metadata4Ing are available on the terminology service of NFDI4Ing (https://terminology.nfdi4ing.de/ts4ing/ontologies/m4i ↗) and on Linked Open Vocabularies (https://lov.linkeddata.es/dataset/lov/vocabs/m4i ↗).
Machine-readable metadata descriptions employing m4i terms can prospectively be generated using the metadata crawler (developed as part of the Doris archetype) and the ELN Kadi4Mat (further developed as part of the Caden archetype) and be deposited in metadata services like CoSciNe, DaRUS and the NFDI4Ing metadata hub.
In this session we will demonstrate the usage of the ontology based on a use case. After an introduction of the main classes and properties, we demonstrate the application of Metadata4Ing using a real-world example and conclude with an open discussion on the possible future evolution of the ontology. A first subontology of Metadata4Ing, specializing on HPC processes, will also be presented at the conference.
License: CC-BY 4.0 ↗
FAIR-IMPACT: Expanding FAIR Solutions across the European Open Science Cloud - Presentation
In 2015 the vision of a European Open Science Cloud (EOSC) emerged. EOSC would provide an open and trusted environment for accessing and managing a wide range of publicly funded research data and related services, helping researchers reap the full benefits of data-driven science. EOSC is now entering its implementation phase (2021-2027) which requires active engagement and support to ensure widespread implementation and adoption of the FAIR (Findable, Accessible, Interoperable, Reusable) principles, to define and share standards and develop tools and services, to allow researchers to find, access, reuse and combine research results. With the ambitious goal to realise an EOSC of FAIR data and services, the FAIR-IMPACT project responds to these needs by supporting the implementation of FAIR-enabling practices, tools and services across scientific communities at a a European, national and institutional level.
The project will focus on developing a more coherent implementation of persistent identifiers (PIDs), increasing data accessibility through enhancing interoperability on all levels, and progressing work around metrics and FAIR assessment. This short talk will provide an overview of the FAIR-IMPACT support programme and pillar activities to advance the findability, accessibility, interoperability and reusability (“FAIRness”) of data and other research objects. A key focus of the project is to help research performing organisations, research data repositories, and national level initiatives to take up those successful and emerging practices, policies, tools and technical specifications that are available to enable FAIR in a practical sense. To this end, the project will feature a number of open calls for support – both financial and in-kind. There will be time dedicated to open questions and discussion following the brief overview of the project and our aims.
License: CC-BY 4.0 ↗
nfdi4energy - Coupling Points to the Engineering RDM World - Workshop
nfdi4energy is a new consortium recommended for funding in the third and final funding round of NFDI. It focuses on energy system research which overlaps with some engineering fields like mechanical engineering, electrical engineering, and computer science. The necessary transformation of energy systems towards net zero greenhouse gas emissions provides a plethora of new research challenges. New interconnections between different energy sectors, such as power, heat, and mobility, increase the system's complexity. In this context, the digitalisation towards cyber-physical energy systems (CPES) alleviates change, and equally affects technical, social, and societal topics, as well as the mode of research in the CPES research community. Research efforts towards CPES heavily rely on modelling and (co-)simulation-based approaches. Tracking of models together with all data creates a complex software and data management challenge, which needs to be addressed in each research project. Therefore, the handling of research software presents a key motivation for nfdi4energy.
To this end, nfdi4energy covers the whole research and transfer cycle of projects in energy system research ranging from (1) identifying relevant competencies for a research field; (2) defining relevant scenarios and experimental setup; (3) integrating models and data; coupling tools and laboratories; (4) extracting results, facilitating public consultation; to (5) identifying research challenges for follow-up activities.
We define the following key objectives for nfdi4energy: (1) Establish common research community services for FAIR data, models, and processes in energy system research and motivate its use in the community. (2) Allow traceability, reproducibility, and transparency of results for the scientific community as well as for the society, improving the overall FAIRness. (3) Enable and motivate the involvement of society for the identification and solution of relevant research questions. (4) Promote better collaboration and knowledge transfer between scientific research institutes and business partners via FAIR research data management. (5) Simplify identification, integration, and coordination of simulation-based models. (6) Integrate the provided services for energy system research within the wider NFDI ecosystem to improve cross-domain collaboration.
To fulfil these objectives, nfdi4energy concentrates on five key services to provide non-discriminatory access to digital research objects for all relevant stakeholders: (1) Competence to help to navigate the interdisciplinary research field, (2) Best Practices to get information about the successful conduct of research including research data management, (3) Registry to find suitable data and software, (4) Simulation to couple existing simulations and, therefore, reuse software artefacts, and (5) Transparency to involve more stakeholders in all research stages.
The work programme of nfdi4energy includes seven Task Areas (TAs). The first three TAs address different stakeholders in energy system research: The research community (TA1), public & society (TA2), and the industry (TA3). nfdi4energy will perform an intensive requirement analysis with all three stakeholder groups to design all services according to their needs. Especially, including the society into energy research data remains an open methodological question and is therefore especially addressed in TA2. Besides improving the FAIRness of data and software from researchers, nfdi4energy also concentrates on developing methods to improve the FAIRness of data from the industry and applied research because this data is often applied as input data for energy system research. But as it is often partly confidential, new approaches for FAIRification supporting the needs of industrial partners are required (TA3). TA4 focuses on FAIR research data, needed e.g. within simulations, which are in the centre of TA5. To determine more requirements and to evaluate the community services in TAs 1 to 5, we define three standard use cases of energy system research. These use cases are examples of three general types of energy system research projects, which should be supported by the developed community services in TAs 1 to 5. Finally, TA7 covers the overall organisation, realising the operation of the consortium, as well as the engagement with the sections of NFDI supporting different standardization processes, e.g., of metadata schemas and ontologies.
Based on the work in these seven TAs, nfdi4energy aims to develop and provide an open and FAIR research ecosystem in the energy system domain containing a large share of common workflows from data gathering to the inclusion into research software, together with data publications for researchers spanning from single component development (like battery storage systems or other new smart grid equipment) up to system-of-systems research (based on mathematical or analytical models).
In this workshop, we would like to present the general ideas behind nfdi4nergy and discuss possible connection points to NFDI4Ing and further activities in RDM, especially reflecting some characteristics of nfdi4energy, like the intended integration of industry in RDM, support for simulation-based studies, and the chosen approach to requirement engineering. Thus, the workshop is especially open to the interested public from other consortia.
License: CC-BY 4.0 ↗
Agile RDM with Open Source Software: CaosDB (Covered Conference Topic: RDM tools and services: usability and automatisation) - Workshop
Cutting edge research is characterized by rapidly changingresearch questions and methods. At the same time, experiments canbe very costly or impossible to repeat, more and more parameters need to be controlled, varied and their influences studied, andfunding bodies impose requirements on proper data management.Still, standardized tools which fit these challenges are scarce.
We present CaosDB, an open scientific data management toolkit witha design focus on flexibility and semantic data representation.CaosDB was originally developed at the Max Planck Institute forDynamics and Self-Organization because no other software couldmeet the requirements by the rapidly developing researchenvironment. In 2018, it was released to the public at under theterms of the AGPLv3, the source code is available at https://gitlab.com/caosdb. For CaosDB, flexibility comes first, so that changes to the datamodel are simple to implement if new requirements arise. Changingthe data model does not require to migrate old data, nor todiscard them: data added via the old and the new data models cansimply be used, queried, modified and retrieved side by side.Data in CaosDB is structured according to constraints given by anobject-oriented data model, which can be modified at any timewhile keeping the existing data unchanged. This data model setsCaosDB apart from NoSQL-like data lakes, which tend to turn into “data swamps” over time, although it was developed with similargoals in mind.
Data sets in CaosDB can, and should, link to each other and havemeaningful properties which contain the semantic meaning of thestored data. Raw binary data and data files are not stored inCaosDB directly but can be referenced natively, enabling users tostick to their existing workflows, with files on network stores orlocally. Changes to data or the data model are stored in aversioning schema, so that previous states can be reproduced laterfor audit trails, which allows fixing mistakes while keepingcompliance with good scientific practice. Access to data can becontrolled with a fine grained role based access permissionsystem.
Usability of real-life systems arises not only from the ability ofthe system to adapt to changes, but often also from invisibility.CaosDB's server architecture provides a REST and a gRPC API andcorresponding high level client libraries in Python, C++, Julia,Octave/Matlab and JavaScript, for seamless integration intoexisting workflows. For cases where other data sources, such asfiles in a file system, Excel files or existing databases shouldbe used to feed CaosDB, there is a synchronization library writtenin Python, which relieves scientists from the chore of manuallyinserting their data into CaosDB by automating the dataintegration.
We show that these features combined enable users to implement anagile research data management with CaosDB, where small steps anda “starting now” mentality are encouraged and where the datamanagement system can grow together with the developingrequirements and experiences. We will also present a fewreal-world examples where CaosDB is used by research institutes toautomate data collection and improve the long-term usability ofvaluable and often irreproducible data sets.
Hands-on workshop: Manage your Data with Python and CaosDB
Managing your data (or the data of your co-scientists) can behard. Is there a way to organize your data in a way that isconvenient and allow you to find relevant data sets, fast? Inthis hands-on workshop, we will give a practical introduction withCaosDB on how to store semantic relationships between data sets,how to annotate data, link search results and publications and howto save time with an easy-to learn (also for your non-technicalcolleagues) query interface.
The scientific data management toolkit CaosDB promotes an agileworkflow which is difficult to follow with traditional databases,and at the same time, it provides more structured approaches thanNoSQL solutions. CaosDB is open source software (source code isavailable at https://gitlab.com/caosdb ↗) and offers connectivityvia REST and gRPC APIs. High-level client libraries exist for anumber of languages, in this workshop we will use the Pythonclient library.
Participants are invited to work on their own data: Bring anoverview of what you consider your important data objects and howthey are related (a pencil sketch on paper is sufficent). Andinstall the required libraries on your machine: pip install caosdbcaosadvancedtools
After this step, the following command should work in Python:import caosdb
At the end of the workshop, the participants will know how to implement their data model in CaosDB and how to modify it later.The workshop also teaches how to insert, update and retrieve data,and how complex questions can be easily translated into CaosDB's query language.
License: CC-BY 4.0 ↗
Some important aspects of the NFDI4Cat ontology and metadata design - Presentation
NFDI4Cat is one of the NFDI consortia, which targets research data management for catalysis-related sciences. The initial outcome of this project with respect to the architecture and interoperability requirements was obtained in the course of user interviews, collecting competency questions, and exploring the landscape of semantic artefacts [1,2]. Despite of the numerous efforts in the development of metadata standards and domain ontologies, the advantages offered by an optimal interplay between them for the implementation of FAIR principles seem not to be well understood yet.
In this work general aspects of the metadata and ontology organization are considered and applied to some typical research steps in chemistry and spectroscopy. For representing the metadata a prototype ontology is built. In general, it should be metadata-centered, suitable for the description of research workflows in experimental chemistry and spectroscopy, involve all relevant semantic artefacts for searching metadata, and, in particular, enable semantic search according to workflow patterns. In the current implementation these features are based on a model representation of a general research workflow in experimental chemistry and spectroscopy. By construction, the corresponding vocabulary is to a great extent easily subsumed to overarching ontologies like BFO [3] and Metadata4Ing [4], which makes it compatible with other standards.
The last but not the least, it is demonstrated how properly structured class hierarchies and relations can unequivocally shape the acquisition of required metadata in a systematic way, and make it only loosely coupled with the nature of the metadata. To the best of my knowledge, this aspect of the ontology design has not been considered in much detail. On contrary to this approach, the amount and quality of metadata created based exclusively on a top-level ontology seem to be very subjective.
The metadata for a given research step are encapsulated within a so-called metaset that is a collection of the corresponding metadata files organized into a separate folder. The metadata are stored within the RDF data model using Turtle (XML) serialization. Metasets are organized as an additional independent layer to the existing data infrastructure. Each metaset has its own URI that is based on a overarching (institutional) URI. At the same time, the URI of a metaset provides the base URI for the corresponding local resources. Published metasets can be hyperlinked by appropriate software via common URIs. This can be done by introducing hyperlinks on top of the Turtle serialization, or transforming the latter into a informal hypertext. This would provide highly-intuitive means for navigating and understanding various resources, which is more convenient than the application of the query protocol such as SPARQL. These and other possibilities for the lookup of metadata can be very practical. In principle, it implies the Electronic Lab Notebook functionality for searching and accessing various information resources which is invariant with respect to the scaling of the scope from a local group to an overarching inter-institutional level. Such an invariance could have a huge impact on the global acceptance of the proposed approach in the community.
References
1. M. Horsch, T. Petrenko, V. Kushnarenko, B. Schembera, B. Wentzel, A. Behr, N. Kockmann, S. Schimmler & T. Bönisch (2022). Interoperability and architecture requirements analysis and metadata standardization for a research data infrastructure in catalysis. In A. Pozanenko et al. (editors), Data Analytics and Management in Data Intensive Domains - 23rd International Conference: 166-177. Springer, Cham (ISBN 978-3-031-12284-2).
2. S. Schimmler, T. Bönisch, M. T. Horsch, T. Petrenko, B. Schembera, V. Kushnarenko, B. Wentzel, F. Kirstein, H. Viemann, M. Holeňa & D. Linke (2022). NFDI4Cat: Local and overarching data infrastructures. In V. Heuveline and N. Bisheh (editors), E-Science-Tage 2021 - Share Your Research Data: 277-284. heiBOOKS, Heidelberg (ISBN 978-3-948083-55-7).
3. http://basic-formal-ontology.org/bfo-2020.html ↗
4. https://nfdi4ing.pages.rwth-aachen.de/metadata4ing/metadata4ing/index.html ↗
License: CC-BY 4.0 ↗
RSpace: An Electronic Lab Noteook designed to enhance FAIR workflows and FAIRification of research data - Demonstration
This presentation and demo will introduce the RSpace electronic lab/research notebook and sample management system https://www.researchspace.com/. RSpace is designed to enhance reproducibility by making experimental and sample data more findable and accessible. This is enabled with an open architecture that facilitates connectivity with relevant data used in research projects generated or stored in other tools. Data in RSpace and linked external data can also be associated with the data management plan related to the project. The data and the data management plan can then be exported to data repositories like Dataverse, Figshare, Dryad and Zenodo, for public discovery, query and experimental reproduction. Comprehensive data capture and availability are further enhanced through the ability to easily integrate RSpace with infrastructure e.g. institutional file stores and NextCloud, commonly in place at major research institutions. RSpace and the workflows and ecosystem it supports are depicted in the following graphic https://bit.ly/rspace-graphic.
RSpace is typically deployed at an institutional level at universities and research institutes. In Germany examples include the Max Delbruck Center for Molecular Medicine, Bonn University, the University of Goettingen and several Leibniz Institutes. RSpace is also used by multi-institutional research consortia, e.g., the Multi Scale Bioimaging Cluster of Excellence. We are exploring incorporation of RSpace into the BIH/Charite Virtual Research Environment.
Detailed description
The session will be divided into three parts. The first is a presentation with an introduction/overview of RSpace.
This will be followed by the demonstration. The demonstration will include a high level introduction to both RSpace ELN and RSpace Inventory, and ways in which they can be used together. It will continue with a focus on why RSpace is particularly suitable for use in engineering workflows. Primarily this is for two reasons. The first is RSpace’s flexibility. The second is the deep integration between the ELN side of RSpace and RSpace Inventory, the sample management system. The demonstration also will cover issues and requirements particularly relevant to engineers that have come up in our past engagements with the engineering community:
- How to customise workflows to one of a kind experiment
- How to track physical and data samples
- How to synchronise and manage efficiently data from distributed sources
- How to capture metadata of tooling (sensors, components)
- Discoverability and association of metadata schemas and PIDs with RSpace experimental and sample datasets.
- Ways of incorporating standardised vocabularies for structuring metadata according to domain specific schemas.
- Scalable export of domain specific sample metadata from the RSpace Inventory module in standardized formats (templates) required by domain repositories and registries.
- An integration with the iRODS virtual file management system.
License: CC-BY 4.0 ↗
Development of a novel platform for teaching and learning materials as a knowledge base for FAIR data use and provision. - Workshop
Data literacy from the beginning. With this paradigm, we emphasise the importance of data literacy education (data usage, data supply) and show the way to sustainability in the use and supply of research data. The Dalia project makes a significant contribution to the implementation of the paradigm through the development of a platform for teaching and learning materials as a knowledge base "FAIR data usage and supply". This is implemented in a modern way, in line with NFDI and the FAIR concept, as a semantically linked knowledge graph. This knowledge base, based on semantic web technology, will serve users by answering questions on data usage (data science) and data supply (research data management).
DALIA is the acronym of the BMBF project entitled Knowledge-Graph of the Data Literacy Alliance (DALIA) for FAIR data use and provision based on Semantic Web technology. The partners in this joint project are the Institute of Fluid Systems Technology and the University and State Library at the Technical University of Darmstadt, the Chair of Bioanaorganic Chemistry and the IT Centre at the RWTH Aachen, the Digital Academy at the Academy of Sciences and Literature in Mainz, the Institute of Medical Informatics at the University Medical Centre Götting and the Leibniz Information Centre Technology and Natural Sciences at the Technical Information Library Hanover. The start date for the project is November of this year.
A knowledge base can only be successful, i.e. accepted, if, firstly, quality-assessed information, teaching and learning materials are available and, secondly, these materials are offered and recommended according to the personal needs of the users. The users form the possible market for the knowledge base and can be strongly differentiated. They include students, doctoral candidates, researchers and teachers who come from different academic disciplines and have different levels of FDM competence. Therefore, the project focus is on the benefits and the users. The users' needs and questions for the knowledge base are to be processed via natural language queries (natural language processing). The goals of the project are the integration of providers and future users, the development of an information model and the integration of a search and recommendation service.
The formulation of competence questions from the learner's point of view is essential. The formulated requirements can be directed at the information model, the user interface, the backend or other functionalities. In this way, the structure of the knowledge graph on which the knowledge base is based is also specified. Competence questions provide statements about which questions are to be answered by the knowledge graph. They are a tool to systematise the requirements analysis and to integrate the user perspective well. Since the Knowledge Graph is to be used by different users with different needs, a broad spectrum of competence questions has to be created and integrated.
Use cases are used to systematise and classify the competence questions. Use cases are defined by the following formulation:
"As a [role with specific FDM experience], I have the [need] to be fulfilled by a [knowledge base function]."
Strong and early networking in the individual communities and the iterative development of the knowledge base based on the needs of the future users (use cases) will ensure a high level of acceptance of the project results and long-term use of the knowledge base.
During the workshop, such use cases will be collected from the community. These will serve as input for the corresponding work packages of the project. At the same time, the workshop will introduce the project and make it known so that the level of awareness is increased right at the beginning and the basis for communication and interaction with the provider and user community is created.
License: CC-BY 4.0 ↗
Beyond Data Literacy in Engineering – What Media Literacy has to offer for Critical Data Literacy - Workshop
Data Literacy is going to be the key ingredient for a digital and sustainable engineering education. The main motivation coming from the need to understand and empirically solve real-world-questions (Wolff et al. 2016, 23). An important focus lies in the fact that through digital transformation more digital data gets generated and can be interpreted. As a highly applicable and proactive scientific field with developments such as industry 5.0 on the horizon, mechanical engineering is predestined to integrate data literacy into the higher education curricular (Schüller 2019, 300). Additionally, on a more individual level the ability to understand and interpret data for shaping solutions is a highly demanded skill in the job market and thus raises the employability of engineering students leaving schools and universities.
There are multiple data literacy definitions, approaches, and frameworks in recent literature. While some researchers see data literacy as a “cross-cutting competence” for effective decision making (Taibi et al. 2021, 112) or as abilities to access; use, understand and create digital tools (Marieli Rowe 2017, 47). Others tend to combine data and digita l literacy and describe “a cluster of behavior and attitudes for the effective execution of value creation process steps on the basis of data.” (Katharina Schüller, Paulina Busch, and Carina Hindinger, 27). The German Hochschulforum Digitalisierung base their definition on Ridsdale et. al. (Ridsdale et al. 2015) as „the ability to collect, manage, evaluate and apply data with a critical mindset (Schüller, Busch, and Hindinger 2019, 10). To this understanding Wollf et. al. add communicative and design aspects while selecting, cleaning, analyzing, visualizing, critiquing and interpreting data (Wolff et al. 2016, 23). Yet, as the literature review on nine definitions of data literacy and four definitions of statistical literacy shows, major aspects in literacy frameworks such as ethics are barely added (Wolff et al. 2016, 11).
Thus, the German association of media science critiques in one of their position paper the narrow understanding of data and digital literacy in favor of well-established critical-reflexive methods grounded in media literacy (Braun et al. 2021, 3). Starting with the clarification that literacy consist of multiple competencies and thus cannot be understood as synonyms (Braun et al. 2021, 3), their main concern here lies in the lack of cultural, critical or self-reflective, and creative perspectives in data and digital literacy (Braun et al. 2021, 3). As they stress the need for a demystification of technological progress, they remind that every data-based solution also shapes the social interactions profoundly and that engineers need to be trained to become aware of those potential changes. Additionally the influence of culture and media in the understanding of real-world-problems seems to be barley considered in current data literacy frameworks (Marieli Rowe 2017, 46). While some scientists see digital literacy as the meta literacy combining information and media literacy (Bryan Alexander et al. 2017, 4), others see media literacy as the umbrella literacy on data, visual and digital literacy (Leaning 2017, 39).
In this conference contribution, we would like to enhance the definition of data literacy with the knowledge developed over the last decades in media literacy. We aim to discuss applicable methods for data literacy frameworks. The idea would be to lend to and borrow from competence concepts of different sister literacies such as media literacy, sustainability literacy and digital literacy. This contribution will consist of two parts: First a critical literature review of the different existing literacy frameworks. Second an interactive workshop on the NFDI4Ing Conference 2022 to develop a collective concept to enhance data literacy frameworks, reflecting cultural aspects, biases, and media influences to data. Through this process different families of literacy gain and learn from one another to enhance education.
References
Braun, Tom, Andreas Büsch, Valentin Dander, Sabine Eder, Annina Förschler, Max Fuchs, Harald Gapski et al. 2021. “Positionspapier Zur Weiterentwicklung Der KMK-Strategie ‹Bildung in Der Digitalen Welt›.” MedienPädagogik, 1–7. https://doi.org/10.21240/mpaed/00/2021.11.29.X.
Bryan Alexander, Samantha Adams Becker, Michele Cummins, and Courtney Hall Giesinger. 2017. “Digital Literacy in Higher Education, Part II: An NMC Horizon Project Strategic Brief.” The New Media Consortium. https://www.learntechlib.org/p/182086/.
Charlotte Lærke Weitze, Gunver Majgaard. 2020. “Developing Digital Literacy Through Design of VRAR Games for Learning: ECGBL 2020: 14th European Conference on Games Based Learning, ACPI. Virtual Conference Hosted by the University of Brighton, UK. 24-25 September 2020.” 674–83.
Leaning, Marcus. 2017. Media and Information Literacy: An Integrated Approach for the 21st Century. Chandos information professional series. Cambridge, MA, Kidlington: Chandos Publishing an imprint of Elsevier. https://aml.ca/wp-content/uploads/2017/03/JMLVo.64No.12-2017.pdf.
Marieli Rowe, ed. 2017. The Jounral of Media Literacy 64. Accessed April 03, 2022.
Schüller, Katharina. 2019. “Ein Framework für Data Literacy.” AStA Wirtsch Sozialstat Arch 13 (3-4): 297–317. https://doi.org/10.1007/s11943-019-00261-9.
Schüller, Katharina, Paulina Busch, and Carina Hindinger. 2019. “Future Skills: Ein Framework Für Data Literacy.” Hochschulforum Digitalisierung (47): 1–128.
Taibi, Davide, Luis Fernandez-Sanz, Vera Pospelova, Manuel Leon-Urrutia, Ugljesa Marjanovic, Sergio Splendore, and Laimute Urbsiene. 2021. “Developing Data Literacy Competences at University: The Experience of the DEDALUS Project.” In 2021 1st Conference on Online Teaching for Mobile Education (OT4ME), 112–13: IEEE.
Wolff, Annika, Daniel Gooch, Jose J. Cavero Montaner, Umar Rashid, and Gerd Kortuem. 2016. “Creating an Understanding of Data Literacy for a Data-Driven Society.” JoCI 12 (3). https://doi.org/10.15353/joci.v12i3.3275.
License: CC-BY 4.0 ↗
An iterative strategy for the planning and execution of data acquisition in field experiments involving technical systems - Presentation
Field experiments pose many challenges concerning data management. This presentation discusses documentation and communication as core difficulties for the planning and execution of field data acquisition campaigns using technical systems. We propose a strategy to develop and enhance field documentation and communication iteratively.
Field experiments are characterized by environmental conditions that cannot be fully controlled by the experimenter. Examples from the authors' work are tests of submarine robots in open water, or tests of driver assistance systems in public traffic, which serve to illustrate our proposed strategy.
Experimenters planning to go into a field need to prepare for different potential environment states, e. g. weather situations, or traffic circulation. During data acquisition, all relevant field conditions must be documented, so that data users later can reconstruct the context and factor it into their data analysis. This is especially important for data re-users who are not part of the original team, which makes this a crucial issue of FAIR data management.
Documenting these environmental conditions during a field experiment puts additional stress on a situation that may already be difficult: The technical systems that run the experiment tend to have a large number of sensors and computers. The environment may be taxing on the hardware, and critical infrastructure may be scarce. The experiment may extend over multiple days under changing conditions. In this situation, where a single error might ruin the entire data acquisition, experimenters tend to prioritize the safe operation of their technical system, and the reliable recording of their payload data. Meanwhile, important events in the environment may happen unnoticed and undocumented. Planning and organizing the documentation of the environmental conditions and other important metadata may alleviate some of the additional stress, and reduce the risk of losing critical information.
Additionally, field experiments are often carried out in teams. The communication between team members has a decisive influence on the success of the experiment. As documenting all relevant field experiment aspects can only be achieved collaboratively, there need to be robust communication channels in rough field conditions like stormy weather or heavy traffic as well as a uniform experiment terminology within the team. Imprecise terminology and unsuitable ways of communication in the team reduce the quality of communication and thus also of the documentation, especially for re-users.
To address these issues, we propose an iterative approach. Even with many years of experience, it is difficult to anticipate all possible events and influences in the field during planning. Fortunately, data-acquisition campaigns are rarely one-time efforts but are executed repeatedly throughout a project, or the existence of a team.
Implementing an iterative field data acquisition strategy requires three elements: First, an initial planning and documentation structure, bootstrapped from previous experience of the team members and other potential experts. Second, an execution process that documents unpredicted events as material for the next planning cycle. And third, a review routine that adapts the planning and documentation tools to these additional experiences and other changing circumstances. Once the iteration loop is closed, further development breaks down into a planning phase and an execution phase.
Regarding the planning phase, the development of a custom documentation scheme is necessary. This includes a metadata structure for the experiment and its sessions as well as taxonomy and terminology: The metadata structure supports the findability, searchability, and interoperability of the generated data sets. A taxonomy is necessary for qualitative documentation variables, e.g. the differentiation of the weather into sunny, clear, or cloudy. Defined terminology enhances data interpretability for subsequent users. For the creation of the documentation schema, we present a flexible checklist that can be adjusted to individual requirements and can be developed iteratively over multiple data acquisition campaigns. The checklist addresses the identification of data flows in the field experiment, their documentation, and various data-centric aspects of organization and preparation for successful field experiments.
Regarding the execution phase, we discuss the documentation of unpredicted events as part of the metadata collection. A data acquisition that follows the FAIR principles for scientific data management requires rich metadata. The primary purpose of metadata is to support potential data re-users in evaluating whether a given data set is suitable for their use case, and in integrating the chosen data set into their work environment. We propose that, additionally, data producers take advantage of their existing metadata collection tools and processes for the systematic documentation of unforeseen events. There are many types of metadata, referring e. g. to the data structure, possible use cases, the data production personnel, legal issues, etc. We propose to handle unpredicted events as part of context metadata, which describes the environmental conditions of the field during data acquisition.
The practical implementation of our approach is illustrated with examples from field experiments involving an experimental vehicle to validate new driver assistance systems in road traffic, and a sensor system improving the safety monitoring equipment of rescue divers.
License: CC-BY 4.0 ↗
Metadata and Terminology Services. A Toolchain for comprehensive Data- and Knowledge Management - Presentation
The effective implementation of data and knowledge management in research is becoming increasingly important in academia and industry. Especially as the requirements for this become more complex and thus more cost-intensive. Automation, through well-adapted services, promises multiple benefits for researchers and their contracting authorities. An example of this is the reproducibility of research data. In 2016, Nature published a survey on the effects and reasons of the so-called reproducibility crisis. In this survey, researchers from various disciplines were asked whether they could reproduce their own research or the research of others. The results were more than sobering, e.g., in chemistry, among others, more than 80% of others' research and more than 60% of their own research could not be reproduced. In other research disciplines, the situation is not much better. Furthermore, this survey also shows reasons how these poor results can occur. Essentially, it shows that the most important reason for this situation is selective reporting. Predominant reason for this situation is, amongst others, a lack of available resources (such as data and software) or an incomplete description of experiments' design and execution.
Another aspect in this context is the use of ambiguous language. Imagine a discussion about the results of a transdisciplinary project employing technical terms and indicators from the domain of micro- and macronutrients: Soil scientists will discuss about nitrogen, phosphorus, potassium, calcium, magnesium and sulphur. Nutritionists in turn will discuss about carbohydrates, protein and fat. Another example is the Mars Climate Orbiter, which was destroyed in 1999 because engineers in the US and the UK didn't clearly communicate with each other. The US team used English units while the UK team used metric units. This led to a navigational error, which resulted in a damage of 125 Mill USD. Tis and other examples vividly illustrate the importance of agreeing on a commonly understood language within a designated community. Even though the introduced examples are very striking, they are representative of the obvious problem of miscommunication large and small.
The NFDI4Ing project Measure Metadata and Terminology Services addresses this problem of ambiguity in data management and provides a comprehensive service offering. This offer facilitates the use of subject- and application-specific standardised metadata and their integration into engineering specific workflows. Furthermore, it provides services for generating, sharing and reusing application-specific metadata profiles as well as a terminology service to enable researchers and infrastructure providers to access, curate, and update terminologies. Essentially, the task area will provide web-based software services to support the following requirements:
- Flexible application-specific metadata schemas via selection of suitable elements from controlled terminologies (Metadata Profile Service)
- Provision of a service that allows access to, curation of, and subscription to domain-specific terminologies (Terminology Service)
- Uniform access to multiple metadata repositories through a single API (Metadata Hub)
- The Metadata4Ing ontology, to describe research processes and research results in engineering.
After one year of project runtime, the first demonstrators and minimal viable products are available for all three envisioned services as well as the Metadata4Ing ontology. To learn, inspire and discuss the application of these services in engineering research workflows, we are about to prepare a series of workshops. This will help us to better shape the future development of our service offer. At the same time, we want to raise interest in the community for the opportunities to improve their research data management through richer metadata by showing use cases and demands from today’s engineering research.
We present our range of services based on a specific use case. First, there is a general introduction to the task area, and then a concrete demonstration of how our services can be used.
License: CC-BY 4.0 ↗
An approach to improve the reuse of research software - Presentation
The reuse of research software is central to research efficiency and scientific exchange. The inspection of source codes supports the understanding and comparison of methods and approaches, and the execution of programs enables the reproduction and validation of research results to users of diverse scientific backgrounds.
Often, however, no subsequent use takes place due to a variety of reasons. For example, when the software is not published at all or is not licensed, when no software can be found which fulfills a specific set of requirements, or when the list of potentially useful software is very long, but there is no adequate information to allow for making an informed choice. In addition, it regularly occurs that for published software, not enough documentation is provided to give sufficient guidance on how to use the software, requiring potential new users to invest significant effort on familiarizing themselves with the source code. Although most software authors are aware of the importance of detailed documentation, the provision of this information is often associated with such a high level of effort, that it frequently is not publicly provided. To make matters worse, documentation standards often do not exist or are not adhered to, making it much more difficult for users to interpret the information. Furthermore, documentation is only rarely available in a machine-actionable form, so that computing resources cannot be utilized to assist in the search, selection and application of software.
If there is no sufficient possibility to get a targeted overview of available software and to identify and reuse useful software, the degree of redundant developments will inevitably increase. If redundant programs have been developed, they all must be individually documented, maintained and further developed, although they each serve the same purpose. This slows down both, individual scientists and the research community as a whole. Furthermore, redundant software implementations lead to independent and smaller user communities, which frequently translates into a decrease of software quality, as larger developer and user communities combine more expertise and larger work forces.
In this talk, we present an approach to annotate software with detailed metadata, without heavily increasing the documentation effort for researchers. The necessary metadata annotations are partially embedded directly into the source code, thereby also increasing the adequacy of code documentation. Although the documentation of software requires a certain amount of effort, especially if best practice standards are adhered to, the software must be described anyway, which is why the approach presented here does not involve any significant additional effort. The resulting software descriptions capture the software interfaces in a detailed and machine-actionable way, giving rise to many different opportunities, such as the implementation of semantic software search engines, the automated creation of documentation websites, the automated composition of complex software workflows, or the registration of the software on community platforms or research knowledge graphs. The adoption of existing standards like CodeMeta and OpenAPI supports the broad applicability of the presented approach. Once detailed and machine-actionable software metadata has been compiled, research software developers will benefit from an increased impact of their work, whilst their respecitve communities can improve their work efficiencies and exchange of expertise through the increased findability and reusability of research software.
License: CC-BY 4.0 ↗
Maturity models for RDM processes - Publication of research data - Workshop
When dealing with research data, researchers are confronted with a multitude of guidelines and directives which must be taken into account in research on the part of the researchers. Models and tools are being developed to help to meet these expectations. The entire handling of research-related data along the data lifecycle, from the planning of data collection to post-use, to make data reusable, traceable and verifiable, is understood as research data management (RDM). The additional requirements that the RDM entails for researchers present them with the challenge of possessing knowledge in all areas for adequate realization. In addition, there are heterogeneous research fields within engineering research, in which different research activities are carried out, where different research data are collected with different analysis methods, which leads to different requirements for the execution of RDM. Because of this, RDM activities must be designed based on the research fields and integrated into the research processes.
To evaluate and improve the execution of research data handling along the data lifecycle maturity models will be developed. The models thus serve as a solution approach for the improvement and execution of RDM. Individual maturity levels reflect the quality of the execution of the RDM. By representing the RDM in maturity levels, an evolutionary path for improvement of the RDM can be depicted, which suggests stepwise oriented improvements to researchers in the execution of the RDM. In addition, an assessment of RDM practices can be completed through the maturity models. In this way, researchers' self-assessments can identify the current states of RDM execution. Further, funding agencies or project leaders can specify maturity levels to be achieved.
Based on the data lifecycle, superordinate process areas are defined that contribute to the execution of the RDM of the researchers. Specific and generic (cross-process) goals are assigned to the process areas, which must be met when executing the processes. Practices are assigned to these goals, which describe expected activities to achieve the assigned goals. In addition, best practices or detailed descriptions are assigned to the practices, which should describe the goals' achievement in more detail. The objectives are then assigned to the various maturity levels according to a defined maturity level definition.
For the development of user-centered solutions, the heterogeneity of engineering research must be taken into account and the various requirements of different research fields must be worked out. In this context, the planned workshop will elaborate on the needs and goals that arise for researchers in the execution of processes in dealing with research data in a research-differentiated manner. The workshop will focus on data access management, the publishing of research data, as a phase in the research data life cycle. Research data on which a publication is based are often not published, even though they offer added value for further research. The publishing of research data refers to the making available of collected data, which can then be used by other researchers. The aim is to provide access to data that has already been collected and stored. Accordingly, a first structure of a maturity model with a defined maturity level definition is presented, on the basis of which the classification of the process goals into the different maturity levels is oriented. Thereby, a first impact with common goals for the process area, related to engineering, is also presented. Based on the developed maturity level definition and the structure of the model components, general and research-dependent goals will be identified in the workshop and classified according to their maturity and based on associated characteristics in a research-differentiated manner.
The goal of the workshop is to develop and collect goals for the publishing of research data, which researchers should fulfill. At the end of the workshop, a detailed process description and goal definition should have been developed, with which the execution of the publishing of research data can be represented. A differentiated classification of the goals according to a given maturity level definition along an evolutionary path should then take place in a discussion to be able to evaluate the process area in its execution based on the defined goals and to realize step-by-step improvements. Through the NFDI4Ing community, it is hoped to achieve differentiated and research project-dependent execution in the publication of research data in the field of engineering. The results of the workshop will be used to extend the maturity models and to develop user- and research-oriented maturity models to evaluate the execution of publishing research data but also to support researchers by defined goals to be realized.
License: CC-BY 4.0 ↗
Adaptive Visualizations for Cross-Stakeholder Development of Ontologies for Research Data Management - Presentation
Ontologies are increasingly establishing themselves as a solution in the environment of developing metadata standards for research data management systems.
In computer science, ontologies represent a conceptual model of the world. They are implemented using a formal language and processed very extensively by a machine in an automated fashion. There are standards for the ontology implementation, but they are purely technical in nature. However, this very technical specificity turns out to be very problematic when the content of the ontology is to be defined by domain experts.
Within the Semantically Connected Semiconductor Supply Chains (SC3) Support Action, funded by the European Commission, we are working on solutions to address exactly this challenge by the work on an ontology platform. The SC3 ontology platform enables the collaborative development of ontologies taking into account different groups involved in the ontology development. In the context of this task, visualization of ontologies plays an important role when users need to understand the content of ontologies. To address this need within SC3, some new approaches have been developed to efficiently map information between different visualization formats. At the core of the various SC3 visualizations is the so-called Resource-Relation Model (RRM). This model is generally the basic data structure for the different types of visualizations in the SC3 Ontology Platform. The RRM takes advantage of the fact that resources, relations, annotations, axioms, rules, and type assertions describe ontology ingredients. Accordingly, resources define type assertions, annotations. Relations extend resources by providing domain and range constraints that form the link between resources. The RRM model reorganizes the structured textual representation of ontologies into a representation format for further processing. This model represents network-like structures of ontologies with additional grouping and classification of triples. This is exactly what is used for further information processing in preparation for different types of visualization techniques; each visualization mode is constructed from and interacts directly with the RRM model.
An ontology graph is displayed as a node-dot diagram with customization options. For this graph-based visualization, the RRM is first converted into a node-link model (NLM) and then sent to a rendering module for visualization as a graph. In the graph-based view, the user can choose between UML and VOWL notations.
For intermediate level users, ontologies in the system can be also visualized in Hybrid Mode. Using this mode, information from the ontology is displayed as resources and relations. An individual resource and relation is represented by its header, description, widget-based representation, and graph-based representation. This view also displays meta information. We are assuming that our intermediate users have some understanding of ontology construction and are interested in more detailed information, such as the given axioms.
For expert users, ontologies can also be displayed in textual form. It is possible to view every detail of the ontology using this view. In this visualization mode, we have chosen to display the data in TTL format.
In many cases, work on new ontologies draws on existing ontologies to reuse already established definitions. This especially promotes the acceptance of ontologies outside the target domain. In our platform, we are therefore working on a sophisticated concept for managing and visualizing "external" ontologies in order to collaboratively explore their possible reuse. In the future, we also want to enable editing of ontologies via an integrated WebProtégé installation including synchronization to established management systems like GitHub.
License: CC-BY 4.0 ↗
Making Research Data Findable - B2FIND - Presentation
Thematic Scope
Due to the fact that research is becoming increasingly data-intensive, research (data) infrastructures need to deal with challenges concerning data and metadata. In an ideal world, Open Science would be the norm and not be limited by national restrictions. However, in reality various research infrastructures exist on national, European and international levels, that rely on divergent funding bodies and legal issues. To intertwine those already existing infrastructures is not an easy task and (intended to be) enforced by the European Open Science Cloud, which aims at developing and maintaining a common research data ecosystem. As searching for relevant information is the beginning of every research project and finding appropriate data that encourages scientific work is the foundation for the “F” in FAIR data principles, we will present a generic service that makes research data from heterogeneous sources discoverable, across disciplines. B2FIND was developed and is offered as a service from the pan-european Collaborative Data Infrastructure EUDAT CDI , it has been the central indexing tool for EOSC-hub and it plays a major role in the Data Infrastructure Capacity for EOSC (DICE) .
Therefore a comprehensive joint metadata catalogue was built up that includes metadata records for data that are stored in various data centres, using different meta/data formats on divergent granularity levels, representing all kinds of scientific output: from huge netCDF files of Climate Modeling outcome to small audio records of Swahili syllables and phonemes; from immigrant panel data in the Netherlands to a paleoenvironment reconstruction from the Mozambique Channel and from an image of “Maison du Chirurgien” in ancient Greek Pompeia to an excel file for concentrations of calcium, magnesium, potassium and natrium in throughfall, litterflow and soil in an Oriental beech forest. As the metadata catalogue is a central EUDAT CDI service, it represents data collections that are stored in EUDAT centres as well as those of (big) international research infrastructures or small (national) data providers. Those are represented as Repositories B2FIND - whether that is the outcome of a community platform for computational materials science (MaterialsCloud) or a national generic repository for open research data from all academic disciplines in Norway (DataverseNO) or even the outcome of an (European) infrastructure for digital language resources (CLARIN) depends on the Data Provider. As of August 2022, more than 1.1 mio records from 29 Communities and 37 Repositories are searchable and findable in our webportal, both numbers will steadily increase. Therefore it can be seen as a Best Practice example for a generic productive (european) service, that fosters already existing research data infrastructures and initiatives.
Service Description
The backbone of our service is its ingestion process that consists of three steps. Firstly, the metadata records - provided by various research communities - are harvested (using different harvesting protocols, preferably OAI-PMH but also RestAPIs or CSW). Afterwards the raw metadata records are converted and mapped to unified key-value dictionaries as specified by our schema , which is based on Datacite 4.1 with supplementary elements ‘Instrument’, ‘Discipline’ and ‘Community’. That allows users to search and find research data across scientific disciplines and research areas as well as searching for certain measurement tools, e.g. data produced by specific beamlines or measurement stations. Transparent access to scientific data objects is provided through the given references and identifiers in the metadata. Finally, the mapped and checked records are uploaded as datasets to the metadata catalogue, which is based on an open source data portal software (CKAN) that provides a rich RESTful JSON API and uses SOLR for indexing.
Within the ingestion process, the main challenge refers to the divergence in metadata standards and schemes which correspond to the specific communities´ needs. Currently B2FIND supports Datacite and DublinCore as generic metadata standards, DDI as a standard for Social Sciences, Iso19139 (which is basically the underlying structure for the EU INSPIRE Directive) and FGDC for geolocated resources, EUDATCORE as a generic standard for the exchange of metadata between different EUDAT services and FF for Danish archaeological data. Even though some Data Providers are offering their metadata with a generic standard, a lot of them do not. This means that a semantic mapping of the non-uniform, community specific metadata to homogenous structured datasets is hereby the most subtle task and done in close collaboration with all partners involved. To assure and improve metadata quality the mapping process is accompanied by the usage of controlled vocabularies (e.g. Iso 639-3 for Language codes) and formal and semantic mapping and validation. The web portal offers a common freetext search as well as geospatial and temporal search options. Results may be narrowed down using the (currently 17) facets, enabling interdisciplinary discoverability of research data with satisfying search results.
Good metadata management is guided by FAIR principles, including the establishment of common standards and guidelines for data providers. Hereby a close cooperation and coordination with scientific communities (for the part of metadata exposure), research infrastructures (whether they are thematic ones as e.g. ICOS or big clusters within EOSC as e.g. SSHOC) and other initiatives dealing with metadata standardization (as e.g. RDA or Projects like Fair´s´FAIR or GO FAIR) is essential in order to establish standards that are both reasonable for community-specific needs and usable for enhanced exchangeability.
We are well aware that the question of “metadata standardization” can´t be answered finally. However, as a generic service within EOSC we have now some years of experience in ingestion and curation of metadata and we would like to share this experience, especially when it comes to practice.
License: CC-BY 4.0 ↗
Collaborative Metadata Definition using Controlled Vocabularies, and Ontologies - FAIR Data Showcase in Experimental Tribology - Presentation
Data's role in a variety of technical and research areas is undeniably growing. This can be seen, for example, in the increased investments in the development of data-intensive analytical methods such as artificial intelligence (Zhang et al. 2022), as well as in the rising rate of data generation which is expected to continue into the near future (Rydning and Shirer 2021). Academic research is one of the areas, where data is the lifeblood of generating hypotheses, creating new knowledge, and reporting results. Unlike proprietary industry data, academic research data is often subjected to stricter requirements regarding transparency, and accessibility. This is in part due to the public funding which many research institutions receive. One way to fulfil these requirements is by observing the FAIR (Findability, Accessibility, Interoperability, Reusability) principles for scientific data (Wilkinson et al. 2016). These introduce a variety of benefits, such as increased research reproducibility, a more transparent use of public funding, and environmental sustainability.
Serially-produced FAIR data could be the key ingredient to enabling tribological results for scalable machine-learning-based analyses, and thus, it can potentially solve tribology’s greatest challenges. In this presentation we will, first, address the challenges of implementing big-data techniques in the inherently interdisciplinary tribology research, and second, propose a framework for scalable non-intrusive techniques for FAIR data collection. In this light, we will describe the engineering and application of controlled vocabularies, ontologies, virtual research environments (electronic lab notebooks), data collection, FAIR data publication, using experimental tribology as a testbed showcase for their application.
The first component of the presented framework is a controlled vocabulary of the domain related to the data which needs to be annotated. A controlled vocabulary is a collective that denotes a controlled list of terms, their definitions, and the relations between them. In the framework presented in this contribution, the terms correspond to the metadata fields used in the data annotation process. Formally, the type of controlled vocabularies used in the framework is a thesaurus (National Information Standards Organization 2010). Thesauri consist not only of the elements mentioned previously, but also allow for the inclusion of synonyms for every defined term. This eliminates the ambiguity which can occur when using terms with similar definitions. Additionally, thesauri specify simple hierarchical relations between the terms in the vocabulary, which can provide an explicit structure to the set of defined metadata fields. The most important feature of our framework, however, is that the controlled vocabularies can be developed in a collaborative fashion by the domain experts of a given research field. Specifically, people are able to propose term definitions and edits, as well as cast votes on the appropriateness of terms which have already been proposed.
Despite their advantages, one limit of thesauri is their lacking capability of relating metadata fields to each other in a more semantically rich fashion. This motivated the use of the second component of the framework, namely ontologies. An ontology can be defined as “a specification of a conceptualization” (Gruber 1995). More precisely, it is a data structure which represents entities in a given domain, as well as various relations between them. After a set of metadata fields has been defined within a controlled vocabulary, that vocabulary can be transformed into an ontology which contains additional relations between the fields. These can extend beyond the hierarchical structure of a thesaurus and can contain domain-specific information about the metadata fields. For example, one such relation can denote the data type of the value which a given field must take. Furthermore, ontologies can be used to link not only metadata, but also data. This can contribute to the reusability aspect of FAIR Data.
The components described above are being implemented in the form of multiple software tools related to the framework. The first one, a controlled vocabulary editor written as a Python-based web application called VocPopuli, is the entry point for domain experts who want to develop a metadata vocabulary for their field of research or lab. The software annotates each term, as well as the entire vocabulary, with the help of the PROV Data Model (PROV-DM) (Moreau and Missier 2013) - a schema used to describe the provenance of a given object. Finally, it assigns a PID to each term in the vocabulary, as well as the vocabulary itself. It is worth noting that the generated vocabularies themselves can be seen through the prism of FAIR data: they contain data (the defined terms) which is annotated with metadata (e.g., the terms' authors) and provided with a PID.
License: CC-BY 4.0 ↗
Automated storage & visualization of field study data with Metal Oxide Semiconductor Gas Sensors - Presentation
The automated storage and visualization system presented in this work has been developed within the research project VOC4IAQ addressing the transferability of a laboratory calibration of commercial indoor air quality sensors to their actual field use. For this purpose, four field test systems were set up based on the chair's own measuring systems holding a total of 10 commercial sensors, which are operated in manufacturer mode and 5 sensors operated in temperature cycled operation (TCO). The latter one allows for subsequently building and testing own machine learning models [1]. This results in a total of 32 data fields per system with a sample rate of 10 Hz due to the various parameters of a single sensor, which, in addition to the electrical resistance of the sensor, calculate various model-based environmental variables such as humidity, total VOC, air quality index, ozone level, NOx, etc. on the sensor chip and send them to the host-board. A Raspberry Pi is used to serve as data buffer and allow access to the internet. Prior to the field test, the sensor systems are calibrated in a simulated and simplified lab calibration procedure, explained in detail in [2].
These sensor systems are in the field at different sites. One is located at the Chair of Measurement Technology, one in a city flat, one in a flat in the countryside and one in a company. To correlate the sensor signals to specific events, everyday processes in the flats such as cleaning (cleaning agents) or cooking (frying, baking) are recorded manually by the inhabitants or company employees, respectively.
Due to the distributed installation sites, a manual data transfer to the server at the chair and a direct monitoring of functionality is note feasible. Hence, an automated solution with resilient data storage possibility and monitoring of functionality must be developed, which ensures that the systems do not lose their data even in the event of unwanted malfunction.
The data is stored by the Pi in HDF5 files [3]. A new file is created every hour. On the one hand, this ensures that only the current hour is lost in the event of a system failure, and on the other hand, it enables the data transfer of one complete file per hour. Another important point is that an internet connection cannot be guaranteed all the time. Accordingly, a resilient data connection is needed. This is made possible by the open-source tool Sycthing [4]. The peer2peer synchronization tool is based on other cloud services such as OneDrive, iCould, Google Drive with the difference that the data is not stored with third parties but is only synchronized between the connected computers. Furthermore, the tool can be configured to ignore deleted files. This makes it possible to delete measurement data from the Pi without also deleting them on the server. Just as with all other cloud services, synchronization stops when the internet connection is lost and resumes automatically when the connection is restored. Due to the encapsulation of this software and the control software, it functions independently and does not influence the measurements.
As soon as the data are on the server, they should be visualized automatically but with the option of visualizing the data manually in case a dataset has unknown features. The automatic visualization consists of a python script, the Time Series database InfluxDB [5] and the visualization tool Grafana [6]. The python script unpacks the HDF5 file and uploads it to the Influx server. This creates a continuously expanding data set, which is automatically equipped with the available metadata such as timestamp, sensorID, boardID, systemID, and measured variables. This shows the power of the HDF5 data format, as the data is equipped with the necessary keys and timestamps when it is generated in the field, so that at this point only the data is converted from the HDF5 format into a format that can be interpreted by the database is necessary.
In the future, the system will be extended so that the self-generated machine learning models from the raw data are also uploaded, so that this information can be easily compared with the manufacturer-based models.
Due to the amount of generated data that needs to be viewed at regular intervals, various established tools were used, which on the one hand promotes the quick availability of the services through many available tutorials and functional descriptions, and on the other hand almost no special knowledge is required to pass the system on to others. In this way, a high degree of automation could be achieved with simple means. Due to the generalizability of this toolchain, other systems can also be added, such as monitoring of environmental parameters (humidity, temperature) within a laboratory.
Furthermore, a FAIR data handling was aimed at, as an automated systems needs unique machine-readable data to save and represent them properly.
- Findable: The system is findable due to the unique identification of the measured variables within the database and the search possibilities of a modern database system.
- Accessible: Access is cross-platform via web browser or programmatically via REST API. Due to storage on a web server with access authorization, everyone can be granted appropriate access as needed.
- Interoperable: Where available, all existing (meta)data is stored in the HDF5 files, which are persistently stored on the data server. Using timestamps and IDs from the database, each file can be uniquely traced back.
- Reusable: Database standardized format JSON. Readable and uniquely identifiable by all systems.
[1] T. Baur, J. Amann, S. Caroline, A. Schütze, Field Study of Metal Oxide Semiconductor Gas Sensors in Temperature Cycled Operation for Selective VOC Monitoring in Indoor Air Tobias, Atmosphere (Basel). 12 (2021) 647. doi:10.3390/atmos12050647.
[2] T. Baur, M. Bastuck, C. Schultealbert, T. Sauerwald, A. Schütze, Random gas mixtures for efficient gas sensor calibration, J. Sensors Sens. Syst. 9 (2020) 411–424. doi:10.5194/jsss-9-411-2020.
[3] The HDF Group, Hierarchical Data Format, (1997). https://www.hdfgroup.org/HDF5/.
[4] The Syncthing Foundation, Syncthing, (2013). https://syncthing.net/.
[5] InfluxData, InfluxDB, (2013). https://www.influxdata.com/.
[6] Grafana Labs, https://grafana.com/, (2014). https://grafana.com/.
License: CC-BY 4.0 ↗
SaxFDM – towards a comprehensive research data management support network for Saxony - Presentation
SaxFDM – the Saxon state-level initiative to support research data management (RDM) – was founded in 2019 as a bottom-up initiative, aiming to bring together relevant actors and foster cooperation and collaboration in terms of research data management across Saxony. Currently, the network comprises of 23 partner institutions. The aims, fields of activity and structure were defined and described in a strategy paper1. SaxFDM is coordinated by a board of five speakers and supported by a scientific advisory board consisting of 13 members. The quarterly meetings of the SaxFDM plenum are public and anyone from Saxony interested in RDM topics is free to join. Three working groups were formed which, based on volunteer work, deal with specific topics with regard to data management (knowledge transfer and consultation, common services and tools, outreach) and developed a number of initial offers, for example the "Digital Kitchen" event format.
Since the end of 2021, SaxFDM is receiving funding from the Saxon State Ministry of Science, Culture and Tourism (SMWK) to conduct the project "SaxFDM – establishing a cooperative research data management support in the Free State of Saxony". Building on the outputs of previous SaxFDM activities, this allows for the advancement of the existing offers into a comprehensive service portfolio as well as the development of sustainable and effective structures. The project activities are coordinated and carried out by a team of three RDM experts who took up their posts in March 2022. The fields of activity inlcude:
- Development of a statewide RDM consultation service
- Fostering and supporting the implementation of RDM in the policies, guidelines and strategies of the SaxFDM partner institutions
- Design of training and education offers
- Further development of the portfolio of technical RDM services based on needs assessment
- Development of a business model to sustain SaxFDM activities and structures beyond the end of the project lifetime
- Coordination and networking with the NFDIs and RDM initiatives of other federal states
- Design, evaluation and implementation of RDM services and solutions in the context of independently working focus projects
There are different approaches to get to a broad, statewide service portfolio. Some RDM initiatives of other German federal states, for example, provide consultation primarily as an institutional service, where the local service units of network partners provide services only for the members of their institution. SaxFDM is following an approach that combines a central, state-wide consultation service with local offers that cooperate with the initiative. One of the reasons for this is that there are many smaller institutions in Saxony that can afford RDM staff only at a very limited scale or not at all. Currently, Leipzig University and Technische Universität Dresden have dedicated RDM consultation services. There are also individual staff members who provide RDM support in a number of libraries, research institutes and universities of applied sciences. The Research Data Working Group of Leipzig University comprises staff members of three institutional units, namely the research services department, the computing centre and the university library. Their services can only be used by members of Leipzig University. The situation is a bit different in Dresden. Since 2021, the Service Center Research Data of TU Dresden, the service unit offering RDM support for the university, is collaborating with DRESDEN-concept, a local alliance of 33 universities, research institutes as well as cultural institutions engaging in research. They now jointly offer RDM consultation and a number of other services for the whole alliance.
The central RDM consultation service of SaxFDM is primarily targeting smaller institutions with no or just a very limited RDM support. But also larger partners who already have considerable RDM support capacity in place can benefit from what an initiative like SaxFDM has to offer, especially in terms of networking and the pooling/sharing of resources and expertise.
What would such a pooling of resources look like? The SaxFDM project will design a concept which will include a central contact point, providing consultation and other services, that will, depending on the availability, cooperate with local support units. In the case of complex questions or issues requiring very specific expertise or skills, the initiative can act as a moderator, facilitating smooth contact with domain experts.
There are still a number of questions to be answered on the way to implementing and sustaining such an approach. For example: How to ensure an effective interplay and collaboration between the different actors? How to communicate the offers to all researchers in Saxony without them having to gather relevant pieces of information from many different places? What could an appropriate operating and business model look like?
The service portfolios of RDM support networks (e.g. on the level of federal states, but also NFDIs) are, in many cases, still in development. SaxFDM shares a lot of challenges with these and is eager to share knowledge, build on existing experiences and incorporate suggestions from various communities such as NFDI4Ing. This way, we want to develop a comprehensive service portfolio that will help turn research data management from a mere concept into living practice.
[1] https://saxfdm.de/ueber-uns/strategiepaper-des-arbeitskreises-saxfdm/
License: CC-BY 4.0 ↗
Implementing usability-improved data management plans in interdisciplinary engineering projects - Presentation
Over the last decades, the role of data in production environments has gained considerable importance. The amount and variety of new data sources and tools, as well as the degree of connectivity between the domains along the entire production chain, have increased drastically. In this light, the research of the Cluster of Excellence „Internet of Production“ (IoP) at the RWTH Aachen University revolves around enabling data-driven manufacturing to improve production processes and products by making the massive amounts of data available for cross-domain processing, analysis, and exchange. Naturally, research in cross-domain engineering features highly interdisciplinary scientific collaboration and research data, and the comprehensible documentation of the data in this increasingly complex environment is becoming both more challenging and relevant.
Data management plans (DMP) have become a common tool for researchers to plan and document their research data. DMPs are questionnaire-like documents guiding researchers through the stages of the data lifecycle, addressing the necessary information needed to make their research data comprehensible, reproducible, and potentially reusable. There is already a variety of DMP templates available, e.g., provided by funding organizations, such as in the Horizon Europe program, or by universities, such as the RWTH Aachen University. In theory, creating a DMP for a project is as simple as choosing a template and distributing it to the project researchers for them to answer its questions. Practically, having a DMP completed and ensuring the quality of the information content of the filled-in document demands several considerations, especially concerning the needs of the researchers.
To get a better understanding of those needs, we conducted a series of semi-structured interviews with IoP researchers. The focus of the interviews was three-fold: First, the kind of research data handled; second, the researchers’ level of awareness and knowledge on research data management (RDM) topics; and third, their experience with RDM measures currently put into practice, and perceived incentives and barriers regarding the implementation of further RDM methods. The series comprises 28 interviews with researchers from all cluster research domains. In the following, we discuss the requirements that we derived from the interview series for creating and implementing a usability-improved DMP template for IoP projects and similar interdisciplinary engineering projects. We distinguish between content-related (CR) and organizational (OR) requirements.
CR1: Elaborating on relevant topics. Templates provided by large organizations are often intended to be as generally applicable as possible. Subsequently, the included questions are kept generic, and the covered topics inevitably lack potential project- or discipline-specific focus. However, there may be specific RDM topics inherent to the interdisciplinary project environment and engineering research area that are of particular relevance for making the research data understandable, and which should therefore be addressed in more detail in a customized DMP template.
CR2: Ensuring comprehensibility and unambiguity. Existing templates often presume knowledge about vocabularies such as metadata, persistent identifiers, or the FAIR principles, which are not always common among researchers. Furthermore, we found that even basic terms such as ‘research data’ may leave room for different interpretations, and even more so in an interdisciplinary environment. A suitable DMP template should account for different levels of RDM proficiency in researchers as well as for various notions of important terms.
OR1: Top-down structure specifications. Without clear instructions from the management on responsibilities, hand-in schedules, and review processes, many researchers have indicated that they see neither the possibilities nor the reasons for creating DMPs. This applies in particular to projects where different institutes work together. The implementation of a DMP template should therefore be well coordinated and communicated by the project management.
OR2: Reducing overhead. A frequently expressed sentiment among the interviewed researchers was that DMPs were mainly perceived as an additional workload, which may have undesirable consequences on the completion behavior. Methods should be taken into consideration of how the amount of time needed can be reduced. Approaches to make the user experience more pleasant such as gamification may also be promising.
Currently, we are developing a DMP template for IoP researchers which dedicates questions specifically to documenting and handling industrial data, e.g., machine data, or business-sensitive data from industry partners (CR1). We are working towards including help texts and examples for all terms that the interviews revealed to be unclear among researchers (CR2). Steps towards the planned roll-out are closely coordinated with the cluster management (OR1). Finally, we are exploring two methods for reducing the time required to complete the DMP: First, adding fixed response options to questions with free text answer fields, e.g., by surveying the most frequently handled data types and making them available as selectable response options; and second, indicating summarizations at suitable places within the questionnaire, i.e., questions skippable based on previous answers (OR2).
As a result, we expect a template that is easy to work with for researchers from all involved disciplines, and that provides useful insights into the research data of the cluster. Further investigations of methods especially about enriching the DMPs with more functionalities may be promising. For example, this can include the integration of other RDM services with DMPs, such as the automatic creation of a repository entry, or a certification function on the degree of compliance to the FAIR principles, including methods to support researchers with the FAIRification of their data.
License: CC-BY 4.0 ↗
Establishment of a Guideline for the Intuitive Creation of Semantic Models in the Internet of Production (IoP) - Presentation
The Internet of Production (IoP) is one of the clusters of excellence at RWTH University. Its goal is to enable a new way of data understanding by integrating semantics in real-time data related to the production system, including processes and users' data. For achieving this, it is necessary to have semantic models (ontologies), i.e., a formal description of the knowledge related to, in this case, IoP and particularly to the production domain and its connections. Indeed, to obtain reasonable and optimal knowledge modelling, both groups of actors, Domain Experts (DEs) and Knowledge Engineers (KEs), should collaborate and complement each other's work. On the one hand, DEs do not know how to properly create semantic models, while KEs do not have the expertise concerning the production and manufacturing domain.
Moreover, several methodologies for ontology development exist, but there is no single user-friendly guidebook or templates that facilitate and save time in the modelling process. As a consequence, DEs may rely, for the most part, on KEs for creating ontologies, while these lack proper expertise regarding the specific domain. Thus, it creates a cycle of knowledge dependency. It may also be that KEs are unavailable or that role does not exist in a particular organisation. Therefore, DEs try to build ontologies on their own. However, as they do not know how to formalise knowledge, it could result in a challenging and time-consuming task.
We propose a guideline booklet in which we recommend an existing methodology, the Linked Open Terms (LOT) methodology. We complement it with a descriptive workflow to follow in the entire ontology development process and best practices to consider in ontology creation and maintenance. Furthermore, we provide templates to support different stages of the workflow, recommended tools, and ontologies catalogues covering different levels of concept descriptions, e.g., domain-specific and cross-domain. We also include examples and explanations regarding the definitions and usage of ontologies to provide the fundamental knowledge required to use the guideline. As a result, we offer support to both groups of actors to collaborate more efficiently and create semantic models. To do so, we performed descriptive and qualitative research consisting of a literature review and comparative analyses of the existing methodologies, tools, and domain-specific ontologies using various criteria. We investigate the commonalities in existing approaches for ontology development and unify them.
We assess our results via two evaluation methods, one consisting of the booklet's evaluation by getting feedback from DEs on its suitability for the ontology development task. The other is letting the DEs, who had no experience creating ontologies, create one from scratch based on a real use case in the manufacturing domain. The evaluation process, through both methods, shows that our proposed guideline is a step forward in achieving more fluent collaboration between both parts, DEs and KEs. Also, it offers a valid aid to support DEs in the initial tasks of ontology creation, especially with the templates and additional resources we provide. Furthermore, the participants of the evaluation process suggested some improvements, which we will include in a new version of the guideline. These suggestions refer, for the most part, to more practical examples to keep motivating the adoption of ontologies and, in this way, to improve the scenario toward achieving the goals of the IoP.
License: CC-BY 4.0 ↗
Documentation for FAIR Modelling - Presentation
As computer performance continues to improve, more and more complex theories can be simulated with different scales, different physical models and a gradual transition from simulation to experiment. In engineering science, for example, a fluid simulation is combined with a solid body simulation or the model is combined with an industrial application. These multi-x-simulations need transitions from one model into another, to gain more insights. This combination of different models bring new challenges to solve: the models are written with different (research) software and sometimes even models are used from other fields of experience. Therefore, it is crucial to have a good documentation, which is seen often as a software engineering problem, but research software is just the tool for simulating models.
The focus of this presentation is on a multidisciplinary approach to understand the complexity of documentation for mult-x-simulations. Perspectives of different disciplines are brought together to determine how simulations need to be documented to be reusable.
With a case study approach, two use cases were evaluated at our institute to understand what is needed that the documentation is perceived as valuable. In conducting the research, two main aspects have caught our attention: the complexity of documenting the whole multi-x-model, and it parts and the validation and verification of the reused model as input for another model in a broader sense of trusting that the model from another field of experience is valid. We looked deeper into the philosophy of science, research data management and software engineering.
In the philosophy of science, we found three aspects:
(i) Decisions determine our research question: first, we have to decide with which model we build the physics, then with which software solution we simulate the model and then with the question we want to answer. The first two decisions determine our third decision.
(ii) The validity and verification of models must be ensured through transparency and skill: in the form of documentation and knowing how to document.
(iii) The emergence of models, which depend on its parts and are independent of them: The whole model and its parts, as well as the dependencies, must be documented.
These aspects need to be understood and documented in order to understand the model. Within research data management, we see the FAIR principles as an important aspect of describing research results with metadata. Until now, discipline specific standards are missing, but hopefully the NFDI-Initiatives bring more standardization into it. In Software Engineering we focus on the approach “docs as code” with the idea to handle documentation like coding to bring it into the daily routine. Here concrete steps how to write documentation is in the focus. Two main ideas emerged from this approach for our research, namely that documentation needs to be integrated into the everyday life of the researchers and documentation reviews that can provide feedback on a content level from experts and beginners.
Documentation is a crucial part of reusable results. Until now, there is a focus on the tools, not on the methods, but the tools can change. In practice, little thought is given to the theoretical framework of simulation in computational engineering. Not only software development methods, but rather a basic understanding of what constitutes the method of simulation and what it should contain, is required. There need to be a discussion of what the scientific method in simulation science demands, focusing on practice not on theory, but with epistemologically reflected theory in the back.
License: CC-BY 4.0 ↗
Matadata Schema for Terahertz Research - Presentation
1. Introduction
The success of Collaborative Research Centres (CRCs) requires the exchange of research outcomes between researchers. Furthermore, in the engineering sciences, a wide variety of data are created from observations, simulations, experiments, or subsequent analysis (Sheveleva et al., 2020). Hence, an efficient research data management (RDM) is a crucial success element for large, collaborative projects. The RDM systems are developed to process and store data and descriptive metadata, and make them accessible in an effective way. To search and find relevant research data, it is necessary that the data is well described in a standardized way, usually by means of a metadata schema. Following the FAIR principles, each dataset should be described with detailed metadata providing context information how the dataset was generated and who acquired or edited it (Wilkinson et al., 2016). Metadata thus presents the required information for the correct utilization and proper interpretation of research data.
A specialized metdata schema for engineering sciences is Engmeta (Schembera and Iglezakis 2020), however, for terahertz research, Engmeta does not support all aspects of this discipline. Moreover, involving the terahertz researchers into developing such a specific metadata schema is critical to be able to cover all details of this research area. Therefore, we developed a metadata schema for terahertz research, based on Engmeta, and by implementing the competency questions concept in close cooperation with the researchers of the CRC/TRR 196 MARIE. This paper describes an approach how competency questions helps to build a new metadata schema or enhance available schema for the terahertz research. 2. Methodology
The method of competency questions was used to identify the needs of terahertz researchers for finding and describing teraherts research data. This method lets the participants assume that there is a comprehensive omniscient database with well-developed research data from this domain and asks them to formulate questions that they would like to ask to this database. Hence, terahertz researchers from the CRC were invited to an online workshop. We explained the assumption, and requested them to write down the questions that they would like to ask to such a database. In the next step, the questions were analyzed and sorted by our team. the questions were sorted based on the categories:
- complexity of question (values: simple or complicated) -> as the complicated ones needed more focus and discussion
- administrative (values: yes or no)
- out of scope (values: yes or no)
- type of data (values: measurement, simulation, software, design or multiple) -> defines the different sections of the metadata schema, and the metadata fields are determined by sub-classes and their children
- sub-classes (values: method, instrument, software, variable or material)
- instruments/tools (values: type, name, description or version)
- material (values: type, name, description, e.g. optical material properties)
- method (values: type (generation, processing, analysis, others), name or description)
- software (values: name, version, description, open-source, operating system)
- variable (values: name, constant (value) if not: measured minimum value, measured maximum value).
3. Results and Discussion
The first workshop hosted 22 researchers. Overall, 100 questions were collected. The questions were classified according to the type of data that the question address. After the first workshop, the competency questions were evaluated according to the above categories. At this point, two competency questions and their evaluation are mentioned as examples: Question: What algorithm has been used for post processing, what were the parameters? Evaluation: understandable and simple question which reflects a utilized software and algorithm, and its sub-classes are software and parameters. Question: Give me all details about the sample (composition, geometry, reference measurements, other groups measurement, photos). Evaluation: understandable and simple question, measurement and design are the data types and the query about other groups’ measurement is administrative, and needs contacting other groups especially when the data is confidential. Thereafter, some relevant fields were added to the sub-classes such as name, type, description, etc. The extracted fields were assessed by researchers in the second workshop, as well as through uploading the data on the Database platform (Dataverse). The researchers confirmed that the Terahertz metadata schema effectively covers their needs to archeive their data in the database.
4. Conclusion
A metadata schema for terahertz research was developed based on Engmeta, and by implementing the competency questions concept in close cooperation with the researchers of the CRC/TRR 196 MARIE. Overall, 100 questions were collected, then the questions were analyzed and sorted according to the following categories: complexity of question, administrative, out of scope, type of data, sub-classes. An initial metadata schema was obtained based on the questions’ sorting. The collection of competency questions has proven to be a good tool to involve subject scientists in the creation of the schema who do not have prior knowledge of metadata schema creation.
5. Acknowledgment
The research work presented in this paper is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) Project-ID 287022738 TRR 196 for Project Z03.
6. References
Schembera, Björn; Iglezakis, Dorothea (2020): EngMeta: metadata for computational engineering. In IJMSO 14 (1), Article 107792, p. 26. DOI: 10.1504/IJMSO.2020.107792.
Sheveleva, Tatyana; Koepler, Oliver; Mozgova, Iryna; Lachmayer, Roland; Auer, Sören (2020): Development of a Domain-Specific Ontology to Support Research Data Management for the Tailored Forming Technology. In Procedia Manufacturing 52, pp. 107–112. DOI: 10.1016/j.promfg.2020.11.020.
Wilkinson, Mark D.; Dumontier, Michel; Aalbersberg, I. Jsbrand Jan; Appleton, Gabrielle; Axton, Myles; Baak, Arie et al. (2016): The FAIR Guiding Principles for scientific data management and stewardship. In Scientific data 3, p. 160018. DOI: 10.1038/sdata.2016.18.
License: CC-BY 4.0 ↗
Expanding an Ontology with semantically linked CFD-Simulation Data by Segmentation into reusable Concepts - Presentation
Computational-Fluid-Dynamics (CFD) simulations generate a large variety of complex (meta-)data, which are inherently difficult to store in a FAIR manner (Findable-Accessible-Interoperable-Reusable). While the created data itself needs to be handled by domain experts, runtime-generated meta-data, e.g. the simulation settings and major output variables, offer the possibility of restoring, revising, and re-evaluating existing simulations. As the parameters of such simulations are interlinked, classifying and storing metadata in a standardized manner is a difficult task. Ontologies are key to the FAIRness of such data to classify and be able to query the data.
The interlinked parameters of the simulations can be represented in an ontology as a mesh of individuals, which achieve similar interconnectivity. Each individual is thereby acting as a data point in and of itself. However, directly storing the data in an ontology comes with several challenges, as it would generate a large number of connections and axioms, and, therefore, cause a high computational effort in the reasoning, because several common settings do not change between simulations. They need to be recurrently implemented for each simulation. Classifying these settings increases the computational effort even more, as several properties need to be defined to infer the similarities between the representative individuals.
To solve these challenges, a segmentation method is introduced for data condensation and pre-classification. The method proposed here uses nested python dictionaries in the form of JSON files as input and intermediate storage and populates an existing ontology with respective simulation data. Such a nested structure is contained in solver-logging files of ANSYS CFX. Exemplarily, the CEL (CFX expression language), used for the setup of boundary conditions and the simulation case in general, is written in nested structures. These structures are parsed as dictionaries and then segmented into sub-dictionaries, representing the main concepts. The sub-dictionaries are archived, and relations between different simulation dictionaries are established in order to pre-classify the data. The population of the ontology via the above-mentioned sub-dictionaries then allows for the knowledge graph generation. This condenses the data by linking and reusing concepts between multiple simulations. The method helps to reduce the number of manual inputs required to populate the ontology with the given data to enable operable and FAIRer data storage for a non-expert.
Starting from simulation files, important steps of this method, such as data conversion, are shown and important challenges are solved, such as the population of arbitrarily named entities that occur throughout the dictionaries the multiplicity of concepts, varying linkages, and renaming into semantically aligned classes.
While the method presented is generic in its concept, it is applied to two datasets of CFD simulations of the flow and heat transfer in microreactors containing a total of 911 simulations and respective log-files. Using the EMMO ontology as top-level ontology, 318 classes are generated and used for classifying the data, resulting in 28.817 automatically created individuals in the ontology to represent the data set. In addition, the efficiency of the method is evaluated resulting in a reduced number of individuals with approximately 10.2 % of the starting number.
License: CC-BY 4.0 ↗
Prepare RDM for Decentralized and Blockchain-Based Data Source - Presentation
RDM needs to consider preparing for the mixed data source from centralized data storage, decentralized data infrastructure and private data from blockchains.
So far, most RDM groups all over the world have done amazing jobs in pushing the RDM as a required procedure in the research process to ensure data integrity during research. Canada, for example, will soon require grant recipients to curate and actively manage their research data. Funding agencies such as the Canadian Institutes for Health Research (CIHR), the Social Sciences and Humanities Research Council (SSRC), and the Natural Sciences and Engineering Research Council of Canada (NSERC) will shortly require grant recipients to adhere to some requirements supporting the management of their research data.
However, with more researchers beginning to use RDM, more data from a variety of resources need to be considered to be allowed in RDM to ensure that RDM will not become an obstacle for researchers to utilize data. Although RDM groups are encouraging the usage of open access data, the fact that there will be mixed permissions of the data source in the age of web 3.0 needs to be acknowledged and handled by RDM groups and tools.
Besides centralized data management, as researchers are currently doing, the RDM needs to consider the accessing, organizing, securing and sharing of data from two more types of resources. The first data type is the decentralized data that may be stored anywhere in the world with user-controlled allowances. The allowance could be coordinated by third-party organizations. The second data type is the private data stored offline with Peer-to-Peer allowance directly from one party to the targeted researcher, with permission-related information encrypted on blockchains. Both types of data could be transferred to fully open-accessible data at any time.
For the first type of data source, the RDM group wants to prepare a ‘whitelist’ of third-party coordinators who are not the owner of data but help the user to control the allowance of accessibility by implementing help tools such as a mobile app. The Solid tool from Inrupt, which is created by Sir Tim Berners-Lee, is one of these organizations. The ownership of data is the user. However, these organizations may follow a certain type of protocol or standard to enable the decentralized feature. The RDM wants to consider corresponding metadata and enhance current standards to accommodate this scenario. The securing of data also needs to be modified to coordinate the fact that the actual data could locate anywhere. More security checks of data sources may need to be added for this type of data to be safely used.
For the second type of data source, the RDM group needs to consider the preservation of blockchain-related metadata in RDMs. The special part of the blockchain is that on one hand, transactions are open and transparent. On the other hand, entities on the blockchain are autonomous. Thus, it is foreseeable that RDM needs to design special standards for this type of data to be accessible and shareable for researchers. A potential solution is to build an RDM chain to store the metadata to access and share this type of data and allow the user of this RDM to use corresponding metadata to decrypt related research data. Another possible way is to push the receiver to open the data after certain criteria are satisfied, such as passing a timepoint or getting approval for funding. There is still a long way to explore on this path. However, the RDM group needs to make sure that this type of data can still be used by researchers in the future.
The overall principle of RDM is to allow researchers to utilize the power of data from all possible resources while guaranteeing the data is correctly accessed, preserved and secured. Thus, the RDM groups should navigate the best practices to integrate decentralized data and blockchain-carried data into RDMs and continuously look for inputs from researchers to accommodate their needs.
License: CC-BY 4.0 ↗
Dealing with in-situ monitoring data of Laser Powder Bed Fusion processes: from acquisition to deep learning applications - Presentation
Laser Powder Bed Fusion (LPBF) is a promising additive manufacturing (AM) technology to manufacture metallic components with complex geometries in a layer-wised manner. To ensure the quality of the end product and process stability, in-situ monitoring systems are integrated into LPBF machines to acquire process information. Due to its layer-by-layer manufacturing behavior, the monitoring data is able to be captured continuously during the process, and a considerable amount of raw data can be acquired after each job. With the captured monitoring data, deep learning methods can be used to automatically detect defects within parts and anomalous manufacturing behavior. In order to efficiently store, analyze and use the data for these deep learning methods, a preprocessing pipeline is necessary.
In this paper, we present a data process pipeline from data acquisition through an optical tomography (OT) monitoring system to deep learning applications. Our OT system is installed on an EOS M 290 LPBF machine by using a 5.5 Megapixel sCMOS (scientific complementary metal-oxide-semiconductor) camera which works in the near-infrared (NIR) spectral range. The captured monitoring image contains process quality information since it is representative of the emission energy of the melt pool where the metallic powder melts and consolidates. Raw data is acquired continuously at each layer during LPBF exposure process. With the limited exposure time of the camera, multiple images are taken for each layer with a predefined exposure time of $2s$. The acquired OT images are calibrated to reduce systematic noise, including flat-field, defective pixel, and perspective correction which were determined via camera calibration experiments.
Through the described monitoring setup and depending on the processing time of each layer, one layer can contain up to a thousand images. Each image has a size of around 4 megabytes which is necessary to cover the whole building platform with the resolution needed. One print job with 150 layers alone requires in total around 600 gigabytes of storage space. Besides the issue of limited storage space, this type of raw image data is in its unstructured form of no use. The most intuitive approach to structuring the data is by mapping an image to a layer number. This is done through an integrated photodiode within the process chamber, that monitors the laser emission continuously. During processing the laser is non-stop activated and only turned off after finishing one layer during powder recoating. Measuring the laser emission allows a clear distinction between each layer of the printing process. Hence, by integrating the binary information of the photodiode with the OT image data enables a clear allocation of the image to layer number.
All captured images by this approach display a low information density, where most of the images are covered by noise or contain no information at all. In order to increase the information density per image, we propose to crop out only parts of the image where information is present. Necessary to that end is the part position of the printed object. Through a simple transformation matrix that maps part positions within a platform coordinate system to part positions within an image coordinate system, a cropping algorithm reduces the image size significantly to an image size that encloses all necessary information. In case multiple parts are printed within one printing job, we propose a simple threshold-based search algorithm that compares the standard deviation of the pixel distribution around the part position with a predefined pixel value to detect if any printed part is present around the part position. Finally, by looping over all part positions, we save all cropped images to the disk that satisfies the pixel distribution condition.
With the presented data acquisition and processing setup, we propose the following data structure. On the highest layer, we define printing jobs with their unique identifier. A job can contain one or multiple different geometries. Each printed geometry has a position within the building platform, a specific set of laser parameter configurations, and a unique part identifier. Under each part, we gather all processed images that correspond to the part according to its part position. Each image is named according to layer number and a unique identifier. Through this approach, each image is allocated to a layer number within a part, the corresponding machine parameter, and its geometry as well as a printing job.
In order to evaluate our data acquisition and processing pipeline, we executed a job that prints multiple parts with different parameter configurations distributed equally over the building platform. All data that were acquired during the print was processed and stored according to the described structure. Through our proposed approach we reduce the average image size from 4 megabytes to around 40 kilobytes. We implemented our data storage structure within the cloud data storage service \textit{Coscine}. The platform allows user-friendly access, management, and sharing of our image data collected during LPBF processes and enables further processing through deep learning applications.
License: CC-BY 4.0 ↗
The federal state initiative HeFDI, state of Hesse - Presentation
Many fields of research are increasingly relying on digital research data. This is especially true for the engineering sciences. The amount of data is constantly growing and the possibilities for collecting, processing, and analyzing these data are multiplying. They are of high value for researchers and - in line with good scientific practice - must be responsibly secured at all stages of the data lifecycle. Furthermore, they should also be accessible to other researchers as well as reproducible. Thus, research data management (RDM) plays an increasing role within the scientific community. Currently, it is a rapidly developing field with funding agencies and the German National Research Data Infrastructure (NFDI) as the main drivers of change and individual researchers as the target group for services, standards and best practices.
In this landscape, state initiatives play an important role in providing the link between national (and international) developments and local needs. Their representatives serve as a first point of contact for researchers, constitute a knowledge network with reciprocal support structures, and dispense information from national and international RDM stakeholders like the NFDI, the European Open Sciences Cloud (EOSC), or the Research Data Alliance (RDA).
In the project "Hessian Research Data Infrastructures" (HeFDI), eleven Hessian universities have been working together since the end of 2016 to develop and establish infrastructures and services for research data management (FDM). Central aspects are the development of policies and supporting tools for active data management, the preparation and implementation of training and consulting services, and the support of researchers in publishing and securing their research data.
The presentation highlights the central fields of action of HeFDI in its current funding phase (2021-2024) and describes the services offered by the network. Special attention will be paid to topics regarding the engineering sciences. The following measures are part of HeFDIs strategy to advance RDM in the engineering sciences and beyond:
- Implementing and updating research data policies at all of the participating universities.
- Offering training and consulting services at all of the participating universities.
- Supporting efforts to advance data literacy amongst students
- Offering advice on data strategies and data management plans in applications for third-party funding.
- Offering tools for the structured development of data management plans.
- Developing a collaborative information pool.
- Providing researchers with information on legal issues (licenses, data protection).
- Jointly developing local solutions for institutional repositories for securing, archiving and publishing data.
- Providing support for actively used data, i.e. for tools, versioning, and licensing.
- Networking with and participating in national initiatives on RDM (NFDI and its consortia, forschungsdaten.info, forschungsdaten.org, RDA, data competence centers).
- Finding solutions for subject-specific RDM.
- Hosting workshops and conferences open to the general public.
- Philipps-Universität Marburg (project coordination)
- Frankfurt University of Applied Sciences
- Goethe-Universität Frankfurt
- Hochschule Darmstadt
- Hochschule Fulda
- Hochschule Geisenheim
- Hochschule RheinMain
- Justus-Liebig-Universität Gießen
- Technische Hochschule Mittelhessen
- Technische Universität Darmstadt
- Universität Kassel
License: CC-BY 4.0 ↗
How to use re3data – Creating, editing, and analyzing information on research data repositories - Workshop
The year 2022 marks the 10th anniversary of the Registry of Research Data Repositories, re3data (https://www.re3data.org/). The global index currently lists close to 3,000 digital repositories across all scientific disciplines – critical infrastructures to enable the global exchange of research data. The openly accessible service is used by researchers and other services worldwide. It provides extensive descriptions of repositories based on a detailed and publicly available metadata schema. Ingests in re3data are managed by an international team of editors who thoroughly analyze the repositories and take care of metadata completeness and quality. re3data promotes open science practices and the visibility of science-driven open infrastructures. Scientific communities, funding agencies, libraries, and other scientific infrastructures use re3data and recommend the service to researchers.
The contribution kicks off with an introduction of the re3data service and the current re3data COREF project, which is dedicated to further improving and enhancing re3data. We will also take a closer look at how repositories for the engineering sciences are represented in re3data.
This is followed by the practical part: We will demonstrate which information an entry in re3data should contain to ensure maximum visibility of the repository. The ingestion process for repositories in re3data will be explained and participants will learn how to submit new entries and suggested changes.
To round it off, the following section will be dedicated to the reuse and analysis of re3data metadata. re3data provides a central access point and user interface through its website as well as via API. We will introduce you to the API and show use case examples implemented in the free software R.
The session will conclude with an open discussion round and ample time for questions and feedback.
License: CC-BY 4.0 ↗
Fostering the Creation and Management of Solar Energy Data: Scholars' Needs and FAIR RDM Good Practices - Presentation
Data-driven solar energy research produces large amounts of data which is crucial for informed decision-making. However, data is often difficult to find and fragmented which creates inefficiencies. Since these solar energy data are vital for addressing, planning, and managing the energy transition, it is fundamental to strengthen solar energy data creation and management. Therefore it is relevant to identify the needs of researchers in this field in order to know how we can promote the best data creation and management practices.
Researchers' readiness to open databases and make solar energy data findable, accessible, interoperable, and re-usable – FAIR is the driver of successful data investments and the reproduction of knowledge for guiding and accelerating the transition to a clean energy system. On the other hand, research data management (RDM) is increasingly becoming a subject of interest for academics and researchers. The interest is motivated by a need to support research activities through data sharing and collaboration both locally and internationally. Many institutions, especially in developed countries, are practicing RDM to accelerate research and innovation but this is not so common in developing countries.
This study targets the stakeholders (both academic staff and students in the solar energy sector) from two diverse socio-economic-political countries; Palestine a developing country that has been suffering political conflict and financial crises for several decades, and Italy a developed country within the European Union with various privileges in scientific research and research data management. We aim to (1) assess the current FAIR RDM practice of the stakeholders and (2) evaluate the FAIRness of any available research energy datasets in the targeted countries.
Mixed-method research (quantitative and qualitative) are implemented, where a focus group and a survey are used to gather information from the stakeholders. The study is underpinned by the DCC curation lifecycle model, which enables employing a descriptive research design to capture data from researchers. In the focus group with the stakeholders from participating universities (professors in the electrical engineering and solar energy sectors), we discuss the following topics:
- Data types, formats, and categories that you are using/producing and are particularly important to current energy research demands
- The current practice of data management? how, where, when, visibility, and sharing?
- Awareness and knowledge of data management tools, and open datasets and repositories in the energy sector
- Main data management issues and challenges
- Previous training in data management and the status of RDM curricula in the participating universities, which include IUG and the University of Parma?
To test the compliance of solar energy data resources in the targeted countries with the FAIR principles and to offer guidance in improving the FAIR status, we use different evaluation tools from checklists to (semi-)automated evaluators.
The collected data and the test results are analyzed to generate descriptive and inferential statistics to address the needs. The recommendations and actions proposed should be based on global best practices while also incorporating the needs and experiences of the stakeholders with diverse socio-economic-political settings.
License: CC-BY 4.0 ↗
A No-Nonsense Guide to Higher Code Quality for Researchers - feat. Unit Testing - Workshop
For the vast majority of researchers, working with code is common practice. In Engineering Sciences, however, many researchers lack knowledge and experience in how to write good code. As a result, their code is often chaotic, non-reusable, non-interoperable and it is hard (or impossible) to include new people such as students or colleagues in developing the code further. On the positive side, though, researchers as well as students in Engineering Sciences are highly motivated to improve their programming skills – of course unless the hurdle is too high, or unless it takes too much time.
Learning from past experiences of getting a vertigo when looking at code your supervised student just handed in, of months-long refactoring and of the frustration when debugging a code which was working “just a minute ago”, the RDM team at the Institute of Fluid Systems of the Technical University of Darmstadt has developed a set of guidelines for assuring fundamental quality of code developed there. It constitutes a set of steps that are very easy and fast to implement and lead to a significant improvement of the written code.
Researchers themselves, the authors realize that there is no one-fits-all, universal solution and that a set of guidelines for code quality should offer a good amount of flexibility while costing minimal effort. These guidelines will be presented in the talk and they will be followed by a workshop on the fundamentals of unit testing.
The guidelines are focused on the programming language Python but are applicable to other programming languages as well. They rely on existing standards such as Pep8 and other thorough guidelines such as that of Google. They consist of the following two groups:
- Language Rules: these include linting, spellchecking, code structuring
- Style Rules: these include writing good documentation, using naming conventions for variables, classes, modules, etc., managing whitespaces, automatic formatting.
The session will also contain a discussion part where lessons learned from other teams can be discussed and similar approaches also with the examples of other programming languages can be presented by the participants.
License: CC-BY 4.0 ↗
Become a reviewer
Im Rahmen des Community Meeting lobt NFD4Ing einen Award für die beste FDM-Lösung des Jahres 2022 aus. Nominiert werden können alle Lösungen, die nicht im Kontext von NFDI4Ing selbst entstanden sind. Wenn Sie selbst an einer Lösung mitgestaltet haben oder die Lösung andere nominieren möchten, würden wir uns freuen, wenn Sie uns eine entsprechende Nachricht mit dem Betreff “NFDI4Ing Award” an folgenden Adresse senden: community@nfdi4ing.de
- FAIR engineering science: applying principles to practice
- RDM tools and services: usability and automation
- Open science: impact and visibility of engineering insights
- Data literacy in engineering education
visibility of engineering insights
engineering education
Alles auf einen Blick
Everything at a glance
Data is the commodity of the future. In research, FAIR data, data quality and research data management (RDM) are therefore experiencing a steady increase in importance. Likewise, in industry it is important to collect, structure and evaluate data.
Mit der NFDI4Ing Konferenz bringen wir Forschung und Industrie zusammen zum Austausch, gegenseitigen Lernen und Vernetzen.The NFDI4Ing conference brings together research and industry for exchange, mutual learning and networking.
Data is the commodity of the future. In research, FAIR data, data quality and research data management (RDM) are therefore experiencing a steady increase in importance. Likewise, in industry it is important to collect, structure and evaluate data.
Mit den NFDI4Ing Community Meetings bringen wir Forschung und Industrie zusammen zum Austausch, gegenseitigen Lernen und Vernetzen. Um der Vielfältigkeit der Ingenieurwissenschaften gerecht zu werden, fokussieren wir uns hier auf die Community im Maschinenbau und der Produktionstechnik (CC-41).The NFDI4Ing Community Meetings bring together research and industry for exchange, mutual learning and networking. In order to fulfill the diversity of the engineering sciences, this community meeting focuses on the community in mechanical engineering and production technology (CC-41).
Erfahren Sie mehr über aktuelle Entwicklungen des Forschungsdatenmanagements in unseren Themenblöcken "Get a grip on RDM" und "Metadata, Workflows and Ontologies" . Die Veranstaltung ist ebenfalls in diese beiden Slots geteilt, sodass zu jeder Zeit zwei Vorträge stattfinden, jeweils zu einem der beiden Themenblöcke zugehörig. Was Sie in den einzelnen Themenblöcken erwartet können Sie der folgenden Übersicht entnehmen. Darüber hinaus finden Sie ausklappbare Abstracts der Beiträge, indem Sie Beitragstitel mit dem ▶ Symbol anklicken.Learn more about the current activities on research data management in our topic segments "Get a grip on RDM" and "Metadata, Workflows and Ontologies" . The whole event is divided into these two slots, so there will be two talks at any given time, each associated with one of the two topic segments. What you can expect in the individual topic segments can be seen in the following overview. Additionally we have added fold-out abstracts to the contributions. You can find these by klicking the contribution topics with the ▶ symbol.
Wann? | 26. und 27. Oktober 2022, jeweils von 9:00 bis 17:00 Uhr |
Wo? | TBD: Online oder vor Ort an der RWTH Aachen |
Wer? | Alle FDM-Interessierten innerhalb und außerhelb der NFDI4Ing |
Was? | Themen rund um FDM, Ergebnisse und Fragen aus der NFDI4Ing sowie nationale und internationale Gastvorträge |
When? | October 26th and 27th, 2022, between 9AM and 5PM |
Where? | TBD: Online or live at the RWTH Aachen |
Who? | Everyone interested in RDM within and beyond NFDI4Ing |
What? | Topics regarding RDM, results and questions from the NFDI4Ing as well as guest lectures both national and international |
Im Rahmen der Konferenz wurde von NFD4Ing auch ein Award für die beste FDM-Lösung des Jahres 2021 ausgelobt. Nominiert werden konnten alle Lösungen, die nicht im Kontext von NFDI4Ing selbst entstanden sind. Als Teilnehmer der Konferenz bestand die Möglichkeit die Stimme für eine favorisierte Lösung abzugeben. Eine Liste aller nominierten FDM-Lösungen finden Sie hier.
Nominate Your Favorite FDM Solution of the Year!
Im Rahmen des Community Meeting lobt NFD4Ing einen Award für die beste FDM-Lösung des Jahres 2022 aus. Nominiert werden können alle Lösungen, die nicht im Kontext von NFDI4Ing selbst entstanden sind. Wenn Sie selbst an einer Lösung mitgestaltet haben oder die Lösung andere nominieren möchten, würden wir uns freuen, wenn Sie uns eine entsprechende Nachricht mit dem Betreff “NFDI4Ing Award” an folgenden Adresse senden: community@nfdi4ing.de
Im Rahmen der Konferenz lobt NFD4Ing einen Award für die beste FDM-Lösung des Jahres 2022 aus. Nominiert werden können alle Lösungen, die nicht im Kontext von NFDI4Ing selbst entstanden sind. Als Teilnehmer der Konferenz haben Sie die Möglichkeit selbst Ihre Stimme für Ihre favorisierte Lösung abzugeben. Eine Liste aller nominierten FDM-Lösungen finden Sie hier.
As part of the conference, NFD4Ing is announcing an award for the best FDM solution of the year 2022. All solutions that have not been developed in the context of NFDI4Ing itself can be nominated. So if you contributed to a solution yourself or would like to recommend a solution of others, we would be pleased if you would nominate it in the following.
08:45 – 09:00
🎫 Arrival and virtual Check-In
09:00 – 17:00
TBA
08:45 – 09:00
🎫 Arrival and virtual Check-In
09:00 – 17:00
TBA
Get a grip on RDM
Special Interest Groups (SIGs) und Sektionen
Special Interest Groups (SIGs) and sections
Learn more about the cross-topic collaborations within NFDI4Ing as well as the sections that span the entire NFDI. In this slot the topics of RDM Quality Assurance, RDM Training as well as legal and ethics aspects are discussed.
FDM Anwendungsfälle aus den Exzellenzclustern der CC-41
RDM Use Cases in the clusters of excellence of CC-41
FDM in practice will be presented in two exciting use cases: In its workshop, the Internet of Production Excellence Cluster wants to find out what discipline-specific and application-oriented data management plans can look like. Here, the community is called upon to participate! Subsequently, the Cluster of Excellence Fuel Science Center will present the FSC platform, a tool that is used in practice.
Bisher kaum bekannte/ genutzte Tools für die Unterstützung der Forschung
Previously little known/used tools to support research.
Learn in this slot about tools that could help your research. In production engineering, there are also experiments that need to be documented. Electronic lab books are currently hardly used in the CC-41 community, but can be very helpful here. The same is true for the data repository search engine re3data, where CC-41 is hardly represented. We want to change that!
Metadata, Workflows and Ontologies
Special Interest Groups (SIGs) und Sektionen
Special Interest Groups (SIGs) and sections
This slot adresses cross-application collaborations within NFDI4Ing and beyond. Here topics like terminologies, metadata, ontologies, workflows and infrastructures are presented.
Von Terminologien zu Metadaten-Nutzung
From Terminologies to metadata usage
In three consecutive presentations, the Terminology Service will be introduced first, which is used in the AIMS platform. With AIMS on the other hand, application profiles can be created based on existing metadata standards. The data repository Coscine uses those to provide metadata. This results in a toolchain that builds on each other and strives to deliver real value in RDM.
Ontology Prime Time
Ontology Prime Time
Prime Time at the Community Meeting: The "Ontologies Experts Group" of the Cluster of Excellence Internet of Production and fairsharing.org introduce themselves. A must for all ontology fans!
Sonderausgabe | Special issue
Übergreifende Spezialthemen
Overarching special topics
Alle, die sich schon hinreichend über die NFDI4Ing und ihre Aktivitäten informiert haben finden in diesem Beitragstripel Spezialthemen zu dem Journal ing.grid, den FAIR Data Spaces und Hardware.
For those of you who are already well informed about NFDI4Ing and its activities, this trio of articles includes special topics on the ing.grid journal, FAIR Data Spaces, and hardware.
High noon
FDM Anwendungsfälle aus der CC-41
RDM Use Cases in CC-41
FDM zum anfassen gibt es bei diesen drei spannenden Use-Cases!
Get in touch with RDM and our three exiting Use Cases!
High noon
Von Terminologien zu Metadaten-Nutzung
From Terminologies to Metadata-Usage
In drei aufeinanderfolgenden Vorträgen wird zuerst der Terminology Service vorgestellt, welcher später in AIMS eingebunden werden soll. AIMS selbst soll in Coscine anwendung finden. Somit ergibt sich eine Toolchain, die aufeinander Aufbaut und bestrebt ist, einen echten Mehrwert zu liefern.
In three consecutive presentations, the Terminology Service will be introduced first, which will later be integrated into AIMS. AIMS itself will be applied in Coscine. This results in a toolchain that builds on each other and strives to deliver real value in RDM.
Data dawn at dusk
Es gibt kein "Zu viel FDM"
There is no such thing as "too much RDM"
Sie können nicht genug von FDM kriegen? Erfahren Sie mehr über die Möglichkeiten, sich über FDM zu informieren und auszutauschen!
Not enough RDM for your liking? Find out about other offers regarding RDM information and networking!
Data dawn at dusk
Ontology Prime Time
Ontology Prime Time
Prime Time im Community Meeting: Die Ontologies Experts Group des IoP und fairsharing.org stellen sich vor. Ein Muss für alle Ontologie-Fans!
Prime Time at the Community Meeting: The Ontologies Experts Group of the IoP and fairsharing.org introduce themselves. A must for all ontology fans!
Im Rahmen der Konferenz wurde von NFD4Ing auch ein Award für die beste FDM-Lösung des Jahres 2021 ausgelobt. Nominiert werden konnten alle Lösungen, die nicht im Kontext von NFDI4Ing selbst entstanden sind. Als Teilnehmer der Konferenz bestand die Möglichkeit die Stimme für eine favorisierte Lösung abzugeben. Eine Liste aller nominierten FDM-Lösungen finden Sie hier.
Agenda
The CC-41 2022 Community Meeting comes along with the following program of presentations, workshops and networking opportunities:
Thursday, March 3rd, 2022
All times in CET
08:45 – 09:00 Uhr
🎫 Ankommen und virtueller Check-In | Arrival and virtual Check-In
09:00 – 09:15 Uhr
Begrüßung und Vorstellung des Programms | Welcoming and Introduction to the agenda
» Tobias Hamann, Mario Moser (NFDI4Ing Community Cluster CC-41)
(15 Min, )
09:15 – 09:45 Uhr
Neue Werkzeuge für nachhaltige Produktionsforschung
» Prof. Robert Schmitt (Sprecher NFDI4Ing)
(30 Min, )
09:45 – 10:15 Uhr
Vorstellung und Update zur NFDI4Ing | Presentation and Update regarding NFDI4Ing
» Mario Moser (NFDI4Ing Community Cluster CC-41)
(30 Min, )
10:15 – 10:30 Uhr
☕ Pause | Break
10:30 – 12:00 Uhr
Derzeitige Aktivitäten in der NFDI4Ing: Special Interest Groups (SIGs) der NFDI4Ing und Sektionen der NFDI e.V.
Current Activities in the NFDI4Ing: Special Interest Groups (SIGs) of the NFDI4Ing and sections of the NFDI e.V.
Cross-topic collaborations within and beyond NFDI4Ing
10:30 - 11:00 Uhr
SIG FDM Training
» Manuela Richter
(30 Min, )
11:00 - 11:30 Uhr
SIG Quality Assurance & FAIR Metriken
» Iryna Mozgova
(30 Min, )
11:30 - 11:45 Uhr
Sektion EduTrain (Training & Education)
» Manuela Richter
(15 Min, )
11:45 - 12:00 Uhr
Sektion ELSA (Ethical, Legal & Social Aspects)
» Griscka Petri
(15 Min, )
Cross-application collaborations within and beyond NFDI4Ing
10:30 - 11:00 Uhr
SIG Workflow Tools
» Dennis Gläser
(30 Min, )
11:00 - 11:30 Uhr
SIG Metadata & Ontologies
» Susanne Arndt
(30 Min, )
11:30 - 11:45 Uhr
Sektion Common Infrastructures
» Dr. Sonja Schimmler
(15 Min, )
11:45 - 12:00 Uhr
Sektion Meta(daten), Terminologien und Provenienz
» Dr. Rainer Stotzka
(15 Min, )
Cross-topic collaborations within and beyond NFDI4Ing
SIG FDM Training
Research data management (RDM) training and data literacy education is a crosscutting topic within NFDI4Ing and beyond the consortium. While a variety of materials and concepts already exist, it is challenging to select and adapt them to the needs of the engineering community.
The SIG RDM training & education serves to facilitate communication between archetypes, base services and community clusters (CCs) within NFDI4Ing. It also connects to the developments in the NFDI section „Training & Education“ as well as other cross-regional initiatives on RDM training. By creating a channel for communication with the community, the SIG aims to support the measures S-6/CC-2 in their goal of utilization, evaluation and adaptation of materials and concepts for basic RDM training according to the needs of the engineering sciences. Quelle ↗
» Manuela Richter
(30 Min, )
SIG Quality Assurance & FAIR Metriken
The SIG quality assurance and metrics for FAIR data provides and discusses standards, metrics and guidelines for organizing data curation based on existing best practices and the FAIR data principles. The SIG focuses on challenges and applications of quality assurance and metrics in the context of NFDI4Ing, but many issues are transferable and apply to many types of data and research data management processes. Furthermore, the methods and tools developed by NFDI4Ing for the self-organization and self-monitoring of quality and maturity of research data and data management processes require an ongoing exchange of experience between different national and international stakeholders. For this reason, close collaboration with researchers in other NFDI consortia and the larger scientific community is a major goal of the SIG. Quelle ↗
» Dr. Iryna Mozgova
(30 Min, )
Sektion EduTrain (Training & Education)
Ein besseres Forschungsdatenmanagement im täglichen Betrieb beruht auf den Kenntnissen der einzelnen Forscher:innen. Daher setzt sich NFDI für die Stärkung der Datenkompetenz in universitärer und außeruniversitärer Forschung ein. In der Sektion „Training & Education“ sollen Schulungsmodule mit Lehrmaterialien entwickelt werden, die sich an dem Bedarf der Zielgruppen orientieren. Zudem soll ein Zertifikatskurs für Data Stewards, Personen, die innerhalb einer Institution für eine gute Datenqualität zuständig sind, konzipiert werden. Quelle ↗
» Manuela Richter
(15 Min, )
Sektion ELSA (Ethical, Legal & Social Aspects)
Die Bereitstellung und der Austausch von Forschungsdaten sind oftmals mit Rechtsfragen verbunden. Im Umgang mit personenbezogenen Daten muss beispielsweise der Datenschutz besonders berücksichtigt werden, während in anderen Bereichen das Immaterialgüterrecht, also der Schutz des geistigen Eigentums, eine besondere Rolle spielt. Die Sektion „Ethical, Legal & Social Aspects“ soll ein Austauschforum für rechtliche, sozialwissenschaftliche und forschungsethische Erfahrungen bieten. Gemeinsam sollen Leitlinien und rechtliche Standards für das Forschungsdatenmanagement im Wissenschaftsalltag erarbeitet werden. Quelle ↗
» Dr. Dr. Grischka Petri
(15 Min, )
Cross-everything: Journals, Data Spaces, Hardware
Journal ing.grid
The journal is committed to the principles of Open Data, Open Access and Open Review. Open Data enhances transparency of scientific processes, accountability of researchers and reusability of research results. Open Access facilitates the dissemination of scientific discoveries, making them operational for addressing societal issues. Open Review nurtures vibrant scholarly discussion.
The journal is firmly rooted in the engineering sciences. It welcomes contributions from all engineering subject areas. Moreover, the journal recognizes connections and common practices across all subdisciplines of the engineering community and encourages active exchange of experiences. Quelle ↗
» Kevin Logan
(30 Min, )
Gaia-X meets FDM: Die FAIR Data Spaces
In den Fair Data Spaces werden die föderierte, sichere Dateninfrastruktur Gaia-X und NFDI zu einem gemeinsamen, cloud-basierten Datenraum für Industrie und Forschung unter Einhaltung der FAIR-Prinzipien verbunden. Das heißt, Daten sollen auffindbar, zugreifbar, interoperabel und wiederverwendbar (findable, accessible, interoperable, reusable) geteilt werden. Quelle ↗
» Dr. Marius Politze
(30 Min, )
Open.Make - Towards open and FAIR hardware
Abstract: Currently, some grassroots initiatives are pushing to develop an "open hardware strategy for science" that extends open source principles from software to physical products. This is very much in line with the promotion of FAIR data principles. Open.Make aims to explore best practices for creating and publishing research hardware and to develop a prototype publishing platform for research hardware. Lessons learned will also inform open access guidelines to assess and ensure replicability of hardware in science. Given the wider role of academic research in society, a new career path for open and FAIR hardware engineers may also emerge in the future. Quelle ↗
» Robert Mies
(30 Min, )
11:30 - 12:00 Uhr
Coming soon
Cross-application collaborations within and beyond NFDI4Ing
SIG Workflow Tools
Software-driven scientific workflows are often characterized by a complex interplay of various pieces of software executed in a particular order. Moreover, each process in a workflow may pose a number of requirements on the software or hardware environment.
In this SIG, we want to elaborate, together with the scientific community, a vision on how scientific workflows should be created, packaged, and published in order to be as FAIR as possible. We want to evaluate if existing workflow tools provide reusable solutions, and identify the capabilities that are missing to reach our goal: reproducible research workflows, by anyone, anywhere and anytime. Quelle ↗
» Dr. Dennis Gläser
(30 Min, )
SIG Metadata & Ontologies
The SIG metadata & ontologies supports the exchange between NFDI4Ing Archetypes, Community Clusters and Base Services on cross-sectional topics on metadata and ontologies as well as services using such resources. It allows the different groups to update each other on their progress and requirements and to develop services that meet practical demands in research data management in engineering.
The SIG will develop a metadata model for engineering research metadata according to the community’s needs in its subgroup Metadata4Ing, thereby fostering harmonization activities in NFDI4Ing. It will also follow and contribute to cross-consortia activities related to metadata and ontologies. Its subgroup Metadata4Ing has already hosted a cross-consortia workshop and will continue to collaborate with other NFDI consortia by joining the future NFDI section “(Meta)data – terminologies – provenance” which is currently forming. Quelle ↗
» Susanne Arndt
(30 Min, )
Sektion Common Infrastructures
Über eigene, vereinzelte Informationsinfrastrukturen verfügen bereits verschiedene Forschungsdisziplinen. Damit diese heterogenen Angebote in Zukunft interoperabel nutzbar sind und der interdisziplinäre (Daten-)Austausch vorangetrieben wird, sollen diese in eine gemeinsame Infrastruktur eingebettet werden. Angedacht ist die Realisierung einer multi-cloud-basierten Basis-Infrastruktur. In der Sektion „Common Infrastructures“ werden außerdem die Themen nachhaltige Nutzbarkeit und Langzeitarchivierung eine Rolle spielen. Quelle ↗» Dr. Sonja Schimmler
(15 Min, )
Sektion Meta(daten), Terminologien und Provenienz
Gemeinsame Standards im Forschungsdatenmanagement sind die Grundlage für die Möglichkeit einer effektiven Nachnutzung von Daten. Die Sektion „(Meta)daten, Terminologien und Provenienz“ arbeitet daran. Entwickelt werden sollen Best Practices zur Modellierung von Terminologien, Vokabularen und Ontologien. Zudem sollen unter anderem einheitliche und nachvollziehbare Dokumentationsverfahren von technischen und kulturellen Aspekten des Entstehungskontextes von (Meta-)Daten etabliert werden. Quelle ↗
» Dr. Rainer Stotzka
(15 Min, )
12:00 – 13:00 Uhr
🍝 Mittagspause | Lunch break
13:00 – 14:30 Uhr
Use Cases und Metadaten
Use Cases and Metadata
RDM Use Cases related to mechanical and production engineering on the basis of two clusters of excellence
Workshop: Entwickeln und Implementieren von Datenmanagementplänen für ingenieurswissenschaftliche Projekte des Exzellenzclusters Internet of Production (IoP)
Wie können generische Datenmanagementpläne fachspezifisch angepasst und erweitert werden?
Wie gelingt es, die Datenmanagementpläne innerhalb vorhandener Projektstrukturen zu etablieren?
Diesen beiden Fragen gehen wir mit Input aus Projekten der Research Data Alliance und des Exzellenzclusters Internet of Production auf den Grund.
» Soo-Yon Kim
» Sabine Schönau
(60 Min, )
Harnessing the valuable knowledge in the interdisciplinary cluster of excellence "The Fuel Science Center (FSC)"
The FSC-Platform keeps about 30 scientific institutes and 170 researchers up to date on the cluster's research results. Here, the project descriptions and results as well as current publications are displayed in a transparent and standardised form. This enables researchers to see who has published what, with whom on which topic in one or more projects. These information are used by the FSC steering committee to adjust the research direction of the entire cluster with respect to content.
» Robert Jungnickel
(30 Min, )
14:00 - 14:30 Uhr
Coming soon
From Terminologies to Metadata-Usage
13:00 - 13:30 Uhr 2P1
Terminology Service
» Dr. Felix Engel
(30 Min, )
AIMS: Applying Interoperable Metadata Standards
Gute wissenschaftliche Praxis erfordert eine präzise und verständliche Dokumentation der Ergebnisse. Forschungsdatenmanagement auf Basis weitreichend standardisierter Metadaten ist daher von essenzieller Bedeutung. Dies ist umso wichtiger, wenn Forschende ihre eigenen Forschungsdaten teilen und publizieren oder archivierte Daten Dritter nachnutzen möchten. Anhand eines Anwendungsfalls aus der Produktionstechnik wird demonstriert wie eine Integration moderner Standards den Forschungsalltag in Zukunft bereichern wird. In AIMS gestaltet ein interdisziplinäres Team die Entwicklung einer Plattform zum Erstellen und Teilen von anwendungsspezifischen Metadatenschemata mit hoher Interoperabilität und Nachnutzbarkeit.
» Nils Preuß
(30 Min, )
Coscine – Research (meta)data management made easy
Metadata is a core aspect of good research data management, but data is rarely tagged with metadata in everyday research. The research data management platform Coscine aims to change this by linking the uploading of data to the assignment of metadata.
» Benedikt Heinrichs
» Dr. Ilona Lang
(30 Min, )
13:30 - 14:00 Uhr
Coming soon
14:30 – 15:00 Uhr
☕ 💬 Virtuelle Kaffeepause mit Networking-Session | Virtual coffee break and networking session
Grab a coffee and join other interested people in one of our virtual break out rooms.
15:00 – 16:30 Uhr
Vernetzung und Ontologien
Networking and Ontologies
RDM-Tools in Mechanical and production engineering
Elektronische Laborbücher in Physik und Ingenieurwissenschaften
Aktuell schauen viele Institute und Arbeitsgruppen nach Lösungen zu elektronischen Laborbüchern (ELN). Der Markt ist unübersichtlich, und häufig können die Suchenden kaum ihre Anforderungen formulieren. Dieser Betrag zeigt, was ELNs leisten können, welche verschiedenen Ansätze es gibt und wo weitere Hilfe gefunden werden kann. Der Beitrag beschränkt sich dabei auf Lösungen für Physiker, Materialwissenschaftler und Ingenieure.
» Torsten Bronger
(30 Min, )
Workshop: re3data – Indexing and discovering research data repositories for the engineering sciences
The service re3data, the Registry of Research Data Repositories, currently lists over 2700 digital repositories across all scientific disciplines and is used by researchers and other services worldwide. A variety of funders, publishers, and scientific organizations refer to re3data within their guidelines and policies, recommending it as a trusted service to researchers looking for appropriate repositories for storage and discovery of research data. Since January 2020, the re3data COREF project has been funded by the DFG to further develop and enhance the registry service.
The talk will give a general introduction as well as a demo of the service, in particular the indexing and curation process. We will take a brief look at the subject classification implemented by re3data and examine which repositories indexed in re3data are associated with institutions participating in NFDI4Ing. We conclude by opening a discussion with the audience with the goal of identifying options to ensure good representation of research data repositories for engineering sciences.
» Robert Ulrich
» Dorothea Strecker
» Rouven Schabinger
(60 Min, )
15:30 - 16:30 Uhr (vorläufig)
re3data – Indexing and discovering research data repositories for the engineering sciences
» Robert Ulrich
» Dorothea Strecker
» Rouven Schabinger
(60 Min, tbd)
15:30 - 16:00 Uhr
Coming soon
The Power of Ontologies
Cluster of Excellence Internet of Production: Proposed Ontology for Digital Shadows from Theory to Practice
In this presentation, we present our proposed conceptual model to describe digital shadows that is established via interdisciplinary research in German Cluster of Excellence. Furthermore, we provide a schema and a tool to implement and analyze the proposed framework. This includes data lifting, validation, querying, etc., where the user can validate the schema based on a reference schema and do desired queries.
» Anahita Farhang Ghahfarokhi
(30 Min, )
Live-Demo: FAIRsharing: an ecosystem of research standards and databases for effective RDM
FAIRsharing is an informative and educational resource on interlinked standards, databases and policies, three key elements of the FAIR ecosystem. FAIRsharing is adopted by funders, publishers and communities across all research disciplines. It promotes the existence and value of these resources to aid data discovery, interoperability and sharing across all of our stakeholder groups. Here we discuss how FAIRsharing can be searched and updated by our user community, and how you can make the best use out of it as part of a broader data management infrastructure. We will also let you know how you can take part and contribute to the development of the description of engineering resources within FAIRsharing.
» Allyson Lister
(60 Min, )
16:30 – 17:00 Uhr
(vorläufig)
Journal ing.grid
» Kevin Logan
(30 Min, )
16:30 – 17:00 Uhr
Ausblick und Verabschiedung | Outlook and farewell
- Feedback
Feedback
- Ausblick auf nächste Schritte
Outlook for next steps
- Zukünftige Angebote zur Vernetzung in der Community und Kontaktmöglichkeiten
Future offerings for networking in the community and Contact
(30 Min, )
17:00 Uhr
🏴 Veranstaltungsende | End of the event
ab 17:00 Uhr
🍻 Virtueller Ausklang | Virtual After-Glow
The networking session was too short? Still this one question open? Feel free to stay some minutes longer.
Exited for more? Sign up now!
Fragen | Questions
Do you have questions regarding the event? Please use this contact form to get in touch with us!