Base Service “Automated data and knowledge discovery in engineering literature”

Text and data mining methods offer the potential to gain new insights from publications. The NFDI4Ing base service automated data and knowledge discovery in engineering literature is developing the legal and technical basis for making engineering literature accessible for this purpose.

Text and data mining methods allow the computer-based analysis of large volumes of documents (symbolic image). Image by Reto Scheiwiller@Pixabay.

Motivation
According to the scientific database Dimensions.ai, over 950,000 scientific publications in the field of engineering have been published in 2023. These publications contain a wealth of information in the form of full text, tables and figures, including their captions and headings, bibliographies and appendices. Text and data mining (TDM) methods offer the potential to gain new insights from this wealth of information. TDM methods are already being used in materials science for example to answer a wide range of research questions.

Our work programme
In the base service automated data and knowledge discovery in engineering literature, we are working on making large quantities of engineering literature accessible for TDM applications. We are focused on addressing the following questions:

  • How can engineering publications be aggregated and made available via a standardized system and in a data format that is as uniform as possible?
  • What is the legal framework for TDM for research purposes in Germany?
  • How can existing TDM technologies be implemented in such a way that engineering researchers can derive practical insights from the aggregated literature?

Provision of documents for TDM
In the context of the first question, we are concerned with identifying relevant engineering literature for research institutions in Germany. To this end, the bibliography of the Fraunhofer Society, Fraunhofer Publica, was analysed with kind support of the Library and Information Services of the Fraunhofer Institute for Production Technology IPT. It was used as an example in order to identify journals and conference proceedings in which members of the Fraunhofer Society have frequently published. The results of the analysis can be found here. In the following steps, we will focus on the most important open access journals on this list and attempt to provide them in a machine-readable, structured format. An agreement has already been made with MDPI, the open access publisher that accounted for most of the open access journal publications. It allows us to harvest, convert and provide the publisher’s data. The necessary infrastructure is currently being set up. It is planned to use other data sources, such as the Open Access Monitor or other bibliographies of university and non-university research institutions, to identify relevant open access literature and then make it available.

Legal aspects
Our initial focus on open access journals is based on aspects of copyright law that must always be taken into consideration when providing and using literature for TDM purposes. These legal aspects are explained in detail in the Guidelines on text and data mining for research purposes in Germany published by us last year. These guidelines describe the conditions under which TDM may be used for scientific purposes on scientific publications on the basis of statutory exceptions and/or contracts, and the risks involved. Finally, it describes how publications can be used for TDM if neither a statutory exception exists nor a contractual authorization is given.

Usage scenarios
TDM usage scenarios are developed as part of the third research question outlined above. For example, we are investigating how a data mining service for training large language models (LLM) could be implemented on the basis of our legal and technical work results to date. This would open up completely new research opportunities for scientific work. Initial tests based on the open access literature harvested so far on a trial basis are planned for the near future. In addition, further concepts are being developed to enable researchers to gain knowledge from the literature, for example through the possible integration of literature-based data into the materials science research data infrastructure Kadi4Mat.

Do you use TDM methods in your research and need open access literature in a standardized structured data format? We would be pleased to hear about your use case. Feel free to contact us via one of the contact addresses provided on our website!

Elke Brehm
Jens Freund
Marvin Gusen
Michael Selzer

Tags

NFDI4ING services may be relevant to different users according to varying requirements. To support filtering or sorting, we added a tag system outlining which archetype, phase of the data lifecycle, or degree of maturity a service corresponds to. By clicking on one of the tags below, you can get an overview of all services aligned with each tag.

This service has the following tags:

The tags correspond to:
The Archetypes: Services relevant to Alex – Bespoke Experiments, Betty – Research Software Engineering, Caden – Provenance Tracking, Doris – High Performance Computing, Ellen – Complex Systems, Fiona – Data Re-Use and Enrichment

The data lifecycle: Services related to Informing & Planning, Organising & Processing, Describing & Documenting, Storing & Computing,
Finding & Re-Using, Learning & Teaching

The maturity of the service: Services sorted according to their maturity and status of their integration into the larger NFDI service landscape. For this we use the Integration Readiness Level (IRL), ranging from IRL0 (no specifications, strictly internal use) up to IRL4 (fully integrated in the German research data landscape and the EOSC). Click here for a diagram outlining all Integration Readiness Levels.