S-7 - NFDI4Ing

The NFDI4Ing base services.
Measure S-7

Automated data & knowledge discovery in engineering literature

Measure S-7: Automated data & knowledge discovery

As outlined in the FAIR principles, the aim of research data management is not only to enable humans to find and interpret data, but also to enable automated statistical analysis of data by machines, i.e. data mining and machine learning. In engineering science, the vast majority of available data is hidden in textual publications. For example, in a 2019 survey by the NFDI4Ing consortium, three out of four research groups stated that they do not publish results in data or code repositories (the survey data can be found at https://doi.org/10.25534/tudatalib-104). Additionally, open access publishing is rather uncommon in engineering (FAIR study group TIB). Therefore, generally available engineering research data in the sense of structured representations of information such as tables or datasets is in fact not accessible for state-of-the-art scientific methods. Well-prepared and high quality (meta)data is a prerequisite for algorithmic approaches from fields like high performance data analytics, machine learning, artificial intelligence, and text mining.

Key challenges and objectives

Measure S-7 aims to promote the application of text and data mining methods in the engineering sciences. The use of these technologies will enable engineering scientists to identify patterns in and gain new insights from large quantities of publications that cannot be uncovered by intellectually receiving individual publications.

The basis of every TDM analysis is a text corpus. There are three main challenges in the process of generating a suitable corpus:

1. For TDM analyses, publications should be in a machine readable, structured format. In reality, however, many publications are still provided in PDF format, often with non machine readable diagrams and tables which are not accessible to TDM analyses without pre-processing steps.
2. There is no central source for scientific publications in machine readable, structured formats. Publications are provided by different publishers via different technical systems, e.g. APIs, FTP servers or websites.
3. Since most of the engineering science literature is not published under open access licences, the legal aspects of the text and data mining process itself as well as the publication of its results are often complicated. Nevertheless, these aspects have to be carefully considered in order to avoid copyright infringements.

Measure S-7 will develop services to address these challenges, including the provision of a machine readable corpus of engineering literature and legal guidelines concerning publications with different licence models. In addition, data science approaches will be evaluated to fulfil requirements from several archetypes regarding knowledge discovery.

Tasks

Task S-7-1: Providing a service to enable text and data mining in engineering literature (TUDA)
Based on the communities’ and archetypes’ requirements, the goal is to build a large digital corpus from e. g. articles, proceedings, and other document types relevant for engineering. All documents will be converted to machine readable text, using OCR and layout recognition where needed. Those documents already available in machine-friendly formats will be harmonised in a structured XML-format as far as possible.

Task S-7-2: Providing guidelines for the legal aspects of text and data mining (TIB)
The goal of Task S-7-2 is to provide the legal basis for copyright clearing. Researchers will be able to check if and how particular publications can be made available in digital form for the sake of TDM. In particular, this tool will provide guidelines concerning publications with different licence models, including open access licences and the compliance with copyright regulations.

Task S-7-3: Applying data science methods for knowledge discovery (KIT)
In the field of data science, there is a huge need for accessing as much data as possible from any source to get new insights in the research work. Task S-7-3 evaluates how the information gathered with the help of tasks S-7-1 and S-7-2 can be used to apply data science approaches to fulfil demands from different archetypes regarding knowledge discovery.

Results

In a first step, several engineering working groups at TU Darmstadt were asked about the literature that plays an important role in their research and whose automated analysis could offer them added value. As a result, we received lists of specific literature (e.g. journals and conference proceedings) that are of particular importance for the engineering research at TU Darmstadt. One request that came up several times was the provision of machine-readable figures and tables.

In general, TDM methods are still rarely used in the engineering sciences, with the exception of materials science, where extensive preliminary work exists (see e. g. https://ceder.berkeley.edu/publications/2021_text_mining_review.pdf).