Update on automated data and knowledge discovery in engineering literature

Analysing large collections of documents using text-and-data-mining methods can yield entirely new scientific insights. In the NFDI4Ing task area Automated data and knowledge discovery in engineering literature (S-7), services are developed to enable researchers to more easily apply these innovative methods in practice.

Analysing large collections of documents using text-and-data-mining methods can yield entirely new scientific insights. In the NFDI4Ing task area Automated data and knowledge discovery in engineering literature (S-7), services are developed to enable researchers to more easily apply these innovative methods in practice.

In the age of digital media and a steadily increasing number of scientific publications, text and data mining (TDM) methods offer the opportunity to gain insights from large volumes of documents that would be inaccessible by reading individual documents in the traditional way. TDM methods are already being used, for example in materials science to extract synthesis routes for existing materials from scientific publications in a structured form, with the aim of potentially being able to simulate synthesis pathways for novel materials based on this data in the future.¹

Current Obstacles for TDM

Usually, the first step in any TDM analysis is constructing the corpus. The relevant texts must be downloaded from the respective sources (most often the platforms of publishers), stored locally, and prepared for the actual analysis. Currently, these steps are very time consuming due to the variety of publisher platforms, each with specific technical access options ranging from simple web pages to FTP servers and APIs, and legal terms of use. In addition, the downloaded documents are usually obtained in different file formats and under various licenses.

Services currently under development in S-7

To address these obstacles, one aim of our measure is to provide researchers access to engineering literature in a machine readable, structured XML format that is particularly well suited for TDM applications to the best extent legally and technically possible. To achieve this goal, we are currently harvesting engineering publications, preparing their conversion into the uniform format, and looking into their provision in compliance with the applicable copyright regulations. In parallel, guidelines for researchers on the legal aspects of text and data mining are being developed. To this end, the precise boundaries of statutory exceptions regarding text and data mining as well as Open-Access-Licences and license contracts of publishers of the most important resources in the field are currently being analysed.

Looking for feedback

We are always very interested in literature demands for text-and-data-mining projects, both to check against literature we are already able to provide and to add to our basis for harvesting. If you are working on a TDM project, we would be very pleased if you could send us a short message with the literature you need for your project.

^{1. Kononova, O., Huo, H., He, T. et al. Text-mined dataset of inorganic materials synthesis recipes. Sci Data 6, 203 (2019). https://doi.org/10.1038/s41597-019-0224-1↩}

J. Freund