Developing an advanced research assistant system for engineering literature

By utilizing large language models to provide an automated method for extracting knowledge from a vast amount of literature and integrating it with the research data management platform, users will be offered more intelligent data management and analysis methods.

Enhancing Research Efficiency: Automated Knowledge Acquisition and Data Management in Engineering Literature

In an era of rapid technological advancement and data proliferation, the ability to efficiently access and exploit scientific knowledge hidden in textual publications has become paramount. The provision of an automated knowledge acquisition platform integrated with data management in scientific research workflows will prove beneficial to the advancement of scientific research. We are working to develop an advanced research assistant system specifically designed to extract valuable information and to navigate and interpret diverse engineering literature. 

We are exploring the potential of modern Text and Data Mining (TDM) methods, in particular the Retrieval-Augmented Generation (RAG) architecture, a state-of-the-art framework that combines the strengths of retrieval-based and generative machine learning models, such as Large Language Models (LLMs), to improve information retrieval and answer accuracy. 

The assistant system is designed to assist researchers by providing concise summaries, extracting key information, and answering complex queries related to scientific documents. We are also working on its integration with Research Data Management (RDM) frameworks such as Kadi4Mat, an open-source platform designed to efficiently manage research data and provide seamless access to a vast repository of research data. By organically combining the RDM platform and TDM approaches, we ensure that users have unified access to current, high-quality research data and their own uploaded document databases, providing a robust tool for scientific inquiry, enabling more informed decision-making, and accelerating the research process. Our aim is to demonstrate the effectiveness of the research assistant, highlighting its ability to improve research efficiency and collaboration, and its potential to change the landscape of scientific research support.

Harvesting and Provision of Machine-Readable Engineering Literature

To provide input data for this and other literature-based services, we are continuing to work on obtaining open access engineering literature and providing it in a uniform XML format. We already concluded agreements with the open access publishers Copernicus and MDPI which kindly provide us with their journal archives, consisting of PDF and to a large extent also full text XML files. In addition, we are already in negotiations with Frontiers and contacted PLOS. Unfortunately, IEEE declined our request to make its open access journals publicly available for text and data mining analyses. In addition to these negotiations, we are working further on the technical systems to harvest, convert and finally provide the document files. 

First example files, at the moment limited to the original formats provided by the publishers, i.e. not yet converted to our uniform XML target format, are accessible in the public test area of our document repository TUstorage

If you are interested in specific open access literature that should be provided in the future for text and data mining analyses, we would be happy if you would write an email to jens.freund[at]tu-darmstadt.de. We are very much looking forward to your suggestions.

Yinghan Zhao
Michael Selzer
Arnd Koeppe
Jens Freund

Tags

NFDI4ING services may be relevant to different users according to varying requirements. To support filtering or sorting, we added a tag system outlining which archetype, phase of the data lifecycle, or degree of maturity a service corresponds to. By clicking on one of the tags below, you can get an overview of all services aligned with each tag.

This service has the following tags:

The tags correspond to:
The Archetypes: Services relevant to Alex – Bespoke Experiments, Betty – Research Software Engineering, Caden – Provenance Tracking, Doris – High Performance Computing, Ellen – Complex Systems, Fiona – Data Re-Use and Enrichment

The data lifecycle: Services related to Informing & Planning, Organising & Processing, Describing & Documenting, Storing & Computing,
Finding & Re-Using, Learning & Teaching

The maturity of the service: Services sorted according to their maturity and status of their integration into the larger NFDI service landscape. For this we use the Integration Readiness Level (IRL), ranging from IRL0 (no specifications, strictly internal use) up to IRL4 (fully integrated in the German research data landscape and the EOSC). Click here for a diagram outlining all Integration Readiness Levels.