NFDI4Ing Storage & Repositories

In the NFDI4Ing base service “repositories & storage” we reached a couple of milestones during the last year, with many new developments in the storage and repositories domains. In the following we present the recent developments regarding applying for storage, updates on the Coscine API, the progress of the Data Transfer Federation, and the ongoing development of the Data Collections Explorer.

Applying for Storage with JARDS

Storage space is valuable and there are many researchers who need to store their research data. Most existing storage distribution systems are ad-hoc, require (internal) transfer of funds, or do not scale on institutional or even national level. Most importantly, the value for the scientific community often remains unaddressed. Starting with their existing data management platform Coscine, RWTH Aachen University adapted JARDS, a tool already utilized by many computing centers in Germany to handle applications for computing time. The aim is to unify applications for scientific IT resources, make applications more transparent and comprehensible as well as ease the process of formalities management.

Coscine API v2

RWTH Aachen University’s Coscine application programming interface (API) has been redesigned from the ground up. The new API v2 offers a more flexible, yet more consistent and easier-to-use interface. It is currently being rolled out. Highlights include response pagination, which improves handling large amounts of data. It can now be retrieved in manageable chunks instead of all at once. Data Transfer Objects simplify data transfer, reduce the probability of errors happening and improve performance. Data input is automatically checked using validation filters upon entering it into the system, thus minimizing errors and ensuring more consistent data. Resource Based Authorization is a security mechanism that determines access permissions based on the properties of the protected resources themselves. Access and privileges are only granted to those users and systems that can prove their identity.
More consistent API responses make it easier to understand what is happening under the hood. Should something go wrong, detailed status and error codes enhance transparency and ease debugging.

The new API v2 has been set up as a single project, which makes it easier to manage and minimizes potential dependency issues. In addition to that, the API v2 supports OPTIONS requests which enable querying the API about available communication options. Furthermore, new discovery features provide clients with information about API features and resources without the need to fully integrate with it.
Last, but not least, the new API v2 allows live configuration updates. Changes can be made while the system is running and in use, thus reducing downtime and improving service availability.

Data Transfer Federation

The NFDI and other large collaborative projects are distributed over different locations which can complicate working together. A data transfer federation enables researchers in such projects to have seamless access to a wider range of resources, thereby fostering transparency and collaboration between research centers.

Cross-site collaboration between research centers is becoming increasingly crucial, with researchers accessing storage or computing resources from other institutions. However, these collaborations often require the transfer of large amounts of data between different storage systems for computational or archival purposes. This can be a challenging and time-consuming task. To address this need, the scientific computing and data center SCC at Karlsruhe Institute of Technology designed a data transfer federation as part of its work in the NFDI4Ing base services “repositories & storage” (S-4).

The data transfer federation is a collaboration of file-based storage providers across multiple organizations to enable seamless access to data with federated identities and automated, scalable data access and data transfers between the file-based storage centers. The federation allows researchers to access and transfer data files between various storage systems using their home organization’s user account, based on their access rights to the resources in a collaborative project. The storage systems in this federation are either dedicated large-scale systems or systems associated with High-Performance Computing. 

At the SCC, we have integrated the Large Scale Data Facility: Online Storage (LSDF OS) with WebDAV protocol and OAuth2 authentication to enable access of third-party applications to the storage service. Beyond improved support for programmatic access using OAuth2, the WebDAV server enables users to inspect the LSDF OS filesystem via their web browser. To enable file data movement, we have deployed an instance of the File Transfer Service (FTS) on-premises and integrated it with our identity provider. 

FTS is a low-level data management service that orchestrates reliable bulk transfer of files from one storage endpoint to another. It is an open-source software developed by CERN that distributes most of the Large Hadron Collider data across the Worldwide LHC Computing Grid infrastructure. However, currently FTS is designed to be used with one identity provider, whereas in a federation multiple identity providers are involved. To address this limitation, we designed a central identity provider to issue tokens which are recognizable by the other identity providers in the federation. These downstream identity providers further handle access to storage systems at their local institution. To fully achieve such a federation, a unified NFDI-wide approach regarding the information included in these tokens is necessary.

Data Collections Explorer

The NFDI4Ing Data Collections Explorer is an information system for the engineering sciences allowing scientists to share and discover repositories and data sets. In addition, it provides a quick overview of the most important facts about services and data sets, such as access rights or usage restrictions.

Currently, the Data Collections Explorer is a human-centered service that has a couple of limitations, mainly lack of one-to-many mappings and no API access. To overcome these limitations, we are working on a new, graph-based implementation (c.f. Figure 1). This encompasses a new RDF-based data model and an import of the current data base. The new graph-based version allows for more flexibility and machine-accessibility comes naturally via SPARQL. In addition, it enables smoother integration with existing efforts in NFDI4Ing and beyond.

Fig. 1: Showing all relations between all data entries of the new graph-based Data Collections Explorer. Philipp Ost (KIT-SCC), CC BY-SA 3.0 DE
Currently we are working on making a preview version of this graph-based version available to users. Future plans include a new user interface and integration of vocabularies

Philipp Ost
Mozhdeh Farhadi
Petar Hristov
Serge Sushkov
Marcel Nellesen
Andreas Petzold

Tags

NFDI4ING services may be relevant to different users according to varying requirements. To support filtering or sorting, we added a tag system outlining which archetype, phase of the data lifecycle, or degree of maturity a service corresponds to. By clicking on one of the tags below, you can get an overview of all services aligned with each tag.

This service has the following tags:

The tags correspond to:
The Archetypes: Services relevant to Alex – Bespoke Experiments, Betty – Research Software Engineering, Caden – Provenance Tracking, Doris – High Performance Computing, Ellen – Complex Systems, Fiona – Data Re-Use and Enrichment

The data lifecycle: Services related to Informing & Planning, Organising & Processing, Describing & Documenting, Storing & Computing,
Finding & Re-Using, Learning & Teaching

The maturity of the service: Services sorted according to their maturity and status of their integration into the larger NFDI service landscape. For this we use the Integration Readiness Level (IRL), ranging from IRL0 (no specifications, strictly internal use) up to IRL4 (fully integrated in the German research data landscape and the EOSC). Click here for a diagram outlining all Integration Readiness Levels.