Strategies for Managing and Querying Distributed Simulation Data for AI

Date:

Tutorial, ISC High Performance 2026, Hamburg, Germany

This tutorial explores the core concepts and practical tools for organizing, labeling, and accessing scientific datasets that are essential for developing AI models. One of the biggest challenges in AI training is managing the vast amounts of simulation and experimental data spread across many files and facilities. The session focuses on effective strategies for making these datasets easier to access, search, and prepare for use in AI training workflows.

We introduce a software framework called PULSE (Platform for Unified data Lifecycle and Scalable Execution) that enables a workflow or a group of scientists to manage many related datasets stored in multiple files, across multiple facilities, through a concept called Campaign Archives (CA). These archives provide metadata and statistics about the datasets, enabling discovery and organization of content regardless of the original data’s physical location. Integrating these archives into AI workflows allows scientists to prepare large ensembles of simulations for AI by providing ways for effective data search, query, and remote access tuned for machine learning tasks.


Presenters:

  • Ana Gainaru, Oak Ridge National Laboratory
  • Scott Klasky, Oak Ridge National Laboratory
  • Norbert Podhorszki, Oak Ridge National Laboratory