Efficient Digital Twin Training using Uncertainty-Guided Data Generation

Date:

Invited Talk, SIAM CSE25, Fort Worth, Texas

I had the pleasure of giving an invited talk at the SIAM CSE25 workshop: Neural Acceleration, Surrogate Models, and Learning Techniques for HPC Kernels. The minisymposium offered an introduction into current solutions for approximating expensive kernels with heuristics based on machine learning and the trade-offs encountered when used in HPC environments.

The talks focused on aspects for improving performance, fault tolerance, and scalability for sparse linear algebra, graph partition, and other areas that are important to high-performance scientific computing. My presetation focused on the challenges of training digital twins for large scale scientific applictions that generate PB of data at every simulated step. The slides provide information on the limitations and cost of training when different solutions and employed and advice on how to improve the performance and accuracy of the training through uncertainty-guided data generation.

Abstract: Machine learning proxy models offer a powerful approach to accelerating and even replacing computationally expensive models. However, constructing these digital twins presents a unique challenge in efficiently generating training data. A naive uniform sampling of the input space can lead to a non-uniform sampling of the output space, resulting in gaps in training data coverage and potentially compromising accuracy. While massive datasets could eventually fill these gaps, the computational burden of full-scale simulations can make this impractical. In this talk, we introduce a framework for adaptive data generation that leverages uncertainty estimation to identify regions requiring additional training data and re-triggering simulations to fill the identified gaps. Essentially, this approach steers large-scale simulations towards generating the necessary data for training digital twins iteratively and thus reduces the data needed to train accurate digital twins. We will demonstrate the challenges at training at scale and the effectiveness of such methods on both a simple one-dimensional function and a complex multidimensional physics model.

Link to the event: https://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=82428
Access my slides here

Related paper: Adaptive Generation of Training Data for ML Reduced Model Creation
Mark Cianciosa, Richard Archibald, Wael Elwasif, Ana Gainaru, Jin Myung Park, Ross Whitfield
2022 IEEE International Conference on Big Data (Big Data)
Paper PDF: from OSTI.gov