Strategies for querying large-scale scientific data

Date: June 26, 2024

Invited Talk, 17th Scheduling for large-scale systems workshop, Aussois, France

My presentation focuses on efficient strategies for querying large datasets, specifically addressing quantities of interest and derived data. Although not directly about scheduling, these strategies create analysis tasks and data transformation needs that can strain study resources, especially at scale. To fully realize the potential of these data management and query techniques, and to ensure timely completion of analyses, the development and implementation of custom scheduling solutions will be essential.

Abstract: Scientific data analysis often involves complex queries across distributed datasets, requiring manipulation of multiple primary variables and generating derived data that needs to be handled efficiently, creating challenges for applications that need to parse many large datasets. We investigate in this talk the performance of different approaches where applications define derived variables as quantities of interest (QoIs) and offload the computation and transfer of these QoIs to the I/O library. We look at a detailed analysis of the performance-storage trade-offs associated with different solutions and showcase results for our study on two large-scale datasets created from climate and combustion simulations.

Link to the event: https://graal.ens-lyon.fr/~abenoit/aussois24

Related paper:
To Derive or Not to Derive: I/O Libraries Take Charge of Derived Quantities Computation
A. Gainaru, N. Podhorszki, L. Dulac, Q. Gong, S. Klasky, G. Eisenhauer, A. Kougkas, X. Sun, J. Lofstead
2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 105-115, 2024 DOI: 10.1109/SBAC-PAD63648.2024.00030