Neural operators have recently emerged as a powerful machine-learning framework for learning infinite-dimensional mappings in physics-based systems described using partial differential equations. However, their application to realistic high-dimensional problems often encounters computational challenges. Existing approaches either down-sample the input-output spaces, sacrificing accuracy by overlooking intricate details or rely on latent space learning, which is unsuitable for non-dissipative systems. Additionally, researchers have also opted for smaller mini-batch training to accommodate high-dimensional problems in the operator learning framework; which sacrifices the generalization ability of the network. To address these limitations, we propose a scalable data-parallel, multi-node, multi-GPU training framework as a promising solution for neural operator training for high-dimensional systems.
To demonstrate the capabilities of the proposed framework, we have studied the subsurface multiphase flow modeling problem using deep operator network, a popular neural operator architecture. The problem involves the learning of a neural operator that maps the spatially varying permeability field to the pressure field. This problem involves input and output dimensionalities of the order 10E4 and 10E7, respectively, posing significant computational challenges when training on a single GPU architecture. Our implementation demonstrates promising results: for a batch size of 8,200, speedups of 1.93x and 3.88x were achieved using two and four devices, respectively. For a larger batch size of 131,000, these speedups improved to 2.09x and 4.47x. Beyond computational gains, the use of large batch sizes enabled by multi-GPU systems significantly enhances generalization performance, a critical requirement for robust machine learning models in physical systems.