The emergence of data-driven approaches for developing interatomic potentials promises to transform materials design and synthesis. Machine learning interatomic potentials (MLIPs) build on recent advances in ML to accurately model the potential energy surface of a material system by inferring its form from a large number of atomic configurations and their properties obtained from first principle (FP) calculations. The success of MLIP development hinges on (1) access to curated and ample FP data, (2) availability of fitting frameworks supporting rapid prototyping, (3) ability to exchange MLIP implementations and training sets, (4) ability to deploy these tools at scale within molecular simulation packages. Currently, these components are at best sparsely integrated. The ColabFit project seeks to provide such an integrated framework.
In this talk, I will provide an overview of ColabFit, consisting of an online platform, the “ColabFit Exchange,” for sharing, analyzing and discovering first principles and experimental data, and associated tools for data wrangling and MLIP development and deployment. The ColabFit Exchange (https://colabfit.org/) is the largest curated database of first principles and experimental data spanning the periodic table, optimized for training ML models and MLIPs through the adoption of a new efficient standard for storing heterogeneous datasets. The ColabFit associated tools include a significant expansion of OpenKIM’s MLIP development package, KLIFF, enabling rapid prototyping and on-the-fly use of ColabFit datasets in the fitting process, as well as seamless integration with a dozen major simulation codes that support the KIM API, such as LAMMPS and ASE. This expansion includes support for distributed memory and GPU parallelism, support for state-of-the-art compiler based automatic differentiation of descriptors, offering near analytical derivat