The MLOps Behind Recursion’s Foundation Model Phenom-1
Decoding MLOps: How Recursion Built Phenom-1 to Revolutionize Drug Discovery through Machine Learning and Automation Technologies
June 4, 2024Who Is Recursion
Recursion is a clinical stage TechBio company leading the space by decoding biology to industrialize drug discovery with tools like machine learning and automation technologies. At its core, Recursion uses a variety of machine learning approaches on in-house generated fit-for-purpose data to tackle the many challenges of drug discovery. Our paramount ML use case, phenomics, uses deep learning models to embed images of genetically or chemically perturbed cells into high-dimensional feature spaces and then compares those embeddings to one another to infer gene-gene, compound-compound, or gene-compound biological relationships, known as our maps of biology. Over the last decade, Recursion has been curating datasets using highly-scalable, automated laboratory workflows that enable execution of up to 2.2 million experiments per week for up to 50 weeks per year.
Building A Phenomics Foundation Model
Why Did We Need It?
Recursion decided to pursue the development of a foundation model because of its potential to unlock existing constraints around data and training. Previously, our best deep learning phenomics models, having a DenseNet architecture, required training separate weakly-supervised models for each cell type and imaging modality that we work with. Furthermore, these models could only be trained using images of cells genetically perturbed with siRNA technology and not with CRISPR technology, both of which have been extensively used at Recursion. While CRISPR is superior in its ability to do gene knockdown, it is also subject to a phenomenon known as proximity bias where CRISPR-Cas9-induced double-strand breaks sometimes result in chromosome arm truncations, confounding the resulting phenotypes from those of pure gene knockout and degrading the performance of the models when included in training. This along with other challenges, such as curating large quantities of high-quality labels, severely limited the scale at which we could train our phenomics models. Foundation models offered the promise of training a single model on all our data that could generalize well to different cell types and imaging modalities, as well as overcome the limitations of CRISPR-based genetic perturbations.
How Did We Do It?
Model Architecture
Early on, the model developers on the team experimented with a variety of model architectures, eventually landing on a Vision Transformer (ViT) backbone for self-supervised pre-training as a masked autoencoder (MAE). The overarching objective was to reconstruct randomly mask patches of an image using the remaining unmasked patches as input. In this way, the model was able to learn representations of cellular morphology. The team experimented with patch size and mask ratio to yield optimal performance.
Visualizing reconstructions from masked images for different MAEs.
Data
Predictive Scaling
We trained early versions of the foundation model on 12 million microscopy images that included RxRx1 and RxRx3 images. We then scaled up our datasets to 53 million and then 95 million, roughly 1000x the size of the original RxRx1 dataset used to train our DenseNet-based models. We also iteratively scaled the size of our ViT. Our state-of-the-art foundation model was a 300 million parameter ViT trained on the 95 million image dataset equivalent to 4PB of data.
Migrating Data to the Training Site
Most of Recursion’s image data is stored in the cloud on Google Cloud Platform (GCP), but we do the majority of training on-prem using our BioHive-1 supercomputer. To make the foundation model training loop faster we needed compute and data to be close to each other to optimize performance and latency. Typically, it takes about 37 days to transfer 4PB of data over the internet at a rate of 10 Gb/s. We increased transfer speeds by nearly 4x to make the data egress process faster by using a Data Center Interconnect (DCI), which is an interconnect between Google Cloud and our data center on-prem. As a result, data transfer took roughly 11 days.
Preparing the Data
One of the bottlenecks in machine learning is memory bandwidth. Unlike RGB images, our images have 6 channels; each channel represents a range of frequencies capable of visualizing a specific cellular function (e.g. nuclei, mitochondria, etc.). To help alleviate bandwidth concerns, we transformed and stored our 6 channel images into large N-dimensional arrays using the Zarr format, which allowed our GPUs to access the data efficiently. After batch processing images to zarrs, they were ready to be fed into the data loaders within our training pipeline.
Compute
Training Tech Stack
- Determined AI: Determined was our platform of choice for experiment tracking, resource allocation, checkpointing, logging, fault tolerance and hyperparameter sweeps at the time of this project.
- BioHive-1: In 2021 Recursion built BioHive-1, one of the world’s most performant supercomputers and we designed it for large-scale model training. Since the publishing of our last article on the MLOps community blog, Recursion made significant changes to the BioHive-1 supercomputer including:
- Replacing the original Lustre file system with IBM’s General Parallel File System (GPFS), which is better suited to our use case
- Implementing infrastructure as code with Ansible to prevent configuration drift between nodes
- Replacing Determined’s proprietary resource manager with Slurm, a leading scheduler used by many of the world’s supercomputers
- Purchasing of additional 500 H100s (currently being integrated into the supercomputer)
- PyTorch Lightning + Hydra (training harness and configuration): Back in 2022 we re-wrote many of Recursion’s model training workflows. Early ML team members found success using Determined’s training harness, but new team members found the framework overly opinionated, preventing short iteration cycles, as well as making the code inflexible to new model architectures like UNets, ViTs, and MAEs. In response to these concerns, we adopted PyTorch Lightning and improved training configuration management by replacing a long, complex and painful to maintain configuration management script with composable configuration files managed by Hydra. Adopting PyTorch Lightning and Hydra had the following benefits:
- Fewer bugs in training cycle
- Easier to navigate codebases
- Increased test coverage with shorter build times
- Simplified distributed training: reduced complexity by moving away from Horovod and using TorchElastic instead. A one line change enables data distributed parallel training.
- Abstract metrics-collection using a unified API
Training Details
Our foundation model was trained on BioHive-1 with distributed data parallel settings for up to 100 epochs on 128 DGX 80GB-A100 GPUs (20K GPU hours). Training was done on 3.5 billion image crops sampled from the 95 million dataset using PyTorch 2.1 + PyTorch Lightning framework to leverage bleeding edge developments such as FlashAttention, allowing us to train transformers on the scale of hundreds of millions of parameters. Each input image crop was a 256 x 256 crop randomly sampled from the original image of resolution 2048 x 2048 consisting of 6 channels. We were able to demonstrate that the scaling hypothesis, the notion that increasing data and compute yields better performance, held true for our model, giving us confidence that our best performance would come from training on all our data on the largest model reasonably possible given time and resources constraints.
Scaling Plot
Figure 2: Lower left: small models trained on public data. Lower Right: small models trained on smaller subsets of private data. Upper right: large models trained on private data.
Inference
We ran inference in a Google Kubernetes Engine (GKE) cluster on T4 GPUs. Our inference pipeline is computationally expensive. For example, genetics-only experiments in RxRx3 yield about 140 million crops to feedforward through the encoder (64 crops per well x 1380 wells per plate x 9 plates per experiment x 175 experiments) to obtain representations.
Our prior inference pipeline had to be adapted as our models were no longer compatible with earlier constraints. We worked closely with NVIDIA to optimize the model on dedicated hardware. To speed up inference, we converted our models to TensorRT and saw a 3x performance boost. We also worked closely with GCP to increase overall throughput of our systems by 3.8x.
Architecture Diagram
People
Culture is a key part of successfully delivering on a high-stakes ML project. Culture is hard to get right such that people interact effectively across roles and functions. The team that built Phenom-1 had diverse skills including machine learning, data engineering, and software engineering skills that were used collaboratively to assemble the datasets, develop the model, and deploy the model.
We divided ourselves into workstreams to deliver on project milestones. We were able to move quickly because we had the data, compute, and skill sets to iterate efficiently and there were minimal hand-offs between the phases of the model development workflow. There were few silos to slow communication between workstreams.
Recursion’s high performance computing (HPC) team did a lot of heavy lifting to get BioHive-1 ready for this project. We also relied on data scientists with a deep understanding of how to evaluate the effectiveness of the model.
We integrated the foundation model into the Recursion OS platform to seamlessly produce embeddings from new wet-lab experiments. We also collaborated with the map building teams to generate maps of biology and chemistry using foundation model embeddings inferred from all previously executed experiment images.
Summary
It took a combination of large amounts of proprietary data and compute, as well as the right MLOps tooling and people for Recursion to build our first phenomics foundation model. Recursion’s culture of open communication and minimal silos contributed to fast-paced project delivery that had a massive impact on our ability to better decode biology to radically improve lives.
Appendix
- Experimental batches: A group of experimental plates that are processed together at the same time using identical reagents with the exception of perturbations
- Phenomics: analyze billions of images of human cells that have been systematically modified by various genetic or chemical factors and study observable traits that result from expression of a cell’s genes and their interaction with the cell’s environment after modifications
- Perturbation: a set of reagents or conditions applied to cells in well in the experiment plates Recursion uses to run wet lab experiments
- siRNA: short interfering RNA. Used to regulate or silence specific genes
1 A Cell Painting dataset with 125,510 images of 4 human cell types under 1,108 different siRNA perturbations across 51 experimental batches at NeurIPS 2019.
2 A publicly-available proprietary Cell Painting dataset containing over 2.2 million images of HUVEC cells under 17,063 CRISPR knockouts or 1,674 compounds across 180 experimental batches.