Home
MLOps Community
The MLOps Community is where machine learning practitioners come together to define and implement MLOps.
Our global community is the default hub for MLOps practitioners to meet other MLOps industry professionals, share their real-world experience and challenges, learn skills and best practices, and collaborate on projects and employment opportunities. We are the world's largest community dedicated to addressing the unique technical and operational challenges of production machine learning systems.
Events
8:00 PM - 9:00 PM, Feb 6 GMT
Coding Agents Lunch & Learn
5:00 PM - 6:00 PM, Feb 5 GMT
Serving LLMs in Production: Performance, Cost & Scale
4:00 PM - 5:20 PM, Jan 22 GMT
MLOps Reading Group January: Agent Use for Coding in 2025
Content
Video
As AI moves beyond the cloud and simulation, the next frontier is Physical AI: systems that can perceive, understand, and act within real-world environments in real time. In this conversation, Nick Gillian, Co-Founder and CTO of Archetype AI, explores what it actually takes to turn raw sensor and video data into reliable, deployable intelligence.
Drawing on his experience building Google’s Soli and Jacquard and now leading development of Newton, a foundational model for Physical AI, Nick discusses how real-time physical understanding changes what’s possible across safety monitoring, infrastructure, and human–machine interaction. He’ll share lessons learned translating advanced research into products that operate safely in dynamic environments, and why many organizations underestimate the challenges and opportunities of AI in the physical world.
Feb 6th, 2026 | Views 15
Blog
This blog explains a systematic way to fix CUDA out-of-memory (OOM) errors during GRPO reinforcement learning training, instead of randomly lowering hyperparameters until something works.
Subham argues that most GPU memory issues come from three sources: vLLM reserving GPU memory upfront (often the biggest chunk), training activations (which scale with batch size, sequence length, number of generations, and model size), and model memory (usually the smallest contributor). By carefully reading the OOM error message and estimating how memory is distributed across these components, you can identify exactly what’s causing the crash.
The recommended approach is to calculate memory usage first, then adjust the highest-impact settings, such as GPU memory allocation for vLLM, number of generations, batch size, and sequence length. The guide also shows how to maintain training quality by using techniques like gradient accumulation instead of simply shrinking everything.
Overall, the key message is: treat OOM debugging as a measurable engineering problem, not trial-and-error, so you can fix memory issues faster while preserving training performance.
Feb 4th, 2026 | Views 462
Video
Hundreds of neocloud operators and "AI Factory" builders have emerged to serve the insatiable demand for AI infrastructure. These teams are compressing the design, build, deploy, operate, scale cycle of their infrastructures down to months, while managing massive footprints with lean teams. How? By applying modern intent driven infrastructure automation principles to greenfield deployments. We'll explore how these teams carry design intent through to production, and how operating and automating around consistent infrastructure data is compressing "time to first train".
Feb 3rd, 2026 | Views 548

