MLOps Reading Group
# Coding Agents
# Open Source World Models
# LinkBot World
Advancing Open-source World Models // MLOps Reading Group // February 2026
We present LingBot-World, an open-sourced world simulator stemming from video generation. Positioned as a top-tier world model, LingBot-World offers the following features. (1) It maintains high fidelity and robust dynamics in a broad spectrum of environments, including realism, scientific contexts, cartoon styles, and beyond. (2) It enables a minute-level horizon while preserving contextual consistency over time, which is also known as "long-term memory". (3) It supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second. We provide public access to the code and model in an effort to narrow the divide between open-source and closed-source technologies. We believe our release will empower the community with practical applications across areas like content creation, gaming, and robot learning.



Valdimar Eggertsson, Adam Becker & Arthur Coleman · Feb 27th, 2026



+2
Valdimar Eggertsson, Lucas Pavanelli, Rohan Prasad & 2 content:more content:speakers · Jan 26th, 2026
AI agents aren’t “helping” devs code anymore—they’re starting to run the workflow. This panel pokes at the uncomfortable question: are engineers still in control, or just supervising very confident machines that are slowly replacing how we think, design, and build software?
# AI Agents
# Coding Agents
# LLMs
# Devs Code



Jon Saad-Falcon, Jimin (Anna) Yoon & Arthur Coleman · Dec 1st, 2025
Language models are getting better at reasoning but their ability to verify their own outputs still lags behind. This paper tackles that challenge head-on by introducing Weaver, a framework that combines multiple weak verifiers into a single, stronger verifier without relying heavily on labeled data.
Weaver uses weak supervision to estimate verifier reliability, normalize inconsistent outputs, and filter low-quality signals, resulting in a unified score that better reflects true response quality. In practice, this approach significantly boosts reasoning and math task performance rivaling models several times larger, such as achieving o3-mini-level accuracy using only Llama 3.3 70B as the generator.
# LLM Verification
# Weak Verifiers
# RAG Systems



+1
Sophia Skowronski, David DeStefano, Valdimar Eggertsson & 1 content:more content:speaker · Oct 31st, 2025
As AI agents become more capable, their real-world performance increasingly depends on how well they can coordinate tools.
This month's paper introduces a benchmark designed to rigorously test how AI agents handle multi-step tasks using the Model Context Protocol (MCP) — the emerging standard for tool integration.
The authors present 101 carefully curated real-world queries, refined through iterative LLM rewriting and human review, that challenge models to coordinate multiple tools such as web search, file operations, mathematical reasoning, and data analysis.
# MCP
# AI Agents
# LLM Judge



+1
Adam Becker, Matt Squire, Rohan Prasad & 1 content:more content:speaker · Sep 17th, 2025
LLM performance isn’t just about the model—it’s about the scaffolding we build around it. “Context Engineering” reframes the conversation: prompt design is the toy problem, while the real frontier is systematically engineering the information environments that shape model behavior. Surveying 1,400+ papers, this work defines the field’s taxonomy—retrieval, generation, processing, management—and shows how it powers RAG, memory, tool use, and multi-agent systems. The survey also reveals a paradox: LLMs can absorb increasingly complex contexts but remain clumsy at producing equally complex outputs. This tension signals a coming split between research obsessed with cramming more into context windows and the harder question of whether models can ever match the sophistication of what they’re given.
# Context Engineering
# LLMs
# Prompt Engineering



+2
Kelly Hong, Adam Becker, Matt Squire & 2 content:more content:speakers · Sep 1st, 2025
When Bigger Isn’t Always Better: How Context Length Can Break Your LLM
Longer context windows are the new bragging rights in LLMs — now stretching into the millions of tokens. But can models really handle the first and the 10,000th token equally well?
# Context Windows
# LLMs
# Prompt Engineering



+1
Sonam Gupta, Adam Becker, Nehil Jain & 1 content:more content:speaker · Sep 1st, 2025
This paper challenges the LLM-dominant narrative and makes the case that small language models (SLMs) are not only sufficient for many agentic AI tasks—they’re often better.
🧠 As agentic AI systems become more common—handling repetitive, task-specific operations—giant models may be overkill. The authors argue that:
SLMs are faster, cheaper, and easier to deploy
Most agentic tasks don't require broad general intelligence
SLMs can be specialized and scaled with greater control
Heterogeneous agents (using both LLMs and SLMs) offer the best of both worlds
They even propose an LLM-to-SLM conversion framework, paving the way for more efficient agent design.
# Small Language Models
# Agentic AI
# LLMs



Sophia Skowronski, Adam Becker & Valdimar Eggertsson · Apr 9th, 2025
We break down key insights from the paper, discuss what these findings mean for AI’s role in the workforce, and debate its broader implications. As always, our expert moderators guide the session, followed by an open, lively discussion where you can share your thoughts, ask questions, and challenge ideas with fellow MLOps enthusiasts.
# Generative AI
# Claude
# Hierarchical Taxonomy



+1
Adam Becker, Nehil Jain, Matt Squire & 1 content:more content:speaker · Mar 6th, 2025
We dive deep into this groundbreaking paper, breakdown its key insights, and discuss what makes DeepSeek-R1 so special. Our expert moderators guide the session, followed by a lively round-robin discussion where everyone shares their thoughts, asks questions, and debates the implications with fellow MLOps enthusiasts.
This is the reading group for anyone passionate about MLOps, from seasoned practitioners to the AI-curious. We meet every month on the second Thursday, and trust us—you don’t want to miss this one.
# DeepSeek
# AI
# MLOps



+1
Nehil Jain, Adam Becker, Valdimar Eggertsson & 1 content:more content:speaker · Dec 27th, 2024
In the December Reading Group session, we explored A Taxonomy of Agents for Enabling Observability of Foundation Model-Based Agents. Key participants discussed the challenges of building agentic AI systems, focusing on four key capabilities: perception, planning, action, and adaptation. The paper highlighted issues like lack of controllability, complex inputs/outputs, and the difficulty of monitoring AI systems. Early-stage insights drew on DevOps and MLOps practices, and the need for improved tools and evaluation strategies for agent observability. The session fostered a collaborative exchange of ideas and practical solutions.
# AI Agents
# Observability
# AI Systems



+1
Valdimar Eggertsson, Sophia Skowronski, Adam Becker & 1 content:more content:speaker · Dec 2nd, 2024
This November Reading Group conversation covers advanced retrieval techniques, strategies like iter-drag and hyper-drag for complex queries, and the impact of larger context windows on model performance. The Reading Group also examines challenges in generalizing these methods.
# Long-Context RAG
# Inference Scaling
# iter-drag and hyper-drag complex queries
