MLOps Reading Group

# Coding Agents

# Open Source World Models

# LinkBot World

Advancing Open-source World Models // MLOps Reading Group // February 2026

We present LingBot-World, an open-sourced world simulator stemming from video generation. Positioned as a top-tier world model, LingBot-World offers the following features. (1) It maintains high fidelity and robust dynamics in a broad spectrum of environments, including realism, scientific contexts, cartoon styles, and beyond. (2) It enables a minute-level horizon while preserving contextual consistency over time, which is also known as "long-term memory". (3) It supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second. We provide public access to the code and model in an effort to narrow the divide between open-source and closed-source technologies. We believe our release will empower the community with practical applications across areas like content creation, gaming, and robot learning.

Valdimar Eggertsson, Adam Becker & Arthur Coleman · Feb 27th, 2026

All Tags

All Types

Valdimar Eggertsson, Lucas Pavanelli, Rohan Prasad & 2 content:more content:speakers · Jan 26th, 2026

Agent Use for Coding in 2025 // MLOps Reading Group January 2026

AI agents aren’t “helping” devs code anymore—they’re starting to run the workflow. This panel pokes at the uncomfortable question: are engineers still in control, or just supervising very confident machines that are slowly replacing how we think, design, and build software?

# AI Agents

# Coding Agents

# LLMs

# Devs Code

Jon Saad-Falcon, Jimin (Anna) Yoon & Arthur Coleman · Dec 1st, 2025

Shrinking the Generation-Verification Gap with Weak Verifiers

Language models are getting better at reasoning but their ability to verify their own outputs still lags behind. This paper tackles that challenge head-on by introducing Weaver, a framework that combines multiple weak verifiers into a single, stronger verifier without relying heavily on labeled data. Weaver uses weak supervision to estimate verifier reliability, normalize inconsistent outputs, and filter low-quality signals, resulting in a unified score that better reflects true response quality. In practice, this approach significantly boosts reasoning and math task performance rivaling models several times larger, such as achieving o3-mini-level accuracy using only Llama 3.3 70B as the generator.

# LLM Verification

# Weak Verifiers

# RAG Systems

Sophia Skowronski, David DeStefano, Valdimar Eggertsson & 1 content:more content:speaker · Oct 31st, 2025

Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

As AI agents become more capable, their real-world performance increasingly depends on how well they can coordinate tools. This month's paper introduces a benchmark designed to rigorously test how AI agents handle multi-step tasks using the Model Context Protocol (MCP) — the emerging standard for tool integration. The authors present 101 carefully curated real-world queries, refined through iterative LLM rewriting and human review, that challenge models to coordinate multiple tools such as web search, file operations, mathematical reasoning, and data analysis.

# MCP

# AI Agents

# LLM Judge

Adam Becker, Matt Squire, Rohan Prasad & 1 content:more content:speaker · Sep 17th, 2025

Beyond Prompting: The Emerging Discipline of Context Engineering Reading Group

LLM performance isn’t just about the model—it’s about the scaffolding we build around it. “Context Engineering” reframes the conversation: prompt design is the toy problem, while the real frontier is systematically engineering the information environments that shape model behavior. Surveying 1,400+ papers, this work defines the field’s taxonomy—retrieval, generation, processing, management—and shows how it powers RAG, memory, tool use, and multi-agent systems. The survey also reveals a paradox: LLMs can absorb increasingly complex contexts but remain clumsy at producing equally complex outputs. This tension signals a coming split between research obsessed with cramming more into context windows and the harder question of whether models can ever match the sophistication of what they’re given.

# Context Engineering

# LLMs

# Prompt Engineering

Kelly Hong, Adam Becker, Matt Squire & 2 content:more content:speakers · Sep 1st, 2025

Context Rot: How Increasing Input Tokens Impacts LLM Performance (MLOps Community Reading Group)

When Bigger Isn’t Always Better: How Context Length Can Break Your LLM Longer context windows are the new bragging rights in LLMs — now stretching into the millions of tokens. But can models really handle the first and the 10,000th token equally well?

# Context Windows

# LLMs

# Prompt Engineering

Sonam Gupta, Adam Becker, Nehil Jain & 1 content:more content:speaker · Sep 1st, 2025

Small Language Models are the Future of Agentic AI Reading Group

This paper challenges the LLM-dominant narrative and makes the case that small language models (SLMs) are not only sufficient for many agentic AI tasks—they’re often better. 🧠 As agentic AI systems become more common—handling repetitive, task-specific operations—giant models may be overkill. The authors argue that: SLMs are faster, cheaper, and easier to deploy Most agentic tasks don't require broad general intelligence SLMs can be specialized and scaled with greater control Heterogeneous agents (using both LLMs and SLMs) offer the best of both worlds They even propose an LLM-to-SLM conversion framework, paving the way for more efficient agent design.

# Small Language Models

# Agentic AI

# LLMs

Sophia Skowronski, Adam Becker & Valdimar Eggertsson · Apr 9th, 2025

Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations

We break down key insights from the paper, discuss what these findings mean for AI’s role in the workforce, and debate its broader implications. As always, our expert moderators guide the session, followed by an open, lively discussion where you can share your thoughts, ask questions, and challenge ideas with fellow MLOps enthusiasts.

# Generative AI

# Claude

# Hierarchical Taxonomy

Adam Becker, Nehil Jain, Matt Squire & 1 content:more content:speaker · Mar 6th, 2025

DeepSeek That, DeepSeek This : MLOps Reading Group

We dive deep into this groundbreaking paper, breakdown its key insights, and discuss what makes DeepSeek-R1 so special. Our expert moderators guide the session, followed by a lively round-robin discussion where everyone shares their thoughts, asks questions, and debates the implications with fellow MLOps enthusiasts. This is the reading group for anyone passionate about MLOps, from seasoned practitioners to the AI-curious. We meet every month on the second Thursday, and trust us—you don’t want to miss this one.

# DeepSeek

# AI

# MLOps

Nehil Jain, Adam Becker, Valdimar Eggertsson & 1 content:more content:speaker · Dec 27th, 2024

MLOps Reading Group - December : A Taxonomy of AgentOps for Enabling Observability of Foundation Model-based Agents

In the December Reading Group session, we explored A Taxonomy of Agents for Enabling Observability of Foundation Model-Based Agents. Key participants discussed the challenges of building agentic AI systems, focusing on four key capabilities: perception, planning, action, and adaptation. The paper highlighted issues like lack of controllability, complex inputs/outputs, and the difficulty of monitoring AI systems. Early-stage insights drew on DevOps and MLOps practices, and the need for improved tools and evaluation strategies for agent observability. The session fostered a collaborative exchange of ideas and practical solutions.

# AI Agents

# Observability

# AI Systems

Valdimar Eggertsson, Sophia Skowronski, Adam Becker & 1 content:more content:speaker · Dec 2nd, 2024

Inference Scaling for Long-Context Retrieval Augmented Generation

This November Reading Group conversation covers advanced retrieval techniques, strategies like iter-drag and hyper-drag for complex queries, and the impact of larger context windows on model performance. The Reading Group also examines challenges in generalizing these methods.

# Long-Context RAG

# Inference Scaling

# iter-drag and hyper-drag complex queries

MLOps Reading Group

.css-1t9010w-StyledLink:hover *{color:var(--theme-color-primary, #C92C7F);}Advancing Open-source World Models // MLOps Reading Group // February 2026

Advancing Open-source World Models // MLOps Reading Group // February 2026