MLOps Community
Sign in or Join the community to continue

Advancing Open-source World Models // MLOps Reading Group // February 2026

Posted Feb 27, 2026 | Views 8
# Coding Agents
# Open Source World Models
# LinkBot World
Share

Speakers

user's Avatar
Valdimar Eggertsson
AI Development Team Lead @ Snjallgögn (Smart Data inc.)

Raised in Reykjavík, living in Berlin. Studied computational and data science, did R&D in NLP and started making LLM apps as soon as GPT4 changed the game.

+ Read More
user's Avatar
Adam Becker
IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More
user's Avatar
Arthur Coleman
CEO @ Online Matters

Arthur Coleman is the CEO at Online Matters . Additionally, Arthur Coleman has had 3 past jobs including VP Product and Analytics at 4INFO .

+ Read More

SUMMARY

We present LingBot-World, an open-sourced world simulator stemming from video generation. Positioned as a top-tier world model, LingBot-World offers the following features. (1) It maintains high fidelity and robust dynamics in a broad spectrum of environments, including realism, scientific contexts, cartoon styles, and beyond. (2) It enables a minute-level horizon while preserving contextual consistency over time, which is also known as "long-term memory". (3) It supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second. We provide public access to the code and model in an effort to narrow the divide between open-source and closed-source technologies. We believe our release will empower the community with practical applications across areas like content creation, gaming, and robot learning.

+ Read More

TRANSCRIPT

Paper: Advancing Open-source World Models

Arthur Coleman [00:00:00]: Good morning, everybody. Welcome back to the reading group for February. We've got a great topic for you today. Hold on. Called Advancing Open Source World Models. I'm actually excited in this because I live in the, one of the areas I live in is the entertainment space and I've been working in this area. So this is kind of cool stuff for me. Hopefully you'll find it that way too.

Arthur Coleman [00:00:21]: Before we get started, I wanted to let you know from the last session, We hadn't. Everybody who saw that video, including Demetrius and others, said, you know, why can't we follow up and have a weekly, you know, place or something where we can all share best practices? Because this is a hard thing. So they founded, or we started a channel in Slack called Coding Agents, which I urge you to join. So the best ideas, I use it all the time, and I'm living this stuff 13 to 15 hours a day. And you continually will find people giving you ideas and best practices to improve your performance and the performance of your LLMs. They also have a biweekly Lunch and Learn, and I think this is on Pacific time, so those of you in Europe or Asia, this may not be the most convenient, although you might want to start your own at some point. We could help you do that. Benoit is an expert at setting up those kinds of things.

Arthur Coleman [00:01:17]: And you can find the— this is the last event that happened last Friday. On home.mlops, the community there announced his events. So just know that the kind of things that we do here, people do take notice. They do watch the videos. So thank you for your participation. Our speakers today are obviously Valdemar, who is the AI development team lead at Smartdata, and the infamous Adam Becker, who's the founder of HeadOn. So, and with that, I'm going to turn it over because Valdemar is going to start with application data, the early parts of the presentation of the paper. Then Adam's going to go into the training and evaluation of the LLMs.

Arthur Coleman [00:01:59]: And as always, these sessions— and I will put the— there's a Google Doc for questions. I will put that into the chat. So just know that's going to happen in a minute. But these sessions belong to you. They're intended to be interactive. I say this every time. So, you know, I don't want to say it too much, but No judgment zone. We want to get your questions.

Arthur Coleman [00:02:19]: The more you participate, the more you involve, the more you share what you're doing, if you have ideas to share, the better this will be. So with that, I will turn it over to Valdemar with one last thing. Don't forget to fill out the post-event survey because that is really critical to us doing better for you and putting on better events. Okay, Valdemar, all yours.

Valdimar Eggertsson [00:02:40]: Thank you, Arthur. I'll share my screen. We have a Miro board. And some other stuff. Let's see. So we are going to talk about open world models. I'm very happy to take a break from LLMs. Been like immersed in LLMs for some years, and this is exciting to talk about something else.

Valdimar Eggertsson [00:03:00]: And it was also very fun to play with. So we're talking about the new open source world simulator. It was published recently by Chinese company Robiont, which is a part of Alibaba. So they have a lot of resources and it just made this pretty magical machine here that I'll briefly talk about and show you. Yeah, it's an interactive world simulation. And we can just look at the website. I shared this website in the reading group channel where you can see demos of it, and we could just talk a little bit of this. So we're looking at something that seems like a video game where you can move around in an environment, but what makes it unique is that it's generated on the fly, so to say, just from an image and the prompt of what the environment should look like.

Valdimar Eggertsson [00:04:10]: So effectively, like, simulates a camera in a world which is more reliable than previous video generation approaches and less dreamlike. So what is LinkBot World? It's an open-source framework designed for interactive world modeling. It delivers high-fidelity, controllable, and logically consistent simulations. So yeah, you basically, like in a game, you can control the path of the camera, or like, like if, as if you're walking around in a video game, first person. And, uh, it's open source, which is like the one of the coolest things about it. This is the pattern that's been happening in machine, this big machine learning models recently, how like in this case, Google DeepMind was researching this topic. There's, they did something called Genie, Genie-3 that was like researched last year and came out as a product from Google recently. And then a couple of months later, we get a Chinese open source version that works and that I can run.

Valdimar Eggertsson [00:05:29]: And that's what I did. And I'll show you a bit what I did with it. Because it's not just some like theoretical paper. It's something that I managed to just get up and running and playing with relatively quickly. So it's not just a text-to-video generation, it's a text-to-world simulation. And how do they do it? There's like two main components. There's a scalable data engine with hierarchical semantics. I'll talk a bit about this.

Valdimar Eggertsson [00:06:04]: And then Artem will talk about the multi-stage evolutionary training pipeline. And they combine all kinds of things to make this work. And it's cool. How does it compare to other systems? And this table here, theirs is the best, ticks all the boxes that I defined here. Like, for example, if you compare it to Genie-3, it's very similar, except it has a higher what they call dynamic degree, which is how like The world can change. I think this has to do with having consistency also in things changing. So off camera, there's a cat walking in the room and you can go do something else. And if you look back, the cat will be still doing its thing because it's simulating a world.

Valdimar Eggertsson [00:06:56]: It's not just simulating a video. And before jumping into the data engine, I wanted to really just show you what I did. Or there's chapter 5 here is about the application itself. So yeah, they've made lots of different videos that you can see on the website. Sorry, it's taking a bit long. Yeah, they talk about 3 things that they have. They have something called promptable events where you can just add things on the fly, like I want fireworks, I want lightning, I want a shield. So it's just a way to magically conjure forth things from text effectively.

Valdimar Eggertsson [00:07:42]: And I think my videos are here. So, Jan, I did this in a cheap machine that I rented. Basically, you start with a picture. There is a picture of with one of my favorite pumpkin illustrators. And then you give it a prompt that all did— I described the environment of the Shire. Here we have Gandalf. And then you give it a sequence of movements where you would move in the world. That's the version that's already available and easy to run, but they're working on like a real-time, something what feels more like an open video game engine.

Valdimar Eggertsson [00:08:28]: So this was not in the picture. We have, yes, very short, but from the picture it generates, uh, you can look around and you have something that feels like the Shire based on the prompt that I gave it. Got one different one, different formats. I made it fly into the sky instead of just walking and turning, looking over Hobbiton. And I took a still from Blade Runner here. And it's not just an image, it's a world, so you can turn and there's a car that almost bumped into. Into Azure. So how do they do this? The foundation as for machine learning models in general is that data and they built a scalable data engine where they gathered the data from 3 different sources, main sources.

Valdimar Eggertsson [00:09:48]: There was videos, so they have access to both in-house collections from Alibaba somehow and open source repositories with lots of videos, but they had to choose something that made sense to be like something from a first-person perspective or a third-person's perspective where you're moving around in the world. And they had some tricks to select a useful video for that. And in addition to just training on videos or gathering videos for training, they used game data. They developed a dedicated game data acquisition platform engineered for high-fidelity capture and synchronization of visual data, agent actions, and camera movement. So it's basically someone playing a game and you keep track of how is he moving. So you get this association between movement in a world and the, how the world changes. And these are pretty important pillars, but it's not enough to make it robust. So they had a synthetic rendering pipeline built based on the Unreal Engine, where they would build synthetic data where they could precisely, like, they could control and make it geometrically make sense.

Valdimar Eggertsson [00:11:17]: Like, the point was to make the resulting model make sense. Then they would try all kinds of things to generate, render video associated with the input from the like how's the camera moving and managed to expand the diversity beyond the biases of real-world datasets because there's like a billion videos on the internet, but they don't capture everything. Okay. And gathering the data is not enough. They basically need to filter it, which is what they call the data profiling engine where they would filter down for videos that had nice resolution and stuff, but also used their internal vision language model to filter based on visual quality and different parameters. Is it first person, third person? And an important thing it did was like approximate where is the camera in the video to know the relationship between the camera and objects. So they could use that to generate environments from a perspective. And then when they needed to map it to text, so you could describe the world.

Valdimar Eggertsson [00:12:42]: And so they made captions to the video and that's not just subtitles. But what they call a hierarchical captioning module with 3 distinct categories with different levels of granularity and control. There was the general narrative captions, which are just like, what's this little video about? And then there's the scene static, which is not like as general, but about like what's there and what's there and what's there. And then there's the temporal caption, which is the sequence of events. And with all of this, they were able to train 50 powerful models. So here's an example, briefly looked at, of the narrative caption. It's like a prompt. The video unfolds as a tranquil first-person exploration of a meticulously designed East Asian-style courtyard.

Valdimar Eggertsson [00:13:46]: And they described the things that are going on there. And then we made a kind of a sifter explanation that's decoupled from the actions, just about the— what's— yeah, let's read it quickly. The video presents a first-person perspective or someone wandering through a serene, ornately decorated courtyard or temple complex with traditional East Asian architectural elements. It doesn't like talk about everything that the person does, but then they have the temporal caption saying how at the start time, the first few seconds, there's an event where the camera approaches a decorative screen and onwards, new events at 5 seconds, 10 seconds. 30 seconds, 35 seconds. And the hierarchical key, they managed to describe it well enough to associate what's going on in the video with the text. And since I got it up and running, I can say a few words about the code, or the architecture of the thing itself, since I dived a bit into it. Let's see, it's an open source world simulator generating interactive navigable video environments from a single image, text prompt, and a camera trajectory.

Valdimar Eggertsson [00:15:14]: Just 3 files for each video, your input. I actually made a website for it where you can upload an image and put text, put the path in. What it does, it has like a model that encodes the text, a model that encodes the image to a latent space along with the camera poses. Yeah. And makes the camera poses into some kind of embeddings after some plucker. Then there's a diffuser model that runs in a loop with a high noise and a low noise model that like generates the image in this diffusion way where you denoise it and it appears eventually. I maybe should have found a picture describing how diffusion models work. And after this has been encoded and denoised, they map the tensor, like the vector, to red, green, and blue video frames.

Valdimar Eggertsson [00:16:21]: To get an output video. So MP4 video. And yeah, just like the ones I generated earlier. And I think I'll skip going into further details. Maybe mention that, yeah, there's a text encoder model, T5. It's a classic one, it's an old one. There's a thing that maps the input image into a latent space and decodes it back to video. And two diffusion transformers handling low noise and high noise stuff.

Valdimar Eggertsson [00:16:58]: And then I will hand it over to Adam to go a bit into the details and tell us about how it was trained. Nice. Okay.

Adam Becker [00:17:11]: Thank you, Valdemar. So I'll share my screen as well. And it's a good place to hand it off because I don't know how much background people have in thinking about diffusion. And it was useful for me to get a nice little refresher here because to fully appreciate what these guys are doing, I thought it would be useful to, to dive back in. So let's just start right from the top. In terms of their training, as Waldemar said, they broke it down into an evolution of 3 different stages. The first one, just to create a general video model. Then they're gonna add some interactivity and make it remember things long-term.

Adam Becker [00:17:57]: And then last, actually make it so that Waldemar can deploy it on a machine because otherwise it's very, very large. It takes a very long time and it doesn't have an understanding of causality. So we're gonna spend some time going through each of these stages. You're gonna see how the whole thing kind of like connects. So first I just wanna start by asking like if you were to just like literally try to put yourself in their shoes, if you were to try to create a world model, what would you do? What would be the first thing that you do? And in some ways you should think about it kind of like what's like, what is the difference between a video and a game? Everybody's had this experience where, you know, you go to an arcade and then you try to play the thing, but you never actually put the coins in, right? So it's kind of like you're playing the game, but really you're just watching a movie, right? What you don't have access to yet is the controllability. So I think it was based on this intuition that they decided, you know what, let's just start with a video. So let's just grab whatever video generator model we can, and then we'll make all of our modifications on top of that. So that video generator model will essentially serve as like the visual canvas.

Adam Becker [00:19:09]: This is sort of what the world is kind of looking like before we start to modify it and convert it into a world model. And so stage 1, just get your general video prior. How do they do it? They're looking for something off the shelf. They found WAM 2.2. If you haven't yet checked it out, it's WAN.video, I think it's called. And you could just go, it's by the same, I don't know if it's the exact same team, but also with Alibaba, they created WAN and you can go and play with it in the same way that you can play with Gemini and, and VEO-3 and all of these other things. Okay, so how exactly does that video generator model work in the first place? I think it would be valuable if we spend a couple of minutes just on that because you'll see that the modifications that they're making to the architecture and to the training are downstream of how that video generator model works in the first place. So we're going to do a quick tour of diffusion architecture.

Adam Becker [00:20:03]: And for that, I want to focus for a second on transformers. So you could see that in a typical transformer, you have the encoder, you have the decoder. And the encoder here was meant to process incoming text with bidirectional attention. Remember, I think even in the original paper, it was about language translation. So you say, I gave him the book, da da da da, it encodes it, and then Slowly, autoregressively, it's going to come up with the translation. Notice that the decoder in this case is using the process of learning the representation here to generate one token at a time autoregressively. That is, it's causal. You could say that the book, whatever is gonna come here, doesn't know.

Adam Becker [00:20:42]: So the je doesn't know about the book yet. It can't know about that yet. It hadn't yet generated it. It only can look backwards. It can't look forward. Forward. And so in the exact same way, when you try to think about how transformers are to be modified for image generation and then later for video generation, we simply kill off the decoder. And so we're only going off of the encoder.

Adam Becker [00:21:04]: That is the only, you need to have a sense of, you cannot have a sense of causality when you're generating images, right? Because the image, it's not temporal in the same sense, right? Like every part of the image needs to know or could know about other parts of the image. And so you're getting rid of the decoder that has a sense of causality. You no longer have a sense of causality when you're trying to use transformers for images. Okay, now when I say that every part of the image can affect every other part or can learn about every other part, what exactly do I mean? Well, in the convolution world, we just used to restrict our attention only to the neighborhood. So I look at a pixel and then I say, okay, what are the other pixels around me? All right, that's what I can pay attention to. Fine. But in the self-attention world, you can have every pixel pay attention to every other pixel, right? So if you have a 256 by 256, that's 4.3 billion. I mean, this is an extremely expensive proposition, right? And so pretty quickly you're like, oh yeah, this is unlikely to be the best way to go.

Adam Becker [00:22:07]: Let's find other ways to nevertheless deploy the attention paradigm onto images. There's a few ways to do it. The first one is the original Vision Transformer. And it came up with a very simple solution. It just says an image is worth 16 by 16 words. That is, you take a picture and you patchify. You just split, tick, tick, tick, tick. You split it into 16 by 16.

Adam Becker [00:22:30]: And then you say, fine, every patch can attend to every other patch. And so immediately you reduce it down already to 65K. So previously 4.3 billion. Now you're at 65K. What do you do? Remember, we're still in the transformer paradigm. We're still thinking about something that kind of represents a text. You need to have just like one long sort of vector. How are we going to convert it into that? Well, so you take the image, maybe 3 channels, the colors, whatever, fine.

Adam Becker [00:22:59]: You patchify them, you flatten them into 1D patches. You could see we're moving, we're kind of processing it more and more so that it looks like something that can be fed into what we we saw previously with the original transformer, the different language. Okay, flatten, 1D patch, a linear transformation to the token embeddings. And now you need to add the positional embeddings, right? So that you have some sense of where every pixel was. Once there, you just reuse the transformer encoder block and you're good to go. So this was the original vision transformer. What WEN 2.2 does is it What it does though is not using the vision transformer, it uses the diffusion transformer. So let's linger for a minute on what diffusion is.

Adam Becker [00:23:42]: So if you think about, so let's say we have, this is only in 2 dimensions and let's imagine there's another dimension just for the probability. Think about the, if you are to imagine the landscape of all possible pixel values, you could see that it has, sharp peaks and it has valleys. That is, if you have, let's say, this picture of a dog, it's very likely that it doesn't have the, let's say, a nose of a clown, right? So if you gave me just enough of the image, I have a certain sense of how the probability is likely to maximize. It's likely to maximize with the nose of a dog. That's what it's likely to maximize. And so you could just sort of start picturing this entire landscape. And if you were to just randomly grab a point in the pixel value space that is, let's say, empty, it would just look like noise, right? It would just look like fog. There's no structure there.

Adam Becker [00:24:40]: But structure is sharply peaked. Okay. So what you're trying to do in the diffusion model is you're trying to teach the model how to identify the direction of highest probability increase. If you start here, you want to start walking up the most probable structure, the most probable peak, and then you're going to get the image that you care about. And so the way you do it is you're just learning the highest gradient ascent. Picture there's a ball, and the ball is looking for the highest probability increase from where you are right now. Problem is this is very sharply peaked.. So it's very difficult for you to even identify it.

Adam Becker [00:25:23]: It's not like it's a hill. It's just very sharply peaked right next to that image, right? So right next to that constellation of pixel values. So what you do is you start by flattening it a little bit. You say, okay, fine, it shouldn't just be, you know, a very sharp peak. Let's diffuse the signal a little bit, right? Let's just spread it around a little bit so it looks a little bit more like a hill, right? So then, and as you get closer to the top, then you start to sharpen, you sharpen the image back to what the original image was, the clean image was. So how do we do it? We just add noise. So this is in the forward diffusion case. You start by, let's say, t equals 0.

Adam Becker [00:26:02]: Let's think about a time, but yeah, it's kind of like time steps here. The t equals 0, clean image, fine. t 0.25, you add a little bit of jitter and then more jitter, more up until t p equals 1 and it's fully— it's just fog. So the idea is you teach it how to move from more fog to less fog. You teach it how to move from just any random space here, any random spot here, up towards the hills. So Waldemar spoke about these two different processes. There's the high noise and then the low noise. You're going to see how they're going to plug into this insight here.

Adam Becker [00:26:45]: The poetic way to say it is we start out by just sculpting vague shape in the fog, right? So this is maybe the high noise. You're going to see there's going to be an expert, a high noise expert, or just in the high noise regime. We're just starting by sculpting something vague. And then over time, we're chiseling finer and finer and finer detail in the emergent structure. Okay, so this is how the diffusion works. In this case, what we do is in the diffusion transformer, we're taking the image and immediately we're using a variational autoencoder just to reduce the number of dimensions. Remember, look, we went from 200K values now to 4K values, from 256 to 256 to 32. But we added an extra dimension here.

Adam Becker [00:27:32]: So we might have not just the 3 color ones, but we added an extra dimension. And now we're gonna start to play this game of diffusion. Now it's also the case that we, right now while we're in latent space, we also patchify it. Remember how we saw the patchification earlier? So we patchify just not in the original image. Right now we're just patchifying in the latent image. Okay. And now we're gonna start to play the game. So we're gonna fog it up a little bit and we're gonna add a lot of noise.

Adam Becker [00:28:04]: And then we're going to run the forward and reverse diffusion. We go first from the image to the latent space, and then we go from extremely noisy to increasingly clear. And the diffusion transformer is learning and is implementing the defogging process step by step. You need to feed it what step you're in. You need to say, okay, right now we're in t 1. Okay, it needs to know how far it has to go. Now t is 0.75. Now t is 0.5.

Adam Becker [00:28:31]: Now And then you also give it the ultimate caption. This is some of the captions that Waldemar was speaking about. So the model needs to keep track of what the timestamp and the text to condition on. And then you end up getting again a clean image in latent space. And then you use the variational autoencoder to decode the decoding step. This you just take it off the shelf. Now decoder step, you can get it back to the original pixel space. Question is, how are you going to incorporate the time embedding and the class embedding or the text or the caption? How are you gonna do it? There's a lot of different ways to do it.

Adam Becker [00:29:04]: So I'll just mention a couple of them. One of them is you could just, you just, okay, so you have the time embedding and class embedding in this case. And again, you have the image, I think in this case it's already in latent space. Well, one way to do is just concatenate it. So just take all of the patchification in whatever space and just add the time and the class. Problem is that right now you're competing with visual signal and then the time and the class might get ignored. Okay, so there's another approach. The other approach is cross-attention where you use the time and the class as the keys and the queries and the values as the visual signal.

Adam Becker [00:29:43]: Problem with this is that it's very expensive to do this computationally. The next approach is to use adaptive layer normalization. So you take the time and class, you pass 'em through an MLP, And then you simply modulate the signal instead of using just a regular layer normalization. The layer normalization is now parameterized with 2 extra parameters. Let's say gamma and beta. And those are learned over time. And so the time embedding, the class embedding, they go through the MLP. And then what comes out is some way to modulate the signal.

Adam Becker [00:30:14]: So this is called the adaptive layer normalization. And this is extremely cheap to do. It's very cost efficient. Okay, so this is a little bit of the architecture of what WAN 2.2 underlying video model is doing. So far we've spoken a little bit about images, but in order to just convert to more of like a video paradigm, you should probably picture it like this. Instead of a batch just being, or instead of just us feeding a single image, really we're just passing a batch of frames. Of a video, right? So, okay, these are 4 different frames that we're passing in. And then instead of trying to predict, okay, but is there a nose of a clown or is the nose of a dog or what's happening here? You're just saying, okay, I want to actually just predict 4, 5, 6, however many other frames, right? So you're just moving from a set of images to another set of images.

Adam Becker [00:31:06]: So the probability peaks here is not just what a single image would look like, but what a scene would look like. So there's also some dynamic and coherent and cohesive structure in the probability space for the scene itself. And so this is what the model ends up learning. And the way they split it in WAN and also here is to focus on two different experts. The first is the expert on high noise, and the second is the expert on low noise. And the reason I think it's very clever to to think about this and to decide to do it because really you're trying to learn very different types of things. For the high noise, think about it. You're still— do I have a picture of high noise? Not really.

Adam Becker [00:31:52]: OK. High noise, you could just picture it like this. There you go. This is why I put it here. You got this film. You're not seeing the specific structure. You're not seeing fine visual detail. But you get a sense, OK, we're outside.

Adam Becker [00:32:04]: We're inside. The camera is moving like this. The camera is moving like that. What's the scene? Is the car coming? Is the car not coming? Like we're just thinking about very high level. This is still in the fog. This is still sculpting vague shapes, but sculpting vague shapes is necessary in a longer domain. That is like right now when I'm trying to understand camera movement, it's very different from if I'm just trying to understand how to correctly model this pen. So they're focusing on these two different experts, one expert learning almost like high-level camera motion and scene detection.

Adam Becker [00:32:39]: And the second is low noise, high fidelity video structure. Okay, so this is so far we're good. We have our first stage done. We got our video. The problem is it's still video. It's not interactive. It doesn't yet respond to any interactive logic. And as you could probably, as you probably know from having played with a bunch of these video generator models, they produce very short clips.

Adam Becker [00:33:06]: And we need to be extending this to something that's much, that we can actually play with. So they go through a few different processes. The first is they say, okay, fine, well, we start with 5-second clips, let's incrementally show it longer and longer and longer videos, right? So they start with 5 seconds, they end up in 60 seconds. The problem here is that while I was telling you earlier that Yes, we can denoise. What I didn't say is how long should we spend denoising in a high noise regime and how long should we spend denoising with fine detail, right? And what they realized is that the longer your expected video is, if you expect the video to be much longer, if you feed in longer videos and expect to get longer videos out, you wanna spend much more time in the very high noise regime. That kind of makes sense. Right? Like you wanna be able to remember the scenes and then to picture how the camera moves from scene to scene. And this plays a higher importance when your video is longer.

Adam Becker [00:34:10]: So there is, so first of all, they go through a threshold and they say, okay, what expert should pick up a given level of noise in the first place? That's one consideration. The second, how long should you spend in that high noise regime? And they have a variable called shift. And I played around with it a little bit. You could just see it in their code base. They just crank up the shift. They take it really high. And it's also a function of the duration of the video that is coming in and the duration of the video that you expect to come out. And so in the original shift, you could see your base— okay, we split the sampling into 70., right? So from 1 to 0, 70, and we spend an equal amount of time on every given denoising frame.

Adam Becker [00:35:00]: Fine. But when the shift starts to rise, you could see that right now we're spending much more time on high noise frames. So that's what ends up happening with the shift. And they start to modulate that. They mess around with that in reaction to the video's duration. Okay. Yeah, cool. Video duration.

Adam Becker [00:35:24]: All right, nice. Now cool, so at this point they got a video model and it can ingest and produce even longer videos than anything that we're used to. Fine, we still didn't add the action parameters. That is still not interactive. So when you think about action here, you should think about it in two different ways. So first, What they ask you to do is kind of like a move, a key, let's say like W, A, S, D, right? You know, like when you play video games, W would be like forward and then A would be left, D right, S down. Okay, so they feed in, this is the stuff that Valdimar was talking about, they take the synthetic video generation and it already comes paired up with those keys. So I know that they moved left.

Adam Becker [00:36:12]: I know they moved right. I know they moved up. But up and left and down and right can mean different things. On one hand, it could mean, okay, now I'm deciding to open the door. Now it means that I'm walking downstairs. Now it means I'm entering the car. It could mean different things. So we also need to represent— question is, how are we going to represent this action? We can represent it, first of all, as a multi-hot encoding.

Adam Becker [00:36:34]: Just picture it like a vector with 4 dimensions. Let's say w would be 0, 0, 1, 0, right? Something like that. But second, we're also trying to shift our perspective in the world, and we're trying to shift every ray that is arriving to our field of view. And this is done using the Pluker coordinates. And so we take the action, and we convert it into both the state change and a ray change using the Pluker coordinates. Which you could think about it as kind of like 6 dimensions, almost like X, Y, Z, and then DX, DY, DZ. So you have a point in space and then you have the direction in which that point is orienting. And so, but they didn't show in the code, they didn't show it.

Adam Becker [00:37:21]: They didn't show how they did it. I think the code that they've released is only for the inference and not for the training. And this still goes into the training. So I couldn't find how they did the Pluker coordinates transformation. And it's also that they didn't specify it in the text. But this is sort of what I've managed to gather. Okay, so now just picture that they take— the ultimate representation might just be a concatenation of those two. So the 4 dimensions for the multi-hot, just so that like a state change plus 6 dimensions for the PLUKER.

Adam Becker [00:37:49]: And remember how earlier we were talking about the adaptive layer normalization? That's how they end up treating this. They take the action and then they put it in the same way that you would expect for us to have put the time embedding and the class embedding. That's how they treat it also. That's why I spent a minute on that. So they take the representation and then they feed it in in this way. That is to modify the gamma and the beta that are going to control the normalization before the feedforward and the multi-head attention. Now, another cool thing that they do is they say, wait, wait, wait, we already have a very good video model. They spent a lot of money with WAN 2.2 creating a very good video model.

Adam Becker [00:38:24]: Now they're going to start to add a bunch of Unreal Engine action keys like that, maybe it's going to produce, maybe it's going to ruin the video model. So he says, you know what, we're going to freeze all the weights. We're going to freeze virtually everything but this gamma and beta. And when we're learning how to react interactively, we're only going to be learning on the gamma and the beta in reaction to the state change and to the rate change. That's all they're doing. They also don't want to mess with catastrophic forgetting. They have a good video model. They freeze everything.

Adam Becker [00:39:05]: OK, so far so good. We have the video. We've made it longer. And now it's reacting to the action— left, right, up, down. The problem is we still don't have any causality enforced. This is still bidirectional. We never said you can't look into the future yet. And it's also very large and it's a very slow model and Valdimar would not have been able to run it had it just been this.

Adam Becker [00:39:33]: So now the goal is, how do we reduce the size of the model without losing too much accuracy? So the first thing we do, we say, you know what, this is too large. It's too— it's a mixture of experts, the low noise, the high noise. Each one is 14 billion. We're going to kill one of the experts. Which expert do you think they're going to kill? Just linger with that thought for a second. Which expert might they kill? Okay, you might have the intuition that they're going to kill the low noise expert, especially for longer video, for scene detection, for all of that stuff. The high noise is what's necessary, so they're gonna kill it. But now you have high noise.

Adam Becker [00:40:16]: Expert and it's only been trained on high noise. It doesn't even know what a good clean image looks like. Won't it ruin the ultimate output, the quality of the output? It will. And so what they do is they say, okay, let's find some strategically high quality, zero noise or very low noise images and we'll fine tune this high noise expert on very, very clean images. So that's the first thing they do. And that tends to help. And in the evaluation, they go back and they say, okay, did it help, did it not help? And it looks like it helped. Next is they say we need to convert the model into a block causal model.

Adam Becker [00:40:55]: What does that mean? We have a single batch of frames. Let's call it like a scene or a block. It could just be, let's say, like 1 second. It might even be less than a second. So 4 or 5 frames, da, da, da, da, da. What we say is that within this block, sure, you can look into the future. It's okay for you to look, but still within that 1 second, you can leak into the future. Why? Because when you don't have it, things just start to flicker and to jitter, and you need to have some cohesive sense of temporality even within that microsecond.

Adam Becker [00:41:30]: But between the seconds, you can't do it. And so they then enforce this causality at the level of the architecture so that you cannot look to the future. Okay. Last thing that we're going to talk about is that they introduced another concept called self-rollout. So unlike in the video generation case where you could just get a block, you say, okay, I want 5 seconds. Boom. It just gives you 5 seconds. Right now we're in an autoregressive regime.

Adam Becker [00:41:59]: That is, we need to keep pushing forward not just based on what the original high-quality input was, but based on what the last second was and the last 2 seconds and the last 3 seconds. We keep sort of like eating our own tail, right? Like whatever the output was is now going back in as input in order to continue autoregressing. And the problem with that is that when you, right, like, 'cause this is what we wanna do with a world model. It's not just a video, we wanna be able to move forward. And so whatever we're seeing right now needs to then be fed in as the input to the next generation of the next few frames. The problem there is that we keep introducing noise. As you're generating something, you're gonna— let's imagine that this is the thing that we've just introduced. So this is output 1.

Adam Becker [00:42:43]: Output 1 is already a little bit off. If you're now using this as input for output 2, now you have bad input. You're gonna get another bad output, which is gonna be another bad output and another bad output. And you just keep making things worse for you. What you're trying to do is you're trying to get the model to learn how it makes its own mistakes and starts to self-correct. So the way you can do it is you can say, OK, fine, you've produced one. Let's say the next few frames. Use those.

Adam Becker [00:43:17]: This is the self-rollout. Use those to generate the next few frames and then the next few frames. And right now we're still in the training regime. Now compare this frame with the ground truth and see if you got it right and if you got it wrong. And insofar as you got it wrong, then you're gonna, you need to self-correct. So this is what it's doing. This is the self-rollout. And with this, they're done with the training.

Adam Becker [00:43:42]: So stage 3, what they did is they did, it was a student-teacher. They started with the teacher being the two models, 2 experts, they killed the expert, then they started training it on high-quality zero-noise images, added the block causality, and introduced self-rollout. So with this, we're done with all the training. Arthur is telling me I got a couple minutes left. Maybe we can spend a moment on the evaluation, though they didn't spend much time on evaluation, to be honest. So they broke the evaluation into 2 parts. The first is the qualitative and the second is the quantitative. For the qualitative one, they basically just looked at it and said, okay, how much worse is this fast student model than the original teacher model? And let's just look at it by eye.

Adam Becker [00:44:31]: And they look at it by eye and they say, it looks, it's okay. That's kind of what they did. But then they also do, they also look at emergent memory capability, which is pretty cool. So they say, here, look at this. Right now we're looking at this bridge. You see we're on the bridge, and we could see the top of the bridge. Now let's linger. Let's move around a little bit.

Adam Becker [00:44:52]: Now we're going to keep driving. We'll keep driving forward. Oh, and look, it looks like this bridge is even bigger now because we got closer to the tower of the bridge. So they say, OK, that's cool. So we managed to remember even though we lost sight of it. Then we came back to it, and it recognized that it needs to look to appear larger because we've moved towards it. Another cool emergent memory capability that we're seeing is you see this car here, you can see the car and then we move away from it. And then when we come back, we see that the car moved further along.

Adam Becker [00:45:30]: So it's sort of inferred that the car has this certain motion and it knew how far the car is gonna get. So this is another cool thing. And then last, what they do is— so this is all for the qualitative. For quantitative, they look at V-Bench and they compare everything. And I think that the Valdimar spoke about that. They compare it against these few dimensions. When I looked at the V-Bench paper, there's a lot of different dimensions. And they didn't— I don't think that they evaluated on all of them.

Adam Becker [00:46:01]: At least they didn't tell us. And some of these are not entirely clear. Some of these are informed by models of how humans would react to things. The dynamic degree is also not entirely very— not very clear. It's just like how much dynamic motion you're packing into your segment. Yeah, so that's— but they also recognize that— here you go. Considering evaluation protocols for world models still in a very nascent stage. So didn't spend much time on the, on the analysis and evaluation, but I thought the training was very, very good.

Arthur Coleman [00:46:40]: All done? Yep.

Adam Becker [00:46:42]: All right.

Arthur Coleman [00:46:43]: Adam, I think you should start a site for education. Do, do training on models and LLMs. You do a great job explaining things. Some of it went way over my head, frankly. I'll knock to open.

Adam Becker [00:46:59]: Okay.

Arthur Coleman [00:46:59]: We have a few questions, a couple of different folks, and I'm going to try and balance them out pretty well. So Daniel, you've been the most active and I have two questions for you. You know, ask, go ahead and ask your question, but before you do, are you in the business? Because your questions make it sound like you're actually doing video, and I'd love to know that.

Daniel Yung Chi [00:47:25]: I don't do video, but I do study it.

Adam Becker [00:47:28]: Okay, so let me start my questions.

Daniel Yung Chi [00:47:33]: How do you think the operating costs of this model or this type of use case compared to regular engines? Someone added about Adobe Maya and Unity, right? Those are what we're using now. And if this type of model is more dynamic, more creative, what is the trade-off?

Valdimar Eggertsson [00:48:00]: I can talk about how much it costs me to make these small clips. So yeah, I generated 3 videos in like 1 hour on a GPU that costs $3 to rent per hour. So each clip was like $1 in inference costs. So it wasn't totally free, but, uh, not expensive at all. And I don't know how difficult it is to render in an actual engine, um, since I'm no expert in this topic. And it's easy to use, and I imagine if you have plenty of GPU, this could be made as a service that could be feasible. I don't know, what do you think? Arthur, you made a comment about the cost of a movie shot.

Arthur Coleman [00:48:55]: Maya is damned expensive. Maya is what, $4,000 or $5,000 a month on a flat basis to do what it does. And I don't know about Unity, but everyone I know who's doing anything in real-time interactive of any serious order is using those two. And they're, they're, if it's, you're down to a dollar a, a generation in terms of cost, that's a huge reduction. That's, that's like 100 times kind of reduction in my monitoring cost. How many, how long a scene was that you were saying you did?

Valdimar Eggertsson [00:49:31]: It's 5 to 8 seconds. So I rented like an H100 GPU, which in my mind, but it was not enough to run the full version at all and not enough to run more than like 8 seconds. So yeah, if you have 8 times H100, you can do something more proper and that's gonna cost like $40 per hour.

Daniel Yung Chi [00:49:57]: Yeah.

Arthur Coleman [00:49:57]: Daniel, go ahead and ask your next question. I'm gonna do some calculations here real quick and see if I can figure what the, that without data.

Daniel Yung Chi [00:50:05]: Those are broad stroke questions you're not going to answer in a day, like a fitness fit question. Okay, so the next question is the customizability of those environments, right? Like, sure, they generate some, use videos, use environments to produce the data, but like when you prompt it, do you feel like it like snaps to certain distributions of data, like it always goes back to some of those very similar environments, especially when you just prompt it with text, right? Like Alfreza, you can use a feeder image and then you can adapt to more closely to your input image. But the cheapest use case is just text. And then people would just like type a lot of things and try to use the video inside the model.

Valdimar Eggertsson [00:50:56]: I didn't test it enough to really know, but, um, you're supposed to— it's supposed to be quite controllable and steerable based on the description you put in. They claim in the paper that it can cover, you know, many different styles. Adam, maybe you have some words on this? Because yeah, I cannot answer this properly.

Adam Becker [00:51:20]: There's so many comments. No, I'm not entirely sure, to be honest.

Valdimar Eggertsson [00:51:29]: Yeah. I figured that it, I mean, it tends to just by default mimic the original image and it go from there. It does what's probable and that is good stuff maybe. And it's what it's seen in the training distribution. Yeah, you need to play with it to figure that out.

Daniel Yung Chi [00:51:46]: I haven't done it enough.

Adam Becker [00:51:49]: No problem.

Daniel Yung Chi [00:51:49]: I think the next question is from Sam, and Sam can answer. Yeah.

Valdimar Eggertsson [00:51:53]: Ask the question.

Arthur Coleman [00:51:53]: Before you— we can back to the costing question. So I got my pricing. First of all, Maya is $4,600 for a 3-year subscription, not a 1-year, not a monthly subscription. It's about $255 a month. Now your 10 minutes of video, Valdemar, based on your costing, and I think that can commercially would be lower, is about $75 for a 10-minute clip. And I would imagine in a video game, a 10-minute clip is actually a lot because you're moving through the space and, you know, you're coming back to the same space in many cases. If you think about like a first-person shooter game, you're moving through this space, but then it can get very repetitive in terms of where you're moving through. So in that sense, it's, it looks to me to be cheaper by about 30%.

Arthur Coleman [00:52:44]: It's about 30% of the cost in a monthly version, but that's just a guess, Vladimir. We'd have to really do some analysis on this. Sam is up.

Valdimar Eggertsson [00:52:56]: Sam's eye went—

Arthur Coleman [00:52:56]: Sam, are you in the business, by the way?

Valdimar Eggertsson [00:52:58]: Do you do VIN walk? Sam said she had to go.

Arthur Coleman [00:53:05]: All right. So here's the question I get to stand in. How far can you go from any entry point? Have you tested starting from the Blade Runner scene to see what happens if you keep going? This, I think, is the same thing I'm asking, which is how far does it—

Adam Becker [00:53:21]: does—

Arthur Coleman [00:53:21]: how far, how many minutes of editing of content do you have to have in the video to be able to generate a real-world scenario of, you a worldview.

Valdimar Eggertsson [00:53:35]: I think it's limited to around like one, like one minute videos that have proper context. They talked about that being kind of a breakthrough to go after one minute.

Adam Becker [00:53:46]: Earlier it was like previous methods are even shorter and Yeah, I think that they mentioned somewhere in the paper that they even pushed it forward by 10 minutes and they were shocked at how little degradation they experienced, but I don't think that that's what they're releasing.

Valdimar Eggertsson [00:54:09]: So they haven't released like a 10-minute version. Yeah, they haven't released also like, when I looked at the first of the paper, I thought I could just go into the world and play in the world. But they only released this first version. Upcoming is the longer version and the faster version where it can be more interactive. And that sounds revolutionary if that's gonna be available.

Arthur Coleman [00:54:38]: Okay, Daniel, you get the last question since you got a lot of questions for voice.

Daniel Yung Chi [00:54:46]: Okay, the last question is, that question is inside, we talk about video games earlier. Player. So sometimes you have multiple players, then you have multiple actions interacting with the world and you want to create a consistent world. So Adam talks about an action, open the door. So if the action was done by user 1, but it was opened by user 2, and then the next person comes along, they have to like see the door open, right? So that is what I was trying to get to a consistent world. Like you cannot like User 1 sees the door open and User 2 sees the door close, then it's not a consistent world. So would you, like, obviously the paper didn't do it, otherwise you would talk about it, but like, how would they extend their model paradigm to accommodate something like that?

Adam Becker [00:55:43]: I love that question.

Daniel Yung Chi [00:55:44]: Do they just add other users' like hand vector into that adaptive layer that you talk about? And then that is the input, right? Because like you think these two friends are asking at the same time and then you put all the users, what do they do to the world?

Valdimar Eggertsson [00:56:00]: And then they generate a list.

Adam Becker [00:56:02]: Yeah, well, maybe, but why do you feel like it's important to know which user opened the door rather than just the door had opened?

Daniel Yung Chi [00:56:10]: I think that is also a good question.

Adam Becker [00:56:15]: Just discussion questions, I think.

Daniel Yung Chi [00:56:17]: Yeah, yeah, yeah. For users, maybe sometimes you want to know who did what in the world. I don't know if the world model level cares about, but like in other— you may want to care about.

Valdimar Eggertsson [00:56:33]: Yeah.

Adam Becker [00:56:33]: I haven't thought it through. That sounds fascinating. So like maintaining, yeah, maintaining like cohesive and consistent world across lots of different actions that can all happen simultaneously or near simultaneously. That's, uh, I mean, we're for sure getting there. Like, we'll see if it's like a year or a 2-year or 3-year, but let's, uh, yeah, maybe something to think about, about this.

Arthur Coleman [00:56:57]: Maybe we can find that as a special guest tonight, and then I can work on that as a follow-up to the portion, get a guest speaker on gaming visualizations. And we are at time. So I'm lucky. And I say to everybody, uh, first of all, remember to fill out the post event survey so we can do better. Remember that there is now, thanks to you, a community channel on our Slack. Join there. You know, find ways to improve working with agents. And with that, leave good data, everybody.

Arthur Coleman [00:57:28]: Thank you to our speakers, Adam and Balmori. You did a great job. And we will see you at the next pleasure.

Valdimar Eggertsson [00:57:36]: Thanks very much.

Adam Becker [00:57:37]: Thanks everybody.

Valdimar Eggertsson [00:57:38]: Bye-bye.

Adam Becker [00:57:38]: Thanks guys.

+ Read More

Watch More

Operationalize Open Source Models with SAS Open Model Manager
Posted Oct 21, 2020 | Views 921
# Presentation
# Model Serving
# Sas.com
Graduating from Proprietary to Open Source Models in Production
Posted Feb 27, 2024 | Views 847
# Machine Learning
# Open Source
# Baseten
Building an Open Source MLOps Stack with ZenML Part 2
Posted Dec 14, 2022 | Views 884
# MLOps Stack
# Open Source
# ZenML
# zenml.io
Code of Conduct