Sign in or Join the community to continue

DeepSeek That, DeepSeek This : MLOps Reading Group

Posted Mar 06, 2025 | Views 97

# DeepSeek

# AI

# MLOps

Share

speakers

Adam Becker

IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More

Nehil Jain

MLE Consultant @ TBA

Hey! I’m Nehil Jain, an Applied AI Consultant in the SF area. I specialize in enhancing business performance with AI/ML applications. With a solid background in AI engineering and experience at QuantumBlack, McKinsey, and Super.com, I transform complex business challenges into practical, scalable AI solutions. I focus on GenAI, MLOps, and modern data platforms. I lead projects that not only scale operations but also reduce costs and improve decision-making. I stay updated with the latest in machine learning and data engineering to develop effective, business-aligned tech solutions. Whether it’s improving customer experiences, streamlining operations, or driving AI innovation, my goal is to deliver tangible, impactful value. Interested in leveraging your data as a key asset? Let’s chat.

+ Read More

Matt Squire

CTO and Co-founder @ Fuzzy Labs

Matt is CTO and co-founder at Fuzzy Labs, a consultancy dedicated to using MLOps to help technical teams get the most out of AI and ML. He enjoys AI, bio-inspired computing, and functional programming.

+ Read More

Sophia Skowronski

Data Scientist @ Breckinridge Capital Advisors

Sophia Skowronski is a Data Scientist at Breckinridge Capital Advisors with previous experience as a Business Analyst at Pledge 1%. Sophia has also worked as a Data Science Intern at Candid, an AI Investigations Intern at Deep Discovery, and held roles at Singularity University and the Global CO2 Initiative. Sophia holds a Bachelor of Arts in Astrophysics and Cognitive Science, as well as a Master's degree in Information & Data Science from the University of California, Berkeley.

+ Read More

SUMMARY

We dive deep into this groundbreaking paper, breakdown its key insights, and discuss what makes DeepSeek-R1 so special. Our expert moderators guide the session, followed by a lively round-robin discussion where everyone shares their thoughts, asks questions, and debates the implications with fellow MLOps enthusiasts.

This is the reading group for anyone passionate about MLOps, from seasoned practitioners to the AI-curious. We meet every month on the second Thursday, and trust us—you don’t want to miss this one.

+ Read More

TRANSCRIPT

Adam Becker [00:00:00]: Obviously, today, DeepSeak R1, and we split up the paper into four. So each one of us is going to cover a quarter of it, roughly. And the first part is going to be me covering kind of like the background. But when you look at the background in the paper, it's just like a page. There's not much going on there. So I figured either I'll just kind of like, do a quick, kind of like run through it, or we get to go quite deep into some of these topics. And so I tried to do something a little bit in between where we kind of like go and explore some other papers that are relevant, just to kind of bring everybody onto, like, the same page. It's likely that I will have some blind spots here because it's just massive, Right.

Adam Becker [00:00:47]: I just, like, ended up falling down all these different rabbit holes. And then I'm like, okay, this is not relevant. Maybe this is more relevant. And then we tried to figure that one out, and then I come back out. So if you have some questions and something still seems somewhat confusing, just shout or just say something.

Nehil Jain [00:01:05]: Right?

Adam Becker [00:01:05]: Like, in between, it's fine. And so let's just get started. Okay. So the paper itself came out just a couple of weeks ago. As Benoit said, you know, the Internet has been ablaze with it. I didn't quite understand why yet. And so this opportunity was just useful for me to actually try to make sense of it. People are talking about reasoning, people talking about all these different words that I feel like I have an intuitive grasp of, but I don't really know what's going on.

Adam Becker [00:01:32]: So I'll just walk you through my own understanding of it and of the relevant background here. I'll try to keep it somewhat brief. So obviously, we're trying to make language models intelligent. Right. You guys can see my screen. Yeah. My mirror board. Okay.

Adam Becker [00:01:48]: We're trying to make language models intelligent. How do you do it? The last few years, everybody knows, just make them very, very big, right? And just, like, feed a lot of data into them. Fine. And we've seen over the last few years that just models have continued to increase in size. How big? I mean, we started out, Bert had 345 million parameters, GPT2.2 billion. And we're just almost like every iteration of these models, we're, like, growing by, like, 10 times. That's not really sustainable for lots of different reasons. First, it's very expensive.

Adam Becker [00:02:22]: It takes a very long time. And we're also going to run out of data at some point. If the whole point is to just feed in more data. We're going to run out. And I think estimates are like 2028. I think I found this, this paper that was kind of interesting is like, yeah, by 2028 we're going to just run out of data. That's it, we're done. It's like we're not generating enough.

Adam Becker [00:02:41]: And if all you're doing is generating data by an AI right then like it's not even going to be that good. So we have a bunch of problems and people are trying to find different ways to increase intelligence despite not having access to so much data. Or in the case of China for example, not having access to enough computer to the typical. Whatever chips we're using in the US they have less and less access to it. Well, they need to find some more efficient mechanism. So there's a bunch of different mechanisms and everybody's trying to figure out what is that new future going to look like. The problem is that even with the current state, they're still not good enough on certain things. And this was kind of interesting to me.

Adam Becker [00:03:21]: If you think about whether or not they can reason, just try to picture this like what do you call these things? The crossword puzzle. Right. How do they. If the way we're training them is just predict the next token, how are they going to actually figure something like this out that requires some type of self reflection and iteration? Well, you tried this approach and then it doesn't work. And you try that approach. And the answer is that before 01 came around, they weren't very good at things like this. That I think is kind of like the kicker. And this is almost just an example of a more abstract type of class of problem.

Adam Becker [00:04:01]: The class of problem is just anything that requires that type of experimentation. Well, let's try this thing. Oh no, that didn't work. We'll retrace our steps. Well, maybe this one doesn't work. Let's try again. So that type of iterative refinement has simply not been very, very good before L1. Why is that? I think we all have some intuition, but the idea here is that LLMs, we've trained them on sort of like what to answer, right? So in the typical kind of supervised fine tuning, what we do is we try to match the output token with some sense of what the input distribution was like, right? So we're trying to come up with some type of mirroring, some type of predictor we see here.

Adam Becker [00:04:40]: Abstractly, this paradigm trains models to produce a single input output mapping. This is fine when the goal is to directly solve a set of similar queries from a given distribution, you say France and capital, you think Paris got it easy. But if you're trying to now go out of distribution, you're trying to think outside the box, you're trying to take a step back and kind of like reason this type of paradigm isn't really going to serve you all that well. It's shocking how well it served us thus far. But you could see that when you press it to the surface of like math and other types of logical puzzles, at some point it starts to break. So instead you're trying to create a model that is robust enough that can generalize to unseen problems, and it has to try different approaches and seek information to different extents. So that's kind of the idea. Now, how do you train an LLM on how to answer a question as opposed to just what to answer? So through the research, I found this paper that came out in August of last year and it's by Google and they're saying scaling LLM test time compute optimally can be more effective than scaling model parameters.

Adam Becker [00:05:49]: What does that mean? What is test time? What are all these different words? Basically, the idea is this. There's a bunch of different approaches for trying to do things not just during the training, but after the training, perhaps even during the testing. We're not really thinking about this. For people that are not using people, people that are using ChatGPT, they might not get a good intuitive sense of this yet, because ChatGPT is doing this under the hood in a sense. But basically Google found these different approaches for trying to come up with better solutions at test time. That is, you submitted a query, you're saying here's the crossword puzzle and then try to solve this and then there's different ways to go about it. Well, perhaps the LLM can say best of N. So I'll try it with different approaches and then I'll select the approach that worked best and then you're going to receive that out like the response afterwards.

Adam Becker [00:06:46]: Right, so that's best of N beam search is you can try to take the like the solution and break it down into individual steps and then do that in parallel and the step that seems to be quite affect them and promising, keep going with that one and shut down the other ones.

Matt Squire [00:07:05]: Right.

Adam Becker [00:07:05]: So there's different ways for you to try to come about a better solution. Another one is lookahead search, which is okay again we break it down into these different steps, but then we try to simulate what the future might look like if we were to pick this answer. Right? You see what's going on here, and then based on the simulation, you come up with some reward and you're like, you know, maybe it makes sense to pick a particular topic. Let's. A particular approach. Let's keep going with that. The problem is that they, for the most part, suck. They're not very good, none of these things.

Adam Becker [00:07:40]: So Google is like, yeah, maybe you can make some advance with this and this type of regime. And nothing is very good yet. This. Okay, keep in mind, this is August. Also, we didn't even say how are you going to pick which solution? And which solution even seems promising in the first place. This is again, an open challenge. They call this the verifier. Right? Like the verifier needs to go step by step and figure out, does this look promising? Does this not look promising? A lot of complexity here.

Adam Becker [00:08:08]: I don't think we have time to get into how they do this verification, but there's some interesting stuff. We could put the link in a minute. Okay, so this is August, September of last year. OpenAI comes up with 01. And I think that starts to change the game a little bit. So what they do is they increase the length of the chain of thought. You can actually. Let's zoom into this.

Adam Becker [00:08:32]: The idea here is, as they say, similar to how humans think. If you ask me a very difficult question, I just need. Just give me a minute. I need to think about it. And then once I think about it, I can come up with perhaps a better answer. So what Owan did is essentially gave us a little bit, gave the model more time to think about it. But that thinking is normally done under the hood. But you could see it.

Adam Becker [00:08:55]: So come here to like this password. If this is GPT4, O gets the wrong answer here with O1 preview, which is the small one. You see, it thinks for five seconds. You can click into this. You could see how it's thinking. It says, okay, we are asked to solve this crossword only. Let's first make sense of the grid. It's like it's going through the process, kind of like step by step by step by step.

Adam Becker [00:09:26]: You're not even seeing it. You're charged for it, but you're not seeing it. All of these tokens are hidden. So the AI is thinking about how to best respond. And it might try different things. How does Elfman do it? Probably we know with some form of reinforcement learning, but I don't. I haven't been able if people know how it does it. I haven't been able to understand exactly how it does it.

Adam Becker [00:09:50]: I suspect it's because they're not telling us. Maybe. Maybe they are telling us somewhere, but I haven't been able to.

Nehil Jain [00:09:56]: Yeah. So the paper talks about it, Adam, where even, like, deep sea carbon. My takeaway is like, long chain of thought is all you need. And what that means is, like, can you train it to generate the right chain of thought and it's long enough that it'll reach the right answer? And that's it. Like, that's basically what they're trying to do. And now there are different ways to get to an accurate long chain of thought, which is the thinking.

Adam Becker [00:10:18]: So. But do you know how it does the reinforcement learning on the basis of that, do you know what is. Like, what. What they're using? Because I feel like for. For Deep Seq. For. For Deep Seq, they've made some innovation in the reinforcement learning. I suspect Sophia is going to get into that.

Binoy Pirera [00:10:35]: Hold on, Adam. We have a question from Arthur. He's asking where did those steps come from if they're hidden?

Adam Becker [00:10:44]: Yeah, it's.

Arthur Coleman [00:10:45]: I just.

Adam Becker [00:10:46]: Did you just assume them?

Arthur Coleman [00:10:47]: Is that what I'm thinking is you kind of made those up that. Were those steps your best guess at what's going on?

Adam Becker [00:10:56]: Oh, no, no. So they're telling us this is it. This is the. You guys can see this part, right? Like the. The OpenAI.

Matt Squire [00:11:02]: This is their own. Their own documentation.

Adam Becker [00:11:05]: Yeah, they're saying it. It's just that at the bottom, they'll tell you why they decided not to show it to you. It was interesting. Hiding the chains of thought. So they think that it's like maybe people can start to manipulate them. Maybe people can. I don't know exactly. Maybe they have some.

Adam Becker [00:11:21]: They also said that there's something about a competitive advantage here that they don't want to show how it's thinking.

Matt Squire [00:11:28]: Yeah, sorry. I was going to say there's two. Just bringing out. There's two concepts here that are kind of running side by side that neither of which we know anything about much from the O. One's perspective. One is that the model has a way to signal that it wants to take more time. It has that thing where it can emit tokens that say, I'm going to think about this a little bit. And OpenAI then removes that from the output we see.

Matt Squire [00:11:56]: At least if I've understood that. Right. But then the other thing we don't know is how they've trained the model to do that. What that looks like. What's the reinforcement learning scheme that they've used behind scenes?

Adam Becker [00:12:11]: That's exactly right. It's almost like they're using part of the context for the thinking. And how much of it. That's an open question. And that depends on the prompt. And that depends on. There's a lot of different things here. Yeah, this is an example of.

Adam Becker [00:12:27]: It's from another paper, but you could just sort of see how it's like, oh, we got these extra compute tokens for verifying the response. There might be some iterativeness there. There's something about that. But I couldn't get a good answer from how OpenAI does it. I think hence some of the success and the popularity of deep sea. So where were we? We're here. We know that they've done it with reinforcement learning, but we don't really know exactly how. But the verdict is nevertheless incredibly impressive.

Adam Becker [00:13:00]: Whatever they did with O is so far been sort of like cutting edge. Okay. Now Deep SEQ is coming around, and essentially what they're doing is they're saying, let's just do it with reinforcement learning. Let's just be pure RL as far as we can. And the way they've done it is they started with deep seq v3 base. I'm not going to get too deeply into all the different. I think different folks will talk about the approaches, but just so that you get a sense. Three base.

Adam Becker [00:13:28]: So this is a. I think they came up with V3 in December of last year. It's a mixture of experts. 671 billion. I think they spent like $6 million on it. Like, it's, It's. It's not a cheap model. Right.

Adam Becker [00:13:41]: Like, they've spent a good amount on it. Some people think it's just, you know, a couple of kids in a garage. Like, it's a full team. And it's pretty interesting. Even what they did for V3, I think, again, in reaction to just more constrained resources. If people want to read that, I'll put the link there too. But basically they ended up training. They took V3 and then they did GRPO on it, which is a type of reinforcement learning.

Adam Becker [00:14:09]: Sophia, I don't know how much you want me to go into GRPO or if you want to get into that.

Sophia Skowronski [00:14:15]: It looks like you walk through the equation a little bit. I do that as well. So it might be good to see it twice. If you want to do like a high level version of it, I'll do.

Adam Becker [00:14:24]: A very high Level because I feel like I only really understood it today, this morning, like 20 minutes ago. So we'll see. So basically we. So we're. So since 2017 we've been using PPO, right? So this is the proximal policy optimization. The idea is we're pre training an LLM. Fine. This is just like next token prediction.

Adam Becker [00:14:50]: It's supervised, self supervised. That's the setup. Fine. But then again, the models are still not very good. They're not very readable. They're not actually aligning with human preference. So then what you do is you need to try to collect human signals. So you show a human three different options and they're like, okay, I like this one, I don't like this one.

Adam Becker [00:15:08]: This is the one I like the least. So. And then this is the right, like this is like the reinforcement learning with human preference, with human feedback. So basically what you do is you need to try to find a way to propagate that signal from the human back into the model so that the model is able to just like shift a little bit and become basically like a policy that is like predicting the human preference. And there's a different ways to go about it. And PPO is one of those ways. And I think the key insight here is that before PPO you had like, you have a bunch of different approaches in like 2013, 2012, 2014, but they were too unstable because basically the models would be updating, the policy would be updating itself too quickly. And so it ends up being incredibly unstable.

Adam Becker [00:15:57]: What PPO ended up doing is saying, you know what, let's just. It's just really this ratio that is relevant. Just think about, it's like the new model versus the old model. You want to make as small a change to the new model that could nevertheless maximize the gain, so the leverage. So the next policy needs to be really good, but don't make too many changes to it, basically. So this was PPO again. I think because of resource constraining, they needed to come up with a more efficient way to go about it. And they came up with a group, basically like a group bpo, which is do the same thing, but not just on a batch, but like on a group of batches and kind of come up with the average and then find a way to optimize for the group and that ends up being way more efficient and you can actually leverage this much more.

Adam Becker [00:16:46]: But Sophia, you could do the rest on that. So basically we came up with deep seq v0 simply by applying this, this GRPO again, it's a little bit. It wasn't very readable in mixed language. It was a bunch of different problems. And so they ended up applying a few other techniques on it that we're going to get into. And they ended up getting to deep seq R1 and then they've distilled it so that you could just download it on your own computer or whatever. Like, these things are much more manageable now. But again, taking a big.

Adam Becker [00:17:20]: This is like the large model, distilling it into like a student model that is much smaller, fewer parameters, and now you can just download it. I downloaded it last night and you could just like run it on your computer. And that's pretty cool. So how exactly all of that's done? I will leave it to everybody else. Sophia, you're next.

Sophia Skowronski [00:17:40]: Oh yeah. So I kick off the explanation of the R10 model. So I made some simple slides this time just really pulling from the paper itself. So let me get that started. So this isn't actually in the paper. This is in the other Deep SEQ math paper that came out around the same time. I find that this is much more easy to understand versus the math, which we will also get to. But I think it helps to first kind of ground yourself in this.

Sophia Skowronski [00:18:18]: So this is the reinforcement learning algorithm that takes deep seq v3 into deep seq v10. And at its simplest level, what RUPO is doing is it's dropping this need for this value model. And so what all these colored boxes are are either trained LLMs or frozen LLMs. So the policy model here is deep seq v3. That's the or it starts as deep seq v3 and it's the LLM that we're currently training. This reference model is a frozen vers of the original model, so probably deep seq v3. And then the reward model can also be an LLM. It can be an LM as like a judge to judge all of these outputs across some metric.

Sophia Skowronski [00:19:11]: Or what the Deep SEQ authors did is they actually just used a bunch of Python fun or we don't know. I think it's Python functions like, as simple as like regex expressions to like. And we'll get into that when we go over the reward modeling. And so at a high level, like, what the advantage is here is it eliminates the need for a second trained LLM. So you have so the lack of like or you reduce the memory compute by reducing the number of LLMs. So from four to three. Or in this case it looks like it's two or. Yeah, one that we're actually training.

Sophia Skowronski [00:19:52]: And so just at a high level, like what is going on left to right? So there's a query that's processed by the LLM to generate a set of outputs. The outputs are evaluated through a reference model and that's where we get into like KL Divergence, which is here, and then the reward model, or in Deep seq's case, a bunch of python functions. It evaluates each of the outputs based off of like a set of metrics and it outputs a score. And then all these rewards are grouped and the advantage is computed through a mechanism. And then both the KL and advantage are brought into the advantage or into the objective function which are then used for updating the model parameters. And so this I think is a lot easier to understand than this math here. So again, like Adam, I think I learned mostly what it is about 20 minutes ago. I like the other slide more.

Sophia Skowronski [00:20:56]: So this objective function, it aims to optimize the LLM by updating it relative to a baseline LLM and while also controlling for divergence from the reference LLM. So the system keeps track of multiple versions of the same LLM. So there's the policy model, which is just PI theta here, the old policy model, which is the snapshot of the same model but from a previous training step. And then KL Divergence also looks at the reference model, which in this case is deep seq v3 before training has started. And so this reference policy model enables. Yeah, so it's to basically create a more stable, trained version of the policies for KL regularization. And so this ratio here represents how much the new policy deviates from the old policy. And then again.

Sophia Skowronski [00:22:00]: Okay, let me see, there's. I can't see what people are saying in the chat. Sorry. So if there's a question, please.

Binoy Pirera [00:22:06]: Yeah, please, yeah, I'm going to read it. All right, so PUIA is asking, is there a continual learning aspect for replacing the reference models? Slash monitoring the ref models with drift. What is the divergence calculation based on?

Sophia Skowronski [00:22:21]: Right, okay, so let me. The KL divergence is based off of the. So, okay, let me just give another high level overview of the. The. Like the steps taken here because this is used in Gradient Ascent for updating the model at each training step. So the old. So this is like using the same model but pointing to different checkpoints. So this is the current model being currently being updated and old is possibly from the most recent training step.

Sophia Skowronski [00:22:52]: So at each iteration these model parameters get Updated essentially. But so a high level overview of what this equation is doing. So it calculates the policy gradient ratio. So it tells you how much the policy is changing between trading steps. And then it also does the same thing, but it clips it to be within a certain range. So this limits excessive updates during training. And then we scale both of these gradients by the advantage, which is the reward signal. And I think if you look at the paper, it is just a normalization.

Sophia Skowronski [00:23:34]: So it's each individual outputs reward minus the mean over the standard deviation, which is just like centered at zero. It tells you what direction the new reference model or the new, the policy model is shifting from the mean. So it tells you how good of a reward you should or how much of an advantage a particular output has. Whoops, let me go back, sorry. And so yeah, so then it, so the advantage is applied to the gradient ratios and you. Sorry. And then you take the minimum between these two. So this selects a more conservative update.

Sophia Skowronski [00:24:18]: And then you subtract out the KL divergence term which penalizes. Sorry, which penalizes large deviations from the deep seq v3 model. So let me see, what else is there to say here? Oh yeah, so KL divergence, the regularizing effect, what is it? So you can see it's a penalty term here. So it basically ensures that the new policy model does not deviate too far from the reference. So when these, when the reference model and the new policy are equal, then this goes to zero. And it's like basically there to make sure that the new policy model doesn't employ any reward hacking. So that's the term that you see in references to this paper. So if during training one of the outputs here has like a highly rare word in it, like, let's see, like, let's see, quixotic or I don't know exactly how you pronounce it, like keyotic, like, and that like returns like from the reward model.

Sophia Skowronski [00:25:36]: It returns like a really high score. For whatever reason, you don't want the new model to then like optimize for saying that same word over and over again. So as I think they kind of say in the paper that this enables the new policy model to keep its language modeling ability without any of this reward hacking to like create some weird like deviations for optimizing the output. And so reward modeling is the other piece here. So as I kind of mentioned, it's a lot simpler for this particular paper because I think they were mostly training this R10 model off of math and coding problems. Which are very deterministic. They either work or they don't or they're correct or they're not. So the accuracy here in this case is just specifying did they get the correct answer.

Sophia Skowronski [00:26:31]: And they also included format rewards as another input to the overall reward. And in this case like did the output include the think tags or not? And was there text between the think tags and was there like a, like an output response like keeping the same formatting? And so it was a lot simpler than I thought it was going to be. But I guess I'm not. I think Nahil is going to talk about how this changes for incorporating non math and non coding problems. But so this is kind of like what's going into this process, this step right here. And so the other signal that they have, so they have the reinforcement learning update optimization step. But they also like where the reasoning comes in in this particular case is through the use of think tags. And so like it's the, this is the prompt that they're using for training.

Sophia Skowronski [00:27:32]: And so yeah, this will basically be replaced. So any sort of input query that they have will be inserted where this red word is a red prompt word is. So this is kind of how they're also trying to generate the right signal for using in for developing reasoning capabilities. And then they also kind of show that in the performance wise like R1 as you R1.0 gets better and better. So it's comparable to OpenAI mini and OpenAI on similar math encoding benchmarks. I read that this Amy benchmark only has 30 examples, which is a little weird. But so it shows that you can get big deviations if you get plus or minus. One more question for this benchmark.

Sophia Skowronski [00:28:32]: So but still R10 outperformed just purely by like bootstrapping this like reinforcement learning process, which is, which is pretty, which is pretty nice. So no training data that explicitly told the model how to reason. Just using this prompt and this reinforcement learning process it was able to to get the model to naturally reason or use anthropomorphic reasoning language in its think tags. So that's kind of what they mention in the paper about the aha moment. And so I guess it's also kind of pointing to the natural or the pattern of like reinforcement algorithms where they start to, to pick up behaviors that weren't explicitly in the training data. So as we mentioned like there was no reasoning training data that was used for this or to supervise this model. And so it's kind of a hard thing to label and generate. So you can see what they Found by looking at some of the outputs that there was they notice a model like telling itself to wait and that forces the model to reevaluate and then get the better to then generate a better response or evaluate itself.

Sophia Skowronski [00:30:03]: And so that was kind of the gist of R10. What they found was that there were issues just like using. So they found that like the reasoning output sometimes switched languages midway through the reasoning component, which isn't a really good user experience. And let's see, there was like one other issue with it. I probably should have read this beforehand.

Nehil Jain [00:30:25]: Yeah, the was that language inconsistency and readability was heard in the chain of thought. So like you cannot understand what it's actually.

Sophia Skowronski [00:30:33]: Oh yeah, okay, cool. So yeah, that's the gist for R10. So I guess I can hand it off for Nihil to kind of walk through R1 proper.

Nehil Jain [00:30:46]: So my takeaway when I reached this point, which is what I was going to talk about was like all the building blocks required for understanding how we got to R1 was covered by Adam and Sophia. Mostly now it's just process like what they actually did, combining those concepts to get to R1. What do we know? We know that just doing NRL based on GRPO As Sophia explained, gets you to pretty close to O like the original version, the September version and oval mini etc. No theory data required. But there are issues. Issue number one is that the readability is not good and then it's still not the best one. That's the other problem. And so what did they do? They were like, okay, maybe we should try doing a little bit of fine tuning with some some data and then, then we go back to doing other.

Nehil Jain [00:31:38]: And so they started with three types of data sets and we will cover also how they got to reasoning generally because Argon 0 was mostly the reward model only had math and coding questions because they were deterministic so they could just write rules and they were like oh this is right or this is wrong or which one is the right answer? But how do you generalize that to other thinking questions? Right. And so what they did was they said okay, first let's just generate some few shots using other LLMs. So they wrote some prompts, they gave an example of what a long chain of thought looks like and generated some examples. They just did like simple one shot prompts where they were like, hey, do a lot of reflection and verification. I didn't find these prompts in the paper. Correct me if I'm wrong, but they do mention that they do the best, very weighty, but not giving the specifics in it. And then the last thing they did was they also filtered the output of Arden 0 because there was a lot of good data that was generated during the training process where they're like, oh, this was what the model predicted. And then the reward model can already tell you which answers are actually good.

Nehil Jain [00:32:47]: And so you can use that as feed for the next kind of fine tuning training. And so they did use humans to kind of review and filter the data even more just to make sure that they are feeding the best quality data. Small sample set, but still they want it to be high quality. And then they start the kind of fine tune process which is just standard fine tune that we do where you have like the input and output. And then after that they went in doing the exact same thing that they were doing for R10, but on the new model, which is fine tuned with all that data. And they're doing the same thing, the same reward model. They're checking for accuracy format and language consistency. So I guess language consistency is a new thing that they added for R1, but the reward model is exactly the same as before, which are rule based.

Nehil Jain [00:33:38]: Hey, did you get these questions right or wrong? And is the format correct? Like do they use the fake tag and all the other things? And then they also added is their language consistency like are you thinking in the same language? So that they can eventually make it useful for for users once the RL converges. Kind of similar to the graph that Sophia was showing. Once you converge, you're like, okay, let's move on to the next step. And then they basically iterated on the same thing one more time. And they said, now we have even more data. So what we do is we will take this long chain of thought generated by my new kind of this step. Like you were doing RL here instead of using the original one. Now I have language consistency and other things.

Nehil Jain [00:34:25]: So let me just do another set of fine tuning on top of this. So it's kind of like an iteration within itself where we did that. They also use some supervised reasonings. So now they're trying to also broaden the data set to use something that is not just like math encoding, but English and just like general thinking. And so they added some more data sets that they had from just training V3. Again, they didn't share the data set itself, but they talk about like using the data set that they used to do supervised fine tuning for V3. And they also got data for non vsync tasks again from V3 and once they have that then simply they just supervise fine tuning and then another set of rl. But this time they made sure that they're also checking in the reward function for helpfulness and harmfulness.

Nehil Jain [00:35:15]: And that's how we got our run. So it was really like iteration with a little bit of cold start on giving it curated data and then just doing the RL0 method all over again. And then as Adam was showing, what they did was they took this model and they generated a lot of data, I think. Yeah, actually they just used these 8,000, took 8,000 cots that they had. They got R1 to generate it again and they just distilled it. And so they just had like llama train super fine supervised, fine tuned on it and then Quinn also supervised fine tuned on it. And that's it, like that's how they, they got the distilled models which you can download. And then the R1 model is.

Nehil Jain [00:36:03]: I kind of went through the whole process. So that's the high level of how they got R1. Nothing like crazy new innovation in the R1 piece. I think this GRPO application in the zero was where they really figured out how to get a good quality piece.

Matt Squire [00:36:18]: I guess maybe if, if I can make an observation. Thinking about this from an mlops standpoint, what's really interesting here is like these are wonderful diagrams by the way. They help, but they, they show firstly how complex the process is, how many different steps there are. But they also tell a story that can almost make you think that the authors have gone through that process linearly. Like they've laid it all out and they've ran it and out of the other side came their various models, their R1, their R10 and their various distilled models. But we know in reality that's not what's happening, right? This has been a very iterative process for their build this up. And so it makes you wonder what tooling do they need to have in place to do that effectively? Like the degree of collaboration, the degree of false starts and different directions and even like preparing the data and iterating on that before they even try to train a model. They don't say anything about this in the paper at all.

Matt Squire [00:37:24]: I think they focus entirely on methodology and outcomes. But it makes you wonder, you know, what does it look like? Because while most people aren't going to do this, most businesses that want to do some kind of AI, they're not going to be trying to train their own reasoning model, but they might be Doing something of similar complexity. So that's kind of interesting to think about.

Nehil Jain [00:37:48]: Yeah, I think in the niches people are fine tuning, so the process looks kind of similar there. But you're right, like a lot ever lost to actually get to this level of quality.

Bruno Lannoo [00:37:59]: Also think like it's. It's also mentioned a bit later in paper, I'm not sure if someone else will mention it, that they have a couple of unsuccessful attempts. So what you're saying that like they obviously didn't get there straight is actually highlighted in the paper. It's very interesting to see what they tried that didn't fit in the pattern.

Matt Squire [00:38:13]: Can I.

Arthur Coleman [00:38:14]: It's in the, in the chat. If from a high level, as I interpret what you've said, if I'm looking at this as a big picture, what it seems that DeepSync has done is instead of using billions of parameters, they've synthetically generated parameters from the chain of thought and substituted them back into the model. Would you say that's the basic notion of what's been done here?

Nehil Jain [00:38:37]: Yeah, I think so. I mean that's what they were doing with RL as well. But for R1, the only difference is, yeah, they have. Some of it is not just synthetic data. Like some of it is also data sets that already existed which they have used. So it's a combination of the two. But there's a lot of synthetic data that is being used to get to the good quality accuracy.

Arthur Coleman [00:38:59]: Yeah, it's basically they use the output of the chain of thought as parameters. Basically, in my mind, as I'm understanding what you said. That's cool.

Matt Squire [00:39:06]: Yeah.

Nehil Jain [00:39:07]: And so this is what I was interjecting, Adam. This, this was my takeaway. Like launch, you know, thought is all you need to make like amazing reasoning or at least as of today. And all of that shows up at inference time, which is all the rage about, hey, can we get like more money on that side, on experimentation side, there were some things that are useful just because like I run a lot of EVAs and I think like a lot of the community here would be interested to know that. That as well of how they evaluated things. And well, the first thing is like, how do you even evaluate thinking? That's not straightforward. And so they shared some stuff of what they were doing, how that can be useful for evaluating thinking. So the first thing they did was let's just turn on all of the maximum token output so that the model is provoked to see it longer and generate a lot more tokens and then that will capture all the detailed thinking as well.

Nehil Jain [00:40:08]: The other thing they did was they tried greedy response, which is like they will take whatever the next token is and evaluate that. But that didn't work out well. So what they did was then they generated multiple outputs and they took the average of the correctness of all of those. And so that. And while generating multiple outputs, they did have a temperature zero. Of course they wanted variety in it so that the LLM can do different thinking so that then it can figure out like okay, this is the right answer versus this and how often is it able to think correctly? That's kind of how they evaluated NK in this case of like how many responses are they generating per query? So that's how they evaluated it. They did it across everything. In the end with R1, R1.0 was mostly again coding and math.

Nehil Jain [00:40:58]: But then R1 they wanted to be more general purpose. And so here they took all the different coding benchmarks. There are so many, I'm not familiar with a lot of them as well. But the overall gist is that they evaluate against different things and then the quality is pretty good almost. In all cases this process got them to be better than the base model, which was V3.

Matt Squire [00:41:21]: Right.

Nehil Jain [00:41:22]: And so that was already a win where with very little data and adding this long inference time, generation of tokens, you can surpass the accuracy of what you can produce for complex queries. So that was super cool to see. It was the old rages. It was competitive with one latest update, with much open source and all that. So that was the other thing. And then even with distillation, I think it is be pretty efficient if you can take all this data and distill it down to smaller models, which could be very effective for a lot of us practitioners where we are taking a specific example and like a subset of a problem and we don't need foundational models, then we can actually just do what they did, where you generate a long chain of thought and you use these models and then just build down them to a smaller one and then you can run them on the edge or you can run them inside your web app, et cetera. Like you can use it in different surface areas basically. So those were kind of my takeaways.

Nehil Jain [00:42:23]: Yeah. Anything else? Yeah, we'll see. RL and destination. I think it's just the beginning. As Adam was saying. Owen, just open the curtain to people that like this is how you can do long chain of thought. And I think a lot of people want to do more RL and distribution from here.

Adam Becker [00:42:43]: Neil I think I'm. If I'm understanding the chat correctly, they want to know what you're using for these visualizations.

Nehil Jain [00:42:50]: Oh, there's nothing. Oh, I. So this one is React Flow. It's some REACT library which just generates like graphs and I told Lord to always enter the next node I'm building. And this is nothing. This is just some react. I didn't do anything here. I just wrote markdown and told Lord.

Nehil Jain [00:43:09]: Hey, generator. React visualization. So I guess prompting is what I'm using. So it's just like a slideshow between React, nothing else. So yeah, that's. That's kind of my take from here. Matt, maybe you can walk us through the discussion and where does this lead us?

Matt Squire [00:43:29]: Sure. Yes, absolutely. I think before I do that, before I share my screen, it may be just a good moment to pause and ask if there are any other questions or any other comments, any thoughts. Does, you know, does what you've heard so far make sense? Because I'm conscious there's a lot going on here.

Arthur Coleman [00:43:46]: To Dan's question. Nahil, great presentation. Using your versehelapp, what was the visualization tool that you used?

Nehil Jain [00:43:56]: I was using a React Flow, which is like a react library, and then using CSS animations. You in this place, just bring one thing at a time.

Binoy Pirera [00:44:07]: I think Kachi has an interesting one. What are some possible applications of reasoning models to agentic software?

Matt Squire [00:44:14]: There's a lot. Yeah. I mean, well, for example, imagine that, you know, maybe your agent needs to look at your. Look at three people's calendars and figure out where there's some available slots, you know, as part of a bigger task. So maybe what the agent is doing is organizing appointments and sending out invitations, doing all of this stuff. But one of the tasks it's got to do is it's got to reason through this logical problem. Or maybe it's got to find what's the maximum amount of time that these three people are available together so that we can schedule a meeting. All of these things are stuff that we can all do, but they're kind of busy work that we would rather not do ourselves.

Matt Squire [00:44:59]: But they're all reasoning problems. They're like constraint satisfaction problems and things like that. I don't know if any other speakers have thoughts on specific applications.

Nehil Jain [00:45:10]: I mean, I think like everything where you have to think a little bit more before taking an action, I think goes in there.

Matt Squire [00:45:17]: Yeah. So firstly, I've not put any visualizations together. I'm sorry about that. I've been put to shame by my Co presenters. But what I wanted to do was cover off the last two sections of the paper, which is the discussion and the conclusions and future work as well. Now, just to be a little bit unconventional, I'm going to jump over to the conclusions first and then come back to the discussion point. And that's because, well, firstly, the conclusions aren't actually that long, but they kind of summarize all of the different moving parts here in quite a nice way. So, you know, they've, they recap the journey.

Matt Squire [00:45:59]: Right, so we're enhancing model reasoning capabilities using reinforcement learning. So we've gone from V3, which is Deep Seq's existing state of the art large language model. We've used reinforcement learning to teach that model how to do reasoning and the details of the reinforcement learning we've covered. But you know, various techniques there that they've applied, they come out with a model that is competitive, we all know that. And actually we saw massive stock market crashes a few weeks ago because of precisely that. But the other thing they do is they distill that knowledge, they distill that capability into other models. And for me, that bit's kind of throughout the paper. It's not emphasized as much.

Matt Squire [00:46:51]: And that's fair enough because a lot of the interesting stuff here is about the reinforcement learning that precedes it. But what interests me about this distillation bit is we get good models that are very small. So the distillation part is really about we take some other models. So we take Quen or we take Llama. And what we're trying to do is teach those models to do reasoning as well. So deep seq r1 knows how to do reasoning. So we position it as the teacher and we generate a bunch of training samples and then we use that to fine tune those other models. But it's via that kind of knowledge distillation approach.

Matt Squire [00:47:34]: And the interesting finding, we get that performance and it's kind of summarized just here, I think. So we're kind of comparing the distilled models and the original models against these various benchmarks. Now I'm not going to go into all of these benchmarks, but we're able to kind of get that performance. And actually there's, you know, now I'll kind of COVID the discussion. So, you know, sorry, I actually looked at the wrong table there, didn't I? This table. So this is where they look at the various distilled models. So they do four distilled models from Quen, two distilled models from Llama, and they Outline the performance of all of those. But we get very small models, right? You look at the numbers of parameters we have in most of these, we're looking at, you know, 14, 32, 8.

Matt Squire [00:48:23]: That's a reasonable model for somebody to run on infrastructure that's not obscenely expensive to run. So that kind of caught my interest in particular, even though it's not the main point of the paper. So let's move on to this discussion point though, because it touches on some of the questions that I've seen in the chat. So the first one is that distillation versus reinforcement learning, which works best in general when we're trying to do this. Because what we've shown is that we can take this R1 model and we can distill it into much smaller models which still show impressive performance. So one question they ask is, well, could like, did we need to do that? Did we need to go through that distillation process or could we have achieved comparable performance just through reinforcement learning? So could we have actually taken our, you know, our QEN model and done the reinforcement learning in the first place and got the same results? Essentially, is what they're asking here. So they try that as well. So they take Qin32B and they apply the same reinforcement learning techniques and they compare it to the distilled models and to the R1 model as well.

Matt Squire [00:49:43]: So we've got those results in this table and they pull two broad conclusions out of the first is that distilling more powerful models into a smaller one yields really good results. But comparatively, the smaller models that are trained using reinforcement learning require more computational power and don't match the performance that we get from distillation. So high level here is that for them, if you have a powerful big reasoning model, you can use distillation to take that knowledge and inject it into a smaller model. And that works. It uses less compute and it produces good quality, well performing models, distillation strategies are economical and effective. But they also feel that if you need to like, it's not enough to kind of push the state of the art. And that's where they feel that the large scale reinforcement learning techniques which they cover in this paper, that's the place that they, that's where they see those techniques sitting within this, within this context. So there's a difference, I suppose, between, you know, I've got this big good model and I just want to distill its capabilities into something smaller versus I want to make a better big model in the first place.

Matt Squire [00:51:12]: I want to advance the state of the art there. They also talk about some of the things that didn't work. Reinforcement learning is a complex and expensive process, so do we need to use it at all? They talk about setbacks and failures that they encountered along the way. And you know, this sort of picture their experiment tracker being, hopefully they had a good experiment tracker, but you know, you sort of picture that being quite a convoluted and interesting place. They look at two things. So there's process reward model. Essentially this is a way where you can tie the reward to the process that the model decides to go through when it reasons through, when it solves a problem. Now the issue they found with that was that it's actually quite hard to sort of define those processes, to label those processes, to evaluate them in an automated fashion.

Matt Squire [00:52:14]: But also they encountered this problem of reward hacking where the model starts to try to optimize for generating a good process, but that doesn't necessarily mean it's good at reasoning in general. It can just kind of tell you how to do it, but that doesn't mean it can necessarily do it well. So that was, that was a limitation there. The other one was Monte Carlo tree search. So they're kind of looking at successful models for things like chess and go. These are the models that have been pioneered at DeepMind and they are training models to reward how they. So they're basically saying, we have a search space. So the problem you're trying to solve has a search space and you can explore that search space and find an optimal solution within it.

Matt Squire [00:53:06]: Now there was a question about, you know, can we sort of use a reasoning engine and have someone sort of have the bot, sorry, the AI, explore using that reasoning engine. And I think that ties into this as well. The big challenge I found here was simply that the search space is huge. You know, if your model is generating, if your model is solving chess, then it's a search space that's kind of a manageable size. If your model is generating tokens, that search space is enormous and it grows at a very high rate. So they found that that doesn't really work. What they tried to do was then limit the length of a search, which does work, but only as far as you then end up in like local optima because you're not actually allowing it to fully explore where it might lead to. So yeah, in conclusion, neither of these techniques really work at the scale they need it to work at.

Matt Squire [00:54:01]: And that's why they ended up sticking with the reinforcement learning based approach that we've seen described so far. Now then we're down to the conclusions, limitations of future work. Now the only other thing then for me is to come back to that kind of, that theme of, of MLOps. The disappointment for me as I look through the conclusions, as I look through the, the summary discussion is, well, there's not. I don't really see how to reproduce their results. Like fundamentally, I'm not sure what they've shared that would allow me to get the same results. Okay, I don't have the GPUs for it, but putting that to one side, you know, how would we verify and reproduce this? They've shared their methodology, they've shared their outcomes, they've shared the model. But what would we have to do to reproduce that nice pipeline, that workflow that Nahil shared earlier? For me, I still feel like that's something that's not discussed much in these papers around various large language models and how they've been trained and how they've been optimized and so forth.

Matt Squire [00:55:07]: What does the infrastructure look like? What does the tooling look like? What does, what's the process? How are these people collaborating when they work on these things? What works, what doesn't work? And I feel like as an mlops community, that's something that we ought to know more about and it feels like we don't at the moment. It is something that we've been at my company researching quite heavily as well. And I've been writing a series of articles on my newsletter, which I'll share a link in a second. Shameless promotion, I know, but a series of articles kind of exploring how certain open source models are built up from scratch and looking at deepseek in the next one as well. So there's that. And anyone who has any thoughts on MLOps for large language models.

Arthur Coleman [00:55:52]: Matt, you heard about cake? Cake AI or cake IO? So the problem is the pipelines, right? What Nihil was so accurately able to show is can we create a templatizable pipeline where you can replace the reward function, replace whatever function you are, so that we all know that this is the kind of pipeline that we want to tie and be able to plug and sight in pieces in it that are reproducible? That's what CAKE is trying to do with open source libraries, but again, it's a proprietary pipelining tool that it leverages open source libraries. What I've been trying to do is create a pipelining tool which generic, where you can say, all right, if I know what the pipeline is going to look like can I deploy that to any cloud and plug and play the different abstracted reward function model, a GPU or CPU or grid search. All of those things you should be able to put in based on some kind of interface layer that you're going to define. That's what I think the whole community is going to benefit from from an mlops approach.

Matt Squire [00:56:48]: Would you like, are you thinking like you have out of the box all the different reward functions and you can kind of plug them together then? No.

Arthur Coleman [00:56:55]: So you would have some but what you would have is at least the interface, the attraction that says that this is what you're going to have to implement to design your own. So that you can just drop it in there and say swap them out, test them against each other. As long as that pipeline is generic enough to Nahil's point, then we can go ahead and define our own applications in that pipeline.

Matt Squire [00:57:14]: Yeah, no, that makes sense. That makes sense. Yeah.

Nehil Jain [00:57:17]: I mean I felt the same way Matt when I was reading the R1 piece they so roughly mentioned. Oh we use this data set or like we tried some few short prompts. How is this a paper like it feels like someone just wrote the thought they were having while like they were reflecting on. Yeah, it's not reproducible at all and I think we it should have more specificity or at least citations to other things.

Bruno Lannoo [00:57:41]: Yeah, I have another question like something like that I read there I was kind of starting to realize this too but I found it like very clearly stated here that like for the prompt engineering section in the conclusions is like they're saying the few shot technique is making things worse. And I found it somewhat interesting because it makes some sense that reasoning models that want to have very long text generation actually few shot is not a good match. But at the same time I noticed that in practice I rely a lot on few shot techniques to get outputs that I can afterward parse and extract all the components I would like to get from that answer in automated pipelines. And so I'm wondering is this something we need to start learning to work with like that we always need to have two models like one that reasons and that does the chain of thought and then another one that kind of reformulates with few shots the conclusions that the smarter one has done. I'm wondering what people think about that.

Nehil Jain [00:58:38]: I have intuitively in practice ended up doing what you said like the two models plane. I don't have like fundamental understanding to answer like why few shots might be hurting. But like I usually do this where I use a complex model to do reasoning or solve the problem and I let it think I don't format the output and then I just view like a very simple very very very cheap model to just do like maybe even function calling and format output the way I would which becomes like an extraction problem almost.

Bruno Lannoo [00:59:05]: Yeah.

Matt Squire [00:59:07]: How well does it work? As Ezra says, I think that's that's for you.

Nehil Jain [00:59:13]: Now that technique works much better than just shoving both the solving of the query and the formatting in one prompt most of the time I mean it's all vibe based so like I haven't done evals on it so across different tasks to compare but I think that's the other mental model I have where I use one model to an expensive model to solve the problem and then another model could just extract the data for because it's usually part of a pipeline which is doing other things and I need the data in a format so that code can understand what's going on. Works pretty well and very very cheap models are able to extract all the relevant stuff accurately. Haven't been a problem.

Binoy Pirera [00:59:52]: If you guys have any questions, thoughts, the Slack channel is wide open as always so please feel free to drop your thoughts there. And yeah, I mean this was one of the I think this is probably the biggest reading group session we've had so far and so yeah thank you guys so much for joining, we really appreciate it and I've dropped Nahil's, Matt's and Sophia's LinkedIn profiles in the chat so if you want to go say hi, please go ahead and we will see you again next month with a brand new paper.

+ Read More

Watch More

MLOps Reading Group - December : A Taxonomy of AgentOps for Enabling Observability of Foundation Model-based Agents

Posted Dec 27, 2024 | Views 371

# AI Agents

# Observability

# AI Systems

MLOps at the Crossroads

Posted Jan 16, 2024 | Views 5.9K

# MLOps

# Kentauros AI

# LLMLOps

# AIMedic

Who's MLOps for Anyway?

Posted Sep 17, 2024 | Views 5.4K

# Generative AI

# ROI

# EPAM Systems