MLOps Community
+00:00 GMT
Sign in or Join the community to continue

The Truth About LLM Training

Posted Aug 09, 2025 | Views 7
# LLMs
# AI Agents
# Prosus Group
Share

speakers

avatar
Zulkuf Genc
Director of AI @ Prosus Group

Zulkuf Genc is a PhD computer scientist and Head of Data Science at Prosus Group. His work focuses on AI/ML, including ProLLM Benchmarks, FinBERT (1M+ downloads), and generative AI research. He shares insights on LinkedIn.

+ Read More
avatar
Paul van der Boor
Senior Director Data Science @ Prosus Group

Paul van der Boor is a Senior Director of Data Science at Prosus and a member of its internal AI group.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Agents in Production [Podcast Limited Series] Episode Nine – Training LLMs, Picking the Right Models, and GPU Headaches

Paul van der Boor and Zulkuf Genc from Prosus join Demetrios to talk about what it really takes to get AI agents running in production. From building solid eval sets to juggling GPU logistics and figuring out which models are worth using (and when), they share hard-won lessons from the front lines. If you're working with LLMs at scale—or thinking about it—this one’s for you.

+ Read More

TRANSCRIPT

Paul van der Boor [00:00:00]: I mean, these models are the quickest depreciating assets, right? They've got a half life and the.

Demetrios [00:00:05]: Most expensive depreciating assets, right?

Paul van der Boor [00:00:07]: There are hundreds of millions of dollars that, you know, after weeks or months have lost their value because a better model came out.

Demetrios [00:00:19]: We talked with Paul and Zulkov today about training models for agentic use and vows for agents. Paul is the VP of AI at Process and Zulkov is the director of AI. Got two hard hitters. And myself, the founder of the ML Ops community. My name is Demetrios. Let's get into this conversation. Paul set the scene. What are you training? How you doing it?

Paul van der Boor [00:00:48]: Right, so we're talking about what we do at PROS is in terms of bringing AI in particular agents into production. In that in the scope of our work, we care about making sure that the systems work for our use cases, that they are scalable and that they are reliable. Right. And so to do that we have to do a lot of different things. Right. And I'm going to focus on the AI part mostly because of course, it's the product side and so on and the user needs that. We obviously also spend the designing new experiences with AI, But I'm going to focus on the AI piece one. We need to make sure we understand how well models perform on our use cases.

Paul van der Boor [00:01:29]: And to nobody's surprise, the benchmarks that we see on this space are not good indicators for that. We saw recently even that the meta. Yeah, meta optimizing for the benchmarks, that's the oldest ML problem, right? Like overfitting to the test set. So I think that is something under evaluating these models. And they come. New models come out every week now, very quickly for whether they can be used for Polish car dealers call summaries or for predicting ingredients in a recipe in a Brazilian food dish. Those are all the use cases that we care about how they perform. So evaluations is one, two is how much, how well do they fit to our use case? So it's domain understanding, right? But also language capabilities.

Paul van der Boor [00:02:16]: We benchmark voice models in our eval sets, which we'll talk about in a second in Afrikaans, Polish, Brazilian, Hindi and so on to see how well do they perform for our voice calls, because most of our traffic is not English. Then we also care about cost. So once we believe that we found a good fit, we need to make sure that these models are actually affordable to scale to hundreds of millions of calls because we have that many users interacting with our products on a daily basis. So how do you optimize for that cost? Well, you need to find either, you need to fine tune models, smaller, distill them, or find open source models that you maybe want to host somewhere and control the full, let's say inference stack on. So then you get into the world of how do you actually fine tune, how do you distill them, how do you pick the GPUs, how do you optimize for utilization? So we'll talk a little bit about that. And then the last piece, once you've actually gone through this sort of ability to take the performance of a top level Ferrari level model and getting that performance out of a fine tuned smaller model, how do you set up the entire inference stack to work for you with the same SLAs that you can get from an anthropic and OpenAI or any of the other commercial model providers. And that brings in a whole new set of challenges.

Zulkuf Genc [00:03:39]: So.

Paul van der Boor [00:03:40]: And we've had to solve things across this entire spectrum of problems because we care about bringing Genai or Agentix systems into production at scale.

Demetrios [00:03:52]: You run the gambit basically from soup to nuts. You got to figure out from top to bottom.

Paul van der Boor [00:03:58]: That's it. Yeah. So I don't know where you want to start, but maybe we can talk a little about the evaluation piece because we have Zulkuf here, right. And we've since the beginning of working together it was very clear that a lot of things change all the time. Models get better, they can now code, they can now translate, they can now generate images. But one thing that will not change is our need to be able to evaluate how they perform on the tasks we care about. And so Zulkuf has spent a lot of time on developing the in house capability for us to do that.

Demetrios [00:04:32]: Break down how you're evaluating right now.

Zulkuf Genc [00:04:35]: Prolllm AI is the short answer, which is what?

Demetrios [00:04:40]: It's probably good to give a little context on what PROLM is actually.

Zulkuf Genc [00:04:43]: It's how we share our evolutions. Like we have many different evolve sets touching different tasks from our group companies, from our own AI team and then we build pipelines, evolution pipelines. Every time there is a new model out there, we are, if, I mean it's from credible resources, then we are putting that model in our pipeline and in I think matter of hours, if not days, we are getting the results from that model on different tasks and we are sharing it with outside.

Demetrios [00:05:11]: And the evaluation sets are not public.

Zulkuf Genc [00:05:18]: They are not. That's the critical point. Actually. Maybe I can give a bit more background to how we started like it's I think two years ago we you know, process on stack overflow and at that time we really wanted to understand the value of stack overflow data, how it's helping because everybody was training on their data. So we wanted to see okay how much we are really helping these models. And it turned out a lot. But that's a different topic. But what we were doing that we are training the models and then we need to see okay how these models perform after training on stack data and without stack data.

Zulkuf Genc [00:05:52]: And we needed really like a thorough eval sets to see. And also you know, in stack data you can really see different programming languages, different type of questions, debugging, non debugging like explaining conceptually guiding all different types of this. And then we wanted to go deeper in the evolutions. We just didn't want to see you put some data on the model and how good it's just writing problem, how good it's answering Python debugging question compared to JavaScript writing. So then we built all this eval set with different attributes and we were using this basically to follow and track the models we trained. But it turned out then occasionally you get some claims, hey, we built this great model on very small, you know, it's just 3 billion parameters but it does better than GPT4R. And hey, then we are curious, does it really do it? Then we get our evolve set, we benchmark not at all, not even close.

Demetrios [00:06:54]: It's just marketing.

Paul van der Boor [00:06:55]: Yeah.

Zulkuf Genc [00:06:56]: And then we say hey, there's a lot of hype there and it came a bit kind of more often. So we say let's then take these benchmarks since we already have them, let's put them somewhere so we can also follow up the models because we will have more of those. And then we had this. Then we said, hey, everybody is asking us then in our group companies which model should I use for this use case. Then we said hey, why don't we make it share it with the outside? And then that's how it started actually.

Demetrios [00:07:23]: Yeah. And you have a few different types of benchmarks. There's the stack dataset and benchmarks but there's tool calling benchmarks. Right. What other ones do you have?

Zulkuf Genc [00:07:34]: I mean the stack benchmarks, we have two benchmarks there. One is typical stack overflow questions, historical ones, like a thousand questions we got there and that one kind of saturated actually I can give you a good example. Like we were benchmarking the models on this historical data and a stack overflow. I had some API agreement With Google. And then after that agreement we saw that Google models suddenly picked on that data set because we fix they start training on it. So we said hey, I mean this benchmark is good, but we need something more fresh, something really like some data from the models haven't seen in their training data. Then we created Stack unseen and that's we are getting from stack overflow from the recent data. We hope models haven't seen those ones.

Zulkuf Genc [00:08:24]: And so this is a bit more new like they're asking about the new libraries or new ways of solving the problems. So that became stack unseen but both of them are Q and A and that then we have tool calling for our agenting use cases how good models are picking up the right tool and using it. We have summarization and just to see how good it's following the instruction I think that you should also talk about because not every summarization is same and we have open book Q and A where you give like for example that's very relevant for our use case we have for example finance colleagues bringing large data sets putting there and asking some questions. So we want to see if you give such a big context length how each model performs in answering those questions. And I think we have a couple of more like entity extraction and what else we have Q and A from token actually from token question and answers. But that's also very new. So models haven't seen that data.

Demetrios [00:09:27]: Yeah, you need to keep the freshness of these evals up to date. How often are you refreshing the data?

Zulkuf Genc [00:09:35]: Yeah, that's a problem actually. Like they are getting outdated. They are getting rotten very quickly. So the eval sets, I think after six months or something they all start get the smell. So hey, you need to update it. And for slack unseen we are trying to do it every three months. For other data sets we are a bit slower but we are adding more like kind of and getting out some stuff. But that requires also resources.

Zulkuf Genc [00:10:01]: So I cannot say that we are doing it every month, but I think every six months or something you should show some attention to these devices. Check what's going on there. I mean models are getting way better but you also see some parts they are not that still good. Maybe you put more emphasize on that. But also depending on our team needs like hey, we need this type of like for example in Tokan we start using Go. So it's important for us which models are doing good in go.

Demetrios [00:10:30]: Right.

Zulkuf Genc [00:10:30]: So suddenly it's a requirement for us. Then we say hey, let's make sure we have Some enough coverage for go and we see which models are doing good there. So that's also based on our needs. We are changing the scope of the eval sets too.

Paul van der Boor [00:10:43]: And you can see what's really nice if you go to Pro LLM so Prosys LLM you can see that there's basically this leaderboards on all these tasks that we've scored hundreds of models on because something will happen, right? So deep seek comes out and of course, well we are very curious. Like we see, oh, deep seek's out, like how well does it perform? And then like Zuko was saying, somebody will claim XYZ performance. We run it through our basically evaluation, the problem evaluation pipeline and within an hour or a couple hours we know exactly where it shows up in the leaderboard and we know what it costs and we know whether it does better function calling or summarization, whatever. And then it's like hey great, so now we should consider this model and move to let's say solving the inference piece which we'll talk about the GPUs and whatever for this model. Because actually we've got three or four use cases which you know, which with 5 billion tokens each, right? So then we have then confidence because of the quick evaluation we're able to do is worth investing more time in bringing this model into an actual agentic or AI product that we have live.

Demetrios [00:11:48]: It's great signal amongst all the noise and all of the hype on social media that you see. Oh, this model got released. It's beating all these benchmarks. You get to see for yourself. And I can imagine you did the same thing with the Llama 4 model and you probably were a little bit unimpressed.

Zulkuf Genc [00:12:06]: We can say that there. But also good point there. Now we have agentic systems, right? Like there are many tools like summarization can be a tool there. I mean you can, for example if you look at latest Google framework, you'll see agents can use agents as a tool. So if you know that for summarization which consumes a lot of tokens, like it's very costly, you put a lot of huge text in it and then you ask questions. If you know that hey, there is an open source model that can be almost deliver the kind of same performance with very expensive flagship models, then you can just replace that model there. You don't have to use the same model for all the tasks. And that problem gives us this kind of freedom to choose.

Zulkuf Genc [00:12:50]: You look around for that type of task, hey, we can use this model. What is the best alternative. Oh, we can cut cost here, we can get better performance here than this kind of. We can also do kind of routing around to model.

Paul van der Boor [00:13:03]: One point to add about Pro LLM is you may ask, how do we get the Zval sets created? Right? Because we know that need is there. Like we said all the time, new models come out. You want to know, is it worth investigating further considering it for moving it into production. And so you need these eval sets. But this is one area where we found that there's a lot of solutions on the market, tools that claim to help, but actually they're not the biggest problem that we have for eval sets or for evaluations. It's the eval sets themselves. And so how do we get those? Well, we have a couple of advantages is that we have, like I said, a big group, lots of companies with real world data. So what we typically do is we'll gather a whole group of people that are hungry and we organize a labeling party.

Paul van der Boor [00:13:56]: We provide them some pizza and some other snacks and we put them in a room and explain, hey, we're going to go through a bunch of summaries or we're going to go through a bunch of, I don't know, listings on OLX and figure out what are the entities that we would want to recognize. And we sit down and we manually, fueled by pizza, curate these eval sets which we use. And that's the hard part because that's laborious. You need people to kind of come in and do that. And by the way, it's been a lot of fun. And I think we've discovered that this is a great way also to give people who are not so familiar with the work that's happening inside an AI team a little bit of exposure of how we work. And so we start to organize these labeling parties, pizza and labeling, and then you have eval sets that we now have ready to be evaluated against anytime a new model comes in. And maybe it's worth also talking a little bit about how we use LLM as a judge to very quickly be able to evaluate a new model against that eval set.

Demetrios [00:15:03]: Yeah, tell me about that.

Zulkuf Genc [00:15:05]: Yeah, I mean, we have been using LLM as a judge for a long time. I think the first thing we discovered that, I mean, for example, for stack overflow, like Stack unseen stack query data sets, benchmarks, there is a technical answer. So there are many ways of giving this answer. Even if you go stack overflow seed, you will see different scores for different answers. So there is not One ground truth. So how are we going to evaluate that? We cannot just compare it. So what helps is using whatever the ground truth is there. In stack Overflow case we always have a true answer and we also have a couple of recent answers that get lots of good scores.

Zulkuf Genc [00:15:48]: So if you bring those answers to the model and then you bring the question and then you bring the answer from other models and ask the model basically okay, is this answer correct or not? And here is the clue kind of hint how ground truth looks like. And then using that the model then kind of I think we got like 8 to 5 now around 90% accuracy in evaluating if that answer model generated answer is correct or not. So as long as you have a kind of credible ground truth there, I think models are pretty good in judging the accuracy of their own generation or generation of adult models. The part there we judge but you can also then you need another eval set like you need to judge the judges. Find out which judge you should use as a kind of judge because every model performs differently in that and to the pole point. Then we organize a labeling party. We generated many labels from the models by putting each of them as a judge position and ask each model to evaluate. And then humans evaluated their evaluations and from there we had a okay labeled data set.

Zulkuf Genc [00:16:59]: And then we use it, we have it in pro LLM as LLM as a judge benchmark. And that benchmark shows how, how LLMs are performing as a judge. And from there we pick up our judge and then use it. I think now at the moment it's GPT 4.1 is our current judge. But if something else becomes better, we will just replace our judge.

Demetrios [00:17:21]: And how are you setting up the judging system? Because there's a million and one different ways to have an LLM as a judge.

Zulkuf Genc [00:17:31]: Yeah, what we do, I mean you always need to explain the task for I think that's no brainer. Like what this task about. This is a technical Q and A. So you need a technical answer. And there, there are also many ways like say okay, some people use from 0 to 10 scoring, you know, okay, give us a score. A model gives seven, another model says eight. What are you going to do with that? Right. Like so in that the point we came what matters for the user like as a technical question answering person.

Zulkuf Genc [00:18:07]: What's important for me is I get an answer and I stop searching. So that's good enough. I just go and then continue my work. Hey, we said okay. But Stack Overflow has something similar like acceptance. Right. It's acceptable or not. So in that we wanted to bring it a kind of binary that solves my problem or not for technical question.

Zulkuf Genc [00:18:29]: But we also realize, hey, models are easy to flip around, so we better give a bit room for them to make mistakes. So rather than 0 and 1, what we did 0, 1 and 2, 3. So if 0 means really like completely off, 1 is a bit more. Okay, it got some clue, but this is still useless for me. Two is I can leave it as answer, I don't need to. And three is like that's perfect. So actually we put two and one there. It's a bit room to if it flips, it still flips into the same area, it doesn't go the other side and makes it completely wrong.

Zulkuf Genc [00:19:05]: And with that I think we could get pretty good results. So it's in the end what is important for you for each task? If summarization, for example, then that definition changes what's important there. So in summarization it has to follow your instruction. If you ask for two key takeaways, it should give you two, not three and things like that. So for each task you have to go a bit deeper, see what matters and then reflect it to your rubric.

Demetrios [00:19:38]: Basically one thing I think about, because I am a huge user of Gemini and if you would have asked me six months ago what model I was using, it would not have been Gemini. Have you been surprised by certain models gaining traction or losing traction over time? Because you've been tracking this for so long.

Zulkuf Genc [00:20:02]: I mean, in the beginning OpenAI was always on top. So in every benchmark, like from summarization to technical question answering, it was always OpenAI models on top of. And that was a very big margin with the second best model if it is not from OpenAI. But over time what we see now, that margin gets narrower and narrower. And now OpenAI is also not always on the top. We see other models getting very close or getting ahead of OpenAI and sometimes you just see okay for different tasks. Also like the models are changing now. We got lots of more variety now with Google models with anthropic models.

Zulkuf Genc [00:20:42]: And sometimes I can even mistrate, like a small model was really doing great job on the summarization, better than the OpenAI flagship model. And now we also have the China dimension. Maybe Paul, you can also share your talks about Chinese model. But we see them getting way better and better over the time and that's also helping us as open source community. Like we are training a lot of models and now with those benchmarks we are just using them to see which model can be our baseline for different type of tasks by looking at the parameters. Okay, that model 32B, that can be really like a great reasoning model. And then from the benchmark we can decide on that one. Or look at the small models in the benchmark.

Zulkuf Genc [00:21:24]: What is the smallest, best model?

Paul van der Boor [00:21:25]: Yeah. If you go back to this angle of building agents in production, let's take Token, which we talked about, our internal sort of platform for productivity and others. Over the last two years of building Token, we've had continuous challenger champion models to kind of see which ones are better at certain tasks that the agent, whether it's summarization or image generation. And we've gone over 100 models that we've put into production. And I can tell you with confidence that the models we have today in Token that are answering the questions, in six months all those models will be different.

Zulkuf Genc [00:22:07]: Wow.

Paul van der Boor [00:22:08]: And so think about that as a product builder. Right. That means that we need to understand which, you know, in six months which models are going to be the ones that we need to be replacing with. Right. Because there's going to be tons. And as Zukof mentioned, you know, initially it was actually easy, like it was just OpenAI that was the best one. Right. Unless there was some cost consideration and so on, it was easy to know which one was best.

Paul van der Boor [00:22:32]: Today, what's been very interesting is to see the rise of NEO model providers, in particular open source ones coming from China and other parts of the world that have basically become very, very good alternatives for many use cases. And we are starting to use those in production. Which again then brings us to the point which we'll talk about is how do you organize inference for models that are open source at that size that are available? And the first open source options that were really the only competitive ones were essentially Llama and Mistral's models. Now all of a sudden you have Gwen models coming out. Of course you've got the deep SEQ ones. And as we switch to more reasoning workflows, in particular for the agents, it's really cool to be able to get much more visibility on how these models were trained, what they work well on. And we see that, you know, because we've got two years of history of Pro LLM, these leaderboards, you can actually play the video to see, okay, how many, you know, what does this leaderboard look like over time? In the beginning it was very concentrated, just three or four players. And now you start to see that, you know, there's a dozen players that are in the top 10, that are continuously, you know, catching up to each other.

Paul van der Boor [00:23:53]: I mean, these models are the quickest depreciating assets, right? They've got a half life and the.

Demetrios [00:23:58]: Most expensive depreciating assets, right?

Paul van der Boor [00:24:00]: There are hundreds of millions of dollars that after weeks or months have lost their value because a better model came out. And so again, coming out to this product builder perspective, we're agnostic to where the model comes from because we just want the best model cheap, fast, whatever, whatever criteria you use for best. But then once we know which one's best, we still have to solve. How do we actually host this? Right? How do we do, Is it a commercial, is it API as a service? Or do we actually need to solve for figuring out where this model is going to run on the cloud, on our own GPUs and so on?

Demetrios [00:24:38]: Yeah, it's probably worth talking a bit about the infrastructure side of things and going out there and figuring out the GPUs. And I am fascinated by the GPU market right now because if you do decide to go the open source route, you need GPUs. There's no questions. All other tools are kind of optional. We talked with Bruce about what you chose to buy and what you chose to build yourself. There's no way you're building your own GPUs. Now, you might go and you might decide to buy GPUs, but that is a very hard way of solving this problem. So can you talk about this journey of getting out there and looking at GPU providers? There's so many right now, so maybe you can break it down on what key considerations you had to look for when you were going out there.

Zulkuf Genc [00:25:34]: Yeah, actually, as you say, there are way so many players there. And so I think we kind of talked to almost every leading player out there from Nvidia to Core Wave to get AI Mosaic, but it became databricks later. So we had lots of exploratory phase. And if you talk to GPU providers, I mean, initially everybody is very open and then everybody is the cheapest, everybody is the fastest. But once you start talking like it's shifting, like, I think most of them are kind of asking, you know, some dedication, like commitment. So you need to get reserve capacity for certain number of GPUs for a certain number of months. And then if you do the math, hey, that's a lot of money. And if you don't, if you're not going to use these GPUs immediately, you are just wasting like Burning your money and then you are, they are just idle there and they are waiting.

Zulkuf Genc [00:26:31]: So that's kind of dilemma we had initially. Hey, we know that we are going to explore a lot and there will be lots of times we are not training, we are just manipulating data, preparing for another experiment. Then we will have extensive week of trainings, so many parallel trainings. So we need many GPUs those weeks, like a peak times. But then we again get the experiment results, analyze, learn and prepare for the next set of experiments until we are sure that, okay, now we can do a big training because now we know which data to use and how everything is settled. But that part we needed on demand. So when we start talking this on demand was not something every GPU provider is very enthusiastic about. So that was a tricky point.

Zulkuf Genc [00:27:20]: We couldn't, I think, get it from. Yeah, almost anybody. Like, hey, if you want to do on demand, then suddenly either price goes super high, then it becomes, wow, almost you pay as much as reserve capacity or you get very small number of GPUs assigned to you, so you cannot really run any meaningful experiment. So it was a kind of tricky point. So with our current GPU provider, so that was the differentiator for us, we have a kind of freedom. We can use on demand GPUs and scale quickly. And when we need a way bigger capacity, we can reserve a week ahead and then get more GPUs for bigger trainings.

Paul van der Boor [00:28:05]: Maybe to add to that, because I think you've described a very specific type of need that we have for GPUs, which is when we train models for which for a period we need a few dozen, a few hundred, maybe a few thousand GPUs to run a training run.

Demetrios [00:28:19]: It's very spiky.

Zulkuf Genc [00:28:20]: Yeah, spiky.

Paul van der Boor [00:28:21]: And this is sort of the on demand piece, right? Which is like then a commitment of one year or whatever longer is really hard to make because we just need it for that batch, that training run, right. And then, you know, we'll look at the results and we'll run it again. Then there's other use cases, inference related, where we also need GPUs. Right. So we looked at do we need to, let's say Again, commit dedicated GPUs or do we do on demand spot instances or whatever on the cloud providers is an option. But they typically have these quotas. They're very slow to respond. They're not great for the setup because they tell you, oh, you can have eight a 100s or whatever, right.

Paul van der Boor [00:29:04]: But sometimes we need 12 or sometimes we need whatever. So then for those inferent workloads we had the issue that if we had committed, let's say a certain number of GPUs resources to this service, right, which was doing classification of whatever or translation whatever in real products. You also have a lot of cycles like daily or even weekly where people come in and they post pictures or they order food. And so you needed to kind of adjust because if you don't adjust your underlying GPUs for inference, you have a really low utilization and then your costs are out of the roof again. And that's where talking to providers like together AI has been interesting because that's more inference related, like for the training we don't really, we haven't worked with them yet, but they've been great because they basically offer on demand or token as a service for open source models with all the privacy guarantees that we needed. And so depending on the workloads, you actually need to solve for very different number of resources and so on.

Demetrios [00:30:09]: The elasticity is key for you, if I'm hearing that correctly. But also you bring up a great point that potentially if you can, you're going to try and just go with tokens versus I'm going to figure out how to put the model onto this infrastructure and then figure out everything around that to make that inference very fast or whatever. So how do you break down the, oh, we'll just hit the open source provider in a token API way versus we're going to go into the GPUs and do it ourselves.

Paul van der Boor [00:30:44]: Well, we have our own models, so I think sometimes it's not like, let's say, let's take a deep SEQ example, right? We want to, you know, use deepseek for a workload. Then we could do a few things, right? So we either go to deepseek themselves, which typically for privacy reasons doesn't work. So then we can say, well we download the models from hugging face and put them on some bare metal right in the cloud or other. Then you get into the issue. Well, actually we still have that sort of day to day cyclical usage pattern, so we probably can't get the economics to be very favorable because utilization fluctuates too much or too low if you have guaranteed bear met on your own. So then let's say a service where it's token as a service for deep seek, which is just the vanilla model, right? The model everybody has access to together AI is great because then you just send the request and you get it back, and they organize the workloads. But we're also training our own models. And so that's where Mazukov trained a model.

Paul van der Boor [00:31:50]: How do you expose that for people to use? Then you can't use together AI or a commercial provider, obviously, because you don't have that model. So then you still need to go back and say, okay, how do I provision, make this work? Yeah, for our own trained model.

Demetrios [00:32:05]: And the other piece is on the. When you're out there and you're looking for hardware, I'm sure there's a lot of other considerations that you've thought about. Maybe in this specific training instance, it comes with its own set of considerations. In the inference instance, it comes with a whole different set of considerations. How do you look at what you need, what boxes you need to check as you're getting for the training?

Zulkuf Genc [00:32:32]: I mean, initially, we actually didn't know. We learned it by experience. So when we start training the models for Stack Overflow, and at that time, Mosaic, I think, was the only provider helping there. And then we started using their platform and they were great guys helping with the support. But we learned that we cannot use every tool or every model out there, so we have to just go with whatever they support and that kind of start being a limiting factor for us later. So that's okay. Important checkpoint for us. Can we use the frameworks and the model, every model out there, immediately with these GPUs, or do we need to wait a kind of support to come.

Paul van der Boor [00:33:11]: To the platform just to make it real? So what we were facing is llama 2 came out and we wanted to train that model on the Mosaic servers. Right. And that wasn't possible because their middleware, essentially their framework, I think Converge it was called, it was not compatible with that model yet, or that model wasn't compatible with it. So we had to wait weeks to get that model available. So then we would want to have access to the bare underlying GPUs and run our own, everything on there to be able to use their services.

Demetrios [00:33:48]: Well, it feels like it's almost. The bigger question is, when you're looking at the GPU provider, do they. What kind of support do they have? What kind of software comes with the GPUs?

Zulkuf Genc [00:34:00]: Exactly. And most of them are nowadays supporting Slurm, for example. That's a kind of default out of box you get. You always need hugging face kind of integration. We can bring every model from hugging face to the GPUs. And then you also see later, hey, how are we going to bring the data here. So initially you think, hey, that's a no brainer, there should be many ways. But you see, hey, not everybody is really supporting every way and most of our Data is in S3.

Zulkuf Genc [00:34:28]: How we are going to do the secure connection to bring this data and once this data is here, how are we going to handle that data? Like what kind of privacy guarantees we get? Can we remove the data anytime we want? Do we keep the replica? Those kind of simple questions don't have the same answers for every provider. So then they all turn to kind of checkboxes for us after start talking to the GPU providers. And then there are natural things like okay, network speed and like how is it scaling, okay if there is there auto scaling or not. So there are also kind of technical checkboxes. We also ask the GPU providers.

Paul van der Boor [00:35:09]: One thing that was interesting is that availability was also hard, right? I mean this shortage, we felt that sort of very real, right. Like it was actually hard to get access. And even some of the providers said, well, we only do, you know, millions of dollars dedicated capacity of course. And so, and then of course the quotas and the cloud providers and so that, that shortage was real.

Zulkuf Genc [00:35:37]: Yeah, especially if you have certain type of GPU in mind, like hey, we give you a hundreds as much as you want, but for H a hundreds we have to wait. And for H2 hundreds, well we need commitment, things like that. And then we, okay, we always want the best fastest gpu, but then sometimes there is a big cost difference. Luckily with Nebius, for example, now we got the Same price for H200 so we migrated there. But that's also changing from GPU provider to GPU provider.

Demetrios [00:36:13]: And I can imagine it's not as easy to swap out the GPU provider as it is to just swap out the API if it's tokens. So going back to the point of if we can, let's think about just using the API. If we need to go further because of some certain criteria, then let's figure out what we have to do to make that the most efficient as possible.

Paul van der Boor [00:36:40]: Like swapping out GPU providers actually is easier than you think, other than maybe the data and so on. But like at the end of the day we've got pipelines of code that we run, especially if it's not tied to their frameworks. It's just getting the environment set up and so on. It's more if you switch from an H200 to a100, obviously you need to change a few things, but it's not super sticky, right. It's pretty commoditized.

Zulkuf Genc [00:37:06]: I mean, you need to change your data routes, like where data is coming. And also on paper everybody's supporting every framework, but in practice it's never the case. So we start using a certain framework for training our models and then the RGP providers needed to do some adjustments. So if you go to another one, I'm not sure if out of box support will be enough for us to continue test. Probably they will also need to do some adjustments. But yeah, it's not a big deal.

Demetrios [00:37:40]: I know. You also mentioned one big thing for you is the support that the providers give you. And I can imagine that if the scenario is you're inundated with a lot of requests and you have to prioritize which ones you want to go with, you're not going to be offering support to the person who's saying, hey, can we get like spot instances? And we'll see. Maybe we use a little money this month, maybe not as much next month, versus someone who's going to pony up and pay the actual big dollars. So did you find providers that would offer you support despite not doing these large commitments from the get go?

Zulkuf Genc [00:38:21]: Yeah, we are lucky. And then we see the importance of support later. You know, initially you think, okay, it's just a bare metal, we take it, we put our frameworks and libraries and we do the training. But many things can go wrong. Like I say, huge network of the GPUs, clusters, connected nodes, like there are so many reasons that things can wrong and then you always need a support. And we were lucky with the nebulse on that front. We had a Slack channel and the guys were always available. Anytime we ask a question, we got an answer back.

Zulkuf Genc [00:38:54]: That's how we could move fast. Otherwise I cannot really imagine it after seeing all this. Initially I didn't think really we needed that much support, but after seeing on practice how much we needed support, I think it's a big prior. I mean, I will definitely ask. This is one of the first questions.

Demetrios [00:39:10]: That moves up on the hierarchy.

Zulkuf Genc [00:39:12]: Yeah, definitely.

Paul van der Boor [00:39:13]: And I think it's easy to forget how hard it is to run clusters of thousands of GPUs. And not all companies are equally experienced in that. Right. And I think just having the GPUs doesn't mean that we can run our training runs uninterrupted. Right. I mean, sometimes we'd wake up like the next day and the runs were failed. Right. Without any alert.

Paul van der Boor [00:39:35]: And so then you're Losing time. Right. So that is definitely something that's important to keep in mind.

Zulkuf Genc [00:39:42]: Yeah.

Demetrios [00:39:43]: It also helps that they're right down the street from you. So you can go and pop into their office. Right?

Zulkuf Genc [00:39:47]: Exactly. It's just next door. So we can go knock their door. That's helping.

Demetrios [00:39:54]: I wanted to get into the labeling that you have on the. What is it exactly that you're doing? It's you. You've evolved from your labeling with toan answers. This is changing gears. All right, so let's change. Let's change gears for a moment because I want to talk a little bit about the output and the labeling that you've done for tocan and the answers and how you are able to. It's not evals you mentioned, it's what is it? So it's tagging and it's different than the parties like the labeling parties that you were talking about too. So let's change gears for a moment because I think you've done some innovative stuff around figuring out how people are using token without needing to read tons of messages or a lot of output from token.

Zulkuf Genc [00:40:55]: Right.

Demetrios [00:40:56]: And also that helps because then you have a bit of privacy.

Zulkuf Genc [00:41:00]: Yeah, I mean, it's more than a bit privacy. It's really a lot of privacy there. So we cannot read any message, and not even in our team without the consent of the person. So token messages are super private. Nobody can touch and read them, even the database admins. And that's when you have that requirement. And also we want to understand how people are using token. Right.

Zulkuf Genc [00:41:21]: Where token is helping them. And also are companies, they come to us and they say, I want to know how my employees are using it. They are really getting a kind of value from it. Or can you tell me, give me any insights how we can support them, make it more useful and things like that. That's the first point, like understanding how people use it. But also as the builders of token, it's important for us to see if token is fulfilling the expectations, if it is delivering and where it's failing, where it's performing great. So understanding token performance, the third thing is also like understanding the impact where it's providing the most value and where we can kind of improve it. And those kind of dimensions we want to focus when we get the tagging in place.

Zulkuf Genc [00:42:14]: So what we do there, we build a system that takes the token conversations, not just the question, the entire conversation with the model, like a user question model, answer and follow up questions from the user and follow up answers from the model and Then we give this conversation to our LLM and we ask like I think 25 tags for example, what is the domain of this conversation? Finance for example, or HR or data science. IT and what is the task type? It can be like coding, debugging. And what is the kind of use case? Because somebody, I mean it can be coding but it can be somebody from finance writing some analysis work for evaluating a competitor or for a startup. And okay, this is the use case, this is the task type, this is in this domain. And then we kind of make it very granular to see how exactly token is being used. And then we also ask the model, okay, how much time people are gaining with this task. And then in model is also like okay, the first thing is how people would do this manually without token says okay, manually he should do this, write this code, go check Google. Probably he will spend 60 minutes with token now with conversation and answer it took 10 minutes, so he saved 50 minutes.

Zulkuf Genc [00:43:41]: Then we say okay, 50 minutes saving. I think in average each token session you save like a 17 minutes or something. And then in a day if you use it three times, you save one hour by using the token. So that's also kind of, you know, the metric we surface and then see, okay, how much we gain by using token. That's one of the tags. But I don't know if you want to deep dive into technical.

Demetrios [00:44:06]: Yeah, let's talk about the tags because I know the tags were a bit of a headache since it is so open ended.

Zulkuf Genc [00:44:11]: Yeah, that's the point. So you have two choices there. You either kind of restrict the model to certain number of tags and then you say okay, you have to choose one of those tags. But token is being used by different companies, different language for different type of things. So there is no way to capture everything front end. You have to like it will always try go beyond those tags and then basically what you do, you just miscluster your tags. Then you introduce inaccuracy into your system. If you do that, all of a.

Demetrios [00:44:46]: Sudden you have 10,000 tags.

Zulkuf Genc [00:44:48]: Yeah, I mean we didn't want to go that way. We say okay, we are going to let it keep it free. But what we see, then there's no end to end. It's going all wild. And the models also it's not always consistent even if you keep the temperature at zero. Sometimes it calls it python. I mean it's coding, sometimes script writing, sometimes programming. Like you can get many different tags for the same task actually.

Zulkuf Genc [00:45:15]: So we needed to do lots of things There to kind of tame the model in that we didn't touch the model, we kept it free. But we did lots of post processing and use the model again in this post processing. So kind of we built new way of LLM powered clustering here. If you go, I mean use directly embeddings. Embeddings also are not perfect. You will get also inaccurate clusters. Hey, then the idea, we have such powerful models, why don't we use them for clustering? Technically you think, I put all the text, pour all the tags into model prompt and it will give me clusters out of box, nicely organized text. And I can use turned out after a hundred tags, models are failing.

Zulkuf Genc [00:46:00]: So some of them are not coming out at all. Like they are forgetting the tags and the clusters are like in the beginning it creates a cluster, but in the end it creates very similar cluster. And now suddenly it's not consistent at all. Again, hey, we cannot use that. It turned out we cannot really put more than 100, maybe new models. It changed, but like a year ago it was like around 100. They said, okay, we use that model, but we do the clustering ourselves. We create a hierarchy of tags.

Zulkuf Genc [00:46:30]: We kind of created good enough clusters, then use the model to refine those clusters. And that worked the best. So you get a kind of good enough cluster, you ask the model if it is good or not. The model can say, okay, this tag is good. This kind of not that should be the name for this cluster. So we go over all the clusters and after that we did another clustering. Because sometimes you just need five clusters to communicate. The audience who needs to know very high level.

Zulkuf Genc [00:47:00]: But we also wanted to go low level and see what people are doing. So we need the kind of three level of clusters there deep dive into what people are doing. And it turned out okay. I know the cluster like you are writing Python and you are using, I mean for finance, like somehow Python. But what exactly are you? I want to know a bit more, but it's also private, I cannot look into it. So we also kind of created another tag, kind of giving you a bit more detail. Like the goal of without anything private, anything sensitive, but just a few words more. So if you want to read, get a bit more idea, you can go read those texts in a kind of nice ui.

Zulkuf Genc [00:47:41]: So in that we came up with a solution from all the way down, you can go all the way high in different clusters, different on your needs. So we can serve it to our users that can tell you how people are using token, how token is Performing and what is the impact?

Paul van der Boor [00:47:56]: One of the reasons that we wanted to do this is because we have large volumes of data and we need to get some understanding in a way that scales and in Token's case is compliant with the privacy requirements of the SOC2 certification and so on. So we built Flow, which is our data analyst agent. And of course this is, it's an interesting case because Token handles millions of questions and the interest we had was what types of topics, how helpful is it, how much time is it saving? And so we built the Flow data analyst. It took a lot of time, right? But now Flow analyzes these millions of questions continuously and provides this data. So it's very useful. But it turns out this pattern of understanding large volumes of data is useful for a lot of different, you know, settings in the group, right? People that are dealing with customer interactions or support tickets or, you know, other kinds of, you know, like sales calls. So, you know, we are now moving more heavily into this AI workforce as a topic and theme and creating, you know, AI agents that become parts of the team to help things make easier. We now have lots of data analysts connected to all sorts of internal data sets that allow people to ask as data analysts questions.

Paul van der Boor [00:49:17]: So we've got Flow, our data analyst, but there are many Token data analysts out there in the group in iFood and OLX and so on that are doing very similar things. Taking raw customer data, adding a layer of intelligence, clustering, tagging and so on that then can be used by the teams to make informed decisions about what to do.

Demetrios [00:49:41]: So you did mention before that models are the fastest depreciating asset. And at the same time, I think I heard you just say that you're training models and you have to go out there and procure GPUs. How do you weigh those two seemingly opposite ideas in your head?

Zulkuf Genc [00:50:01]: Yeah, I mean, I think you must train the models. If you are in that game and you are serious about it, and then you have a really good domain data, you have to leverage that. And in our case, we have really companies having lots of valuable data. So we wanted to create specialized domain models for these companies, their own business in their geographies and language. Like, for example, we are creating a large commerce model now that can help the customers of these companies way better than the generic models that model. Those models will know their domain better and they can later build on top of those models, their agents, conversational assistants, even use it for their existing AI systems. And I think we should be mindful about the depreciation part there, like what depreciate is the base model. So we are not interested in creating a language model from scratch.

Zulkuf Genc [00:50:55]: We are just, we have our pipelines, we have our data and if the Bayes model depreciates, for example, it was QEN 2.5 a few months ago. Now last week we got QN3Qeq. So we can just replace the base model and then we continue and just enjoy the performance gains coming from the base model. On top of that we continue fine tuning and even further pre training and alignment. But that part doesn't really add that much on the cost.

Paul van der Boor [00:51:28]: I think if you think about effort spending on training, it is about fine tuning and pre training for our specific tasks on top of an existing base model. Like Zukov just said that we will see and expect to see many more of our proprietary models going into production today. That is still a small percentage of the use cases we try and test open source models on that we've fine tuned because we want to make sure we know how to do that. I think in the next 12 to 18 months it's very likely that we will displace a lot of the commercial model traffic to our own models for very specific tasks because it's higher performance and much more scalable in terms of costs.

Demetrios [00:52:18]: Yeah, the cost is a big one there, especially once you get it out to that many users and you're starting to see, wow, it's not only our internal company that is using it or our portfolio companies that is using it, but now when we start to push this out to their users, then that cost is going to go through the roof. If you're using just off the shelf API I can imagine.

Paul van der Boor [00:52:43]: Yeah.

Zulkuf Genc [00:52:44]: And also the customization factor, like you have lots of flexibility, then you have a model you can train on multiple tasks there and you can use it not just for one thing. Now you can also scale it to different tasks. Like why don't we change? I mean we have lots of statistical models, right in every company, AI infra host, I don't know how many XG boost models and other models. Then hey, we have a better intelligence here. Maybe we can also replace some of those models with that one. Now you start thinking in that way of how to leverage that resource that I have, I trained. Can we also introduce this task, maybe change the data structure differently but also start utilizing a better intelligence, get a bit more performance gain from it and then once you have your model and the training, you start also thinking around that and then you also suddenly find yourself Coming up with some innovative ideas like hey, I can also use it here. I can also.

Zulkuf Genc [00:53:36]: Let me try it there. Let me do this. Maybe then put this agent on this model rather than the OpenAI models and suddenly you have kind of more interesting.

Demetrios [00:53:46]: It comes right back around to Pro LLM. You're not going to know which ones are performing better in what ways unless you have something like Pro LLM.

Zulkuf Genc [00:53:54]: Exactly. Actually we also publish that like a kind of evolve driven development. When we have a use case, we always try to start from the end. Like we create eval set for prollm. We put it there. What is our baseline? Okay. What we expect from the model. We also do initial labeling for some eval set if we don't have any labels.

Zulkuf Genc [00:54:14]: And then after that we start developing, training and then experimenting. But we always have the benchmark there. In Prodelm, we recently created private spaces. So for that those are not public. It's only for our team or processed companies. So when you start a development of a project, every time you do benchmark on your private space, you can see which model is doing how, which technique work best, which prompt work best. You can do the kind of follow up on that space and then we iterate over through the benchmark.

Demetrios [00:54:45]: Are you allowing folks from the portfolio companies to add their own eval space?

Zulkuf Genc [00:54:50]: That's the idea. Yeah.

Demetrios [00:54:51]: Yeah, that's cool.

Zulkuf Genc [00:54:52]: Everybody can use it. They can create their ifood space, they can create their OLED space and put. We give the pipes or the pipelines the only thing they need to do come up with the eval set and then they put the eval set and everything else is automatically ran through. All the models they want to test even. We can also test the fine tuned models so they will be able to see it.

Demetrios [00:55:15]: That's all we've got for today. But the good news is there are 10 other episodes in this series that I'm doing with ProSys deep diving into how they are approaching building AI products. You can check it out in the show notes, I leave a link and as always, see you on the next one.

+ Read More

Watch More

The Truth About AI Agents
Posted Oct 09, 2023 | Views 454
# AI Agents
# AutoGPT
# RPA
# Autogpt.net
All About Evaluating LLM Applications
Posted Sep 28, 2023 | Views 880
# Evaluation
# LLM Applications
# Exploding Gradients
The Art and Science of Training LLMs
Posted Mar 22, 2024 | Views 1.2K
# LLMs
# MosaicML
# Databricks