Fine-Tuned Models Are Getting Out of Hand
speakers

A product engineer obsessed with solving natural language problems one conundrum at a time.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
How do fine-tuned models and RAG systems power personalized AI agents that learn, collaborate, and transform enterprise workflows? What kind of technical challenges do we need to first examine before this becomes real?
TRANSCRIPT
Jaipal Singh Goud [00:00:00]: Slow data is more. When I want to emulate a decision making process, a lot of things are beyond sort of prompting me. But if an agent or a system observes you for long enough, it can pick up on the intrinsic patterns of how and why you are making certain decisions.
Demetrios Brinkmann [00:00:19]: There's the whole small language model kick that you, you've been on and then there's this other idea like I use as someone that works at a company, I have the internal uses of AI that hopefully I'm enabling with the agents. But then I also have just like when I interact with a chatbot directly and usually I'll interact with Gemini and ask it questions.
Jaipal Singh Goud [00:00:44]: Correct. There's different use cases where you're doing general purpose stuff. You go to Gemini, you go to OpenAI. You want to summarize a document? Yeah, I don't message my colleague to summarize a document for me but, but when I want my colleagues opinions on a. Is this compliant with what our company is doing? Like if I'm going to adopt this and we're going to push this out, I go to my compliance officer because he knows information about the company, nobody else does. And that's what we want to capture in these models.
Demetrios Brinkmann [00:01:10]: Why do you need a small language model or why do you just need a fine tuned model for that as opposed to some RAG based system or something that is a little bit more simple to set up? I could say, for lack of a better word, because I know you're trying to make like this fine tuning much more simple to set up. So I don't want to like say that it's not simple, but it feels like if you're trying to fine tune small language models, you're adding a bit more complexity.
Jaipal Singh Goud [00:01:40]: Yeah, I think we've taken care of adding the more complexity part for you so you don't have to really, you know, take the engineering overhead of fine tuning models and I think fine tuning and RAG are complementary. At the end of the day, when you want to add static memory sources to an LLM to refer to during inference, you could do a RAG or you could do an agentic memory framework to supplement it. But the model itself, what should it be sending as a query to the RAG system? Is it aware of the overall context? One of the issues with agent take rag is a lot of times because you're doing only semantic similarity, framing the right questions or framing the right intermittent questions becomes really difficult. And that's where you want to do fine tuning on large context. That's all your enterprise data. You've collected so that it knows what right questions to ask for the RAG to go and retrieve memory. Because fine tuning models does not evolve as fast as RAG data sets evolve. Right? RAG data sets, you can add new context every day.
Jaipal Singh Goud [00:02:40]: But then what questions should and data should be retrieved from the RAG system is what a fine tuned model would do much better instead of if you just use like a vanilla based model.
Demetrios Brinkmann [00:02:49]: So then do you have some kind of framework that you think about with fine tuning certain data versus let's throw that into the almost like fast data. You've got the fast and slow thinking or you how do you look at like we're gonna make that a fine tuning thing versus we're gonna make that a rag thing you would do.
Jaipal Singh Goud [00:03:11]: This is a rag thing when there's a lot of factual information that needs to be recovered or that needs to be given out. So if you're doing customers something like customer support, then a rag agent or a rag system is really good because you can refer to similar semantic queries and then you can really quickly get around and then retrieve them and then sort of give it to the user. And as new use cases get added, as my database is growing, it's plugged into the rag and I can recover them and I can give it. So I'd say yes. So that's where I use fast data. Slow data is more when I want to emulate a decision making process. Right. When you want to sort of see, okay, what considerations go into making this decision, it's not really black and white.
Jaipal Singh Goud [00:03:51]: You consider multiple variables as you make those decisions. And fine tuned models are really good at emulating how you would think to make those decisions. So what are the more mission critical operations? And for those we go for fine tuned models and then ideally in the best case scenario you wouldn't have them both working together because then the fine tuned model can refer similar cases, draw inspiration from them and then say, okay, this is what we've done. But the problem I have is completely new. So I know what our framework and our embodiment of thinking is. Here are some things that people have been doing now based on all this info, here's my suggestion or here's my recommendation because I like to think of models really as just humans. Like how would you do? Like, okay, if you ask me, I have notes. I carry my notes with me everywhere and my notes are like my rag.
Jaipal Singh Goud [00:04:40]: I have written things written down. And when someone says let's talk about vaccines, sure I have something written about these vaccines. I know you should do. But if somebody asks me, what's your opinion on getting vaccinated? Okay. It's something that I've thought about, it's something I've known, something I've gathered from multiple sources and that's where my fine tune model kicks in. So it's the different kind of problem statements you have, different approaches you would take.
Demetrios Brinkmann [00:05:02]: Thank you for getting this podcast banned on YouTube now for mentioning vaccines. I appreciate that one.
Jaipal Singh Goud [00:05:08]: It's not on purpose.
Demetrios Brinkmann [00:05:12]: Okay. But we were also talking about. So I like this how you're thinking about the different decision making process and almost like what my values are versus there's this fast stuff that I can reference and I don't necessarily need to always have that in mind. I can reference it. I know where to go to find it.
Jaipal Singh Goud [00:05:33]: Yeah.
Demetrios Brinkmann [00:05:34]: And we also mentioned about the, what were we calling it? Like the AI workers, the AI workforce. We didn't really have a good term for it.
Jaipal Singh Goud [00:05:42]: Yeah. Because it's so difficult to put a term to it. Like what do you call them? It's almost everything. An AI assistant, an AI agent, gentic workforce.
Demetrios Brinkmann [00:05:52]: And none of it sounds really appealing.
Jaipal Singh Goud [00:05:55]: And none of it sounds human.
Demetrios Brinkmann [00:05:56]: Yeah.
Jaipal Singh Goud [00:05:57]: And the issue with AI systems is they are low trust systems in a high trust environment. Right. People don't like to read documents written by ChatGPT. Like we were talking about this earlier. You when, when somebody sends a doc across to you and you see those double hyphens between the text or emojis galore, galore, and your internal classifier goes like, nah, I'm not going to pay attention. But then you ignore all the effort that went into writing the document.
Demetrios Brinkmann [00:06:23]: Yeah, well it feels to me whenever I get one of those is like, oh, this person just phoned it in, they prompted it. And now I have to be the one that spends my time on this jargon that got output.
Jaipal Singh Goud [00:06:37]: Yeah, yeah, it's so true. And that's where I think, coming back to the, coming back to your point, and I'll come back to this in a second, is when you mentioned the AI agent AI workforce example. Right. I think making it more human would solve sort of this problem of also making it more consumable. Because when, when we talk about these sort of concepts of AI assisted work or AI assisted workflows or AI enabled workflows, you also want them to be not only think like a human but also behave like a human. Right. So. So a human would send you a slack message about something that you asked.
Demetrios Brinkmann [00:07:19]: And the worst kind of humans would send you a slack message that just says, hey, you around?
Jaipal Singh Goud [00:07:25]: I hate those.
Demetrios Brinkmann [00:07:26]: They don't give you any context or anything. And it's just like, no, I'm not. Fuck off. Anyway, sorry to derail that.
Jaipal Singh Goud [00:07:35]: No, and I think a lot about building these AI agents and enterprise AI agents is about building the interaction design around them. You can build the best model, like OpenAI's GPT4OS. They're great models, but their interaction design is not something that makes people comfortable in working with them. We develop these biases towards them and we've sort of, as a society developed this bias right now about documents generated by ChatGPT that we will not read. And you said it rightly, it almost feels like they're just throwing it my way and they want me to review it. They. They've not even taken the pain of deleting the emojis. So I think a lot about this workforce, while there's the technical part of it, there's also like a huge design aspect and the nomenclature aspect.
Jaipal Singh Goud [00:08:25]: Like, what do we call an AI workforce? Enterprise agent thing. Yeah. It's something that even we're still figuring out where we point the finger.
Demetrios Brinkmann [00:08:36]: But how do you plug in the small language models into that? And I've heard it referred to as almost like if you look at the jobs to be done framework and you look at what I do as a marketer, I'm moving around data and transforming data. And ideally you can get an agent to do some or the most amount of that data transformation, moving, presenting it in different ways.
Jaipal Singh Goud [00:09:06]: So enterprise agents or small language models working with agents. Right. I think it's important to define what an agent is. An agent is essentially, I think, in my opinion, three parts. One's when you have the intelligence layer, which is more of the decision maker, then you have the memory layer, which learns and remembers things. And the third bit is the actions. What can the agent do? Can it perform something? Can it read? Can it write? Can it make a phone call? I think it's these three things together that sort of make an agent in any environment, in any setting. Now, small language models fit into the intelligence layer of the agent.
Jaipal Singh Goud [00:09:45]: Right. And small language models off the shelf are really not that intelligent. I mean, when I say small, I mean anywhere between, I'd say 1 to 7 billion parameters in size. And they're not really very intelligent off the shelf, but they're really good students and they're really, really good at absorbing knowledge if you teach them the right way. Right. How you train, when somebody New joins your company, or when you have an intern fresh of university coming in and you want to train them in how the organization works, they're fresh, they're ready to learn. A small language model is like that. If you feed it the right data of how you perform certain processes, you can fine tune it to emulate that behavior.
Jaipal Singh Goud [00:10:30]: And it's important to use a small language model and not a larger one here. Because small language models are easily trainable, they require much less compute. You can fully fine tune them instead of just building Lora adapters on top of them, which you end up doing with the larger models. But smaller ones, you can do it entirely. That helps you instill much more deeper insights into the model itself. But when you're doing these fine tunings, you can run RL on them as well. So if you want to do reasoning thinking, you want to run some GRPO chains on it, you can do that effectively, cheaply and fairly fastly as compared to if you do it with a much bigger model. And for SLMs working in enterprises, it's essentially about capturing process knowledge.
Jaipal Singh Goud [00:11:15]: You just said as a marketer, a lot of your job involves moving data from one place to the right place and then deploying it. But you do a lot of thinking in the middle, right? And this thinking is in your head. But if an agent or a system observes you for long enough, it can pick up on the intrinsic patterns of how and why you are making certain decisions and then instill all of that knowledge into a model. And that's what we try to do when we fine tune small language models is, is instill these intrinsic patterns into them. Something you cannot achieve with prompting. Like a lot of times people confuse that. Can we just prompt them? But then a lot of things are beyond sort of prompting.
Demetrios Brinkmann [00:11:58]: Well, and this is where that like high trust, low trust comes in. Because if I am going to let someone observe me at my job, first of all, that makes me uncomfortable just thinking about it. But second of all, I really want to make sure that I am not giving up trade secrets from one angle. And the other angle is that I'm instilling the right things into the model.
Jaipal Singh Goud [00:12:29]: True. I think I also hate being monitored. Like nobody wants to be seen what you're doing when you're doing. But creepily, it's becoming the norm that we are letting sort of observers into our ecosystem. You know, when you go to a Google Meet or now even a slack hurdle, there's this AI which just creeps in and it's listening and it's making notes. I'm like, you're so it's slowly sensitizing us to the fact that there's going to be these latent observers behind the scenes collecting data on them and it's only going to, I'd say, grow from here we are Microsoft launch products where they sort of record your screen and they collect information.
Demetrios Brinkmann [00:13:15]: Well, I mean you're even wearing the meta glasses.
Jaipal Singh Goud [00:13:17]: I'm wearing the meta glasses, yes.
Demetrios Brinkmann [00:13:19]: And it's becoming more popular like pendants that will record your whole day.
Jaipal Singh Goud [00:13:23]: Exactly. So constant data capture is becoming a norm. But the most important part is where does this data go and what really happens with that data? Right. And this is where we've gotta be really, really careful. And we need to give individuals whose data is being collected the authority and the option to control where this data flows and how it's being used. Right. And as we're building prem, that's one of the things that we're doing very, very carefully is when we help you build these small language models or even medium sized language models built on your data. Our promise to you is that this data does not leave.
Jaipal Singh Goud [00:14:04]: Our ecosystem is only exposed to open source models that we host and we run. And then we have like a 7 day delete policy after that that once your models are trained and fine tuned and you walk away with them, we delete all of the data from our infrastructure from our logs. So then you know and you're confident that it's not being used in a manner that you're not aware about. And that's the most scary part. I mean, not knowing where your data is going is worse than knowing that somebody is doing something mean with your data. And that's what I think we're very careful about. Yeah.
Demetrios Brinkmann [00:14:37]: And so getting back to the capabilities that you're trying to unlock with this almost like virtuous cycle of observer, seeing how you're making your decisions, picking up these patterns and then creating workflows or agentic workflows out of it. Like how does, what's that next step of okay, I see that you do XYZ whenever you pull up Google Ads. Now you're creating a workflow or you're fine tuning a small language model that can reason on Google Ads. Like yeah, land the plane for me.
Jaipal Singh Goud [00:15:14]: Great. So I think this, there's three things in building this virtual workforce that we talk about. It's data collection, it's model fine tuning or knowledge distillation in particular and then it's using them in action. So we Covered data collection a little bit. Right? Data collection can happen on your Google Meets, on your notes, on your slack notion, process documents, or even just screen.
Demetrios Brinkmann [00:15:39]: It's watching your screen.
Jaipal Singh Goud [00:15:40]: Just watching your screen. Exactly. Taking screenshots, understanding the steps that you take to reach from point A to point B, that look like success for you, that's the jobs to be done and then how are you doing the job? So that's data collection, there's data parsing and then there's building the data sets. Right. So imagine we've built a data set to fine tune an LLM. Usually data sets to fine tune LLMs are conversational data sets. So they would be like, hey, I'm at this stage and I see this, what am I doing next? And the user replies, ah, okay, I would click on this.
Demetrios Brinkmann [00:16:12]: So it's a data set for a specific task or it's a data set for your whole day or what does that data set look like and how do you curate it?
Jaipal Singh Goud [00:16:22]: I think that data set is for your role and your job. So if you are a marketer and you're Demetrios, it would be Demetrios model. It's Demetrios way of operating when he's doing some work. So it's like your digital clone sidekick. It's your sidekick, it's your sidekick who's always with you. So this data set really captures how does Demetrios work and then why he does the things that he does. So that's the data part of it then on the fine tuning part of it. So with prem, we've built this product called Prem Studio wherein you can feed in all of this data and then we take care of building synthetic data for you that can be used to fine tune the slms and then also selecting which SLM is the best for you to fine tune and then running the jobs for you as well.
Jaipal Singh Goud [00:17:12]: So we do all of the MLOPS monitoring and sort of taking care that the GPUs are provisioned all the different. If you're doing any RL, we take care of running RL experiments as well. So that way you can train a bunch of different open source models and then once you get them back, we evaluate them on some of the data set that we'd held out as our test data set early on. And the goal at the end of the whole training process, what's your training objective? Right. Your training objective is that if given a situation and certain context about the situation as the input to the slm, the SLM should Predict the next step of what needs to be done. Right. So that is what we do. Now how does this go into action? Right until now, agentic interfaces have been really chat first.
Jaipal Singh Goud [00:18:02]: You know, you talk to an agent and it gives you some information back. But that's not again, that's not how Demetrios works. Dimitrios does stuff, he's more than just talking.
Demetrios Brinkmann [00:18:11]: I like that. Action oriented.
Jaipal Singh Goud [00:18:14]: So, so now the next part is having these action, having these sort of models deployed and giving them access to a set of tools. So if the set of tools that you want to run this model on prem and then give it access to like your calendar, your Gmail, maybe your slack, maybe your GitHub, right. And then you control the permissions of what the model can do for you. And then you just execute and you say okay, now it's running, it's always on. We can give it an email address or we can give it a Slack handle. It can join your Slack channel as one of your team members. Then in action when you send it a message or when you send it an email, it will emulate how you as an individual would be responding to it. Right? And it would do that action.
Jaipal Singh Goud [00:19:03]: If you send it an email, it would read the email and it would think how Demetrios is gonna respond to it, Write back the response and send it back to you. And that's where we see it first. You can send it a document to review. Maybe somebody sent you a five pager, you send it to him on Slack and say, hey, can you review this and tell me what do you think about this? Or is this compliant? Or is this something that you know, aligns with our company's goals, objectives? That's phase one is where it's still conversational, but you're moving to multiple channels of conversation. You're going beyond just like one chat, you can set it a voice note and it understand the voice note and reply back to you in text. Right. But it's how you would work with a coworker. The next step after that, which we see coming in, let's say about a year from now, is when you start giving it control over your systems and your tooling, where the model can take control of a digital workspace and start doing things in browser on your screen using the tools that you use on a more day to day basis to perform more solid actions beyond conversation for you.
Jaipal Singh Goud [00:20:09]: And I think that's very important. I'm super excited about all the work that's being done for computer use for Browser use. I think they're really, really great frameworks and we see some really good models coming out as well. We've seen the agents come out to do it, but then a lot of the frustration of the people who are using these agents is like, he's not doing things the way I would do it or how I would do it. And that's what we want to solve for and that's what we think is going to be the next evolution in this is how you sort of delegated tools to do stuff on your behalf.
Demetrios Brinkmann [00:20:42]: It almost feels like you would need many small language models in a way that you're not only going to be using one for Demetrius and it feels weird referring to myself in the third person, but you're not going to have that as this is my small language model and it's my one that's been trained and retrained and continuously retrained on. It's almost like this is my Google Analytics small language model that my big brain model can call as a tool. Or there's the web browser model that can be called as a tool in that way. Is that kind of how you're thinking it's going to shake out?
Jaipal Singh Goud [00:21:26]: Yes, there would be two ways. There would be almost a hierarchical structure of models which would be experts in certain things. So you could look at it as a mixture of experts, but decoupled almost mixture of experts, like in one way. So we decouple them and we have lots of small language models. But you would almost have like the social media side of Demetrios, the social media expert. Initially, when we are having a less amount of data, we would keep it confined to a singular model. Right. Because then otherwise you become too thin across.
Jaipal Singh Goud [00:21:56]: You spread across different models. But as the volume of data grows. Yes. You would want to split it out into different sort of individual language models that can take care of it. But on top of it, you may have your vision models which are parsing the screen and understanding what's on the screen. Where do I click, where do I press? So that would be the sort of layering on top of it is to understand context, to understand different tools. Maybe we have something that can take control of your phone and do things on your behalf on the phone. So that would be the layer and then that would parse the information and then give it down to the ideal model.
Jaipal Singh Goud [00:22:33]: Depending on the application. You are in that. Okay, now how would Dimitrios, maybe you know, do marketing on Telegram on his phone? He's in these five groups. What do we send. We've just launched a new tool. It's always been about shilling.
Demetrios Brinkmann [00:22:46]: It's always, that's we're on telegram. That's just shilling. Well, one other piece that I wanted to touch on was how you think about the differences between the workflows that are almost hard coded. And when you observe me doing something multiple times in a row, it's almost like you know that I'm going to do XYZ and there's that next best action. And you don't necessarily need to have the agentic capability be this open space that it can choose whatever it wants as an expect best action. Because you've seen me do it this same way five times.
Jaipal Singh Goud [00:23:26]: True.
Demetrios Brinkmann [00:23:27]: And so it can be a hard coded workflow versus oh, just go figure it out agent. And then you have that reliability a little bit more certain.
Jaipal Singh Goud [00:23:39]: What you're talking about is almost like you can build process Markov chains. Yeah. Where a Markov chain is essentially a set of events happening and then a probability connecting them, saying, okay, this, if these three things happen, what's the probability of us going into either of these one or two ways? And if there are certain chains which are like not too divergent and they're always going straight. Yeah. You can be certain that if these four steps were taken, this is what the next step is going to be. And I think that's going to be more on. This is going to sound meta, but if, if you were to run an analysis on the data, not to, not to sort of emulate how you behave, but to understand, okay, what patterns do I see in this behavior? I see a lot of noise or a lot of. A different distribution for when you are working on Twitter because you're doing like 100 different things.
Jaipal Singh Goud [00:24:32]: But then when you go to Google Ads, maybe there's just one workflow that you are always doing that and the distribution of data and that looks fairly, fairly just linear. It's very, it doesn't have a high amount of variance. It's really tight and in one place. So I think it would be interesting to run some experiments just on the data set that's collected to explore or to surface what these latent patterns are and then maybe to see if we can control the inference space when the model is running inference based on these inputs. Maybe there could be some work done with memory frameworks or some confidence scores being paused that, hey, okay, if, if I'm at this step, has Demetrios done something like in the past? How many times has he done Something else. It's about framing those questions, retrieving that information and then also taking the right step and controlling, let's say, your answer space during inference.
Demetrios Brinkmann [00:25:25]: Well, the other piece on this is that a lot of the things that I do are not just specifically done in the browser. And so there's taking data from the browser and then again transforming it or downloading it to my local computer and then putting. Putting it on a program that I have. I'm thinking specifically about when I edit this video and I'm going to be taking the data and then putting it onto the DaVinci resolve. And then I do a few things. I'm going to import the different vocal tracks and the different videos, maybe do some color correction, but I'm also going to put them on one timeline is going to be just the full thing with a few edits and then there's going to be the clips, timeline. So all of that is very menial work that I. Every time I do it, I'm like, I really wish at least like some kind of a template could be done here because it's not necessarily that it's a template within DaVinci resolve.
Demetrios Brinkmann [00:26:25]: It's not as easy as that.
Jaipal Singh Goud [00:26:26]: It's true.
Demetrios Brinkmann [00:26:27]: It's like take these that are in this folder, upload them, also upload these other the intro music and then create these different timelines and find the different clips and then the clips are going to be verticalized. So there's all these things that I know could be templatized, but I guess the question that I'm trying to ask is there's different abstraction layers that you play at and that's just for one of my workflows. Yeah, I imagine somebody that is a dentist that's trying to figure out their systems to work with and the insurance that you can pull from and trying to charge their client uses a whole lot of other systems. Right. And so how can you enable the agents to take actions when the distribution of systems is so large?
Jaipal Singh Goud [00:27:25]: Yeah, I think when you move more into working with custom software, it really starts getting challenging with less examples to train or fine tune a model on and then delegate to it 100% responsibility to be autonomous and work on your behalf. Because you are making not just a big plan of execution, but at each step in the plan you are making a creative decision of what it should be. Like your video do I do like 3 seconds? What title font do I use? Color grading. The lighting was bad that day and my other videos have this kind of color grading. So you're making a lot of like these atomic decisions as you go along the process, right? And this is where I believe, I believe this is also a lot of creative work. Like there is creativity in every field. Engineering, medicine, video editing. It's a very creative process in such fields.
Jaipal Singh Goud [00:28:20]: I think it's really going to be human augmented agents that run. So you wouldn't give them complete authority and autonomy. Maybe you would make a plan for them. You would say, hey, okay, I want you to now edit this video. Here's the plan. I want you to do the next following 10 things. It says, first is, okay, figure out a good title sequence. Then first five minutes, I want you to focus on this, two on this, three on this.
Jaipal Singh Goud [00:28:47]: Maybe you give it that whole plan and then you let it go and do its thing. Today we rely on these agents to make the plans for us as well. But I think it's not the right approach because when they fail and they fail miserably, a lot of times there's not a lot you can do about it. So we need something where me and the AI agent have a common playing ground and it's something we can both read on, we can both riff on something maybe the agent can add to the plan and say, bro, I don't think these two steps are good. You want to add this. So you vibe on the plan for a while and they're like, okay, now the plan's ready, let me go and try and execute it. And maybe it's going to stop at some point. Maybe you get a notification saying, hey, I've done these four things.
Jaipal Singh Goud [00:29:30]: Do you want to review? And maybe this review step could be a part of the plan. Saying, okay, step one, step two, step three, review. Step three, step four, review. So you work with it instead of just delegating and then going to bed and hoping it's done. For some, for some domains you can do it, but it's not always like now you've got like coding agents, you've got cloud code who would sort of go and you give it a sort of issue and it creates a PR for you, it writes the code for you, but you have to review it and you have to go back and forth sometimes with it. And it would be the same with these sort of. You bring the same analogies to these different spaces and these different tools. But it's just that these are more information, sparse areas, they're more creativity rich areas.
Jaipal Singh Goud [00:30:13]: So we need to think of frameworks of how humans and agents would work together here.
Demetrios Brinkmann [00:30:18]: And I Guess the other thing on the information sparse piece is that a lot of this stuff doesn't have APIs that you can just ping.
Jaipal Singh Goud [00:30:29]: Yeah.
Demetrios Brinkmann [00:30:30]: And so integrating with these tools is so hard because it's not just, all right, we have an API or an MCP server these days. And so how do you even think about that? And that's where a lot of the, I think people get excited about the browser use or the computer use because you can bypass the API and you just go into the gui. However, I think anybody that's played around with computer use is like, yeah, it's cool, but it's also very hard to make consistently do things.
Jaipal Singh Goud [00:31:01]: True. So first off, I'm bullish on computer use and I completely acknowledge that the interfaces we have right now for tooling such as APIs, MCPs, they are not for, let's say consumer software or consumer products. They are more for SaaS. All SaaS tools have an API which is very easily consumable and because they were built for that purpose. But Premiere Pro probably doesn't have like a kick ass API you can edit a whole movie on. So because it's not made for that thing, it was never made to be used with an API. Right. It was made for somebody to look at it, to see what's on it and then make those decisions.
Jaipal Singh Goud [00:31:37]: And I think this is where computer use will really become the norm for a lot of operations that we sort of work on with AI agents. And yeah, as somebody who is sort of always playing around with computer use, I completely acknowledge that. Sometimes it's too slow, sometimes it veers off path, sometimes it gets stuck on really dumb things like, bro, it's right there.
Demetrios Brinkmann [00:32:01]: Like, come on, just scroll down the damn page, please.
Jaipal Singh Goud [00:32:07]: And this is where sort of purpose specific training data is going to become super important. And this is where fine tuning will come into play. You cannot prompt it, you cannot always prompt it because your prompt will explode because there's so much context it has to go through. Right. If you want long tail tasks, it needs so much context. And this is where you want to do process fine tuning. On this model I see this and then I click here and then I do this and then these are the chains. I think, why am I clicking here? Why am I going there? I also think there's going to be tool specific models that come out very soon.
Jaipal Singh Goud [00:32:40]: Computer use model fine tuned on Photoshop, right. Imagine if somebody at Adobe sat down and collected and made this data. The query being, I want to, you know, add this hue to my image. And give it a 1970s vibe. Right. And then there's a screen recording of five minutes of a great designer doing this. That's your data point.
Demetrios Brinkmann [00:33:04]: Yeah.
Jaipal Singh Goud [00:33:05]: So there would be these process specific models or, sorry, software specific models that would come out that would then help you do it. And then you can fine tune them further on your style. Okay. This is the base model for Photoshop. I'm going to make a Demetrios version of the Photoshop based model because that's my style and my vibe and then I make it run well.
Demetrios Brinkmann [00:33:25]: It also makes me think like, then where do you plug that in and how do you interact with it? Because we've seen and we know that the agentification of the Internet and SaaS is coming and it's almost already here in certain ways. Like if you're dealing with cursor, if you're dealing with lovable, you just talk to it. Yeah. And then it gets done. And so in Photoshop, I think Photoshop has kind of tried to do this, but it's more with image generation where you can generate an image. I haven't used Photoshop in a while, but I, I also. Let's think about like Premiere Pro and I want to edit this or create a few snippets or whatever. I just talk to it and then it does it create a new timeline and then do this or I don't even need to be that specific because it already knows if I want to create clips, it's going to create a new timeline for each of those.
Demetrios Brinkmann [00:34:20]: But am I talking to Photos or Premiere directly? Am I talking to an overarching agent that then will use Premiere as a tool? Where does that abstraction fit in?
Jaipal Singh Goud [00:34:36]: Right.
Demetrios Brinkmann [00:34:37]: Because if you're thinking about it from the enterprise perspective, again, going back to, we've been talking very consumer ish, but if I now am in the marketing position and I'm trying to do these three things that require me to use a specific tool that we have at my company, Maybe there's the API for that, maybe not. Like you were saying, like SaaS, there's great APIs because that's how we've been dealing with them, but maybe it's local to my own computer and that's how we have to use it. And so do I have my agent that is powering or my Sidekick that is now going into each individual SAS tool and grabbing the data, transforming it, bringing it into the next SAS tool, or is it within the SaaS tool? I tell that tool do this.
Jaipal Singh Goud [00:35:33]: Yeah, it's something We've been thinking about a lot and I think where we've come down to is that it's going to be an intermittent layer, which is more at an operating system level instead of the application level. You would have, let's say a dynamic model orchestrator who would given the task, of course, plan on what sort of models it would need to use to reach the final goal. Right. So if on your screen there is Premiere Pro open or you wanted to eventually open Premiere Pro and do something, the planner would, this would be like an intermediate agent that's running in the middle. You would give it that instruction. Its job would be, okay, first command, move the cursor, open Premiere Pro or use the CLI to open Premiere Pro. And then behind the scene it's okay. Next step, load the Premiere Pro model into memory.
Jaipal Singh Goud [00:36:29]: Start using that for inference. And then, okay, now look at the screen Premiere Pro model and tell me, I want to achieve this goal. What do I do? What should I do next? And then once the task is done, that model gets offloaded and then maybe you want to publish on YouTube. So like the YouTube specific browser use model gets loaded in and step one, open browser, go to YouTube.com, step two, go here. Step three, create a title. Create a title in Demetrios style because it's fine tuned on how you would give it a title. And that's where sort of this personalization.
Demetrios Brinkmann [00:37:02]: And there's a whole thumbnail one. There's a whole.
Jaipal Singh Goud [00:37:04]: Yeah, so it's going to be this family of models and we're 100% confident that it's going to be a billion models fine tuned for specific people and applications that are going to be live out there that will get hot swapped at runtime and then be used for inference. Boom. Yeah, right, right, right, bro. It's right there.
Demetrios Brinkmann [00:37:46]: I like that.

