Sign in or Join the community to continue

Computers that Think and Take Actions for You

Posted Jan 02, 2026 | Views 77

# AI Agents

# Robotics

# OpenAGI Foundation

Share

Speakers

Zengy Qin

OpenAGI @ Foundation

Zengyi holds a PhD from MIT and is the founder of the OpenAGI Foundation. He leads a team pioneering Computer-Use Agent models that can understand screens, click, type, and operate computers much like humans do. Their model already surpasses comparable models from OpenAI and Google.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

What if the computer itself can think and take actions for you? You just give it a goal, and it performs every click, type, drag, and gets work done across the desktop and web. In this talk, Zengyi reveals the breakthrough technology that his company OpenAGI is developing: AI that can use computers like humans do. He talks about how his team developed the model, why it outperforms similar models from OpenAI and Google, and its wide use cases across different domains.

+ Read More

TRANSCRIPT

Zengy Qin [00:00:00]: In bring five to 10 years, the computer will change a lot. And the interface between humans and computer will also change a lot. And the fundamental enabler of this is the AI agents who understand human intent and being able to operate the digital device by themselves.

Demetrios Brinkmann [00:00:30]: What's your, what's your story?

Zengy Qin [00:00:34]: I started working on the AI model since 2016 where I just got into college and people were working on the self driving and especially the computer vision where you twitch the car, teach its perception system to detect the cars and the pedestrians surrounding you so you don't hear them. Because. I don't like taking classes, I don't like doing exams. So I find opportunities to do the research in our university slat. And later I was fortunate enough to go to Microsoft Research Asia as a second year college student, but they typically take graduate students. But I was fortunate enough to prove myself and get into it. So I was doing the self driving 3D computer vision program. And later I went to Stanford, the Fei Fei Li's lab who pioneered the computer vision.

Zengy Qin [00:01:53]: And there I started working on not just the perception, not just sensing the environment but also taking actions on the environment, which is actually robotics. It closed the perception and action loop. Perceiving environment and taking the actions on the physical environment is robotics. I really like the concept of closing the perception action loop. So later I went to MIT for PhD, but as I just said before, I don't like taking classes, I don't like taking exams. So I feel that either in the future I go to a big company such as Google and Meta et cetera, or starting my own company, I don't like go to academia because I don't like writing papers. So that's why while I was doing PhD at MIT I started the first company called MyShot which is AI agent platform. And we basically provide an infrastructure for the creators or developers who doesn't know how to build AI models.

Zengy Qin [00:03:14]: But we provide all those AI models as LEGO bricks so that they can orchestrate other models to do their own applications. So right now it has around 6 billion active users and it went publicly listed around half a year ago. And during that time we also did a lot of open source research work because I believe the science is better to be opened. And we released two open source models at that time. One is the open voice model called Open Voice and it released just after ChatGPT released their voice voice mode. So because our model at that time compares, it performs on par with OpenAI model, but it's free and open source. So it immediately caught a lot of attention and it went straight into GitHub training global top one. After two days we put the model on GitHub and then it has around 30, 35k stars right now.

Zengy Qin [00:04:27]: And I believe it ranks around top 0.3% of the GitHub project that time. And. We had another model later which is called Melotts, which is a speech to text model which also ranked GitHub global top one, at least based on my understanding. We were the only team that has more than one project on training on GitHub top one in 2024. Some things can get into once, but to get into the top one twice is very lone free. Later we released a model called JTime OE. So the concept was, you know, at that time There was the Llama 2 Stan, the Llama 2 model, very famous and a lot of frontier labs. They claim they have hundreds of millions or billions of dollars to train the model.

Zengy Qin [00:05:35]: And it was commonly believed that it's impossible for a small team to train the model from scratch. Although they can of course binding the model based on the already trained Lala model. But training from scratch, doing the pre training stage is extremely challenging. So we didn't believe that and we tried to train a model from scratch and try to match the Llama 2's performance which was trained by hundreds of million dollars. So given the constraint, there are many two issues. One is where we get the data, high quality data, and the second where we get the compute. Both questions are extremely critical. Basically it means you don't have a feel how to drive a car.

Zengy Qin [00:06:30]: But creativity likes constraints. So from the data set, from the data perspective, we were able to extract from the public data set a layered data set, which means on the bottom there's large quantity, but the quality is relatively not very good. But as it gradually go up into the pyramid, the quantity decrease, but the quality increase. So when we train the model, we first train with the low quality data and then gradually converge to the high quality data. So even though we don't have too much high quality data, we eventually converge to that data, eventually converge to the data in the training process. So it turns out to be extremely effective. And for the model architecture we did a novel architecture which is called mixture of attention, which saved the computational cost by 75%. So eventually we were able to train that model and with around 100k dollars and match the Llama tool's performance.

Zengy Qin [00:07:53]: And it was the first time that the community proves that within 0.5 million $1 million they can train at extremely good Model, So we caught a lot of attention there. Yeah, but that's the past story. But it's not quite, not quite focused on the current model we are building. But that's my background.

Demetrios Brinkmann [00:08:19]: All right, so you have a very decorated past, like you said, going from basically all the top academia spots to then training your own large language model at 1/10 or 1/100th of a cost that normally it takes to train. You've done voice models also. Now you're focused on computer use.

Zengy Qin [00:08:45]: Yeah, computer use. That is because we discovered a very grand new opportunity for the computer being able to operate itself. With computer use, the model is just basically taking the screen as input and control the people and mouse. So this makes all the tools on the computer available for the model to use, not just leaving inside a chat box like how ChatGPT does. So when the model have access to the computer, it can empower a huge amount of wire car job, empower them to help them to do faster, even handling the job by the AI itself. And this capability is something that frontier labs such as Open Air, Google, Anthropic, not doing quite well. Although they have a very good base model, they don't have a very good computer use model. And this requires a very, very different training paradigm.

Zengy Qin [00:10:02]: And given the massive potential and economical potential of this capability and also the current states of this field where the big lab is not doing much. So we feel that there's a new opportunity to do a new model around this topic.

Demetrios Brinkmann [00:10:25]: What is different about training computer use models than training large language models?

Zengy Qin [00:10:32]: Training large language model, it has two stages. One is called pre training and one is called post training. So in the pre training stage, OpenAI and Tropical Google, they basically feed the entire Internet knowledge in text format and feed it to the model and let the model to compress and remember and generalize unlist knowledge. But the issue is that the model is not able to understand how to interact with the environment and they don't understand the causality. And what does causality mean? Imagine that you are going to driving school and learn to drive and your coach give you all the training materials, for example driving manuals and all the YouTube videos or the Internet knowledge of how to drive and just give it to you, okay, Let you recycle, let you remember, but they don't let you test the car. So by training you this way you're able to become a very good chatbot, a very good person who can understand almost all knowledge with driving and seems that you know all the driving style. But actually because you didn't touch the car, you don't know how to drive. So which knowledge is missing? Because all the driving materials are written by people who know how to drive.

Zengy Qin [00:12:10]: And if you read that you should know that. But when you actually learn to drive, you learn the causality. For example, you drive around the road and you turn the wheel left and then you see the car going left. You see the car going left. So you understand that this action cause that observation or that word state. And then after you turn left and then you turn the wheel right and then the car turns right. So you have the observation. So you have the action and observation and then observation taken via gray and produce the next action.

Zengy Qin [00:12:54]: So you clearly understand what is the consequence of doing the action, what will be the next word state of doing the action. This is strong causality. And why remembering this driving manuals doesn't give you the causality, it's because you don't experience that. And even though the driving manuals the text contains some causality but it's very different from trying by yourself and learning by yourself and learning the strong collide parasoft. And in the pre training data that Tropic Google they use they just the text or the videos on the Internet doesn't contain any strong causality, especially in the computer use. And also considering that the Internet doesn't have too much too much computer used data. So they were performing not very good in this sum.

Demetrios Brinkmann [00:13:53]: And how are you then going about it? Are you giving models sandboxes so they can understand that or are you training how like are you training world models? Is that what they're specifically called?

Zengy Qin [00:14:05]: No, we are not training the world models. We do have a extremely large scale of the sandboxes so that the agent can run in that sandboxes and do some tasks and collect the trajectories. And we have a reward model to help identify which part of trajectory is good and which part is not good. So that agent can have a feedback signal for it to optimize itself. So basically learning by doing, learning by prowess arrows, this is how you teach the model to understand the content.

Demetrios Brinkmann [00:14:47]: So I'm assuming it's a little bit easier because the end state, you know, and it's like in a way it's did it do that thing or did it not? It's not subjective at all, it's very objective. And so that gives you the ability to create a stronger reward state.

Zengy Qin [00:15:05]: This seems correct. But in reality how the model reached that state can have a lot of variances. For example, if they take 10 steps or 20 steps or 30 steps to go into that state, apparently the 10 step one could be the best. But you cannot tell probably the 20, the 20 step way is the most standard way and it considers some other configurations on the computer, et cetera. So just by checking the end state is actually not sufficient. And there's also very challenging to actually check the end state because let me give you an example. If you are training the agent to manage the healthcare training agent to manage the electronic healthcare record system, you have a lot of patients and you want to transfer the patient data from a database to that healthcare system and you need to copy every detail, every digit into dev system and you have an extremely large end state, it's not captured by the single screenshot, is deeply integrated with how the computer is deeply integrated with the internal state of the computer. And you cannot use the code to just the end state in many cases because those softwares, they are very, very exclusive.

Zengy Qin [00:17:03]: You actually don't have access to the internal state, don't have the internal API of that software. So just by filling the forms, filling the data, transferring the data is non trivial to judge whether it's successful or not. I'm not sure if I make it clear, but because there are a lot of technical details which. The difficulty cannot be understood as judging the end states.

Demetrios Brinkmann [00:17:43]: Like tell me about some of the technical details.

Zengy Qin [00:17:46]: For example, if you're asking the agent to operate the PowerPoint and make a PowerPoint for you. But. But it's really hard to identify whether this slice is looking good or not. Although it's very easy for the human to understand. But if you are training the agent and there are more than 1,000 agents running in parallel and you need something to judge whether the PowerPoint is whether it looks good or not. Because it's not a. It's not an objective judgment. So it could be very difficult.

Zengy Qin [00:18:29]: And also another situation is that if you are operating, if you're letting the agent to operate an enterprise software and the enterprise software itself has its own state, for example, there are already a lot of information stored on that software, for example the contact information and you need to let agent to modify or transfer or send some messages to someone, etc. So, but after you agent taking the action, the world state change and you cannot reproduce that state unless you build that internal state again. So the agent faces a very non stationary state during training. And although it could be, it's not very difficult for the humans to judge, but it's actually very difficult for the Agents to judge because at first they don't have the knowledge of that domain. I mean, if they already have the knowledge of that domain, then why we need to train them? It's a chicken act problem.

Demetrios Brinkmann [00:19:42]: How are you dealing with the state changing constantly?

Zengy Qin [00:19:46]: We must accept that. And we are training in non stationary states.

Demetrios Brinkmann [00:19:52]: And so you just are figuring out how to make that work within your training. And are you simulating these states also? Because you mentioned that you have a lot of sandboxes and then are you simulating different states so that you can get that training? That happens.

Zengy Qin [00:20:13]: We do run a lot of softwares on that sandbox and to simulate the environment that agent might work on. So by running running the agents, we collect a lot of trajectories, the screenshot and action sequences. And we were able to either use human annotation or the trainer reward model to figure out which is good and which is not good and use that to update the model. And when we train the model, there's another very, very challenging point. So you see the deep sea and also OpenAI anthropic, they use reinforcement learning to train their model so that they can achieve the international gold medals in math and physics. And we all know they are using reinforcement learning and the thing works in this way. So they first give the model a math problem and let the model to solve it, to roll out the tokens, to roll out the solutions. And then they have a judging function to judge whether the solution is correct or not.

Zengy Qin [00:21:28]: This judging function is very, very simple because they only need to check whether the answer, the correct answer is in the inside end of this light language model rollout. So they were able to, every time they have a map problem, they roll out around 10 or 20 times so that they pick out which is good, which is not good, and to optimize the model based on the reward signal. And one difficulty that we face is that solving the best problem is you don't need to interact with the environment, so you just need to roll out. Probably after two seconds they finish the rollout. But interacting with the computer, it has the latency, you need to deal with the worst state. And every rollout takes several minutes. So several minutes. It is around 100 times longer than rollout a math solution.

Zengy Qin [00:22:30]: Yeah. And also you need to maintain an extremely large infrastructure. For example, when you manage one or two virtual machine or sandbox, it's totally fine. But if you have 1,000 or 10,000 sandboxes, then you frequently, almost frequently encounter the sandbox failure. So you would have an extremely robust mechanism to let it recover from a failure. And we do have a very robust mechanism where during our evaluation we start from complete failure, the complete crash of the entire system and then see how long does it take for the system to go back to normal. And when they're around 32 sandbox, it takes around 80 seconds to go back to the normal, to go back to 100% healthy from 100% failure. If we're around 1,000 sandboxes, it takes around 300 seconds to go from complete failure to complete health state.

Zengy Qin [00:23:38]: So it is actually quite impressive because all those things are automatic. And in reality you won't start with a state where there's a complete failure. So you probably don't need actually 80 seconds to do that.

Demetrios Brinkmann [00:23:56]: How are you getting the data for all this? Like to train? Because I imagine you have to train in specific software like maybe it's Google Flights or maybe it's HubSpot or maybe it's Salesforce.

Zengy Qin [00:24:08]: Right. We use all the softwares available on the computer and basically we for the real website, if we can access that website without being blocked, then we use that real website directly used and run reinforcement learning on Google Files. And for other apps, for example Google Docs or Microsoft PowerPoint or Microsoft Words, we just use that software and we just install them in the sandbox and run it.

Demetrios Brinkmann [00:24:47]: But I guess there's, there's going to.

Zengy Qin [00:24:49]: Be.

Demetrios Brinkmann [00:24:52]: Instances where the agent encounters software that it wasn't trained on and you're going to want it to be able to try things and make it work. Right. So maybe it wasn't trained on HubSpot, but it was trained on Salesforce. And you're going to hope that it can go and figure out HubSpot when.

Zengy Qin [00:25:14]: The agent is trying to operate a software they never seen before. And we at least hope that it can have a cold start. This comes from two perspectives. One is that because the agent already see a lot of data, a lot of screenshots data and a lot of data that it's been trained on. So it has a relatively general understanding of how the UI works. So that when they see a new screenshot, see a new software, at least they understand that, okay, that button is set. If I go to that button I'll probably change, change the thing to dark. So that's a general knowledge which is absorbed by operating with their existing software.

Zengy Qin [00:26:18]: So if that doesn't work, we'll collect the human data to let humans to operate the computers and operate the software and to give the model some prior Knowledge. So that come from one is the natural generalization from existing training. And the second part is a human teaching them.

Demetrios Brinkmann [00:26:45]: And the human teaching them or the human data that you're collecting. You then fine tune the models with.

Zengy Qin [00:26:52]: Yeah, we fine tune the model with human data. Human. And we actually collect a huge amount of human data. And it's not called fine tuning, it's called mid training where there's a huge amount of data. Much more than what the typical fine tuning need.

Demetrios Brinkmann [00:27:15]: Yeah, so it's not fine tuning. It's basically like step two in training.

Zengy Qin [00:27:21]: Yay.

Demetrios Brinkmann [00:27:23]: Okay, so I'm starting to get a picture of this and tell me if I'm going down the right track. You have a large amount of sound sandboxes. You have models that are trying things inside of these sandboxes and then they're getting rewarded when they are completing the right task in the best way possible, meaning I imagine, the least amount of steps. So from there you get a base model that understands how to do things with your computer. Like scroll and take a screenshot and then submit, click buttons, whatever it may be, go back, go forward. And then when you want to use a specific tool that is a tool that the model's never seen before, you're giving it human data. So you can do that. Stage two training.

Zengy Qin [00:28:25]: That's very correct. The model needs to have first have a general knowledge and then train on relatively specific tools.

Demetrios Brinkmann [00:28:36]: And the training needs to happen in these sandboxes or the sandboxes are just how you're generating data. Data to then train.

Zengy Qin [00:28:45]: Yeah, the sandbox, it is only for generating the data because training needs a lot of GPU needs a relatively centralized infrastructure where the sandboxes are just the desktop environment of Ubuntu or Windows or Mac os. They're just plain desktop environment and let the agents run. But the agents. But when we collect the data, the agency is not trained inside the sandbox. It's trained on a GPU cluster in our training infrastructure. Yeah. And we actually already open sourced our training infrastructure with a lot of sandboxes. And oh, by the way, we just launched our model in 12-1-just.

Demetrios Brinkmann [00:29:38]: Oh, nice.

Zengy Qin [00:29:38]: Yesterday. Yeah. And I can actually send you the.

Demetrios Brinkmann [00:29:46]: Yeah, tell me more about the model.

Zengy Qin [00:29:48]: Yeah, you see the comparison between the Lux Thinker and also feminized all the other computer use.

Demetrios Brinkmann [00:29:55]: Yeah, doing really well.

Zengy Qin [00:29:59]: Yeah. So yeah, it's doing extremely well. And the online Metawan benchmark is a benchmark with around 300 real world computer use tasks. And it's widely known in the community. So many models are evaluating on that benchmark and we were able to reach a score of 83.6, which is much higher than Google's Gemini model and also Open Operator and also Cloud. And it's not just more powerful, it's also faster. So it's on average two to three times faster than these competing models. And also the token cost is around just 10% of these models.

Zengy Qin [00:30:51]: So it's extremely powerful. And we are releasing the model and also the SDK and also the API so that the developers can build their their own computer use applications.

Demetrios Brinkmann [00:31:03]: And so now I've seen quite a few folks trying with computer use. It's captivated our imagination. Why did you feel like you wanted to go to the model level versus trying to just give existing models more infrastructure?

Zengy Qin [00:31:20]: That is because the current model is not good enough. And by giving the infrastructure you cannot solve the fundamental performance issue, you cannot solve the speed issue and you probably have a higher cost. So by giving the model probably some tools and it can do something more easily, but it's not general enough. And also if I'm starting a company but just wrapping around others model and what is my competitive edge, we must have something very fundamental, have our own space in the field of computer models. And I think that is not an easy path, but it's definitely the most promising path in the long run.

Demetrios Brinkmann [00:32:22]: Do you feel like you also want to or need to add to the different framework layers also? Or is it something that you're just going to focus on the model, make sure the model is as good as possible and let others build on top of it?

Zengy Qin [00:32:40]: Very good question. So when released our model called Lux, we also release the Agent framework. For example, in our SDK page there is a thing called Tasker Agent. What is Tasker Agent? It's an agent structure that is based on our Lux model. But it gives extremely high controllability for a fixed workflow. For example, in that agent, the task agent, the developer can specify each to do each precise to do for the model to do each subtask so that the model can do it one by one. And inside the task reagent, if a subtask is not completed correctly, then the model will know how to make up the mistakes and correct mistakes. So basically the task agent can give the developer extremely high control ability to do a workflow one by one, to do repetitive workflow step by step and do it fast and have the ability to correct its own mistakes.

Zengy Qin [00:34:04]: If you are just using the model itself, it's not very easy to achieve that level of controllability. So we also build the framework around our own model. There are three ways to use the model. One is called the actor mode, where it runs blazing fast and much faster than existing competitive models. And the second mode is called thinker mode, where it's able to handle vague queries and long horizon task and it can think before it take actions. And it can also run for more than one hour on a task and it needs to let me in the middle, even encounter some problem. It knows how to fix it by itself. That is the thinker thinker mode.

Zengy Qin [00:34:58]: And there's also a task mode which we just talked about it, where it offers extremely high controllability. And this is especially suitable when you already know how to do this workflow, but you just need an agent, just need an intern to help you to do it. So you need to give the intern a to do list or subtask list and can do it with extremely high controllability and stability in the thinker mode.

Demetrios Brinkmann [00:35:32]: And actually for all these different modes, how different was it to train each of them or was it just that stage two training, that was the different part. You had the base model and then you added these different flavors of the model. Or are you just giving more compute at test hunt at runtime for the thinker?

Zengy Qin [00:35:55]: Yeah, so the training is actually very, very similar. We share the first stage of training. In the second stage for the actual model, we prioritize the speed so we don't let the model think too much. But for the thinker model, we like the model think extremely deep. So it's basically the stage two training is quite different. The objective or the authorization. Perfect.

Demetrios Brinkmann [00:36:24]: Okay. So stage two, you're just giving it more time to think about what it's going to do and then try it, come back, try again, that type of thing.

Zengy Qin [00:36:34]: Yeah, yeah, yeah, exactly.

Demetrios Brinkmann [00:36:37]: And this is so cool. What have you done with it so far?

Zengy Qin [00:36:43]: You mean the model release or like.

Demetrios Brinkmann [00:36:47]: What you playing around with the model? What are some things that you've done?

Zengy Qin [00:36:52]: Yeah, yeah. So we actually have some examples that we really like on the website. For example, letting the model do software qa as one example is doing the software QA where you are developing a software and you add a new feature. You want to make sure all the features are functioning correctly. So you need to run over the software and do the QA on each part. So this is something that can be done with a tasker agent where you specify all the features. You want to test all the to dos. So you just do it one by one.

Zengy Qin [00:37:37]: And every time you add a new feature, you can spin. So every time you add a new feature, you can start a task agent in the virtual machine and add it to the QA for you. Another example is finding the insider trading activities for all those stocks. So you just ask the model to find the insider trading activities for Apple stocks on Nasdaq. So it will just go to the website and search for the aapl and then inside the page, somewhere in the page is able to find the insider trading activities. That's an interesting example.

Demetrios Brinkmann [00:38:26]: Wow.

Zengy Qin [00:38:27]: Yeah, yeah. And also helping you filing Your text on TurboTax is able to just fill those boxes one by one. And based on some information that you.

Demetrios Brinkmann [00:38:44]: You give it, that's a godsend, especially right around March, April time.

Zengy Qin [00:38:52]: So we're, we're building the, so we're increasing the reliability of the model so that you can completely trust it. So there's another very important aspect about the visibility of the model because for some tasks when the model operates the computer, you actually want to have the control of it. But for some tasks you don't even want to look at it. So to be specific, there are three situations. One is the completely background task, for example, helping you to crawling information from Amazon or buying a toilet paper on Amazon, so you don't even want to take a look at it. So in that case the agent just do it. And the second level is where you don't care the process, you just care the results and when. So in the last stage you need to check the results.

Zengy Qin [00:40:05]: So the second level is where the agent can do the process, but you need to confirm at the end, but you actually don't care or don't want to look at the process. For example, feeding your information to the health insurance provider, you want the agent to do it, but before the agent hits submit, you want to stop and you check it and you hit the submit. And the tax filing is also this category. You want the agent to do it, but actually you don't want to submit it for you, you want to check before you submit. And the second level, and the third level is where you want to see every step of the agent doing, for example, moving your bank information from here to there or feeding some extremely important information where you don't want any digit going well, so in that case you observe the agent step by step. So for those three scenarios we offer a different, for these three scenarios, we offer different visibility for the developers so that they can build corresponding applications.

Demetrios Brinkmann [00:41:34]: One thing I think About a lot with computer use models is sometimes I also want to be using my computer. And so what in your mind does the infrastructure look like for these different models? Is it that they are going out into the cloud and doing things or they're not doing this on your computer, they're looking at other resources or they're doing it in the background like you said. But then maybe it eats up. If it is doing it on my computer the whole time, it eats up all of my memory or my. Just my computer usage.

Zengy Qin [00:42:14]: Yeah. So the local background node can have its own virtual desktop on the local machine. So they can do without interfering with your work. So you can still use the computer, but it's doing it in the background. It's also possible to run it on a cloud machine, but you have to upload some information to the cloud machine or if you only ask it to do some browser operations, it can just retrieve the information from the browser and that's completely fine. So there are different modes, all are postal.

Demetrios Brinkmann [00:42:58]: Dude. What else do you want to talk about?

Zengy Qin [00:43:00]: I would like to talk about the last thing, our imagination and vision of the future of computers. Because we see several stages, evolutional stages of computer in the history. The first one is probably the mainframe and the second one is the PC and the third one is the mobile phone. And what is the fourth one? It's very likely that in the future the computer will not in this current form, it's possible that the keyboard will disappear, the mouse will disappear, or you have some other input format such as voice to let the computer understand your intent and then the computer can do it for you. And probably we can still have some keyboard. But eventually the computer starts self driving just like the current subdivision cars. But this probably won't happen in one or two years. But in five to 10 years the computer will change a lot.

Zengy Qin [00:44:16]: And the interface between humans and computer will also change a lot. And the fundamental enabler of this is the AI agents who understand human intent and being able to operate the digital device by themselves. So we are going towards that direction right now is computer use and in the future is probably just an AI operation system, AI operating system that is completely autonomous and you only have a hardware and you don't need to touch it and you just say your intent. And it does, and it does the things for.

+ Read More

Watch More

Ghostwriter - AI Writing That Learns From You

Posted Mar 06, 2024 | Views 430

# Ghostwriter

# AI Writing

# Shortwave

How to build agents that take ACTION

Posted Oct 13, 2025 | Views 121

# Building Agents

# Ai agents

# Arcade

Why we built PydanticAI, and why you might care // Samuel Colvin // Agent Hour #2

Posted Dec 19, 2024 | Views 3.9K

# Pydantic

# Agents

# Agent Hour

# AI agents in production