From Robotics to AI NPCs // Nyla Worker // AI in Production Talk
Nyla Worker is the product lead at Convai, an award-winning AI-NPC start-up. There she focuses on developing conversational non-player characters. Before Convai, she product-managed synthetic data generation at NVIDIA's Omniverse Replicator and honed her skills in deep learning for embedded devices. Nyla has experience accelerating eBay's image recommendation system and has conducted research at the University of Minnesota. Her expertise spans AI, product management, and deployment of AI in production systems.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Nyla delves into the intersection of robotics, simulation, and AI techniques, particularly in the context of powering Non-Player Characters (NPCs) in games using multi-modal Large Language Models (LLMs). Drawing from the principles of training AIs in simulation environments, the talk explores how Convai utilizes these technologies to enable NPCs to perform actions and react dynamically to their virtual environments. Viewers can expect insights into the methodologies employed and the practical applications derived from robotics research.
AI in Production
From Robotics to AI NPCs
Nyla Worker [00:00:00]: You.
Demetrios [00:00:05]: And right now we're going to talk about NPCs, is that right? Yes, we're talking about robotics, NPCs, and how LLMs can be powering all these different things and agent framework. This is going to be fascinating. Let's see. I think your slides are up and I'll come back in 10 minutes. Take it away.
Nyla Worker [00:00:22]: Awesome. Thank you. Yeah. Hello everyone. My name is Nyla Worker. I lead product at convey. Prior to convey, I led synthetic data generation for Nvidia, and there we use synthetic data to train perception models for robotics and manufacturing tasks. So, for example, you wanted to make a robot move around, but you also wanted it to take feedback from the environment.
Nyla Worker [00:00:49]: You could do this in synthetic worlds way before you did it in the real world. And I've translated those learnings from that space into the AIMPC space. So let's dive into it and see how this all comes together. So here we have an AIMPC. As you can see, the AINPC is responding to the person who is asking them and talking to them. So in this case, we ask it to pass me something and she will do it. So this is an embodied NPC that then moves its body and does the desired task. This is similar as to what you would see in like, a robotics environment.
Nyla Worker [00:01:35]: AI and PC are embodied AI agents. So people want to experience a conversational, human like experience in which you are effectively driving this character. So here we have our dear tour guide, and the AIMPC is going to show us around this space. This embodied AI character is spatially aware, so it knows where it is in the space and it can talk about the items in the space as well as perform desired actions. And this is very similar to what you would have to do in a workflow in robotics. So now let's talk about what it takes to build a robot. So, first, before we get into any of this cool stuff, the first thing that it takes to build a robot is get the actual physical robot platform. So you spend years building the robotics platform, and then you spend years building the system to operate this platform.
Nyla Worker [00:02:36]: And finally, once you have the actual body of the robot, now you need to do this tasks, which is localization, mapping, perception, planning and control, machine learning and human robotic interaction. In localization, it is where am I in a space? Mapping is understanding where you are in a map. Perception is understanding what you're seeing, what is in your surroundings. Planning and control is what tasks to do and whether you can actually do them. So that control system that you have enable you to plan to do these tasks and machine learning and AI. So this is a very broad topic, but it is basically leveraged by robots in many areas, for instance for voice recognition, environment adaptability and way more, which I'll touch into in a bit. And lastly, human robotic interaction. How is these robot ones really deployed in the real world? Going to interact with a human and developing this and integrating all of this is required in order for you to have a robot that will, for instance, help you do your laundry.
Nyla Worker [00:03:58]: The dream anyway. But what does it take to build an AI NPC? And in this case, we ended up quite opposite to what you would do with robot. With a robot. We first started with a robot, like building the physical hardware with the AIMPC, we start by building the mind. So in this case, I've heard the previous talks were about retrieval augmented system, having your own LLM or using an API. Well, in this case we use that LLM with of course techniques such as retrieval augmented system for the knowledge base. To create this character with a personality, with the required actions that it needs to perform, with a voice and with memory. Then we have to make it listen, it has to hear us.
Nyla Worker [00:04:55]: So that is speech to text. Then we have voice outputs. So that is, the NPC has to be able to talk back to us with input from the LLM. However, an AI NPC goes way beyond just the speaking reasoning, which is quite a lot already. It goes into the embodiment in a game engine. So there we are talking about putting it into a body and giving it facial animations, lip sync, gestures, movements. And you can equate that to a control system, for instance, in a robot. And then there is a virtual robot whole area, which is the AI MPC has to be able to perceive the environment it's in and understand where it is at, to then carry actions, leveraging things like seeing metadata or seen information.
Nyla Worker [00:05:51]: And those actions have to be dynamically generated based on either the instruction from the human or maybe environmental feedback that has happened, such as a human walked into the space. Thus the AIMPC should go and greet them. And now let's look at the AIMPCs in the context of the core components of a robotics workflow. So localization and mapping are actually much simpler tasks to do in AIMPC, because the AIMPC has perfect feedback from the environment. So we get this from the game engine, so we know exactly where the AIMPC is, and we have a virtual map in the game engine. So we can skip those two sections, which is huge. I mean, that is where massive books of roboticists have been dedicated. But there is a lot more outside of that that needs to still be done in robotics, which is that perception side.
Nyla Worker [00:07:02]: How do you leverage that data to then do planning and control, which is needed to successfully do the task? And then how do you successfully leverage other sensor data, for instance for voice recognition and so on, to make this robot or AIMPC viable for conversation? And lastly, we have to have the human and AINPC interaction, which is similar to the human robot interaction. It has to feel natural, it has to feel as if you are able to completely guide it, as if you were guiding humanlike entity. And this is very hard in both robotics and AiMPCs. Let's start with perception. In both cases, processing the environment is key. So in the case of a robot, we're talking about a whole sensor set. We're talking about cameras, thanks to Dali for creating this. But there is a lot more than just cameras.
Nyla Worker [00:08:13]: We're talking about Lidar radar, sonars and so on. So that is extremely challenging. And in the case of AIMPC, we thankfully get to simplify a little bit our lives and just take camera data. And we also get to cheat a little bit because we get to use scene metadata. And on top of that we get perfect information about the environment. So with AI NPCs, because of that, we have a shortcut in this case to then know where we are at. We know the items on the game engine, we know exactly where they are at. So in that sense, this is much more of a simpler task.
Nyla Worker [00:08:59]: With AI NPCs, however, it's still hard to put them into context and we still need to utilize for instance, a multimodal LLM to understand how things fit in context. But the individual items, such as for instance an object like a mojito would be in the metadata and we would be able to fully localize it. But still both share the perception as a challenge, just in slightly different ways. And here you can see that in actions. As I mentioned before, the AIMPC needs to be aware of where the skull is to do this tour and teach us about this skull of a trex in the virtual museum. Now let's talk about a little bit about planning. Planning in robotics is an extremely hard task, in particular because of nonperfect information about the environment. And in that case you have to potentially discrete section a space and then decide the potential routes through those discrete sections of the space and then decide which of the paths is shortest and more optimal, for instance, in the case of this image.
Nyla Worker [00:10:21]: So this is a very difficult task. In the case of AI NPCs, it's a little bit easier because the environment is defined, the map is defined, and we have perfect information of where the items are. However, we still need to plan what to do. So we know what the movements that the AIMPC has. In robotics, you would also know what the movements that this robotic arm is constrained to by the control stack. But then you have to decide with those movements what are the actions, and then with those known actions, decide what to do. So let's break it down into a simple example. So in convey, we call this atomic action.
Nyla Worker [00:11:06]: So we have a simple set of core actions that the NPC can perform because of its animation set. So that in the robotics case would equate to something like I can move the robotic arm left and right and so on. In the AI MPC case, it's the animations it has available. So those animations that it has available we call atomic actions like move, dance, pickup, drop, and so on. So now that is one thing is to tell the NPC, can you move? Can you dance? That is just giving an instruction. But how do you go to complex dynamic actions? So for example, if I tell the AimPC, fetch me a jetpack, it should know to plan the following move to the jetpack, pick up the jetpack, move back to the user and drop the jetpack. This is quite difficult to do. And the way that we've been doing it at convey is by leveraging voiced input such as bring me that fine tune the LLM to receive.
Nyla Worker [00:12:17]: A fine tune LLM receives that query and then decides a set of actions to do. Keep in mind here, bring me that is leveraging also spatial input, right? Because what is that? That has to be gathered from the perception that we mentioned before. Anyway. So the fine tune LLM then gives a set of actions and then we validate whether the actions are actually logical to be used. And this would be similar if you were to use this path for robotics. So let's watch it in action.
Demetrios [00:12:54]: For this example, you see that axe lying there over on the ground? We can ask the character to fetch it for us. Hey, can you bring me the axe, please?
Nyla Worker [00:13:07]: Sure, I'll bring you the axe. You can tell she is very proud of herself that she brought axe anyway. Just showing this complex set of actions that has been dynamically processed in real time to then bring up the axe back to the user. And this would be something that if you're doing a conversation with a robot, you would like to have guaranteed and improved over time. Now let's jump into machine learning and AI. So this might feel like an oxymoron because I've been talking about AI the whole time and now I'm like even more. But there is AI that doesn't fit into the previous picture, which is, for instance, leveraging AI for planning logistics at a warehouse, which might then influence a whole robot, and then communicating with a human. So there is a lot more of AI that can be used for robotics, but that goes beyond the scope of this talk.
Nyla Worker [00:14:13]: What we would like to dive deeper into is voice recognition. Automatic speech recognition is a great way of communicating with our AI MPC. However, this has to be extremely low latency. It has to be able to capture new words and different pronunciations in order for you to communicate with it properly. And it has to be able to potentially handle multiple languages. So in the case of like a factory, there is likelihood that there would be custom words that you would like to pronounce, potentially in different languages. And another thing is proper human feedback. Voices with emotions would be nice.
Nyla Worker [00:14:55]: Within the convey platform, we enable you to select multiple languages up to four concurrently. And the AIMPC can talk about all of them, and you can add custom pronunciation and new word recognition. So you can see here, in this case, it was speaking in Spanish, and then we spoke in English, and then it responded in English. So this would be similar to an interaction that you could have with an embodied robot. And with that, let's jump to embodiment. The thing that both AI npcs and robots have similar is that both are embodied one in a virtual world, the other one in the real world. And the embodiment in happens in gaming, different gestures and also face animations and lip sync. In the case of robotics, you would have each hardware robotics platform have its own set of potential movements.
Nyla Worker [00:16:04]: And that would be the embodiment in that case. And this leads me to the next step, which is embodiment, sorry, human computer interaction. So in this case, we're talking about AI npcs that are going to be talking with humans and getting feedback from humans as to whether their interactions are actually realistic. So here we're leveraging audio to face from Nvidia to make the facial animations to feel realistic and to give the humans the proper reaction to what they expect. This, I think, is where AI npcs are going to leapfrog robotics for a while. Figuring out what is the proper interaction with humans and giving the right feedback to humans from these embodied AI agents is going to be key for the future. And lastly, the point about embodiment brings us to the fact that these embodied agents get environmental feedback. So what do I mean by that? Whether it is in a game engineer or in the real world, there is something that you can use to understand whether a task has been performed and how to improve it over time.
Nyla Worker [00:17:25]: Here I have that paper Bojajer, which was made by Nvidia. And here is where the AIMPC or like the AI agent, got feedback to the environment from which it learned to upgrade its tool set. That upgrade over time is thanks to the environmental feedback. In this case, AI npcs have perfect environmental feedback and we'll be able to learn from that and get better over time. And we'll start seeing things like more animations, more custom movements, more responsive movements and things like that that are only possible because of the feedback that we are getting from the simulated world. In has simulated physics, dynamics and things like that. In long term you can imagine how that barrier between simulation and reality is going to reduce. We call this sim to real and where we able to see how that simulated world could slowly lead to aimpcs coming to the real world due to the reduction of seem to real.
Nyla Worker [00:18:36]: Even what I worked on at Nvidia previously was on minimizing the symptoreal barrier for perception tasks. But this applies to this of improvement and self betterment of the agent. Yeah, and just to wrap up, I want to say what it takes to build AI npcs just as a whole, holistically. Again, we discussed today how an LLM takes text input. Text output. The text input includes things, information such as the backstory, the personality, the state of mind, all of that contributing to the human AI interaction, such as the long term memory too. And the knowledge base. The knowledge base being a retrieval augmented system that enables the AIMPC to remain factual, all things that a robot with a mind would have.
Nyla Worker [00:19:35]: Then we discussed voice input, voice output, how that is required to potentially converse with the npcs. We discussed the embodiment, how an AI NPC lives within this rigged avatar body that has animations, has movements, lift things and gestures, and how that all integrates with this simulated world. And within that simulated world we have an agent that can perceive, carry out actions and learn from the environment over time and thus potentially convert to something smarter and more usable for us humans. And I think that the two fields will benefit immensely from each other. Particularly I think human computer interaction, as well as making the action sets more robust, is something where gaming can help a lot. And yeah, how can you get started today? Just go to convey.com and you can make your AI agent with actions a knowledge bank today. So, yeah, that's all you can go and watch a presentation from Jensen where he talks about us. But yeah.
Nyla Worker [00:20:51]: Thank you so much for your time. I'm ready to answer any questions you may have.
Demetrios [00:20:56]: Nice, Nyla, thank you very much. I think we have a bunch of questions from the people from Westworld. Let's see if. No, I'm just kidding. A few people are asking. Let's see a couple of things. One is, okay, Nisarg Amin says if you were to build an NPC application for augmented reality, it would be more similar to robotics because you do not have environment and sensor metadata. Though I believe that that question came up before you were introducing kind of like some of the game engine capabilities of the entity because you do not have environment and sensor metadata.
Demetrios [00:21:31]: Then the next part of the question is curious how this can be used in simulations as well, where not all metadata would be available up front.
Nyla Worker [00:21:38]: Yes, I want to start with AR. So within AR, definitely we need to rely on that perception set. That's why within convey we are also very focused on multimodality because from multimodal input we can at the very least get like segmentation inputs or placement inputs as to where the objects are, and then with that make a more intelligent decision as to how to converse about the environment such that the AMPC is spatially aware. That's definitely in our roadmap and highly considered. But yeah, in the case of a perfect game, then we have perfect information. So it's very different. What was the second question?
Demetrios [00:22:22]: The simulation. Yeah. I'm curious how this can be used in simulations as well, where not all metadata would be available up front.
Nyla Worker [00:22:32]: What kind of simulations are they speaking about?
Demetrios [00:22:36]: So I read this one paper about, basically the idea is that you can create like a simulated society, right? Why don't you just allow those npcs to talk to one another and then to have emergent behaviors and all of a sudden there's going to be economies and new departments. And I think that the one I saw, they started having parties going on.
Nyla Worker [00:23:00]: I saw that one.
Demetrios [00:23:02]: Yeah.
Nyla Worker [00:23:03]: So we actually have a feature called NPC to NPC where we let npcs just talk with each other. I can put it in the chat if anyone wants to go and play and build a game. It's actually quite fun if you see the NPC conversation, interestingly. Yeah. But yes, you can use this in those kind of simulations. And even then I think the objects in the scene would remain the same and the map would remain the same unless the AiMPCs can actually change the map, which would be like a whole new set of generative games, which, by the way, are definitely in the pipeline. There are just a lot of people that are auto generating maps, auto changing, and that's a whole other world. But sticking to that subject of simulation of AIMTC interaction, that's already happening.
Nyla Worker [00:23:52]: We have that as a feature of the platform and it is actually quite fun to watch. We convey was at CES, we had a demo and the funniest thing to watch was how people read into these AI MPC conversations. They were chatting to each other and then the humans were like, oh, I think they hate each other. And then they would go and chat to them and give the AiMPC some feedback that they hate each other and then they would continue talking and it was like OD. But yeah, this is already happening and I think it's going to lead to a completely new set of games and experiences. That's very exciting.
Demetrios [00:24:34]: Yeah, very cool. That's very interesting. Just like the reaction of the human verbally can also be used as supervising guidance signals.
Nyla Worker [00:24:41]: Yeah.
Demetrios [00:24:43]: The idea is that you could probably have some NPCs that can learn differently from other ones so that they're not all continuing to kind of optimize in the same way can cause much strife. And this is a fascinating space.
Nyla Worker [00:24:54]: Yeah. And that's why we focus a lot on staying in character, because what happened on that paper is that everyone was, all of them were very chudgyptesque. They always speak in this very professional way and they don't have their own personalities, their own incentives, their own narratives and their own agendas. And that's actually another set of features that we've been working on within convey, which is extremely exciting because now you drive a completely new game here.
Demetrios [00:25:25]: Naila, thank you so much. And you put the QR codes for folks to follow you, and if you want to be on the chat, I imagine lots of folks would like to continue the conversation.
Nyla Worker [00:25:34]: Love to. And you can always reach out to me in LinkedIn, Nyla worker and then on my Twitter. And yeah, I'm very happy to connect with all of you.
Demetrios [00:25:44]: Awesome, thanks again.