Sign in or Join the community to continue

Physical AI: Teaching Machines to Understand the Real World

Posted Feb 06, 2026 | Views 521

# AI Agents

# AI Engineer

# AI agents in production

# AI Agents use case

# System Design

Share

Speakers

Nick Gillian

Co-Founder and CTO @ Archetype AI

Nick Gillian, Ph.D., is Co-Founder and CTO of Archetype AI with over 15 years of experience turning advanced AI and interaction research into real-world products. At Archetype, he leads the AI and engineering teams behind Newton—a first-of-its-kind Physical AI foundational model that can perceive, understand, and reason about the physical world. Before co-founding Archetype, Nick was a Senior Staff Machine Learning Engineer at Google and a researcher at MIT, where he developed AI and ML methods for real-time sensor understanding. At Google’s Advanced Technology and Projects group, he led machine learning research that powered breakthrough products like Soli radar and Jacquard, and helped advance sensing algorithms across Pixel, Nest, and wearable devices.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

As AI moves beyond the cloud and simulation, the next frontier is Physical AI: systems that can perceive, understand, and act within real-world environments in real time. In this conversation, Nick Gillian, Co-Founder and CTO of Archetype AI, explores what it actually takes to turn raw sensor and video data into reliable, deployable intelligence.

Drawing on his experience building Google’s Soli and Jacquard and now leading development of Newton, a foundational model for Physical AI, Nick discusses how real-time physical understanding changes what’s possible across safety monitoring, infrastructure, and human–machine interaction. He’ll share lessons learned translating advanced research into products that operate safely in dynamic environments, and why many organizations underestimate the challenges and opportunities of AI in the physical world.

+ Read More

TRANSCRIPT

Nick Gillian [00:00:00]: So you kind of flip the problem on its head. Instead of constantly asking the questions to the model and giving back an answer, you basically set up this lens and then you basically stream your real time physical AI data through the lens. And now you get this very rich structured output that you can then build into a bigger system which we typically call a physical agent.

Demetrios Brinkmann [00:00:27]: You're in the us, right?

Nick Gillian [00:00:28]: I am, yeah. Palo Alto. Yeah.

Demetrios Brinkmann [00:00:30]: But I would take you for more of a Celsius kind of guy.

Nick Gillian [00:00:35]: You're right, you're right. I grew up in Northern Ireland, so that's where my slightly interesting accent is coming from.

Demetrios Brinkmann [00:00:41]: Yeah, it is one that I couldn't place right off the bat.

Nick Gillian [00:00:46]: Yeah, I've been, I've been here since 2011 is when I moved to the US to go to do my postdoc. So yeah, I mean I've been in the US for a while. So yeah, it's a bit of a myth at this point.

Demetrios Brinkmann [00:00:56]: And you've been doing cool stuff with physical AI. And I know that you were at Google, you did a lot of incredible things there, have since spun out a company, you're leading the charge with physical AI. But I recognize that my first thing, when I heard the word physical AI, I thought in instantly about robots. I imagine you get that a lot. Can we just clarify terms? What are you talking about when you say that?

Nick Gillian [00:01:30]: Sure, sure. This is a great place to start. So I mean robots are definitely physical AI. I mean they're a key part of it. But when others, I think in this field talk about physical AI, they may really just think about robots and maybe autonomous driving and then they maybe stop there. Right. At archetype we are thinking much bigger than that and we're really thinking of any use case in the physical world, in the real world where you have some form of sensor and you need to make sense of it. Right.

Nick Gillian [00:01:59]: So just think for a moment how broad that is. I mean of course that includes robots, it includes self driving. But you know, think of like a large factory, big machines in those factories, they have hundreds of different types of sensors in them typically to measure the, the pressure and the voltage and the gear rotations per second and all of these type of things. They're great indicators of what's going to happen with that machine. Right. Think of all the cameras you might have around that factory to kind of help monitor the safety of workers and how efficient the logistics are, getting shipped out and everything else. Right. So think of all the kind of the electrical grid that's physical AI, right? VR, AR, you know, like all.

Nick Gillian [00:02:41]: Anything where you have a sensor and you're trying to basically do projection onto the world and do projection digitally on top of that. Right. So it's very broad. It's. It's a really big area. And really we have a, you know, at architect, we're trying to build this horizontal platform that understands the physical world that customers can use to build on top of for their vertical. And the breadth of customers we have that come to us is kind of wild. Everything from, you know, autonomous driving to, you know, people in factories, to, you know, folks that want to put sensors 2km underground in a big drill head or something.

Nick Gillian [00:03:20]: Each of these users is very different. Their requirements are very different. But really the thing that unites them all is they all have some sensor in the world and they struggle to make sense of it with traditional machine learning techniques or traditional signal processing techniques. And they want basically missing that intelligence layer to map sensors into reasoning and then to be able to take some action on top of that. Right. So for us, physical AI is really that intelligence layer that's able to kind of perceive reason and act on the physical world in any form. So it's very broad.

Demetrios Brinkmann [00:03:54]: And you're not talking about a model that will be multimodal by default, that can understand things as like a higher level of abstraction. You throw everything at it, whether it is camera data or it's sensor data that comes tabular form. It's more like specialized models in that case. For the specialized use case.

Nick Gillian [00:04:21]: Well, our. Our goal is to build one big foundation model that customers can plug lots of different sensors into, and then they can use natural language to love that, focus it on their use case. When it comes to actually deploying the model, they may need to take that really big model and then break off a slice of it and, you know, compress it down for their use case to, you know, for example, to deploy it literally inside a drill that goes many kilometers underground. I'm not sure if you need to know every single possible fact in the world to put it underground. Right.

Demetrios Brinkmann [00:04:53]: There's probably every Bob Dylan song.

Nick Gillian [00:04:55]: Yes, every Bob Dylan song. Right. Or, you know, you know, all of the presidents of the U.S. like, there's a lot of knowledge there that is required initially for the foundation model to understand what's happening. But then once you know, the use case, you can break off a subset of that model, compress it very efficiently, and then that's the thing you actually deploy in the real world. Right. But kind of the base model or like the Mothership model that the customers start with is a much, much bigger model that you can plug in many, many different types of sensors and then you can apply natural language to focus what that model is paying attention to. So, you know, typical use cases could be you plug in, you know, four different cameras, for example, that are overlooking the factory and then all of the time series sensors from a big large machine inside the factory and you focus on like, what is the health of that machine right now? Right.

Nick Gillian [00:05:48]: So imagine if a technician comes up to it and like starts to take it apart to do maintenance on it, versus the gearbox in the machine that starts slowly getting out of, out of sync without taking in all of those different sensors. The model has no idea why the gearbox rotation is changing. It could be because it's getting out of sync, or it could be because actually there's a person about to repair this right now and they're taking off all of the outer brackets and that's causing the machine to start to vibrate more, for instance. So being able to plug in all of these different modalities and use natural language to steer the model for real time safety alerts, or for being able to predict when a machine might need to be repaired, or being able to actually help that technician as they're walking through the repair manual. The model can read the manual, it can watch what the user is doing in the world, it can look at the sensors that come out of the machine, it can fuse all that together and it can help the technician actually move through the steps to repair the machine safely and in the right order, for instance. So one big model that you can essentially use natural language to control and then focus on the task in the.

Demetrios Brinkmann [00:06:52]: Real world, add some wild stuff. Talk to me more about building that model, because that's where the fun part comes in, I imagine.

Nick Gillian [00:07:02]: Yeah, yeah. So when, when, when. So let me, let me go back and give some context here. So I mean, I've been working on, on, on sensors since the, you know, start of my career. Even before I started my PhD, I was doing a lot of stuff with real time sensors, trying to understand them initially with the signal processing techniques and heuristics and everything else. And I've been working on machine learning since about 2006, so long before it was cool, before deep learning was even a thing. So I've been able to go through all these trends and it's been a wild ride, it's been a lot of fun. And before I started Archetype, I was at Google for almost 10 years.

Nick Gillian [00:07:42]: Where I met a lot of my co founders and we've all been working on some form of sensing in the physical world for different parts of our career, either on the ML stack for myself, or more of the signal processing stack for Jamie, my co founder, our chief scientist. And Leo, who's our designer, has been looking at it more from the interaction standpoint and so on and so forth. Right. So when we left Google in 2023, we saw what was happening in the foundation model space and we were able to kind of extrapolate of like this is going to go well beyond just text, this is going to go well beyond vision, this is going to go potentially to every single sensor there is in the world. And this is the missing link that we've all been looking for in our careers to actually understand what's happening in the world. Right. So when we left Google, ChatGPT had kind of just exploded. Text models were a very big thing and we knew that we wanted to build a foundation model for sensing for the physical world that could take in any form of sensor data.

Nick Gillian [00:08:45]: But to actually build that model, the architecture of it is extremely difficult because the techniques you use to build a large language model or a vision language model are very different than the techniques that you would need to build something where you can plug in LiDAR, for example, or models where you can take hundreds of different time series sensors across all of these. Like a large wind farm for example, Right. Where every turbine might have hundreds of sensors in it and you maybe want to plug in all of those sensors across the entire wind farm. So now you can see the kind of micro and macro patterns between each of the turbines and the whole holistic field. Just the techniques to build this didn't exist. So we had to really go back to first principles and look at how do we build a model where you can plug in any form of sensor and in most cases it won't have nice Internet sized data sets with very nice human labels annotated for them. For 10 or 20 years of things you can go and scrape and it's like a big technical challenge there of how to do that.

Demetrios Brinkmann [00:09:54]: There's no common crawl.

Nick Gillian [00:09:55]: Yeah, exactly. So the, we're able to leverage things like common crawl, you know, to start the model. But you know, that's, that's the start, that's the tip of the iceberg. Right. There's just so much more that has to happen there. So we had to basically go back to basics and look at how do we take in all of these different types of Sensors, all of this data that's actually not on the Internet and look at fundamental techniques that we can leverage that and then use that to build very large, you know, techniques that do leverage the same kind of recipe that you would see in an LLM. Right? So there's, there's a big self supervised pre training phase, there's post training, there's RL fine tuning and so forth. But the strategies of how you build those data sets, of how you do the alignment, of how you synchronize different sensors, it's a very different approach to actually make this.

Nick Gillian [00:10:44]: So if you squint at it, it looks very like an LLM. But do you actually go and build it and build the data sets and think about how to construct this? It really requires us to go back to basics, right? But our fundamental kind of thesis is similar to what's happened in language where if you look at the original GPT1 paper, there's a very interesting phenomenon was picked up there where simply by learning to predict the next token over these text sequences and there they were using, I think early, it was script, a lot of data from Amazon reviews, right? So they were able to actually have the model predict on its own the sentiment of the review, right. Whether it was a positive sentiment or a negative sentiment, right. Simply by predicting the next token in the sequence, which is kind of wild if you step back and think about it, that by being able to predict the next token in this arbitrary sequence, you actually start to pull out these very rich sentiment models that other people in the NLP space were struggling to build, right? And suddenly this model like out of nowhere is able to do it really well. So we have a similar thesis on sensor data, which is simply by observing the physical world, these really large, powerful models can fundamentally start to learn like the intrinsic principles that govern the word world through these sensors and then be able to model and leverage that knowledge in the same way that an LLM can read the Internet and then suddenly it understands human history and it can generate beautiful essays and all of these type of things. We could do the same thing with sensor data. So that's kind of the fundamental thesis is the same how you go about actually building it requires a little bit of innovation.

Demetrios Brinkmann [00:12:41]: There's so many avenues that I want to go down, but the one that I think will probably be the most valuable right now for myself is the idea of the world models and how that plays into what you are doing. Is there any crossover? Again, I'm trying to like clarify the terms because I see Things pop up on social media for a blip and I kind of am following it. I understand what it is a little bit, but I don't necessarily do a deep dive. And then I'm graced with someone like your presence on this podcast and I get to go like, okay, what's going on here?

Nick Gillian [00:13:19]: Yeah, yeah, okay. This is, this is a great thing to deep dive on. So there's definitely some overlap, but I think how we look at world models is very different than most of the community. Right. So, and that's fundamentally that a lot of folks who are working on world models, they're, they're very vision based. Right. There's an intrinsic bias there to take in a bunch of vision and spatially aligned data and use that to build almost like a digital twin, a projection of the physical world, which can then actually be. It doesn't have to be the digital twin.

Nick Gillian [00:13:57]: It can be now a completely simulated world, but it's extremely anchored on. You can think of it as 4D dimensional data where you have the 3D dimensions of the world plus time. Right. So you can basically move around these virtual worlds that you can generate by observing huge amounts of video data, for example. Right. And while that's definitely a world model, I would not argue against it. The problem is it's extremely biased on vision. Right.

Nick Gillian [00:14:28]: So if you want to take another sensor modality and capture things in the world, like CO2, for example, I mean, how do you project that into this kind of very vision dominant system? Right. How do you take a temperature sensor and project that into this type of system? Right. And then kind of leverage that and bootstrap that? Right. So the way we think about world models is that we're trying to allow the core of our foundation model is called Newton. So we're trying to allow Newton to learn this world model itself from the sensor data. But actually to learn the representation of that model directly from the sensor data and not have this kind of human bias of having a very structured kind of 3D form that very much biases the model towards vision based systems, which is very useful. But to actually capture the subtle details that are happening in the real world, you need to go well beyond vision. There's a lot of things in the world that you can't see, you can't touch, you can't sense, and that a camera cannot pick up, but you need an electromagnetic spectrum, all of the chemical processes that might be happening, and all the things that are happening in the electrical grid.

Nick Gillian [00:15:51]: I mean, you can't see that with A camera. Right. So even take a teapot right on the stove, unless there's steam coming out of it, you can't really tell what the temperature of the teapot is. Right. There's fundamental things that are missing. And sure, if you go into other spectrums, like infrared for instance, then you might be able to start to get these things. But the fundamental point here is we believe these world models need to go beyond vision. I need to bring in lots of other modalities to be able to fundamentally understand the subtleties of what's happening in the world and then to be able to reason about the world or to be able to act on top of it, or be able to predict what might happen in the future, for example, which is what a lot of world models are being used for.

Nick Gillian [00:16:35]: You need to be able to capture all of these modalities. So there's definitely an overlap of how others in the field look at world models. But I think we are trying to go beyond these kind of very vision based models that essentially are being used to build fancy computer games at this point. It's cool, but a lot of our customers are coming to us and they're not trying to build a computer game rendering of their factory, they're trying to optimize the throughput of the factory, for example, or they're not trying to build a nice digital twin of the windmill farm. They want to actually understand why machine number 62 has dropped 10% in efficiency today, for example. Right. Things like this. Yeah.

Demetrios Brinkmann [00:17:17]: Is there certain data that you throw at these models that you should be weighting differently?

Nick Gillian [00:17:25]: Well, I guess one of the, one of the biggest ones is the multimodal nature of just looking at the same phenomenon in the, in the physical world, but through different modalities. I guess that's one of the things we are betting on in terms of we are trying to build a model which is natively multimodal in, multimodal out. So I can talk a little bit about that. But multimodal in here means you can either take multiple instances of the same sensor. So it could be multiple cameras, for example, that are overlapping or non overlapping in a factory or in an autonomous car. Or it could be hundreds of different time series sensors in one large big industrial machine. So it could either be multi instance or it could be multimodal. Right.

Nick Gillian [00:18:15]: So you maybe have a camera and a pressure sensor, or you have lidar and a radar and the gearbox RPMs, for example. And each of those sensors, even the very simple ones, they capture details that other types of sensors just fundamentally can't. They're orthogonal to each other. So by combining in these different modalities into one model, and to be able to do deep fusion on those sensor modalities, it allows the model to understand these fundamental relationships that you really can't solve with just one sensor. Right. And there's a lot of problems in the world where you need more than one sensor modality to be able to solve it. The teapot example is like a very simple, a very simple case of that, right. The thing I mentioned before of the large machine, and you start to see these anomalies in the signal.

Nick Gillian [00:19:10]: Well, is that because there is actually a malfunction in the machine or it's actually. No, someone is. The technician is actually just working on the machine or the machine next to it. And that's kind of a known anomaly. It's like it's explained. You don't need to stop the factory today to go and fix it. It's like, yeah, yeah, that's scheduled maintenance. It's kind of the normal thing.

Nick Gillian [00:19:29]: Right?

Demetrios Brinkmann [00:19:30]: That's Scott.

Nick Gillian [00:19:31]: Yeah, it's cop again. Yeah, yeah, yeah.

Demetrios Brinkmann [00:19:35]: Now if you extrapolate this out, it feels like you can have infinite amount of data sources. Where do you stop?

Nick Gillian [00:19:47]: Well, this is it. This is one of the big, you know, challenges of working on the physical world, right? There's a very long tail required to make the system work. Some know, folks that are working on autonomous driving know this, right? That you have to, you know, after you've driven a few, a few hundred miles, like, the system, you know, already starts to, like, gain some intelligence, but you have to drive billions of miles to be able to capture like, all the possible things that happen. Because, you know, frankly, in the physical world, it's on 24 7, right? It never stops. And most of the time things are not that interesting, right. Most intersections, if you have a bunch of sensors there, most of the time there are not accidents, thankfully. Right. But you have to look at maybe 6 months, 7 months, 8 months of data before you actually see that rare event happening.

Nick Gillian [00:20:42]: By nature, it's a rare event or an anomaly. So going back to your question, how much did it, you need. Well, you need a small amount of high quality data that captures all of these, you know, different parts of the distribution. But unfortunately, to like, sample those, you know, important moments, you do need, you need to look at most of the distribution and therefore you need to look at a lot of data. So a big part of what we're building at Archetype, there's a lot of the ability to take in these huge data streams that are essentially continuous streams from the physical world and be able to very quickly triage those, be able to catalog them, to be able to understand what parts of that data is useful versus what data you need to store in cold storage versus what you can just throw away because it has very little value in it, for example. So this is one of the big challenges of working in the physical world. It's very hard to capture those little moments that really help push the performance of the model forward. Or from an eval perspective, it takes a lot of sampling in the world to capture those hard moments that really show where your model is breaking so that you know then what to do next to go and improve it.

Nick Gillian [00:21:54]: Right? Yeah.

Demetrios Brinkmann [00:21:56]: By definition, they're anomalies. And that's literally what an anomaly.

Nick Gillian [00:22:00]: Yeah. By definition, you know, a rare event is rare. So you have to maybe two years, three years of data before you see the event. Right. So these are some of the challenges. And again, going back to that first principles thing we talked about, this is not a new thing with foundation models and everything else. This is a problem with the physical world. I've had this problem my whole career.

Nick Gillian [00:22:24]: I've been trying to solve this problem many, many times. And unfortunately, to solve it in the previous deep learning waves where everything was focused on mostly supervised methods as opposed to self supervised and, and everything else, you would have to go and get very high quality labeled samples for every instance. Right. You'd have to go and get millions of labeled pairs to be able to build these models. You could ship. When we were building radar models back at Google, there is no imagenet for radar. We have to go and build those data sets ourselves. And we'd have to go, you know, do these massive data collection campaigns and, you know, collect millions of examples of people doing motions and gestures and picking up mobile phones and, you know, holding them in weird ways and just to capture like all these nuances and then.

Demetrios Brinkmann [00:23:17]: Give it to the intern.

Nick Gillian [00:23:18]: Yeah, yeah, exactly. And like we'd have these huge teams. It was most of, most of the effort of those big production products, honestly are. And this is the thing, I think I've heard Karpathy say this as well when he went to Tesla, which is when you're a researcher in academia, you do all the fun things and you're doing all the research and you're building the papers and you're working on the algorithms and you're drinking coffee and sitting and thinking deeply, and in most cases you're lucky enough to be able to kind of download some reference data set and work on that and then publish a paper on it. Right. When you go to industry, it inverts most of your time is built on building the protocols. Like what's the data collection protocol plan? You know, how are you going to work with legal to kind of get this out? You know, what are the product requirements of the features and the actual research? Part of, part of the system gets much smaller. You know, in terms of like what you're actually spending your time on.

Nick Gillian [00:24:12]: It's still equally as fun but you know, it's a very different kind of set of problems you have to solve to like really build production models.

Demetrios Brinkmann [00:24:18]: Yeah, well, I appreciate that your mind went to the idea of the anomalies and the massive amount of data that you need in order to accomplish what you're doing in the, in the form of the size of data. What I was thinking is the diversity of data sources and data streams. So as you were mentioning, you've got lidar, you've got temperature, you've got whatever, it's, there's time series data, there is words. So, and then you have cameras. If you again extrapolate that out, when is enough enough? Because you could say, well maybe we need to get some kind of infrared sensor in there. Maybe we need to test the humidity of the air. Maybe we. How can you know that you have enough data points or can you ever.

Demetrios Brinkmann [00:25:13]: Are you, should you just be trying to get as many as possible?

Nick Gillian [00:25:16]: Yeah, this is a great question. And you know, it's always, you know, the, the, the, the dirty secret of machine learning that no one really wants to talk about, particularly when they look at log, log plots, is that you get these diminishing returns, right? In terms of how much data you put in versus how much improvement you're going to get out of the system. So I always like to think about your question. There's kind of two axes to think about it, right? One is more from a product driven axes and one is more from kind of an evaluation, performance, reliability, trust, perspective. So on the product perspective it's. Well, I mean what sensors are customers coming to us with that they want to plug into this model, right? Because you know, it's all cool. Us, you know, being, you know, very, very academic about it and being like we want to support infinite sensors. But like at what point do we have a customer come to us and say like, oh yeah, yeah, like great, you support all my sensors.

Nick Gillian [00:26:10]: Cool. You know, so we definitely have not got there yet. And I think it will take us many, many, many more years before we get there. But there is this classic kind of perto distribution where most of the customers are coming with a small subset of common sensors. Most of our customers come to us. They have some form of camera fleet and they have some form of time series sensor fleet. And in most cases, they have both. And they want to have a model where they can plug both of these sensors into the classic example I've already given.

Nick Gillian [00:26:44]: You have a big factory, you have a bunch of machines there, they have hundreds of time series sensors in the machine measuring pressure and RPMs and oil temperature and electrical consumption and so forth. And they want to take the systems measuring those large industrial machines and they want to then fuse that with all the cameras that are around the building that are actually looking at how are those machines being operated by humans, or how is that being connected to the kind of the machine next to it, for example, or what is all the logistics of things moving around the factory and so forth, right? So one aspect of this is how rich and diverse are those sensors that customers are bringing to us, right? For this, this is where, you know, again, going back to that first principles concept, this is where we're trying to build fundamental techniques where if a customer does come to us with a completely new type of time series sensor or camera, ideally we're building general enough encoders that they can plug that into our model without us having to retrain it from scratch, right? Because the structure of the data is close enough to everything else we've trained it on. And the fundamental processes that those sensors are measuring are similar enough to what we've seen before. So the model already understands some of the key components, and maybe you can either use it out of the box or you can, you can tune it with a little bit of data to kind of recalibrate the model to understand your special new type of. If Bosch comes out with a brand new temperature sensor tomorrow, they should be able to plug it into our model and it should just work, right? But then there are some customers that come to us with very rich, unique sensors, like in the RF Damian, for instance, where the literal structure of the data itself looks nothing like what we've given the model before. And that's where we are trying to build general recipes where we can take a large amount of unlabeled data for that sensor and be able to use it to build these very powerful encoders that can then be plugged into Newton and then you can kind of retrain Newton to pull in that new modality type. Right. So you don't have to train the model from scratch, but you will need to build a new custom encoder for this very exotic type of RF sensor, for example.

Nick Gillian [00:29:08]: But then if they come with another type of RF sensor, you should be able to plug that now into the same encoder and reuse it and bootstrap from there. So part of this is coming up with those general recipes and then building the right software abstractions and platform tools around it so that our customers can actually do that themselves. They can come with their own data, they can run our platform on their infrastructure, so the DITA never leaves their networks, et cetera, et cetera. And they can use that to build their own custom encoder without having to know any of the kind of deep science that's going on under the hood to kind of build those things. So that's on the kind of sensor side to quickly talk about the other aspect, the other axis, which is when is enough data enough? It's more on the evaluation axis. Right. So there, this is where I've learned a bunch of hard lessons in building production production systems, where it's very different from how academia typically thinks about building production models. Right.

Nick Gillian [00:30:12]: So if you open any machine learning textbook, chapter one, it'll be take your data, everything is IID, everything can be. You can take a random split, 80% of it, use it for training, 20%, use it for testing. Cool, great. It you do that and you get these great numbers, publish your paper, ship them all, etc.

Demetrios Brinkmann [00:30:30]: Oh, if only it were so easy, huh?

Nick Gillian [00:30:32]: Yeah, if everything was that easy. So when building production models, this is a really bad plan, fundamentally. Right. The main reason being that the distribution that you typically capture is in that data set is not fully representative of the product and the real world. Right. And I've made a few mistakes in my early career where I would go out and I would have teams collect the training data first for a product and you'd be randomly splitting and you know, keeping back a subset of this that you would be testing on. And then later on you'd realize like, you know, we're getting like 99% performance, you know, on our held a test set, you know, in our eval. But like when we try the product, it's, it's kind of crop, it's like it's at best it's 80%, maybe it's 60%, right.

Demetrios Brinkmann [00:31:27]: Oh man, I laugh because I think everybody's been there at least once yeah.

Nick Gillian [00:31:32]: And you know, it took me a long time to work out that the way to build the test set is not to build the training set and make the model work and then start to build the test set. It's like from day zero you build the test set and ideally you build the test set using a different protocol and a different team, maybe even in a different part of the country. So you really start to see the differences between your training set and your test set. And the way to build a test set, and this gets to your question of how much data is enough, is that before you build the test set, you need to sit down and you basically, I like this framework, it's called bucketed analysis. Right. Which is instead of just looking at some proxy number, let's take the most obvious one, like accuracy, for example. Right. Having accuracy over your test set is useful for exact presentations and simple graphs and papers and things.

Nick Gillian [00:32:23]: But there's a lot of nuances in there in terms of if you have a model which is 92% accurate and you have another model which is 92.5% accurate, well, is the second model actually better than the first model? I mean, you would maybe think it is. It's 5.5% better, but in production it may be fundamentally worse. Right. And you don't know because you aggregate everything into a single number. So a really good framework that I use a lot is this bucketed analysis technique where essentially before you start any data collection, you sit down and you break apart your problem and you try to understand what are the key variables in the system you're building that are going to impact performance. And some of those are human variables. Height, width, gender, skin tone. If it's a risk based device versus something you're going to put on your eyes versus something you're holding your hand, there's going to be a bunch of things that are going to impact the performance of that system.

Nick Gillian [00:33:21]: Then you have to think about other aggressors, you have to think about other key variables in the system. And typically you come up with maybe 6, 7, 8 critical variables for your system. And now you can do the mathematics of, well, that's now a seven or eight dimensional system. And you can basically say for each of those key variables there is a bucket that captures. We need to find enough participants that are seven foot tall, that have darker skin, that have blue and green eyes and that really like cucumber sandwiches or something. Right? Right. So now you have this bucket which represents that slice of the distribution. And then you say, okay, well what is our confidence level on that bucket? And then you can actually say, well, you know, if we want to be 95% confident, for example, in that performance at that bucket, that tells us how many samples we need to get of, you know, seven food people that like cucumber sandwiches.

Demetrios Brinkmann [00:34:21]: Oh, I see. Yeah.

Nick Gillian [00:34:22]: Okay. And now you can measure accuracy and F1 and you know, all your typical metrics on that specific bucket. But more importantly, as you're. Because again, you should do this at like day minus one, before you even start the project. As you start the project, you can already start to build out your test set. And initially you're going to start and all those buckets are going to be empty, right? So now as you get data, you can basically see which parts of the distribution you're actually filling up. You'll never fill that full distribution, Right. It's with six or seven variables.

Nick Gillian [00:34:53]: You're already at millions of of buckets. And for each bucket, you may need tens, hundreds, thousands of users to be more statistically certain. Right? So that's already like everyone's data collection budget is already blown up just by this. Right. So you have to use very sophisticated sampling techniques about which buckets you pick and et cetera, et cetera. But essentially what you can do is as you are building out this test set, you can look at the performance of those buckets and now you can actually say for a specific bucket, is the performance good enough? Right. And once it's good enough, ideally in your training data, you tell your training team, stop collecting data for that type of thing. We're already good enough in that area.

Nick Gillian [00:35:36]: Go over here, you basically have a torch now that you can kind of shine in these kind of this analysis framework, and it will help guide you on what data you should collect. Right? So this is a very powerful framework to help structure projects and build production AI models because it allows you to start to see, well, first of all, where are your gaps in testing? Because when you actually do that analysis, you'll be shocked on production systems when you do this bucketed slicing, how little of the distribution you've captured. In most cases, you realize, oh, wow, we are 95% accurate because most of our data samples land in this bucket. But actually, if we normalize by per bucket performance and we basically cap, you know, once we hit a certain threshold of samples in that bucket, we can basically cap the performance and not continue to like, improve our performance in that bucket. And then you normalize that by all the other buckets, you realize you're not at 95%. You're at 60%. Right. Because you've all pulled the old.

Demetrios Brinkmann [00:36:36]: You see everywhere. It's such a more clear picture of what you're looking at as far as performance. And as you mentioned in the beginning, it's really nuanced when you're looking at performance or accuracy as just one number as opposed to breaking it up into these buckets.

Nick Gillian [00:36:53]: Yeah, yeah. And then of course, you can aggregate all those together to get one nice metric that you can track over time. So you can, like, for a bigger team, you know, you can't go into that level of detail and you can actually, like, is your. Is your performance improving over time? Is your coverage on those test buckets improving over time? And normally what you see is this kind of. This kind of sawtooth wave, like trend where the performance increases and then you basically expand to a whole new set of variables that either you couldn't collect in the early days or maybe you don't even know about when you started the campaign. You discovered like, oh, actually this, you know, like whether the person exercised or not before they use the product actually impacts the performance of the product. Okay. That's another variable we have to add.

Nick Gillian [00:37:37]: So you're going to see performance will increase and then you'll hit this next, and performance will drop and then it will increase again and then it will drop. And, you know, over 6 12, you know, 18 months, you'll build the actual performance. You need to really launch these things. So to go back, this is a long way of saying to go back of how much data is enough. I think it's either when the customers tell you you have covered all the sensors, or it's when you've actually looked like. Like you have these very rich eval slices and you can actually say for the product we're trying to ship, this is the performance we need to hit. And now we're really confident, like for 95% of all these buckets, that we do have the true performance numbers on where the system is going. Right.

Nick Gillian [00:38:18]: And typically you never reach that because you always add in more complexity. There's always a new feature you want to add, or there's always something that will add one more dimension that, you know, keeps growing the system. Right, exactly.

Demetrios Brinkmann [00:38:31]: It's goes back to that whole thing of you're doing a ton of R and D. I'm sure, as you're also creating a product that folks are using on a daily basis.

Nick Gillian [00:38:45]: Yeah. And you know, honestly, that's the type of team or company or, you know, that's where I really loved. I love to be on those edges of going from zero to one and then really taking that and building a product around it and then building a rich set of features and things that people love on top of that core R and D. But yeah, it really starts from a core piece of technology and that requires normally some significant innovation to work out how you're going to solve that. And this is where I should mention that one of our tricks of how we do that with my co founders is. So I mentioned Jamie, who's my long term collaborator and our chief scientist. And we're kind of both on the technical side of the team. Right.

Nick Gillian [00:39:31]: But then we have Leo, who's our chief designer. So he's coming in and really looking at things from an interaction standpoint. And then we have Brandon, who's our coo, and really he's just the best operator I've ever met. And he's just amazing at smashing through walls and just getting things done and, you know, being able to like, really understand where a product needs to be built and like just building teams and executing and, you know, changing laws when they need to be changed to allow things to get chipped and all this stuff. Right. And then we have Ivan, who's our CEO, who is our CEO, who is, you know, kind of the glue of. He's able to look at all these things and kind of bring them together. And like our, our big secret is while we're building fundamental technology and we're always building it from the ground up from a technology standpoint, from the very start, we bring in design and interaction and we really think about how people will use the technology.

Nick Gillian [00:40:29]: And then we use that both to kind of guide where the technology is going, but also to put some strong constraints on the technology itself to allow us to actually ship it. Right. Because if you don't put those constraints on it, you're just trying to. You're trying to boil the ocean and solve every problem at once and it's too much. Right. There's always this valley of death that R and D goes to die in. Or you try to take an R and D project and you typically pass it to a production team and they will take it and ship it. And that doesn't go well for very new technology.

Nick Gillian [00:41:05]: You really have to make sure that's successful. And one of the tricks of making this happen is making sure that we use interaction as a way to put some strong constraints on this and really think about how people will use the technology and how to Kind of guide them and put, you know, make it easier for the model itself to solve these problems, et cetera.

Demetrios Brinkmann [00:41:22]: Well, that's a great segue into what we were talking about before we hit record on the construction sites and the use case around that. Can you connect those dots of the interaction, how you ship something so that you had those constraints with this use case of you've now got sensors and all of this multimodal data for a construction site?

Nick Gillian [00:41:46]: Yeah, that's a great segue. So construction is a really good example of why doing things in the physical world is hard. Right. So we worked with the Japanese construction, you know, this mammoth company. They're fundamental, they're Kajima, if you've ever been to Japan, they're the company that built the island that the airport sits on top of. They didn't build the airport, they built the island that the airport is resting on top of. In Japan. It's phenomenal.

Nick Gillian [00:42:24]: So they build bridges and tunnels and islands and buildings and everything, right. And we work with them on a five year project. And we come in at the end where we were basically moving a river, right. As one does on a normal day to avoid flooding. And they essentially had four or five years worth of data of all of these cameras placed around different parts of the river. As the construction teams put these large excavators on barges, floated them out in the river and basically were dredging the river to basically widen and deepen the river to avoid flooding. And their big goal was to understand from that data how to estimate the throughput of each of the dredging teams. The team had a schedule that they were supposed to dredge six hours that day.

Nick Gillian [00:43:19]: Did they actually do six hours or did they two or did they do zero because of maintenance issues or weather or et cetera, et cetera? And this is a great example of why doing AI in the real world is hard because like a construction site is literally different every day, right. The thing that was safe yesterday is no longer safe today, right. Because they've moved the bridge a little bit further out over the river. Or you know, that system is now, you know, that wall is now 3 meters higher than it was before. So if you fell off it yesterday, you'd be okay. If you fall off it today, you're going to break your leg, right? Like these type of things. So you have weather, you have, you have very large sites that are 25 square kilometers. And you know, the excavator is now just barely visible in the furthest realms of the camera.

Nick Gillian [00:44:05]: And you have to get the model to be able to understand this. So a few things we did to really use interaction to solve this. One of the big concepts we have in our model is called a lens. And you can think of this as the ability to, instead of like a chatbot where you ask a question, it gives you back an answer. You ask a question, it gives you back an answer. You can basically set up a lens and connect a real time data stream to it. And then the lens will constantly analyze the data with whatever focus you've put on that lens. Right? So take this simple example.

Nick Gillian [00:44:37]: You want to work out if your package has arrived at your home. You don't want to keep having to ask your doorbell, like, did my package arrive? Did my package arrive? Did my package arrive? Right. You want to just say, like, notify me when my package arrives today. So you set up a lens, it will continuously run, and then when something happens, it can actually output an answer. Right? So we use this interaction technique for construction where instead of constantly having to ask the model, like, what's the efficiency of the work team today? You set up this lens before you get the data. And now you actually run that lens on either prerecorded videos or live data streams. And now the model as it's running will actually basically output this rich structured response of, you know, is the work team resting? You know, is the barge on the river, Is it actually dredging or is it being moved, you know, et cetera, et cetera. So you kind of flip the problem on its head.

Nick Gillian [00:45:30]: Instead of constantly asking the questions in the model and giving back an answer, you basically set up this lens and then you basically stream your real time physical AI data through the lens. And now you get this very rich, structured output that you can then build into a bigger system, which we typically call a physical agent. And that's the system you can now actually deploy, where you basically deploy these physical agents in the real world that will in real time measure the productivity of the construction teams, for example, Right. And it will give you back notifications and responses. But you don't have to keep asking a question of like, you know, is the team operating well today, for example?

Demetrios Brinkmann [00:46:08]: So lens is a way for you to tell the model what you care about.

Nick Gillian [00:46:12]: Yes. Beforehand. Exactly. And then it's interactive, right? So it's real time. So you can adjust the focus in real time. You can ask it questions if you want to, and it can give you back the answers, but it'll essentially you give the model some explicit goal of like, what you're interested in and that the model now can on its own autonomously analyze that and then work with you to actually like give you back a report or like send you a notification or you can ask it a follow up question and it can then give you the answer. But it doesn't have to do it from scratch. Right.

Nick Gillian [00:46:44]: And then of course, you can have dozens of lenses running on the same sensor stream to look at different aspects of the data.

Demetrios Brinkmann [00:46:50]: That's pretty wild. So I can get a daily report of how my construction site is going.

Nick Gillian [00:46:55]: Exactly. Or, or, or so you can get a daily report or you can have. If a worker happens to go up in one of the rigs and now is in it, it'll immediately notify you on your pager and say like, hey, you know, you know, Claire has now climbed up on like a dangerous, like pay more attention to her to make sure that she's safe, for example. Right? Yeah. Or you can say like, here's, here's the work at this section of the river. Two months later, here's now the productivity of this section of the river. Why Is 1 section 20% lower productivity? Like help me analyze this and understand this big.

Demetrios Brinkmann [00:47:33]: That's because of Scott. Scott's over there with his little side gig, checking over there.

Nick Gillian [00:47:39]: And it happens to be winter, which means there's a lot more rainfall, which means the river is much higher, which means there's a lot more dredge, kind of, you know, dirt in it, et cetera, et cetera. But ultimately it's Scott, we should move Scott.

Demetrios Brinkmann [00:47:50]: And so it will be able to extrapolate that kind of information of, hey, there's all this stuff that's happening.

Nick Gillian [00:47:57]: Yeah.

Demetrios Brinkmann [00:47:58]: And this, we have a confidence level. I'm sure that we think it's because of these things.

Nick Gillian [00:48:05]: Exactly. So I'm joking about weather, but actually that's a real thing on that project where we fused in from these multimodal camera rigs where they're all capturing different aspects of the work. We fused in the local weather station report. So we get rainfall, we get temperature, all those things we fused in the hydrometer sensors that are on the river that actually tell you the height of the river. Right. And things like water flow and things like this. We fuse these all into Newton and now you can actually ask questions like, you know, show me the top, you know, 10 top 10% of teams that are working, you know, this month. Show me the bottom 10%.

Nick Gillian [00:48:49]: Like, what's the difference between these three teams? You know, I can actually go and tell you oh, because of the weather. There was a storm three days before and it takes several days for that storm to kind of impact downstream. Right. And because of that, even though the team was able to go and work that day, it was a bit wet, but it wasn't that bad. Because of the storm two days before, it's actually now decreased the productivity of the system. Right. So it's a great example again of that multimodal data where you're bringing in the weather data plus the cameras and now you can combine it together and look at more than just five minutes of data. You can look at many days of data and because of that you can start to see these bigger patterns that even the engineers on the site weren't able to pick up.

Nick Gillian [00:49:29]: Yeah.

Demetrios Brinkmann [00:49:29]: So what do they do with these insights?

Nick Gillian [00:49:32]: Well, mostly, mostly they're using this for, to help them, to help them the way they structure their teams. A lot of the work is being done by subcontractors. Right. So a lot of this is from a compliance standpoint to make sure that the team, they're subcontracting and they're, they're getting the work. You know, for example, a six hour shift go and dredge. Like, are they actually going and doing six hours of dredging? And if they're not, is it because of the weather or is it because the drill bit broke that day and they had to spend two hours repairing it? Or is it because, you know, Sam didn't turn up the work that day and the team wasn't, you know, it didn't have the barge driver and therefore it couldn't do the work. And then based on that, how do you then go and start to optimize the site so that, oh, next time we see a storm coming, let's actually just have the team go and work on this section of the site instead because they're going to be a lot more, going to be a lot more productive for, for instance, and then we'll go back to that area of the river where we know four days after the storm, it's okay to dredge there, for instance. Right.

Demetrios Brinkmann [00:50:32]: Wow, dude, this is so fascinating. I think we'll end it there. Unless. Is there anything that I didn't ask you that you wanted to talk about?

Nick Gillian [00:50:45]: Well, I think that it's been a fun conversation. I'd love to talk about this stuff. I guess the final thing I can end on is we have a huge amount of customers coming in. You can probably tell that the team is growing quite a bit. So we are hiring if folks want to come and work on physical AI, hit me up on LinkedIn. Or maybe we can attach our careers page to the podcast. And folks, come and check us because there's a lot of work to be done and a lot of very cool things happening now in the physical world. And I think in the next few years, this space is going to explode.

Nick Gillian [00:51:18]: It's going to be huge. Come and join us. Come and work on very cool technology in the real world.

Demetrios Brinkmann [00:51:24]: Is it hiring in the Bay Area? Hiring across?

Nick Gillian [00:51:29]: It's in the Bay and in Europe. So, yeah, we are hiring a team across. We're a distributed company. So, yeah, we have folks all around the world. The two major axes is in the US and in Europe.

Demetrios Brinkmann [00:51:43]: But yes, engineering sales.

Nick Gillian [00:51:46]: Yes, everything. Engineering sales research. Yep.

Demetrios Brinkmann [00:51:49]: Excellent.

+ Read More

Watch More

Real World AI Agent Stories

Posted Jan 14, 2025 | Views 395

# AI Agents

# LLMs

# Nearpod Inc

Unlocking Real-World LLM Use Cases

Posted Oct 09, 2023 | Views 592

# LLM Use Cases

# RAG

# Google Cloud

Measuring the Minds of Machines: Evaluating Generative AI Systems

Posted Mar 15, 2024 | Views 591

# Evaluation

# GenAI

# Intuit