Synthetic Data for Computer Vision
Rich is a Machine Learning engineer with a background in Software Engineering. Passionate about all things related to AI.
An introduction to using Synthetic Data for Computer Vision tasks.
Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/(url)
Jose Navarro [00:00:00]: Thanks for coming, everyone. We got two really good talks tonight. Good. And they're the ones I really wanted to see, so that's how I wanted it to be. This first one we've got is Rich from Rowden Technologies out near Bradley Stoke. They're a really forward company in terms of research on Edge ML, and they're doing some absolutely amazing stuff at the edge and also some amazing stuff with synthetic data, which Rich is going to talk about today, so I'll let him wade straight in.
Rich Riley [00:00:25]: Cool, thank you. So, yeah, thank you for the instruction. So I'm going to give you an overview of synthetic data for computer vision. Firstly, I want to go over what we mean by synthetic data, why we need it, and then really how to get started using it. And then hopefully, towards the end, we can touch a little bit about what the future of this field could look like. As you are probably aware, machine learning tasks are really heavily reliant on data. We need very, very large datasets. What's more, we need that data to be annotated.
Rich Riley [00:00:55]: It's not enough just to go off and collect a load of images for whatever it is you're trying to train your ML to do. You have to go through and annotate that data. So you need that data to have the ground truth labels, as we call them. What the ML is going to do, enduring training, is it's going to use that annotation, that ground truth, to correct itself. Right. It's going to update itself during training in order to get better at whatever it is that we're training it to do. And whilst there are lots of public datasets available, so this is imagenet, a very famous data set, which are annotated, they're unlikely to cover the use case, whatever it is that you're looking for. And if they do cover your use case, they're not going to provide the diversity and variance of data that you need to create a robust model, which we'll talk about in a second.
Rich Riley [00:01:40]: So collecting and annotating data is arguably the biggest bottleneck, really, when we're trying to train an ML algorithm today, we can take a lot of computer vision architectures that have been proven to work off the shelf pretty much these days. Anyone can train an ML algorithm to some degree with just a few lines of code, but unless you've got that data set, you're not going to get started. So, just to be clear what we mean by labeling data, so we have differing degrees of computer vision tasks, this isn't an exhaustive list by any means, but it's going to depend on your use case, basically, it's going to depend on what you want your ML algorithm to be able to do at the end of the training. That is what is it? It's going to actually be put into production to do. So, in the top left here we've got image recognition, and this is the simplest one. This is just saying, is the item of interest, does it just exist in the image at all? Is there a dog present? Is there a sheep? Is there a horse? It's just saying, is it there in the image at all? Below that we have this sort of object detection case. So this is what's referred to as bounding boxes and this is object localization. Where in the image do we think the thing is? Then we have semantic segmentation.
Rich Riley [00:02:45]: So this is actually a pixel level classification task. Now it's actually providing this detailed image mask and it's telling us exactly where the algorithm thinks the, the items of interest are. And then we have instant segmentation, which is arguably the most complicated labeling task where we actually distinguish between the different instances of the sheep. In this case, it's not just blocked out in one color, we've actually distinguished between them. There's an entire industry now of companies who will do this kind of labeling work for you. So it seems quite simple really. Like if you imagine just doing one of these, it's really straightforward. You could just draw some bounding boxes.
Rich Riley [00:03:21]: But as soon as you try and scale this to the amount of images that you're going to need for a real task, it suddenly becomes not just impossible in the sense of like, we have to sit there for a day or so, it's just, it requires human teams of very large amounts of people to do this process. There are lots and lots of companies now whose entire job is just to label data for you. Google offer this service, AWs, Amazon Turk. And what they're doing is they're subtracting this work out to large teams of people who are being paid relatively small amounts of money to do this very repetitive task over and over again in order to label your data. Just to talk about some of the other challenges that you've got when you just collecting the data. Actually, it could be very hard just to collect the data in the first place. Depending on your use case, it could be that the things you're trying to detect are very rare, they're not things that you have access to. You can't just take a video camera out and start filming them.
Rich Riley [00:04:11]: It could be that it's quite dangerous situations or just situations that's very costly to get to. If you're trying to sort of send something to the moon, for example, and want it to work well, you're not going to be able to sort of collect data privacy issues as well. GDPR we can't necessarily just be like, using our customers data for this. You might want to go beyond the sort of rgb spectrum that we're seeing here. So you might want to do like Lidar, thermal, all those kinds of use cases. And then it could be that your data changes all the time, right? So if you're training, say, let's say like a robot that's operating in a warehouse where the, the goods that it's dealing with are changing on a weekly basis, then you're going to have to, it's not just a one time thing you're going to have to do. So you're going to have to iterate a lot anyway, even on this process. So what is synthetic data? So synthetic data is really any data that has been generated in some sort of artificial manner.
Rich Riley [00:04:59]: So by that, I mean it hasn't been taken in the computer vision sense, just with a traditional camera. And that's likely to be through some sort of 3d modeling software. So that's what we're seeing here, is a drone that's been rendered in blender, which is sort of thing you'd use to do, like animations or make games in unreal unity. Mayer all these kinds of things that graphic designers or artists use, or it could have been generated by another ML algorithm. So we're now seeing generative AI, this sort of rise of things like stable diffusion and Dali and all these models, Sora, which is about to change everything, where actually we can be generating data from another process as well. And it's worth noting at this point that synthetic data doesn't just cover computer vision, right. It also covers large language models as well. So a lot of smaller large language models are being trained by data that is coming from larger models, and that's also considered a synthetic approach.
Rich Riley [00:05:52]: So just as a sort of summary here, we've touched on a lot of these already. But time and cost are your main reasons for wanting to go with a synthetic approach. Essentially, it removes the need for the human labeling is the primary use case, because when you render an image in 3d modeling software, what's happening is a process called ray tracing. Probably all familiar with this idea where you're essentially simulating light passing through that scene into a camera. So you're taking what is a 3d model of a world, and you're rendering it into a 2d image that we can recognize as a photograph, like Pixar and all these things is a big thing about rendering. And because you're able to render those individual paths through the scene, you can get your annotations for free. We can get that bound in box, we can get that image mask, we can get depth masks, we can get any kind of annotation we want without a human having to sit there and doing it manually by hand. Much more precise labels.
Rich Riley [00:06:46]: And obviously we can produce many more for a fraction of the time and the cost, we can introduce more data variance because we're able to simulate and randomize the data much, much easier. So if you were to go out and do a manual collect, that collect is likely to be whatever the lighting conditions were that day, the weather conditions that day. We can simulate light, we can simulate weather, we can simulate different landscapes, different, whatever you want to simulate rare events is another one, especially when using synthetic data for verifying a model. We've all probably seen examples online where a driverless car has encountered something on the road that is very unusual, maybe like a horse and car, or an advertisement on the side of a lorry or something. And it completely confuses it because it hasn't seen that in training. We can create those kind of anomalous, rare events. Occlusions is where something has been obscured by something. And because the rendering software effectively can see through the occlusions, we can actually get much more detailed labels as to predicting where we think something is behind a barrier of some kind.
Rich Riley [00:07:48]: And then we can look at custom sensors. So we can actually simulate very specific types of sensors. We could have thermal cameras, we could have just specific different types of lenses. Right? If you were trying to capture images on your phone, it's going to look very different to a GoPro camera, for example. Or it's going to be very different from a drone that is using a different sort of lens, like a fisheye lens or something. So we can actually really make it very bespoke to your, to your use case. So this is the pipeline, basically. So we build a scene of some kind.
Rich Riley [00:08:15]: So this scene is just the entire kind of space, the space that we want to be putting our model into deployment. So that could be like a factory or a warehouse or production assembly line, or just whatever it is you're doing. We're going to import the assets into that scene, the stuff that we're interested in detecting, not just the stuff we're interested in detecting, but also all the other clutter that's going to be in the scene in production as well. We don't want to just fit it onto the specific item of interest. We want to make sure that we're showing it stuff that it's going to encounter as well. We want to randomize that. We want to create random variation in the data set. So that's what this second steps about.
Rich Riley [00:08:51]: And then we generate the data. So this is the rendering process. So we're actually going from that 3d randomized configuration of the domain and we're turning that into an image. And we're also getting our labels, the most important thing, and those will be output into the correct kitty or cocoa format, whatever format you want to use. And then we're going to train and we're going to validate our model. So depending on what you're doing, you could work in a purely synthetic way if you want to. But if you're doing this in any kind of professional capacity where you really care about the reality of how well your model is performing, you need to be validating against real data. So you are going to have to do a small amount of real collect.
Rich Riley [00:09:27]: Because the challenge with synthetic, and this is the real real challenge is going from synthetic to real, right. This is the gap that we're trying to bridge. We're trying to get an algorithm that's only ever seen synthetic virtual images to perform well when it's seeing physical images. Ideally, our synthetic data would be so good that the AI doesn't even realize in inverted commas that it's now actually in the real world rather than in the synthetic world. But that's obviously down to how good you can make your synthetic data. What we can do when we see a poor performance on our real data is normally in traditional processes, we'd go back and we'd tune the hyper parameters of our model, right. We'd start tuning and say, okay, we'll increase the learning rate, or we'll add some dropout or something, whatever it is we're going to do. But we can actually go back now and we can tune the generation pipeline as well.
Rich Riley [00:10:12]: So we can tweak the way that we're generating the data. We can add extra assets. If it's failing on particular things, we can add things. It's still a very tricky process. You do that and something else might break over here, but we've suddenly got a whole automated pipeline that we can play with. This is just a very quick use case. I want to show in case you want to get started with this kind of thing. This is based on a blog that I wrote maybe two or three years ago now, actually.
Rich Riley [00:10:35]: So it's a little bit out of date, but I think it's a good example of how you might get started with this. And then I'm going to talk about some of the more advanced ways of doing stuff a little bit. So what I wanted to do was create an algorithm that could detect a drone flying through the sky. That was the case. And I wanted to put a bounding box around that drone if it saw it. And what I was able to do is use blender. So, Blender is an open source 3d modeling piece of software. So the first thing is we have to build the scene.
Rich Riley [00:11:00]: So what I found is these things called high dynamic range imaging. I didn't know what that meant at the time of the hdris. And what these are is they're people who are really into photography will go out and they'll collect a multiple sort of range of pictures of a scene, and they will somehow, using software, put them all together and create a 360 degree realistic sort of photo backdrop. So you can imagine it's like a snow globe. And projected onto that snow globe is the scene. What we're seeing here are flattened versions of those environments. So I downloaded a load of these. So there's a free open source site where you can download these.
Rich Riley [00:11:37]: There's a few, well, there's thousands on there, but I picked like 100 just to pick from. And then I went to do this scene randomization. So I took the 3d model drone that you saw before. That was just an open source model that I got, again, online somewhere. And I dropped that into the center of the scene, actually, for this one. So we've got this kind of snow dome with this photo realistic background. And I've dropped the drone in there. Now I'm pointing the camera at it, and I'm randomizing the orientation of the camera.
Rich Riley [00:12:01]: So the camera is the distance to the target, but also all the different parameters that you can play with for the. The variation of the camera. And then I also played around with the light conditions slightly. So those HDRI backgrounds, they actually encapsulate lighting conditions in them. So you actually do get that for free when you bring in different backgrounds. But I also did some sort of manual playing around with that, and then I generated the data. So there's a plugin for blender called blender Proc, and this is open source. I've got some links at the end, which you can, if you're interested in finding.
Rich Riley [00:12:35]: But blender Proc is exactly for this use case. It's been developed by, I think, from a university somewhere who have just realized that there's a use case for this kind of stuff. So you can see what I've got from this is, I've got these bounding boxes. So this is as if the human has done the labeling for me. And this took sort of, I think I generated about 5000, 10,000 of these. It took a few hours on my desktop machine at home, so it's not mega amounts of time, and you can scale that as well if you're trying to produce lots of them. So this is a python SDK that you can use for blender Proc, and it's really well done. It's not huge amounts of code at all, and it actually pulls blender down for you.
Rich Riley [00:13:08]: So you haven't got to do any kind of configuration stuff. It does actually kind of work pretty quickly. And then I did some evaluation. This was the first pass, and it does pretty well. Right? So this isn't synthetic. Now, this is real, and there's some false positives in the background. And I'll be honest with you, I didn't try and optimize this at all. I was doing this just to try and demonstrate to some, to some stakeholders that this concept kind of works.
Rich Riley [00:13:32]: In general. If I had been critical of what I've done here, I would probably get some other things flying through the scene and seeing what false positives I get, right? So if I had a bird fly past or something, I reckon we'd probably get some false positives. And at that point, what I'd do is I'd go back and I'd update the data. I'd say, okay, well, let's add some birds to this. Let's go and take some pictures. And they wouldn't necessarily have to be synthetic. You know, we could just add lots of different pictures into that mix. But that idea of updating your training data to see better results on your validation set is actually the power of using this sort of synthetic approach.
Rich Riley [00:14:04]: So this is just a quick video I thought I'd include. This is on the blender proc thing, so you can just get a feel for what they offer. So again, you need somebody who understands how to create 3d assets and things, but you can see the types of the power of what you get from this, where you can actually start to create scenes where you would never be getting a human labeler to sort of to do stuff to that level of complexity, pausing there, then that's the kind of the concept and the sort of simple sort of hello world kind of. This is how you might get started with it. And then I want to talk a little bit more now about actually using more sophisticated simulators that are a bit more purpose driven for this type of work. So this is Nvidia's Isaac Sim. Isaacsim is a robotic simulator that's designed for being able to simulate very realistic environments for robots to operate in. It's used for all sorts of different use cases, and it allows you to very easily import real components, real manipulators, real robots that you might be using into highly realistic scenarios and have them, you can test them, evaluate them, you can do reinforcement learning, all that kind of stuff.
Rich Riley [00:15:16]: So it's a really big, big serious thing. And it's designed to scale, to very vast scale. Everything Nvidia does is designed to scale. And this is, there's a barrier to entry here, of course, to learning this. It's not just going to be a few lines of python code you really have to get stuck in, but they are making that easier and easier with tutorials and things. But there's an extension, as they call it, to Isaac Zim, which is called replicator. And replicator is doing exactly what blender proc is just doing before. So it's allowing you to use all the power of this rendering engine to still take all these different kind of shots.
Rich Riley [00:15:50]: And you can see when it comes up, all the very sophisticated, different types of labeling that you might need. So Isaac Sim is very powerful, and it's something that, when I'm getting time, I'm trying to get up to speed as to what it would offer you. But I think the main thing is if you're working in this space, in all likelihood you're going to want to move beyond just labeling data at some point, and you're going to want to start testing and evaluating. So actually, starting here is a really good place. This is also fully supported by Nvidia, like aggressively supported by Nvidia because they want you to buy it. And blender proc is, from my understanding, is more academic tool that people have created. So, yeah, I recommend checking out Isaac Sim. I'll carry on and then we'll do questions and discussions.
Rich Riley [00:16:34]: I think it'd be good to have a bit of a discussion, but I wanted to talk a little bit just about generative data. This was another use case I was playing around with just to give you a feel for it. So we have generative models now, I think everyone's familiar with this, right? We have Dali, we have stable diffusion. We have all of these things. So we can actually, just, through a text prompt, ask an algorithm to produce us 100 pictures of a particular thing that we're interested in. So this was a use case that I haven't written this up yet, but I wanted to test this idea of being able to take the kaggle dataset of all the different breeds of dog. So I think it's like 200 breeds of dog. And then I sort of systematically just went through stable diffusion and asked it to generate me 100 images of each of those breeds.
Rich Riley [00:17:14]: So I just had a prompt which was like, generate me a picture of this insert species of dog here. And then I varied the location. I was, like, on the beach, in the countryside. From just all these different prompts, I could vary, and then ran it through stable diffusion and got the images. And actually quite a costly process in terms of compute. But you very quickly get yourself a dataset. So this doesn't have bounding box annotations or anything. So this is just like a mosaic of the sort of different ones that I produced.
Rich Riley [00:17:41]: But it was just an idea that you could then train on this data and then test it against the Kaggle dataset and see what kind of results that you. That you might get. Video language models are this thing that we're going to see more and more. Well, they're going to just dominate everything, I think, at some point. So this is, this is going from text to videos. This is what Sora is, which is what Facebook are doing now, and not Facebook. Sorry. This is what meta.
Rich Riley [00:18:01]: Sorry, OpenAI are doing, and we're going to see just more and more of this being able to generate stuff very, very quickly. And I think there's probably going to be a use case where you can take the stuff that's being produced by those ML algorithms and then actually use another algorithm to create your bounding boxes, and then use that to train your smaller model, if that makes sense. So there's lots of ways you can plug these things together to avoid having to collect and label data yourself. You need to be really, really mindful of data contamination when you're doing this. So with this example here, the stable diffusion, in all likelihood, has seen the evaluation set that I was about to test it on. It's going to have seen the kaggle dogs thing. So these images are going to be inspired, in inverted commas by the kaggle data. Some of them could be near duplicates.
Rich Riley [00:18:45]: So when you're benchmarking, it's a bit sort of like, well, I'm not sure how valid that benchmark might be, but if you're looking just to get good results in reality, like in production, then, you know, just use what you can. It would be my, my approach there. Cool. I've listed a few, just a couple of the resources. I sort of ran out of time, as you point out, with the slides here. But, yeah. Do we have any questions or any observations? Like, go for it, like, feel free?
Q1 [00:19:11]: Yeah, Gordon, it's a very successful Ecoda, and I remember reading a paper about how these models have started to like homogenize or basically fit guzzard.
Rich Riley [00:19:25]: We've generous their data.
Q1 [00:19:26]: That's just so much mange and Craig bad to real life, ideally for helmets, emotive experience.
Rich Riley [00:19:34]: So I wouldn't say I've gone into all seriousness of like, deploying into production, but yeah, this is exactly the risks that you would face of any of these tasks is like overfitting or. So the video language models that we're seeing, these so called foundation models are going to have seen pretty much all the content that's on the Internet. Right. So they're going to have seen everything, which is arguably the most diverse sort of set of data that you can see. But it's also going to mean, I don't know if. I'm sorry if I'm not answering your question, but it's also going to mean you're going to see lots of models that are essentially the same model because they're going to have been trained on the same thing. So if you look at some of the large language models with like the 7 billion parameter ones, they've all kind of got the same data set in a lot of ways. So they're actually not really that dissimilar to each other.
Rich Riley [00:20:14]: So really shaping and refining your data set is actually going to be really critical. And this is the challenge. Right. You know, I've made it sound very sort of simple. Oh, we're just going to generate some images and things, but actually tracking the results and you improve in one area doesn't goes worse in another area, or it's a hard problem.
Q1 [00:20:34]: In the drone detection demo that you showed, what part of the training data was actually synthetic? Was it the drone that was synthetic? Was it the scene that was synthetic? Or was it the variation in lighting in the location?
Rich Riley [00:20:49]: So all the training data that I produced would be classed as synthetic, so that the drone itself was a rendered polygon model that was being rendered through ray tracing. The background was a photograph, which I guess you could say it's synthetic in the sense that it wasn't being directly captured by a camera, but the lighting and everything else. So all of those pictures came out of blender, if you see what I mean. But I fine tuned the model from a model that had been trained on a. You know, I didn't train from scratch, if that makes sense. It'd probably been trained against Coco or something like Imagenet or something like it would have seen some real data in the sense that it had learned, probably, say all the imagenet classes. I don't know if that covers drones or not, but it would have. It probably would have.
Rich Riley [00:21:33]: It would understand things that exist in the sky, for example. It's probably learned to distinguish that. And then I'm fine tuning the drone in that video isn't. The drone isn't, you know, it's not similar to the drone. I didn't attempt to model the 3d drone, if that makes sense. So, and like I said, I should probably make that more of a rigorous kind of study. But it was just really a proof of concept of how to get that end to end pipeline working.
Q1 [00:21:54]: How have you played around with the other scored multimodal models to actually request it to give you a box? And.
Rich Riley [00:22:05]: So, no, this is something that I want to know. I haven't actually asked for labels from a large language model. I've used some language video models, which again, I need to go into the absolute details of it. So there's a model called, I think it's called owl, where you can essentially, I have a live video feed and you can just dynamically say what you want the bounding boxes to be. So I can say, put a bounding box around all the people and it's suddenly it's all the people, and now it's like, put a bounding box around all the, you know, whatever the flower pots or something. So you've kind of. So I think that's why when I started putting this together, I was thinking I need to refresh this a little bit because there's so much new stuff that I could be looking at. But I do think that's probably going to be the future, is being able to generate images depending on your use case.
Rich Riley [00:22:46]: If you've got a really niche thing that some startups doing, then probably you're not going to be able to just use it. But if you've got just this neat little thing you want sorted, I do think you can definitely harness the generative models for that.
Q1 [00:22:58]: Let me, I was just curious what application you missed at the site.
Rich Riley [00:23:06]: So the synthetic data stuff. I was lucky enough to see some stuff at Nvidia recently, and they were talking about digital twins, which I actually missed out here. So digital twins is this idea where you have a near perfect simulation of the thing that you're building. So if you're building a factory, you first build a perfect simulation of that factory. And what we mean by that, we're not just a 3d rendering of that factory, but you're actually modeling every single component in that factory. And the idea there is that you can, like, you can optimize the running of that factory before you even start to begin to build it. That's one of the main use cases. You can also experience that factory through virtual reality.
Rich Riley [00:23:42]: So you can walk around your digital twin before you even start building it. But the robots and all the automated stuff that's going on, all the sensors and everything, can all be experiencing thousands of years of experience of existing in that factory before the factory is even built. So the idea would be is that when you go and build it, you build it much quicker because you have all this technology and all this, this reference point, which is the digital twin. But when you drop your robots into that space, they've already, to them, it's almost like they haven't. Nothing's changed, right? If your digital twin is good enough, then to them they're just in this. It's like the opposite of the Matrix, right? They've gone from the simulation into reality and suddenly they're working really, really well. Or they can use the digital twin for planning, right? They could use it for, you know, boxes just suddenly got in my way. I'll just go and experience how to get around that in the digital twin for a thousand years or something.
Rich Riley [00:24:28]: You know what I mean? They can actually use that and then go, okay, I'm going to go and do this once they've worked out. So I think that stuff really cool. They're using it for warehouses and very controlled environments at the moment. Environments where they can control everything, the light levels, and it's a big market as well. I think taking that out into reality, that's going to be hard because reality is much more complicated. True. Outdoors, all sorts of stuff goes wrong, but in a controlled environment like a warehouse, I think that's where we are seeing that already.
Q1 [00:24:55]: This is a really great tool of it. We've been doing things similar, actually, with the text. So using like, a fake language model to create labeled text data and using them. Created.
Rich Riley [00:25:06]: Yeah.
Q1 [00:25:08]: Those big models is coming like, bound down so fast. It's actually, you messages usually. We know, rather than drain us all.
Rich Riley [00:25:14]: Yeah. One of the issues with those big models, we do a lot of stuff about edge based deployment, so we can't rely on an Internet connection. So we're looking at deploying, say, on a drone or on a robot or something, so we can't actually call back to those models. And we're also looking at very low latency, like real time low latency, where we can't have any lag in terms of referencing it. And even if we had direct connections to one of those large language models, I think the inference time would probably be too long. So I think we always want to be thinking about. I think smaller models will always have their place. And what you've described there is absolutely the right approach, I would say, is to harness the amount of training that's gone into those large foundation models and then put that into your smaller models.
Rich Riley [00:25:55]: Without a doubt. I do understand that the cost has gone down, but the speed is something else as well. I've heard of people creating a dataset from chat gpu 24 and then training a smaller model on that in order to do what chat 254 would do for them. They could be using it, but it's just too slow and a bit too costly. But yeah, there is a trade off, definitely. I don't know, I'm incredibly jet lagged right now, so this is. For all I know, this could be a dream, for all I know. But, yeah, I hope it's not a simulation.
Rich Riley [00:26:28]: It feels quite real. Yes, that's true. Yeah, that's true. Cool. Awesome.