Designing Human in the Loop Experiences for LLMs
speaker

Alberto Rizzoli is co-Founder of V7, a platform for deep learning teams to create training data workflows that automate labeling. V7 is used by over 500 global AI companies and enterprises including GE, Siemens, Merck, and MIT.
Previously founded Aipoly, the first engine for running large convolutional neural networks on smartphones, leading to the creation of an app enabling the blind to identify 5,000 objects through their phone camera used on over 1 billion images.
Alberto's work on AI granted him an award and a personal audience by Italian President Sergio Mattarella, as well as Italy’s Premio Gentile for Science and Innovation. His work won the CES Best of Innovation in 2017 and 2018.
SUMMARY
How will we teach large models to behave in organisations at scale? We’ll be discussing both the technical and user experience challenge of hundreds of humans influencing one agent. Who must it listen to? How must new learnings be represented? How can we make the labeling experience of LLMs an ongoing collaboration between people using them?
TRANSCRIPT
So let's give a warm welcome to our first speaker of the day on Track two, Alberto, uh, who's joining us from V seven. Um, and he'll be talking about designing human in the loop experiences for LLMs. And I think this is gonna be a great talk to start the day off with. So without further ado, let's bring.
Alberto on go. Hello, how's it going? Hello, everyone. Thank you for having me late. It's going great. Uh, it's very warm here in London. Um, I, you know, right now I'd rather be in Bosnia with, uh, with the rain and cooler weather. Mm. But I'm excited to be here and, uh, to present to you all. Awesome. I like your background as well.
I can't tell if it's like bugs or animals. This is the original image classification. So after Darwin there was a craze about actually depicting, we're already starting the talk I guess, so I can eat into my time. But, uh, depicting animals and creatures and plants. By their, their Grand Truth classification.
So it's the, the original way of depicting what data should look like. And if you're creative enough, and you can notice that there is some very broad level of clustering, this is almost like an embedding space. Wow. Very, very cool. Awesome. Well, here are your slides. And take it away. We only have 10 minutes, so I'm gonna go in a flash and we're only gonna be able to go at the surface level on some of this problems.
And, uh, it's probably a, a good thing because when, uh, when I was first invited, um, to, to talk, I, I thought that by now, given the rapid progress of how LLMs are making their way into products, there will be a lot more. Am I good? Yeah. Audio coming through. Cool. Um, there will be a lot more to talk about in human in the loop interaction within alignment by human in the loop.
I mean specifically anything that has to do with labeling or teaching or getting information that's inside here to make its way into a model's knowledge. But actually we've made, uh, very slow progress in this. Um, and I'm just going to talk a little bit about human computer interaction principles. So we're not just gonna talk about labeling, which is what V seven as a company is most known for.
But also the act of people teaching these LLMs. We're gonna be talking about how humans are supposed to fix data issues when LLMs make mistakes, which is quite often not, not just talking about hallucinations, but perhaps answers that are subpar. And then we're gonna be touching a little bit on multimodal human in the loop approaches.
So anything that spans cross modalities, not just in language, but in language plus vision. To give. Uh, and, and maybe to, to summarize this poll, the question that we wish to ask ourselves is, is this enough? We've seen apps be developed everywhere in which the responses of a model are a thumbs up and a thumbs down.
And I think we will look back at this time and cr cringe. At the huge untapped potential that, that we, we had in improving these systems that are sometimes deployed in production and sometimes handling really important information. And we're treating them really, uh, like at the, the very beginning of the machine learning implementations with just a, a thumb up or a thumb down critique.
I see that the slides are coming through very late in, uh, in the presentation, I hope. It's not, it's not confusing for Yola home. Um, so, uh, to give you a little bit about, uh, of an intro to myself, I, I founded, um, a company called B seven. We're a training data platform and because of that, we handle the ground truth of hundreds of AI companies, and through that, we're able to learn.
What exactly does good knowledge look like, uh, to be fed into neural networks With LLMs, this has changed significantly and continues to change because we're moving further and further away from over supervising our data. We come from a world in industry specifically in which we need generally lesser amounts of data.
There are very well labeled in a world where we are using enormous amounts of data that are really poorly labeled. Um, and we have a, an interesting vantage points because within V seven we actually have enormous amounts of, of data that we use in research. Uh, we're talking about petabytes of training data.
That is also really well labeled, and does that actually matter for use in industry? The big question of of of, um, human in the loop processes specifically within LMS is, are LMS necessary to solve industry problems? Do you actually need something that can write Shakespeare as well as it can tell you ice cream recipes?
Or do you need something that's a bit more restricted? One other very important point when it comes to human computer interaction. Is that, um, we like to think of LLMs as co-pilots. Um, but in reality there, so this thing that is contingency supporting your actions, continuously aware of that stream of a task and a task in, uh, the case of, for example, a flight is taking off somewhere and landing somewhere else.
But realistically, the atomic unit of a task in most machine learning software is much shorter. And we're actually using lms. In this very same way, just like we're treating this, um, this cute, uh, Labrador here, we're asking it to treat some information to us. The dog comes back, we're happy, and then we start off another task and send it back to them.
So the reality was we're we're still scratching the surface of what we should be using L Lambs in production for today. And the majority of production use cases is that at least we're seeing from our perspective, still tend to be relatively simplistic. And, uh, even within our own product, the use of LMS tends to be still using it as a glorified zero shot model.
In the case, for example, of using it in, in computer vision, they're generally used to manipulate other models since, um, as, as multimodal models, uh, large models tends to be still very unreliable. They're still generally used as a glorified command. K. This is potentially okay, but there's still a lot for us to explore within this paradigm.
So in this specific case, we're saying, Hey, I wanna label bs. So pull up the B model or a model that knows the B class, run it, and then maybe do some bas. Very basic transformation on this data. So what we ask ourselves day in and day out, uh, as product designers is how do we make an experience where this thing actually learns over time?
And unfortunately this is much harder than we initially thought. Uh, within our product, we have something called auto label. Which is effectively a large model that given a small prompt such as segmenting out one of these airplanes, it goes off and starts to segment all the other a airplanes that it sees in the picture.
So truly a multimodal co-pilot that is able to understand your language instructions, such as, well, an air airplane. I know that it's supposed to be a Qantas airplane only in this class, to understanding the vision side. The problem that we tend to see, at least within the, the labeling space and automating labeling beyond the point where we're at.
Is that most of the time, if you're a user, which is an expert label on the other side, cannot automate, uh, cannot, um, can be, you know, fully automated, uh, by a, a model in this case, then probably shouldn't be labeling that piece of data. And if your user is an average distribution person or a true expert, an engineer, a radiologist, then it's very, very hard to automate them because by design they're always introducing an out distribution piece of knowledge to your training set.
And, uh, and so this makes the job of developing, uh, human in the loop experience is quite difficult. And there's a few challenges that we've encountered by now, and the first one that we see in industry is that automation is often overrated. And most of the time it takes quite a while for a, an end-to-end automation system or even a co-pilot system that, uh, assists someone by making them just do QA to actually find its way into prod.
Um, and with the lambs, it's even harder. Because we're using these things that are very impressive. They can convince us that they're very intelligent by, by their means of speech, but they're actually that, not that better, uh, in both n l p and in, in multimodal approaches than something that is just fine tuned with a smaller amount of data.
Um, the, the challenge with getting these into prod within, for example, computer vision, is that most use cases within vision don't have any undo. They involve atoms. You can't undo the picking of an apple or the cutting of a tree if you're doing it robotically. Um, it, uh, in, in many use cases, there's no room for error.
Most industry use cases have defined outcomes, and this is actually one problem for why large models are not finding its way into many industry use cases, is that we actually have designed industry use cases to have very strictly defined discrete outcomes. You're buying a stock or you're selling a stock, you are pushing the accelerator or you're not pushing the accelerator.
So all the reasoning and logic that can happen in that. 55,000 token vocabulary of a large model kind of goes to waste when the actual problems that we're using AI for are quite simple because we have designed simple systems. So we're still in a transitionary period in in which these can find their way.
Um, and. They, they also bring a, a, a whole new set of challenges. One of them is co-pilots versus SaaS. In reality, we already have a lot of software that fulfills a particular, um, a particular profession. For example, Bloomberg will have their own Bloomberg terminal, which is great for trading, and it already has the buttons to actually be, be pushed to complete trades.
Um, as a result of that, it doesn't really need. Uh, an l l m like interface in many cases. It just needs to, um, it to complete actions. And these actions can actually be done without using a large model, but by simply using a, a classical machine learning model or even just a regular deep learning model.
That's fine tuned on bespoke data. Uh, one minute. One minute. Cool. Yep. Yep. Cool. The other problem is people are terrible teachers. Most of the time. If you're giving people the ability to retrain a model, they will usually give them incorrect information or information that is just not written in the way you would normally are, which have the model.
And then finally, there's a problem of information as symmetry. Most of the time the answer is not wrong. It is just the wrong information given to the wrong person. And these are still problems that within nlm are, I would say, largely unsolved in production systems. Um, we've seen many, uh, implementations of these adept has.
Famously created a way that will navigate websites and, and click around things for you. But most of the time, uh, this is not just the only way in which you would use a, um, you know, a, a piece of software to go and search for homes. It's just the way that works for now. Open AI has, uh, created a pretty simplistic in interface that still works, but it is not the way you would want to get house prices out of it.
And so, uh, many other software has done something similar, I think has one of the better ways of implementing, uh, visual feedback with the responses of an LM shout out also to glean an enterprise search company for doing so. Um, so. 10 minutes going really fast. But to keep it short, um, uh, there's many challenges for us to implement the training of, um, uh, of new information within the lms.
Hope you've enjoyed this talk and, uh, see you all in, uh, in future talks within the, the lops, uh, day and have a good rest of your day. Awesome. Thank you so much, Alberto. And. Make sure to check out the chat as well. People can kind of interact and follow up with you there. Thank you. Thank you. Chat. Cool.
