Data Labeling Best Practices
Charles Brecque is the founder and CEO of TextMine. Charles started TextMine in Oxford in 2020 with Amber Akhtar after experiencing data loss and friction when working with legal and financial business-critical documents. TextMine leverages patented knowledge graph technology and large language models in order to structure the unstructured data in documents. Prior to TextMine, Charles was the first commercial hire at Mind Foundry, a machine learning spin-out from the University of Oxford. Charles is a graduate of the École Centrale de Lyon.
I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.
I am now building Deep Matter, a startup still in stealth mode...
I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.
For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.
I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.
I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.
I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!
Data labeling is a key part of fine-tuning open-source LLMs. However, poor labeling practices can hurt your LLM's performance. This lightning talk will cover data labeling best practices from hiring, preparing your data, and managing your data labelers.
Data Labeling Best Practices
AI in Production
Adam Becker [00:00:00]: Thank you very much. And we have coming up. Charles, are you here? I'm here.
Charles Brecque [00:00:04]: Can you hear me?
Adam Becker [00:00:06]: Charles? Nice to see you. Well, Andy spoke about just like the various challenges that we have operationalizing llms. And one of those is obviously data labeling. Right. And this has not been a new problem as a problem even existed in the more like classical era of machine learning. Right. Even just like in the supervised problems that we had seen before. I suspect that they take on a slightly different kind of dimension and consequence right now and perhaps that will trigger new types of workflows and pipelines and ways to think about all these different things.
Adam Becker [00:00:39]: We would love to get your thoughts on this, so please, the stage is yours. Do you need to share your screen?
Charles Brecque [00:00:46]: I do. Let me share. And I guess, can everybody see the slides?
Adam Becker [00:00:54]: Yes, we can see it.
Charles Brecque [00:00:55]: Great.
Adam Becker [00:00:55]: I'll be back in ten minutes.
Charles Brecque [00:00:56]: Perfect. So, yeah, thank you everybody for tuning in. Again, this talk isn't sort of prescribing a way of building your own data labeling pipelines or doing data labeling. It's really just some of the learnings that we've uncovered at textmine through building our own data labeling team and fine tuning our own models. And of course, this is not the way to label. It's a way, and hopefully you'll find this talk useful. So in terms of the agenda, some context about text mine, and then just some bullet points around when you might need to consider building a data labeling team, how to design the problem for data labeling, and in particular for the data labelers, the importance of feedback loops. And then we'll sort of touch briefly on some of the tooling that we've used.
Charles Brecque [00:01:57]: But as Andy said on the previous talk, there's so many tools and technologies popping up that again, it's not the way to sort of do your data labeling. It's a way, and I think in terms of next steps, I don't want to sort of say what the future of data labeling will be like, but I do have some sort of predictions in terms of where the next challenges or tooling will evolve. So briefly, about textmine, I'm one of the co founders, Amber, who you can see in the picture is my other co founder. And essentially what we have is an end to end platform that allows business users to sort of extract key data points from their PDF documents and sort of navigate and manage them within a knowledge graph. And we're sort of combining llms and knowledge graphs to sort of deliver accurate and actionable insights and knowledge. So in terms of when you need data labeling. Obviously, the foundational models have been trained on vast amounts of data, and they can write poems and all sorts of wonderful things. But for most business use cases, they will not necessarily have enough knowledge to deliver relevant performance or accurate enough performance to sort of be useful.
Charles Brecque [00:03:27]: So especially if you are working with open source models and trying to sort of fine tune it for your own specific requirements, that's when you might want to consider data labeling. You might also want to do this if you have proprietary data, especially for example, you're a SaaS platform and you've got thousands of users and you've been around for a couple of years, you probably will have very interesting data that can be used to fine tune the model. And then I think just thinking beyond chat, especially if you are using llms to perform a specific task, whether it's an ETL task, an extraction task, or something that's relevant to your business, fine tuning can significantly help with the performance. And obviously for that you'll need some form of data labeling. When it comes to designing your data labeling solution, the framework is very similar to any problem solving framework. But what's specific to data labeling is there will be some unique considerations. First of all, what is the problem that you're looking to solve? And I'd say what's most critical with LLMS is does the solution require reasoning? Llms can't reason by default. So if you do need reasoning, then you need to think hard about whether an LLM is the right solution.
Charles Brecque [00:04:54]: If it is, then you might want to try and sort of split the problem into smaller tasks and chain the prompts or the llms together to sort of solve the specific questions. The other thing, when you are sort of, once you've narrowed down the problem and defined what it is you're looking to solve, you then need to have a think about the labels or the prompts that a labeler will need to sort of answer. There's two aspects to that. There's on the one hand, designing the prompt for the LLM, so making sure that it will give the results that you're looking for. But you also need to think that the labelers are not always experienced with machine learning or deep learning, and they might also be data labelers for the first time, so they might not necessarily understand the prompt. And I've got some examples which sort of highlight where us as a company, we've maybe not done a great job at designing prompts that are clear for the labelers, and that's something that you really need to bear in mind, because if you get it wrong, then you'll potentially be teaching the LLM the wrong behaviors, and that's not good for your LLM. Next part is obviously once you've sort of done all that is, do you actually have data and how are you going to slice the data? Because again, you need to chunk the data so that it can be labeled and you need to make sure that the chunks are sort of being created in a way where you're extracting enough signal to answer the question. What you don't want is to have a whole bunch of chunks where the answer to the question is not applicable or no answer.
Charles Brecque [00:06:42]: And if you do that, you're ultimately not teaching anything to the model. Again, next step after that is sourcing the labelers. I do have maybe some tips for sourcing them, but essentially we found LinkedIn to be quite effective for finding data labelers. But in practice you probably want to find domain experts, and that can mean students or people with lots of years of experience. And finally, around the labeling, what's probably most important is the quality control and the feedback loops. So just to give some examples of what some of these steps mean. So here we've got a prompt where we've given some context to the LLM around, telling them that they're intelligent, they're able to interpret CSVs and transform dates, and then giving some clear instructions around, well, do not do this, do not do that. And this is really to ensure that the input and the output is sort of aligned with what you're looking to do with the LLM.
Charles Brecque [00:07:58]: But again, you won't find this on day one. You'll have to iterate until you sort of find the right prompts that are aligned both for the data labelers and the LLM. It's also really crucial when you have data labelers is to really emphasize that they should indicate that the model, you know, doesn't, shouldn't, doesn't know. If the model doesn't know, then the model should say that it doesn't know. That's really crucial, because again, you don't want the model to sort of hallucinate answers if they don't exist. And then the other thing, one thing that we've encountered, especially when you think about forms with addresses, there's often a first line, a second line. Our software team initially sort of suggested that we split down the prompts in two. So first line address, second line address.
Charles Brecque [00:08:49]: But in practice that just confuses the model. So I think it's really important to sort of detach yourself from the engineering requirement and actually think what makes sense for the model. And so in this case, you can see with a prompt what is the first line address and second line address. It's giving the same answer. So that's an example where to avoid in terms of finding data labelers, I think that the key thing is that you'll find lots of domain experts, but not everyone will necessarily have the skill for being a data labeler. Even though it might seem like a mundane task, it's really important to hire or contract people who are consistent and really have an attention to detail and can sort of deliver quality labeling. You can consider students, but we found that actually the best way is just to sort of trial before sort of doing a proper contract and having a community or creating a slack channel for them to sort of communicate with each other and improve. But most crucially, when you're doing all of this, make sure that you are signing confidentiality agreements, because they might potentially be having access to sensitive data.
Charles Brecque [00:10:01]: And even if the data is not sensitive, you don't necessarily want them sharing the data with your competitors or other companies. So in terms of more examples of why feedback loops are really important with data labeling, sometimes we forget that actually llms don't know what things are. So the first example, the LLM's asked, what is a city from a document? And it's saying United Kingdom. If the LLM sort of knew that United Kingdom was a country and not a city, then it wouldn't have even suggested that answer. So that's why it's really important to sort of feed that back to the LLM. There are obviously other mechanisms you can do to sort of provide that knowledge, especially with knowledge graphs. But in the absence of that, it's really important to make sure there's feedback. And I think the other thing also is that these foundational models that you'll be fine tuning, they've been heavily trained on named entity recognition examples.
Charles Brecque [00:10:58]: So, for example, the answer to the question, what is the governing law of the agreement? It's identified law of England, but that's not how a human would respond. A human would respond the laws of England or english law. And so it's really important to sort of feed that back to the model, because the models are generative and they should be able to provide those types of answers in the long run. And this example here is an example where the question has been misinterpreted by data labelers, and we've ended up with answers from the model which are not actually consistent in terms of tooling. Again, this isn't the prescribed tech stack, but we use our gila for our data labeling, and it is a great interface for the data labelers. And I think going back to the importance of feedback, if you can sort of establish a connection between your live model and your data labeling, you can then sort of accelerate the improvements to the performance. So in terms of what's next, I think really.
Adam Becker [00:12:09]: We have to. We're at time.
Charles Brecque [00:12:11]: Okay, well, that's it then. And thank you for your time.
Adam Becker [00:12:15]: Charles. If you could please stay, thank you very much for the presentation. If you could please stay on the chat. We have a few questions for you. Snariti asked you a question about the healthcare industry, and she would love to get your thoughts on it. If you scroll up, there's another question that she asked about earlier. You mentioned you have to sort of like, evaluate the task that you actually have on hand. Whether or not it requires reasoning or not, you'll see it up on the chat.
Adam Becker [00:12:40]: If you could please get back to her, she will be right there. I'll send you the link.
Charles Brecque [00:12:44]: Okay, perfect. Thank you, Charles.
Adam Becker [00:12:46]: Thank you very much.