Turn Data Chaos into AI Strategy with Programmatic AI Data Development // Elena Boiarskaia // DE4AI
Join Elena and learn how programmatic data development is transforming enterprise AI specialization from unconnected, manual data tasks into a streamlined, strategic development process. Learn how a programmatic approach to data development allows enterprises to efficiently manage, curate, and label data at scale, accelerating production AI and aligning models to unique business critieria—especially critical in sectors like banking and healthcare, where accuracy is non-negotiable. Discover how Snorkel's data development platform empowers AI teams to build and release faster with high quality custom training data sets.
Link to Presentation Deck: https://drive.google.com/file/d/1uS9jEvz_1-Z479IjhJnkAeVMd6WIPDzy/view?usp=drive_link
Adam Becker [00:00:05]: Ready to snorkel? We are. Elena, the floor is yours.
Elena Boiarskaia [00:00:10]: Thank you so much. So I've got my own slidesharing, right, so I'm good to go. Okay, great.
Adam Becker [00:00:16]: Yep, you're good.
Elena Boiarskaia [00:00:17]: Okay. Thank you so much. So, so excited to be here and talk about what we're doing at Snorkel. So I wanted to start with a quick intro and then get right into it. So I'm looking to do a little bit of a paradigm shift about how we think about data in the AI development process. So that's what this presentation is about. And hopefully that, you know, at the end of the presentation that that's your takeaway. So why am I talking with you today? I lead a team of applied machine learning engineers who are developing solutions for customers in a variety of use cases.
Elena Boiarskaia [00:00:53]: So we're delivering for teams across verticals and finance, healthcare and everything in between. And my background, so my always perspective is a data scientist first. So I'm not, you know, I didn't grow up as a data engineer. I had to learn it along the way to, to power up my needs in data science. So that's always my perspective. And I mentioned that because all of the startups I've been a part of, that includes databricks, that includes snorkel, that are part of my career. I'm always looking to optimize. How can I solve the data science problem? So there is this huge data component to creating good machine learning and AI.
Elena Boiarskaia [00:01:39]: And this is the key part about enterprise AI is that any of your verticals, the companies that are, that have a ton of data that they're working with, that's their key differentiator. So that's exactly what they need to leverage to create a good machine learning model or a good gen AI chatbot. This doesn't go away. So what do we do with all this data? This is where Snorkel's take is programmatic data labeling, because along the way of all of your machine learning pipeline, you're going to need some, some concept of labels. So my first, so there's my three key takeaways. The first one is we need to specialize our out of the box models for the enterprise tasks, so specialize them to the data. And I think no one's going to argue with that. But we do see as LLMs come out, you know, and become really popular that we are using out of the box chatbots.
Elena Boiarskaia [00:02:39]: And they do perform okay, but they're not the solution for enterprise by any means. Next, we'll talk about why label data is the key to this whole approach, and then I'll talk about exactly how we do it in a programmatic, software like way. So think of snorkel AI, if nothing else, as an ide for programmatic data development. So this is just a quick slide to show that basically, if you take an off the shelf llmdeh, you're going to be lacking accuracy and it's not going to be unique to the organization. So why would an enterprise use it? It lacks differentiation and it lacks accuracy. And the more specialized the industry gets, like my previous company actually, Tempus AI, we were working with patient data and working with oncology. So predicting cancer outcomes is not something that you want to be inaccurate about. So the more specialized your system gets, the more critical it is to build, to use the data and build LLMs in a specialized way.
Elena Boiarskaia [00:03:48]: So data curation and labeling is the takeaway that we snorkel AI proposes, basically. So high quality, curated label data is how you optimize this, this whole pipeline. And what is it that we always lack when we're building a good machine learning model? Well, we need labeled data, right? So especially like in predictive tasks, that's super clear. Like I need labels to then build a machine learning model. That's how prediction works. Well, guess what? Like in reality, we don't have good label data to work with when it goes to Genaida. Kind of maybe a sentiment that has started to develop that I've noticed in the industry is that, well, it's generative, so it's like not requiring labeled data, right. But the reality is that whatever the generative AI algorithm puts out, we're going to still need to understand if it's any good or not, we're going to need to label.
Elena Boiarskaia [00:04:46]: If this is what I want my chatbot to be saying to my customers, I'm going to want to talk about, you know, give it feedback. I'm going to want to train a quality model. I'm going to make sure that it's utilizing my documents appropriately. So there's data labeling across that whole pipeline. So that's absolutely crucial for a successful system. So in this one, I talk specifically about LLMs, but it also definitely applies to predictive use cases, of course. But if you think about along the way, as you build a generative AI system, there is a rag component where you want to curate the data that is presented to your algorithm to then utilize, say, in a chatbot for customers that involves embedding, index and chunking, that process needs to be labeled. It requires labeling.
Elena Boiarskaia [00:05:41]: It requires some sort of human annotation component or some SME involvement. Then we're looking at fine tuning, alignment, or prompt engineering. All of those go back to making sure that the LLM is quality. So building quality models, making sure they have alignment for the enterprise, even to the point of how does the LLM talk? What is the sentiment to, as it interacts with your customer, is super crucial. Right? So there's that customization and then the last piece, which is true for every single model ever that goes out into production, you want to evaluate it and make sure that there is. You're catching errors ahead of time. You're catching that there's bias. The theme of this conference with the t shirt is, of course, I hallucinate more than my genai model.
Elena Boiarskaia [00:06:30]: Well, yeah, this is why valuation is important. We want to catch biases and hallucinations in this step. So as you can see along the way, labeling and curating data is the key ingredient. So that's why sort of thinking about data engineering and then turning it a little bit on its side, because I'm not talking about where the data is stored. I'm not talking about how we serve it. It's literally how we utilize it to build a successful AI system. And that goes back to labeling. So part of the stack.
Elena Boiarskaia [00:07:08]: So where does snorkel fit in? So there's a compute infrastructure, there's vector databases that are required for building a strong rack systems. There's a lot of great talks about how to do it in this conference. I've enjoyed many of them. Right. So there's model models being created, and so there's lightweight LLMs that are coming out all the time that are being more specialized or a little bit like tuned to, you know, be good at one thing or the other. So there's like foundational models starting with Bert models a little outdated, but they are still generative. And this is where we fit in. On top of all of that, we have the concept of data development, taking the IP from the business.
Elena Boiarskaia [00:07:49]: So again, it requires SME involvement. It does require domain expertise. Like we can't, with all of my experience with machine learning and data science, we can't forget that, you know, there is a subject matter experts that understand if the output of the model is any good, like bottom line. So we never discount it. What we want to do is take that overwhelm out of that system and give the SME's a way to look at the right sample of data to understand what the error, you know, where the error is coming from. Give them a way to create labels that are scalable. So this is our core is data development. And then AI applications sit on top of that.
Elena Boiarskaia [00:08:33]: So that's what you're putting out into the world, into your production. So let's talk about how we do it. So labeling and curating data, we think of it as programmatic. So I've kind of skip over the slide, but this takes hours. But each of these pieces can be sort of systematized. And so going away from manual data development, which definitely. So, first of all, to have manual data development to scale it, you need to explain to a bunch of people how to do it, which is very slow, difficult, because you can't always nail down exactly like what the correct label looks like. Definitely error prone because, you know, they're no longer SME's.
Elena Boiarskaia [00:09:16]: You're just kind of like trying to, you know, scale it, but you, how do you trust the label basically, and then it's not centralized, you no longer control the process. You know, typically you're starting with a couple, you know, a handful of subject matter experts that are hard to scale. So how do we do it? This is where we want to make sure that we're distilling what these SME's are thinking about when they're labeling data into a faster system that's adaptable to them, changing their mind, or auditable. To understand. How was this label created for this data point? That is true for almost every enterprise. There needs to be some sort of trail to understand why we made the decision the way we did. It's absolutely crucial, which sometimes isn't possible even when we have a human annotator for a vast amount of data. So again, the key for us is making this process programmatic.
Elena Boiarskaia [00:10:10]: Just like any software development process, we work with the data in a programmatic way. So here's kind of the overview of what we think about when we think about labeling data. So there is that SME collaboration that is absolutely crucial. We need to understand it and evaluate how they're labeling data. So they are creating our ground truth, if you will. But then we distill that into programmatic data operation so we can think of it as weak supervision labels that we can be very kind of loose with some of the rules, because what we're going to do is put a bunch of them together to arrive at a good label. So that's kind of at the core of what the snorkel platform does, is we create a variety of different domain expert heuristics for example, if I know that this text contains free cache, it's probably spam. It's not a complete statement about ok, that's all I need to know about spam, but it does contain a piece of information on how to identify spam in your email.
Elena Boiarskaia [00:11:15]: Then we can actually use, we've included all of the open source LLMSdev into the platform, so you can basically prompt an LLM to label your data. And that's just a way for you to get the power of LLMs directly into making your data better without actually having to put these in production and so on. Just leveraging LLMs in a really unique way to label your data. So you can ask an LLM, you can prompt it and say, go through my data. If you see that the MLS asking about money, label it spam. So that's a pretty good way to use an LLM that's already trained on understanding how to like read the email and so on and summarize it. And then from that that's how I'm creating my labels and so on. So I can use all sorts of code, custom code embeddings.
Elena Boiarskaia [00:12:04]: To think of high level labeling functions which will then, and also including human annotators if that data is available, is fantastic. But it's just not scalable, right? But we do of course use it because it's invaluable. And then we combine all of this information together into one label per data point. So this is why we're basically providing a ton of signal per data point with various facets of how we think about labeling, again, trying to get SME collaboration and all of that, and getting the power of LLMs and then distilling it into one data point that we will then use to build and train a model. And this goes to this in a similar way to our rag system because it's again, how do we identify what should be presented to the LLM? How do we identify the correct context chunk? So looking at prompt and context chunk pairs and then looking at the LLM and understanding, okay, was this response good? Given this prompt, is this a response I expect? And then tuning your LLM to improve the quality that it puts out. Right. So we do need a way to evaluate that the responses are actually providing the labels that we want or whatever. So this is the programmatic development approach rather than, you know, there's a lot of human annotation that would be required to even verify that this is any good, right? So to put a model in production, especially one that uses enormous generative AI models is actually, actually, it needs to be a little bit more regular.
Elena Boiarskaia [00:13:46]: Like how do you trust the outcome? So this is our approach to that. And so to further kind of clarify where snorkel fits is, you know, we start with the AI development process, thinking about metrics and veteran, this is exactly what my team does at snorkel. And then we look at, we literally collaborate with the customer, look at what they need to, what are they benchmarking against? Oftentimes we'll benchmark against either an existing model or an out of the box LLM and look to improve the performance of it quite significantly. Almost in every case, through this process of going through understanding the SME kind of process of thinking about labels, creating programmatic labels, which then we empower SME's to actually do themselves. Because the platform is very easy to use in a no code way, we also can help customize some of that thought process. And then definitely the error analysis is the key piece to this, because then we guide our SME's to look at other data points where the model is uncertain, where there seems to be a bias, where there seems to be a lack of coverage or precision. Right. So we have kind of that iterative loop and after a few iterations you see significant improvement in the model.
Elena Boiarskaia [00:15:05]: And at this point, this is the paradigm shift I want everyone to walk away with. Notice I'm not talking about feature extraction or feature importance, and nor am I talking about model tuning. So this is something that if you have a good label, your model can be very simple and then it's much lighter weight to deploy in production. So this is the reusability, this is the, the risk mitigation and actually also scalability that is offered by this approach. And I am coming from earlier systems where feature engineering was the key to the whole thing. Think of this as shifting that to thinking about labels. Yes, you need to understand what's in the data, but we can work off of unstructured data here to then create some labels for it. And so to kind of give it more of a concrete look for you guys, just quickly run through exactly how it works with some screenshots.
Elena Boiarskaia [00:16:12]: So this is where, again, SME's are already doing this implicitly, it's in their head. How do you translate that to programmatic labeling? So this is the platform. I mean, this is just a direct screenshot of the platform if you can see it. But we have a no code and a code based SDK and we allow that. So the SME can literally go in and write a fuzzy keyword match and say, you know, if the document has these keywords, it's likely this label, and that's, and that's great to empower them to literally put their, you know, intuition into this programmatic labeling. We also help them with, and then they can do it themselves, of course, but we also help with completely custom labeling functions. So there's some nuance that we want to capture. It's completely supported, if you will.
Elena Boiarskaia [00:17:04]: And then once we create these labeling functions again, so you notice I've created a bunch. One is prompt based, one is using embeddings, and I'm clustering the embeddings, and another one is using just a direct regex. And then we combine them all together into a single label for our data point. So each data point gets a label, and that's how we scale up a small number of labels across the whole data set. So you'll likely have, this is what the enterprises struggle with, right? They have a whole bunch of data with maybe like a hundred, you know, some, some small number of ground truth labels that they're working off of and trying to figure out how do I apply that to my entire data set. So this is exactly this process. So through distilling these labeling functions, we will then be able to scale it up to the entire data set, and then iterate from there and verify that it actually works. And then this is part of Snorkel's IP is how we denoise and orchestrate the combination of these labeling functions, and then guided user error analysis.
Elena Boiarskaia [00:18:09]: So we will immediately know where the model has low confidence with respect to the labels. So the labels disagree with each other. So let's pretend like we have, you know, five labeling functions for one data point. We are using a model that distills that into a single label, but we see that there's a low certainty, there's a low confidence, and then we build a model on top of that. How do we expect that model to be any good? We don't. We put it to the user. Right, to identify. Okay, there's actually 47 data points that I need you to look through again.
Elena Boiarskaia [00:18:44]: So this is how we narrow in on what to look at. We as humans, don't have the ability to look through the entire data set. That is how we scale. We identify the, we guide them into tagging the smaller distilled, error prone documents or whatever data points they're looking at. Then the fine grain evaluation. This is where we start to look at data slices. Do we have enough coverage across our labels? Do we have enough coverage across facets of the data and comparing metrics, and then again, identifying where to look at this is not a single shot. This is multiple iterations.
Elena Boiarskaia [00:19:26]: But the multiple iterations is. They're almost gamified. You're constantly seeing improvements in the model as you add more and more labeling functions and SME knowledge into it. So I've pretty much covered this slide. This is the main advantage is that you're going to improve accuracy, it's going to scale, and it's going to work a lot faster than this is something I've done as a data scientist in my past. Right. It's working with the SME's to understand label data, but how do you apply it to the rest of the data set? This applies to any system you're trying to build. And rag and LLM creation is no different.
Elena Boiarskaia [00:20:03]: Right. Because we still need someone to verify that this is a good model to put into production. So I'll talk quickly about the use cases that we support. And we pretty much covered the breadth of enterprise use cases that, you know, that involve document classification, information extraction. We support computer vision and entity linking, all sorts of use cases that apply to a variety of verticals. So we have a lot of telco, insurance, finance, healthcare and pharma and so on. So happy to connect with you more about that, but this is my last slide, I believe. So.
Elena Boiarskaia [00:20:45]: To clarify, the data ingestion can be in a variety of systems that we support, including databricks and others. And then how we think about modeling that's happening in studio, inside the platform. But we can be hosted on any cloud or on Prem, and then we output an ML flow object to put in production anywhere that you want. So that is my, that is my talk. And hopefully the key takeaways you leave with today is programmatic labeling helps with.
Adam Becker [00:21:22]: Your, with your AI system, programmatic labeling and data development. I can't get. That is just the perfect phrase, data development. I have a couple of questions, or just something that perhaps for folks in the audience to clarify for. So can you go back a few slides, I think, to the one where you asked or we gave, we said, okay, it might be spam if you see this string. It might be spam if it's asking for money. I thought that was also an excellent slide.
Elena Boiarskaia [00:21:53]: Yeah.
Adam Becker [00:21:54]: Yes, yes.
Elena Boiarskaia [00:21:54]: It's very concrete.
Adam Becker [00:21:55]: Yeah, it's so good. And it's. What I want to get your thoughts on is that you say that we can essentially create these labels programmatically, and we're essentially creating these, these like, labeling functions. And these functions can take off the form of this. Now, the reason that this is used for, like, one person could ask, why can't we then use the exact same labeling functions for inference? Like, why isn't, okay, fine. You know, an email comes in, we should just check, why not do it like that?
Elena Boiarskaia [00:22:31]: A rule based system, which, a rule based system as we know, is not robust enough to make inference about new things that come in. So we, and I've done this with like a variety of use cases of converting rule based systems into ML. The interest lies in kind of identifying the fuzziness, right? And so we, if we create a single rule, the reality is you're going to be like, okay, if it says free cache and then it has this and then it has this, and it gets really complex for you to build that line of thinking. And then you're going to probably have a really long SQL where clause. And I've seen this with enterprises with like an 80 page where clause in SQL. That is the rule based system.
Adam Becker [00:23:15]: Right.
Elena Boiarskaia [00:23:16]: In this case, scalable.
Adam Becker [00:23:18]: Yeah, it's not scalable. And also, the goal here isn't necessarily to create a classifier. The goal here is to be able to attach labels that can then go into downstream training. Right. And that's what is so interesting about this. So we still need to do this sort of thing. We're coming up with these rules, but not so much in order to create the system, but in order to inform the downstream intelligence of the kinds of.
Elena Boiarskaia [00:23:46]: So we actually, the two stage process that we have exactly is so we go through this, we create the labels and then the model, if you want to do inference on it, we can build a model on top. The model then becomes a lot simpler because you go like, pretty much from your raw data into these labeled data points. And then you can have something so simple like a logistic regression or Bert model. So it's something that no longer is about model tuning to get the power to for inference. It's about distilling the understanding of how to label the data across the entire data set, not just a small known data set of truth.
Adam Becker [00:24:21]: I love this. And there was one more question that I had, which was, when you say a sample or a data point, can that be an entire document?
Elena Boiarskaia [00:24:32]: Absolutely.
Adam Becker [00:24:33]: Text. Right. There's not limited to what people might initially think. It's just like some, like, I want.
Elena Boiarskaia [00:24:39]: To reiterate this again. Yes. Thank you so much for raising that. Because in the previous way of thinking, I consider, like, you know, I want structured data, I want it to be featurized. I want to, you know, each feature is a column. This is not what we're talking about. We're talking about basically raw and structured text. And we then can work.
Elena Boiarskaia [00:24:59]: You know, we use NLP so that you can see the examples here. You know, you can use keywords, you can use embeddings, and that's where we do the most processing, which takes away the burden of feature engineering or of data preparation. There's still some prep required, but a lot less.
Adam Becker [00:25:17]: A couple of questions from the chat. Let's see. We're running out of time, but I think this would be good to get your thoughts on. So, Nick is asking, any thoughts on best practices for evaluating model outputs if the training labels were produced programmatically. So, human manual review labeling.
Elena Boiarskaia [00:25:35]: Yeah, so this is where, I don't know if you're still sharing my presentation, but this is where we do error guided analysis. This is really where it shines. So we identify exactly where the model and the labels disagree. And so we have a way to then distill for. Yeah, so that manual annotation, we can give you like the five data points that you need to review to make sure they're good versus, you know, we're always comparing to if you have ground truth. But this process goes against the labels we just created programmatically, against what the model that's bridging our raw data to the labels we created is telling us. So there's kind of several layers of way we can identify the mismatch, again in a very systematic way.
Adam Becker [00:26:25]: We got one more. Is snorkel time efficient during inference for nuanced data.
Elena Boiarskaia [00:26:31]: So hopefully that's the takeaway from my talk for you guys, is this is where exactly we can take the actual document that you're trying to understand and make, you know, classify or make into a rag system or whatever. And that's where any level of nuance required can be captured in this labeling function. So the labeling function can be as simple as a regex and can be as complex as any python code, really we support as a way to distill that nuance.
Adam Becker [00:27:04]: Awesome, Elena, thank you so much. This was beautiful.