Sign in or Join the community to continue

RAG Quality Starts with Data Quality

Posted Sep 20, 2024 | Views 196

# RAG

# Named Entity Recognition

# Tonic.ai

Share

speakers

Adam Kamor

Co-Founder & Head of Engineering @ Tonic.ai

Adam Kamor, PhD, is the Co-founder and Head of Engineering of Tonic.ai. Since completing his PhD in Physics at Georgia Tech, Adam has committed himself to enabling the work of others through the programs he develops. In his roles at Microsoft and Kabbage, he handled UI design and led the development of new features to anticipate customer needs. At Tableau, he played a role in developing the platform’s analytics/calculation capabilities. As a founder of Tonic.ai, he is leading the development of unstructured data solutions that are transforming the work of fellow developers, analysts, and data engineers alike.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Dive into what makes Retrieval-Augmented Generation (RAG) systems tick—and it all starts with the data. We’ll be talking with an expert in the field who knows exactly how to transform messy, unstructured enterprise data into high-quality fuel for RAG systems.

Expect to learn the essentials of data prep, uncover the common challenges that can derail even the best-laid plans, and discover some insider tips on how to boost your RAG system’s performance. We’ll also touch on the critical aspects of data privacy and governance, ensuring your data stays secure while maximizing its utility.

If you’re aiming to get the most out of your RAG systems or just curious about the behind-the-scenes work that makes them effective, this episode is packed with insights that can help you level up your game.

+ Read More

TRANSCRIPT

Adam Kamor [00:00:00]: Hey, everyone, my name is Adam Kamor. I'm the co founder and head of engineering at Tonic AI. And I take my coffee black using a normal coffee machine, like with the filter, the drip, the drip coffee, I think is what it's called. And I use Folgers because that's what they have at Costco.

Demetrios [00:00:23]: Welcome back to the MLOps community podcast. Surprise, surprise. Today we're talking about data, not just any kind of data, the data that is a little bit more sensitive, that PII data. And how when you are dealing with rags, when you've got that chatbot that you think is going to change the whole direction of your company, it's going to inform all of your leaders on the status of projects. How do you make sure that all that sensitive data does not get leaked by the chat bot? Well, we get into it today with Adam. Huge shout out to the folks at Tonic for supporting the community on this one. If you enjoy this episode, feel free to share it with one friend. And let's get into the show.

Demetrios [00:01:17]: So we gotta start with the name of the tool. Did you guys know what you were doing when you called it textual? Because in my mind, getting textual with somebody is something totally different.

Adam Kamor [00:01:31]: We realized that connection after we came out with the name. It's like, you know, we come up with the name and I start using it in all of like, the sales calls. And after the third or fourth sales call, I'm like, wait a minute. And then it kind of hit me. And it has been an ongoing annoying joke within the company since then. I think, you know, there's that saying among, like in software development. Like the two hardest things are like cache problems and variable naming. Well, product naming is way harder.

Adam Kamor [00:02:07]: And by the way, the biggest constraint in naming a product is figuring out what domains are available. Of course, like when we started Tonic AI, like before, we were like naming our products like specific names. We went with Tonic AI because like, everything else is taken, man, there's just no domains out there. And unless you're willing to just like, like really, really misspell the name of your product. Oh, yeah. Then you just can't get a good domain. And I remember, yeah, I hate the miss, the misspelling of products as it's impossible to search for them.

Demetrios [00:02:40]: Exactly. Or you go with some random dot, like the dot iOS got popular, the so's are pushing it. Or then you have, I've seen you can go with the Ca or the co and that's right, that's right.

Adam Kamor [00:02:56]: I mean, we're AI and I. So, yeah, you're loud and clear. That was an issue that we had as well.

Demetrios [00:03:02]: Well, good call on tonic, because that gives an aura of sleekness and like a gin and tonic type thing.

Adam Kamor [00:03:12]: Yeah. And, you know, we actually had a product for a while, uh, name Jinn, but we spelled it. We didn't spell it g I n. Actually going to not spelling things correctly. We spelled it dj I n n. Like, you know, like a jinn. Like a genie.

Demetrios [00:03:26]: Oh, yeah, yeah.

Adam Kamor [00:03:27]: Like a clever play on that. But I. I think it was perhaps too clever and clever.

Demetrios [00:03:32]: Yeah, yeah. Sometimes you can outdo yourself and.

Adam Kamor [00:03:36]: Yeah, that's right.

Demetrios [00:03:37]: So the. But speaking of textual, let's get textual right now. Yes, I did do that. I can only imagine how much you hear that.

Adam Kamor [00:03:45]: That's actually a great. I might put that on t shirts for conferences. Let's get text.

Demetrios [00:03:49]: There you go.

Adam Kamor [00:03:49]: That's great.

Demetrios [00:03:50]: Textual tech.

Adam Kamor [00:03:52]: Textual healing, perhaps.

Demetrios [00:03:54]: Exactly.

Adam Kamor [00:03:54]: Yeah.

Demetrios [00:03:55]: That is 100% what I was thinking of. So.

Adam Kamor [00:03:58]: All right, thank you.

Demetrios [00:03:59]: That is not exactly the road we're going to go down right now, though. I will switch it up because. Yeah, yeah. I mean, we could go pg 13 if you aren't, but we've been known to go r sometimes. But for this very thing, I think what you all are doing with textual is really cool. And it's going to the data layer and making sure that some of these challenges that you face in the data layer can help you downstream when you're working with AI and ML. And so can you just give us this overarching idea of what textual is real fast so we can understand it? And then I'll dive into some of the challenges that come off of it or it helps to curb.

Adam Kamor [00:04:48]: Yeah, absolutely. So, tonic textual is a tool that is used to build top quality data pipelines for people that are building rag systems. We got started on this because we realized quickly when we were building our own internal rag solutions, that data quality really trumps everything else. When it comes to getting good quality answers from your rag system. If you're not providing your LLM with the right context, then especially, like, when you're asking questions about, like, private data, you're just never going to get the right answer. So building a pipelining tool that actually gives you, like, good quality chunks, where you're kind of, like, optimizing for rag retrieval, is really the best way to building a good rag system.

Demetrios [00:05:37]: So the obvious statement here is, you gotta have the quality data before you can do anything with it. Right. And I've heard a lot of people complain about how. Well, my knowledge assistant would be much better if people actually documented properly in my company.

Adam Kamor [00:05:58]: Yeah, that's. Yeah, well, yeah, you have to have the data. That's definitely the prerequisite. I, you know, I can't do anything about that for you, unfortunately. I'd say getting people to document is not something that anyone is going to ever solve. Yeah, but, yeah.

Demetrios [00:06:12]: Doesn't that scare you? Doesn't it feel like you're built on a house of cards there? Because it's like, documenting, getting people just.

Adam Kamor [00:06:19]: Fighting human nature every step of the way? Yes. No. So, like, when you look at, like, most enterprises, or not even enterprises, just really companies of any size, they always have very large amounts of data. I mean, sometimes too much data, honestly, where you actually don't even need all of it. And a lot of it can be a distraction for what you're trying to do. And especially as organizations are starting to. What we were seeing early on, and it's still mostly true today, is that companies are building rag systems to solve internal problems. Here's our HR chat bot that, you know, employees can go to, to ask HR questions, or, you know, here.

Adam Kamor [00:06:58]: Here's my chat bot where people can ask about the company handbook, you know, things of this nature. So, like, internal chat bots that were mostly, I don't want to say that they were there for entertainment. They're certainly solving in need, but people were doing things that were, I think, relatively straightforward to begin with, to kind of figure out what's going on. Cause this is new to everybody, and we're starting to see, like, organizations starting to, like, build more serious rag systems as time progresses, then we're now starting to see companies build externally facing chatbots or rag systems. And then when you get into the externally basing ones, is typically being used by their customers, and it's incorporating customers personal data into the experience. And that's where we really come to play when we're dealing with information that isn't just publicly available in, like, a company's policies or on some, like, internal notion page. I like we're getting into the actual data of the enterprise that involves their customers, and that's where we prefer to focus.

Demetrios [00:07:59]: Yeah. And I know one of the huge benefits of this is you don't get that leakage. And I have a fairly cool understanding of how you can swap out a lot of this Pii, because I've looked at what you all are doing before, and I've gotten textual. We could say, but maybe you can give me a quick overview, or the rest of us, a quick overview of how you make sure what data stays and what data goes and what is not safe versus safe.

Adam Kamor [00:08:32]: Sure. So behind tonic textual is a set of proprietary, built in house Ner models. So Ner is a subset of NLP, which is natural language processing. Ner means named entity recognition. It's essentially a machine learning solution that identifies entities within unstructured text. So I say, hey, my name is Adam. I work at Tonic. Presumably the entities that we would pull out from that are Adam being a name and tonic being a company, that would be an example of named entity recognition.

Adam Kamor [00:09:07]: All of the data that you bring into tonic textual for the purposes of building your data pipeline run through our NER models to identify entities that are a, sensitive so that we can kind of address any sensitive information to prevent data leakage. And b, entities that are interesting. Like, not all entities we find are sensitive. Like, yeah, okay, sure, we can find ssns and credit card numbers. Those should probably never land in your rag system, but we can also identify things like what product is being discussed in this customer service transcript, or like, okay, what company does this person work at that might not be sensitive? Then we can use the entities that we find to actually build for you better chunks that can optimize your rag retrieval to get better quality answers. So all of the data goes through these ner models, and we address anything that's sensitive through a variety of factors or a variety of techniques, rather. And then we also kind of augment your chunks with additional entity metadata that we find. So, you know, we talk about both quality and safety of the data in general.

Demetrios [00:10:15]: Okay. And I know that chunking is one of the hardest things, making sure that, a, you don't have too many chunks, but you give it enough context.

Adam Kamor [00:10:26]: Right.

Demetrios [00:10:26]: And have you figured out some tricks around that?

Adam Kamor [00:10:30]: I mean, tricks, yes. You know, general solution that works for everybody perfectly every time. No, of course not. I don't think such a thing exists. It might not even be possible. Our approach is, of course, our solution ships with different chunking techniques that we've developed. I wouldn't want to claim any of them are state of the art or revolutionary. Our approach instead is that our system plugs very well into the existing work you've already done, and you can really integrate into the workflow that we provide to you any chunking algorithm you like.

Adam Kamor [00:11:05]: So you might already have your own purpose built chunking algorithm that you think is ideal for your data that you developed in house, and you can work that into your textual pipeline without issue. So we basically take this approach out of the box. We have something for you. If you think you can do better, and maybe you can, you can also bring in your own algorithm.

Demetrios [00:11:27]: Nice.

Adam Kamor [00:11:28]: The idea that there's going to be one chunking algorithm to rule them all. Um, I don't think it'll, I don't think, you know, we're there yet.

Demetrios [00:11:37]: No, I like how you say that. It is a very case by case basis, because we don't really understand until you have lots of cycles and lots of iterations on what this looks like and how it performs on this specific use case with this set of data or this embedding model that we're using, whatever it may be. You've got a lot of different variables in there.

Adam Kamor [00:12:03]: That's right. I'll give a real trivial example. Let's say you are building a rag system where some of the documents are like faqs, right. You know, typically for an faq, what I've seen work very well is you want to chunk each question answer pair separately. Yeah. So you kind of like, don't really look at like length as much, but it's just like, okay, here's the question, here's the answer. That's one chunk, and then go to the next like that. That's a very simple solution, but it's also incredibly bespoke and does not apply to like any other document out there.

Adam Kamor [00:12:33]: Right.

Demetrios [00:12:33]: Yeah.

Adam Kamor [00:12:34]: So, like, you know, it's like you might not even have one chunky algorithm for your entire corpus. It might be like, this is what I do on these types of documents. This is what I do in these other types of documents. It could really be all over the place.

Demetrios [00:12:45]: Oh, dude, what a good point. That is how you don't need to be stuck with one chunking algorithm and figure out what the best algorithm is for the different documents that you're working with.

Adam Kamor [00:13:00]: That's right. And I think this go like that idea. Like, I don't know, we've been doing rag stuff now for a while, and it gets more complex every day. I'm not saying that's the most complex concept in the world or anything, but it's one example of things are ramping up quickly in this space.

Demetrios [00:13:21]: Yeah, I always laugh because it's a, a little bit to me like we are engineering band aids, but not really going to the root cause of what the problem is that these LLMs can spit out things that are absolutely made up sometimes. And so we have to try and do all this engineering work to make sure that we cover our asses so that it doesn't happen.

Adam Kamor [00:13:48]: That's right. It is. It's a very challenging, I mean it's all, it's all very challenging. Like I think you can get to like a decent quality system, I think, I'm not going to say with ease, but like you could, you can get something that mostly works out there quickly and then dealing with the edge cases and like the occasional hallucination, like it just, the complexity goes high quick. It's like the last 10% is the hardest, that.

Demetrios [00:14:16]: Exactly. Yeah, it's so true. So aside from scrubbing the PII or the entities that are actually worth scrubbing versus ones that can give you insight and making sure that you're chunking properly, depending on documents and your algorithms you're using, are there any other things that you feel like are essential for this data preparation before you're sending it to the rag, or when you're using the data for rag use cases?

Adam Kamor [00:14:48]: Yeah. So let me give you the common workflow that we see among our users, and I think that paints a pretty good picture both of our capabilities and really I think what should be done when building these pipelines. The first is you need to understand, okay, well what kind of documents do I have and where do they live? Every document type, there's a different way of extracting out the relevant information in a word document. A word document actually is just a zip file, it's a zip archive. And inside the zip archive is a bunch of XML. Programmatically, you can actually traverse through word documents, Excel sheets, PowerPoints, et cetera. You can programmatically traverse through these files and extract out all of the content and material you need. Within these XML files is all of the information you need about these documents.

Adam Kamor [00:15:35]: So for office documents, extraction is just a programming problem. PDF's on the other hand, typically you're using some type of OCR process to extract information. Then you have other file types like images, which is also OCR, plaintext, which is relatively straightforward. But for every file type you got to consider how am I going to extract out this information? Then you have the question, okay, well where does my data live? We commonly see data being stored in s three and azure, blob store and gcs, but it's also going to be stored in confluence pages and Jira tickets, notion doc, et cetera. How are you going to access this data? Then you have this question of, ok, well in all of these file systems, and I'm using the word file system loosely here, how often are files changing? What process do I need to go through to ensure that as files are being added, modified and deleted, how are those changes percolating into my vector database? Obviously I'm bringing this up because we have a solution to it, which I'll get to momentarily. You've identified where the files live, what the files are, how frequently they're changing. Then what essentially you want to do is begin extracting out all of the information that you need, which typically means just converting all of these documents into some version of plain text. We prefer to use markdown.

Adam Kamor [00:16:57]: So what we do is we convert all of your complex file formats into markdown, which is plain text with some syntax sugar that you could add on top. It's nice because you can represent complex file formats like PDF's and word documents in plain text and preserve a lot of the layout and formatting. Still you get lists, tables, headers, titles, things of that nature.

Demetrios [00:17:21]: Let me stop you right there, because you are bringing up such a great point that goes right in line with the idea of how important data quality is for your AI systems. How are you confident? Or how can you know, you as a user or an engineer that's creating these systems? Know that when I just ingested that PDF and it references, there's a paragraph in there that references table one. Table one is in there labeled table one. All that fun stuff, which is, I've heard an incredibly challenging problem when you're trying to do it at scale.

Adam Kamor [00:18:01]: Yeah, it is. So, I mean, the first step is you need to be able to extract tables from your documents. And that means identifying some section of the PDF actually has a table in it. And especially in an Excel sheet, it's pretty obvious in Microsoft word you can make a table object and add rows and columns. That's all relatively straightforward, but in a PDF, the table could be anything. It could even be very artistic in nature, where it's a designer crafted a table that is highly stylized. The first step in what you're asking is how do you identify tables and documents and PDF's? And that alone is pretty challenging. There's a few techniques out there, and there's a few libraries that'll just solve it for you.

Adam Kamor [00:18:51]: And I say solve, meaning they do. Decent. A common OCR system that most folks know is tesseract. Tesseract out of the box doesn't support table identification. There's some other open source OCR systems that do a better job, like paddle OCR is one that I've looked into recently that actually does support table detection. There's also cloud providers. Azure has the document intelligence service and AWS is textract. These OCR providers actually support table identification and extraction out of the box.

Adam Kamor [00:19:24]: You could just go with one of these cloud providers and get this table extraction. There's also cool techniques actually that combine using tesseract like this open source OCR engine along with an LLM and kind of combining the two to identify tables. Like extract all the text with text track, send it to the LLM, ask it to kind of format this in the way that you might expect, and it'll typically do a pretty good job at identifying tables for you. So that's another approach you have. But you got to figure out how you're going to do your table identification. And of course, this goes without saying, tonic textual supports the identification and extraction of tables as well as key value pairs, which is another, like pretty common concept in OCR documents. Like imagine a tax form. Like it's got the name box, you put your name, it's got the SSN box, you put your SSN.

Adam Kamor [00:20:12]: Like, you know, the keys would be name and ssn and the values would be the name and SSN itself. Yeah, yeah. And then the second point you're asking, okay, well, like there's a paragraph that references the table. How do you tie them together? Typically it requires a second pass be done with some additional model that can kind of help associate things together. And it's not easy, but what a lot of folks will do is they will use, they'll typically take follow up passes with large language models. It could be OpenAI or it could be a lower weight or a lower parameter count model that they deployed locally, but they're going to have to do a second pass of the LLM to kind of like draw those associations.

Demetrios [00:20:57]: Yeah, yeah, yeah, it's awesome. All right, continue back with the rest of what you were talking about where I just had to ask because, dude, that is such a difficult problem to get into and really have the confidence that. All right, I just ingested two gigs of PDF's. What? Or let's say two terabytes of PDF's.

Adam Kamor [00:21:23]: Yeah.

Demetrios [00:21:23]: How do I know that all these tables are, a, showing the right data, b, being referenced and connected correctly? And so I like this very robust way of going about it and just throw another model at it when in doubt.

Adam Kamor [00:21:39]: Basically, I know I don't like saying, oh, just running through an LLM the second time, but like, I. They're incredibly effective at those types of challenges. And like, look, in an ideal world, you have your data science team, you guys go train your own Bert specific model to do it. And you can do that, and it'll run faster and more efficiently, and you can run it for cheaper, but it's going to take time to develop, and at the end of the day, it might be faster just to go send it to, like, your LLM of choice. Yeah. And it all comes down to, like, you know, the pros and cons and, like, the opportunity cost for your specific situation.

Demetrios [00:22:17]: Yep.

Adam Kamor [00:22:17]: Right. So, for example, like in tonic textual out of the box, we don't use any large language models. All of the models that we kind of had developed and that we ship with, they're all proprietary models trained and fine tuned by us. Typically, we focus on Bert or Roberta models as, like, our, like, core architecture, and then we fine tune and improve on top of those. We found those to just be more cost effective and better at, like, very specific tasks that we're doing. Like, specifically, like, our ner models. There's no LLM in use for any of our NeR detection.

Demetrios [00:22:51]: Yeah, I was going to say, potentially what you're looking at is better, cheaper, faster. Right?

Adam Kamor [00:22:56]: Yes, it is. It's. It's more expensive to develop on the front end because, you know, we have to go, you know, train. We have to go manually annotate, you know, large bodies of text so that we can kind of feed these models. But once you build out that machinery, it becomes actually relatively straightforward and quick to improve. Add new entities to detect things of that nature, and that's kind of where we're at now.

Demetrios [00:23:19]: Yeah, I imagine once you developed one and you saw the payoff, it's hard to go back to or do anything else besides that. Even though you have that initial upfront cost and as you're going through it, you're thinking to yourself, I don't know, should we just use an LLM? It might end up being cheaper, but after you see that this actually works, now I can see why you would only want to use that.

Adam Kamor [00:23:46]: Yeah, that's right. I mean, essentially, like, it also really pays off for our customers in terms of, like, cost of operating, um, both in terms of, like, how long it takes to run. Like, you know, our ner models, like, running on, like, commodity hardware. You know, it's tens of thousands of words per second that are being scanned, and you can scale that out, like, you know, to whatever desired throughput you want. And, like, you just won't get that from an LLM. It's just, it's not going to happen at this point, and if you could, it would probably be very expensive. So, um, back, back to that question about, like, okay, you know, what are folks doing to build these pipelines? So, you know, they, they've begun extracting all the data. Our approach, and I think this works well, is we convert all of your documents into markdown, and, and we do this in like, I think, a really high quality way where the markdown is really representative of the original document.

Adam Kamor [00:24:36]: But then we also provide our users with essentially a structured output of their document. So think of our solution essentially as taking every unstructured document you have sitting at s three, for example, and creating a structured counterpart to it. Our counterparts are typically JSON. So essentially every document you have gets converted into a JSON document, regardless of the file type and the details of the file. Every JSON document has the same structure and schema to it. So you essentially end up now with a huge body of JSON documents that are all schematically, structurally the same that you can interact with for building your rack system in lieu of the original files and the different file formats that you have. And then on top of this body of JSON that we give you, we provide nice utilities and SDKs for interacting with it, and then quickly and efficiently getting the JSON into your data pipeline. So that is typically the approach we.

Demetrios [00:25:34]: Take, but the data isn't. So you basically transform the data, but then you throw it back into s three, or you have that in a database.

Adam Kamor [00:25:46]: Where does that, it's really like up to the user by default. If the user, for example, has all of their PDF's or word docs or whatever in s three, then we're going to dump a bunch of JSON documents into s three as well. You kind of specify where, but they'll be sitting in s three. And it's very easy then to programmatically, like, interact with these documents via our, like, SDK. And then from there you can do whatever you like. So, for example, our solution is also available in the Snowflake native app marketplace. So for customers that want to keep all of their compute within Snowflake, you can run our entire solution on your snowflake compute. Everything stays within Snowflake.

Adam Kamor [00:26:23]: All the web calls, all of the models are running in Snowflake, et cetera. In that solution, we don't dump into s three. That doesn't make as much sense. We're writing these JSON blobs directly into a snowflake table. Snowflake has a column type called a variant, which is basically, it's not exactly JSON, but it's very analogous to JSON. And we're just writing all of the results into a variant column in a snowflake table, and then you can programmatically access it that way.

Demetrios [00:26:48]: Nice.

Adam Kamor [00:26:48]: Yeah.

Demetrios [00:26:49]: So you're ingesting and then transforming this data and making it real pretty for anybody to use with. I can imagine. The next steps are, all right, how do we throw this at the embedding models?

Adam Kamor [00:27:06]: Yeah, well, how do we chunk it and then how do we throw it at the embedding models? Which is. Yeah, that's. Sorry, that's what you said, so.

Demetrios [00:27:11]: Yes, yeah, yeah, exactly. And so then once you're there, there was something that you mentioned before, which I know is another very sticky problem. And how do you know those documents? Like you referenced Jira tickets. Right. Those are getting updated all the time.

Adam Kamor [00:27:32]: Right.

Demetrios [00:27:33]: How can you ensure that if somebody is an end user of that internal chatbot and asking a question about a project that is continuously being updated on Jira, they're getting the most up to date information.

Adam Kamor [00:27:48]: That's right. So for all the different look like data locations that we support, like on the input side, there's this notion that we have of, like, I want to say this in a generic way. You know what? I'm going to stick with s three as my example for now, because I think most folks know s three, or azure, blob or gcs, they're all roughly equivalent. So let's say I upload a file to s three. There's two things that can happen, right? Maybe I upload a new file. So we want to know, our system needs to know when a new file has been added, meaning we're reprocessing and we encounter a file we haven't seen before. We also want to know when a file has been changed. Meaning, yeah, we processed this file before, but wait a second, it's checksum isn't the same as it was the last time.

Adam Kamor [00:28:35]: And then there's the issue of when a file is deleted, and we need to know that as well, meaning we've seen it before, but we're not seeing it this time. So those are the three situations that we find ourselves in. And if we wanted to talk Jira tickets, the equivalent would be Jira will have an ocean for a ticket of last modified or last updated, and we will be hooking into columns of that type in Jira. But regardless, whenever you process data using tonic textual, whether it's in s three, Jira, or somewhere else, we are keeping track of the three things that I just mentioned. And every time we process, it's just running a delta. Everything we do is a delta operation. The first time you run, it's the whole thing. We have to process the entire backlog.

Adam Kamor [00:29:14]: That'll take time. The next time you run, we're only processing the things that have changed since the previous run. Because of that ability. You can be running our pipelines as frequently as you like. The more frequently you run them, the smaller the delta will be each time. And then they just complete very quickly, like you run it every 30 seconds. Well, most of those runs the delta is zero, nothing has happened. And then occasionally a file has been added, modified, deleted, et cetera.

Adam Kamor [00:29:42]: And then we pick up on that. Then we just provide nice hooks to the SDK user to allow them to easily say, okay, compute the delta in files between the current run and the previous run. Give me all files that have been changed and then you can do your pipelining based on that change set is typically how we approach it.

Demetrios [00:30:05]: What have you seen once you know, okay, this has changed. I think the real sticky part happens in the vector database, right? Oh, this file has been deleted. Now how do I go and rip that out of the vector database?

Adam Kamor [00:30:19]: Yeah, that's right. So most common situation is the file hasn't been modified. So in that case, skip it, do nothing. If a file has been added, then you're going to go through your chunking and embedding and inserting process just like you did for the initial backlog. If the file has been modified, you're having to do initially, you'll do the deletion, you got to delete the original chunks. It's better just delete those chunks as opposed to trying to do a delta on a chunk level, just do it on the file level. Delete all the chunks for the file and then you're going to go and rechunk it using whatever strategy that you have already selected. That's typically what we see.

Demetrios [00:31:02]: Basically if a certain file has been deleted, you just go and find those chunks and delete those chunks.

Adam Kamor [00:31:09]: Yes, that's right. How to identify those chunks in the database. That's left just quoting a textbook. It's left as an exercise to the reader. Right. So they'll, you need to set up your vector database in a way that kind of like makes this easy and straightforward. Like typically what we'll see is like folks will include like a pointer to the file alongside the chunk that the chunk alongside the chunk to kind of show where the chunk came from. So some approach like that where it's just easy to identify which chunks came from which files, makes this a straightforward process.

Demetrios [00:31:50]: Yeah, I've also heard trying to link up the creation dates. And then inside the vector database you can also filter for creation dates and say, okay, this one goes here and still seems like it can get a little bit messy.

Adam Kamor [00:32:09]: It's all like a vector database is a database first and foremost. Right?

Demetrios [00:32:13]: Yeah.

Adam Kamor [00:32:13]: These are very standard database operations that no one would bat an eye at doing in traditional rdbms. And you got to play the same games here. Yeah. Okay. You have chunks and that's cool. And you have your embeddings. That's neat. But there's other columns too.

Adam Kamor [00:32:31]: And it's still software engineering.

Demetrios [00:32:33]: Yeah, nothing's changed just because we call it a vector database.

Adam Kamor [00:32:37]: Yeah, that's right.

Demetrios [00:32:39]: It falls under the umbrella of databases. So treat it like a database, not some new fancy thing that is specific for AI. Although it is, but it's not. It doesn't change everything. Yeah, that's right. I like that.

Adam Kamor [00:32:53]: You know a good example of this, you can create your vector database in postgres. Just go install the PG vector extension.

Demetrios [00:32:59]: Exactly.

Adam Kamor [00:32:59]: Get all the power in postgres and you have your Vector Collins as well.

Demetrios [00:33:05]: Yeah, I just saw a company that was doing that. Basically their whole thing was, hey, we're postgres, we're managed postgres. And one of the big selling points was we can give you a managed postgres database. Vector database.

Adam Kamor [00:33:21]: Vector database. Yeah. When we were initially building rad systems, we were just using. God, I hope this is correct. We were using just postgres and RDS, the managed service at Amazon. And I think we were just installing PG vector on the boxes. I think that's what we did. I'm just debating if PG vector was available in rds then.

Adam Kamor [00:33:45]: And I believe it was.

Demetrios [00:33:47]: Well, I think that's best practice. Use it until you outgrow it and then you need something more specific. Make the transfer. Get one of these like the Milvis or the quadrant type or Lance DB, whatever it may be. That is when you hit your limitations with what you're already very good at. Because if you already know PG or postgres and you've been working with PG vector, then that is probably going to be the best tool that you can work with. Until you can't work with it.

Adam Kamor [00:34:26]: Yeah, until, until you can't. But you know, that might never come. So, you know. Yeah. If you're like my advice to everyone, you know, building companies is start, start simple. Make no assumptions about the future. Just focus on today.

Demetrios [00:34:41]: I think that's, that's focus on good names and dot AI's and dot coms. That's what you want, folks.

Adam Kamor [00:34:47]: Thanks for including the AI in there. I appreciate that.

Demetrios [00:34:51]: So there was another thing that I wanted to mention when it comes to the data prep side. So before we hit record, one thing that we were talking about was role based access control and allowing certain things to come into the database, or certain people to look at the database. And this feels like, again, it goes back to that same theory of this is a database. So database things should happen at the database level. And I was asking you, okay, so do you allow with textual, do you allow certain people to have certain access to certain files? And you kind of push back on that. So I would love to hear, let's try and rehash that conversation and act like it's the first time that we're having it.

Adam Kamor [00:35:44]: Yeah, that's good, that's good. So I think that the question that was posed to me before we started recording was, does tonic textual help with granting access on an individual basis or on a user basis to different portions of the vector database? Tables, chunks, basically, can you ensure that user a only has access to this set of chunks, whereas user b has access to this other set of chunks? And my response is, no, we don't do that. I see that more as the responsibility of the database system. To me, that is a very vanilla use case for row based access control. Given users have access to given rows, and that's really the end of it. Where we come into play more is an organization is going to have certain requirements on it in terms of what types of user data can land in the vector database. The common fear that these organizations have, and rightfully so, is data leakage. Right.

Adam Kamor [00:36:47]: Anything you put into the vector database is fair game for the chatbot to regurgitate to the end user that's interacting with it.

Demetrios [00:36:54]: Yeah.

Adam Kamor [00:36:54]: So how do you deal with that? That's really where taunting textual comes in. Yeah, we give you very high quality data to make your rag system better. And then this is empirically proven. And I'm happy to go into those details. But really what we do is we ensure that an organization that says, hey, these types of entities cannot land in the vector database. They are too sensitive. We cannot have them regurgitated. We don't want that data leakage.

Adam Kamor [00:37:17]: That's where we come into play with our NER models to ensure that that data is either never included in the vector database or it is somehow rectified, typically by either redacting certain pieces of information or by synthesizing it in order to get high quality chunks.

Demetrios [00:37:36]: So there's a fascinating kind of screw up recently that happened with Slack. I don't know if you saw it how slack AI can give data leakage for private channels. So you can ask Slack AI, did you see that?

Adam Kamor [00:37:53]: I think I saw it on hacker News or something. Honestly, I should have probably read the whole thing because it's incredibly pertinent to what we do, but I must have been busy that day. I have the high level, but I don't have all the details, which maybe you could share.

Demetrios [00:38:08]: Yeah, so it's basically saying the long and short of it is that you can do some prompt injection and get information from the private channels out in the open when you're talking with Slack Af.

Adam Kamor [00:38:25]: Yeah. Okay.

Demetrios [00:38:26]: And I just wonder from your approach, how would you combat something like that? Because it's not necessarily like you can say this channel is 100% entities that we don't want. Maybe it's certain things that come up we don't want others to see, but actually maybe everything is, we don't want to even touch that or put it in the vector database. Like would you just, yeah, just not put it in or, I mean, yeah.

Adam Kamor [00:39:01]: Certainly you can do that and then you have a worse system because of it. So that's not ideal, that specific problem. And like I said, I don't know all the details. What you said I'm hearing for the first time. On the surface it seems like that's more of like access control issue where like how, how could I, you know, how should I be, or why would I be given access to chunks that were generated by other users in private DM's? So to me that, that's, that on the surface it sounds like a row based access control type deal. Let me give you an example of where I, you would want to use our solution for removing sensitive information. Let's say you are building a customer service chat bot that customer service agents can use to quickly answer questions that come in from users or users might interact with themselves directly on the website. Now one solution could be a user comes in, they're authenticated, so the chunks that they can retrieve are only going to be chunks that have been generated from previous conversations of that user with the customer service system.

Adam Kamor [00:40:19]: Okay, well you could do that, but every user's probably going to have a very small amount of data. Maybe some user only ever calls customer service once. So that is not very helpful. Really, though, you know that customers call in with the same issues. Like if one person's confused, then 10,000 people are confused. Okay, so you want to basically use all of the data generated by everyone to answer any individual's problem. Well, okay, I call, you know, customer support for some place. First thing I do, they're like, okay, what's your name? Okay, I give them my first and last name.

Adam Kamor [00:40:51]: What's your address? What's your credit card number? Like, what's your, what's the last four of your SSN? Like, all of these things they got to do to identify, you know, you are who you say you are, right? In that situation, those are the types of chunks or entities rather, that we need to remove prior to, like building our rag application and exposing this chatbot. Because that type of information, like my name, has no bearing on like the actual problem I'm facing or the solution or whether or not it's going to be useful to someone else. We need to remove that. We need to remove, obviously the last word by SSN or my, like my, my three digit credit, cardinal code, things of that nature. So that's more where we come into play. So previously an organization just couldn't use this data. It's like, oh man, there is literally nothing we can do. We can't touch this data because anything that we put it into the vector database, we are just going to be in a load of pain and trouble.

Adam Kamor [00:41:44]: But now they can do that because we will go in, we'll identify every entity, every utterance said that is potentially sensitive, and then we can either redact it, meaning we just strike it from the record, replace the utterances with tokens, or we can synthesize it. You know, given a chunk that has a bunch of sensitive entities, we will create for you a new chunk that is semantically identical. But, you know, Adam is replaced with John, my SSn is replaced with a fake SSN, you know, things of that nature. And of course it's done in a very intelligent way where even like, the semantics across multiple chunks is still maintained. So that's the approach we take. So previously, an organization that couldn't use any of this data now gets to use it safely and in a high quality way to build their experience.

Demetrios [00:42:25]: Yeah. So first of all, I do like that you can kind of do a hot swap so that you still retain the information that is being said. It's just, has no meaning because it's not the person's name. It's not the company, it's not their screen or their Social Security number. None of the actual valuable stuff is there, but the whole meaning is there. And so you can understand it, hot.

Adam Kamor [00:42:55]: Swapping, as you call it. We call it synthesis, but hot swapping, it's actually safer. Let's say I'm given a corpus of text and it's like, okay, let's replace every name with a token. That's like redacting it, essentially. Okay, well, you get now the safe text and you see Adam, you're like up the model, missed Adam. Adam's a real person in the data. Cause every other name is tokenized. So you know the model.

Adam Kamor [00:43:19]: And these ner models, by the way, and this is true for any ner model out there, they miss. They're not 100%.

Demetrios [00:43:27]: I mean, this is true for. Yeah. Any ML model, folks, they will miss.

Adam Kamor [00:43:32]: You have to consider that when building the system. Yeah. But instead, if you do this synthesis or the hot swapping that you called it. Well, okay, Demetrios goes to john. Adam, though, is missed. The attacker or the end user, whoever sees this data, you don't know. You don't, you know, you have some more plausible deniability there. It actually gives you a safer footing.

Demetrios [00:43:55]: So I just pulled up the report on slack AI just because I wanted to scratch my own itch. And you really should have read this because it is talking about how it is leaking API keys, which is totally what you could have done. You could have saved slack. I mean, it looks like they would be a good customer for what? Okay, somebody from the sales team should reach out to the Slack AI team, because in private channels, someone set up API keys and then they put in a public channel, basically some prompt injection, and from that they were able to leak the API key when talking with the LLmDh. Oh, and so the, the funny thing here, though, that I, I think is the sticky problem with Slack or the, the messy problem, I think we should call it, is that private channels, you, you kind of think they're going to be private. So you might be talking shit about Karen, and now if all of a sudden the LLM can leak that data.

Adam Kamor [00:45:04]: Yeah, yeah.

Demetrios [00:45:05]: Now I can't talk shit on Karen.

Adam Kamor [00:45:08]: That's right. Like, I gotta do it. Why don't you go to work if you can't do that, you know, it's not fun anymore. Yeah, that is, that's interesting. I mean, yeah, like I would expect my DM's on really, any system I use, not slack in particular to certainly stay private. And the thought of them going into a vector database is certainly concerning. And not even the ones where you're just like, you know, you know. Was it.

Adam Kamor [00:45:38]: Is it. Wait, is it spilling tea? Is it. That's the phrase for gossiping. Yeah, I'm showing my.

Demetrios [00:45:44]: I think so. That's what the cool kids are calling it these days.

Adam Kamor [00:45:47]: Not even those situations. But imagine it's like the private channel of, like, leadership. And like, they're talking about, like, you know, presumably like, important things or.

Demetrios [00:45:56]: Yeah, just important stuff. Yeah, that's right. Porn stuff. That's right. I don't know. You're on the leadership team, right? You know what those conversations are?

Adam Kamor [00:46:05]: Yeah, yeah, absolutely. It's not just spilling tea. There's other things discussed, too.

Demetrios [00:46:08]: Yeah, there's, there's a quote unquote important stuff happening, like the direction of the company. And you want to be able to have those conversations in private before you bring them out to the team. But if somebody can prompt, inject the LLM, then that's a little bit dangerous.

Adam Kamor [00:46:25]: I mean, this is true of, this is going to be true of, like, a company comes in saying, okay, we're building a rag system on our documents in notion or in Google Drive or really in any system where, like, there's this notion of, like, shared private. You know, this group can see it. This group can't. You know, you have to, like, you have to, like, you know, when I create a document in Google Drive or in, like, in Google sheets or whatever, like, I have faith that the sharing functionality works. And just because I've used it for years. Right. But it's complex. And, like, if you're going to be, like, building something on top of that, you need to be as good as Google Drive is in that situation.

Demetrios [00:47:06]: Well, what a great point. Because it's not only slack that has the notion of private channels in public channels, it's everything that we're using. Any documents, if you're on notion and you have your private workspace or you have your private documents in there that you've shared with a few people to socialize around, do you want that to be publicly questionable through the QA bot, like the chat bot? And how, how have you seen best practices on what to do? So what's private stays private and what's public is public.

Adam Kamor [00:47:43]: So we, we don't take a, like, our product doesn't take a position here. It's like, our product typically does not. It doesn't, like, hook into, like, you know, the sharing and, like, the the access controls of given documents, you can certainly feed that into our system, like as metadata, as you're processing documents. And, like, we can assist in landing the metadata in the right place. But in terms of, like, what, what happens once the data goes into the vector database and who can access it? That it's really up to the end user. And really, in my opinion, like, the responsibility of the database system to provide, like, reasonable, like, RBAC. I think the problem that you're describing, though, like, the ultimate form of this problem is we have tons of data. We need to make it shareable among everyone.

Adam Kamor [00:48:35]: Right? Like the customer service example I gave where I want to train my model on all customer service engagements, not per user. So how do I. Basically, in a world where you have tons of documents and there is implicit sharing in the sense that, like, you know, everyone's discussing their own private stuff, but you have no structured way of knowing what's what, how do you deal with that? And that's where we shine just because of, like, what we're able to do in terms of, like, identifying information within documents in an unstructured way.

Demetrios [00:49:04]: Yeah. And in the ways that you've seen customers building their chatbots, have you seen much of these design patterns of, like, trying to have some kind of guardrails for the question, or guardrails on the output and, or is it mainly around just, hey, the database has got to take the brunt of this one, and it's got to be the roles on that.

Adam Kamor [00:49:29]: Well, you know, I thought you were going a different direction with this. When I hear guardrails, I. My mind immediately goes to guardrails around, like, answer quality and preventing hallucinations and ensuring that, like, the answer given isn't just, like, complete and utter nonsense, the guardrails, in terms of, like, who can see what. I'm not really privy to those, like, portions of, like, the setup of our customers. Like, what we try to do for our customers is help them easily build a safe and high quality data pipeline where we try to optimize your retrieval metrics, meaning for a given question, you're always retrieving the optimal set of chunks or the optimal collection of chunks. That's where we typically draw the line. And downstream of that tonic textual, doesn't it kind of go further? There's, like, things on our roadmap that we're doing that I think actually will end up going further over time, but beyond right now, like, the data layer, we kind of stop playing at the moment.

Demetrios [00:50:40]: Yeah. And since it feels like in the last six months, graph rag has have an explosion. Do you have thoughts on that?

Adam Kamor [00:50:52]: I am interested to see where it goes. It is, I guess my only qualm with it at the moment, and I don't even know if it's really a problem I have or just an obvious. It's more of an observation, really. It is relatively expensive right now to build out your graph, and it's expensive because it typically requires an obscene number of LLM calls in order to like, kind of like build out the connections between documents or chunks or just depending on what level you're working at. We're, we're working on our own technology here at the moment, kind of powered by like our entity recognition to help identify connections between different documents and between different chunks. I think the first version of this we're aiming to kind of release over the next week or two. And it's not going to be as powerful as a graph rag approach at the moment, but it'll eventually get there. It's more like a technique for grouping documents together that all pertain to a same or similar topic.

Adam Kamor [00:51:50]: So it's more of like a document level topic analysis that's being done. And I think from there we'll likely move to finer grained like per chunk and then building in like, ways to connect documents and chunks together. But I'm excited by graph rag and approaches like it. I think there will be a way to do it that is like reasonable for like large enterprise sets of data. And I think hopefully, you know, we're the ones to do it, but, you know, time will tell.

Demetrios [00:52:19]: Nice. I cannot wait to hear what name you come up with for the.

Adam Kamor [00:52:25]: Yeah, I mean you, you have a, you have a, you have a good take on all that. Maybe you'll, if you think of something funny, um, you know, puns allowed, then let me know.

Demetrios [00:52:33]: There we go. All right. You where if anybody listening has some good names, we would love to hear them too. Yeah, that's awesome.

Adam Kamor [00:52:41]: That's good.

Demetrios [00:52:41]: We can give. Yeah, we can have, you know, when I was in third grade, they had all the third graders and all the YDD, basically all the elementary kids in the city that I was from try and compete to create the new mascot for the recycling trash. Tick up. Yeah, yeah, yeah. A kid in my class actually won. So from that point in time, I believe that anything is possible. All right? That is the big learning. And my top takeaway is like, that kid created the mascot that I now see driving all over town, and that's my buddy right there.

Adam Kamor [00:53:17]: And that's incredible, actually. Did the kid win like an award or anything?

Demetrios [00:53:23]: So these days you would think like, wow, he probably should have gotten like ten grand or at least free trash for a year or something. He didn't get anything, man. And they just used that. They blatantly stole it. They were like, hey, thanks. You applied to this, which we didn't really have a choice, right? We had to do it as part of school. It was like, now we're going to submit this to the trash, the recycling folks.

Adam Kamor [00:53:46]: I mean, that's.

Demetrios [00:53:47]: Bitch.

Adam Kamor [00:53:48]: If my kid won that, I'd be very pleased. That's going on. The college application for at least, you know, that out of it, at least.

Demetrios [00:53:58]: Exactly. You got a lot. You could juice that for a lot of miles.

Adam Kamor [00:54:02]: Yeah, yeah. 100% for sure. Yeah.

Demetrios [00:54:06]: So, like, that very same use case, if someone out there is listening and they have a great name and you end up using it. Yeah, yeah.

Adam Kamor [00:54:16]: Just like your friend. We won't pay for it. Just give it to me.

Demetrios [00:54:22]: And put it on your college or your MBA or PhD application.

Adam Kamor [00:54:27]: Yeah, yeah, yeah. We're going to pay you an exposure. Perfect. People love that.

Demetrios [00:54:33]: Or at the very least, you might get a promotion. We'll let your boss know that you deserve a pay bump. So. All right, last one before we jump and we stop bullshitting on this retrieval evaluation.

Adam Kamor [00:54:51]: Yes.

Demetrios [00:54:51]: Practices.

Adam Kamor [00:54:54]: Oh, man, that's a hell of a question to end on. That's tough. Look, I mean, I'll say something at first that I think is obvious to people, but I'm going to say it anyways. When you're building a system, you write tests in general, and given this input, this is what I want the output to be. Well, okay, that paradigm doesn't work as well when you're working with large language models because they're not deterministic in the way that a normal software system that you build would be. So how do you test the quality of the system you're building? Like, the end to end quality? Like, given this question, I want an answer that, like, touches on these points or contains, you know, at least this in its answer, you know, things of that nature. How. How do you test this? Um, there's different approaches out there.

Adam Kamor [00:55:43]: We actually have our own open source solution in the space. It's called tonic validate, and it's a. It's a rag performer.

Demetrios [00:55:49]: I was hoping for a better name than that, honestly, after all this talk.

Adam Kamor [00:55:55]: I mean, it's actually like, you know, traditionally it's a good name.

Demetrios [00:55:59]: It says what it does cause people understand it.

Adam Kamor [00:56:01]: You can, you can spell it, there's no, there's no jokes to make. It's just, you know, tonic validate. Yeah, by those standards, yeah, it's, yeah, very, very vanilla. So tonic validate works essentially, but like by treating your rag application, or really any LLM backed application as a black box. Given this input, we're going to evaluate both the context retrieved and the answer given to determine if it's a good answer. And there's different metrics that we support, that we support out of the box that you can use. The metrics themselves aren't perfect. Many of them require the use of a large language model to do the evaluation.

Adam Kamor [00:56:45]: But what they get you is a consistent and easily automatable observer, which it's worth a lot because it allows you to see in a consistent way trends over time, trends in your scoring. So I'll give you an example of some of the metrics that we see as being the more popular ones. One is called answer similarity. This is probably the best metric, but it's a very expensive one to compute because it requires every test question to have a human provided reference answer. And essentially we evaluate the rag answer with the reference answer and we check for similarity. That's a very powerful metric, but again, expensive. And then we have other metrics. For example, like one metric, it looks at the question and it looks at the context retrieved, and it kind of tries to determine how relevant is the context to the question.

Adam Kamor [00:57:39]: And then on the flip side, we look at the context retrieved in the answer. How much of the answer is derived from the context? So we're doing metrics like that to kind of help you understand both answer quality, retrieval quality, answer generation quality, etcetera. And then there's other metrics that don't require LLMs, like does the context contain phi? Does the answer contain pii, things of that nature? Or what's the latency, what's the response time? How much did the call cost? How many tokens were used? Did the answer contain this exact substring? Simple things like this. And we make all this available in an open source Python SDK. And then we provide a web UI to help you visualize and understand metrics over time. But the problem is not solved. It's very challenging, especially as you get into more complex rag systems coming on the market. They're complex and they're hard to evaluate, and they have multiple steps and they're agentic and they're multimodal.

Adam Kamor [00:58:40]: And like, you can't just do like a black box evaluation. At that point, you want to do, like, careful testing of each step in the process. And I don't think there's a great answer yet. I think there's like, there's no, like, there's no great solution out there yet. I think tonic validate does a decent job. It does a good job, but it's a very challenging problem, and I think it's one that's going to get more and more relevant, especially as rag systems go from being just like internal tools to externally facing production deployments.

Demetrios [00:59:11]: Yeah. And they leak data.

Adam Kamor [00:59:14]: And they leak data.

Demetrios [00:59:16]: And they have more agency.

Adam Kamor [00:59:18]: Yes, that's right. All of the above.

Demetrios [00:59:20]: Yeah. Dude, this has been great.

Adam Kamor [00:59:36]: Close.

+ Read More

Watch More

Driving ML Data Quality with Data Contracts

Posted Nov 29, 2022 | Views 2.6K

# ML Data

# Data Contracts

# GoCardless

Partnering with Product for Effective, quality Data Ingestion & Training Data // Daniela Santisteban

Posted Sep 18, 2024 | Views 1.1K

Code Quality in Data Science

Posted Jul 12, 2022 | Views 1K

# Data Science

# Clean Architecture

# Design Patterns