MLOps Community
+00:00 GMT
Sign in or Join the community to continue

AI for Customer Experience Teams

Posted Apr 01, 2024 | Views 181
# AI
# RAG
# TwigAI
# Twig.so
Share
speaker
avatar
Chandan Maruthi
Founder and CEO @ Twig

Chandan Maruthi, has over 10 years of experience in enterprise data platforms and AI. He has worked with some of the most renowned companies in the world, helping them develop and deploy cutting-edge AI solutions.

+ Read More
SUMMARY

Chandan discusses the concept of retrieval-augmented generation (RAG), emphasizing its relevance in enterprise settings where specific data and knowledge take precedence over generalized internet information. He delves into the intricacies of building and optimizing RAG systems, including data pipelines, data ingestion, semantic stores, embeddings, vector stores, semantic search algorithms, and caching. Maruthi also addresses the challenges and considerations in building and fine-tuning AI models to ensure high-quality responses and effective evaluation processes for AI systems. Throughout the talk, he provides practical guidance and valuable considerations for implementing AI solutions to elevate customer experience.

+ Read More
TRANSCRIPT

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/

Chandan Maruthi [00:00:00]: Quick intro and I'll dive straight into the content. So this is me. I run a company called Twig. We are AI for customer experience teams. We're based in San Mateo. We have twelve customers and some interesting investors. I used to work at H 20 AI. We used to build automl platform.

Chandan Maruthi [00:00:25]: I also built semantic search at Lambda school. So done a little bit of AI before language models came out and just made a whole lot of sense to go all in when we saw how great these models were. So, a quick show of hands, who does not know what Rag is? All right, so almost everyone knows what rag is, so I'll not spend too much time on this. Essentially, Rag is a way or retrieval. Augmented generation is a way of telling a large language model to use a specific set of data you give, rather than trying to use all the Internet knowledge it knows. And it makes a lot of sense to use a technique like this in enterprise b, two b use cases where you're trying to solve a specific problem, and you don't care about what is out there on the Internet, you have the better information than anyone else. And so, grounding it to your data, your knowledge, is what rag does, and the most deterministic solutions out there, which is you can depend on the answers you know, where the answers came from, happen to be rag systems. This is how we look at a reference architecture for rag.

Chandan Maruthi [00:01:56]: So you have your data layer, you have a data ingestion, you have a semantic layer, you have models, and you have some optimization layers. Essentially, when you get a user question, you convert it into an embedding. You see if you have this information in your semantic cache. Semantic cache is essentially historic answers, where you know these answers are good for these questions. And if you find it in cache, you don't even try to generate a new answer, you just serve it from cache. If you don't, then you look at your vector store, see what information you have that matches this information, and then you use a language model to get that response right. Pretty standard here. So I'm going to just dive into what we have learned building this in production.

Chandan Maruthi [00:02:50]: So, the first thing I'll talk about is data. Any data scientist, any ML practitioner, knows that the first thing you do is you stare at data, and for a long time. So what works is getting data is a solved problem, especially from known systems. So you have Zapier, prismatic, Paragon, Workato, all these ipas systems, which you can embed and get data from existing tools, right? So don't try to build your own scripts unless you need to. Where it becomes interesting is when you're working with enterprises that have been around for like 2030 years, they have a lot of legacy systems and these legacy systems don't have built in connectors. So you're going to end up writing a bunch of python scripts to get this data, which is fine, but caution is you're going to end up with a lot of these scripts and the person that wrote them is going to change. Companies going to move on roles. Nobody knows what it did, they don't like how it was written.

Chandan Maruthi [00:03:57]: No two engineers like each other's code. And then the next step is data pipelines. Now this is, let me show you some of our data pipelines. We run our pipelines on prefect. And if you look at the amount of time some of these pipelines take, is pretty decent, right? So some of these run in minutes, some of these take 50 minutes, some of these take 38 minutes. So you can't expect these to be running, to be looking at these in runtime. You need some kind of orchestration layer to do this. And Prefect is one, you could use other DAC tools like Daxter, Airflow, Prefect, et cetera.

Chandan Maruthi [00:04:41]: But what they do is they allow you to run these long running processes overnight and on multiple nodes that you don't have to worry about. A lot of data, even within the enterprise, sits in websites, and so you'll need to have scrapers. Scraping can is nondeterministic because you're going after visual information. And so the kind of problems you'll run into is JavaScript or client side rendered pages. So you try to pull a page and see what's in it. You just see a bunch of Javascript, there's nothing in there, right? So you need to have some kind of a browser emulation on the scraping side to first render the page and then access the content in there. And I don't like Cloudflare because it is always a pain when a company uses cloudfare to try to scrape it, for good reasons, they have it, but it's not friendly for scraping. The other thing that we didn't realize is when you work with devtool companies and they have command line interface and they have documentation for their CLI, that documentation was not written by a tech writer.

Chandan Maruthi [00:05:50]: That documentation for a cli was written by an engineer. And they are very, very efficient with words and language models don't like, like the previous presenter spoke about, they don't like brevity, and engineers love brevity. So you're going to pull CLI docs and per page you're going to get like 2030 words. It's not useful. So there are ways to probably solve it. You can expand that using a language model and then use it. But it takes a lot more work to work with less, very little text and then stale data. What is the strategy to work with stale data? Timestamps are one, but then websites don't have timestamps, and there's also conflicting data.

Chandan Maruthi [00:06:36]: Like you have two pages that say completely different things, which may be because they were on different versions of the product. And so you need different strategies to handle with stale data. So we didn't even get to rag. So these are all the kinds of things you will face when dealing with data. We'll talk about semantic stores. So in the previous diagram, you saw, you take the data, ingest it and put it into a semantic store. So let's talk about what happens in a semantic store. The first thing is semantic stores don't hold a lot of data per record, and so you have to break it down.

Chandan Maruthi [00:07:16]: And when you break it down, you have to chunk it into small parts. So what's the strategy to break down a piece of text into smaller parts? The problem is, let's say you have a long story where it started with how, say, Steve Jobs started apple and ended up with starting another company. When you break it down, where do you break it? What is the logic of breaking it? Right. The most simple strategy is just taking chunks based on the text size, but then this may break data based in semantically irrelevant places. So chunking is a problem. And then you'll end up with orphans because you will take a large piece of text, break it down, and the last couple of paragraphs, no rag system will pull it in relationship to the original text because it had nothing to do with the original topic. Right? So what do you do with orphans? You could always summarize the entire content and add it as a header to every piece of text. But then you're taking away from what you can store.

Chandan Maruthi [00:08:30]: Embeddings, not that big of a problem. In the early days, you can use standard Ada two kind of embeddings. We type fine tuning embedding models. They are good, but Ada two is still really good. So just large models for embeddings work well. Vector stores are a commodity today. Pine cone VB eight, you'll have 100 vector DBs to work with. Just use one of them.

Chandan Maruthi [00:09:02]: They all work the same. If I'm not wrong, even postgres and others are having vector support. Now, the next one's interesting, the algorithm for semantic search, right? So you have cosine, which is the most common algorithm that we have all heard. But the problems you run into with cosine is if you have ten pages in your website saying the same thing, or something similar, you'll have ten entries in your rag search, which say exactly the same thing. So you will not get to the 11th piece of information, which is useful because the first ten, if you're only retrieving ten entries, then the ten were similar, right? So you need face, on the other hand, has a method to penalize something that's very similar. But face is not implemented in pine cone, and so you'll have to do your own thing. So this is not a well solved problem, and I don't see too many people trying to solve it either. But this will have to be solved over time as rack systems become more enterprise ready.

Chandan Maruthi [00:10:14]: Finally, what do you do with the data you didn't store in the vector store, right? So you probably need to tag it, go along with Mongo or elastic, to put a bunch of data that you didn't store there, metadata and so on. So that's all the things that work with a semantic store and things that don't work or you need to be careful of. Semantic caches are very interesting. What we do is when we get a customer question, we first look at our cache, and if we have something in there. So right now there are like 20,000 human edited questions and answers in our cache, and if we find a match there, we just present that answer. The reason this is important is when a human edits AI response with a lot of nuance, they want exactly that answer the next time. They don't want it to be reworded in any way. And the cache helps there.

Chandan Maruthi [00:11:11]: We built our own, but there are tools like port key or GPT cache that do the same thing. This can be pretty deep. So just be aware, if you go down that path, you will end up building a lot yourself. We built it because we see ourselves doing a lot of precaching. What is precaching? You take a bunch of text, you generate your own questions and your own answers. Synthetic Q A and that synthetic Q A, you put in the pre cache so that when a customer asks a question, you have a lot more answers coming from cache, rather than being generated from scratch, which means you're spending less time in delivering those answers. Right. So typical customer question may take anywhere between six to 15 seconds from cache.

Chandan Maruthi [00:12:01]: It can be delivered in two to 3 seconds. And that's significant when people use it in volumes. If they use it like 2030 times a day, they will notice that difference. Hosted models are a great place to start. We really don't need to think beyond the large language models in the early days, but very quickly you'll start thinking about how do you build your own models? And there'll be a lot of scenarios. We have a scenario where we generate synthetic Q and a, which means we have to make thousands of calls in batch every day. Using the GPT 3.5 or four is very expensive. So using a fine tuned open source model is a cheaper approach for us.

Chandan Maruthi [00:12:50]: And so we use fine tuned models as well as off the shelf open source models. The thing to think about there is cost and security. Some of these vendors are certified, and some of these vendors may not yet be. So something to think about. Just ignore this third one there and then fine tuned models on proprietary data is very interesting. Now we are seeing some very interesting models come out which are fine tuned on alama two or a mistral and so on. And we are going to see a lot of them out there which are very domain specific or task specific, and those would be very interesting to look at. Finally, I'll talk about evals.

Chandan Maruthi [00:13:32]: And the previous presenter did a great job talking, really going really deep about evals. But I'll talk about something sort of net new. The simplest way to look at eval is did you even give an answer or not, right? And one way to do that is to have the LLM respond in a JSON format saying, hey, did I have an answer? And if I did, this is answer, right? Human feedback. So we have a training layer that looks like, let's see, collab, find tune. But this is our platform, so humans can go in here and they can edit questions. So you can see the AI gave an answer that looked like this human just reduced the text that was there. And they marked some questions as accurate. They marked edited more.

Chandan Maruthi [00:14:23]: So this is like a training or a feedback playground that you can offer. The other thing you can do is look at your rag context itself. And the way to do that is to look at your rag context and look at the answer it was generated and say, was there? The information in the answer was that present in rag? And for that you'll use a large language model. It sounds good in theory, it sounds good in retrospect. You can't do this in production because your answer is going to be ready in 10 seconds. Your evals are going to be ready in 2030, 40 seconds. And so who's going to use that eval? Right? So it's more a retrospective thing than in a runtime thing. Some of the simplest things you can do to improve AI response quality is look at each question and see what really went on.

Chandan Maruthi [00:15:18]: And so what we do is we have operational dashboards that look like that, and we can look at a single user and see what was going on in their day, and we can see all the questions they asked and we can see why did they make this edit? Right? So this was the original answer, this was the AI edited, the user edited answer, and then we can go in and see what was the nuance there, what was the change that the user did? And sometimes these are grammatical, sometimes the information itself is different. And the way we try to make sense of that is we have observability built in. And just like the previous presenter talked about, edit distances, Levenstein cosine distances, these are things you can measure mathematically. Again, these are retrospective things that help you measure and make the system better over time. It's not something you can do in runtime, but these things really help because you can see what's going on at any point of time and you can track it over a graph and see, did the edits come down? Did the edit distances come down? Did the times where I had no answer come down? Right, so we can see those kinds of graphs at any point of time. A mathematical way to, a quadrilable way to measure NLP responses is a hard problem and it's not at fully solved. So that's, again, something you'd build when you're building production rag and just bad ingestion, right? So sometimes, a lot of times when I've tried to debug a problem and say, why did we not do a good job answering this question? It has a lot of times gone down to, or gone back to, did we ingest it properly? So something interesting about that data set that was different than most other data sets that we saw. So just better ingestion is a good way to solve the problem, right? Often data is the first place to look at ingestion.

Chandan Maruthi [00:17:19]: Rag is the second and third place to look at. Models are the last one. What would you do if GPT 3.5 or Claude or llama 13 B is bad? It is bad. That's it. You can't do much. What you can do is do everything other than that. And you can fine tune, but you're not going to get better than large models in certain contexts. And so you do everything you can do other than those things.

Chandan Maruthi [00:17:46]: So, yeah, so those are some of the key areas that we think about while we build our production systems. We are constantly thinking about how to offer a system that works for our customers. How do we know when we were successful? How do we know when we failed? And is there a quantifiable way for us to measure it? And is that number improving day on every day? So that's what we think about. Yeah, I'll be around. If you have any questions, please feel free to ask me.

Q1 [00:18:22]: Open to Q A. Hi, great presentation. Really like the way that you have described different steps. My question is about ingestion and I have inserted a lot of documents. Right. The point is, assistant API is very good. Remove the documents. But if you are building your own kind of semantic vector database, how you can say that I don't want this right now, I want to get rid of this and create it from scratch.

Q1 [00:18:55]: What are different steps where you go into a bad ingestion and then what are the recovery methods?

Chandan Maruthi [00:19:04]: So your question is you ingested a bunch of data and some of it is bad. Now you want to redo, you just spurge the whole thing and then start over. The scenario, which is slightly different than what you asked, which is the data ingested is not bad, but may not be relevant for a particular use case. There we have built the idea of multiple AI agents, and so we have different agents. So for instance, so if you look at this agent, there's one agent that looks at all these eleven data sources and provides a detailed natural response, whereas we have a different AI which looks at fewer data sources and provides a neutral response. So this is a way to have different ais that look at different data and do different things, but they are assuming they're all good data. But if the data set itself is bad, what we do essentially is we have a data ingestion layer where we go in and if you don't like a data set, we kick it out and we just re ingest the whole thing. Yeah, if you're working at very large data sets, right, millions of rows.

Chandan Maruthi [00:20:15]: Then just like any data engineering problem, you start with a small batch. You make sure that small batch works and you like what you see, and you eyeball the ingestion, the output, and only if you like what you see at a few thousand records, then you do a million records.

Q1 [00:20:32]: Just a follow up question. The basic process of making up to the mark of evals, you start with prompt engineering I saw one of your slides, but can you just highlight before attempting rag what prompt engineering techniques we can do, like some adding context where some token size, we can add that, or just play around. Can you add a couple of points over there?

Chandan Maruthi [00:20:57]: Yeah, a lot of these methods are not mutually exclusive. In fact, if you see they're all built on top of each other, right? The most basic thing is you just ask, just play with llms and you get a sense of how they work. But then you realize that people have got really good at making it work in a certain way because they have used different prompting methods, and that is non trivial. You can't expect end users to do that. So there will be people who, there will be examples or libraries out there that show how to do it right, and they will be different for different models. What you use for a GPT model will be different than what you use for a llama or mistral. So you'll have to depend on looking at what works. And then context sizes are definitely a problem, right? Every model is restricted to a certain context size.

Chandan Maruthi [00:21:50]: We run into that problem all the time. One way to solve a context size problem is to summarize data, and summarization is cheap. Today, even a llama seven B can summarize pretty deterministically pretty well. So you can use a small model to summarize and then a large model to respond to. The answer is. So the question was, at what stage do you decide to fine tune? And the answer is, well, fine tuning is not a solution for most of the problems, because fine tuning, in fact, what we have seen is fine tuning is not good for deterministic responses. Fine tuning is not good for data. Fine tuning is good for behavior.

Chandan Maruthi [00:22:44]: So if you want to have a model that is giving JSON responses or giving latitude longitude responses, that's a great use case to build a fine tuned model that just responds in a particular way. But fine tuning for knowledge does not work, it just hallucinates. Fine tuning for behavior works. The decision to fine tune is not on a progression scale, but more on a use case scale. You'll still continue to use a large model. You'll use rag for deterministic responses. As far as we've seen, the only way to get deterministic responses is from rag. The only way to work complex reasoning is through larger language models, right? And so fine tuning is not a progression in the journey.

Chandan Maruthi [00:23:33]: It's a separate problem. You use fine tuning to do something an existing model does not.

+ Read More

Watch More

Building for Small Data Science Teams
Posted Dec 19, 2021 | Views 615
# Spothero.com
# SpotHero
# ML
Building Better Data Teams
Posted Aug 04, 2022 | Views 1.2K
# Data Teams
# Data Tooling
# RN Production
# Financial Times
# Ft.com
DevTools for Language Models: Unlocking the Future of AI-Driven Applications
Posted Apr 11, 2023 | Views 3.4K
# LLM in Production
# Large Language Models
# DevTools
# AI-Driven Applications
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com