MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Navigating through Retrieval Evaluation to demystify LLM Wonderland // Atita Arora // AI in Production

Posted Feb 18, 2024 | Views 795
# LLM
# Evaluation
# AI
# ML
Share
speakers
avatar
Atita Arora
Developer Relations Manager @ Qdrant

Atita Arora is a seasoned and esteemed professional in information retrieval systems and has decoded complex business challenges, pioneering innovative information retrieval solutions in her 15-year journey as a Solution Architect / Search Relevance strategist / Individual Contributor. She has a robust background from her impactful contributions as a committer in various information retrieval projects. She has a keen interest in making revolutionary tech innovations accessible and implementable to solve real-world problems. She is currently immersed in researching about evaluating RAGs while navigating the world of vectors and LLMs, seeking to uncover insights that can enhance their practical applications and effectiveness.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

This session talks about the pivotal role of retrieval evaluation in Language Model (LLM)-based applications like RAG, emphasizing its direct impact on the quality of responses generated. We explore the correlation between retrieval accuracy and answer quality, highlighting the significance of meticulous evaluation methodologies. - Atita Arora

+ Read More
TRANSCRIPT

AI in Production

Navigating through Retrieval Evaluation to demystify LLM Wonderland

Slides: https://docs.google.com/presentation/d/1KZiAM0gon20iwszYS4ZUcX9ijv3N0KkS/edit?usp=drive_link&ouid=112799246631496397138&rtpof=true&sd=true

Demetrios [00:00:01]: Next up, we've got none other than Atita coming at us straight from quadrant talking about those rags. What's happening, Atita, how you doing?

Atita Arora [00:00:21]: I think I'm pretty good and that was a great song and I see that you wrote it especially for the.

Demetrios [00:00:28]: Conference that was in real time. We are talking real time happening and so some of it may have come out a little bit off, but think was all right.

Atita Arora [00:00:43]: Pretty good. I have heard you play before, but this was unexpected.

Demetrios [00:00:50]: You've had the displeasure maybe, or pleasure? Whenever.

Atita Arora [00:00:54]: Totally a pleasure.

Demetrios [00:00:58]: Incredible. Well, I know that you've got an excellent talk for us coming up. We have ten minute talks now, these are the next three talks that we have going on, are going to be lightning talks and that means it is ten minutes. We encourage everyone to drop some questions in the chat if you want. Atita, I know that you are also watching us from the platform, so you can answer those questions in the chat after the talk, but I just want to let everybody know that we're going to be going bing boom boom boom boom boom boom boom real quick and get it all rocking. So without further ado, I see you shared your screen. I'm going to put it on there and put ten minutes on the clock. I will see you soon.

Atita Arora [00:01:43]: Okay, so, thank you so much. And today my topic is navigating through retrieval evaluation and in the word of large language models. I know that this is all about AI in production, but we're kind of taking a little detour and I'll tell you why. But before that, a short introduction. My name is Adita Roda and I work with Quadrant as a solution architect and as a developer relations manager. If you haven't noticed yet, my name is a palindrome, a complete name, which is why you would not forget me. So short, maybe promo about the company that I'm representing here. If you haven't tried quadrant before, it's the most loved vector search database.

Atita Arora [00:02:31]: And if these stats haven't convinced you, I might as well again say that it's a performance centric, scalability oriented and resource optimization focused vector database. If you haven't tried it as of yet, do give it a shot and you can scan this QR code to get your free instance today. Moving on. So, as I said, that a lot of people might be wondering that the event is supposed to be AI in production. Like, we shouldn't really be talking about retrieval. That's like so 2022. But you also would also realize that artificial intelligence is becoming an integral part of our lives at every stage and of our daily interactions as well with the technology. So whether we are shopping online or building reports, or using assistants, or even while watching films and listening to music, I mean, the key enabler of these sophisticated application is none other than information retrieval in the first step, and which is why the quality of the information retrieved has a direct impact on our choices, like what item are you going to buy next, how are you going to perform your work next? Or even our mood, because that's impacting the videos and audio that we're listening to.

Atita Arora [00:03:48]: Also, one of the key problems here is that the data is growing by every single day. I mean, the structured unstructured data, catalog data, anything that a user could search through is growing and growing. And those models, those are trained on the user behavior data as well. That's also growing. And let's not forget the media, I mean, images, videos, audios and whatnot. There may be several other things that I might be forgetting in terms of the modals or I would say modes. So yes, that's something that we should not forget either, which is why measuring relevance and retrieval quality becomes prime. Most important thing, it becomes much more important than before, because the primary challenge right now is to refine this huge information that user is presented to, and present it with only the relevant information which influences the behavior of the user.

Atita Arora [00:04:46]: And these factors basically are very subjective, very contextual, and also temporal. So we might also see that we already know about certain metrics that are used to evaluate the retrieval quality of our information, like precision recall, harmonic mean of precision and recall. Like f one scores average precision. That's like something really basic that the users usually start with. And if your use case require more like ranking the relevant data onto the top, or the ranking matters just so. In all the use cases, DCG, NDCG and MRR are the metrics to go forward to. I think we also know about the similarity score having been introduced into the vector word. And yes, last but not least is also the human evaluation, which is super expensive and which is kind of reserved for the last stage.

Atita Arora [00:05:42]: So I'm sure a lot of people have already mentioned about retrieval augmented generation in the presentations before. I've been following the event as well. Over here I show the small implementation of Rag, and if you notice why it is called rag is because it has three key components called retrieval, where we retrieve the relevant information from our knowledge store quadrant in this case, and augment that with our prompts with which we query our large language models. So please understand that the key piece here is retrieval, which is why it is very important to pay attention to the retrieval quality, because after all, if you retrieve garbage, I mean, you're going to get hallucinations. And I think that's what the song was all about that Demetrius had composed. So your rack system is as good as your retrieval because I just have ten minutes. I'm trying to make most of my time, which is why I would not go into the complete depth of this experiment here. But I've also attached the link if you could go on and read more about it.

Atita Arora [00:06:46]: But in short, I would say that if your retrieval is bad, you ought to have very bad responses as well, which is exemplified by this snapshot here. And on the other side, if your retrieval is good, you're going to get good responses from the LLM as well. So obviously that is not the only way the RAC could be implemented. And there are again several resources available in which we talk about different formats, different shades of Rag, like naive Rag, the advanced rag and modular Rag. And while we are helping a lot of customer at quadrant implement different formats and different data structures which are ingested into quadrant to get the most sophisticated format of RAG, we have figured out that there are several ways to improve your retrieval quality by improving your chunking strategy. I think that's where usually you should begin. If you have resources available, you could also try model fine tuning that definitely improves the retrieval quality as well. Some of the other techniques that have been tried before is query rewriting and augmented adapted retriever, and this is the paper which is attached to the resources as well.

Atita Arora [00:08:03]: I think once you have the access to the slides you would be able to see them as well. That not only are we experimenting with models and chunking strategies and query rewriting, we are also changing the way we are retrieving and interacting with the vector databases, which is super amazing to see how that's impacting the retrieval quality. And yes, I haven't forgotten about the media content that I spoke about earlier. For the images, we usually rely on the metadata based retrieval, and one of the other things that has been key is the content based image retrieval as well, which is the technique used in search and retrieval of the images from the collection based on their visual content, rather than relying on their text based metadata or annotations. So CBIR is the system that analyzes the visual features of the images, such as color, texture, shape and spatial arrangement, to represent and index images in the way that allows for the efficient retrieval and for the audio. Most of our customers are relying on the whisper based transcript generated and processing them as text retrieval. And for video, it's kind of a little bit tricky because it's the combination of audio and the images. So which is why we rely on the techniques which combine both of these approaches to bring best of both world.

Atita Arora [00:09:26]: The metrics used are map, again mean average precision, which relies on existing judgment data similarity score and AUC RoC. If you don't know, it's more of the precision recall area under curve and receiver operated characteristic curve. So the idea here is that AUC quantifies overall classifier performance, that higher values indicate better performance and equality, so maximum of one is obviously preferred. It's perfect and 0.5 for the random when the model is not able to decide if it is sureshot this thing or not. And the RoC curve is basically graphical plot between sensitivity, which is true positive rate, and specificity, which is false positive rate. And this is basically more like a binary classification which is used to identify if the image or the visual content is relevant or not. So apart from the metrics that I've discussed, and given the limited time that I have, I would like to quickly talk about the challenges. So even then we have the evaluation mechanism.

Atita Arora [00:10:36]: There are still some outstanding challenges like training and testing data biases. I think bias becomes really critical here because if the testing data has biases and it is missing that diversity in the test data set, the models are, and the retrieval process itself becomes very biased. Again, we're talking about a lot of models supporting big context size or the context length. I think that gives birth to the lost in the middle problem that's also attached to the resources. Once you have access to the slides, you would be able to see them. And I know it's obviously something that all the solutions have been targeting for is like personalizing responses for the given user. But too much personalization tends to throw a user into echo chamber or filter bubble. For example, if the valuation data set consists of predominantly similar or redundant documents, the assessment metric may reflect the system's ability to retrieve those specific types of documents well, but it fails to capture its performance in the broader range of content.

Atita Arora [00:11:41]: So similarly, the evaluation queries are overly specific or focused on a particular domain. The assessment may not adequately reflect the system's effectiveness in retrieving diverse and relevant information across different contexts. So I guess that was about it. If this content interested you and the presentation interested you and you would like to connect with me, please make sure you send in your LinkedIn invite and do try quadrant. Thank you.

Demetrios [00:12:12]: Excellent Atita awesome stuff. Thank you so much for that. I know rag evaluation and retrieval is huge on so many people's minds right now, and so I think it's very topical that you chat about it. I am going to keep this moving. I encourage anyone that is hot on the retrieval evaluation topic. Reach out to AITA and see what's up. For now, we'll be saying bye and I will see you soon.

Atita Arora

+ Read More

Watch More

LLM in Production Round Table
Posted Mar 21, 2023 | Views 3K
# Large Language Models
# LLM in Production
# Cost of Production
Navigating Through the Generative AI Landscape
Posted Jul 04, 2023 | Views 706
# Generative AI
# LLM in Production
# Georgian.io
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io