MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Benchmarking LLM performance with LangChain Auto-Evaluator

Posted Jul 06, 2023 | Views 821
# LLM in Production
# Auto-Evaluator
# LangChain
# langchain.com
Share
speaker
avatar
Lance Martin
Software engineer @ LangChain

Lance recently joined LangChain after several years working on vision systems for self-driving bots, trucks, and cars at Nuro, Ike, and Uber ATG. Authored a few open-source packages, including the https://github.com/rlancemartin/auto-evaluator, which is hosted at https://autoevaluator.langchain.com/.

+ Read More
SUMMARY

Document Question-Answering is a popular LLM use case. LangChain makes it easy to assemble LLM components (e.g., models and retrievers) into chains that support question-answering. But, it is not always obvious to (1) evaluate the answer quality and (2) use this evaluation to guide improved QA chain settings (e.g., chunk size, retrieved docs count) or components (e.g., model or retriever choice). We recently released an open-source, hosted app to address these limitations (see blog post here). We have used this to compare the performance of various retrieval methods, including Anthropic's 100k context length model (blog post here). This talk will discuss our results and future plans.

+ Read More
TRANSCRIPT

Link to slides

 All right. Next on the agenda, I think one talk behind is Lance. Lance, thank you so much for joining us. Thank you for your patience. Hello, I'm here. Hello. We're talking about L L M Performance with Lane Change Auto Evaluator. That is, That's right. Awesome. You can share my screen. Yep. Here you go. All righty.

Thank you. Thank you. Well, thanks for having me and really enjoyed the talks today. I'll talk a bit about Lang Chain, which is a framework, uh, for building language, language model applications. Um, and one of the popular paradigms here is retrieval augmented LLMs. So this is where we want to retrieve documents from some source.

Put them in the working memory of an l l M and produce answers. So Lang Chain has integrations kind of crossed this kind of flow here. Um, over 80 integrations for document loading. Over 30 integrations for storage, different vector stores, over 30 for retrieval and over 40 different LLMs. And I'll walk you through kind of how you can use it to build an application and evaluate it with some of our tools.

So let's take an application where we want to build a chat app from a set of YouTube URLs. Um, now we can actually get, uh, the documents or text from a set of URLs in really only, um, three or four lines of code. Using one of our document loaders. I show this here, um, and I'll provide slides and links later so you can dig into this yourself.

But it's very easy to go from any list of YouTube URLs or playlists to set of documents. And when you have this, You're kind of confronted with this challenge, like how do you set these parameters? They're split size, they're split overlaps, split method, retrievers, different language models, and so we actually have an application for this called the Auto Evaluator, which is free to use hosted application, and it's also open source where you can input your documents of choice and you can input your settings and it'll evaluate it for you against a test that you provide or an auto generated test set.

Of question answer pairs. Do you expect chamber documents? So this kind of shows you what it looks like. I share the link down there. Uh, but it has user settings on this side, so it's kinda a playground environment. You can select all sorts of different settings, um, you know, different chunk sizes, different models, different retriever methods, um, different numbers of docs to retrieve.

Again, you can see, you can actually just upload your documents of interest. You can upload an eval set. Run a bunch of experiments and in this nice UI you can see down here it logs your experiments. You probably can't see the details of this particular run, but that's okay. It's showing you the big idea.

That is a nice little playground where you can upload any documents and quickly evaluate different chain parameters. And behind the scenes Langton, we'll build the chain for you and evaluate it. And so this is, we, this is for our example application. We took all the lectures from Carpo recent, uh, l l m course.

Uh, we input it into our app. We test a bunch of different settings. So here I test, let's say, I wanna try different chunk sizes, different LLMs to synthesize answers. And you can see in the visualization it shows you. Which setting gave you better quality versus latency? So you can kind of, you can see here in this particular case, smaller chunks were a little bit better.

You can see open source models like the Kuna did, lag GPT three, five, and four a little bit in this case also, in this case, GP four and three five performed about the same. But in any case, this is just showing generally how to use this application for really any, um, Any question, answer application you'd, you'd wanna build.

Um, so again, use Auto Evaluator to kind of select your settings in a no-code manner. Very easy. And then we provide some nice templates. Uh, Lang Chain works very nicely with, for example, fast api. You can set up a streaming backend, connect that to any front-end framework you like. For example, for Cell, the app is here shown.

And this is kind of it running and so it'll stream answers. So Lang Chain is powering all this. Um, and again, we kind of use the auto evaluator itself to kind of choose our best settings before I went and implemented them. Since building your Vector DB is kind of costly, it's nice to just Sandy check things before you go to all the effort of building the Full Vector db.

Um, and I'll, I'll share a few other learnings that we've had with this Auto Evaluator app. So this slide's a little bit busy, but the main thing I'm trying to communicate is like there's lots of different ways to approach a retrieval question in these retrieval augmented generation applications. Of course, there's kinda lexical statistical search.

Of course there's semantic search, which many are very familiar with. There's also some kind of newer methods that use semantic search with metadata filtering. Um, the self coin retriever is one very interesting one in Lang Chain, and I'll talk about that in a little bit. Um, and the thing, one other thing I'll highlight is some newer models with very large context sizes like anthropics 100k model.

You can actually stuff documents in together. Um, also I want to highlight, uh, post-processing COHEs reran as a very interesting way doing that. And so we've integrated a lot of these options with auto evaluator and I'll just share a few like little insights. One thing I found very interesting is that Anthropics a hundred K model is actually really good.

Um, this is actually inputting the entire GPT three paper. It's a 75 page pdf. I have a test set of 15, or I'm sorry, in this case five questions and it performs as well as three, five, and GPT four with an independent vector store retriever. So again, this is taking that whole 75 page, page pdf. Putting into the working memory of the model and just asking the model questions about, it's very impressive.

Of course, it's slower, but it's an interesting thing to highlight. If you have like a single document use case, maybe you can get away with using one of these larger models, putting the entire document in. And don't worry about like building an independent Vector db. Um, another thing I'll highlight, open source models are very much worth considering.

Um, mosaic MPT seven B, for example, is quite fast. Um, kudos to them on the inference work there. I've seen very good results from Fauna as well. It's, we host on Replicate. You can play with both of these in auto evaluator and it's very much worth considering these, particularly for retrieval augmented generation.

You may not need a massive model, some of these smaller open source models. Uh, could, can be sufficient. Um, and the final thing, I'll, I'll wrap with metadata filtering is like a very interesting topic. Um, we have an application that's linked here. It's part of auto ware that you can test during metadata filtering schemes.

It can be a little bit tricky, like this is a case where I had an app, uh, built for the entire, uh, podcast, um, of Lex Friedman. And if you wanna ask a question, say, What did Elon say in episode 2 52? I want to do a metadata filter for that. In some cases, um, your metadata is actually stored using a different format, which is non-obvious.

So in that case, something like, so queer retriever might fail, and things like other libraries like core maybe more useful. I provide a bunch of examples in the notes here. You can go, go into that in my slides. Um, but that provides an alternative. We provide a nice way to kind of evaluate different metadata filtering schemes, which can be quite tricky actually, because it's often hard for the metadata from the natural language query.

Um, And maybe so in summary, uh, line chain is kind of a framework for building and evaluating LLM applications with many integrations, over 80, integrations for loading, 30 for storage and retrieval. Uh, 40 different LLMs. We have tools for evaluation and prototyping. Um, we have app templates like Cari, G P T, that highlights how to stream applications with line chain, uh, integrate with different front ends, like for sell.

We have support for Python and, and uh, JavaScript. So, uh, everything is free to use, everything is open source. So, um, yep, that's probably about it. Uh, please reach out to me with any questions and thanks for the opportunity to speak.

Awesome. Thank you so much, Lance. I really appreciate it. And we'll send out the slides and, uh, the recordings to everybody who's tuned in, um, a little bit later. Thank you so much. Cool. Thank you. All right. See you later.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

9:47
Taking LangChain Apps to Production with LangChain-serve
Posted Apr 27, 2023 | Views 2.3K
# LLM
# LLM in Production
# LangChain
# LangChain-serve
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com