Sign in or Join the community to continue

A Systematic Approach to Improve Your AI Powered Applications

Posted Aug 08, 2024 | Views 186

# LLMs

# AI

# Scale 3

Share

speaker

Karthik Kalyanaraman

Co-founder and CTO @ Langtrace AI - A Scale3 Labs Product

Karthik Kalyanaraman is the Co-founder and CTO of Langtrace AI/Scale3 Labs. With a decade of engineering & leadership experience, Karthik has previously worked and contributed to several projects at companies such as Coinbase, HP, and VMware. He possesses extensive expertise in software design, observability, and infrastructure. Additionally, Karthik is an active member of the Open Telemetry working group, where he plays a key role in developing semantic naming conventions for the Gen AI stack.

+ Read More

SUMMARY

LLM-powered applications unlock capabilities previously unattainable. While they may seem magical, LLMs are essentially black boxes controlled by a few hyperparameters, which can lead to unreliability, such as hallucinations and undesirable responses. To maximize the benefits of LLMs and deliver high-quality user experiences, it's essential to implement a system for regular monitoring, measurement, and evaluation. In this talk, I will present a straightforward approach for developing a minimal system that helps application developers continuously monitor and evaluate the performance of their applications.

+ Read More

TRANSCRIPT

Slides: https://docs.google.com/presentation/d/1hFdU1CmbW8GdR8XK6BKVSW1vt0GXWCIB5gF35kkN3u8/edit?usp=sharing

KARTHIK KALYANARAMAN [00:00:00]: Yeah. My name is Karthik. I am one of the co founders and core maintainers of Langtrace. Langtrace is an open source observability product for tracing LLM applications. So what is Langtrace again? So Langtrace is an open source LLM application for running observability and evaluations. Essentially, Langtrace helps you from, helps you from transforming your AI apps from a shiny demo to a reliable and accurate application that delights your customers. About us I'm one of the co founders. My other co founder is over there.

KARTHIK KALYANARAMAN [00:00:41]: We are a small team startup based in the Bay Area. Okay, let's set the context. So the number one challenge for enterprises today with respect to adopting AI based applications is accuracy, as you can see from all these different headlines, from different sources. So this is the number one reason why enterprises are skeptical to adopt AI, even though most of them see a clear business use case. The biggest blocker at the moment is how can I get that confidence to deploy my AI based application to my customers? So the good news is, with the right set of tools and the right set of process, you can iterate your way to good accuracy and a reliable system. But the key thing to realize is these are probabilistic systems. You cannot just reproduce, fix it, and iterate your way to 100% reliability. You need a system after you deploy your application, you need like a closed loop feedback system to basically understand how to improve your application and take it to a much more reliable and accurate state.

KARTHIK KALYANARAMAN [00:01:59]: And which is what I'm going to talk about in the upcoming slides. So typically when an enterprise starts up, they essentially do some prompt engineering. They identify the right prompt, right model, hyper parameter settings for the model, they identify a use case they deploy to production, and you have a nice and shiny little demo that goes viral on Twitter and LinkedIn. And then what? And then you have no idea whether your users are finding value out of it, whether it's accurate. Let's say you are building a customer support agent chatbot. You have no clue how to improve it, whether you have no idea what the accuracy looks like. You probably did some eyeballing and wipe checking before you deployed it to production, but then your customers are not interacting with it. So how do I go from this place to a place where it truly delivers value to your customers? The first step is to start aggressively tracing all the interactions in production.

KARTHIK KALYANARAMAN [00:03:08]: This is important, and this needs to be done at multiple layers and not just at LLM layer. Let's say you're running a RAC pipeline, you need to trace the vector DB retrievals you need to trace let's say you're using frameworks. Frameworks are notorious for prompt front loading, like a lot of prompts get added before it hits the LLM inference endpoint. So you need to trace the entire end to end flow and also make sure your retrieval is also traced, because your application is only as good as your rag application is only as good as your retrieval pipeline. So once you start tracing it, the next step is to manually evaluate. This is a step like you need to look at your data that is like table stakes. Of course it's not like a sustainable approach, but initially you need to look at your data deeply to understand your use case. And the nice side effect of this is you can curate a golden dataset out of it, which you can use for running regression testing in the future.

KARTHIK KALYANARAMAN [00:04:08]: So once you start curating the once you start manually evaluating and curating, you will get a baseline performance of how your application is doing in production. So the next step is to use the curated golden data set to set up automated evals. Automated evals again, it can be as simple as unit tests that just do keyword based assertions, or it can be LLM based evals like LLMs evaluating the responses of other LLMs. Maybe use a powerful LM like GPT 4.0 to evaluate the responses of Mistral. You can set it up with Pytest, like it can be as simple as that. But the important thing to do over here is you need to set up this thing so that the next time when you iterate on your application, before you deploy, you run it against the golden data set and make sure your application has not regressed. So that gives you that confidence to deploy once again. So once you set that up, you should completely avoid the first step of directly prompt engineering and deploying.

KARTHIK KALYANARAMAN [00:05:12]: And you need to run it through the automated evals and then deploy it. And you need to run it in your CI CD pipeline so you can hook it up to Circleci Jenkins GitHub actions, whatever that you use. And you also need to set up alerting so that in case it kind of regresses, it doesn't take the newly iterated version of your application to production. So this is like the feedback system that you need to establish to iterate your way to over 90% of accuracy. And it's going to be very specific to the use case that you are dealing with. So in the upcoming slides I'm going to show how Lang trace can help you establish this thing so like I said, lank trace is open source and open telemetry based. You can self host and run it, or you can use the cloud hosted version that we are offering. And the setup is literally two lines of code.

KARTHIK KALYANARAMAN [00:06:09]: You can just install the SDK, you can generate an API key from land trace, and it's just two lines of code. And we instantly start tracing all the layers like framework LLM and vector DB. So I'll just run through each one of the tabs on lank trace. So this is like the metrics tab. It just gives you like a quick overview of what your metrics look like, mostly around cost and latency. The main tab over here is the traces tab. So here as you can see, this is a crew AI agent tracing. So we have native support for crew AI.

KARTHIK KALYANARAMAN [00:06:52]: So you can clearly see there is like a crew kickoff, there is a task that is being executed, and then there is an agent that executes that task. And under the hood crew AI uses lancing. And finally it hits the OpenAI inference endpoint. So with just two lines of code you will get this deep level high cardinality traces. And with this you can first of all identify the bottlenecks in your system. But at the same time you can also get a blueprint of what your model settings and what your hyper parameter settings look like. And not just that, even at the framework level, what kind of settings you have set up. Like say for instance for crew AI, there is a bunch of settings.

KARTHIK KALYANARAMAN [00:07:32]: All of those things gets captured. In addition to that, you can start manually annotating. We have this annotations tab where all the responses get, all the LLM request response paths get automatically captured. So you can start annotating. You can create tests with custom scales. You can start looking at your data immediately, as soon as you take it to production. And once you start looking at your data, let's say you captured 100 traces, 50 of them are good, 50 of them are bad, which means your accuracy is at 50%. Now you can use the 50 good data, create a golden dataset and you can use the data set to establish like evaluations.

KARTHIK KALYANARAMAN [00:08:13]: Like you can run automatic evaluations using LLMs or unit tests, and capture the reports directly to lank trace. You can do comparative analysis between GPT 3.5, GPT four or different models. In addition to that, you can also version and manage all your prompts within lank trace. So you can like why this is important again is like let's say you are confident with a new version of your prompt. You deploy it and then you realize it kind of regressed and you want to go back to a previous version of it. The nice thing is you can just fetch the prompts directly from Langtrace. You can click a button and revert back to the previous prompt without having to do a code deploy. And finally, there is like a playground where you can just compare and contrast between different models.

KARTHIK KALYANARAMAN [00:09:01]: You can see the cost, you can see the latency, and you can also directly store prompts from the playground. Yeah, so that's basically it. That's how Langtrace helps you establish this closed loop feedback loop. This is like the quick traction on how lank trace has been going. We launched just like two months ago. This is our GitHub star history. And a bunch of startups and also big companies like elastic are already deploying Lang trace within their infrastructure. We support tracing for most of the popular LLMs, Vector, DBS and frameworks, and we are continuing to add support for it.

KARTHIK KALYANARAMAN [00:09:40]: And like I said, we are open telemetry compatible, which means you can consume our SDKs without having to use lang trees at all. You can just use our SDK to generate traces and send those traces to a datadog or Grafana or signos, whatever that you are already using. So you don't, if you're not comfortable, introduce, introducing a new observability toolkit into your. Yeah, so you can, you can create traces using langtraces SDK and send it directly to profanagnosis Datadog. We are continuing to add support for additional observability tools, and in addition to that, we are also working in a open telemetry working group with Microsoft and AWS to establish some naming standards to standardize all of these naming conventions. So if you're an observability expert, or if you just want to, if you're interested, just reach out, you can directly join the group. Yeah, that's pretty much it. You can scan the QR code to get to our website, or you can just look up language, but yeah, that's about it.

+ Read More

Sign in or Join the community

Watch More

Guardrails for LLMs: A Practical Approach

Posted Jul 14, 2023 | Views 2.2K

# LLM

# LLM in Production

# Guardrails

Prompt Engineering Copilot: AI-Based Approaches to Improve AI Accuracy for Production

Posted Mar 15, 2024 | Views 1.3K

# Prompt Engineering

# AI Accuracy

# Log10

Modern Data Science with Vaex: A New Approach to DataFrames and Pipelines

Posted Apr 15, 2022 | Views 931

# Pipelines

# DataFrames

# Vaex

# Vaex.io

# Tiqets.com