Sign in or Join the community to continue

Prompt Engineering Copilot: AI-Based Approaches to Improve AI Accuracy for Production

Posted Mar 15, 2024 | Views 1.3K

# Prompt Engineering

# AI Accuracy

# Log10

Share

speakers

Arjun Bansal

CEO and Co-founder @ Log10.io

Arjun Bansal is an entrepreneur and AI expert focused on understanding and building intelligent systems. He is currently CEO & co-founder of Log10.io, a platform for building more accurate LLM-powered applications via LLMOps Copilots. Arjun previously co-founded Nervana Systems (acq. Intel). Arjun's career spans research in brain-machine interfaces, building AI processors, and AI sidekicks.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

LLM demos are aplenty but bringing LLM apps into production is rare, and doing so at scale is rarer still. Managing and improving accuracy of LLM apps up to the desired quality threshold is one of the main hold ups. In this lightning talk, we’ll share the workflows and AI based approaches that have been successful in deploying AI in production at scale with high accuracy.

+ Read More

TRANSCRIPT

Prompt Engineering Copilot: AI-Based Approaches to Improve AI Accuracy for Production

AI in Production

Slides: https://drive.google.com/file/d/18NJmJLZk5g7M4MPyETqc4Eq3P8Y-6TC7/view?usp=drive_link

Demetrios [00:00:06]: Next up is Mr. Arjun. Where you at, bro? Hey, there he is.

Arjun Bansal [00:00:12]: Hey Demetrios, how are you?

Demetrios [00:00:14]: I'm doing well. I am rocking to my own, I'm to the sound of my own drums apparently. So I know you got all kinds of cool stuff to talk to us about when it comes to prompt engineering and the prompts themselves, which is very fitting considering the song I just played. We're a little bit behind schedule, so I'm going to let you add it and you will see me in like ten minutes.

Arjun Bansal [00:00:44]: Sounds great. Yeah, and I got that QR code Spotify. So thanks for that. Awesome, love it. Great. So yeah, my name is Arjun and super excited to share with you some of the work we've been doing on improving AI accuracy for production at log ten via our prompt engineering copilot. So at log ten, we've been working towards a vision of building systems that can improve themselves by automatically tuning prompts and models. These steps are quite manual today, but there's a path to using AI to automate some of that manual work while improving accuracy and reliability.

Arjun Bansal [00:01:33]: So let's take a look at some of the manual steps involved in improving accuracy for LLM apps today. When an accuracy issue is flagged, developers or solutions engineers typically have to write SQL queries in their data warehouse to pull out the relevant call logs of interest. Then they have to copy and paste those prompts into a playground or a Jupyter notebook, followed by iterating on those prompts and hyperparameter optimization. Next, they need to share those results by copying and pasting into a spreadsheet so others on the team can review and sign off on the changes. And finally, there might be end users or customers who need to get notified with the updated prompts. And so we have found that streamlined tooling can help provide an integrated experience across the developer journey of logging, debugging and evaluations. And this is a great first step to help overcome these pain points in the workflow. However, ultimately, manual review tends to still unfortunately be required as part of the workflow of deploying LLM apps reliably, which adds time and expense to the process and in many cases may make the adoption of AI infeasible.

Arjun Bansal [00:02:56]: So we talk with over 150 companies on their wish list for how they'd like to improve accuracy of their AI systems. Here were some of the main constraints that we learned about. Developers wanted to be able to preserve the LLM programming model via prompting and work with the existing code bases with minimal integration overhead. They wanted to be able to see accuracy improvements even with small data sets that may or may not have human labels yet. And although the operational overhead of fine tuning has been greatly reduced by API calls from LLM providers, fine tuning still creates algorithmic overhead due to a loss of general reasoning ability when fine tuning on specific tasks, additional prompt engineering needed on the fine tune model, and loss of output structure validation. And developers also wanted to avoid optimization approaches that might add a lot of time and expense, such as maybe using evolutionary search or hyperparameter optimization. So at lock ten, we asked what if you could use AI to automate accuracy improvements of AI systems while taking into account the constraints of these real world production systems? And we've built a prompt engineering copilot that does this. It has a simple integration path that preserves developers existing programming model and doesn't require specialized ML expertise.

Arjun Bansal [00:04:32]: The copilot leverages synthetic data to bootstrap from smaller data sets, and we actually surprised ourselves by finding that there's a lot of room for improvement with prompt engineering itself before needing to leverage more advanced techniques such as fine tuning. And our approach goes beyond simple model based evalves, which often suffer from bias and accuracy issues while factoring in the cost and latency of the overall system. So the overall developer journey can be divided into phases before and after the availability of ground truth biofeedback, and we'll go through each of these stages next. So typically, the developer journey for building and deploying an LLM app starts with identifying a task and writing a basic prompt, and then trying to manually optimize it by inspecting the output before any ground truth data is available. The prompt engineering copilot leverages taxonomy based prompt optimizations to improve the robustness of the prompt. So you just heard about the spade taxonomy from Shreya's talk earlier. The prompt ensuring copilot leverages spade to analyze prompts entered by users and then suggests changes based on how well the user's prompt covers the taxonomy. The copilot can then extend to new tasks or domains by incorporating user defined rubrics or principles.

Arjun Bansal [00:06:05]: We recently saw a 22 percentage point improvement by extending Spade with principles from the 26 principles paper as one example of this kind of improvement. And here's an example of the taxonomy based suggestion in action, recommending changes to a customer support prompt. The recommendations are graded here by severity and color coded red yellow green. For this example, the copilot recommended that there be specific quantity instructions and more details about the format and the workflow. Next, once some example data has been collected and annotated for use as ground truth the copilot makes data driven accuracy improvements to the prompt. So using these data driven accuracy improvements, we've seen accuracy improve by up to 21 f one points, with as few as ten to 20 examples on customer use cases. We described some of these in a recent case study we published with Echo AI that's available on this link. In these results, the copilot was able to match the accuracy of fine tuning with just the data driven accuracy based prompt optimization.

Arjun Bansal [00:07:25]: So while these improvements are great, one could ask if we are back to square one in terms of needing manual review to get feedback to improve accuracy. And the answer is not quite so. As you might have guessed, the copilot uses AI during the feedback generation process as well by training custom fine tuned models to mimic the end user feedback. So the auto feedback system uses synthetically generated feedback from as few as 25 human labeled examples to fine tune a custom evaluation model that learns to mimic human reviews. The automated feedback is used to automate the prompt and optionally the model improvements using the data driven accuracy optimizations described earlier. As a side note, the automatic feedback system can augment or automate human review of LLM outputs and enable developers to gate what gets sent to end users or to manage the accuracy of their LLM applications. The base model could still be optionally fine tuned as desired, using standard approaches such as SFT, RLHF, RLaif, or DPO, but it's not required. So we published detailed results as well as our system architecture in this technical report that's available via our substac, and the slides will be available after the talk as well.

Arjun Bansal [00:09:02]: We showed that in a summary grading task that through a combination of fine tuning and synthetic data generation, we were able to improve the feedback prediction by 44%. And in a follow up report, we demonstrated that fine tuned open source models such as Mistral and llama could surpass GPD four and match fine tuned GPD 3.5 on this task. And finally, there's one more thing, and maybe you're already noticing it or seeing it, which is that in steady state, the prompt engineering copilot enables this virtual cycle where collected human feedback data is bootstrapped by AI and used in turn to continuously improve the accuracy of the AI system by tuning the prompts and optionally, the underlying models themselves. This forms a nice closed loop system that can keep improving with time. So if you'd like to try this out on your LLM apps, it's easy to get started. You can start logging at locked in IO today via a one line integration and watch the prompt engineering copilot do its magic and help your AI apps self improve. And yeah, you can reach me on email or LinkedIn X, whatever is your favorite platform. And we also have a substac that covers some of the topics that might be relevant for the kinds of things you're talking about here today.

Arjun Bansal [00:10:35]: Thanks for your attention.

Demetrios [00:10:37]: What's the story with coffee Phoenix? You rise from the Phoenix when you have your coffee.

Arjun Bansal [00:10:44]: I think it's back in 2009 when Twitter suggested ideas for how to get your handle and they were like, what's your favorite beverage? Bridge and what's your favorite band? And so I think it was like a cold winter morning and I was having coffee and listening to Phoenix. So that's how that happened.

Demetrios [00:10:59]: Oh, the band. Oh, nice. Yeah, that does make sense. I like it. All right, dude, thank you so much.

+ Read More

Sign in or Join the community

Watch More

Building RAG-based LLM Applications for Production

Posted Oct 26, 2023 | Views 2.2K

# LLM Applications

# RAG

# Anyscale

Taming AI Product Development Through Test-driven Prompt Engineering

Posted Jul 21, 2023 | Views 1.2K

# LLM in Production

# AI Product Development

# Preset

Beyond Guess-and-Check: Towards AI-assisted Prompt Engineering

Posted Mar 06, 2024 | Views 442

# Guess-and-Check

# AI-assisted Prompt Engineering

# Zeno