Sign in or Join the community to continue

Introducing the Prompt Engineering Toolkit // Sishi Long // AI in Production 2025

Posted Mar 17, 2025 | Views 139

Share

speaker

Sishi Long

Staff Software Engineer @ Uber

Staff Engineer in Uber AI platform Michelangelo

+ Read More

SUMMARY

A well-crafted prompt is essential for obtaining accurate and relevant outputs from LLMs (Large Language Models). Prompt design enables users new to machine learning to control model output with minimal overhead. To facilitate rapid iteration and experimentation of LLMs at Uber, there was a need for centralization to seamlessly construct prompt templates, manage them, and execute them against various underlying LLMs to take advantage of LLM support tasks. To meet these needs, we built a prompt engineering toolkit that offers standard strategies that encourage prompt engineers to develop well-crafted prompt templates. The centralized prompt engineering toolkit enables the creation of effective prompts with system instructions, dynamic contextualization, massive batch offline generation (LLM inference), and evaluation of prompt responses. Furthermore, there’s a need for version control, collaboration, and robust safety measures (hallucination checks, standardized evaluation framework, and a safety policy) to ensure responsible AI usage.

+ Read More

TRANSCRIPT

Click here for the Presentation Slides

Demetrios [00:00:04]: Next up. I am so excited about our speaker because where let's bring you up. She see. Yeah. Hey.

Sishi Long [00:00:14]: Hi.

Demetrios [00:00:14]: Wrote the coolest blog post for the Uber Engineering blog. I am re. I always like the Uber engineering blog. First, I'm just going to say that I'm a huge fan but this one on the prompt toolkit was magnificent. So I'm I feel thankful that you are here joining us and you're going to talk about it right now.

Sishi Long [00:00:35]: Yes. I'm honored to be here to share the blog with you guys. Yeah.

Demetrios [00:00:39]: Excellent. Well, I see that you have shared your screen so I'm going to bring that up right now and then we can get rocking and rolling.

Sishi Long [00:00:48]: Okay, great. Yes. So welcome to our session on the prompt engineering toolkit for AI in production. My name is cixi. Yeah. I'm a software engineer in Uber AI Platform. We call it mechanical and today we are going to take a deep dive into how prompt engineering toolkit is transforming the way we make good use of the power of the large language models. So for your gen journey, usually prompt engineering is usually the first step during your journey.

Sishi Long [00:01:21]: So it is the most cost effective way to improve the performance of lm. So usually do prompt engineering first. If it doesn't work then you will integrate with your rag for more context towards lm. If rack integration still not satisfy your need then you will integrate tools with lms. So tools for example you can have Google search tool or any search engine tools and also the Docker integration tool. And if that's still not working for your LLM you will fine tune the model which is the most expensive one. That's kind of the the whole journey of your LLM and prompt engineering is the first one. So our journey today will cover several topics today.

Sishi Long [00:02:08]: First is background and motivation. So we'll see why LM game changer inside Uber and what's the challenges we face with prompt design and performance consistency and the most important one is scalability in real world application. And second we'll cover the prompt template life cycle so we will see how we involve our prompt templates. First from exploration and iteration. At the last is the full scale production, deployment and monitoring. And the third one I think is the most juicy one is about architecture and also the key components. So we'll get insider look at the whole toolkit architecture including our playground prompt builder and also the management and versioning system and deployment process and eval framework. So the last is the use cases we have in Uber.

Sishi Long [00:03:07]: So the prompt two key powers, both online use cases and online offline use cases. So ensuring the efficiency and robust performance between these two scenarios. And the last one is a Q and A. So we'll wrap up by answering your questions if you have any. So for the background, as LM evolves, the way prompts are structured can dramatically impact the modal performance. So each model actually they have their different flavor of construct. A prompt OPI is different than Mistral model so you cannot use one, I mean the same prompt template towards different L models. We need to provide a way for you to tune your parameters e.g.

Sishi Long [00:03:59]: top P, top K, temperature or et cetera for different LM models. Because, because we want to solve the real world business like during experiment phase and it can apply for your online case. So there was a need inside Uber at that time to have a centralized way to seamless construct prompt template and to manage them and execute them. So we want our toolkit. So in our North Star vision we want our toolkit not only boost our developers velocity but also establish industry standard best practice for prompt engineering. First we want our engineers to explore a Diverse range of LM models. So we will support both third party models, for example OpenAI's list of models. And also we want to support our in house models.

Sishi Long [00:04:56]: Like if any engineer they fine tune a model, for example Mistral, they want to survey it in house. In Uber we also want to expose that one to every every engineer inside Uber. And finally we want to have an evaluation framework. So whenever you have your prompt template towards different models and also with different parameters, the one the the end goal is we want to effectively evaluate their performance. So when you want to roll out production, you will rule out the ones which beats all the others with the best performance. And for prompt template life cycles it basically contains two stages. So one is the development stage and one is the production stage. So the process begins in the development stage where we experiment with different LM models and continuously iterate the prompt templates.

Sishi Long [00:06:02]: So once refined after evaluation stage we will transition into production where the focus will be shift to the deployment of these templates and also the ongoing monitoring of the of the performance. So in, in the right side of this slide we can see there was a model catalog. So this is the first phase during the prompt template iteration. So first it's like it seem, it's very similar to a marketplace. So all of the all the models SPAC is visible to our users, they can play around with different models and also we have user guides and also the cost and different metrics. And this is the JI playground we built for exploring the prompt templates. So in the prompt template user can pinpoint their specific business use case and gather sample data. They can create and analyze different prompts with different LM models.

Sishi Long [00:07:11]: You can see there was a drop down here. You can select different LM models and you can input different data sets and access the response and make revisions as needed. So we apply the same idea from software engineering, so we make it code driven. So you can make revisions to all your prompt templates, you keep revisions of all of them and you can deploy the best prompt template revision to your production. And within this prompt template we also have prompt template catalog so it's easy to share between different teams inside Uber. And the last stage is the evaluation phase and user can evaluate the effectiveness of the prompt templates by testing it with more extensive data. So it's this slide. So you can see the authoring flow is where you author the prompt templates and you create a prompt template and you want to test response.

Sishi Long [00:08:17]: So we build eval the prompt template eval framework. So basically this eval framework is you have feature prep step means you can incorporate all the production logs from your data pipelines and then you fetch your prompt template revision and hydrate those prompt templates and then run either LM as a judge or or your own customized metrics to evaluate the prompt template performance. And then we have human in the loop so you can see whether the performance satisfy your needs. If it is, you will deploy your prompt template to production. And the last stage of the life cycle is the production stage. So once refined we transition into production stage where the focus shifts to the deployment of these templates and ongoing performance monitoring. So user only productionize the prompt templates that pass the evaluation threshold on an eval or we call golden data set. The lifecycle ensures that we are not just launching prompt into production, but also are active nurturing and optimizing them over time.

Sishi Long [00:09:34]: So this is the architecture of our prompt template toolkit. So in here in the left side you can see our LM model catalog. So for each model we will try to do a deployment first. For example, we deploy a gen gateway through geni Gateway for OpenAI model. It's not real deployment, but mostly like the metadata to contain all the information about this model. For our in house model, for example we fine tune Mistral. Then that's a real deployment on our own infra that's deploy a model. Then those models are all visible in the model catalog I just showed before.

Sishi Long [00:10:13]: And the prompt template 2 key we will have the first UI user can create a prompt template from a UI like similar to what I show in the playground. And also they can do it on SDK which is python code driven. So they can code their prompt template in a coding way. So inside Uber there were different preference. Some people really like the UI way. They think it's very visualized, the visualization is pretty good and then the iteration is faster for them in ui. But some developers actually prefer SDK way because SDK they can have more customized way to create a prompt template. We support both way to author a prompt template.

Sishi Long [00:10:57]: Then both methods we support code review even for ui. When you create a prompt template from ui, you can save as a draft for your personal usage. The draft means similar to your local branch. You can just visible to yourself and play around with it. Once you're ready with the draft UI draft you want to send out for review, you can achieve this through the ui just send out for review. A fabricated div will be created for you with your change and will send up to your teammates for code review. For SDK I think everyone knows you do it for repo, then you create your local branch and arc div and then to create the fabricated DIV for everyone's review. So after the code is review and you can land it.

Sishi Long [00:11:50]: So the landed process will create new revision for your prompt template. So underneath, when we create a prompt template we save in two places. One is called edcd because we are using Kubernetes API so they have a reconciliation loop going on in the controller side. So the prompt template will first go into ETCD and then save there and also go into MySQL for future data analysis. And the other one is going to ucs which is Ubers inside object config. So this object config is a fast fetching for all the applications. And once you have this prompt template you can use offline generation notebook. We are scaffolding a notebook for you so you can fetch the prompt template test with different parameters and also hydrating the prompt templates with any data you have in your notebook.

Sishi Long [00:12:49]: So this method is very fast. It's like user can play around with a small data set within five minutes they can get a result back. And for some users they have huge data set. We are scaffolding offline generation pipeline for them. So these pipelines basically have several steps. First the feature prep step, you can using either Spark SQL query, get your hive data query, your hive data query, or you can upload a parquet file. Then we'll have a score. It's like offline inference towards different models.

Sishi Long [00:13:26]: YRM models you want to choose and the last one is the pusher it save your data to any format you want. You can save to hivedb or you can publish your experiment report. The other consumer of prompt template is our prompt template eval pipeline. So you can also set up a similar eval pipeline to to evaluate your prompt template. For EVO pipeline we support sampling because for offline generation pipeline people usually want to go 1 million data records for the batch offline inference. But for evaluation they don't need so much data for evaluation. It's a big cost for them. So we support sampling.

Sishi Long [00:14:06]: You can sample like around 0.1% of your data. Just want to see the performance is good or not. Then we will publish a eval report. So I want to recap about the key components we just covered in our previous slides. So at the heart of our toolkit there were several very important features and components we built to support our prompt engineering process. So one is the Genai Playground. So user can select any model from the model catalog we built is like a marketplace for the models and craft their customized prompts and adjust parameters and to evaluate the model's response at real time. And we also have a prompt builder.

Sishi Long [00:14:58]: So this prompt builder currently is built on top of LangChain so it automatically create any prompts for our use case. You just give the use case to this prompt builder and it can automatically create a prompt for you. So and also help you to discover the most advanced prompting techniques. The auto prompt builder is very helpful and heavily used inside Woover because many people does not want to start from the from scratch. They can just ask this prompt builder to create one according to their business needs and then iterate from there and prompt management versioning. We are following the software engineering best practice. It's a code based iteration so user can modify the instruction adding any dynamic parameters they want for their testing purpose and also they can test it out in their small data set and a core review just enforce like make sure this is also follow the software engineering best practice. Your teammate will review your code and once you approve and pass the CI CD testing then the post landing will merge your prompt template into our ETCD and MYSQL and also ucs.

Sishi Long [00:16:17]: A prompt template revision at that time will be created. And furthermore for the production serving flow user may not prefer to have their prompt templates in the production altered with each update so they can point their production flow to the deployed prompt Template we also incorporate the best practice from microservices. They deploy their service see similar to here, they deploy your prompt templates and several components are involved in the process to evaluate the performance of one or two prompt templates, revisions and along with the LM models. So for eval mechanism we build a prime eval framework. So one is using LM as a judge, so means you leverage another LM as an evaluator, so you can give them guidance how to evaluate the prompt response from your use case. And the other type is for customized evaluation. So the reason we will have a customized violation is because LM as a judge, it only works with clearly defined and automated judgments such as in content moderation and preliminary screening and technical assessment. But many customers inside Uber, many users inside Uber they prefer to have their own customized evaluation metric.

Sishi Long [00:17:58]: So we also support that in our eval framework. And this method can highly tailored to the specific aspects of the performance the customers want to measure. And this is the use cases inside Uber. So there were two main scenarios. One is offline LLM use case, so the other one is online. So here I will, I will cover the offline first. So for LM batch offline inference pipelines, it facilitates the batch inference for large scale LLM response generation. So there was one use case inside Uber is called writer verification because we want to verify all the consumer which is our writer's username, Is it gibberish or it is valid or not? So we have millions of millions of username in our system.

Sishi Long [00:18:56]: So we leverage this pipeline to assess all existing riders inside Uber's consumer database. So that's for backfilling. We try to analyze all the existing users names inside the db. And also we support and the newly registered user because every day there was new writers register in Uber website. So we employing a synchronized method to process and generate response for usernames in batches. So you can think about, we have first we have the offline inference pipeline do ad hoc check across all the all the user information in the database. And then we have an orchestration pipeline which can run daily or monthly or weekly. The user can adjust the Chrome information they want, the trigger information they want and to screen all the newly registered users.

Sishi Long [00:19:59]: So for the challenge of this offline use case is we want to handle millions of records in a very fast effective way. So in this offline inference pipeline we not only support standard API like for example row by row record reference, we also support OpenAI's batch API inference. So what is OpenAI? Batch API inference is OpenAI has API called batch API. So basically you prepare your records in your JSON L files and you upload those files to OpenAI site. So they will do it asynchronously. So they will use their off cycle compute power to process all these records at a non peak time. So the cost is kind of like half of the cost. So we support both use cases like one is for standard API row by row inference and the other one is the bash API inference.

Sishi Long [00:21:04]: And for online service I will go through a text summarization inside Uber. So in Uber customer obsession support a contact is a support ticket used to contact a customer, a support agent. There were scenarios where multiple agents will handle the same contact. In this case the new agent receiving the handoff must either go through the ticket to understand the context or ask the customer to read their problems. Usually it's a formal case. To solve this we leverage LM to provide a summary to the agents when there is a handoff happening from one agent to another. We will use well tested a prompt template in this flow by applying real time data for summarization to the next agent. For each prompt template we also linked with the production logging so you can see your prompt templates performance with real time data.

Sishi Long [00:22:10]: And in the future what we support is you can in the future convert this production logging into your prompt templates evaluation data set, for example your golden data set because that's the real data you happen in your production environment. And also for production monitoring, it measures the performance of prompt template used in production. So the purpose is track the regression of the prompt template and the LM models. So because the LM model sometimes can regress too, we saw this scenario like whenever the third vendor updated models the model may degrade it for the Sampramp template. So a daily performance monitoring pipeline will run towards production traffic to evaluate the performance. So the matrices will include latency and accuracy and also the cost to see how much money you spend for your use case in the month or daily. Yeah, so and also the input and output tokens and availability of the lms. Because for our production flow we also need a fallback.

Sishi Long [00:23:23]: For example, if the LM response does not come in in sla, define sla, we will fall back to either the second LM or our backend logic. Yes, that's pretty much cover all my talk today. Yes. Now I think I'm okay for the questions.

Demetrios [00:23:47]: Yeah. Oh so good. So while the questions are rolling in there's been a few few that were asked during and so we've Got Arthur asking. I find that most of my time is spent tuning prompts and it is often trial and error. Are there any tools you have that look at your prompts and what you are trying to accomplish and recommend changes to better achieve the goal based on the prior results from existing or evolving prompts or even better other people's prompts?

Sishi Long [00:24:20]: Yes, yes. So that's the prompt builder I just mentioned. So the brown builder actually is built on top of a lantern right now you can think about this as an agent, right? Because everything. Yeah you build it is like you have all this. The first is a rack process. We we building all the best practice. We feed all the documents, all the information to the. To the prom builder.

Sishi Long [00:24:43]: This is their rack route. And also for the tools, we also have them to the tools including the search tool, the third party like for example Google search search tools and also integrated with our own internal prompt template database. And then so that's why the prompt builder has all the information about all the prompts inside Uber and also the best practices and also the Internet content. So it will iterate. You can ask the prompt builder to iterate for you. So either you ask prompt builder to create a prompt from scratch for you or you, you can give prom builder already kind of like the first iteration of your prompt template and ask the prom builder to iterate for you. To improve for you. Yeah.

Sishi Long [00:25:32]: Instead of use. Instead of you to try like. Yeah. Modifying any parameters. Yeah.

Demetrios [00:25:38]: Well along those lines each model has a bit of a different way of dealing with prompts. Do you take that into account also?

Sishi Long [00:25:49]: Yes. Yes. We need to take account with different LM because different LM has different formats. If you saw OpenAI's API request, it's very different than Mistral.

Demetrios [00:26:01]: Yeah. And the prompt builder like if I write one prompt that I want to work you on the back end, figure out how to make it work with each LLM. Or I have to write a prompt that's specialized for a specific LLM.

Sishi Long [00:26:16]: So prompt builder can give you a prompt which specific to your. To your lm. But yeah, it is users responsibility to iterate from from that. And that's why we have. Yeah, that's why we have a notebook around. Right. Like you have small detail if it's in the prompt template, a small set of data set and then use the offline inference pipeline with notebook to do a fast iteration. Or we will have an eval framework to evaluate it in real time or in offline space.

Demetrios [00:26:49]: And do you give tips that like. Because I imagine there's insights that are surfaced with so many people using the prompt builder and potentially it's data scientists that are looking at the prompts and trying to surface insights. Or maybe there's some other way that you're doing that. Maybe you're just asking LLMs. But I wonder if. Let's say that I want to use Mistral.

Sishi Long [00:27:15]: Okay.

Demetrios [00:27:16]: Are there tips that I can look at or documentation on the best way to build a prompt with Mistral versus OpenAI or Anthropic?

Sishi Long [00:27:24]: Yeah. That's the modal catalogs responsibility. So in model catalog we have. You can think about. It's like a marketplace for all the models. Right. So if you put a model there you have to give them give the user guidance to use this model. Is it for test summarization or is for chat completion and and also for the customer.

Sishi Long [00:27:46]: They also care about the cost. Right? Yeah. So you can compare different different things into this model catalog marketplace. Yeah. Yeah. And we also support like if you fine tune your model and serve inside Uber. This one is also should be visible to users like to see what kind of models in in house is it cost is much cheaper than other third vendors. Yeah.

Demetrios [00:28:11]: So there it feels like right now it's very much like me as the user. I have to go find the right model. I have to make sure that it matches with my price sensitivity and then I have to build my prompt. Have you thought about how to make it more declarative so that I can just say I want this price, I want this prompt to run. Go figure it out.

Sishi Long [00:28:36]: Yeah. Actually that that's. That's also should be in the roadmap. Yeah. I think like for example for the most used. I think one use case from the marketplace team is asking is when they have LM offline inference pipeline. They have millions of millions of records. For now they have to tune the parallelism, right? Yeah.

Sishi Long [00:28:59]: They want to see how many parallelism can they go and what's the cost. So we will give them a dynamic parameters to say I want to cap at this kind of cost or we can cap at this kind of relimiting then go through the lms. Yeah.

Demetrios [00:29:13]: Awesome. There's a few questions that are coming through here in the chat. One, is DSPY used in optimizing the prompts in the toolkit?

Sishi Long [00:29:23]: Yes. The prompt builder can help you optimize the prompts. Yeah.

Demetrios [00:29:27]: Are you using the framework dspy?

Sishi Long [00:29:31]: No, we didn't use dspy. Yeah.

Demetrios [00:29:33]: Okay. Yeah.

Sishi Long [00:29:34]: We basically leverage LangChain right now on the prompt. The prompt builder yeah.

Demetrios [00:29:40]: Is Prompt builder open source? Sadly, no. Right.

Sishi Long [00:29:45]: Actually we have plan to open source. Yeah, we have plan. Yeah.

Demetrios [00:29:51]: That's awesome. Yes, that's really cool. Okay, so when.

Sishi Long [00:29:58]: When I think maybe later this year, 2025, it's already in the, in the talk. Yeah. To open source. The, the ji. The, the. The prompt engineering toolkit. Yeah.

Demetrios [00:30:09]: Okay, so last one from me is when you're dealing with prompts, you mentioned how it's kind of like playing with git and you can branch prompts and then can you, you can work collaboratively on prompt so that I could ask you, hey, can, can, can I grab this prompt or can I just tweak this prompt or maybe I comment on a prompt like as it was a Google Doc.

Sishi Long [00:30:36]: Yes. So we have two ways. One is SDK. It's kind in your local branch. Right. It's very similar to, to coding software engineering. If you have your local branch. Absolutely.

Sishi Long [00:30:45]: You can grab other people's push to remote remote branch and people can grab your local, your remote branch to their local code repo and then modify it. That's one way. That's for SDK, the, the coding way. The other one is from ui. So ui, we have this concept called draft. So it's similar to your local branch, right. Ui, you modify and save as a draft. It can visualize to you or you can also make it visualized to your teammates.

Sishi Long [00:31:10]: Then your teammate can go to ui, go to that draft and modify the commanding on the same draft and then he can. Or she can save it or ask for code review, iterate on top of your prompt template. Yeah, so we support both the way. Yeah, that's awesome.

Demetrios [00:31:27]: That's really cool. Okay, few more questions coming through here in the chat and then I'm going to be giving away some headphones for the folks that are sticking with us. So stay around for the break. But in your monitoring example where performance degradation occurred due to model updates, can that type of model updating without validation be phased out?

Sishi Long [00:31:56]: Can you rephrase the question again?

Demetrios [00:31:59]: Yeah, actually now that I am saying it, I'm not sure that I fully understand it because I think what you are saying is that.

Sishi Long [00:32:06]: So okay, so for my use case is because for GPT, for example, to chat GPT, right. If they ping a specific version, then this model usually will not degrade. But if they provide a model like for example GPT4.0, so they may upgrade it on the back end from time to time because there was no release version at all. It's not like your Python Library you pinged as 1.0.0. You just use that library all the time in your code. You don't change it. It's not like people have to go to your code repo to update to the python package to 0.1, but for 24.0 OpenAI side, they may upgrade that model time to time. Yes.

Sishi Long [00:32:50]: Then your production performance may degrade because they upgrade their model and your prompt template does not work anymore for their new updated model. Yeah.

Demetrios [00:33:01]: Yeah, that's always fun to try and figure out why your prompts don't work anymore. And so I, I imagine you start throwing alerts in that evaluation piece, like if the prompts stop working.

Sishi Long [00:33:13]: That's correct. So we have this called MES is also in the in inside Uber blog post is the mes. They have several indicators. So these indicators is running daily. So every day you will try to do production monitoring with different indicators. Availability, cost and accuracy. So that's if any indicator violated your threshold. For example, you want to evaluate between 90, 99.9.

Sishi Long [00:33:40]: If it violate your SLA, it will send. Send you an alert. Yes. Automatically every. Yeah, every day.

Demetrios [00:33:49]: Yeah. And you get to choose what the threshold is. I really like that. I remember reading that too. All right, so last one. Where's the best place to monitor for the open source release announcement? I would imagine it's your blog. Yeah.

Sishi Long [00:34:03]: Yes. Okay. Yeah, I'll block you.

Demetrios [00:34:06]: Yeah. I will also make sure to put it in our newsletter too, because I'm very excited for when that comes out. And yeah, this has been super cool. I really appreciate you coming on here. As I mentioned before, I'm a huge fan of the engineering blog at Uber and also this prompt toolkit that you're building. I know. I was talking to Kai about some of the other stuff that you're working on for other parts of the company and how they can use prompts. And he was like, yeah, it's.

Demetrios [00:34:39]: It's in the works. I'll tell you when we release it. We have a blog on it.

Sishi Long [00:34:42]: We have many j use cases inside Uber. Leverage is two key already. Yeah. Yeah.

Demetrios [00:34:49]: Excellent. Well, I will talk to you later and thank you so much for coming on here and giving this talk.

+ Read More

Sign in or Join the community

Like

Comments (0)

Popular

Watch More

Prompt Engineering Copilot: AI-Based Approaches to Improve AI Accuracy for Production

Posted Mar 15, 2024 | Views 1.3K

# Prompt Engineering

# AI Accuracy

# Log10

Taming AI Product Development Through Test-driven Prompt Engineering

Posted Jul 21, 2023 | Views 1.2K

# LLM in Production

# AI Product Development

# Preset

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production

Posted Nov 15, 2024 | Views 6.4K

# Generative AI Agents

# Vertex Applied AI

# Agents in Production